Location-sharing social network Foursquare and HootSuite, which lets users monitor Twitter and other social networks more easily, appeared to have recovered. Ideeli, a four-year-old SoHo-based flash-sale shopping site, reported a temporary outage for about 10 minutes around 4 a.m. Thursday. While the company was able to quickly recover by using infrastructure from a different zone, the outage puts a new spin on the retailer’s relationship with Amazon.
Here’s the scoop:
Prior to the explosive use of online services, when online companies went into business they had to build their own racks of servers to store data and run applications.
With the advance in online services companies such as Netlfix do not host their own servers but they rent space on the servers of companies such as Rackspace, Cbeyond, Amazon and dozens of other companies. These companies (such as Amazon.com) manage the online applications and data of their clients (such as Rackspace).
Ideally when a problem happens with the server (or servers) of a cloud computing vendor, there are redundant systems in place to move data and applications to working servers – thus mitigating any damage.
From what I’ve read Amazon.com’s backup systems did not kick in.
So what can you do? What is the take away?
– Continue to use online services and continue to move to host your data “in the cloud”
– Understand that technology will fail. Your iPad, your BlackBerry, your Vostro notebook and your online service provider.
– Have backup plans in place to continue operating if your technology fails.
– Have systems in place to know what to do and when if technology fails. For example, if your phone lines go down, that might not be a serious problem if the problem only lasts for one hour. What happens if things are down for 2 days. For your online services, if things are down for 30 minutes – is that catastrophic or can your business be fine if online services are down for say 4 hours.
– Have backup systems in place with your vendor and with 2nd and 3rd party vendors. If you have an important application running in the cloud you MUST make the investment to have an engineer (or team of engineers) build systems that have failovers and redundancy.
A New York Tech Meetup discussion board participant (Jonathan Vanasco) writes:
These companies were reliant upon nothing but complete stupidity. I know of several high profile sites that run on EC2 (Amazon Web Services) and had multi-zone failover systems in place ? they were largely unaffected.
There was an issue at a single , physical data center. This is akin to when Rackspace had power issues a few years back in one center, or when that digital hub in LA lost primary and backup power. A smart engineer would have designed a system on EC2 to handle multiple zones for failover — just like a smart engineer in a colo facility would have designed a system to failover to a hosting center on a different coast.
Amazon had given all the sites that busted today the capabilities and abilities to have failover. While many sites hosted on Amazon went offline today, many stayed online as well — thanks to great engineers and architects who did their job the right way.
The companies are to blame – Amazon does not guarantee 100% uptime, and one would be foolish to expect a service promise like that. You should always build systems expecting failures to happen ? and running something on historically unstable cloud platforms should really suggest that is a top priority ]. A bunch of CTOs and VPs of Engineering at some famous companies were simply stupid and cut corners.
While I largely agree with this, what happens in the (very common) scenario where the CTO and/or VP of Engineering says ‘it should be done this way’, and the CEO, or the head of the company says “let’s not worry about it now” or “we have other priorities”. It could be argued that they are not doing their jobs properly, but sometimes tech folks are constrained because the business folk make the decision to take that risk. I’m not proposing that every, or even a majority of the companies whose sites went down yesterday were a result of the scenario I just outlined, but it’s a possibility.
Oh, and one other thought; for those who may balk at some of the prices for talent mentioned in the “Negotiations with a developer” thread; ideally, if you’re paying those high rates for talent, you’re going to get someone who’s thought through these scenarios, put together plans for such a situation, justified their expense to the business, explained why those plans are necessary, and executed when things go pear-shaped so their sites are up and running while others are blaming Amazon for “trashing” their sites/businesses. You’re less likely to have someone come up with plans like those for 20/hr. [END QUOTE]
Technology will fail. Be prepared and have plans in place to deal with it.