Amazon has apologized for the extended outage that hit its AWS infrastructure cloud late last week, providing an extensive explanation for the problem and extending 10 days of credit to customers who were using data stores in the portion of the service where the problem originated. The outage was sparked, the company said, when …
Say whatever you want, this cloud thing is a shoot in the foot of the IT people that like me that are supposed to have scheduled visits to clients. We lose billing hours and when something goes wrong we cannot do shit to help the client. Even if you have redundancy for the services, you have so many extra breaking points that it is just stupid.
At least on my clients, if both ISPs break down at the same time, or if the firewall blows up, they can at least access files on the internal network and do some shit. I can't even imagine trying to explain to a multi-millionaire CEO that Amazon lost his data and/or all the services are down. My head will be on a plate 2 seconds later.
Cloud computing is the "new thing" that can be a life saver in businesses. However, rare events like this outline just how much of your business you're risking. There are still several advantages to using a cloud service, especially for small businesses using outsourced IT or the like, but there's just no beating a local network setup with a competent IT staff (even if that staff is just one person). A smaller business can likely handle 30min of a server (even an entire VM host) being offline while a critical component is replaced (competent implies being smart enough to stock a spare part of non-redundant server hardware, if the risk assessment is high enough). Likely your ISP will have an outage before a cloud provider will have downtime, so if your local servers have less downtime than your ISP (fairly doable actually), you could be better off having a local setup. Disaster recovery you say? If your business burns to the ground, you run into the question of why you would need access to your servers anyway? Your competent IT staff would have an offsite backup from the day prior anyway, so access to data is there. Sure, you won't be able to bring all your systems back online and operating until you replace your server(s) (unless you have a very inventive IT staff), but with the "disaster" hobbling your place of business, such would not be required.
Money and skill are the primary game-stoppers for a decent local setup. Your budget can't afford the ideal redundancy, infrastructure, internet-connectivity, or staff that a cloud provider can. It really comes down to if you can afford the one-in-a-blue-moon Amazon-style snafu (with the potential loss of your data), or if you prefer to rely on your potentially less adequate DR plan.
you're outsourcing most of your stuff etc to a cloud provider, buying two 2TB portable hard-drives, preferably encrypting them (Bitlocker would be an easy solution for moderately IT-literate person) and using them to backup cloud data even weekly, by a simple copy and paste, would in my humble opinion be fairly cheap and not requiring much skill solution to potential cloud provider problems. For a small company, this may even allow them to work while 'the snafu' is underway and not just restore any lost data.
The reason why you might put your stuff in teh cloud is because you've been told it is 'better' and 'cheaper' than DIY IT.
Amazon's (and Google's (and probably others)) recent outages show that, in fact, their ability to keep complex, multi-tenated infrastructure up and running is no better than most businesses can do in-house. So, the 'better' claim is not that accurate. 'Cheaper' is an interesting concept as well. Having done cost modelling on Cloud for a sizeable organisation, the cost savings punted about by many vendors is quite illusory.
Don't get me wrong - I think there is a place for hosted solutions and 'on demand' infrastructure etc etc, but it's not for mission-critical production systems unless the service credits you can get from a provider equal or exceed the financial and reputational impact of an outage. And good luck with that one from an Amazon or a Google!
Wha use is a post-mortem...
... when it doesn't tell you why went wrong what went wrong?
Maybe there exists an internal version of same that does contain the important bits, but we don't know and given the poor communication earlier the expectation is we'll never know.
What is clear is that at least parts within this cloud thing don't react gracefully when the cloud turns out to be imperfect. Whether this is an incident and they'll be running more comprehensive stress-tests from now on? Again, we don't know.
Repeatedly leaving out the critical bits has made amazon's cloud entirely unfit for use in critical infrastructure engineering. Simply because you haven't an inkling what's going to bite you next time.
Well it seems they knew not to put all their eggs in one basket, but then went and tied all the baskets together so tightly that when they dropped one...
Bulk scrambled egg anyone?
The postmortem raises more question than it answers
The postmortem raises a long list of questions regarding the network design as well as some interesting questions about the resilience methodologies in use.
In fact, some bits of the postmortem sound outright surreal to anyone with a telco/ISP background. Two separate networks? Routing between separated control planes and traffic?
An architectural error
The concept of Amazon's storage is that there are typically three copies of any data element with one at a remote site. This particular failure appears to have occurred because the "remote" site chosen by most users was just in another part of the data center, rather than truly independent (no shared internal network) of the local copy.
Whether this was a misunderstanding of the concept of remote, or a misrepresentation that zones die not share resources is only a part of the issue. There is a real weakness in the Amazon model. If a lot of copies are lost, the system MUST rebuild a new replica somewhere, to meet the published SLA and there own architectural requirements.
This means they must over-provision storage by an amount to hold the largest data center they have (not zones---locations!) and, and this is where it gets really painful, provide compute resources in the management nodes and comm resources throughout the system for the bandwidth needed to handle literally billions of replication requests.
I believe this is beyond current technology or financial reach. Amazon should have had two more layers of logic before allowing the storm to explode. First, they should have detected this as a multi-zone, data center outage, rather than a myriad of individual recovery requests. This would have allowed throttling to occur, possibly almost to preventing replication. The second layer is also obvious. They should have had logic and an independent system back-channel network to detect that the problem storage had NOT failed, but only become inaccessible. This would have allowed business to continue using other data centers for either storage or compute or both. Any write transactions to the (healthy, but offline) affected replicas could be journaled and synced later. (Note that, in this current disaster, restoring sync after several days of partial outage must have been fun!)
Disaster-proofing and recovery is a major rationale for the Amazon data structure. That they failed the test of a very predictable failure scenario sets me to wonder if the technology team had explored this adequately, and if they understand the beast they created well enough. I've been having discussions with grid, HPC and storage people for over 7 years on subjects like this and I've concluded there is a gap in understanding of failure modes in large clusters. I've seen this arise mainly as a result of peer pressure and corporate culture (it's hard to be a naysayer and critic), and it's also hard to critique your own design adequately. That's likely what happened here. I hope we all learn the lesson, or cloud deployment will lose out to FUD.
Re: An architectural error
@Jim - You are confusing the EBS service with S3.
Amazon's S3 service was completely unaffected by this; that is the service that stores data in multiple locations (zones) within a region. You don't 'choose' those locations - that is taken care of automatically.
The affected service was EBS which is just a standard drive with mirroring onto two devices. Therefore an EBS volume has some redundancy, but it is all within the same zone. You should never, ever rely on an EBS volume not failing completely.
Architectural best practice (whether in the cloud or not) is where possible not to rely on a single device or server. Those that had well engineered setups were unaffected by this outage as their EBS volume(s) that already existed in their secondary zone continued unaffected. Yes, you couldn't create/attach/detach/snapshot volumes in the other zones due to the common control plane overloading, but the volumes that already existed went on just fine.
The biggest problems here are:
(a) Communication - everyone has universally commented that this was poor.
(b) Some people (2.5%) with RDS (database) services running in multiple zones were affected. This shouldn't have happened, and Amazon have admitted this was part network traffic related, but part due to a bug.
(c) The control plane overload affect other zones.
(d) This was caused by human error.
The fact that a whole zone failed? Not good, but not unexpected. I've never met a data centre that hasn't had some sort of power or network failure that affects multiple servers. Normally something causes the power to failover to UPS/generators (eg. testing!) and it doesn't :-)
Re: An architectural error
Both S3 and EBS have the same exposure. They are topographically similar, insofar as S3 has two local and one remote copy spread over multiple nodes, while EBS has two local copies (a disk mirror), and if sensibly implemented, one or more remote copies.
The paradigm for repairing a drive failure is easy to understand. We have 20 years of experience dealing with it in RAID. Spare space is found, and a new mirror or replica pieced together form the working data. The issue is what happens when a large chunk of the data storage goes AWOL. If the system starts rebuilding images for all the objects, the create and copy traffic explodes, followed by very long latencies and effective operational lockup.
Another crash mode occurs if the lost storage is bigger than available free space across the zones. Here, the control plane saturates as the systems try to find free space on other nodes by each node broadcasting to all the other nodes, but then get into a deadly embrace since these requests cannot be satisfied. . Again this leads to control plane lockup.
That seems to be what happened in this incident. To quote Amazon,
"The mistake meant that many EBS nodes could not connect to their replicas, and they started searching for free space where they could re-mirror their data. With so many volumes affected, not all could find available space."
There are other disaster scenarios to explore, such as the ability of the rebuild system to keep up with drive failures. In year 8 of their life, out of a million drives, roughly 2 drives will fail per hour. With a 24 hour rebuild, that looks like the failures win!
RE: Re: An architectural error
Whilst communication was criticised, it always will be in any form of failure where the push seems to have been "find and fix the technical problem", rather than simply returning the customers back to operations. Having been the buffer between the business (i.e., the people that don't know IT but actually use the services IT provide to make the company money) and the techies (the people that think they keep the business running but often fail to realise the financial impact of a problem), I've been in the position of having to beat up on the techies to get them to actually provide updates to the business adn of having to explain to the non-techies exactly why things aren't back to normal yet.
The problem with us techies is we are, by nature, problem solvers, and when something goes wrong we like to understand the problem and then find the most efficient fix. But all the business cares about is getting things going again so they can start making (or stop losing) money. It seems that the Amazon techies concentrated on understanding and solving the issue, rather than keeping their customers informed. This is doubly understandable as it looks like an admin error - switching to the wrong router - rather than a technical error was to blame, and the admins probably all wanted to avoid having to face the music for as long as possible. Sounds like Amazon needs to re-think both their disaster handling, change control and monitoring, and their customer advisory procedures. First rule of dropping the company in the smelly stuff is make sure you tell the bizz people ASAP just how deep and smelly it is, as the longer they go without info they can make a decision around means the more stressed (and probably vengeful) they will become!
"anyone with a telco/ISP background. "
Do competent people like that still find relevant employment? I thought they tended to be unpopular (and eventually unemployed?) owing to pointing out risks in cheapskate approaches now favoured by PHBs and certified-Microsoft-dependent IT architects, and in particular in pointing out the costs of Doing IT Right?
Awesome, I think you've just explained to me why I've struggled to find work recently >.<
I shall immediately commence smashing my skull into the nearest concrete surface and delete half of my CV contents. Must also remember to dribble a lot in interviews. ;)
Amazon for "mission critical"
I don't think so. that would be like, or even worse (though cheaper) , than entrusting your cloud data to Microsoft. I suggest they go back to selling books and bloated word processors and phones that brick after update respectively.
For thee could - Guess what ! - We've got Google.
Not a good result, but *hopefully* a wakeup call to Amazon.
Change management is *tricky*.
Testing an upgrade/configuration plan *before* you do it might be a good idea.
Manual changes are error prone.
Customers who designed *proper* architectures did not fall over.
This might *seem* to put some blame on the customers but part of why MS stays in business is the way it looks after the *non* tech savvy customers. How much it saves them from their *own* ignorance and unwillingness to learn about the stuff they use.
I hope Amazon learn a *lot* from this failure. Offering near mainframe levels of reliability is hard. Especially at a price in a competitive market. I suspect they will loose customers from this event.
It's a matter of customer *trust*. Time will tell how much they have lost.
Does no-one test anything any more? I don't understand how exponential consumption of finite resources can even happen? Did the code not have any form of load limiting?
When I was in mainframe land one of our 6-monthly jobs used to be to test the disaster recovery plan. We took the weekly tapes to the place that would hire us an IBM in a trailer, and timed how long it took to configure the hired gear (following an already prepared checklist), load the tapes, and run 1000 terminal-based transactions. Part of my job was to measure the current drain from the generator while we did that, and the fuel consumption.
Any new system installed, hardware or software, was given a resilience test. We would turn bits of it off and make a timed log of what failures occurred when, and what the consequences were. At weekends during the year we would repeat those failures on the production system and ensure there were no significant deviations in the timing or propagation. We would look for failure to centrally diagnose or report failures, and any such 'hole' had to be demonstrably fixed within a month. (even if this sometimes meant another red lamp over the operator's console and a long cable)
Every single bit of kit deployed had a failure policy, ranging from warranty and maintenance contracts to a minimum of two ways to continue production without it. We used to test the users, too, to make sure they knew how to find the failure policy and use the alternatives.
Managers have to lean that difficult != unnecessary, and that expensive != unnecessary.
... That "less expensive" = "Bonus and promotion".
And most of the time, the promotion to another job will occurs before any consequence of half assed short sighted decisions.
I'm impressed with your testing procedures, sounds like something from a long time ago...
Hard to measure, but the disfavor customers will now likely have of businesses that utilize AWS likely doesn't measure up to the coupons Amazon is handing out for the days of service lost. A simple calculation of downtime = value won't entice companies to renew their contracts.
Change control process?
An engineer switched to the wrong device/network leading to a cascade failure. This type of thing happens in IT. I've been in IT since punch card files and paper tape. This type of human error continues to happen.
I would have thought Amazon needs to look at its Change Control process. Each step of the change should have been evaluated and documented in advance. Every system has some kind of weakness as the system evolves, the change control process should proactively avoid those weaknesses.
As part of the Post Mortem Amazon will need to look for those areas of weakness and either eliminate them or isolate them from change. It should also examine situation where accountants with spreadsheet vision have endangered the resilience of the system. It is all just plain risk management and some of those risks are likely to be organisational.
But what was affected, really?
I make use of a service which uses Amazon S3 to handle certain sorts of data request. It reduces the load on their own network.
There is a suggestion that their app code which takes advantage of this is flawed.
What I am sure of is that their performance sucked, last weekend, at the time when Amazon was sorting out this problem. Coincidence? This does look to have affected the network level of Amazon's operation.
It's not vital to me, but I can see a problem in these big "cloud" operations obscuring the risk.
"The EBS cluster couldn't handle API requests to create new volumes, and as these requests backed up in a queue, it couldn't handle API requests from other availability zones"
Sounds like the classic 'single point of failure in systems that are not supposed to be connected' deal.
If multiple availability zones all send requests to the same EBS cluster then the EBS cluster service that handles the requests is a single point of failure for all availability zones that use that cluster.
Problems predate 21st April
I think the problems go back further than the stated date of 21st April. I found the EC2 service unusable on 15th April, as I tweeted here: http://twitter.com/#!/kitlovesfsharp/status/58789308554944512
Amazon have redundancy up the wazoo.
Small and medium business can't be expected to shell out for the level of redundancy available on the platform.
The problem is not the architecture, but probably a user error. That is not to say that the backend architectural complexity didn't have a role to play but that cannot be avoided. Amazon will learn and improve incrementally, but there will always be failure and we must design for it.
If those businesses affected had read the docs, they would have used multiple availability zones, which would have reduced (not eliminated) the possibility that they would have been effected.
On another point why should amazon provide MORE detail into the postmortem? If you have a problem with a volume or instance: start another one up (because you have been using best practices and have made an AMI and have been snapshotting your EBS volumes) and you are back up and running again. I've enough problems of my own without worrying about amazon's techies jobs.
In summary, my bet is that humans were responsible for the issue at Amazon, and the relevant personnel at all the whining orgs that failed to use the tools correctly, are responsible for their own mini tragedies.
Except that the failure spanned availability zones, which is why people had downtime even when they were using availability zones. They did read the doc, seems like you didn't read the article tho.
The cloud of scalability...
The effect of amazon's stunning incompetence demonstrates how truly scalable the cloud computing model is. Simple errors are able to scale up to astounding outages that most data centres couldn't have dreamed of in the past. We'll have jobs for the rest of our lives; good work Amazon!
Sounds like Amazon suffered a severe case of Split-Brain
It started with the network, but once the distinct copies of the storage started losing sight of each other all hell broke loose.
Anyone who has done due diligence with a technology like drbd will break into a cold sweat when seeing "Standalone" start spamming in the Syslog; because unless they handle the next steps with absolute precision (and were wise enough to have the right precautions in place beforehand!) the odds are you'll be going back to the tape library and rebuilding from scratch.
Cascade failure modes are usually tough to predict, but blindingly obvious once you've been bitten by one.
If you REALLY care about your business not going on the floor then CIOs & CTOs need to get their heads out of the clouds and balance cost savings against decades of sound design and implementation practice. Unfortunately the guys that would have saved the day by building and running a true HA setup are either unemployed or unwilling to become unemployed by telling IT like it is.
Time for me to get off the dole and start consulting... ;-)
Mainframe-level fault tolerance? Nope.
@Robert E A Harvey, nope, people don't generally do these kinds of tests any more. In fact, a lot of IT people seem to be quite cavalier about their setups (if they do backups and failover at all, they not only don't test it, but don't even look over the overall design to see if it would theoretically work -- apparently, including Amazon). Of course, the fact that there is not really comprehensive information on how one would make all the bits and pieces tie together, as there is in the mainframe world, does not help in this. And for the most part those IT professionals who would plan to this degree and like to do some kind of tests of various failure modes are unable to due to budgetary and time constraints.
Although I have not been an operator on a mainframe, I've looked into mainframe architecture, as compared and contrasted to current "cloud computing", virtual machine setups, and so on. In general, the "state of the art" is about where the mainframe was circa late 1960s... the pieces are all there, and it generally works but the bugs have not been worked out yet (Although, it does vary -- there have been systems like the Sun Enterprise 10000, a.k.a. Sun Starfire, that are much more fault tolerant than just using some regular machines.)
Better late than never
the thing that gives me an uneasy feeling about this all is that there are indeed mission critical services run inside a bookstore, glorified as it may be... Why people choose to do that is really beyond me, equally as it is beyond me why people would store confidential data at the site of a known data miner...
I know it is the aspects of it being cheap and readily available to the nitwits, but aren't business owners supposed to take care of their data anymore? What happened with the stance that IT is merely a replacement of the good old secretary pen?
Also, what nobody seems to pick up, even fellow passionate IT professionals(!), is that most things that happen in the IT industry is merely a way to try and lure money from people without giving them anything back... In the grand scheme of things, we had a mainframe; then somebody decided that central wasn't good enough, and built an open system; now, some 30 or more years onwards, the IT industry is trying its hardest to build a mainframe on open systems, thereby going back to the centralised infrastructure we started out with.
Anyhow, Dear Clients (potential or otherwise): You have to remember that if you want disaster-resistant services, you better do IT in a way that suits your business, not in a way that suits somebody else's business... Or focus on the cost, but then, please, keep quiet when things fail. If you don't want to spend the money, fair enough... Just remember: Only the sun rises for free... Anything beyond that carries a service fee!
CEO - ITPassion Ltd
The Problem with Clouds.............
....is that they are subject to winds and can be here today, gone tomorrow.
C'est la vie
A mega corporation is only as strong as its weakest employee. Or, "it only takes one ''tard to destroy the work of a thousand geniuses."
When it's in your house at least you can see who the 'tards are. You have no control over the 'tards Amazon hires.
Page 2 Header
Great. Now I've got "These clouds fall like dominos, DOMINOS" in my head. Cheers for that.
- Top Gear Tigers and Bingo Boilers: Farewell then, Phones4U
- Stephen Pie iPhone 6: Most exquisite MOBILE? No. It is the Most Exquisite THING. EVER
- Updated iOS 8 Healthkit gets a bug SO Apple KILLS it. That's real healthcare!
- JINGS! Microsoft Bing called Scots indyref RIGHT!
- Early result from Scots indyref vote? NAW, Jimmy - it's a SCAM