It's not surprising that Amazon's infrastructure cloud has gone on the fritz. This is what happens to internet services from time to time. What is surprising – or least more troubling – is that today's outage affected multiple "availability zones" across the service. "Availability Zones," according to Amazon, "are distinct …
There went the next cople of years worth of downtime
at 99.95% uptime 365.25 days per year, by 24 hours, about 4hours24min per year.
Looks like one of the services that smoked was Elastic beanstalk, so both north american data centers were wacked out. Nose say issues from 3:16AM to present(5:55pm).That is pushing what about 18 hours and counting? So if it all is magically working before I finish this, they should be down again until 2015?
Oh wait no, what they will tell you is that 99.95% of their systems will run 99.95% of the time. And lucky you to part of the .05% that won't run, ever.
In fairness anything that big will have a huge impact when it falls down, and it is too big and too complex not to crash.
This is where SLAs become a piece of paper, and why all SLA's are not created equal. There is a difference between the banking industry, where an operator can be fine hundreds of thousands of dollars per hour for downtime, and other industries, where they credit you a fraction of your month service charges and send you an apologetic email with no explanations(if you are lucky enough to get that...)
Good luck on the refund
EBS is the issue
My company's service is hosted by AWS, mirrored across us-east1d and us-east1b, and we did not experience any disruption. That's because we do not use EBS at all, not even for the boot drive on our VMs, which use the older S3-backed format. The downside is all data on the system disks is lost if the VMs go down, which we guard against using active-standby replication across availability zones.
The problem with EBS is that Amazon overpromised (enterprise grade storage at very low prices) and underdelivered (mediocre and inconsistent performance both in terms of IOPS and MBps, and spotty availability). Enterprise storage is very exensive (several thousand dollars per terabyte), and Amazon's pricing of $0.10/GB/month means they must be cutting corners somewhere.
As for isolation between availability zones, I remember reading somewhere that zones with the same prefix (like us-east-1d vs. us-east-1b) are in the same physical data center, but different rooms with different power supplies. The challenge in replicating across different regions is packet latency induced by the speed of light (a minimum of 30 milliseconds between East and West coasts. That's an eternity when compared to the response time of a disk, and most database replication systems perform poorly across WAN links with high latency, even the asynchronous ones. On very high volume sites, it is often not feasible to do so, apart from backup-shipping as a disaster-recovery measure, where data loss would be measured in days.
To bad ailability zones did not work as advertised but separate regions is an option
We are also us-east-1a and us-east-1d. We do make extensive use of EBS. We thought we were well prepared by having our DB2 database in two availability zones. We got lucky and stayed right through the mess. I think we will have to consider going across regions and not just availability zones with our database. We are fortunate enough to have this option as DB2 HADR ASYNC mode works well across great distances. The fact that availability zones did not work as advertised is a disappointment. Let's see if different regions will do the trick.
I blogged about this http://freedb2.com/2011/04/21/cloud-crash-has-a-silver-lining/
The outage has also taken down pocket legends just as they as they started a free fought night promo.
There will be many a grumpy company as a result of this, i wonder how much compensation Amazon is going to have to cough up.
The problem is not that bad?
There is a huge mathematically created perception problem that amazon has.
If 10000 sites host themselves in aunties basement and various cages all over the planet and only get 99.5 % uptime - no news stories will be written when each one drops. they would be going down at a rate of 20 or more per day!)
When they are are all hosted on the same system the news story becomes 10 000 times bigger, even though the uptime could be much better.
Refunds: Amazon has enough money to give a month of EC2 away as a gesture, no matter what the contract says.
centralized cloud computing
' "By launching instances in separate Availability Zones," Amazon says, "you can protect your applications from failure of a single location." But today's outage – which began around 1:41am Pacific time and also affected the use of Amazon's Elastic Block Store (EBS) service – spread across multiple zones in the East region.'
I do believe this whole cloud computing concept has been over sold. For a business with multiple locations, a number of servers sited locally, in a peer-to-peer configuration would provide a more reliable service. All they rely on is an end-to-end IP connection, if one site goes then the rest can carry on.
Remind me again
Why we are all supposed to embrace the convenience and reliability of the cloud
'cos it requires less kit and staff and apparently a great saving?
(I don't buy in to the usual advertised crap)
re: ElasticHosts advertising spam
Oh dear.. business must be bad if you need to resort to spamming forums featuring your so called ''direct competitor'' ... more ElasticHosts spam...
These kinds of high-profile outages are what it's going to take to temper everyone's overhyped expectations of cloud computing. The concept is amazing -- ever-expanding, always-on computing power that can be used like water or electricity. The reality behind the marketing stuff falls short.
The shiny PowerPoint stuff fails to mention that your typical IT problems don't disappear; they just become someone else's problem. If your host either is incompetent or just has a really bad day, it doesn't mean that your workload and data are instantly going to migrate themselves to another location. It CAN mean that, but only if the service provider designed it that way, and you paid for that service. Also, if there's a true disaster, SLAs are useless other than ensuring you can extract money or free service from the provider at a later date. You're still down and out till it's fixed.
The reality is that for most big-scale applications, DR is difficult. You're either looking at dedicated LAN-speed network links between datacenters for replication, architecting your app so that it can support failover properly, or both. Problem is, once the CIO sees "monthly charge based on use" and "get rid of your local IT assets and staff" in the same presentation, it's all over.
Beer because it's Friday, and Amazon's sysadmins are going to need it after the 72-hour shift they'll be pulling to fix this.
It's 'only' a PR problem
So the fact that the outage happens to 10 000 sites all at once is somehow worse than
It happening to those same sites - but spread out over a year?
It's mostly a PR problem. The thing abou EC2 is that every outage makes the news. Reddit (hosted on EC2) crashes on it's own once a month, but that does not make the news.
Services going back up
The likes of Foursquare and Quora are back up, reddit in emergency mode, but if you read the feeds due to wide repeats of old news, it could be perceived not to have improved... Amazon EC2 / AWS outage and Sony PSN downtime UPDATE http://t.co/C0X2rXu
...tels me that this is a software problem and all of Amazon's availability zones are running the same software so they all suffer together. The moral of the story will be that if you want real availability, you need to have more than one cloud provider.
If I were a consultant, I could get paid for that "insight".
Re: Psychic debugging
Update: Amazon are now saying that the root cause was a "network event" that all their clouds responded to in the same way. Would the person who down-voted me like to give that thumb back, or were you just criticising my spelling?
>Would the person who down-voted me like to give that thumb back, or were you just criticising my spelling?
No, for two reasons. 1. Because you asked and 2. Because a "network event" is not a software error.
I'm also inclined towards giving you a further thumbs down for being gullible enoufh to accept "network event" as a plasuible excuse. It's usually used when you don't know what the reason is and are looking for an easy explanation to keep the plebs quiet.
Here is one of the data centers.
Here is one of the data center in question: 39.002704, -77.481850. You can bet the dumpster is full of pizza boxes and soiled underwear.