So, basically a land hurricane?
> 2012
> Still not using quark computronium safely embedded in earth's core, accessible only via high-energy neutrinaser.
It's the capitalists, I say. They are holding EVERYTHING back.
A wave of "hurricane-like" thunderstorms ripped across Indiana, Ohio, West Virginia, and Virginia on Friday night, leaving more than 3.5 million people without power and knocking out the US-East-1 data center operated by Amazon Web Services. Netflix, Pinterest, Instagram, and Heroku, which run their services atop Amazon's …
Was driving I-64 just west of Richmond VA and pulled in to a hotel for the night just before the storm hit. Watched the storm from our room. The wind absolutely whipped the trees. We were lucky. Lights flickered, but we didn't lose power or communications. No damage around the hotel.
Continued west on I-64 today and saw large trees down and debris along the interstate. Crews had cleaned up anything blocking the interstate by the time we got on the road. No traffic backups due to down trees. Found rest areas and towns without power the whole way. On top of that there are 100F+ (37C) temperatures today. I saw 108F on my car's thermometer at one point. Without air conditioning it's like an oven out there.
More likely, a squall line http://en.wikipedia.org/wiki/Squall_line
Hurricanes cannot form over land. They are driven by hot moist air rising from an ocean surface. They also take several days to get going, so you get at least twelve hours notice that a hurricane is headed your way. Usually longer.
Squall lines give very little advance notice. I've heard tell of a transition from a hot summer day to roofs being blown off an hour later.
When you really understand cloud then if the customer can fire up his generator, train his repurposed 10M satellite dish on a distant wifi point and get Internet, then he can access your service in whatever degraded way his bandwidth permits - even if the rest of the continent and your local resources are offline.
You need to ensure that a safe voltage arrives inside the datacentre or none at all. Turning on generators automatically when there's an orderly powercut is straight forward. But when you're dealing with shorts, or indeed lightning strikes, on the sort of high voltage power-lines likely feeding that site, thing's become very unpredictable and very dangerous. Switchovers will be almost impossible to load balance and achieve cleanly.
I'd say the engineers were quite happy with only a 9 minute power down, given the situation - as the power lines feeding the site likely looked a bit like: http://www.youtube.com/watch?feature=player_embedded&v=NYCHBI66izs
As for why Instagram @ co put everything in a single availability zone, well, that's just sheer muppetry.....
If they do not have UPS (that is UPS that come with surge protection as standard) then frankly anyone doing biz with such a Data Centre needs their head examined.
Switch over should not be problematic...if it is then you do not have a resilient environment and your BCS/DR guy is a cowboy.
There is no excuse these days for a power outage in a DC that does not trigger a UPS to take over supplying power - from your main ones to the rack mounted....there is no excuse for a DC to go 'lights out' if the external power dies.
More worrying is that it is clear that Amazon do not spread redundancy across their Data Centres. The first sign of trouble the management and staff need to think about what the break point is in terms of failovers to other sites.
Cloud. Yeah right.
Pint coz its the finals of Euro 2012. Go England! Oh wait...they already have...errr...Forza Italia! And Leeds United *cough*
"More worrying is that it is clear that Amazon do not spread redundancy across their Data Centres."
They do. Read up on "Availability Zones" - in this case only a single one went down... That Netflix, Instagram, etc put all their eggs in one basket, is their own problem...
@ 142
Well evidently they don't have this kind of redundancy within their 'Availability Zones' right? It is quite clear that the entire 'Availability Zone' thing is just marketing wankery if the only Cloud Provider that went down was Amazon.
Pint coz...tomorrow is Monday.
>They do. Read up on "Availability Zones" -
According to some reports it was the availability zone failing to fail over that took out amazon's own service. It is claimed that the data centre went down too quickly and the availability service relied on the down data centre to inform the others in advance to spin up their copies.
In the middle of the outage some users were complaining that Amazon's service dashboard for their instances claimed that the DC was fully up and running when in fact it was dark.
You answered your own question - "miniscule" being the operative word. I've worked at a large computer centre (still a fraction of the size of Amazon of course) and there they had a big room full of batteries to last for the few seconds it took to run up the secondary generators. These only lasted for a minute or so before they were getting really overloaded, but that gave the primary generators time to run up and stabilise. Now scale that up for an Amazon-sized server farm and see what kind of monstrous UPS plant that would need.
Basically, cloud computing is supposed to be sufficiently resilient that you don't need a UPS. Well, that's clearly more the theory than the practice.
Flywheel UPS works well for that. The local synchrotron facility has over 1MW of standby generators, with flywheel UPS between them and the grid. When grid power fails the flywheels power the generators for the 10 seconds or so it takes the diesels to fire up and take up the load.
Alternatively go the telco route, with a small battery/UPS in each rack.
So you're saying that Amazon can't do what a small scale hosting solution can do, because it is tricky? Isn't that what they are selling us - "Trust us, we know DCs".
Isn't the whole point of cloud computing is that someone much more experienced than you at providing DC facilities provides your DC facilities?
There is nothing wrong here, from what I can tell a single availability zone went down.
AWS is designed in such a way that if availability isn't important, you can base your load in one local (in this case, N.Virginia). If you want more availability, use best practices and spread your loads around.
The real story is that these services still aren't properly able to cope with the conditions of the underlying infrastructure.
Yeah they've got the region "us-east" (virginia) availability zones a -> d. If only one zone in the region went down i don't see this as a real issue. There's a phrase involving eggs and baskets that springs to mind.
I cant see how this would drive people to move to other public clouds for reliability either. At least from the EC2 perspective, Amazon's cloud is failing in the way its advertised to fail. Again with Azure, unless T&Cs have changed since I last looked, you get no SLA unless you've got your stuff deployed in different azure reliability zones anyway.
Sounds like someone either didn't do an adequate disaster plan, or more likely some accountant wonk in management decided to save some money and not implement the entire plan. Personally, I hope they kept track of who made those decisions - but somehow it's more likely that the wonk got a promotion for "saving money" and its someone else who is going to get blamed for it. Probably someone in I.T.
number one: I thought cloud based services are supposed to cope with this kind of occurrence?
number two: one single data center does not a cloud make... ...services resident in one data center ARE NOT cloud services! (e.g. Instagram et al.)
number three: don't they have a UPS and backup generator?
summary; if it was my business running in 'the cloud' on amazon services, I would be LIVID to put it mildly.
roy.
AWS let you buy services in regional availability centres, be they throughout the continental US, Europe, Asia, all over really. Its upto the customers to decide on their failover scoping, preparation and scripting - I think this highlights that many didn't and assumed that it would 'just happen', but no-one should assume that it is bulletproof and make necessary backup plans. Google reckon they can do all this automagically, but even they occasionally manage to stuff things up.
EXACTLY.
Not looking very "elastic" is it.
Why on earth didn't Amazon fail over the workload to another DC within minutes of a problem occuring? Isn't that the whole bleeding idea of the all magic, highly resilient, always on cloud?
PMSL. Epic Cloud Fail. Just another example of a cloud hype vs reality disconnect.
Deja vu! Always have a plan B - no matter how sophisticated a system, the unexpected will always happen. This is our strategy: rely on others only to the extent that you are able to manage any failure. http://www.workbooks.com/community/blog/buck-stops-here
"Luckily for the Prickett Morgan household, we had just finished up watching several episodes of The IT Crowd over Netflix just before the storm hit."
Have you considered the possibility that this extremely rash and almost incomprehensible act may well have CAUSED this devastating storm?
The storm here (Western Capitol Region/D.C.) was terrible. The worst I've seen in the six years since I moved to the mid-Atlantic region in terms of property and infrastructure damage. 'Land Hurricane' per above is accurate except the lightning was intense! At times it looked like daylight outside because of the sustained lightning.
That being said all the "disaster planning experts above" have to understand that in situations like that (which the systems are designed to detect though mains variances) you can't just instantly go to backup without knowing what caused the power outage. If the site was hit directly there could be internal shorts causing the problem and if you keep forcing juice down its throat then the whole place might burn; Halon be damned... The system detected those variances and worked as designed.
Amazon did a fine job with only nine minutes of downtime. Most people can't even get in a good wank in that time.
Lightning strikes don't always raise the potential of the building. They can also raise the potential of the surrounding ground which is where most Earth taps are on your electrical system.
Most equipment (especially UPS) have a hell of a time with GND having a higher voltage than the incoming phases. Follow your +pV GND with a +pV phase and the resulting surge is like a tidal wave where electricity flows back and then hits forward twice as hard.
Shouldn't be an issue. An operation of their size would (or should) have it's own substation, and as they can reliably load balance the phases they can (should) have no primary neutral and float secondary one, or at least anchor it to the building's steel frame. Outside ground potential can then do whatever it likes and even if the primary terminals are lit up like a christmas tree the secondary should be fine.
No. You are not right.
Grounded rooftop conductors can help with small indirect strikes but nothing can manage multiple direct strikes in a short period. Meters go wonky, internal breakers flip and shit's just weird. Keeping everything up and running is not as simple as you think. You have to know what's going on before you just put the juice back to it.
I tried to watch Netflix on my Wii the night after this occurred and it locked up at a loading tiles screen. For hours their twitter feed showed that everything was ok. I went to sleep a while after that so I don't know how much longer it took them to notice they lost an entire datacenter somewhere out there in the cloud.
Any professionally-run data center should be capable of riding through a power interruption of any sort, short of a direct hit on the building wiring. Most major data-centers boast of having enough generator fuel onsite for days of off-grid local power-generation, with in many cases contracts with multiple fuel contractors to deliver more if necessary.
The fact that this one could not span even a short interruption tells me that cost-reduction is the #1 priority here. Same goes for the clients who didn't span their instances across multiple EC/AWS zones or different providers entirely - especially since they should have known about the threadbare power infrastructure Amazon is using.
We build our apps on Google's Appengine. We monitor every five minutes, and have not detected a single outage in over a year (since we switched to their high replication datastore). I know AWS outages only affect limited sets of their customers, but they seem to happen fairly frequently.
Appengine did require a mind-shift and some re-engineering of our APIs. But on the credit side of the ledger it also removed a lot of sysadmin work because it is a platform rather than raw infrastructure. Which also reduces the chances of outages due to someone munting your database/web/mail server configuration.
Of course, now I've mentioned this, Appengine will suffer a global failure in 3... 2... 1...
So, now we get VCs chucking a load of money at solving problems that the whole cloud thing was supposed to solve in the first place. i.e. scalability and redundancy.
I do so love slapping a cobbled-up patch over the obvious holes rather than fixing something so it does what it's bloody supposed to do. Still, stops things getting so simple that anyone can understand 'em and keeps us all in work I guess, so there is a bright side.
Is this cloud of clouds going to be THE cloud first thought of, or will we need more than one of these to stop it all falling on its arse too?
Although the author may be right about the VC cash, I doubt it would fix the problem anyway. We're talking about a huge geographical area that was effectively hit by a Cat 1 hurricane without the warnings that accompany a Cat 1 hurricane and on a path no hurricane would ever take (which is part of the problem restoring power in Northern Virginia: the people they normally call on are working on their own problems and everybody is calling further west). The Wide Area Redundant Array really would need servers in practically every area of the US, Canada, and Mexico, with extra thrown in in Asia, Australia, and Europe to boot.
...so we're OK as far as the power situation goes, as all that stuff is underground inside the city, so we don't have to worry about falling trees taking out power lines and such.
However, my broadband provider, Megapath (nee Covad) has both of its main offices for the DC Metro region in the 'burbs, in Arlington, VA and Silver Spring, MD, both of which are still struggling to get their power back. On top of this, Verizon has suffered some hardware damage and failures. So, between the power outages and Verizon's hardware issues, there are still several tens of thousands of Megapath* customers in the region with no Internet (I'm posting this from a coffee shop near my house).
Still, the wife and I are thankful as we never lost our power through all this -- ironically, we still had Internet after the storms passed; the 'Net went down early Saturday AM EDT -- although some friends of mine who I spoke with last night who live just outside the city, out in Arlington, still have no power or landline. They still have working cellular, but they occasionally have to jump in the car and take a drive or two around the block to charge their phones. I'm sending them a standing invitation to catch a Metro into the city and come hang out at our place for a while, watch a movie, drink a couple of beers, charge their phones and enjoy our A/C if it gets too bad where they are.
*Insert obligatory "Megadeth" joke here, if you insist. I've been with them for six or seven years and am actually quite happy with them.