back to article Amazon cloud knocked out by violent storms in Virginia

A wave of "hurricane-like" thunderstorms ripped across Indiana, Ohio, West Virginia, and Virginia on Friday night, leaving more than 3.5 million people without power and knocking out the US-East-1 data center operated by Amazon Web Services. Netflix, Pinterest, Instagram, and Heroku, which run their services atop Amazon's …

COMMENTS

This topic is closed for new posts.
  1. Destroy All Monsters Silver badge

    So, basically a land hurricane?

    > 2012

    > Still not using quark computronium safely embedded in earth's core, accessible only via high-energy neutrinaser.

    It's the capitalists, I say. They are holding EVERYTHING back.

    1. Aaron Em

      Re: So, basically a land hurricane?

      Basically, yeah. I hear it was a hell of a storm -- I wish I hadn't slept through it.

      1. Anonymous Coward
        Anonymous Coward

        Re: So, basically a land hurricane?

        Was driving I-64 just west of Richmond VA and pulled in to a hotel for the night just before the storm hit. Watched the storm from our room. The wind absolutely whipped the trees. We were lucky. Lights flickered, but we didn't lose power or communications. No damage around the hotel.

        Continued west on I-64 today and saw large trees down and debris along the interstate. Crews had cleaned up anything blocking the interstate by the time we got on the road. No traffic backups due to down trees. Found rest areas and towns without power the whole way. On top of that there are 100F+ (37C) temperatures today. I saw 108F on my car's thermometer at one point. Without air conditioning it's like an oven out there.

      2. disgruntled yank

        Re: So, basically a land hurricane?

        I'm impressed that you could. Of course, maybe I'd have done so, had I been asleep.

        1. Nigel 11
          Alert

          Re: So, basically a land hurricane?

          More likely, a squall line http://en.wikipedia.org/wiki/Squall_line

          Hurricanes cannot form over land. They are driven by hot moist air rising from an ocean surface. They also take several days to get going, so you get at least twelve hours notice that a hurricane is headed your way. Usually longer.

          Squall lines give very little advance notice. I've heard tell of a transition from a hot summer day to roofs being blown off an hour later.

    2. LarsG

      So you want to leave all you data on the cloud?

      Rethink time methinks.

      1. Mikel
        Pint

        It's fine

        When you really understand cloud then if the customer can fire up his generator, train his repurposed 10M satellite dish on a distant wifi point and get Internet, then he can access your service in whatever degraded way his bandwidth permits - even if the rest of the continent and your local resources are offline.

  2. Tom 20
    FAIL

    The cloud taken out by clouds? Now I've heard it all...

    1. Anonymous Coward
      Anonymous Coward

      Quite - kind of worrying that the entire cloud was taken out by a storm. So much for redundancy....

      1. Andrew Baines Silver badge
        FAIL

        9 minutes?

        What about backup power? Even my home office can survive 30 minutes.

        Guess I'll be keeping my own servers for a while longer.

    2. Jim Carter
      Coat

      Have you heard the sound of a man eating his own head?

  3. Anonymous Coward
    WTF?

    Standby Generators?

    Don't Amazon bother with standby generators at their datacentres, to cover loss of utility power?

    1. 142

      Re: Standby Generators?

      You need to ensure that a safe voltage arrives inside the datacentre or none at all. Turning on generators automatically when there's an orderly powercut is straight forward. But when you're dealing with shorts, or indeed lightning strikes, on the sort of high voltage power-lines likely feeding that site, thing's become very unpredictable and very dangerous. Switchovers will be almost impossible to load balance and achieve cleanly.

      I'd say the engineers were quite happy with only a 9 minute power down, given the situation - as the power lines feeding the site likely looked a bit like: http://www.youtube.com/watch?feature=player_embedded&v=NYCHBI66izs

      As for why Instagram @ co put everything in a single availability zone, well, that's just sheer muppetry.....

      1. Phil O'Sophical Silver badge

        Re: Standby Generators?

        That's what a UPS is for. As soon as the main supply becomes unstable the UPS kicks in. Main supply is then disconnected from UPS input and replaced with generator power, the UPS covering the outage. It remains like that until the public supply comes back solid & stable.

      2. IT Hack

        Re: Standby Generators?

        If they do not have UPS (that is UPS that come with surge protection as standard) then frankly anyone doing biz with such a Data Centre needs their head examined.

        Switch over should not be problematic...if it is then you do not have a resilient environment and your BCS/DR guy is a cowboy.

        There is no excuse these days for a power outage in a DC that does not trigger a UPS to take over supplying power - from your main ones to the rack mounted....there is no excuse for a DC to go 'lights out' if the external power dies.

        More worrying is that it is clear that Amazon do not spread redundancy across their Data Centres. The first sign of trouble the management and staff need to think about what the break point is in terms of failovers to other sites.

        Cloud. Yeah right.

        Pint coz its the finals of Euro 2012. Go England! Oh wait...they already have...errr...Forza Italia! And Leeds United *cough*

        1. 142

          Re: Standby Generators?

          "More worrying is that it is clear that Amazon do not spread redundancy across their Data Centres."

          They do. Read up on "Availability Zones" - in this case only a single one went down... That Netflix, Instagram, etc put all their eggs in one basket, is their own problem...

          1. IT Hack
            Pint

            Re: Standby Generators?

            @ 142

            Well evidently they don't have this kind of redundancy within their 'Availability Zones' right? It is quite clear that the entire 'Availability Zone' thing is just marketing wankery if the only Cloud Provider that went down was Amazon.

            Pint coz...tomorrow is Monday.

            1. 142
              Facepalm

              Re: Standby Generators?

              "Well evidently they don't have this kind of redundancy within their 'Availability Zones' right? "

              That's.... the whole.... point..............................

          2. Anonymous Coward
            Anonymous Coward

            Re: Standby Generators?

            So that's like: we have redundancy, but you have to provide it yourself?

          3. Yet Another Anonymous coward Silver badge

            Re: Standby Generators?

            >They do. Read up on "Availability Zones" -

            According to some reports it was the availability zone failing to fail over that took out amazon's own service. It is claimed that the data centre went down too quickly and the availability service relied on the down data centre to inform the others in advance to spin up their copies.

            In the middle of the outage some users were complaining that Amazon's service dashboard for their instances claimed that the DC was fully up and running when in fact it was dark.

  4. Anonymous Coward
    Joke

    Ouch!

    That's gotta hurt their guaranteed 99,9999999999999999999999999999999999999999999999999999% uptime!

  5. Chris Miller

    So what was it? A lightning strike on the building? I assume they've got UPS to handle the power issues.

    1. Fatman
      FAIL

      RE: So what was it? A lightning strike on the building?

      More likely some muppet decided to see what `the BIG RED button does`!!!

      WHAM!!!! TOTAL DARKNESS!

    2. kain preacher

      leader of a bolt of lightning can travel at speeds of 220,000 km/h (140,000 mph), and can reach temperatures of about 30,000 °C (54,000 °F),oh and the fact that its over a 1000 time the energy that goes into the UPS.

  6. Anonymous Coward
    Anonymous Coward

    Wrong kind of cloud?

    Don't mess with the Cumulonimbus kids. They are mean MoFo's

  7. Anonymous Coward
    Anonymous Coward

    Have amazon never heard of diesel backup generators? I work for a mid sized ISP - miniscule compared to Amazon, and we have backup generators for just this kind of situation. What gives Amazon?

    1. Anonymous Coward
      Anonymous Coward

      You answered your own question - "miniscule" being the operative word. I've worked at a large computer centre (still a fraction of the size of Amazon of course) and there they had a big room full of batteries to last for the few seconds it took to run up the secondary generators. These only lasted for a minute or so before they were getting really overloaded, but that gave the primary generators time to run up and stabilise. Now scale that up for an Amazon-sized server farm and see what kind of monstrous UPS plant that would need.

      Basically, cloud computing is supposed to be sufficiently resilient that you don't need a UPS. Well, that's clearly more the theory than the practice.

      1. Phil O'Sophical Silver badge

        Monstrous UPS?

        Flywheel UPS works well for that. The local synchrotron facility has over 1MW of standby generators, with flywheel UPS between them and the grid. When grid power fails the flywheels power the generators for the 10 seconds or so it takes the diesels to fire up and take up the load.

        Alternatively go the telco route, with a small battery/UPS in each rack.

      2. Tom 38

        So you're saying that Amazon can't do what a small scale hosting solution can do, because it is tricky? Isn't that what they are selling us - "Trust us, we know DCs".

        Isn't the whole point of cloud computing is that someone much more experienced than you at providing DC facilities provides your DC facilities?

  8. karlp
    Stop

    There is nothing wrong here, from what I can tell a single availability zone went down.

    AWS is designed in such a way that if availability isn't important, you can base your load in one local (in this case, N.Virginia). If you want more availability, use best practices and spread your loads around.

    The real story is that these services still aren't properly able to cope with the conditions of the underlying infrastructure.

    1. eLD

      Yeah they've got the region "us-east" (virginia) availability zones a -> d. If only one zone in the region went down i don't see this as a real issue. There's a phrase involving eggs and baskets that springs to mind.

      I cant see how this would drive people to move to other public clouds for reliability either. At least from the EC2 perspective, Amazon's cloud is failing in the way its advertised to fail. Again with Azure, unless T&Cs have changed since I last looked, you get no SLA unless you've got your stuff deployed in different azure reliability zones anyway.

  9. raving angry loony

    Poor planning

    Sounds like someone either didn't do an adequate disaster plan, or more likely some accountant wonk in management decided to save some money and not implement the entire plan. Personally, I hope they kept track of who made those decisions - but somehow it's more likely that the wonk got a promotion for "saving money" and its someone else who is going to get blamed for it. Probably someone in I.T.

  10. Stephen W Harris
    Joke

    I think the storm was an excuse; it was the leap second that did it!

  11. roy 5
    Facepalm

    cloud? enterpise quality data center? hmmm :/

    number one: I thought cloud based services are supposed to cope with this kind of occurrence?

    number two: one single data center does not a cloud make... ...services resident in one data center ARE NOT cloud services! (e.g. Instagram et al.)

    number three: don't they have a UPS and backup generator?

    summary; if it was my business running in 'the cloud' on amazon services, I would be LIVID to put it mildly.

    roy.

    1. ElNumbre
      Stop

      Re: cloud? enterpise quality data center? hmmm :/

      AWS let you buy services in regional availability centres, be they throughout the continental US, Europe, Asia, all over really. Its upto the customers to decide on their failover scoping, preparation and scripting - I think this highlights that many didn't and assumed that it would 'just happen', but no-one should assume that it is bulletproof and make necessary backup plans. Google reckon they can do all this automagically, but even they occasionally manage to stuff things up.

  12. Coderjoe
    FAIL

    Second outage in June

    AWS had another power issue in their N. Virginia region on June 15. I don't know if it was the same datacenter offhand, but this does make me wonder. Where are the UPS and generator backups?

  13. jaycee331
    FAIL

    Re: cloud? enterpise quality data center? hmmm :/

    EXACTLY.

    Not looking very "elastic" is it.

    Why on earth didn't Amazon fail over the workload to another DC within minutes of a problem occuring? Isn't that the whole bleeding idea of the all magic, highly resilient, always on cloud?

    PMSL. Epic Cloud Fail. Just another example of a cloud hype vs reality disconnect.

    1. JamesKay
      Linux

      Re: cloud? enterpise quality data center? hmmm :/

      Deja vu! Always have a plan B - no matter how sophisticated a system, the unexpected will always happen. This is our strategy: rely on others only to the extent that you are able to manage any failure. http://www.workbooks.com/community/blog/buck-stops-here

  14. Anonymous IV
    Alert

    Cause and effect?

    "Luckily for the Prickett Morgan household, we had just finished up watching several episodes of The IT Crowd over Netflix just before the storm hit."

    Have you considered the possibility that this extremely rash and almost incomprehensible act may well have CAUSED this devastating storm?

    1. Robert Carnegie Silver badge

      And have you tried turning it off and on again?

      Ah go on! Go on! Go on, go on, go on, go on, go on, go on, go on. GO ON!

      ...oh yeah, wrong show.

      1. tpm (Written by Reg staff)

        Re: And have you tried turning it off and on again?

        I believe Mother Nature turned it off and on.... HA!

      2. martin burns
        Coat

        Re: And have you tried turning it off and on again?

        ShowS shurely...?

        Still, the outage may not have been small, but it was FAR AWAY (from Reg Towers)

        Yeah, the one with the bookshop business card (scribbled on the back of a torn beermat) in the top pocket.

  15. Don Jefe
    Boffin

    Amazon Did Well

    The storm here (Western Capitol Region/D.C.) was terrible. The worst I've seen in the six years since I moved to the mid-Atlantic region in terms of property and infrastructure damage. 'Land Hurricane' per above is accurate except the lightning was intense! At times it looked like daylight outside because of the sustained lightning.

    That being said all the "disaster planning experts above" have to understand that in situations like that (which the systems are designed to detect though mains variances) you can't just instantly go to backup without knowing what caused the power outage. If the site was hit directly there could be internal shorts causing the problem and if you keep forcing juice down its throat then the whole place might burn; Halon be damned... The system detected those variances and worked as designed.

    Amazon did a fine job with only nine minutes of downtime. Most people can't even get in a good wank in that time.

    1. IT Hack

      Re: Amazon Did Well

      Because of course Amazon do not have lightening conductors on their buildings and the only other ingress for power is via a UPS (that deals with power spikes) which does not exist, amirite?

      1. Anonymous Coward
        Anonymous Coward

        Re: Amazon Did Well

        Lightning strikes don't always raise the potential of the building. They can also raise the potential of the surrounding ground which is where most Earth taps are on your electrical system.

        Most equipment (especially UPS) have a hell of a time with GND having a higher voltage than the incoming phases. Follow your +pV GND with a +pV phase and the resulting surge is like a tidal wave where electricity flows back and then hits forward twice as hard.

      2. Will Godfrey Silver badge
        Meh

        Re: Amazon Did Well

        Shouldn't be an issue. An operation of their size would (or should) have it's own substation, and as they can reliably load balance the phases they can (should) have no primary neutral and float secondary one, or at least anchor it to the building's steel frame. Outside ground potential can then do whatever it likes and even if the primary terminals are lit up like a christmas tree the secondary should be fine.

        1. Don Jefe
          Boffin

          Re: Amazon Did Well

          The nine minutes to restore service would have been no shorter even if they had their own substation. The detected fault was internal and no amount of outside infrastructure would have mattered. The system worked as designed and it worked rather well.

      3. Don Jefe
        Boffin

        Re: Amazon Did Well

        No. You are not right.

        Grounded rooftop conductors can help with small indirect strikes but nothing can manage multiple direct strikes in a short period. Meters go wonky, internal breakers flip and shit's just weird. Keeping everything up and running is not as simple as you think. You have to know what's going on before you just put the juice back to it.

  16. pixl97

    Netflix outage site takes hours to notice cloud is missing.

    I tried to watch Netflix on my Wii the night after this occurred and it locked up at a loading tiles screen. For hours their twitter feed showed that everything was ok. I went to sleep a while after that so I don't know how much longer it took them to notice they lost an entire datacenter somewhere out there in the cloud.

    1. pixl97

      Re: Netflix outage site takes hours to notice cloud is missing.

      I hate replying to myself, but..

      Is it a bad idea to have your system that warns everyone the cloud is down, on the cloud?

      1. Brendan Sullivan
        FAIL

        Re: Netflix outage site takes hours to notice cloud is missing.

        Yes, yes it is a bad idea.

  17. Phil Koenig
    FAIL

    Mismanaged

    Any professionally-run data center should be capable of riding through a power interruption of any sort, short of a direct hit on the building wiring. Most major data-centers boast of having enough generator fuel onsite for days of off-grid local power-generation, with in many cases contracts with multiple fuel contractors to deliver more if necessary.

    The fact that this one could not span even a short interruption tells me that cost-reduction is the #1 priority here. Same goes for the clients who didn't span their instances across multiple EC/AWS zones or different providers entirely - especially since they should have known about the threadbare power infrastructure Amazon is using.

  18. Andus McCoatover
    Windows

    It's Raining Lawyers!!

    Halleluja! (OK, close as I can get...)

    http://www.dailymail.co.uk/news/article-1254812/Hundreds-fish-fall-sky-remote-Australian-town-Lajamanu.html

  19. Greg Fawcett
    Facepalm

    Glad we chose Appengine...

    We build our apps on Google's Appengine. We monitor every five minutes, and have not detected a single outage in over a year (since we switched to their high replication datastore). I know AWS outages only affect limited sets of their customers, but they seem to happen fairly frequently.

    Appengine did require a mind-shift and some re-engineering of our APIs. But on the credit side of the ledger it also removed a lot of sysadmin work because it is a platform rather than raw infrastructure. Which also reduces the chances of outages due to someone munting your database/web/mail server configuration.

    Of course, now I've mentioned this, Appengine will suffer a global failure in 3... 2... 1...

    1. Anonymous Coward
      Anonymous Coward

      Re: Glad we chose Appengine...

      Reading Google Appengine (and related services) documentation still makes my head fall off and roll around on the floor. It's all just so Wonderland I daren't actually touch it (relationality in a transactional database! Who the hell needs that?). Any tips?

  20. TeeCee Gold badge
    Facepalm

    ".....load balance across multiple cloud suppliers."

    So, now we get VCs chucking a load of money at solving problems that the whole cloud thing was supposed to solve in the first place. i.e. scalability and redundancy.

    I do so love slapping a cobbled-up patch over the obvious holes rather than fixing something so it does what it's bloody supposed to do. Still, stops things getting so simple that anyone can understand 'em and keeps us all in work I guess, so there is a bright side.

    Is this cloud of clouds going to be THE cloud first thought of, or will we need more than one of these to stop it all falling on its arse too?

    1. Tom 13

      Re: ".....load balance across multiple cloud suppliers."

      Although the author may be right about the VC cash, I doubt it would fix the problem anyway. We're talking about a huge geographical area that was effectively hit by a Cat 1 hurricane without the warnings that accompany a Cat 1 hurricane and on a path no hurricane would ever take (which is part of the problem restoring power in Northern Virginia: the people they normally call on are working on their own problems and everybody is calling further west). The Wide Area Redundant Array really would need servers in practically every area of the US, Canada, and Mexico, with extra thrown in in Asia, Australia, and Europe to boot.

  21. Mike Flugennock

    I live in the middle of Washington, DC...

    ...so we're OK as far as the power situation goes, as all that stuff is underground inside the city, so we don't have to worry about falling trees taking out power lines and such.

    However, my broadband provider, Megapath (nee Covad) has both of its main offices for the DC Metro region in the 'burbs, in Arlington, VA and Silver Spring, MD, both of which are still struggling to get their power back. On top of this, Verizon has suffered some hardware damage and failures. So, between the power outages and Verizon's hardware issues, there are still several tens of thousands of Megapath* customers in the region with no Internet (I'm posting this from a coffee shop near my house).

    Still, the wife and I are thankful as we never lost our power through all this -- ironically, we still had Internet after the storms passed; the 'Net went down early Saturday AM EDT -- although some friends of mine who I spoke with last night who live just outside the city, out in Arlington, still have no power or landline. They still have working cellular, but they occasionally have to jump in the car and take a drive or two around the block to charge their phones. I'm sending them a standing invitation to catch a Metro into the city and come hang out at our place for a while, watch a movie, drink a couple of beers, charge their phones and enjoy our A/C if it gets too bad where they are.

    *Insert obligatory "Megadeth" joke here, if you insist. I've been with them for six or seven years and am actually quite happy with them.

  22. Helloworld

    Just Amazon?

    Seems like about half the world has a data center in Northern Virginia, and came through it OK. Was there some unique combination of factors that hit Amazon? Maybe they need to do a little reliability re-engineering.

This topic is closed for new posts.

Other stories you might like