back to article AWS blames 'latent bug' for prolonging Sydney EC2 outage

Amazon Web Services has explained the extended outage its Sydney services suffered last weekend, attributing downtime to a combination of power problems and a “latent bug in our instance management software”. Sydney recorded over 150mm of rain on last weekend. On Sunday the 5th the city copped 93 mm alone, plus winds gusting …

  1. Just a geek

    that spinning flywheel...

    ...which provides power during generator startup is the same system that chernobyl was testing on the night of the accident with similar results it seems!

    1. Mark 85

      Re: that spinning flywheel...

      That type of system is fairly common. The breakers used are usually the sources of the problem. If they arc on opening and the arc doesn't break quickly, the power does flow back onto the grid. Humidity is not a good thing with those breakers and many times they are outside the main building or in the generator shed where humidity isn't controlled.

      OTOH, watching them arc is impressive and spectacular.

      1. Jim Mitchell

        Re: that spinning flywheel...

        The article seems to gloss over this, but backfeeding power into the utility grid is going to make people very upset. And possibly dead. If your disconnect/transfer switch device is known to be problematic, why was it allowed to be installed in the first place?

        1. Mark 85

          Re: that spinning flywheel...

          From what I've seen, it's either "profit" or "stupidity". Pick 2.

          1. Alan Brown Silver badge

            Re: that spinning flywheel...

            It's usually "profit" or its cousin "economising on options"

            There is a really good reason to ensure that the system you specified or the system that's been ordered (accounting types will go X=Y so Z(*)) and actually delivered (**)

            (*) We were lambasted over the price (and quantities) of tape being consumed in IT and asked why we couldn't use cheaper alternatives with all purchasing ability blocked until this was resolved. The answer was that we'd looked at the suggested products from Sellotape, but concluded that they would have an unfortunate tendency to gum up the tape drives.

            (**) A classic case being the Quantity surveyor who decided a building was massively overengineered, and so deleted much of this extra cost without referral back to the customer. The result was a purpose-built city library building that didn't have floors strong enough to hold bookshelves on its upper 3 (of 5) floors.

      2. Alan Brown Silver badge

        Re: that spinning flywheel...

        In our experience (filthy power is the norm in SE england) It's best to interpose the flywheel permanently between the source and load, else normal day-to-day glitches and spikes will cause random trouble.

        That way you gain the benefit of 100% conditioned power hitting the datacentre (no spikes, etc) and you don't need to worry about breakers causing outages (although it's happened here on several occasions as the system has been switched to pass-through for flywheel maintenance. The Caterpilar/Standby Power Systems setup is pretty shitty overall, but still better than most of the rest)

        This doesn't help if the diesels don't start - which has also happened here thanks to helpful people in the organisation economising on a £700k purchase by deleting a £500 redundant starting option.

        Of course if you're 100% serious about your power you run several flywheels and generators in N+1 parallel configuration. This allows you to switch out 1 flywheel or 1 diesel for maintenance and still have the capability to ride a power outage. Phase coherency is a long-solved problem.

      3. Denarius
        Meh

        Re: that spinning flywheel...

        No doubt the breakers were well tested. In Nevada. humid ? Sydney ? Never </sarcasm> Given the weather bureau were predicting accurately the size of that front and its probable arrival times and effects, why weren't diesel gens up and running with the tanks topped up ? Lots of stories about shiny generators without fuel when bean counters save money.

        No-one in their right minds trusts the crap that is Oz power distribution, not to mention the compulsive planting of trees where they can do maximum damage in 30 years.

  2. Anonymous Coward
    Thumb Up

    Your business in the cloud - reminiscing...

    "Service unavailable" was your tag line for the few months you were in business. "I'm sorry, we can't find your order, our system's down again", was your unhappy customer service team's mantra to your dwindling and increasingly infuriated customer base. But at least you saved some money not buying your own machines... Oh wait. You were with Monster Cloud and your prices went up by 5,000% overnight! Each extra customer was actually costing you.

    You still wonder where it all went wrong as you serve up yet another sugar loaded coffee perversion to a skinny Apple Watch wearing man-bunned hipster in Starbucks. The only thing in the cloud now is your head. How could the consultants have lied to you like this... How are you going to pay the bills?

    "But I was in the cloud", you sigh. You weren't really. You just gave away the core of your business to someone else. And paid them handsomely to not care about it nearly as much as you did.

    1. WraithCadmus
      Thumb Up

      Re: Your business in the cloud - reminiscing...

      The cloud offers many advantages, but always have a plan to do your business somewhere else if it's needed.

    2. sinnerFA

      Re: Your business in the cloud - reminiscing...

      This by far is the best definition of the ambiguous "cloud". Can we get this made official?

  3. Adam 52 Silver badge

    Regular reader will know from the byline that they should check the facts in this story, which you can do by going to http://status.aws.amazon.com/ and looking for the yellow triangle on 4th June.

    1. Anonymous Coward
      Anonymous Coward

      Are you saying that its incorrect what was in the byline? To me it looks correct, last weekend, on Sunday the 5th (in Australia, that is where it occurred) Amazon had a failure in Sydney with EC2.

      In the article this is also clarified.

  4. Anonymous Coward
    Anonymous Coward

    The Cloud...

    Somebody else's computers you have no control over.

    1. RudderLessIT

      Re: The Cloud...

      Which is why you have a contract... also you are taking a cost/benefit analysis of having to own everything (and then manage & maintain), over raising a purchase order.

      1. Dagg Silver badge
        Mushroom

        Re: The Cloud...

        >Which is why you have a contract...

        Yea, right. What isn't included is excluded and if it is included there will be so many caveats around the inclusion that the whole thing means you a screwed.

  5. wyatt
    Mushroom

    Maybe a move from spring powered breakers to explosive powered breakers is needed? wouldnt like to be near them when they actuate..

    1. Anonymous Coward
      Anonymous Coward

      I'm surprised it's not solid state in this day and age.

      Spring powered breakers sounds very steampunk.

  6. Somone Unimportant

    OK, so they knew that bad weather was coming...

    With bad weather forcast some time beforehand, would it have been hard for AWS to have one generator actually up and running in advance?

    Would have helped avoid this outage and also provided a test for the UPS system.

    If I had stuff on AWS, I'd be spitting chips over this, if the outage was indeed due to a UPS issue. But if I were on AWS, I'd also have systems ready in another availablity zone to take over should one go down.

  7. Anonymous Coward
    Anonymous Coward

    Cheap servers keep cloud costs down

    Their design philosophy is that servers are homogeneous and "cattle not pets." Cattle don't need dual power supplies and dual power feeds - they're expected to be frequently slaughtered and replaced. But the whole infrastructure is a set of least cost dominoes with lowest price components. Cloud services are built cheap and easily recover from failures, but they're also built to fail - "cattle not pets." And sometimes it takes a while to get a new herd settled in.

    Cloud service providers are like street vendor food - cheap, usually easy, and a lot better than nothing. But they aren't the most satisfying meal and sometimes you end up down and out the next day.

  8. RudderLessIT
    Trollface

    Just love for AWS

    I love that when Intune is out for four hours, there is a plethora of f*&K Microsoft! posts, yet AWS goes down and all is quiet...

  9. Anonymous Coward
    Anonymous Coward

    Well

    This outage is actually significant. It is now obvious that AWS has a design problem with its power systems. They know their data centers can't go down from a power loss. Bad for business and bad for customers - probably in that order. With all the money spent and time invested in getting the best of the best for their power systems, they still failed. They still failed. And it isn't the first time. AWS now has a trend of power failures and is the only major cloud to have that trend. Experts can recommend solutions that include multiple regions or cloud vendors to avoid application outage, and that's probably good advice, however, it significantly adds to the cost of the overall solution, is a bitch to test and manage and it may still go down for a wide variety of reasons that are impossible to test for across multiple regions or cloud vendors. Sticky wicket.

    1. DainB Bronze badge

      Re: Well

      http://www.itnews.com.au/news/amazon-web-services-to-build-two-sydney-data-centres-396802

      Let me try decipher that marketing speak.

      They were trying to find Tier 3 datacenter that can accommodate their growth. There's plenty of them in Sydney. So what they were really trying to do is find cheaper datacenter than Equinix. Failing to find one they decided to build their own which crumbled at first power outage.

  10. Anonymous Coward
    Anonymous Coward

    Really?

    Seriously how AWS consider their Sydney site of acceptable standard when it was stated it is powered by one utility provider? It sounds like they saved money on that but put in proper internal redundancies gambling the backups will work.

    For a normal decent data centre I'd expect power inputs from multiple substations and not uncommonly more than one utility provider. Costly but that's how you avoid complete outages on a big scale.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like