back to article Bungled storage upgrade led to Google cloud brownout

Google's 'fessed up to another bungle that browned-out its cloud. The incident in question meant Google App Engine applications “ received errors when issuing authenticated calls to Google APIs over a period of 17 hours and 3 minutes.” The cause, Google says was that its “... engineers have recently carried out a migration of …

  1. Voland's right hand Silver badge

    Eternal Beta

    Eternal Beta can work when you have a rock star service engineering team which is an integral part of your software engineering team.

    It does not work any more if you hire by the busload and ship 'em by the busload. That is a point when you need processes. That is also the point where you get bitten in the a**e by the fact that you have build this enormous thing without any f*** process in place around. That is also the point where you realize that adding formal release mechanics to something which was built and shipped for years without them is unbelievably hard. It is the point where you look back at your initial "Eternal Beta" idea and you double-facepalm: "Why the F*** didn't we have formal release from the start".

    1. frank ly

      Re: Eternal Beta

      "... adding formal release mechanics to something which was built and shipped for years without them is unbelievably hard."

      Even if you get the mechanics in place, you then have to change the 'culture' and mindset of many people. That can be impossibly hard.

      1. Daggerchild Silver badge

        Re: Eternal Beta

        Still, "I'm going to delete these things because nothing should be using them" differs in an important way from "I'm going to delete these things because nothing is using them". Bad techie. No biscuit.

        Optimism is a bad property to have in a sysadmin.

        1. Destroy All Monsters Silver badge
          Trollface

          Re: Eternal Beta

          On the contrary.

          Sometimes you just need to flush stuff and see whether the phone rings soon (or not so soon) after. Because finding out by questionnaire whether anyone still has valid data on this 15-year old storage rig is pointless.

          1. Paul Crawford Silver badge

            Re: "anyone still has valid data on this 15-year old storage rig"

            Step 1 - unplug networking

            Step 2 - wait for several days/weeks to see what falls over and/or who calls you.

            Step 3 - shut down rig now your fairly confident its not really needed, as once stopped and cooled you have little chance of spinning the disk up ever again!

  2. Mark 85

    it also reveals simple mistakes are more often the problem rather than lightning strikes.

    It seems this is the normal failure mode for everything from rocket engineering to IT to even screw-up at the local coffee shack. Disaster recovery and failure mode simulations always cover the big stuff. It's the little crap that nails you every time.

  3. Hollerith 1

    First, do no harm

    And then do no evil.

    (BTW, 'affected', not 'effected', Reg.)

  4. Alister

    I wonder, is there anywhere which shows just how many outages Google's services have suffered this year?

    Oh, the irony, a quick bit of googling produces these figures:

    Feb 18 Google Compute engine down about 1 hour

    Mar 9 Google Compute engine down about 45 min

    May 3 Google Play, Hangouts, Mail down about 3 hours

    Jun 19 Google App Engine down about 4 hours

    Aug 14 Google Compute Storage 11 hours of brownouts

    Aug 27 Google Cloud Storage down for 9 hours

    Oct 9 Google Apps, Docs down for 5 hours

    Dec 8 Google Container Engine loss of services for 21 hours

    Dec 17 Google App Engine authenication failures for 17 hours

    I make that 71 hours and 45 minutes of outages this year.

    How many nines is that?

    1. Anonymous Coward
      Anonymous Coward

      It's tricky, because as the Reg say Google are much more transparent than their rivals with these details so it could be that this is just par for the course.

      More likely, though, Google are just sloppier than AWS (who have been doing this for some time now and have a more rigorous (and scarier) company culture in general) and Microsoft (who have their own history of screw-ups but seem to have learnt faster than Google to avoid them).

      If I were an enterprise invested in Google cloud this kind of openness would put me on edge rather than reassure me. Partly because it looks like they're not learning, but also because Google are unusually twitchy in killing off products that don't get traction and harm their brand image. Noone would be very surprised if they announced next week that they don't do cloud anymore.

      Which is probably why so few enterprises are invested in Google cloud.

      1. TheVogon

        "More likely, though, Google are just sloppier than AWS"

        To be fair, AWS have customers to worry about loosing...

  5. Martin Summers Silver badge

    More companies should be encouraged to be more human. Mistakes happen because humans make mistakes. So long as they can fix it and nothing is lost that is all that matters. Unless we start seeing RFO: Cleaner unplugged UPS that is.

  6. Anonymous Coward
    Anonymous Coward

    No, I don't want my Datacenter

    to look like Google's.

    Thanks & Goodbye

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like