back to article Google broke its own cloud AGAIN, with TWO software bugs

A couple of days ago Google's cloud went offline, just about everywhere, for 18 minutes. Now the Alphabet subsidiary has explained why and issued a personal apology penned by “Veep for 24x7” Benjamin Treynor Sloss. And yes, that is Sloss' real title. Sloss says the problem started when “engineers removed an unused Google …

  1. Anonymous Coward
    Anonymous Coward

    Storing your Companys' commercial secrets on Google Cloud ...

    ... should be a sacking offence.

    1. Pascal Monett Silver badge
      Trollface

      Apparently there are at least 5 DevOps Managers that disagree with you.

    2. Wilco

      Re: Storing your Companys' commercial secrets on Google Cloud ...

      Why? This issue didn't affect commercial confidentiality or data security, it affected availability. I'd be confident that my data is more secure in google's data centre than it is in some poorly patched corporate data cupboard-under-the-stairs. I might not always be able to get access to it, but hey, if I can't then the h4xx0rs can't either.

      Moreover, when something goes wrong they are pretty good about finding out what it was, fixing it and then telling everyone in detail about it. Working at large financial services organisations, I've seen any number of major outages where the root cause analysis was "dunno" and the follow up action was "cross fingers".

      1. Anonymous Coward
        Anonymous Coward

        Re: Storing your Companys' commercial secrets on Google Cloud ...

        Availability is part of Security, so it did affect data security.

    3. TheVogon

      Re: Storing your Companys' commercial secrets on Google Cloud ...

      Fortunately, only the tumbleweed rolling across their near empty cloud noticed the outage...

  2. A Non e-mouse Silver badge

    Why on Earth did they roll out a change to multiple zones at once? Surely you should be changing one zone at a time, so that if you do bork a zone, the others carry on working.

    1. Adam 52 Silver badge

      It was a BGP change. They are by no means the first to take out large chunks of the Internet with a dodgy BGP config, at least this was confined to their own network.

  3. Anonymous Coward
    Anonymous Coward

    18 Minutes

    Have you used Azure or Office 365 ?

    18 Minutes is the usual daily uptime !

  4. tony2heads

    Lightning in clouds

    Lightning is totally to be expected in clouds. Not the other stuff

    1. boatsman

      Re: Lightning in clouds

      there's no lightning in clouds.

      lighting is outgoing traffic when we are talking clouds.... normally, anyway........

  5. Mike 125

    How many ways...

    >>Google says it has a “canary step” designed to catch messes like that described >>above. But the canary had a bug “

    There's always another way of saying "Actually no, we don't really know what we're doing."

  6. Anonymous Coward
    Anonymous Coward

    Like I said in another article

    Once you get past 4 9s or so downtime is mostly related to human error. From the list in the article of all of Google Cloud's outages since August, only one was not human error. And assuming all outages were of the 18 minutes magnitude of this one, Google Cloud is running at about 3 1/2 9s.

    1. Anonymous Coward
      Anonymous Coward

      Re: Like I said in another article

      Once you get past 4 9s or so downtime is mostly related to human error

      Once you get past two nines you should not be dealing with an external vendor connected via the Internet, and even that's a stretch already. Even when said vendor can deliver 6 nines, it's pointless if you can't get to them by using a network that has so many intermediaries between you and the vendor that ANY kind of uptime statement is as reliable as your guess for next week's lotto results.

      So, even if we assumed that a "let's throw a lot of things at the wall and see what sticks" vendor like Google would be able to achieve good uptimes for other than their spying activities, you're still nowhere near reliability unless you have multiple, diverse, dedicated circuits going to their data centre.

      1. Steve Davies 3 Silver badge
        Happy

        Re: Like I said in another article

        Q: Are my Cloud Services running?

        A: Yes sir, your cloud services are running perfectly

        Q: Then why can't I get at them?

        A: We don't know Sir. Please try your network provider... As the operator looks out the window at a JCB, a big hole and several people all scratching their heads and saying 'That fibre optic cable should not be there'.

  7. Anonymous Coward
    Anonymous Coward

    This shows the weakness of "availability zones" from the same provider.

    Better to take one availability zone from Google and one from Amazon (say). You end up paying for the traffic which goes from one to the other; but you get genuine resilience.

    Alternatively, Google should manage their availability zones as if they were completely separate providers on separate networks - in particular with different AS numbers, and different management teams.

    1. Pascal Monett Silver badge

      You get genuine resilience if you can manage the nightmare of getting two different Cloud providers to handle the same data without bungling things.

      Apparently, Cloud is hard enough as it is with ONE provider. Put another one the mix and you just might become the poster child for a How Not To Do Cloud article.

      But yeah, in theory redundancy is based on two of a thing.

      1. smartypants

        Redundancy adds complexity, and may not actually work!

        I don't mean to criticise redundancy in principle, but often it turns out not to be all that it's cracked up to be.. at least in certain contexts. (We can all agree that putting data in a single place is silly).

        Imagine that to avoid being hurt by the google failure you changed your architecture to run in parallel on two totally different providers (e.g. Google and AWS). Would it be more reliable? I'd suggest it's not necessarily true.

        Why?

        Well, it involves extra complexity in the design and operation of your system, which in turn makes it more likely that a programming error on your part will bring down the edifice, rather than a problem at either Google or AWS, especially when you have iterated through 50% of your team having changed since it all got designed.

        And then there's the additional problem of the huge cost of planning to have an infrastructure where all the load can be handled by less than half of the whole lot. That's always a difficult sell, and in practice, over time, this tends to be ignored, meaning that when things do go titsup in one bit, then the whole thing grinds to a halt anyway.

        In practise, the only well to tell if your system is reliable is to regularly break it on purpose and demonstrate that everything still works, and people are naturally reticent to make it their job breaking things on purpose! (Though years ago Netflix invented 'Chaos Monkey' that did precisely that to ensure certain classes of failure could be handled gracefully).

        So in short, it might be more reliable if you just rely on a single provider!

        1. Nate Amsden

          Re: Redundancy adds complexity, and may not actually work!

          I still see more times where developers build things that they KNOW will fail and are fragile than make it robust.

          Cutting corners is just standard practice these days.

    2. TeeCee Gold badge
      Facepalm

      Yes, but the whole point of this "cloud" thing is that it's supposed to have redundancy and a complete lack of any single failure point built in (i.e. "five nines" availability is implicit in the concept we were sold). It should be physically impossible for any human error to break the whole thing.

      Using two clouds to make a reliable cloud is like owning a dog and barking for it. More to the point, you've just added a SPOF in the bit that handles which one your traffic's going to (or maybe we're thinking of hosting that on a third service???).

      Wake me up when somebody comes up with an offering in this area that's actually bloody fit for purpose.

    3. Anonymous Coward
      Anonymous Coward

      "availability zones" from the same provider

      Better to take one availability zone from Google and one from Amazon (say). You end up paying for the traffic which goes from one to the other; but you get genuine resilience.

      You really don't, you just make sure that multiple big vendors are a risk to your business.

      If you want resilience run it in house, and fail over to AWS.

  8. Anonymous Coward
    Anonymous Coward

    Meh 18mins

    It's a shame and all, but still a million times more reliable than our internal company servers

  9. batfastad
    Trollface

    Automate automate automate

    So Google Cloud is following Azure's "Downtime As A Service" approach in automating even downtime now?!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon