back to article Google Cloud rolls back changes after 18-hour load balancer brownout

Google Cloud's load balancers have suffered a lengthy connectivity problem. First reported at 00:52 Pacific time on August 30th, the incident is still unresolved as of 19:18. Google's struggled with this one. At 06:00 the company said it had “determined the infrastructure component responsible for the issue and mitigation …

  1. Anonymous Coward
    Anonymous Coward

    But, but... it's the cloud

    It's more like 15 minutes of random downtime a week plus at least 3 extended outages a year. Most of them are the kind of screwups that "more instances for more uptime" can't fix. I'm trying hard to see how Google hosting is more cost efficient, easier to maintain, or more reliable than some high-end desktops on the floor with a "Don't touch" sign. I reminisce about the days when IT staff knew how to build systems in a datacenter.

    1. Nate Amsden

      Re: But, but... it's the cloud

      One of my biggest issues was/is cloud players are always screwing with their stuff. Very little means for customers to opt out or postpone changes, probably 95%+ of the changes are not even communicated in the IaaS space(except when there are brownouts etc after the fact). More often they are communicated in the SaaS space at least for the application side of things, though even then it seems to be really rare in SaaS for a customer to have any feedback into accepting such changes.

      vs more traditional data center stuff where you basically have power+network links, both of which often times have fantastic reliability proven over a decade or more(anything higher up in the stack is managed by the organization). Add to that the complexity of network routing and providing redundant power is far less complex(and is a very mature technology vs cloud technologies) than an entire cloud application stack(on top of networking and power as well).

      Data centers and network carriers (the good ones anyway) are usually very verbose about communications with any maintenance or changes on their systems. The carrier that the organization I work for even communicates things such as events that would trigger BGP route recovergence. Not that we really care about short periods of times when routing may not be optimal, it's not that critical. But the attention to detail is good.

  2. ST Silver badge
    Mushroom

    is QA still a thing at Google?

    Or do Git commits get pushed straight into production?

    1. RyokuMas Silver badge
      Joke

      Re: is QA still a thing at Google?

      It was - but QA involves criticising Google's systems, so...

  3. teknopaul Silver badge

    for sure

    the fact it took engineers 18 hours to work out the command to rollback a submit implies it was git. ;)

  4. David Roberts
    Windows

    In the middle of a brown out at the moment

    Sorry if this is TMI.

  5. TaabuTheCat

    Too big, too complex

    I'm starting to get the feeling these systems are getting too big and too complex to be managed for uptime. Not only Google and AWS, but look at O365. They just had a Sharepoint incident where file sharing stopped working (yeah, ironic) and it took them 10 days to patch all those who were affected. Think about that: If you were at the end of the queue for the fix you lost file sharing capability for 10 days! And the whole thing was caused by a bug in an "upgrade".

    Perhaps they need to stop working on "upgrades" for a while and start working on rapid rollback. I still don't get how MS can upgrade (break) everything at once, and then need 10 days to unwind the changes. Something doesn't make sense.

  6. Claptrap314 Silver badge

    Former Google SRE here--not on GCP.

    Cloud operations has never been about more servers = more stability. Or even ==. Cloud operations give you the ability to improve stability, but this requires that the entire stack be engineered to operate in this fashion.

    1) Datacenters can be taken down for routine or emergency service. This can be at the power or water distribution level (although I only observed it at the power level). If you are not in multiple regions, you are NOT HA. If you are in multiple data centers, but they are on the same maintenance schedule, you are NOT HA.

    2) OS & firmware upgrades on the underlying hardware, both routine and emergency, happen. If you cannot handle 5% of your servers being down (in addition to a couple of datacenters are down), you are NOT HA.

    3) Changes happen. Tracing problems back in a stack as tall as exists at a cloud is not easy, because the entire point of separating the layers is that coordination is not required.

    4) I'm not sure that AWS qualifies as having a mature offering. Google never claimed that they would be mature out of the gate. There are major differences in providing external services to internal, and Google appears to have been honest that it is going to take time to match AWS's maturity. The monthly fails during 2016 were certainly undesirable, but I don't even know if I would consider them embarrassing _at the time_. Now would be embarrassing. But we're not seeing that failure rate.

    1. Ken Moorhouse Silver badge

      IDKWTFSOTAA

      Valuable comment by the look of it, but can you please explain your T&TLA's please?

      SRE

      GCP

      HA

  7. handleoclast

    Counter

    Time to reset the “Days since last self-inflicted cloud crash” counter to zero, guys.

    Hmmmm, do I detect a subtle reference to this?

    1. Anonymous Coward
      Anonymous Coward

      Re: Counter

      I thought it was "Days since last tornado" in a trailer park. (Can't remember: Simpsons? King of the Hill?)

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019