back to article Google cloud wobbles as workers patch wrong routers

Add another SNAFU to the long list of Google cloud wobbles caused by human error: this time The Alphabet subsidiary decided to patch the wrong routers. The wobble wasn't a big one: it lasted just 46 minutes and only hit Google Compute Engine Instances in the us-central1-f zone. Of course it wasn't minor if yours was one of the …

  1. allthecoolshortnamesweretaken

    "a gradual rollout process for all new releases" - now is that or isn't that DevOps?

    1. Anonymous Coward
    2. Simon Sharwood, Reg APAC Editor (Written by Reg staff)

      I thought DevOps was "yesterday afternoon's hasty revision to code nugget X goes live everywhere ..... NOW!"

      1. Lysenko

        Precisely

        ... but given your position you might want to keep an eye out for black helicopters from the Advertorial department voicing heresy like that! The point of this article is clearly the agile MTTR demonstrated!

  2. frank ly

    Refinements

    "Google network engineers are refining configuration management policies to enforce isolated changes which are specific to the various switch types in the network."

    They were told to 'update the routers with this patch'. Next time, they'll be told to 'update the routers as appropriate'.

    1. Peter Simpson 1
      Happy

      Re: Refinements

      You'd think...

      Google might have automated tools to change router configurations remotely, based on some kind of master network configuration tool.

      Apparently, you'd be incorrect. That's probably intentional. Having a "man in the loop" might be a way to prevent automated chaos. Perhaps manual chaos is easier to fix.

  3. Ken Moorhouse Silver badge

    Post-It(R) Note Failure

    The Post-It(R) Notes they had stuck on the devices to modify had fallen off. They were stuck back on to the wrong equipment.

    Preventative advisory: Use Sellotape(R) to secure Post-It(R) note to device.

  4. Anonymous Coward
    Anonymous Coward

    it took them 45 minutes...

    ... to figure a network change broke the network?

    Surely a quick check of the change list should show network changes were happening in the DC and likely that had gone bad or been done wrong.

    When looking for the arsonist, it's usually the guy stood there with petrol and matches

    1. TeeCee Gold badge
      Facepalm

      Re: it took them 45 minutes...

      Yup! Rule 1 of problem diagnosis: What changed recently?

      Having said that, the 45 minutes was probably chewed in figuring out that the change had also been applied to the wrong kit.....

    2. Bob H

      Re: it took them 45 minutes...

      I remember when things went off-air at a previous job it was standard procedure to yell "Nigel!!!!" in to the racks. It was probably him who had broken something.

  5. Charlie Clark Silver badge
    Stop

    It could well be that rival clouds aren't as forthcoming with reports of messes like this, and that the stream of SNAFUs Google reports is a sign of commendable openness and transparency.

    Or they could be signs of immature processes.

    This whole article oozes snide but only really has insinuation to back it up. I'm not a Google fan but it seems to me that they have pretty mature processes, particularly when it comes to disaster recovery, where it really counts. Being prepared to go public with the procedural details without pointing the finger: "we fucked up and this is why…" is one of the best ways to underline to employees how important their work is.

    Status feeds are one thing but how many complete fallouts of Google have there been this year? And of Azure and Amazon?

    1. Bob H

      'Disaster recovery' is an over used phrase, in my view DR is like insurance: it shouldn't be needed but you have it just in case. If your day-to-day processes need DR then you are doing it wrong. You should have good processes with appropriate monitoring and roll-back, in this case they didn't seem to have that.

      1. Adam 52 Silver badge

        I take the opposite view. If your architecture doesn't have DR built in and in use every day then chances are that it'll fail when you need it.

    2. sabroni Silver badge

      re: This whole article oozes snide

      This whole site oozes snide. If it causes you offence when it's directed at Google then I guess there must be a little bit of Google love there....

  6. Picky
    WTF?

    Easy to do - especially when the boss is involved

    About 12 years-ago I did IT for a small University. We had a massive, new, £2M+ Cisco network installed. I emailed the project manager with some patch changes - specifically Floor 5 Box 1 etc, etc. When the whole of my post grad department fell apart discovered that he had gone to Floor 2 Box 1 to repatch.

    What made it worse was that he used an Access database - and didn't have a backup/printout of the original patches he had fucked up. I left a year later.

    1. Dan Wilkie

      Re: Easy to do - especially when the boss is involved

      I feel your pain. I had to repatch a satellite office, out of hours and only in a limited slot. The cabling was, well frankly it looked like a nest of snakes had exploded in the cabinet.

      The only way we could get the repatching done was to cut all the cables, pull them out, and repatch fresh (we're talking 20m patch cables to go from a switch to the adjacent patch panel, looped around the cabinet 16 times and a mixture of telephone and network patches to boot).

      Whoever had patched it in the first place... Anyway - knowing there was a limited time, we went with the safe and sure method, and went round the day before and mapped every single port to its end point so and recorded it on an excel sheet (which we printed 2 copies of, one stayed in my draw till after the repatching and the other came with us).

      By the end of it we were able to pull 4 48 port switches out of the rack that were no longer required - went from 300 odd connections down to 100 ish.

      I can't remember where I was going with this.

  7. KeithR

    "This whole site oozes snide. If it causes you offence when it's directed at Google then I guess there must be a little bit of Google love there..."

    Naaah. Snide makes the world go round, but in this case there's a damn' sight more snide than is actually warranted..

    Snide for snide's sake is just lazy.

  8. Crazy Operations Guy

    Never change production equipment

    The beauty of cloud /virtualization seems to escape Google. The proper way of working on something like this would have been to spin up a new DC with the changes needed, then slowly move stuff from another DC to it. Once everything is moved, you perform the changes on the now-empty DC, and once done, move the machines over from another DC. Allows for keeping the machines fresh and timetables can be shifted without affecting production.

  9. f00bar

    Hmm, not sure about measuring providers based on their status feed.

    Have to say this seems a bit unfair on Google. We use them and several of their competitors and regularly note that we are told about incidents that aren't actually affecting us with a ratio of about 10:1, and have never seen a service impacting incident not be documented.

    With other providers, the ratio seems to be the other way around and we only see proactive information from the provider about a subset of incidents that do affect us.

    Google aren't perfect but judging providers based on the volume of their own incident reports is just penalising the better developed folks with good processes and pro-active incident reporting IMO.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon