"a gradual rollout process for all new releases" - now is that or isn't that DevOps?
Google cloud wobbles as workers patch wrong routers
Add another SNAFU to the long list of Google cloud wobbles caused by human error: this time The Alphabet subsidiary decided to patch the wrong routers. The wobble wasn't a big one: it lasted just 46 minutes and only hit Google Compute Engine Instances in the us-central1-f zone. Of course it wasn't minor if yours was one of the …
COMMENTS
-
Tuesday 1st March 2016 04:54 GMT frank ly
Refinements
"Google network engineers are refining configuration management policies to enforce isolated changes which are specific to the various switch types in the network."
They were told to 'update the routers with this patch'. Next time, they'll be told to 'update the routers as appropriate'.
-
Tuesday 1st March 2016 14:39 GMT Peter Simpson 1
Re: Refinements
You'd think...
Google might have automated tools to change router configurations remotely, based on some kind of master network configuration tool.
Apparently, you'd be incorrect. That's probably intentional. Having a "man in the loop" might be a way to prevent automated chaos. Perhaps manual chaos is easier to fix.
-
-
Tuesday 1st March 2016 07:03 GMT Anonymous Coward
it took them 45 minutes...
... to figure a network change broke the network?
Surely a quick check of the change list should show network changes were happening in the DC and likely that had gone bad or been done wrong.
When looking for the arsonist, it's usually the guy stood there with petrol and matches
-
Tuesday 1st March 2016 09:31 GMT Charlie Clark
It could well be that rival clouds aren't as forthcoming with reports of messes like this, and that the stream of SNAFUs Google reports is a sign of commendable openness and transparency.
Or they could be signs of immature processes.
This whole article oozes snide but only really has insinuation to back it up. I'm not a Google fan but it seems to me that they have pretty mature processes, particularly when it comes to disaster recovery, where it really counts. Being prepared to go public with the procedural details without pointing the finger: "we fucked up and this is why…" is one of the best ways to underline to employees how important their work is.
Status feeds are one thing but how many complete fallouts of Google have there been this year? And of Azure and Amazon?
-
Tuesday 1st March 2016 10:24 GMT Bob H
'Disaster recovery' is an over used phrase, in my view DR is like insurance: it shouldn't be needed but you have it just in case. If your day-to-day processes need DR then you are doing it wrong. You should have good processes with appropriate monitoring and roll-back, in this case they didn't seem to have that.
-
-
Tuesday 1st March 2016 13:55 GMT Picky
Easy to do - especially when the boss is involved
About 12 years-ago I did IT for a small University. We had a massive, new, £2M+ Cisco network installed. I emailed the project manager with some patch changes - specifically Floor 5 Box 1 etc, etc. When the whole of my post grad department fell apart discovered that he had gone to Floor 2 Box 1 to repatch.
What made it worse was that he used an Access database - and didn't have a backup/printout of the original patches he had fucked up. I left a year later.
-
Tuesday 1st March 2016 16:18 GMT Dan Wilkie
Re: Easy to do - especially when the boss is involved
I feel your pain. I had to repatch a satellite office, out of hours and only in a limited slot. The cabling was, well frankly it looked like a nest of snakes had exploded in the cabinet.
The only way we could get the repatching done was to cut all the cables, pull them out, and repatch fresh (we're talking 20m patch cables to go from a switch to the adjacent patch panel, looped around the cabinet 16 times and a mixture of telephone and network patches to boot).
Whoever had patched it in the first place... Anyway - knowing there was a limited time, we went with the safe and sure method, and went round the day before and mapped every single port to its end point so and recorded it on an excel sheet (which we printed 2 copies of, one stayed in my draw till after the repatching and the other came with us).
By the end of it we were able to pull 4 48 port switches out of the rack that were no longer required - went from 300 odd connections down to 100 ish.
I can't remember where I was going with this.
-
-
Tuesday 1st March 2016 16:17 GMT KeithR
"This whole site oozes snide. If it causes you offence when it's directed at Google then I guess there must be a little bit of Google love there..."
Naaah. Snide makes the world go round, but in this case there's a damn' sight more snide than is actually warranted..
Snide for snide's sake is just lazy.
-
Tuesday 1st March 2016 22:05 GMT Crazy Operations Guy
Never change production equipment
The beauty of cloud /virtualization seems to escape Google. The proper way of working on something like this would have been to spin up a new DC with the changes needed, then slowly move stuff from another DC to it. Once everything is moved, you perform the changes on the now-empty DC, and once done, move the machines over from another DC. Allows for keeping the machines fresh and timetables can be shifted without affecting production.
-
Wednesday 2nd March 2016 14:38 GMT f00bar
Hmm, not sure about measuring providers based on their status feed.
Have to say this seems a bit unfair on Google. We use them and several of their competitors and regularly note that we are told about incidents that aren't actually affecting us with a ratio of about 10:1, and have never seen a service impacting incident not be documented.
With other providers, the ratio seems to be the other way around and we only see proactive information from the provider about a subset of incidents that do affect us.
Google aren't perfect but judging providers based on the volume of their own incident reports is just penalising the better developed folks with good processes and pro-active incident reporting IMO.