How a power blip briefly broke GitHub's boxes and tripped it offline

Thursday 4th February 2016 09:03 GMT Reginald Marshall

A bumpy landing

"Any landing you can walk away from is a good one." Applied to IT, any disruption which leaves you with your data intact is tolerable. No data loss here (or we'd have heard about it, I'm sure.) Comic distractions of the "keyboard not found, press F1 to continue" variety seem unavoidable in any sufficiently complex setup.

6 1 Reply
1. Thursday 4th February 2016 09:53 GMT SimonC
  
  Re: A bumpy landing
  
  Highly debatable when applied to IT... when you get sued for failing to meet SLAs!
  
  An example being when I worked with AML, if the verification API went down, loads of companies using it would not be permitted to let people sign up to their sites to spend money. If they just saw a £200,000 TV commercial telling them to play slut-roulette and jumped on their tablets to sign up and couldn't, they're probably going to forget and lose enthusiasm to do it later, and the client is going to be furious and demanding £££ compensation.
  
  0 0 Reply
Thursday 4th February 2016 09:33 GMT Will Godfrey

Thanks

Good to see an organisation putting out a detailed appraisal of what happened.

Now, about that BT outage...

6 0 Reply
1. Thursday 4th February 2016 20:32 GMT ideapete
  
  Re: Thanks
  
  Now if they would only DESCRIBE the power outage , surge dirty power , failure of back up WHAT ?
  
  2 0 Reply
Thursday 4th February 2016 10:01 GMT Tom 64

Redis

Very surprised to read it sounds like they are running this on bare metal boxes.

Redis is a very good candidate for stuffing in a VM or container and porting around.

0 1 Reply
1. Thursday 4th February 2016 10:10 GMT Electron Shepherd
  
  Re: Redis
  
  I read it as the physical hosts couldn't see their local hard drives. Having all your VMs on a nice SAN is all well and good, allowing you to VMotion them around the DC as required, but usually the underlying VMWave / Xen / Hyper-V host machine is booting off local drives.
  
  1 0 Reply
Thursday 4th February 2016 11:18 GMT Richard 12

The bigger question is why they rebooted

What was the momentary power disruption and why didn't the UPS systems mitigate it?

Was it a faulty UPS? What was the fault and how can they (and others) spot it in the future?

Did these 25% of devices not have dual PSUs on different UPS busses, or were these faulty but unnoticed?

Or was it a deliberate design decision to allow this type of reboot as it should be harmless?

2 0 Reply
1. Thursday 4th February 2016 11:28 GMT Mikey
  
  Re: The bigger question is why they rebooted
  
  The more likely reason when you realise what OS they run is that it's all obsolescent hardware running on the cheap, because you don't need top-of-the-line when running Linux, right?
  
  All joking aside, I'm kinda surprised there was no backup power available, you'd think a datacentre would have that as a basic provision. At the very least have an in-rack UPS unit that would start a graceful shutdown in such an event. But then, hindsight is always late to the planning party is it not?
  
  2 0 Reply
Thursday 4th February 2016 11:39 GMT Gideon 1

Single point of failure

By having all the Redis servers on identical hardware they all failed in the same way at the same time, which is a kind of single point of failure. No redundancy of design.

Did they ever test their ability to withstand power failure by interrupting the power to part of the installation and see what happens?

2 0 Reply
1. Thursday 4th February 2016 13:17 GMT teknopaul
  
  Re: Single point of failure
  
  They've obviously never rebooted on that firmware or they would have noticed right? Best advice for failover testing I heard was from Amazon, three of everything in prod, pull the plug regularly, like daily, or for all new software drops. This way you have confidence your production env behaves like test. Never do graceful shutdown. Not even in test. It just masks what happens in a real failure scenario. Trouble with all that is what you can't test is how well it behaves when you spike the electricity. You just can't do that on a daily basis. Respect for rebuilding a redis cluster in two hours. And for having hardware available to do that on.
  
  1 0 Reply
  1. Thursday 4th February 2016 19:18 GMT phil dude
    
    Re: Single point of failure
    
    that sounds *brutal*, useful, and a little scary...
    
    (hugs his nice, warm, UPS)...
    
    P.
    
    0 0 Reply
Thursday 4th February 2016 13:18 GMT TheProf

boxes?

Surely it's BOXEN?

0 2 Reply
1. Friday 5th February 2016 14:52 GMT Thecowking
  
  Re: boxes?
  
  As far as I know, only brethren, children and oxen take the Germanic pluralisation of -(r)en in English.
  
  Box is not ox.
  
  0 0 Reply
Thursday 4th February 2016 14:01 GMT Tempest

What? No UPS?

Who runs a data centre these days without some form of UPS?

Sounds like a cheap operation.

One benefit of living in a country which has unscheduled power outages, is that protection is the norm. My home standby generator runs on gas - the stuff that is leaking all over California - not petroleum. I also use battery power storage - charged by FREE solar.

1 0 Reply
1. Thursday 4th February 2016 14:53 GMT Electron Shepherd
  
  Re: What? No UPS?
  
  I'm sure they have a UPS. Even if you have a UPS that doesn't guarantee you won't lose power to one or more racks.
  
  Maybe the UPS was being serviced, and someone forgot to bypass it first. Perhaps the bypass switch itself was faulty. Possibly there was a fault with the power distribution infrastructure between the UPS and the racks. The list goes on...
  
  0 0 Reply
2. Sunday 7th February 2016 05:06 GMT jimmyj
  
  Re: What? No UPS?
  
  living in 'the' country (rural) with frequent outages I soon learned to have a UPS running @~ 1/3 capacity. also have solar charging a pair of gardentractor sized 12v to 'backup the backup'. what I didn't expect was the actual voltage value... after losing several power supply boxen (!) in my tower, I took my trusty Fluke (!) to the utility socket... & measured 131 volts! 'yikes' I exclaimed... & found a 6v transformer which I wired as a "buck' to supply my IT gear. for those unfamiliar : a 'buck' or 'boost' is simply a transformer fed from the mains with its output wired 'out' or 'in' phase with the same mains to subtract (or add) it's output to that which is delivered to the equipment. in my case 125v solved the problem.
  
  0 0 Reply
Thursday 4th February 2016 15:01 GMT John Stoffel

They needed to use Chaos Monkey from Netflix

These guys need to use Chaos Monkey from Netflix to do fault injection more often to stress test the infrastructure. They already have horizontal scaling, so this type of testing is a perfect match for them.

Now not having the systems bootup because of BIOS problems (crappy BIOS maybe) ain't fun at all. I wonder if that's an uptime issue, since they said it could be solved by pulling the complete power from the system and then rebooting.

This is why you try to have multiple motherboards/vendors in your server center to try and spread out the pain when stuff like this pops up. It's not simple, it's not cheap, and it's hard to do at times.

John

0 0 Reply
1. Friday 5th February 2016 00:53 GMT Nate Amsden
  
  Re: They needed to use Chaos Monkey from Netflix
  
  For most simply having redundant power is enough. Multiple power feeds failing is extremely rare. I'd bet a lot of their gear was single feed and they were betting on rack level availability and not assuming a single UPS services multiple racks.
  
  One company i was at was so cheap their backend infrastructure of 20 to 25 racks was all covered by a SINGLE UPS. There were other UPSs available but they didn't pay to even get alternating racks on two different UPSs. Fortunately for them there was never an outage on the backend gear. Front end gear also on single UPS and that UPS was tripped and went offline on a couple of occasions when someone plugged something in that they shouldn't of.
  
  I've only been hosted at one facility that had both feeds fail at once. I inherited the gear and we moved out within 6 months(2006). It was an internap facility im seattle (they operated but did not own it). 3 years later that place had a fire and a full 30+ hours of no power.
  
  0 0 Reply
Friday 5th February 2016 12:39 GMT DropBear

"Users were served the unicorn-of-fail page"

OMG! Sparklelord! GitHub reads Doctor McNinja?!?

0 0 Reply