back to article How a power blip briefly broke GitHub's boxes and tripped it offline

Exactly how a momentary power failure managed to trigger a two-hour GitHub outage has been revealed in full today. The popular and widely used source-code-hosting service fell off the internet last Wednesday, and soon after blamed the downtime on a "brief power disruption at our primary data center" that "caused a cascading …

  1. Reginald Marshall

    A bumpy landing

    "Any landing you can walk away from is a good one." Applied to IT, any disruption which leaves you with your data intact is tolerable. No data loss here (or we'd have heard about it, I'm sure.) Comic distractions of the "keyboard not found, press F1 to continue" variety seem unavoidable in any sufficiently complex setup.

    1. SimonC

      Re: A bumpy landing

      Highly debatable when applied to IT... when you get sued for failing to meet SLAs!

      An example being when I worked with AML, if the verification API went down, loads of companies using it would not be permitted to let people sign up to their sites to spend money. If they just saw a £200,000 TV commercial telling them to play slut-roulette and jumped on their tablets to sign up and couldn't, they're probably going to forget and lose enthusiasm to do it later, and the client is going to be furious and demanding £££ compensation.

  2. Will Godfrey Silver badge
    Thumb Up

    Thanks

    Good to see an organisation putting out a detailed appraisal of what happened.

    Now, about that BT outage...

    1. ideapete
      Pint

      Re: Thanks

      Now if they would only DESCRIBE the power outage , surge dirty power , failure of back up WHAT ?

  3. Tom 64

    Redis

    Very surprised to read it sounds like they are running this on bare metal boxes.

    Redis is a very good candidate for stuffing in a VM or container and porting around.

    1. Electron Shepherd

      Re: Redis

      I read it as the physical hosts couldn't see their local hard drives. Having all your VMs on a nice SAN is all well and good, allowing you to VMotion them around the DC as required, but usually the underlying VMWave / Xen / Hyper-V host machine is booting off local drives.

  4. Richard 12 Silver badge

    The bigger question is why they rebooted

    What was the momentary power disruption and why didn't the UPS systems mitigate it?

    Was it a faulty UPS? What was the fault and how can they (and others) spot it in the future?

    Did these 25% of devices not have dual PSUs on different UPS busses, or were these faulty but unnoticed?

    Or was it a deliberate design decision to allow this type of reboot as it should be harmless?

    1. Mikey
      Joke

      Re: The bigger question is why they rebooted

      The more likely reason when you realise what OS they run is that it's all obsolescent hardware running on the cheap, because you don't need top-of-the-line when running Linux, right?

      All joking aside, I'm kinda surprised there was no backup power available, you'd think a datacentre would have that as a basic provision. At the very least have an in-rack UPS unit that would start a graceful shutdown in such an event. But then, hindsight is always late to the planning party is it not?

  5. Gideon 1

    Single point of failure

    By having all the Redis servers on identical hardware they all failed in the same way at the same time, which is a kind of single point of failure. No redundancy of design.

    Did they ever test their ability to withstand power failure by interrupting the power to part of the installation and see what happens?

    1. teknopaul

      Re: Single point of failure

      They've obviously never rebooted on that firmware or they would have noticed right? Best advice for failover testing I heard was from Amazon, three of everything in prod, pull the plug regularly, like daily, or for all new software drops. This way you have confidence your production env behaves like test. Never do graceful shutdown. Not even in test. It just masks what happens in a real failure scenario. Trouble with all that is what you can't test is how well it behaves when you spike the electricity. You just can't do that on a daily basis. Respect for rebuilding a redis cluster in two hours. And for having hardware available to do that on.

      1. phil dude
        Coat

        Re: Single point of failure

        that sounds *brutal*, useful, and a little scary...

        (hugs his nice, warm, UPS)...

        P.

  6. TheProf
    Boffin

    boxes?

    Surely it's BOXEN?

    1. Thecowking

      Re: boxes?

      As far as I know, only brethren, children and oxen take the Germanic pluralisation of -(r)en in English.

      Box is not ox.

  7. Tempest
    FAIL

    What? No UPS?

    Who runs a data centre these days without some form of UPS?

    Sounds like a cheap operation.

    One benefit of living in a country which has unscheduled power outages, is that protection is the norm. My home standby generator runs on gas - the stuff that is leaking all over California - not petroleum. I also use battery power storage - charged by FREE solar.

    1. Electron Shepherd

      Re: What? No UPS?

      I'm sure they have a UPS. Even if you have a UPS that doesn't guarantee you won't lose power to one or more racks.

      Maybe the UPS was being serviced, and someone forgot to bypass it first. Perhaps the bypass switch itself was faulty. Possibly there was a fault with the power distribution infrastructure between the UPS and the racks. The list goes on...

    2. jimmyj

      Re: What? No UPS?

      living in 'the' country (rural) with frequent outages I soon learned to have a UPS running @~ 1/3 capacity. also have solar charging a pair of gardentractor sized 12v to 'backup the backup'. what I didn't expect was the actual voltage value... after losing several power supply boxen (!) in my tower, I took my trusty Fluke (!) to the utility socket... & measured 131 volts! 'yikes' I exclaimed... & found a 6v transformer which I wired as a "buck' to supply my IT gear. for those unfamiliar : a 'buck' or 'boost' is simply a transformer fed from the mains with its output wired 'out' or 'in' phase with the same mains to subtract (or add) it's output to that which is delivered to the equipment. in my case 125v solved the problem.

  8. John Stoffel

    They needed to use Chaos Monkey from Netflix

    These guys need to use Chaos Monkey from Netflix to do fault injection more often to stress test the infrastructure. They already have horizontal scaling, so this type of testing is a perfect match for them.

    Now not having the systems bootup because of BIOS problems (crappy BIOS maybe) ain't fun at all. I wonder if that's an uptime issue, since they said it could be solved by pulling the complete power from the system and then rebooting.

    This is why you try to have multiple motherboards/vendors in your server center to try and spread out the pain when stuff like this pops up. It's not simple, it's not cheap, and it's hard to do at times.

    John

    1. Nate Amsden

      Re: They needed to use Chaos Monkey from Netflix

      For most simply having redundant power is enough. Multiple power feeds failing is extremely rare. I'd bet a lot of their gear was single feed and they were betting on rack level availability and not assuming a single UPS services multiple racks.

      One company i was at was so cheap their backend infrastructure of 20 to 25 racks was all covered by a SINGLE UPS. There were other UPSs available but they didn't pay to even get alternating racks on two different UPSs. Fortunately for them there was never an outage on the backend gear. Front end gear also on single UPS and that UPS was tripped and went offline on a couple of occasions when someone plugged something in that they shouldn't of.

      I've only been hosted at one facility that had both feeds fail at once. I inherited the gear and we moved out within 6 months(2006). It was an internap facility im seattle (they operated but did not own it). 3 years later that place had a fire and a full 30+ hours of no power.

  9. DropBear

    "Users were served the unicorn-of-fail page"

    OMG! Sparklelord! GitHub reads Doctor McNinja?!?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon