back to article Everest outage was caused by split brains

Server farm Everest''s blackout on 15 November was caused by a power outage combined with stacked routers each running different software versions. A "reason for outage" document issued by Everest admitted to there having been a "loss of connectivity" for clients using IP network services between 0830 and 1030 on 15 November …

  1. Anonymous Coward
    Anonymous Coward

    Despite a rigorous testing regime for the generators at Everest, the generators started but failed to maintain the load placed upon them.

    I've seen that happen before. The "rigorous test" was to start the diesels every Monday, and then to shut them down again. After years of such cold starts and short runs they were so coked-up inside then when they were needed in anger they started OK but as soon as they got warmed up they died. Only a full head-off decoke got them back into working order. After that the test was to start & run for 30 mins...

    1. Anonymous Coward
      Anonymous Coward

      Even that is not enough

      I have some friends who built and run a hosting facility (not in the UK) and got bought recently by one of the big players.

      Their test was not to test as often - they did every month instead of weekly or daily, but do it for up to 6 hours under full load with multiple switch-overs back and fourth.

      The first time they did it, they got a noise abatement notice as the inhabitants of all apartment blocks within 100m got a petition to the city council. So they had to install some pretty nifty bespoke mufflers on the generators as a result.

  2. Anonymous Coward
    Anonymous Coward

    Aww shoot

    I only clicked on this article because I though it was about MUMSNET. I want my outrage back!

    1. Buzzword

      Re: Aww shoot

      Same here, I keep reading Memset as Mumsnet. Also, memset() is a function in the standard C library. Either way not a great name for a hosting company.

      1. VinceH

        Re: Aww shoot

        Ditto - I read the headline as "Mumsnet outrage was caused by split brains" - and wondered (a) what the latest silly outrage was about, and (b) if the 'split brains' was a reference to some medical issue affecting those outraged.

      2. Uberseehandel

        Re: Aww shoot

        A long time ago, I was told that the name memset() is a nerdyish in joke . . . but as somebody who wrote non-compatible tape conversion systems in Fortran, I was in no position to say anything

  3. G2

    basic electrical mistake there: full load =/= start-up load

    "full load" is almost never equal to "everything is starting up at once" load...

    electrical load when a device starts up has a BIG load spike, sometimes this can even reach (for a few fractions of a second) 500% or even more of the regular load during normal operation

    https://en.wikipedia.org/wiki/Inrush_current

    Regular servers usually have a bios setting to delay automatic power on by a random amount of time.. usually this is decided when power is applied to the pre-power-on circuits and it's randomly chosen between 0 and 60/120 seconds - depending on the bios setting.

    If you have just a few servers then this setting is very much recommended in order to spread the start-up load spikes over a longer period of time.

    For a datacenter though, where they have a lot of servers starting up with random delays but all within the same period of 60 or 120 seconds from when the power is turned on, the combined load spikes from all those servers translates into a very LONG start-up load spike and the generators should be designed for a constant current load of at least 200...300% of the regular "full" load of the datacenter.

    1. Anonymous Coward
      Anonymous Coward

      Re: basic electrical mistake there: full load =/= start-up load

      But this wasn't a start up load? Everything was running when the generators kicked in (and then failed). After that it was a startup load. I guess the thing to do would be to then manually switch off various feeds and put them back in one at a time over a period of a few minutes to ensure no major spike.

      1. G2

        Re: basic electrical mistake there: full load =/= start-up load

        @"But this wasn't a start up load"... actually, it was very much one if you think of it from the point of view of the generators. (not from servers' point of view):

        1. a lot of UPSs lose mains power, thus the capacitors/transformers feeding the UPSs from mains are drained of charge.

        2. generators start

        3. UPSs now start getting power from generators => inrush current to UPS's mains circuits

        4. => power spike applied on generators

        if they have lots of UPSs without random power start delays then that translates into a HUGE combined spike on generators.

  4. Dr Who

    A lesson in redundancy

    To have redundant everything you need to know what everything is. In the complex interconnected world in which we operate it's often hard to know what you don't know.

    One of our servers at Memset (who are normally an all round top notch provider) was affected by this outage. Anticipating that data centres can never actually have 100% up time, we replicate our servers to another DC run by a different service provider. That way we can just switch the DNS and hey presto.

    For our DNS we use another normally top notch provider with loads of DNS servers spread around the world etc.... It just so happened that after ten years of flawless and uninterrupted service they had a problem at the exact same time as the Memset outage. All DNS servers were running normally, but their control panel went offline for an hour due to a database glitch - meaning we couldn't switch the DNS to our redundant server.

    As it happened, our Memset server came back very quickly and we didn't need to switch, but still, another lesson learned.

    Any tips on how to mitigate against this problem would be much appreciated. DNS secondaries with another provider (or our own) would not have helped in this instance as DNS was running normally. We just couldn't modify the zone files.

    1. Anonymous Coward
      Anonymous Coward

      Re: A lesson in redundancy

      Host your own DNS - use a primary and secondary DNS server, host them in two separate data centre locations, both fronted by resilient network links.

      If you outsource your servers, networks and IP configurations to others, don't be surprised when the cloud - aka 'other peoples computers' - fails.

      1. This post has been deleted by its author

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon