back to article BA IT systems failure: Uninterruptible Power Supply was interrupted

An IT bod from a data centre consultancy has been fingered as the person responsible for killing wannabe budget airline British Airways' Boadicea House data centre – and an explanation has emerged as to what killed the DC. Earlier this week Alex Cruz, BA's chief exec, said a major "power surge" at 0930 on Saturday 27 May …

  1. Anonymous Coward
    Anonymous Coward

    If it got interrupted...

    Then its not a UPS. Its a DHL. Dumbass High-risk Liability.

    1. Dan 55 Silver badge
      Holmes

      Re: If it got interrupted...

      First we know everything was slowing down. Maybe they decided they couldn't fix it live and wanted to force a failover.

      The Times suggests a big red button was pressed in the data centre by a contractor and the power went down. That might be when BA claimed there was a power failure.

      That would be the point when the failover failed. Perhaps that is why the CEO said something about there being millions of messages, although he seems to have stopped saying that now, maybe because it suggests there's something wrong with their IT.

      Then I guess they tried to bring the data centre back up, and it looked like the bridge of the Enterprise, shaking about, staff falling to the floor, and smoke everywhere. That would be the power surge.

      I wonder how long it was since power and switching to secondary or backup data centres were tested.

      1. sad_loser
        FAIL

        Re: If it got interrupted...

        The issue is that what was someone doing in the DC playing with buttons they should not have had access to.

        If your IT workforce is all in house then you don't get contractors wandering around unsupervised.

        1. Anonymous Coward
          Anonymous Coward

          Re: If it got interrupted...

          Electrical installation is rarely done in-house and is quite a specialised task. You'd have to be special to cock this up.

          I can see it now.

          Electricians apprentice - Whoops why has it gone quiet?

          Foreman - Quick, restart everything and get out before anyone notices.

          1. Anonymous Coward
            Anonymous Coward

            Re: If it got interrupted...

            Oh, that takes me back some years. New standby generator went in on Friday night/Saturday morning. We're on site, shut down everything, new generator in place, all well, so stuff gets brought back up. 9:30 am, apprentice sparky cut a wire that caused the whole thing to shutdown. I got a call from security, arrived at 9:45 am (I stay close) and the place was like the Marie Celeste. Open cans of diesel for the generator, warm cups of tea and not a bloody person on site.

            1. Mark 85 Silver badge

              Re: If it got interrupted...

              Similar to where I worked, except the backup power gen came on, picked up the load and then promptly died. Seems there wasn't much fuel in the tank. Maintenance guy was fingered for it as it was in his job description to fuel the generator and keep it topped off. The lesson was that firing off the generator once a week for 10 minutes to test uses fuel... duh!!!!!!

          2. Version 1.0 Silver badge

            Re: If it got interrupted...

            The electrical generators are almost certainly three-phase generators ... the trick here is connecting the three phases in the right order. I saw a generator test years ago in Oxford fail on the initial installation test after the phases why connected incorrectly. The generator spun up, and as the power switched over there was one heck of a bang and a lot of smoke ... and no more electricity.

            1. GettinSadda

              Re: If it got interrupted...

              > "I saw a generator test years ago in Oxford fail"

              Yeah... I was told of a similar incident, but in a power station. When the station is powered on it needs to sync to the grid before linking as it is vital that not only the frequency matches exactly, but also the phase. The traditional way to do this was with a dial showing the phase-error. Apparently when the plant was down for maintenance they also had cleaners in to give the control room a going over. One of these cleaners discovered that it was possible to unscrew the glass fronts of the dials to clean the glass. In the process they knocked off the needle, and replaced it... 180 degrees out of phase. When the power station was brought back online the generators apparently detached themselves from the floor... with considerable (i.e. demolition-grade) force!

        2. Anonymous Coward
          Anonymous Coward

          Re: If it got interrupted...

          IT staff rarely go near the electrical stuff, it's far too dangerous for that.

          1. Tim Jenkins

            Re: If it got interrupted...

            "IT staff rarely go near the electrical stuff, it's far too dangerous for that."

            As a significant percentage of BOFH plotlines have taught us ; )

          2. Dwarf Silver badge

            Re: If it got interrupted...

            IT staff rarely go near the electrical stuff, it's far too dangerous for that.

            Er, IT staff work on things that are run off the very same electrical stuff. I do hope you are not implying that data centre grade equipment is too dangerous ?

            Having said that, even a complete Muppet can hurt themselves with nothing more than a mildly sharp stick or an LR44 button cell battery that looks like a sweetie, like everything, its all down to training and understanding the job and the risks of the job. Take a look on YouTube for the chaps that work on the live 500KV power lines, or the guys that maintain the bulb at the top of the radio towers

            Everyone should have seen all the warning signs on the way into the facility (that tick the boxes in the H&S assessment) and had the prerequisite training.about safe escape routes if gas discharge occurs (no, not that gas, the other one), the presence of 3 phase power; the presence of UPS power; various classes of laser optics; automated equipment such as tape libraries that can move without warning and of course the data centre troll who's not been seen for a couple of weeks now, oh and of course, the ear defenders due to the noise plus the phone that you can't hear as its too noisy.

            My point is that data centres are no worse than any other environment - like maintaining a car engine or running a mower in your garden.

            1. Anonymous Coward
              Anonymous Coward

              Re: If it got interrupted...

              Point taken.

              The heavy electrical stuff and switch rooms should be kept under lock and key at all times. Every DC I have been in the techies were not allowed near these. They were expected to be familiar with all the safety items you mention.

        3. Dan 55 Silver badge

          Re: If it got interrupted...

          Who said he shouldn't have had access to the buttons? He's the electrician.

          I bet he was called in and told to pull the plug as a consequence of the system grinding slowly to a halt yet not switching over to secondary. It can't be a coincidence.

          1. Stoneshop Silver badge
            FAIL

            Re: If it got interrupted...

            I bet he was called in and told to pull the plug as a consequence of the system grinding slowly to a halt yet not switching over to secondary.

            If you really want to force a failover that way, you do so by shutting down the small number of systems that would cause the monitoring system to detect a "critical services in DC1 down, let's switch to DC2". If you can't log in to those systems because of system or network load you connect to their ILO/DRAC/whatever, which is on a separate network, and just kill those machines. If the monitoring system itself has gone gaga because of the problems, you restart that, then pull the rug out from under those essential systems. Or you cut connectivity between DC1 and the outside world (including DC2), triggering DC2 to become live, because that would be a failure mode that the failover should be able to cope with.

            You. Do. Not. Push. The Big. Red. Button. To. Do. So.

            Ever.

            1. Nolveys Silver badge

              Re: If it got interrupted...

              You. Do. Not. Push. The Big. Red. Button. To. Do. So.

              Ever.

              I am Groot?

        4. Mark York 3 Silver badge
          Holmes

          Re: If it got interrupted...

          At one facility I worked at & I don't have the full story of why......

          The offshore support insisted on the former plant Sysadmin hitting the plants BRB, pictures were sent to the remote guy via email, he confirmed that was the button he wanted to be pushed & goodness gracious me it was going to be pushed. He was advised again of what it was that would be pushed & the consequences, the plant manager dutifully informed of what was required, what the offshore wanted & what would be the fallout.

          & so it came to pass that the BRB was pushed on the word of the Technically Competent Support representative.

          (Paraphrasing here........)

          "Goodness gracious me, Why your plant disappearing from network?"

          "Because the BRB you insisted that was the button you wanted pushing, despite my telling you that it was never to be pushed under pain of death has just shut down the entire plant."

          I think it took until about 15 minutes before production was due to commence the following day to get everything back up & running.

        5. TheVogon Silver badge

          Re: If it got interrupted...

          "playing with buttons they should not have had access to."

          EPO buttons are easily accessible. That's the whole point of them as emergency safety feature. Usually near the door in each DC hall...

          1. Stoneshop Silver badge
            Facepalm

            Re: If it got interrupted...

            Usually near the door in each DC hall...

            But not so near that they can be mistaken for a door opener button by the dimmest of dimwits. At chest/shoulder height and at least a few steps away from the door appears to me the most sensible location.

            That said, I've seen a visitor who shouldn't have had access to the computer room in the first place look around, totally fail to see the conveniently located, hip-height blue button at least as large as a BRB next to the exit door, and killed the computer room because a Big Red Button high up the wall and well away from the exit is obviously the one to push to open the door for you.

            Unfortunately, tar, feathers and railroad rails are not common inventory items in today's business environment; rackmount rails are too short and flimsy for carrying a person.

            1. BagOfSpanners

              Re: If it got interrupted...

              In my office the button to open the exit door is right next to the Fire Alarm button (which has no guard). There are also light switches and other visual clutter nearby. At the end of a long tiring day I've sometimes come close to pressing the wrong button.

            2. Anonymous Coward
              Anonymous Coward

              Re: If it got interrupted...

              In my day, one of the first things pointed out to me was the BRB and the circumstances under which it can be used without being fired.

      2. pdh

        Re: If it got interrupted...

        > I wonder how long it was since power and switching to secondary or backup data centres were tested.

        I think it's been about a week.

        1. Anonymous Coward
          Anonymous Coward

          Re: If it got interrupted...

          Somebody please ask BA if they do this.

      3. Paul Hovnanian Silver badge

        Re: If it got interrupted...

        "The Times suggests a big red button"

        These exist in many date centers. But the are not intended for normal, sequenced shutdowns or to initiate failover to backups. They are usually placed near the exits and intended to be hit in the event of a serious problem like a fire. They trip off all sources _Right_Now_ and don't allow time for software to complete backups or mirroring functions.

        *Usually for events that dictate personnel get out imediately.

        1. Anonymous Coward
          Anonymous Coward

          Re: If it got interrupted...

          My last employer had the big red shutdown button conveniently located next to the exit door. Unfortunately just in the position that the door open button would be. One lunchtime a visiting engineer carrying a few boxes of spare parts accidentally pressed it trying to open the door...

          1. TheVogon Silver badge

            Re: If it got interrupted...

            "One lunchtime a visiting engineer carrying a few boxes of spare parts accidentally pressed it trying to open the door..."

            They usually have a plastic cover. And a large label....

            1. John R. Macdonald

              Re: If it got interrupted...

              @TheVogon

              One installation I worked in, learned the plastic cover and label thingy the hard way (the hapless third party support techie who pushed the BRB instead of the door opener was banned permanently from the site to boot)

              1. Vladimir Nicolici

                Re: If it got interrupted...

                I think the best solution to prevent accidental use is to have 2 big red buttons. And to require both to be simultaneously pushed to trigger the power shutdown.

                In fact I saw a UPS product having exactly that feature, two EPO buttons that you needed to push simultaneously to shut it down.

                1. Grunt #1

                  Re: If it got interrupted...

                  Fine in principle, but that assumes it is a planned event, it's an EPO for a reason.

                  Better to have someone knowledgeable watching over the contractor.

                  1. Vladimir Nicolici

                    Re: If it got interrupted...

                    Having two buttons doesn't mean a single person can't operate them. You can place the two buttons close enough for that. But it ensures the person operating them really knows what it's doing, and it's not randomly pressing buttons.

                    Of course, if the purpose of having such buttons is to allow even untrained people to shut everything down in case of emergency, it would complicate things. But a large warning message, for example "In case of fire, you need to press these two buttons at the same time!" should take care of that as well.

                    1. Anonymous Coward
                      Anonymous Coward

                      Re: If it got interrupted...

                      The real answer is always train people before letting them in.

                  2. Anonymous Coward
                    Anonymous Coward

                    Re: If it got interrupted...

                    @Grunt

                    But then you need another contractor to do the knowledgeable one's job

            2. Mike Richards

              Re: If it got interrupted...

              They cost extra.

        2. macjules Silver badge

          Re: If it got interrupted...

          "No determination has been made yet regarding the cause of this incident. Any speculation to the contrary is not founded in fact."

          Which would of course not be a problem for the Daily Mail. The Paul Nuttall of the newspaper world.

      4. circusmole

        Re: If it got interrupted...

        The first thing I would ask for would be the fail over test schedule and the resulting reports on how they went (if they did any).

    2. fidodogbreath Silver badge

      Re: If it got interrupted...

      ...then a BOFH and PFY will soon get (yet another) new Boss.

    3. Anonymous Coward
      Anonymous Coward

      Re: If it got interrupted...

      My guess about the initial failure: ATS left in bypass or failed on power transfer. 15mins for someone authorised to manually switch ATS to good power source or bypass failed ATS. UPS/generators not specified (or too much new kit added to the DC) for full startup load of the whole data center which then failed again.

      Some systems probably started up and began a re-sync then the high load crapped out the generators returning everything to silence again, leaving the replication in an unknown state when systems were manually restarted over a longer period to manage the initial load.

    4. TheVogon Silver badge

      Re: If it got interrupted...

      Maybe Emergency Power Off resembles the Indian for Light Switch?

    5. Steve Channell
      Facepalm

      A low availability cluster

      My bet would be that a cluster failover was initiated by the power failure, then fail-back manually triggered, but the primary site failing again with power surge starting the secondary systems. With a manual failback an engineer would be needed to failover again and not just a bargain basement operator.

  2. Bronek Kozicki Silver badge
    Thumb Up

    wannabe budget airline British Airways

    c'mon guys, that's just ... cheap

    1. Ochib Silver badge
      Joke

      Not as cheap as EasyJet.

    2. PNGuinn

      "wannabe budget airline British Airways

      c'mon guys, that's just ... cheap"

      Sorry? I thought budget airlines were supposed to be cheap?

      Perhaps you mean cruel.

  3. nuked
    Mushroom

    Tech Support: "Have you tried unplugging it and plugging it back in again"

    1. Martin Summers Silver badge

      Yeah but at least that normally actually works!

  4. Whitter
    Headmaster

    ... not due to outsourcing...

    If you don't know why it happened, then I doubt you know it was not due to outsourcing.

    1. iRadiate

      Re: ... not due to outsourcing...

      That's false logic. I personally don't know why the Russians didn't send a man to the moon but I know for definite that they didnt. Now fuck off.

      1. Yet Another Anonymous coward Silver badge

        Re: ... not due to outsourcing...

        Or they did but covered it up .....

        1. Anonymous Coward
          Anonymous Coward

          Re: ... not due to outsourcing...

          Sure, the outage itself may not have been due to outsourcing.

          The extreme time needed to get things running though... that has outsourcing written all over it.

      2. LDS Silver badge

        't know why the Russians didn't send a man'

        Oh well, they were pretty secretive back then, hiding whole cities from maps, and restricting access harshly. Just as BA would like to be today.

        Anyway it was because their Moon rockets couldn't be launched without failing quickly.

      3. Anonymous Coward
        Anonymous Coward

        Re: ... not due to outsourcing...

        You may be right, but you need to be nice.

  5. Anonymous Coward
    Anonymous Coward

    If the people that manage the servers are from TCR and they were unable to recover from the power failure in a reasonable amount of time then I deduce that they are at fault. Maybe not for the initial outage but the subsequent problems. They would also be responsible for the disaster recovery procedures so the fact it all failed in the first place also lies with them.

    1. Anonymous Coward
      Anonymous Coward

      Thanks BA

      I've always had trouble explaining what I do for a living to non-IT managers.

      Problem solved as I was able to explain in conversation this week, " I prevent DR and PR disasters like BA ".

      They understood in a flash.

      1. Doctor Syntax Silver badge

        Re: Thanks BA

        "They understood in a flash."

        It won't take them long to forget.

      2. eldakka Silver badge

        Re: Thanks BA

        They understood in a flash.

        Hopefully with no accompanying bangs or blue smoke.

        1. Anonymous Coward
          Anonymous Coward

          Re: Thanks BA

          Sometimes that flash portends bad things: Nuclear Ouchi

    2. Anonymous Coward
      Anonymous Coward

      Reading between the lines

      So we've got an explanation that fits part of the information released so far (i.e. The power issues), but there seem to be large gaps between what we are being told (active-active or active-active-passive DCs that provide fault tolerance) and the two-plus days of outage.

      In addition, saying that UK staff ran UK data centres AND covers off the actions they took, leaves a lot of questions about who runs the systems that stopped and the slow or problematic attempts to recover them.

      My question for BT would be "if you had not outsourced, would you have expected to experience an outage and if it did would it have taken two days to recover?" Given the impact too BAs business, I would hope the answer is no and TCS screwed up...

  6. Potemkine Silver badge

    "Outsourcing is a practice used by different companies to reduce costs by transferring portions of work to outside suppliers rather than completing it internally."

    +

    "the Daily Mail fingered a contractor from CBRE Global Workplace Solutions"

    +

    "Outsourcing not to blame"

    =

    WTF!?!

    1. Anonymous Coward
      Anonymous Coward

      A contractor isn't the same as outsourcing. Electrical work in a DC is unusual and highly specialised. It is not something you keep in house.

      1. jtaylor

        At a previous company, we did have a full-time electrician. When he wasn't fixing something, he was supervising an upgrade or replacement, designing future electrical buildouts, meeting with DC tenants to be sure we didn't overcommit the electrical supply, fixing stuff around the offices, and generally being Very Useful.

        To be fair, we had more than 1 data center, with complex power requirements. Much like BA, come to think....

        I've worked both as a direct employee and as a contractor. A company has much more control over direct employees.

        1. Anonymous Coward
          Anonymous Coward

          "A company has much more control over direct employees."

          But it can sue the ever-living balls off a contractor from an external electrical company (who are adequately indemnified), while all they can do to an employee is fire them.

          1. Anonymous Coward
            Anonymous Coward

            But it can sue the ever-living balls off a contractor from an external electrical company (who are adequately indemnified) ...

            Adequately enough to cough up £150M?

            1. d3vy Silver badge

              "Adequately enough to cough up £150M?"

              I'd have thought so.. I'm just a contract developer but because of the systems I work on carry £50m prof indemnity insurance.

              1. Anonymous Coward
                Anonymous Coward

                "I'd have thought so"

                Most contractors have £1-£3 million cover. If you screw up worse than that you just take any cash as dividends and let your limited company take the hit.

            2. Anonymous Coward
              Anonymous Coward

              To be fair BA like to pass the buck to insurance companies.

              1. Aitor 1 Silver badge

                Good one

                They can´t pass the buck without recognizing the problem was outsourcing....

              2. Doctor Syntax Silver badge

                "BA like to pass the buck to insurance companies"

                Who'll pass it rght back come renewal time.

            3. Andrew12

              Very, very unlikely. There will be all kinds of clauses. And people don't understand simple things - like 99% uptime means 3.65 days down. And it can all come at once.

          2. H H

            Very true and so sad. What a counterproductive way of working.

        2. Captain Badmouth

          At a previous company, we did have a full-time electrician

          Was he an electrician or an electrical engineer?

          World of difference.

      2. Aitor 1 Silver badge

        It is the same

        You should have qualified electricians for your DCs. If you don't you will have problems.

      3. Doctor Syntax Silver badge

        "A contractor isn't the same as outsourcing."

        CBRE are a little vague as to what they do. It seems to come down to "advising" but it certainly doesn't sound like electrical contracting. Maybe they were overseeing electrical contractors. See their site at http://www.cbre.co.uk/uk-en/services/global_corporate_services/data_centre_solutions where they claim "With Data Centres, knowledge is power".

        1. Anonymous Coward
          Anonymous Coward

          "With Data Centres, knowledge is power".

          If there is any.

          1. thomasallan80

            Re: "With Data Centres, knowledge is power".

            I think that's around the wrong way.

            Data Centres: With power there is knowledge, internet connection, web services, etc.

            Data Centres: Without power there is .........

    2. Androgynous Cupboard Silver badge

      You have to factor in the Mails need to pont the blame at someone foreign. Allow for that and it all makes more sense. Surprised they did t accuse him of eating a Swan for good measure.

      1. Brewster's Angle Grinder Silver badge

        "You have to factor in the Mails need to pont the blame at someone foreign."

        Much like the comment threads here.

        1. Anonymous Coward
          Anonymous Coward

          It's not their fault

          They are just trying to make a living. The fault lies with the BA management for allowing this deBAcle.

    3. anothercynic Silver badge

      "The Daily Mail..."

      Enough said.

  7. Anonymous Coward
    Anonymous Coward

    Disaster recovery

    This is surely going to go down in the history books as a lesson in how to turn a crisis into a disaster, on every level from technology design and implementation, to financial impact, public relations and more.

    Do BA/IAG management not realise that what they're saying isn't plausible, not even to people with only very basic clue [1], and what BA/IAG management are doing and not doing isn't helping?

    [1] Anyone who is publically supporting BA/IAG's version of events might want to consider what has been possible in IT (and in PR) for a few decades, and look at the definition of "useful idiot".

    1. Anonymous Coward
      Anonymous Coward

      Re: Disaster recovery

      Do BA/IAG management not realise that what they're saying isn't plausible

      The era of post-truth explanations: the train left the station during the BSE fright, has been rolling accelerando since than and you can't stop the ride!!

      1. Anonymous Coward
        Anonymous Coward

        Re: Disaster recovery

        +1 to the musician!

    2. Anonymous Coward
      Anonymous Coward

      Re: Disaster recovery

      Not as unlikely as you would think.

      How about the DC that went down because an electrician accidentally knocked the kill switch with his helmet.

      1. Anonymous Coward
        Anonymous Coward

        Re: Disaster recovery

        I assume the thumbs down is because you don't believe it. The incident report says otherwise.

        PS

        I was never very good at irony, but this was the explanation given.

  8. Anonymous Coward
    Anonymous Coward

    So poor old sparky pulled the wrong plug, panicked, hit the wrong button and everything went boom in DC1. All expected scenarios so far.

    Why did DC2 blow up? If they're active-active it's not a failover scenario, it's simply a reduction in capacity. What's the betting they've gone active-active for cost reasons (i.e. less kit needed than for active-passive) but woefully underspecced the whole thing for a failover? Or simply allowed the amount of applications to exceed the load of one DC alone because "this will never happen"?

    1. Anonymous Coward
      Anonymous Coward

      Disaster Recovery anyone?

      In both cases you need a DRP.

      1. Kevin Johnston

        Re: Disaster Recovery anyone?

        I think from a management perspective it is not a DRP, but an NMP (Not My Problem)

        1. Stoneshop Silver badge
          Thumb Up

          Re: Disaster Recovery anyone?

          it is not a DRP, but an NMP (Not My Problem)

          A SEP, actually.

  9. Anonymous Coward
    Anonymous Coward

    Isn't restore 101

    To switch things off *before* powering back up ?

    Otherwise you're just risking it tripping again ...

    and again ...

    and again ...

    The proper procedure is:

    1) power goes

    2) power everything off (physical switches if needs be)

    3) fix underlying problem

    4) begin power-up sequence. Which is laid out somewhere in the DR manual.

    But then I am 50, so have some smarts you'd have to pay for.

    1. Jay 2

      Re: Isn't restore 101

      Agreed. Though it looks like after the power off whoops event, someone thought that just switching on the power again without any full-on checks of all the various power bits and everything connected with it was a good idea...

      1. Anonymous Coward
        Anonymous Coward

        Re: Isn't restore 101

        That would only explain the loss of DC2 if the initial fault in DC1 caused it to go Byzantine and corrupt the state of DC2.

        That's the missing link here. We're all professionals, we know fuck ups will happen, we know even relatively minor errors can bring down something as complex as a datacentre if proper procedures are not followed. What hasn't been explained is how the death of DC1 brought down DC2.

        Options include:

        1) BA are lying and this is purely an RBS/Natwest style fuckup at the app/data level.

        2) BA are telling the truth about the root cause but the DCs weren't truly independent at the infra level causing one power fault to screw both

        3) Same as 2 but incompetent applications management/implementation caused the failure to be incorrectly handled

        4) It is a genuine accident and a dumb contractor screwed DC2 by hitting the wrong reset button, resulting in multiple days to replace damaged kit and replay lost data from backups.

        5) It was an attempted forced-failover from DC1 to DC2 that has broken DC2 and left a crippled DC1 covering all the load on the hottest, busiest day of the year.

        Place your bets.

        1. Aitor 1 Silver badge

          Re: Isn't restore 101

          From my experience 1 and 3.

          Lying, plus incompetent PHBs messed

    2. Stoneshop Silver badge

      Re: Isn't restore 101

      Quite.

      We had a rather unscheduled event once, where the fire brigade threw the Big Red Switch in the outside feed. During the time the cleanup was done (mopping up the water and ventilating the building) we worked out the startup sequence for the stuff present: network gear and standalone servers that wouldn't care about connectivity, servers that would need network or else their network configuration would be totally bonkers, and servers that would need to see particular other servers, otherwise their would be in a bind with the best way to recover being rebooting once the other end became available.

      Energy was told to switch off all local circuit breakers before restoring power to each of the computer rooms, so that we could switch off all systems before the racks got powered.

      With that crib sheet things went as good as flawless.

      1. Anonymous Coward
        Anonymous Coward

        Re: Isn't restore 101

        Good effort, but why didn't you have a plan prepared in advance?

        1. Stoneshop Silver badge

          Re: Isn't restore 101

          Good effort, but why didn't you have a plan prepared in advance?

          Good question. Next question please.

          What it comes down to, because that particular information just wasn't there, and basically the best thing to do in the time available was distilling it from a connectivity matrix, combined with noting whether systems were essential, auxiliary or 'meh, can wait'.

          We have detailed info on all systems, including how to start them up from zero. Like for the main VMS cluster basically: 1) network, 2) storage management, 3) storage shelves and controllers, 4) the cluster nodes themselves, but only few of those documents describe the how and why of interaction with other systems. That info tends to be in another class of documents, not the system operation manuals. There are sections that describe what has to be done to neighboring systems in case of a total system shutdown, but that assumes you can log in to those systems to shut down the affected comms channels and such. With the power cut having done so for you, site-wide, there was a certain 'fingers crossed' involved, but the hardware proved surprisingly robust (two minor errors over the entire site, AFAIR), and the software only needed minimal prodding to get the essential bits working again.

          1. Mark 85 Silver badge
            Facepalm

            Re: Isn't restore 101

            That makes sense. At least you had the data available. I've heard of at least one place where there were no hard copies of the recovery plan as they were all "in the operation's database"...

            1. Anonymous Coward
              Anonymous Coward

              Re: Isn't restore 101

              It's the first thing I ask for when I enter a site. Everything should follow.

          2. Anonymous Coward
            Anonymous Coward

            Re: Isn't restore 101

            Next question.

            Why aren't the different components not integrated?

            1. Stoneshop Silver badge

              Re: Isn't restore 101

              Why aren't the different components not integrated?

              Hysterical raisins, for a large part. Security comes into play too: systems that have outside connections in whatever form are not allowed to even communicate with the main clusters directly, let alone have processes running on those clusters. And there is stuff like data conversion systems supplied by third parties, hardware and software, therefore not integrated as as matter of fact.

              Furthermore, you don't have your monitoring system integrated with whatever you're monitoring, do you? Or your storage management system integrated with the systems you're providing storage to?

  10. Bill Gates

    Bullshit. Delta Airlines used the same 'power failure' excuse for their outage, and we learned six months later it was due to a network screw up.

    1. Adrian 4 Silver badge

      yebbut, 'power surge' is management-speak for 'borked'. It doesn't necessarily have anything whatever to do with actual power.

      1. Anonymous Coward
        Anonymous Coward

        Just their power

        I suspect the DC staff will be listened to now.

  11. Anonymous Coward
    Anonymous Coward

    Send me your CV

    Who cares that you're over 25 and have seen this before.

    1. Andrew12

      Re: Send me your CV

      Your comment explains why under 25s should not have the vote.

      And should not be let into data centres unsupervised.

  12. andy 103

    A data or application problem most likely

    This is the biggest load of BS ever. There's no way this started with a power failure at all.

    What's more likely is there was a data/application error that they'd never encountered (planned for, or tested) and someone decided to kill the power ("have you tried turning it off and on again"). Because the systems at the other sites mirror the one that had the problem, they will then have wondered why killing the power did absolutely nothing to fix the problem. So then they'll have killed the power at the other sites and tried to power it back up. As the applications came back online they may have been faced with loads of data corruption which were possibly fixed either manually and/or with a combination of tools built into their applications.

    The article quotes someone as saying a data problem is easier to fix than a hardware one. No idea where you got that total bullshit from. It depends on the circumstances. Even if you had to replace some hardware, that can generally be done faster than trying to fix a set of applications with corrupt or otherwise invalid files that are all trying to talk to one another.

    And as for "this all happened in the UK and isn't outsourced" - who developed and tested the applications? Oh yeah, outsourced Indian workers. *Slow clap*

    1. Anonymous Coward
      Anonymous Coward

      Re: A data or application problem most likely

      "There's no way this started with a power failure at all."

      Do you know something we don't? There are a thousand ways and more this could have started with a power failure.

      1. andy 103

        Re: A data or application problem most likely

        "Do you know something we don't? There are a thousand ways and more this could have started with a power failure."

        No, I don't. However consider all of this....

        1. Assume there was a power failure at the primary DC.

        2. The primary DC has backup/UPS power - why doesn't that work? The article suggests *maybe* the main power and backup were applied simultaneously causing the servers to use 480V. Fair enough.

        3. How does (2) affect what happens at the secondary DC? Why does exactly the same thing happen on a redundant system which is designed to mitigate against such problems occurring at one DC?

        If the power management is also controlled via software, that is a data/application problem, which hasn't been tested - if you are sending the "wrong" data to the secondary DC it will only have the same results, and replicate the problem there!

        I can't see how this would just come down to a "freak" power incident (nobody else in the area has reported it either) that knocked out two physically separate data centres, whereby the UPS also failed to work. It's just too coincidental and convenient.

        It's more a case of whatever happened at DC1 was mirrored at DC2 - either by humans - or by data sent from one site to the other.

        1. Stoneshop Silver badge

          Re: A data or application problem most likely

          2. The primary DC has backup/UPS power - why doesn't that work? The article suggests *maybe* the main power and backup were applied simultaneously causing the servers to use 480V. Fair enough.

          That (getting 480V ed into the racks) suggests a more than grave wiring error that would have caused one or more of 1) seriously frying the output side of the UPS, 2) seriously frying the generator, 3) causing an almighty bang, 4) causing parts leaving their position at high velocity, 5) the electrician(s) that did the wiring leave their place of employment at high velocity, and 6) one or more electrical certification agencies not previously involved in certifying and testing this setup taking a long, hard look at the entire process from commissioning the installation to the aforementioned result.

          3. How does (2) affect what happens at the secondary DC? Why does exactly the same thing happen on a redundant system which is designed to mitigate against such problems occurring at one DC?

          Did it? Or was the second DC karking the result of the primary DC splaffing corrupted data as it went down, and thus corrupting the failover?

        2. Tridac

          Re: A data or application problem most likely

          2) 480 volts. BS: There's no way that any UPS i've ever seen could operate or be wired so that could happen. The only possible thing that might have happened is that grid power was restored out of phase with the UPS inverter output, which could cause a big bang. However, UPS's are specifically designed to deal with that, with state change inhibited and the UPS phase adjusted until it's in sync with the grid input. No UPS could work without that logic.

          UPS systems are usually either true online, or switched. With true online, the UPS inverter always supplies the load, with the grid input powering the inverter and float charging the batteries. With switched, the grid normally supplies the load directly and the batteries are idle, on float charge. For the latter, when the grid goes down, the inverter starts up from batteries to power the load, which is switched over from grid input, to UPS inverter, within a few cycles at most. All these things have smart microprocessor control, which continually samples mains quality and has lockout delays to ensure that brief transients don't cause an unnecessary switchover, Also, delays to ensure that grid power is clean and stable before re connecting to the load.

          We've now had several stories abouty this, none of which seem real and ignoring the giant elephant in the room, which is that their disaster recover failed completely. The tech to do this has been around for decades, so how could it go so badly wrong, in a company of that stature ?...

          1. 404 Silver badge

            'So how could it go so badly wrong'

            Never underestimate the power of human arrogance and stupidity - if it can happen, it will.*

            *recent events surrounding NSA having dangerous tools taken from them and used to bitchslap corporates and gov agencies that should have known better prove that pretty well...

          2. Anonymous Coward
            Anonymous Coward

            Re: A data or application problem most likely

            "their disaster recover failed completely. The tech to do this has been around for decades, so how could it go so badly wrong, in a company of that stature ?..."

            That is the $128+M question. The rest has by now been repeated so often that it's starting to get a little less entertaining than it was.

          3. Anonymous Coward
            Anonymous Coward

            Re: A data or application problem most likely

            A power surge occurred at Redbus 8/9 Harbour Exchange back in 2004 ish. Hundreds of power supplies had to be replaced after going bang due to the same thing delivering 480v to all rack servers. At the time the MD personally paid for a plane shipment from china full of server power supplies to get all customers back up and running. Seems strange BA had enough hot spares for this task.

          4. CaseyFNZ

            Re: A data or application problem most likely

            I did see once a new three phase UPS installed where the electrician hadn't connected the output neutral before turning placing the UPS online in UPS mode. Resultant phase to phase high voltage destroyed several server power supplies.

            1. Grunt #1

              Re: A data or application problem most likely

              Where's the checklist?

            2. Richard 12 Silver badge

              Re: A data or application problem most likely

              Lost neutral is seriously bad juju.

              One or two phases will go hot, and once any phase goes over about 250VAC it'll rapidly kill things.

              And it doesn't take a big imbalance to do that.

              I've seen a few major installs with a rusty neutral - they tested out fine initially, then blew things up a few weeks or months later.

              1. AdrianMontagu

                Re: A data or application problem most likely

                I think this has some credence. I have experienced a faulty earth on a newly rewired three phase system. This can cause voltages +280V (Interphase voltages). This could cause higher voltages in the logic circuits and thence zapping of chips and insidious faults. Furthermore, the higher voltages in the logic circuits could pass to the connected DC2 via the data links (NOT POWER CIRCUITS). That could take out the DC2. Just a hypothesis.

          5. David Holland

            A lost neutral?

            480 Volts is unlikely, as is the suggestion that supplies were connected in series to give 480V. Entirely possible when a fault or mistake happens during synchronising three phase generators to the mains or another 3 phase supply, is a lost or floating neutral, which usually results in phases being substantially under or over or voltage by anything from near zero change to near 240V. If your are lucky you just fry equipment but less than a month ago such an incident also burnt down a pair of semidetached houses

    2. Stoneshop Silver badge

      Re: A data or application problem most likely

      The article quotes someone as saying a data problem is easier to fix than a hardware one. No idea where you got that total bullshit from. It depends on the circumstances. Even if you had to replace some hardware, that can generally be done faster than trying to fix a set of applications with corrupt or otherwise invalid files that are all trying to talk to one another.

      Indeed. And even if half your hardware is fried, it should be possible to bring up the other half with a reduced set of applications in a way that core functionality can be restored. And your DR plan should have tables of what machines can be reallocated to other tasks in a case like that.

      Corrupted data is another matter entirely. Can you fix it by rebuilding a few database indexes or zeroing some data fields, do you need to restore a backup or is the 3rd line support tiger team huddled over their monitors amid mountains of empty coffee cups, alternately muttering lines of logging or code, and obscenities?

  13. A Non e-mouse Silver badge

    Electrical Interlocks

    If... the power control software controlling the multiple feeds into the data centre didn’t get the switch between battery and backup generator supply right at the critical moment ... it could have resulted in both the battery supply and the generator supply being briefly connected in series to the power bus feeding the racks. That would result in the data centre’s servers being fed 480v instead of 240v...

    There should be some serious interlocking going on to prevent two completely different power inputs being connected at the same time to the same load. If there isn't, someone has seriously screwed up on safety and you've got a *very* dangerous situation on your hands.

    1. Anonymous Coward
      Anonymous Coward

      Re: Electrical Interlocks

      Failover from utility to generator power is fully automated in our DC. You'd need to be in the plant room with wellies, gloves and a big set of jump leads to override it. Even if you succeeded it would trip out the breakers PDQ.

      Anyway, "a power surge broke the computer" is totally a "dog ate my homework" excuse, so I say they're bluffing to cover for an even more embarrassing reason.

      1. Steve 114
        Holmes

        Re: Electrical Interlocks

        Is it a coincidence that there was (genuinely) serious flooding in Bangalore on the same day?

    2. Red Ted
      Mushroom

      Re: Electrical Interlocks

      Indeed, all the backup systems I have worked with have both electrical and mechanical locks to prevent this happening. It's a Very Bad Thing (TM).

      I have seen a photo of what happens if you connect 1MW of diesel gen-set to the grid without syncing the phase. The result is that the stator (there's a clue in the name) rotates by the phase difference.

      1. Tridac

        Re: Electrical Interlocks

        ...Or the generator crankshaft or coupling gets sheared off with the shock load. Seen examples of that...

        1. Alan Brown Silver badge

          Re: Electrical Interlocks

          ".Or the generator crankshaft or coupling gets sheared off with the shock load. Seen examples of that..."

          This is why they're supposed to have shear pins

    3. Anonymous Coward
      Anonymous Coward

      Re: Electrical Interlocks

      "There should be some serious interlocking going on to prevent two completely different power inputs being connected at the same time to the same load."

      There was an Open University "System Design" course programme about such things. One of the real-world examples was a local tram system. At the end of the line the operator had to throw a switch to reverse the electric motor power for the return journey.

      The switch had three positions - with the centre one being "off". What happened several times was that the operator went through "off" so fast that the reverse power was applied to a motor that was still moving. This produced a destructive surge.

      The simple modification was to have a separate key-lock "on/off" switch for each of the two directions. The switches were separated by a large distance on the panel. There was only one key. This gave sufficient time delay while both were in the "off" position.

      1. Anonymous Coward
        Anonymous Coward

        Re: Electrical Interlocks

        "There should be some serious interlocking going on to prevent two completely different power inputs being connected at the same time to the same load."

        Back in the 1960s a university laid on a tour for prospective students. They demonstrated winding up a small generator - and watching the phase lights to time when to sync adding its output to the national power grid.

        The lecturer told the story of someone getting their timing wrong - with the result that the generator was momentarily competing against the national grid. The generator stopped dead - and disintegrated.

      2. AdrianMontagu

        Re: Electrical Interlocks

        Good comments. Proper power "Transfer Switches" should automatically "break" before they "make". However, there are times when you want to isolate a UPS and run direct from mains for maintenance. This requires a sequence of switching to occur. Get it wrong and you lose power. This still doesn't address the lack of DC2 coming on stream.

    4. Blofeld's Cat
      FAIL

      Re: Electrical Interlocks

      "There should be some serious interlocking going on to prevent two completely different power inputs being connected at the same time to the same load."

      As an electrician, the simplest ways I can think of of getting 415 V where 240 V should be; is ether to swap over one phase and the neutral, or to accidentally disconnect the neutral on a 3 phase supply.

      The former is almost always an installation error (I took out a bank of lighting when I was an apprentice), while the latter is generally poor design (inappropriately switched neutral) or an equipment failure.

      I also seem to recall, (but cannot find*), a news story where a disgruntled employee damaged part of a DC by temporarily disconnecting the neutral.

      * I think it was on El Reg [citation needed]

      1. patrickstar

        Re: Electrical Interlocks

        The classic way to do the latter is to forget hooking up the neutral after insulation testing. Can have very fun results, depending on how skewed the load is across the phases.

        There has also been a number of cases of lost neutral from overhead wires torn down, people crashing into cabinets, etc with buildings literally catching fire as a result.

        1. Richard 12 Silver badge

          Re: Electrical Interlocks

          WTF are you even touching the neutral for any testing?

          That's not needed for any form of electrical testing that I recognise. Low N to PE resistance on the load side is easily found by other means, and on the supply side they're either bonded together or they aren't.

          In fact, if you disconnected the neutral to any system while any live was still connected I'd throw you straight off site and end your career.

      2. Citizen99

        Re: Electrical Interlocks

        "... As an electrician, the simplest ways I can think of of getting 415 V where 240 V should be ..."

        Up-vote for the first post I've seen getting the square-root of 3 relation correct.

        Not that it affects the arguments ;-)

        /pedantry

  14. MakingBacon
    Facepalm

    Oh come on!

    It's a Daily Mail article for God's sake.

    We all know how accurate they can be ...

    1. Yet Another Anonymous coward Silver badge

      Re: Oh come on!

      Obvioulsy too many Volts or Amperes - both names after foreigners.

      They need to have more British Farads or Henrys

      1. Anonymous Coward
        Anonymous Coward

        Re: Oh come on!

        "They need to have more British Farads or Henrys"

        There are already too many British Hooray Henrys in positions of power.

      2. Captain Badmouth
        Happy

        Re: Oh come on!

        "Obvioulsy too many Volts or Amperes"

        Nonsense, obviously using the wrong plugs...

        R.I.P. Roy Kinnear.

        1. Anonymous Coward
          Anonymous Coward

          some folk might need a bit of help with this...

          "obviously using the wrong plugs...

          R.I.P. Roy Kinnear."

          ALGERNON! The LASER!

  15. Aladdin Sane Silver badge

    We haven't heard anything from the BOFH recently.

    Coincidence?

  16. Anonymous Coward
    Anonymous Coward

    Someone unplugged in the power cable to plug in the hoover

    Someone unplugged the power cable to plug in the hoover

    I reckon that isthe reason.

    And cleaning staff both work at same times in both buildings :)

    1. ecofeco Silver badge

      Re: We haven't heard anything from the BOFH recently.

      I think you may be on to something there...

      1. Korev Silver badge
        Flame

        Re: We haven't heard anything from the BOFH recently.

        I suspect the boss is currently the new circuit breaker; it's a stressful job and if he's not careful he'll burnout...

  17. Anonymous Coward
    Anonymous Coward

    UK FAST - 2009

    We used them for our website hosting, and SQL backend to that.

    Came in Monday, noticed SQL agent had stopped. Bit more digging and discovered an unplanned outage.

    Spoke to UK FAST who - without any hint of shame - admitted that both the primary and backup SQL boxes had been connected to the same power bus, so when they had an outage we lost both.

    A note to anyone who is using these jokers: they genuinely didn't understand why it was a problem.

    1. Alan Brown Silver badge

      Re: UK FAST - 2009

      "admitted that both the primary and backup SQL boxes had been connected to the same power bus"

      I don't even hookup the two rack PDUs to the same power bus, let alone systems which are supposed to be failovers.

    2. Pedigree-Pete
      Mushroom

      Careful where you plug in your servers.

      When our small company out grew personal printers and "floppynet" we installed a Novel Server (3.12 IIRC) in what was then the warehouse.

      As an non-IT person but the closest we had, I was nominated "IT Mgr".

      I had the presence of mind to have a non-switched box on the wall for the IEC lead that powered the server and all seemed good.

      That was until our kettle in the kitchen started to fur up and trip out the kitchen power.

      Sadly the switched socket that I had replaced with a sealed box was on the same circuit so the server crashed unceremoniously every time this happened.

      We had that changed promptly and ordered a small but suitable UPS and commenced negotiations with the ISP next door for a feed from their Diesel Genny.

      Good lesson to learn early.

      Still not an IT Mgr and from the tales I read here I'm bloody glad I'm not. PP

  18. Charlie Clark Silver badge

    Looking forward to the BOFH's take on this

    Should be able to get a whole series of articles out of management fucking this up from start to finish and blaming everything from the cleaning staff to Cthulu!

    1. Aladdin Sane Silver badge

      Re: Looking forward to the BOFH's take on this

      Hail Cthulhu.

      1. Yet Another Anonymous coward Silver badge

        Re: Looking forward to the BOFH's take on this

        I always assumed Cthulhu was responsible for keeping BA (and indeed most airlines) operating

  19. chivo243 Silver badge
    Coat

    Right tool for the Job!

    We always call in an expert for our electrical needs. Will be replacing 3 UPS next weekend... He's already visited and we're clear to fly!

  20. Anonymous Coward
    Anonymous Coward

    I did think it might be something like this. Maybe the UPS was put in to bypass so work could be done on it then someone hit the main breaker and bobs your uncle fannies your aunt the whole lot goes! Then in the panic power gets restored whilst the other DC's is part way in to taking over and you have a right old shit storm happening. Now we only run a small DC at our site with something like 24 racks 4 CRAC units and a UPS supplying the room, but that could happen with our setup, we don't have a genny. To put the UPS in to bypass you need a key to move the bypass switch lever.

    1. Mark York 3 Silver badge
      Facepalm

      Had a similar situation with a planned building shutdown, I was scheduled to be on-site for the testing of all desktop equipment 9am Sunday morning at our office suite.

      Nobody had switched the bypass on the UPS before Friday shutdown (servers powered down safely though), UPS dead, India thought I would be at my desk 40 miles away from where the building was at 3am & left lots of messages about why were the servers not coming back up remotely.

      I turned up on schedule & in the dark (figuratively), electrical contractors for the UPS had to turn up & test each battery in turn just in case one "blew" when power was restored.

  21. Jason Bloomberg Silver badge
    Black Helicopters

    Boadicea House

    "no, we're not going to pinpoint it on a map for you"

    Is this some sort of; we don't want to be accused of helping them darned terrorists thing?

    I am rather surprised Caliphate Inc. haven't claimed responsibility as they seem to claim they have inspired or are responsibility for everything bad which ever happens.

    1. Yet Another Anonymous coward Silver badge

      Re: Boadacia House

      Although ISIS have claimed responsibility for Heathrow generally

      1. AmenFromMars
        Joke

        Re: Boadacia House

        It may well have been IS-IS, a lot of networks use it to underpin their MPLS core.

      2. Stoneshop Silver badge
        Devil

        From The Meaning Of Liff

        AIRD OF SLEAT (n. archaic)

        Ancient Scottish curse placed from afar on the stretch of land now occupied by Heathrow Airport.

        It's clearly working.

    2. Anonymous Coward
      Anonymous Coward

      Re: Boadicea House

      Too late, Google maps already shows it.

      Street view allows you to see gennies at the DC too.

      1. cd / && rm -rf *

        Re: Boadicea House

        "Street view allows you to see gennies at the DC too"

        A very fetching shade of blue.

    3. Anonymous Coward
      Anonymous Coward

      Re: Boadicea House

      Just try googling it instead.

  22. bazza Silver badge

    Ambitious...

    "After a few minutes of this shutdown of power, it was turned back on in an unplanned and uncontrolled fashion"

    That's got to count as one of the most ambitious attempts to switch something back on in the hope that no one will notice it ever got switched off in the first place.

    Whoops!

  23. Colin Bull 1
    Flame

    Power hosed

    I worked in a data centre once when a power contractor disconnected a cable he should not have that was live. The cable then progressed to wave about in the air like a snake on fire and arced several items of power supply kit including UPS batteries.

    After the immediate problem was solved it took more than a few hours to determine what kit was damaged. This include many server PSUs, many network interfaces and much more less important items.

    Luckily this was a facility that was not live, but because the kit was sprayed over a period of time (perhaps a minute or 2 ) there was always a chance data would be out of sync or corrupted.

  24. LoPath
    Trollface

    All fine unless....

    It's all great as long as a proper change control was approved.

  25. Shadow Systems Silver badge

    Next they'll blame the squirrels...

    Those evil, nasty, cable maiming squirrels.

    *Shakes a palsied fist*

    Dang you squirrels, leave my nuts alone!

    *Hugs tin of mixed party nuts*

  26. John H Woods
    Trollface

    Nothing leaked to El Reg

    I am extremely suspicious that nobody from the inside has tipped off the Reg with the real story. it suggests an unfeasible level of loyalty (given all the outsourcing) or that nobody involved reads this esteemed organ which, in turn, suggests a paucity of people who actually know anything.

    1. Anonymous Coward
      Anonymous Coward

      Re: Nothing leaked to El Reg

      "nobody from the inside has tipped off the Reg with the real story. it suggests an unfeasible level of loyalty (given all the outsourcing) or that nobody involved reads this esteemed organ which, in turn, suggests a paucity of people who actually know anything."

      The stories (and comments) reaching the public about what allegedly happened are almost infinitely improbable, which (as you and others have noted) is quite remarkable.

      Perhaps the BA datacentres have turned into the Marie Celeste of airline IT: perhaps nobody was actually in operational control? That's as likely as most of the other stories circulated so far, starting with the original "a big boy blew my power supply and ran away".

      1. Dan 55 Silver badge

        Re: Nothing leaked to El Reg

        The same CEO ran a tight ship at Vueling as well. There was a 4-day melt down last summer and nobody said anything apart from the blindingly obvious.

    2. JimC

      Re: Nothing leaked to El Reg

      I have known cases where everyone concerned has been very careful not to pin down the exact cause too precisely, just in case it turns out to be their side of the fence.

      This may well be the case. If it was indirectly due to inadequacies in the outsourced service then the person who points it out will be making top executives look foolish, and pointing out the state of the emperors clothes is usually followed by a man with a large weapon telling you what you didn't see. If it were in house then you're making in house liable to be outsourced, and if its a UK contractor jobs will be at risk as well. So it may well be in no-one's interests to define the rot cause to accurately to senior management.

      1. Anonymous Coward
        Anonymous Coward

        Re: Nothing leaked to El Reg

        "rot cause" seems like the correct term. ;-)

        No matter who did it, the responsibility always lies with the executive. That is what leadership really means. though in this case they seem to be leading BA to the bottom.

  27. Anonymous Coward
    Anonymous Coward

    Me too thought someone in the know would have posted on beer by now!

  28. TooManyChoices

    Single Point of Failure

    Maybe if BA were running active:active they have just found out a single point of failure that they did not notice before in the design.

    For example maybe taking out the power completely knackered the Enterprise Service Bus (or access to it). If messages could not get to the backup then it fits in with what Cruz said about a power supply issue causing a network problem that meant messaging failed between all the systems.

    The backup data centre might have just been twiddling it's thumbs.

    1. Mark 85 Silver badge

      Re: Single Point of Failure

      Add to that... maybe two DC's close to each other but only one UPS for both... err.. to save money of course.

  29. Anonymous Coward
    Anonymous Coward

    Sounds familiar

    https://www.youtube.com/watch?v=AyOAjFNPAbA

    Skip to 35.04

    1. Bronek Kozicki Silver badge

      Re: Sounds familiar

      The interesting bit is 37:18, thanks.

  30. Ian Emery Silver badge

    240volts + 240volts = 240volts

    Unless of course you do something REALLY stupid.

    The only time I have ever come across anything this stupid was while working in a GUS centre; are they and BA linked in any way??

    They were running 240volts via 2 phases of a 3 phase power line, but fitted with SINGLE pole isolators, so in the "off" position, all the equipment was still "live" - and someone got fried.

    To compound the issue, every single CB was bypassed at all 25 stations, so even if you tripped all the switches in the control cabinets, it was still all live.

    Installed in 1973, I found it the first time I open a cabinet box to try and find out why the guy had been fried - in 1992.

    Even if they managed to use two different phases as "Live" and "Neutral", you would still only get 380volts.

    It sounds to me like a scape goat has been chosen, and they are making up a story to use.

  31. CastorAcer

    Looks like no one has done any root cause analysis...

    The proximate cause is that some poor chump turned off something that he or she shouldn't.

    There are probably several root causes including inadequate training, supervision and planning.

  32. ecofeco Silver badge

    The cause is pretty obvious

    https://www.theregister.co.uk/2017/04/11/british_airways_website_down/

    As you stare at the dead British Airways website, remember the hundreds of tech staff it laid off

    The staff who knew the quirks and oddities of the system were dismissed and the remaining staff had no clue. The rest is nothing but shouting.

    I've seen this happen a few times. Quite a few times.

    Manglement always thinks a complex network is plug and play and so are the staff. And this is always the result.

    So much schadenfreude here.

  33. Daedalus Silver badge

    Willium strikes again

    (wearing brown overalls with a ciggie in his mouth)

    "Emergency power? More than my jobs worth, mate"

  34. JamieL

    Restore in the wrong direction?

    Reminds me of one occasion a few years back. Had installed an active-active firewall pair across two sites with each handling one leg of a 2MB leased-line to the internet (see, I said it was a few years back!) and automatic config hot-synchronized between the pair.

    One firewall sends alarms to say fan playing up. No problem, will just power it down, replace with a good unit then bring it back up again. Power down the offending unit, traffic fails seamlessly over to the secondary and all good. Re-cable from dead unit to the new one, power up. Still all good. Give the "re-synch" command and hear someone outside wondering why the internet has stopped working. Upon closer inspection I determine that the re-synch has indeed happened, with the configuration being dutifully copied across - but I now have two units each with a blank config. Heart stopped and blood ran cold as I realise what I've done...

    Fortunately it was only a 15 min sprint to the other site where I had a laptop with all the backed up configs. A lesson learned the hard way indeed!

    But TBH in BA's case I can't believe that the Friday of a Bank Holiday weekend wasn't considered a change freeze with nobody allowed to do anything other than emergency work in their DCs.

    1. Jc (the real one)

      Re: Restore in the wrong direction?

      Too many years ago (37 , if you want to know) I talked to a customer who had just tested his backup process ("just to get the feel of the syntax")

      At that time , some utilities had a syntax of "$Copy {input device} to {output device}" while others had a "$backup {target device} from {source}".

      This chap managed to copy an 8" floppy to his 100MB database drive (and it worked... perfectly.... database was now about 300KB! )

      Could BA have just run a backup / recovery in the WRONG direction?

      Jc

  35. Anonymous Coward
    Anonymous Coward

    you walked the walk

    Just the thought that people that have been coasting along for years thinking they are in charge were bricking it for several days is nice...

  36. sitta_europea Bronze badge

    In any building which consumes anything like a megawatt the power supply will be three-phase, as will be the backup generators. The UPS (or, rather, presumably a number of them) in an installation of that sort of size will be supplied directly from the three-phase supply and will provide 240V single-phase AC only at the UPS outputs.

    So we know stragiht away that any talk about 240V+240V=480V is nonsense, because to get 240V in the first place you had to come up with some sort of 'neutral' which is NOT provided by the generator and divide the, er, real numbers by the square root of 3. This is why you'll see mention of 415V all over the place in serious power circuits. That's the phase-to-phase voltage. There isn't any 'neutral' and you can't wire three-phase circuits 'in series' - the suggestion makes no sense.

    I'm not a Chartered Electrical Engineer for nothing.

    1. Primus Secundus Tertius Silver badge

      240 Volts??

      In my young day it was 220V. Then, ca 1968, it was upped to 240V. That fried a lot of light bulbs, making work for the working man.

      Then, ca 1996, the EU standardised on 230V. That left a lot of light bulbs running until eternity, so they had to invent an eco-scare to force people to keep replacing light bulbs.

      1. Dan 55 Silver badge

        The EU standardised on 230V +/- a percentage which covered all countries, which meant absolutely nothing changed.

        1. patrickstar

          In Sweden they actually raised the voltage at some point. Used to be 220V - now it's exactly 230.0V from the substation in most cases.

          1. Wensleydale Cheese

            "In Sweden they actually raised the voltage at some point. Used to be 220V - now it's exactly 230.0V from the substation in most cases."

            Ditto in Switzerland. I think it was someone in Holland who told me about it and when I checked the wall outlet with a meter, sure enough, bang on 230V.

          2. Alan Brown Silver badge

            It might be 230 from the substation in Sweden, but it's 252 at my wall socket most of the time.

        2. Richard 12 Silver badge

          The UK used to be 240VAC +/-6% (225-254VAC), and various EU countries at 220VAC and 230VAC nominal.

          Then the EU harmonised at 230VAC +10%/-6% (216-253VAC)

          This range was carefully chosen to ensure that no EU install had to change anything at all.

          Today, most sites in the UK still get 240VAC.

          New builds often get 245-250VAC, as they tap high to allow for future additional load without having to touch anything.

    2. Wensleydale Cheese

      "The UPS (or, rather, presumably a number of them) in an installation of that sort of size will be supplied directly from the three-phase supply and will provide 240V single-phase AC only at the UPS outputs."

      What about computer systems that require three-phase?

      Even if such beasties are no longer common, I would have thought that someone the size of BA would have some legacy kit using three-phase.

      1. IGnatius T Foobar

        Three phase UPS are common in data centers.

        Even if such beasties are no longer common, I would have thought that someone the size of BA would have some legacy kit using three-phase.

        Total bollocks. The kind of UPS that are installed in even the most modestly sized data centers have three phase inputs and three phase outputs.

        1. Anonymous Coward
          Anonymous Coward

          Re: Three phase UPS are common in data centers.

          yep and you should try and get your usage of each phase as equal as possible

  37. Cynic_999 Silver badge

    No such thing as "uninterruptable"

    I once worked at the main telephone exchange of the capital of a particular country. The power could not fail - they had a huge room of lead acid batteries constantly float-charging to supply the 50V power (@ circa 4000 amps) to the electromagnetic exchange. Any one of the 3 container-sized power supplies feeding the batteries could handle the full load, and those were fed from two different sections of the national grid. Two enormous diesel generators (primary and backup) each in a separate room would take over in the unlikely event that both grids went down.

    A bush fire 100 miles away weakened a few 400kV pylons which collapsed and took out one arm of the grid. The additional load on the other arm promptly tripped it offline.

    The primary diesel generator started up automatically within seconds but soon made horrible noises and stopped due to the fact that the maintenance guys had drained the oil the previous week as routine maintenance, then realised they had no oil in stock so it had been left dry while they ordered some (but had been put back online so the boss wouldn't notice).

    The manager quickly went to the secondary generator and started it manually, but while it roared into life it generated no power at all. The batteries could power the exchange for about 30 minutes and time was running out fast. A loose terminal on the excitation winding of the secondary generator was found 5 minutes before we were about to undertake a complete shutdown as the battery voltage was falling to below 45V, and a hasty repair was made. Just as power was about to be switched to the now functional generator, the grid came back on, making it unnecessary.

    A very close call - bringing up an electromagnetic exchange is not straightforward. You cannot simply power it all up in one go, as almost all the solenoids would energise at power-up and overload the whole system. Hundreds of fuses must first be removed, thousands of electromechanical switches manually moved to their home positions, and then the fuses replaced in a particular sequence.

    A lesson in Sod's law.

    1. yoganmahew

      Re: No such thing as "uninterruptable"

      I likewise worked in a company in the travel industry that had an enormous 5 (IIRC) generator UPS and a car park on top of the diesel tanks. Every friday it was tested and the A/C went bang. It was unfashionably modern for its time.

      One night I was on shift on site first line support when the A/C went bang and the lights dimmed. Hmmm. The lights shouldn't dim, we're on UPS. Then the alarms started :) The two fellows from the energy centre, one as white as a sheet, came in shortly after. One half of the UPS (for it was twinned, with the fifth a spare) had blown up after it had kicked in following a half second power drop due to expected engineering work. The chappie had walked into the generator room as some switch thing had arced onto the floor right in front of him drawing a scar into the concrete.

      We lost half the UPS, which, as it turned out, powered all the DASD. They all had to be cold started and it also turned out that we had two large floppies to IML them... it took a while.

      I was 24... nobody died... happy days... :)

      edit: PS no power supplies were damaged by being switched off and on. And this was the days when humidity, not just temperature was important. I remember placing saucers of water round a small DC when the humidifier in the single AC failed. But that was some generations ago.

    2. Anonymous Coward
      Anonymous Coward

      Re: No such thing as "uninterruptable"

      I built and installed a system on a government owned, contractor operated site a few years back. It was the kind of place that really, really didn't want to lose power, and their UPS set up was pretty impressive. Then, a few months afterwards there was a lot of flooding in the region and the grid went down in the area. I held my breath, I happened to be spending Christmas in the area and I was waiting for the phone call asking me to go in and sort out the system after the inevitable borkage that had no doubt occurred.

      The call never came. The entire site's UPS had worked flawlessly, it never missed a cycle as it failed over first to battery and then generators, and ran quite happily all week on the fuel they'd got in stock. When the grid came back they'd still got another month's worth of fuel in the holding tank and a bowser on standby, just in case.

      I was very impressed as, i) this was GOCO, ii) it worked, iii) it wasn't costing the tax payer an arm and a leg, iv) someone in government had managed to get the contract right

      BA

      It sounds like BA have got years of engineering debt built up over decades of doing their own IT. The thing about doing your own IT is that you have to invest in it, otherwise these sorts of problems will occur and get worse. BA need to get serious about either building a new infrastructure, or moving onto someone else's. It sounds like things are pretty marginal in their current (only?) building, and a "simple" thing like a fire will kill their business stone dead, permanently, BA ceases to exist.

      That's a corporate wipe out risk they're running. Compared to the cost of a planned replacement / duplication of their current infrastructure (a guessed hundred million-ish if they do it themselves, considerably less if they use someone else's?), corporate wipe out is veeery expensive.

      I wonder how the shareholders feel about that?

      Oil

      The oil industry is the same. A friend worked in one major company, he reported that their IT was in a hideous mess. It was the sum result of decades of projects that had been started during good years for the oil industry, and aborted half way through when the next slump came along.

  38. tedleaf

    See,I always thought BA stood for bloody amateurs..

    Looks like I was right all along !!!

    All I can say is that it couldn't happen to a more deserving bunch of incompetent,expensive morons..

    Oh I do so hope their insurance says no,you caused it,you pay for it..

    That should put a dent in any manglement bonus's this year...

    1. ecofeco Silver badge

      Dent in bonuses? You're joking of course.

      Dent in some lower scapegoat's employment is more like it.

      1. Pedigree-Pete
        Pint

        Dent in bonuses? You're joking of course.

        You're right but the source of cash in a private enterprise is it's customers so we'll ultimately be paying. :( PP

        >>Friday Eve.

  39. Glennda37

    I wonder...

    If the sparky that took down telehouse a couple of years ago and probably got sacked has moved across town...

    1. Andrew12

      Re: I wonder...

      There are a few such "world-class" data centres around Docklands that I have experience of. Sure, I've seen the acres of batteries and the generators, but I have never found one that was genuinely UPS.

      In one case a FTSE 100 company fitted its own UPS next to its servers, despite already paying a huge rent to a facility that was supposed to provide it. That's mad right?

      Or was the international bank that relied on the sales brochure, contract and SLA, mad? Its pretty hopless waving your contract around when your European systems have no power.

  40. John 156
    IT Angle

    What is odd is that CBRE Global Workplace Solutions is a facilities management company for commercial property.

    1. Anonymous Coward
      Anonymous Coward

      CBRE Global Workplace's web server is TITSUP

      "CBRE Global Workplace Solutions is a facilities management company for commercial property."

      Synchronicity? Karma?

      http://www.cbre.co.uk/uk-en/services/global_corporate_services currently says

      "A big boy eat my power supply and ran away".

      Well, actually it says:

      "An error occurred while processing the request. Try refreshing your browser. If the problem persists contact the site administrator"

      but they're both equally helpful.

      This one works though:

      http://www.cbre.co.uk/uk-en

      [edit, 9 minutes later: both "working as expected"]

  41. Anonymous Coward
    Anonymous Coward

    It happens

    These things definitely happen. I was overseeing the recommissioning of the main propulsion switchboard on one of HM's Submarines. 720V DC stuff (shiver). Jack forgot the DC Shore supply was connected and switched to Batteries-in-Series. Big bang. Magnetised Submarine. Cost £5M to degauss. Embarrassment all round. It was a long, long time ago.

  42. Anonymous Coward
    Anonymous Coward

    Power-supply network

    The power-supplies are servers themselves, with a network connection and complex software. There is a big kill-switch, but it is all controlled through software. Possibly a software fault on the power-supply network caused trouble, and the power-supply was killed abruptly, resulting in damaged hardware, and possibly software. Restarting can break some hardware itself.

    When our troubles come, they come not single spies, but in battalions...

  43. Mr Chuckles

    Well Willie, the first rule of Outsourcing, is you can't outsource responsibility. The whole Outsourcing thing, is simply a way to reduce pay and conditions for workers. Also, companies who outsourced IT Operations and Infrastructure, (sure it’s only tin), now realise it’s neither cost efficient or offers a better service. When something goes wrong, it takes much longer to recover, when IT is scattered to the four winds.

    1. ecofeco Silver badge

      There are many more of these hard lessons yet to be learned.

      Bank on it.

  44. Mvdb

    I am not sure why most of the comments and the UK press are so negative on BA. I am from the Netherlands and just want to know what happened. Reading a lot of nonsense in the newspapers.

    Human error based on a single source. Denied by the contracor. Does any newspaper check facts or they do not care and just publish?

    There are many documented failures of power in datacenters. So BA is certainly not alone. An overview here http://up2v.nl/2017/06/02/datacenter-complete-power-failures/

    And a detailed post about what went wrong here. Hope someone will learn from it

    http://up2v.nl/2017/05/29/what-went-wrong-in-british-airways-datacenter/

  45. ccannick

    Good Follow Up

    Good follow up story but I'm left with more questions than answers.

    https://medium.com/@computespace/british-airways-5-17-ff8bfa25ec7d

  46. Miss Lincolnshire

    Contractor cocks up enormously.....

    ..........remind me to renew my indemnity insurance tout de suite

  47. astounded1

    When You Don't Test And Train And Your Setup Sucks...

    Active:active without estrangement between the two is a flying farce. In other words, your secondary live site must be autonomous, not interlinked. You move to it with no electrical interplay. What kind of collective ignorance is at play here?

    Nobody loves the smell of fried computers in the morning. Or do they?

    1. Wensleydale Cheese

      Re: When You Don't Test And Train And Your Setup Sucks...

      "Nobody loves the smell of fried computers in the morning. Or do they?"

      Not too sure about that. One especially warm weekend a customer's aircon failed and the computer system got so hot that the only sensible thing to do was condemn it, claim from the insurance, and buy a new replacement.

      Our hardware salesman was not displeased with an extra sale landing in his lap.

      1. Alan Brown Silver badge

        Re: When You Don't Test And Train And Your Setup Sucks...

        "One especially warm weekend a customer's aircon failed and the computer system got so hot that the only sensible thing to do was condemn it"

        How many people running datacentres DON'T have a crowbar set to drop on the power if room temps go over a set limit? (usually 35C)

        1. Anonymous Coward
          Anonymous Coward

          Re: When You Don't Test And Train And Your Setup Sucks...

          "How many people running datacentres DON'T have a crowbar set to drop on the power if room "

          Almost everyone running anything mission critical? You would start manually turning less critical stuff off first if you had a heat build up.

  48. MR J

    Doubt 480v

    With a industrial electrical background I can say that only a fool would have a system that could in any way "combine" in series to supply 480v.

    It doesn't seem logical either, because for that to happen the systems would need to be in series to begin with, that just seems highly unlikely, only someone who doesn't understand the basics of electricity would do something like that.

    What I see more likely to have taken place was a failure of the generator / battery system and someone instead throwing mains feeds straight into the barn. The load is excessive and the voltage being so far out causes some of the supply units to go pop or for breakers to even get thrown off again, they cycle the power again trying to fix the issue but it doesn't help. Hook up enough smps units and try to power them all at once on a leg that cant support the starting current and you will see it happen every time!..

    Inrush current of 100 amps for one computer is bad, about 20x max load, so if you imagine this in a data center with some fool turning everything on at once, melted lines and dead equipment does seem possible.

  49. Anonymous Coward
    Anonymous Coward

    This should worry a lot of people...

    The ambiguity for starters, but the lack of emergency plan c and d etc. How do other airlines especially tight outfits like Ryanair get by... Does there need to a failsafe where partner airlines pick up the IT slack... ???

  50. Anonymous Coward
    Anonymous Coward

    480v?

    How would you end up with the two feeds in series? I would have thought that a more likely scenario would be to end up with the two sources in parallel.

    1. theblackhand Silver badge

      Re: 480v?

      How to get to 480V?

      Your incoming power feed at 240V plus your UPS/generator feed which is a separate 240V feed.

      And a big switch between them. With software to control and manual override. That someone screwed up based on the article, although The engineer maybe being thrown under a bus and the details are more subtle.

      BA are supposedly around 250 racks per DC - given the age of the DC's and the likely equipment (mainframe and comes heavy), they are likely around 1-1.5MW per DC. Nothing will be small..

      1. Richard 12 Silver badge

        Re: 480v?

        Except that AC supplies don't work like that, and no UPS could ever be physically wired that way and ever work to begin with.

      2. patrickstar

        Re: 480v?

        Remember kids, connecting power sources in parallell increases the current, connecting them in series increases the voltage.

        If you connect two AC feeds in parallell, which is what would most likely happen if a switch borked, you'd either get a short (if they were out of phase or the phase order different - many years ago, SUNET had a hilarious DC failure caused by a UPS bypass switch accidentally engaging), or nothing at all (in fact, big feeds are actually several parallell conductors already since a single one would get too big).

        Connecting AC feeds in series? No, sorry, doesn't work like it does for DC...

        At most, the latter could cause things monitoring the direction of current in the HV feed to trip. Or a generator to run backwards, which could very well spell disaster - for the generator, not for the server.

        Otherwise the major concern would be this happening during a power failure and frying the poor lineman trying to fix it.

        The typical ways to get overvoltage from wiring errors in a three-phase system would be to either

        a) lose the neutral wire in which case your single phase loads would get 0-415V, depending on the load balance between the phases

        or

        b) hook up a single phase load across two phases instead of a single phase and neutral, which would give it 415V.

        Neither case gives you 480V, although they certainly won't be good for the equipment regardless.

        Both cases are highly unlikely to result from some sort of switch failure since the neutral is never, ever switched in the first place.

        I guess you could somehow theoretically end up with two out-of-phase sources supplying your 3 phases, which could cause more than 415V between phases but still wouldn't hurt your single-phase loads (atleast not unless the neutral wire gets very overloaded as a result).

  51. sanmigueelbeer Silver badge
    Unhappy

    DR plans

    Except for big financial organization, a lot of companies don't really have a real DR plan. True, they might have a secondary data centre but to actually get down and test the DR plan by flipping off the power to the primary DC? Not going to happen even if it has the blessing from the CIO/CTO.

    To divert traffic and data away from the primary DC, there is a lot of preparatory work that needs to happen. This alone defeats the purpose of redundancy. The sensitivity of data traffic has now gone to a level of stupidity that it's not enough to just configure the primary and secondary path and hope that the client will be able to send the traffic down the secondary path in case the primary path is down or detected to be down.

    Now that's the technology side of things. How about the financial side of the coin? How much money does BA have to spend (annually) to build and maintain a mirror of the BoHo? And now here comes the question akin to "what are the odds of winning the lottery?": How often will BA see a system-wide outage? And then throw in the equation of "how much will a system-wide outage cost BA?" and then one will come to a final conclusion that, with the event that just happened, it is still cheaper not to have a DR site.

    Apologies for the long post.

    1. Anonymous Coward
      Anonymous Coward

      Re: DR plans

      It can be done, it isn't cheap or easy.

    2. Alan Brown Silver badge

      Re: DR plans

      "to actually get down and test the DR plan by flipping off the power to the primary DC?"

      Anon BA staffers have already posted into other threads that's EXACTLY what used to be done.

      Whatever the borkage is, it's recent.

  52. Grumpy Rob

    Coarse colonial

    May not be relevant, but my favourite quote (not original) when sytems go down is:

    Q: Why is a computer system like an erect penis?

    A: Because if you f**k with it it's going to go down.

    Seen it any number of times!

  53. Tim99 Silver badge
    FAIL

    Mandy Rice-Davis Applies*

    "Although various people have speculated that operations and jobs outsourced to India's Tata Consulting Services (TCS) contributed to the cockup, both the airline and TCS vehemently deny it."

    *He (They) would, wouldn't he (they)? Wikipedia Link.

    1. Anonymous Coward
      Anonymous Coward

      Re: Mandy Rice-Davis Applies*

      TCS couldn't have been a contributor as they take three days to respond to a major incident.

  54. Anonymous Coward
    Anonymous Coward

    How many CBRE staff does it take to fit a ceiling light bulb?

    5 - 1 holds the bulb while standing on a table, 4 spin the table

  55. Anonymous Coward
    Anonymous Coward

    My theory

    Outsourcing, Indian coding and WebSphere.

    1. Anonymous Coward
      Anonymous Coward

      Re: My theory

      Down vote as the Indians were nowhere near the scene.

      1. Anonymous Coward
        Anonymous Coward

        Re: My theory

        US Operator: Not likely to be Indian site; their primary power is so bad that they are testing the fail-restore disaster plan several times a week.

        Also most server racks down split three phase power service and route three single phase hot-neutral to three single phase power supplies to supply lots of high amperage 3-5-12V DC. Electrical Codes have surge standards to take "ordinary" inside the building shenanigans. They are useless to handle trees and construction equipment bridging two phases of 1300+ V (or one phase to the local distribution lines).

        And now for something completely different: Will European standard electric cars protect themselves when plugged into bad supply voltage? Ah, not so completely different--I lied.

        1. Alan Brown Silver badge

          Re: My theory

          "Electrical Codes have surge standards to take "ordinary" inside the building shenanigans. They are useless to handle trees and construction equipment bridging two phases of 1300+ V (or one phase to the local distribution lines)."

          But you design your switchgear on the incoming side for those possibilities regardless - and the obvious one in many countries of a 6kV or 11kV distribution line falling on the 240V lines where the distribution is above ground and poles are susceptable to cars.

          Most DCs have a dedicated 11kV feed and local distribution transformers but you'd be surprised at the kinds of _shit_ that comes up the power lines

          It's not just the USA. UK power feeds are far from clean, with 1920s power standards still being acceptable in terms of dropouts, short brownouts and spikes. Our 2 large online UPS systems see an average of 5 notifiable events PER DAY. The stored kinetic energy (flywheel UPS) is used several times per week and the diesels are run in anger 3-5 times per month - mainly due to incoming power being well out of spec, rather than a complete outage.

  56. Anonymous Coward
    Anonymous Coward

    Application design not power

    I’m surprised the comments on here contain so much speculation on the way the power issue may have caused the problem. I think BA are lying and it wasn’t anything to do with power and was an application error.

    There’s been previously reported IT problems at BA going back months if not years. System slowdowns and crashes bringing staff close to tears. Then the system goes down during the busiest period on a bank holiday. That’s not a coincidence.

    This is a poorly conceived cover up to hide the fact that offshoring IT to India has resulted in unstable systems.

    1. Anonymous Coward
      Anonymous Coward

      Re: Application design not power

      It was certainly an application failure. The failure to apply proper procedures and respond properly. Compounded by the PR failure and lack of control of the message.

      I understand that senior management went into lock down to resolve the issue, but why didn't they have a front man/woman to communicate this?

  57. JeffyPoooh Silver badge
    Pint

    UPSs suck

    They're not very reliable. And often *cause* problems, up to catching fire.

    It's like having a guard dog that sleeps through burglaries, chews the furniture, poops on the floor, and occasionally eats one of your children.

    Present UPSs often occupy the functional niche where something potentially useful and far less harmful should be.

    1. JeffyPoooh Silver badge
      Pint

      Re: UPSs suck

      Our building has had to be evacuated twice due to the smoldering UPS.

      And because only the servers were "protected" by the UPS, and the 100+ PCs were not, data was still lost with each power outage.

      UPSs are typically daft. An embodiment of human stupidity.

      One could imagine a UPS done correctly, and they obviously do exist. Somewhere.

      1. Anonymous Coward
        Anonymous Coward

        Re: UPSs suck

        "because only the servers were "protected" by the UPS, and the 100+ PCs were not, data was still lost with each power outage."

        Hmmm. Have your IT department been around long enough to have heard about "thin clients" (using servers and storage on the network), or even just ordinary desktop PCs with zero local file storage (just local apps, local display, local processor, and memory, and (presumably) local Windows licence)? Or is that too 1990s for them?

      2. Alan Brown Silver badge

        Re: UPSs suck

        "And because only the servers were "protected" by the UPS, and the 100+ PCs were not, data was still lost with each power outage."

        Modulo the point that leaving data on client machines is a "really bad idea"

        In that case, you need a whole building UPS. If you've got £3-1100k to throw at the job (depends on sizing) I can point to a few suppliers. These come as 20-40 foot shipping containers (again, depends on size) so be prepared to lose a couple of car parks.

    2. Alan Brown Silver badge

      Re: UPSs suck

      "They're not very reliable. And often *cause* problems, up to catching fire."

      Only if you don't maintain the things properly. They carry significant energy and MUST be respected. Don't put them in the same rack (or room) as the computing equipment, they need their own environment.

      And yes, I've had one "catch fire" - after being put into service following a 6 month furlough due to a blown power transistor.

      PHB of the techs had insisted it be stored outside because it was cluttering up the workshop, where it'd gotten damp - it was only under a tarp instead of being properly wrapped up. About 12 hours after it was hooked up to the load, it decided it didn't like its environment anymore and would make that fact clear by smoking up a storm.

      Staff reaction to the rancid smoke filling the building? They opened a few windows as they came in at 5am (This was a radio station). And it wasn't until management arrived at 8am that the building got evacuated.

  58. JimRoyal

    A sysadmin and a sparky are not the same. Don't send one to do the job of the other.

  59. IGnatius T Foobar
    FAIL

    This is 100% the fault of outsourcing.

    Outsourcing is 100% to blame. The outsourcing firm should be sacked, and the person who decided to use the outsourcing firm should be sacked, drawn, and quartered.

  60. sanmigueelbeer Silver badge
    Facepalm

    Push the red button, I dare you

    Yeah, we had that. We had an idiot do just that: He deliberately pushed the RED button and everything just went dark and quiet very, very quickly. (The emergency shutdown button was clearly labeled and could not be accidentally pressed as it was inside an enclosed structure.)

    On the other hand, it was also a good way to really, really test the redundancy of the systems and DR. Nevertheless, it failed. Completely.

    Power was restored in 45 minutes after the button was pushed but the rest of the IT system took three to four hours to recover and all required manual intervention.

    If y'all think that was funny, the postmortem was even funnier. So all the executives in charge of the different systems sat around the meeting trying to explain to the chief of the IT department why the system took so long to recover after power was restored and why the redundancy didn't work. Guess what, status quo was maintained. No one wanted to ask or answer why critical systems didn't have an automatic failover mechanism and requires a large amount of manual intervention to get things moving. No one.

    Please note that the people in charge of the systems were 40% full time staff and the rest are highly paid contractors.

    And I forgot to mention that the site was a 1200-bed hospital.

    1. Anonymous Coward
      Anonymous Coward

      Re: Push the red button, I dare you

      Which is why every organisation should have a DR expert who is authorised to ask the hard questions, especially those where lives are at risk. If you think a risk manager will do try asking him something technical.

      1. kwhitefoot

        Re: Push the red button, I dare you

        Asking the questions doesn't help. In my stint as Unix sysadmin about 20 years ago I pointed out that we couldn't really tell if our plans would really work and asked for funding to run a disaster recovery exercise. The request wasn't denied, just ignored.

        1. Anonymous Coward
          Anonymous Coward

          Re: Push the red button, I dare you

          They may be idiots but it's their right to ignore the request. Politics always seems to override reason and logic.

    2. Anonymous Coward
      Anonymous Coward

      Re: Push the red button, I dare you

      "I forgot to mention that the site was a 1200-bed hospital."

      Wanna name names/locations? I will - this sorry tale from a decade or so ago when I was a routine visitor to North Hants Hospital (Basingstoke, UK) which is now around a thousand beds, and wasn't that different a decade ago. Could've been various other places too.

      There's not even a big red button in this picture, and the whole hospital lost power.

      There was particularly extensive construction work going on around the site, externally and internally. On the occasion in question it was late afternoon (just after visiting time) at a time of year when late afternoon means relative darkness.

      The lights went out without notice, the sockets (and eveything else that didn't have its own batteries) lost power, etc. Not good, but these things happen and are planned for, so staff were initially not too concerned.

      Unfortunately power+lighting was not restored in the timescale the staff expected. Nor was there any power to the small number of "critical services" including a few sockets. Stuff that had its own batteries might still be OK (e.g. some portable or life-critical stuff). Pretty much everything else - zilch.

      To add to the fun, the main corridor in the hospital was one of the areas being worked on, and a full height partition had been erected along much of its length. So it had its locally-powered emergency lighting, but the lighting was useless because various people had allowed an inappropriate partition to be built without preserving the emergency lighting facilities. So people couldn't even see where they were going.

      I realised just how unprepared people (staff, management, contractors, etc) were for something like this, and left as quickly as I could.

      Turns out from later leaks that an inadequately supervised digger working in the car park had taken out both the main grid incomer and the feed from the standby generators, which shared a common underground duct?.

      How many things had to go wrong, how many people had to *not* do their jobs properly, to enable something like that to happen?

  61. BitEagle

    Management failure

    Not to dismiss any of the technical possibilities that have been discussed here, the single most likely reason for such a catastrophic outage is that IT budgets have been shaved consistently over a period of years to a point where all the senior IT managers understood that their staffing levels, processes and infrastructure were probably going to be inadequate to survive a catastrophic failure or series of failures, but were equally aware that telling their finance and operational colleagues this would probably result in their being side-lined, fired, down-sized or moved to "special projects"...

    For an organisation in this state of denial the quarterly bottom-line is everything, and long-term is only the next quarter. You could reasonably argue that this sort of failure is possibly the only way BA's IT investment could ever increase to address the long-term failure to invest responsibly

  62. Stuart Castle

    He may be right when he says that BA's system administration is not outsourced to another country. That does not mean it's not outsourced though, and it does not mean that it being outsourced did not contribute to the problem.

    Regarding the comment someone made earlier about customers not getting compensation because BA outsourced the IT service. For the purposes of compensation (and any potential legal action), that may be irrelevant. The customer's contract (such as it is) is with BA. If an outside contractor is maintaining a system that BA relies on, and that system fails, preventing BA from providing a service, then it's up to BA to provide the compensation (and they will also get any legal action). They can launch any actions needed to reclaim the money from their contractor..

  63. zosconsult

    I read through to the end and was rewarded by seeing the words TCS and "cockup" together in a sentence. Perfect.

  64. Alan Brown Silver badge

    The problem with this timeline

    Is that it was reported to be issuing random destinations on boarding passes on Friday

    The contractors have come out swinging too - "we do not recognise this scenario" is a polite way of warning BA that anything further is likely to result in legal action.

  65. Nicko

    Complete and utter rubbish... and Unforgivable.

    "...our source said, the power control software controlling the multiple feeds into the data centre didn’t get the switch between battery and backup generator supply right at the critical moment – or, potentially, if someone panicked and interrupted the automatic switchover sequence – it could have resulted in both the battery supply and the generator supply being briefly connected in series to the power bus feeding the racks. That would result in the data centre’s servers being fed 480v instead of 240v, causing a literal meltdown."

    Hogwash.

    No. No. NO. Any modern DC has a sequence where the utility feed comes into a box called an ATS (Automatic Transfer Switch). Under normal circumstances, ALL the equipment in the DC is powered from the UPS which is on-line at all times.

    The UPS has two roles - maintain the output voltage at a steady 230V to cater for input supply power fluctuations, and to load-balance the input currents across the three supply phases (most modern UPS systems don't care much about output load balancing, though it's good practice to try to balance them to get the most out of the UPS).

    On-line UPS systems take the input power feed, convert it to DC to charge the batteries, then they have inverters that take the DC from the battery and convert it into clean AC for the data center. Unless the UPS is in bypass mode for maintenance, this is the case at all times for all modern data center UPS systems.

    If the utility feed fails, the ATS detects this and immediately sends a signal to the genset to start the generators.

    The UPS should have about 10 minutes of run time with no utility input - this is to cover the start-up of the genset, and if the genset doesn't start, time to send a signal to the servers that they need to do an orderly shutdown (this is done by s/w agents on the servers - signalling normally by IP).

    At NO TIME should there be a detectable fluctuation in the power to the data center - all that happens when the input utility supply fails is that the batteries in the UPS are no longer being charged and start to discharge (hence the 10 minute run-time) and the genset starts up..

    As soon as the ATS detects that the genset is producing the correct voltages, the utility supply (which has failed anyway) is automatically disconnected from the UPS and the genset output connected in its place - this happens in fraction of a second, automatically, and again, there is NO interruption to the supply to the data center - as the servers are, as always, running off the batteries, not the UPS input supply.

    The UPS batteries are now being charged by the genset, and life continues as normal. Normally, gensets spin up and stabilise in a couple of minutes (our one, a 450kW jobbie on the roof, takes under two minutes). The genset should have at least 24 hours of fuel on-site.

    When the ATS detects that the utility supply is restored AND STABLE, it disconnects the genset from the input to the UPS and connects the utility supply in its place. After a few minutes of stable running, the genset is switched off.

    There is simply NO EXCUSE for a modern (read "last 10 years") DC to lose power the way BA did. Just unforgivable.

    1. Alan Brown Silver badge

      Re: Complete and utter rubbish... and Unforgivable.

      "On-line UPS systems take the input power feed, convert it to DC to charge the batteries, then they have inverters that take the DC from the battery and convert it into clean AC for the data center. Unless the UPS is in bypass mode for maintenance, this is the case at all times for all modern data center UPS systems."

      For larger sites, Flywheel UPSes are the same (incoming mains drives the flywheel motor-generator), but allowable dropout time is usually in the region of 15-20 seconds.

      The ATSes are as described and gennies are on the input side of the flywheel.

      This gets around the _substantial_ issues associated with battery maintenance, but indroduces its own dangers - a 2 ton magnetically levitated flywheel in a vacuum chamber with _that_ much stored kinetic energy is not something to mess with, lest it exit its chamber (and the building) at ~100mph.

      1. TheVogon Silver badge

        Re: Complete and utter rubbish... and Unforgivable.

        "For larger sites, Flywheel UPSes are the same (incoming mains drives the flywheel motor-generator), but allowable dropout time is usually in the region of 15-20 seconds."

        Dated technology these days. Gas fuel cells is usually the way to go: http://www.datacenterknowledge.com/archives/2012/09/17/microsoft-were-eliminating-backup-generators/

  66. Wzrd1

    Been there, done that, got the blasted tee shirt

    Once, as a secure US military installation, which was key in all current wartime communications, the technical control facility manager decided to take the building's UPS offline and go direct to mains power. The unit being active:active at all times. The reason was simple and necessary; replacing a room full of dead UPS batteries.

    Regrettably, he only skimmed the instruction manual, didn't want to wait for the installation electrician and flipped the twisty switch.

    The entire server all went down hard. When he put the switch right (he was one position off from the correct setting), one key rack didn't come online and remained dark.

    At the time, this BOFH had been wearing the information assurance hat, but am an experienced BOFH and also a certified electronics technician in industrial automation and robotics. So, reading industrial electrical blueprints is ancient news to me.

    "Where is the electrical blueprint?"

    Spreads several blueprints out on the floor, kneeing, tossing the incorrect diagrams aside, I rapidly locate (paraphrased, to protect NDA information), "Ah! Circuit breaker 57A, in bank 12F. Where is it?"

    Predictable look of confusion and consternation and disclaimers of such arcane knowledge.

    A swift heel and toe express around the battery/UPS room located the breaker - conveniently located behind a one-off bank of several hundred batteries, seriously out of view and traffic. Sure as can be, the breaker was tripped.

    There was one chance in three that I'd flip that breaker on my own authority, on a US military base, and worse, in wartime. Slim, fat and none.

    "OK, here's the culprit. *I* am not going to touch the damned thing, it's way outside of my job responsibilities and I won't accept responsibility. So, it's your ball. Wait for the installation electrician or push it yourself and *you* take any resultant heat for hardware failure."

    The manager considered, "It'll be two hours before the electrician gets here!" He switched the breaker off, then to on position. The rack lit up.

    It took nearly 12 hours and a very upset COMSEC custodian, to restore all services. Each crypto device required rekeying, requiring the presence of said custodian to provide the appropriate USB (and other devices) keys.

    Six months before, we had a similar outage, due to a blown transformer and the aforementioned room full of dead batteries. A room that was ignored, right until a US General couldn't use his telephone, due to the outage.

    Suddenly, we had the budget to replace that which we had complained of twice weekly.

  67. anonymous boring coward Silver badge

    it could have resulted in both the battery supply and the generator supply being briefly connected in series to the power bus feeding the racks. That would result in the data centre’s servers being fed 480v instead of 240v, causing a literal meltdown

    Sounds weird wired.

  68. amh22

    As an IT professional, I don't believe any of the explanation. In the unlikely event that there is any truth in it then the CIO should be fired immediately as well as the head of IT.

    Neither has happened which reinforces my belief.

    No one, since many years ( I worked on planning fail-over back in the late 90's and it wasn't new then), that has such a large corporation replying on IT systems does not have a triangulation system.

    Triangulation is never based on sites in close proximity so where are BA's three sites? How could all three fail?

    It is unimaginable that BA does not have such a system in place but if it doesn't, Tata is almost certainly the consultancy that devised the backup plan.

    Whatever the reality, the CIO and head of IT have to take full responsibility and either resign or be fired, Anything less and you have two problems. 1. BA has proven it has no commitment to client service. 2. There is a rat and no one is admitting it which comes back to 1.

    As reporters, you need to keep digging 'cause you're being sold a pup.

    1. Alan Brown Silver badge

      "Triangulation is never based on sites in close proximity so where are BA's three sites? How could all three fail?"

      2 of the three are within 2 miles of each other.

  69. Pat Harkin

    You said you wanted a 100% guaranteed failover system...

    ...so I configured it so that when one DC fails, the others fail automatically without human intervention.

  70. Anonymous Coward
    Anonymous Coward

    It's a shame

    That this story is spread across different pages on the Reg. There are valid comments and suggestions spread around.

  71. Captain Badmouth
    Paris Hilton

    Management bullshit

    "it could have resulted in both the battery supply and the generator supply being briefly connected in series to the power bus feeding the racks. That would result in the data centre’s servers being fed 480v instead of 240v, causing a literal meltdown"

    Has anyone here in the electrical/electronic/computer field ever heard of an automatic system being configured in such a manner? Power supplies capable of being connected in series? Parallel maybe..., but then with 3 phase you don't get 480v. Somebody with a 1+1=2 understanding of ac electrics has come up with a bullshit "press release" from a "source".

    How many months before the truth comes out?

    Paris, very saucy. ( but probably knows more about electrics than the source)

  72. Anonymous Git

    "...two different data centres, both of which are no more than a mile from the eastern end of Heathrow's two runways. Neither is under the flightpaths....

    ... and from aerial views (no, we're not going to pinpoint it on a map for you) BoHo looks to be around about that size...."

    oh i see... that secret location.. just a little bit east of the two runways... just between the runways... and doesn't have all that coolant plant on the roof... gotcha... ssshhhh don't tell anyone.

    nothing like the article hinting to where it is located lol.

    https://tinyurl.com/ybk6b5hp

    1. Anonymous Coward
      Anonymous Coward

      Remember

      ...when the BA plane crashed just short of the runway?

      https://en.wikipedia.org/wiki/British_Airways_Flight_38

      Too close perhaps and thankfully not frequent.

  73. Anonymous Coward
    Anonymous Coward

    Normal accidents, I have nothing to add.

    "Normal Accidents" contributed key concepts to a set of intellectual developments in the 1980s that revolutionized the conception of safety and risk. It made the case for examining technological failures as the product of highly interacting systems, and highlighted organizational and management factors as the main causes of failures. Technological disasters could no longer be ascribed to isolated equipment malfunction, operator error or acts of God.

    Extracted from Wikipedia.

    Previous post referral

    Thanks to whoever referred us to this.

  74. Anonymous Coward
    Anonymous Coward

    Outsourcing IT support may not have caused....

    Outsourcing the IT to TCS may not have caused the outage, but may have contributed to the delay in the recovery of the IT systems, prolonging the outage.

    Too often we see companies penny pinching in the wrong areas, and it comes back to bite them. The accountants look at staff numbers on a spreadsheet and see one UK permanent resource costing X and one offshore costing less than half, and choose the offshore. I know it's not just about the daily cost, it's also the overheads that go along with the permanent staff, but replacing someone with 20+ years experience of the airline, how it runs, and the IT systems, in all their intricacy, with a well qualified offshore resource, you lose 20+ years experience you can never get back.

    I'm not saying this would definitely made a difference, but "you do the maths".

    BA / IAG seem hell bent on destroying what was once a great brand. BA is no longer a full service airline.

    The profits may be up, for now, but how long will passengers put up with all the reductions in service but not in fare?

    BA / IAG need a change of management and change of direction. Quality counts and is appreciated and remembered. Mention BA these days and they're used as an example of failure!

  75. Anonymous Coward
    Anonymous Coward

    Pretty quickly? Hardly, it was my week's holiday that was cancelled.

    From the Grauniad today.

    Willie Walsh

    How bad a week was it? “It wasn’t a week,” says Walsh swiftly. “The thing you’ve got to recognise is that BA was back to normal pretty quickly. The critical thing was to get moving again and get customers looked after. I think some of the criticism has been unfair – but it’s easy for me to say that.”

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019