back to article BA IT systems failure: Uninterruptible Power Supply was interrupted

An IT bod from a data centre consultancy has been fingered as the person responsible for killing wannabe budget airline British Airways' Boadicea House data centre – and an explanation has emerged as to what killed the DC. Earlier this week Alex Cruz, BA's chief exec, said a major "power surge" at 0930 on Saturday 27 May …

Page:

  1. Anonymous Coward
    Anonymous Coward

    If it got interrupted...

    Then its not a UPS. Its a DHL. Dumbass High-risk Liability.

    1. Dan 55 Silver badge
      Holmes

      Re: If it got interrupted...

      First we know everything was slowing down. Maybe they decided they couldn't fix it live and wanted to force a failover.

      The Times suggests a big red button was pressed in the data centre by a contractor and the power went down. That might be when BA claimed there was a power failure.

      That would be the point when the failover failed. Perhaps that is why the CEO said something about there being millions of messages, although he seems to have stopped saying that now, maybe because it suggests there's something wrong with their IT.

      Then I guess they tried to bring the data centre back up, and it looked like the bridge of the Enterprise, shaking about, staff falling to the floor, and smoke everywhere. That would be the power surge.

      I wonder how long it was since power and switching to secondary or backup data centres were tested.

      1. sad_loser
        FAIL

        Re: If it got interrupted...

        The issue is that what was someone doing in the DC playing with buttons they should not have had access to.

        If your IT workforce is all in house then you don't get contractors wandering around unsupervised.

        1. Anonymous Coward
          Anonymous Coward

          Re: If it got interrupted...

          Electrical installation is rarely done in-house and is quite a specialised task. You'd have to be special to cock this up.

          I can see it now.

          Electricians apprentice - Whoops why has it gone quiet?

          Foreman - Quick, restart everything and get out before anyone notices.

          1. Anonymous Coward
            Anonymous Coward

            Re: If it got interrupted...

            Oh, that takes me back some years. New standby generator went in on Friday night/Saturday morning. We're on site, shut down everything, new generator in place, all well, so stuff gets brought back up. 9:30 am, apprentice sparky cut a wire that caused the whole thing to shutdown. I got a call from security, arrived at 9:45 am (I stay close) and the place was like the Marie Celeste. Open cans of diesel for the generator, warm cups of tea and not a bloody person on site.

            1. Mark 85 Silver badge

              Re: If it got interrupted...

              Similar to where I worked, except the backup power gen came on, picked up the load and then promptly died. Seems there wasn't much fuel in the tank. Maintenance guy was fingered for it as it was in his job description to fuel the generator and keep it topped off. The lesson was that firing off the generator once a week for 10 minutes to test uses fuel... duh!!!!!!

          2. Version 1.0 Silver badge

            Re: If it got interrupted...

            The electrical generators are almost certainly three-phase generators ... the trick here is connecting the three phases in the right order. I saw a generator test years ago in Oxford fail on the initial installation test after the phases why connected incorrectly. The generator spun up, and as the power switched over there was one heck of a bang and a lot of smoke ... and no more electricity.

            1. GettinSadda

              Re: If it got interrupted...

              > "I saw a generator test years ago in Oxford fail"

              Yeah... I was told of a similar incident, but in a power station. When the station is powered on it needs to sync to the grid before linking as it is vital that not only the frequency matches exactly, but also the phase. The traditional way to do this was with a dial showing the phase-error. Apparently when the plant was down for maintenance they also had cleaners in to give the control room a going over. One of these cleaners discovered that it was possible to unscrew the glass fronts of the dials to clean the glass. In the process they knocked off the needle, and replaced it... 180 degrees out of phase. When the power station was brought back online the generators apparently detached themselves from the floor... with considerable (i.e. demolition-grade) force!

        2. Anonymous Coward
          Anonymous Coward

          Re: If it got interrupted...

          IT staff rarely go near the electrical stuff, it's far too dangerous for that.

          1. Tim Jenkins

            Re: If it got interrupted...

            "IT staff rarely go near the electrical stuff, it's far too dangerous for that."

            As a significant percentage of BOFH plotlines have taught us ; )

          2. Dwarf Silver badge

            Re: If it got interrupted...

            IT staff rarely go near the electrical stuff, it's far too dangerous for that.

            Er, IT staff work on things that are run off the very same electrical stuff. I do hope you are not implying that data centre grade equipment is too dangerous ?

            Having said that, even a complete Muppet can hurt themselves with nothing more than a mildly sharp stick or an LR44 button cell battery that looks like a sweetie, like everything, its all down to training and understanding the job and the risks of the job. Take a look on YouTube for the chaps that work on the live 500KV power lines, or the guys that maintain the bulb at the top of the radio towers

            Everyone should have seen all the warning signs on the way into the facility (that tick the boxes in the H&S assessment) and had the prerequisite training.about safe escape routes if gas discharge occurs (no, not that gas, the other one), the presence of 3 phase power; the presence of UPS power; various classes of laser optics; automated equipment such as tape libraries that can move without warning and of course the data centre troll who's not been seen for a couple of weeks now, oh and of course, the ear defenders due to the noise plus the phone that you can't hear as its too noisy.

            My point is that data centres are no worse than any other environment - like maintaining a car engine or running a mower in your garden.

            1. Anonymous Coward
              Anonymous Coward

              Re: If it got interrupted...

              Point taken.

              The heavy electrical stuff and switch rooms should be kept under lock and key at all times. Every DC I have been in the techies were not allowed near these. They were expected to be familiar with all the safety items you mention.

        3. Dan 55 Silver badge

          Re: If it got interrupted...

          Who said he shouldn't have had access to the buttons? He's the electrician.

          I bet he was called in and told to pull the plug as a consequence of the system grinding slowly to a halt yet not switching over to secondary. It can't be a coincidence.

          1. Stoneshop Silver badge
            FAIL

            Re: If it got interrupted...

            I bet he was called in and told to pull the plug as a consequence of the system grinding slowly to a halt yet not switching over to secondary.

            If you really want to force a failover that way, you do so by shutting down the small number of systems that would cause the monitoring system to detect a "critical services in DC1 down, let's switch to DC2". If you can't log in to those systems because of system or network load you connect to their ILO/DRAC/whatever, which is on a separate network, and just kill those machines. If the monitoring system itself has gone gaga because of the problems, you restart that, then pull the rug out from under those essential systems. Or you cut connectivity between DC1 and the outside world (including DC2), triggering DC2 to become live, because that would be a failure mode that the failover should be able to cope with.

            You. Do. Not. Push. The Big. Red. Button. To. Do. So.

            Ever.

            1. Nolveys Silver badge

              Re: If it got interrupted...

              You. Do. Not. Push. The Big. Red. Button. To. Do. So.

              Ever.

              I am Groot?

        4. Mark York 3 Silver badge
          Holmes

          Re: If it got interrupted...

          At one facility I worked at & I don't have the full story of why......

          The offshore support insisted on the former plant Sysadmin hitting the plants BRB, pictures were sent to the remote guy via email, he confirmed that was the button he wanted to be pushed & goodness gracious me it was going to be pushed. He was advised again of what it was that would be pushed & the consequences, the plant manager dutifully informed of what was required, what the offshore wanted & what would be the fallout.

          & so it came to pass that the BRB was pushed on the word of the Technically Competent Support representative.

          (Paraphrasing here........)

          "Goodness gracious me, Why your plant disappearing from network?"

          "Because the BRB you insisted that was the button you wanted pushing, despite my telling you that it was never to be pushed under pain of death has just shut down the entire plant."

          I think it took until about 15 minutes before production was due to commence the following day to get everything back up & running.

        5. TheVogon Silver badge

          Re: If it got interrupted...

          "playing with buttons they should not have had access to."

          EPO buttons are easily accessible. That's the whole point of them as emergency safety feature. Usually near the door in each DC hall...

          1. Stoneshop Silver badge
            Facepalm

            Re: If it got interrupted...

            Usually near the door in each DC hall...

            But not so near that they can be mistaken for a door opener button by the dimmest of dimwits. At chest/shoulder height and at least a few steps away from the door appears to me the most sensible location.

            That said, I've seen a visitor who shouldn't have had access to the computer room in the first place look around, totally fail to see the conveniently located, hip-height blue button at least as large as a BRB next to the exit door, and killed the computer room because a Big Red Button high up the wall and well away from the exit is obviously the one to push to open the door for you.

            Unfortunately, tar, feathers and railroad rails are not common inventory items in today's business environment; rackmount rails are too short and flimsy for carrying a person.

            1. BagOfSpanners

              Re: If it got interrupted...

              In my office the button to open the exit door is right next to the Fire Alarm button (which has no guard). There are also light switches and other visual clutter nearby. At the end of a long tiring day I've sometimes come close to pressing the wrong button.

            2. Anonymous Coward
              Anonymous Coward

              Re: If it got interrupted...

              In my day, one of the first things pointed out to me was the BRB and the circumstances under which it can be used without being fired.

      2. pdh

        Re: If it got interrupted...

        > I wonder how long it was since power and switching to secondary or backup data centres were tested.

        I think it's been about a week.

        1. Anonymous Coward
          Anonymous Coward

          Re: If it got interrupted...

          Somebody please ask BA if they do this.

      3. Paul Hovnanian Silver badge

        Re: If it got interrupted...

        "The Times suggests a big red button"

        These exist in many date centers. But the are not intended for normal, sequenced shutdowns or to initiate failover to backups. They are usually placed near the exits and intended to be hit in the event of a serious problem like a fire. They trip off all sources _Right_Now_ and don't allow time for software to complete backups or mirroring functions.

        *Usually for events that dictate personnel get out imediately.

        1. Anonymous Coward
          Anonymous Coward

          Re: If it got interrupted...

          My last employer had the big red shutdown button conveniently located next to the exit door. Unfortunately just in the position that the door open button would be. One lunchtime a visiting engineer carrying a few boxes of spare parts accidentally pressed it trying to open the door...

          1. TheVogon Silver badge

            Re: If it got interrupted...

            "One lunchtime a visiting engineer carrying a few boxes of spare parts accidentally pressed it trying to open the door..."

            They usually have a plastic cover. And a large label....

            1. John R. Macdonald

              Re: If it got interrupted...

              @TheVogon

              One installation I worked in, learned the plastic cover and label thingy the hard way (the hapless third party support techie who pushed the BRB instead of the door opener was banned permanently from the site to boot)

              1. Vladimir Nicolici

                Re: If it got interrupted...

                I think the best solution to prevent accidental use is to have 2 big red buttons. And to require both to be simultaneously pushed to trigger the power shutdown.

                In fact I saw a UPS product having exactly that feature, two EPO buttons that you needed to push simultaneously to shut it down.

                1. Grunt #1

                  Re: If it got interrupted...

                  Fine in principle, but that assumes it is a planned event, it's an EPO for a reason.

                  Better to have someone knowledgeable watching over the contractor.

                  1. Vladimir Nicolici

                    Re: If it got interrupted...

                    Having two buttons doesn't mean a single person can't operate them. You can place the two buttons close enough for that. But it ensures the person operating them really knows what it's doing, and it's not randomly pressing buttons.

                    Of course, if the purpose of having such buttons is to allow even untrained people to shut everything down in case of emergency, it would complicate things. But a large warning message, for example "In case of fire, you need to press these two buttons at the same time!" should take care of that as well.

                    1. Anonymous Coward
                      Anonymous Coward

                      Re: If it got interrupted...

                      The real answer is always train people before letting them in.

                  2. Anonymous Coward
                    Anonymous Coward

                    Re: If it got interrupted...

                    @Grunt

                    But then you need another contractor to do the knowledgeable one's job

            2. Mike Richards Silver badge

              Re: If it got interrupted...

              They cost extra.

        2. macjules Silver badge

          Re: If it got interrupted...

          "No determination has been made yet regarding the cause of this incident. Any speculation to the contrary is not founded in fact."

          Which would of course not be a problem for the Daily Mail. The Paul Nuttall of the newspaper world.

      4. circusmole

        Re: If it got interrupted...

        The first thing I would ask for would be the fail over test schedule and the resulting reports on how they went (if they did any).

    2. fidodogbreath Silver badge

      Re: If it got interrupted...

      ...then a BOFH and PFY will soon get (yet another) new Boss.

    3. Anonymous Coward
      Anonymous Coward

      Re: If it got interrupted...

      My guess about the initial failure: ATS left in bypass or failed on power transfer. 15mins for someone authorised to manually switch ATS to good power source or bypass failed ATS. UPS/generators not specified (or too much new kit added to the DC) for full startup load of the whole data center which then failed again.

      Some systems probably started up and began a re-sync then the high load crapped out the generators returning everything to silence again, leaving the replication in an unknown state when systems were manually restarted over a longer period to manage the initial load.

    4. TheVogon Silver badge

      Re: If it got interrupted...

      Maybe Emergency Power Off resembles the Indian for Light Switch?

    5. Steve Channell
      Facepalm

      A low availability cluster

      My bet would be that a cluster failover was initiated by the power failure, then fail-back manually triggered, but the primary site failing again with power surge starting the secondary systems. With a manual failback an engineer would be needed to failover again and not just a bargain basement operator.

  2. Bronek Kozicki Silver badge
    Thumb Up

    wannabe budget airline British Airways

    c'mon guys, that's just ... cheap

    1. Ochib
      Joke

      Not as cheap as EasyJet.

    2. PNGuinn Silver badge

      "wannabe budget airline British Airways

      c'mon guys, that's just ... cheap"

      Sorry? I thought budget airlines were supposed to be cheap?

      Perhaps you mean cruel.

  3. nuked
    Mushroom

    Tech Support: "Have you tried unplugging it and plugging it back in again"

    1. Martin Summers Silver badge

      Yeah but at least that normally actually works!

  4. Whitter
    Headmaster

    ... not due to outsourcing...

    If you don't know why it happened, then I doubt you know it was not due to outsourcing.

    1. iRadiate

      Re: ... not due to outsourcing...

      That's false logic. I personally don't know why the Russians didn't send a man to the moon but I know for definite that they didnt. Now fuck off.

      1. Yet Another Anonymous coward Silver badge

        Re: ... not due to outsourcing...

        Or they did but covered it up .....

        1. Anonymous Coward
          Anonymous Coward

          Re: ... not due to outsourcing...

          Sure, the outage itself may not have been due to outsourcing.

          The extreme time needed to get things running though... that has outsourcing written all over it.

      2. LDS Silver badge

        't know why the Russians didn't send a man'

        Oh well, they were pretty secretive back then, hiding whole cities from maps, and restricting access harshly. Just as BA would like to be today.

        Anyway it was because their Moon rockets couldn't be launched without failing quickly.

      3. Anonymous Coward
        Anonymous Coward

        Re: ... not due to outsourcing...

        You may be right, but you need to be nice.

  5. Anonymous Coward
    Anonymous Coward

    If the people that manage the servers are from TCR and they were unable to recover from the power failure in a reasonable amount of time then I deduce that they are at fault. Maybe not for the initial outage but the subsequent problems. They would also be responsible for the disaster recovery procedures so the fact it all failed in the first place also lies with them.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019