back to article BA's 'global IT system failure' was due to 'power surge'

British Airways CEO Alex Cruz has said the root cause of Saturday's London flight-grounding IT systems ambi-cockup was "a power supply issue*" and that the airline has "no evidence of any cyberattack". The airline has cancelled all flights from London's Heathrow and Gatwick amid what BA has confirmed to The Register is a " …

Page:

  1. PeterHP

    As a retired Computer Power & Environment Specialist there is no such thing as a power surge, there are voltage spikes that can take out Power supplies. This sounds like bull shit from someone who does not know what they are talking about or getting fed a line. In my experience most CEO and IT Directors did not know how much money the company would lose per hour if the systems went down or if they would stay in business and did not consider the IT system important to the operation of the company. I bet they now know a DR system would have been be cheaper, but they not alone I have seen many airline Data Centres that run on a wing and a prayer.

    1. vikingivesterled

      In fairness to Cruz he didn't specify what the, in laymans tems power surge in engineer terms voltage spikes, took out. It could have been sample non ups'd air-con's power supplies being destroyed and a lack of environmental alarms going to the right people leading to overheating before manual intervention. AC can also be notoriously difficult to fix quickly. I have myself used emergency blowers and toll out ducts to cool an airline's overheating data center, where the windows where sealed and unopenable to pass pressurized fire control tests.

  2. ricardian

    Power spikes & surges

    This is something on a much, much smaller scale than the BA fiasco!

    I'm the organist at the kirk on Stronsay, a tiny island in Orkney. Our new electronic organ arrived 4 years ago and behaved well until about 12 months ago. Since then it has had frequent "seizures" which made it unplayable for about 24 hours after which it behaved normally. Speaking to an agent of the manufacturer he asked if we had many solar panels or wind turbines. I replied that we did and the number was growing quite quickly. He said that these gadgets create voltage spikes which can affect delicate electronic kit and recommended a "spike buster" mains socket. Since fitting one of these the organ has behaved perfectly. I suspect that with the growing number of wind turbines & solar panels this sort of problem will become more and more noticeable

    1. vikingivesterled

      Re: Power spikes & surges

      That would probably only be an issue when there is more prower produced than can be consumed. Meaning the island needs something that can instantly lead away, consume or absorb overproduced power, like a sizeable battery bank, water/pool heater or similar. Alternatively if it is not connected to the national grid, the base ac sync creating generator/device is not sufficiently advanced.

      1. ricardian

        Re: Power spikes & surges

        Our island is connected to the grid. Orkney has a power surplus but is unable to export the power because the cable across the Pentland Firth is already operating at full capacity. We do have a "smart grid" https://www.ssepd.co.uk/OrkneySmartGrid/

    2. anonymous boring coward Silver badge

      Re: Power spikes & surges

      Although on a smaller scale, an organ mishap can be very humiliating.

    3. Alan Brown Silver badge

      Re: Power spikes & surges

      "He said that these gadgets create voltage spikes which can affect delicate electronic kit "

      If they can 'affect delicate electronic equipment' then something's not complying with standards and it isn't the incoming power that's out of spec....

      Seriously. The _allowable_ quality of mains power hasn't changed since the 1920s. Brownouts, minor dropouts, massive spikes and superimposed noises are _all_ acceptable. The only thing that isn't allowed is serious deviation from the notional 50Hz supply frequency (60Hz if you're in certain countries)

      This is _why_ we use the ultimate power conditioning system at $orkplace - a flywheel

      As for your "spike-buster", if things are as bad as you say, you'll probably find the internal filters are dead in 3-6 months with no indication other than the telltale light on it having gone out.

      If your equipment is that touchy (or your power that bad), then use a proper UPS with brownout/spike filtering such as one of these: http://www.apc.com/shop/th/en/products/APC-Smart-UPS-1000VA-LCD-230V/P-SMT1000I

  3. Anonymous Coward
    Anonymous Coward

    Power spikes etc.

    The difference between your situation and DCs is they have the money to invest in a UPS which conditions the power as well.

    Your point is noted though as there are many endpoints that don't have the same facility.

  4. Anonymous Coward
    Anonymous Coward

    Outsourcery

    Now the BAU function has been outsourced the real bills will arrive. All the changes that are now deemed necessary will be chargeable leading to a massive increase in the IT cost base.

  5. Florida1920
    Holmes

    "fixed by local resources"

    Translation: Two people. One to talk to India on the phone, the other to apply the fixes.

  6. Anonymous Coward
    Anonymous Coward

    Something else will crash on Tuesday.

    The share price.

  7. Dodgy Geezer Silver badge

    I see that El Reg is unable....

    ...to get ANY data leaked from the BA IT staff at all.

    One more advantage of outsourcing to a company which does not speak English...

    1. Anonymous Coward
      Anonymous Coward

      Re: I see that El Reg is unable....

      Is there an Indian version of El Reg?

  8. GrapeBunch

    Real-time redundancy is why Nature invented DNA.

  9. Milton

    Hands up ... if you believe this for a second

    Sorry, it won't wash. A single point of catastrophic failure, in 2017, for one of the world's biggest airlines, which relies upon a vast real-time IT system? A "power failure"?

    Even BA cannot be that incompetent. Pull the other one.

  10. anonymous boring coward Silver badge

    OK, so something failed. And they didn't have a working automatic failover. I get that. Embarrassing, and the CEO should go just for that reason.

    What I don't get is how it could take so long to fix it? It must have been absolute top priority to fix within the hour, with extra bonuses and pats on backs to the engineers who quickly brought it back up again. How could it take so long?

    1. pleb

      "It must have been absolute top priority to fix within the hour, with extra bonuses and pats on backs to the engineers who quickly brought it back up again. How could it take so long?"

      they had the engineers all lined up ready to execute the timely

      fix you speak of. Trouble was they could not get them booked on a flight over from India.

  11. Anonymous South African Coward Bronze badge

    As an interesting aside, how does Lufthansa's IT stack up against BA's IT?

  12. Anonymous Coward
    Anonymous Coward

    My 20p/80p 's worth

    It's 80 degrees and only 20% of staff are in over the long weekend

    80% of the legacy systems knowledge went when 20% of experienced staff were decommissioned

    80% of the time systems can handle 20% data over capacity

    120% of Uks power is available from wind and solar so 80 % of coal/nuclear capacity is off-line

    20% cloud cover and wind dropping to 80% cause sudden massive drop in grid capacity…. causing large voltage spikes

    ‘Leccy fall out agreements briefly swing in to action, BA can use UPS + generators to cover this

    DC switches to UPS whilst only 80% of the 20% under-capacity generators spin up successfully - "the power surge"

    80% of current customer accounts lose critical 20% of their data when twin system can't synch.

    An 80% chance this is 20% right or a 20% chance this is 80% right?

  13. Ian P

    The power (with its backups) will never fail so we don't plan for it.

    I guess it is just a blinkered approach. You convince yourself that the power will be fine and so you ignore the case when there is a failure, hence chaos when the system that will never fail actually fails. Is it the MDs fault? Yes for hiring an incompetent IT Manager. But I'd replace the IT Manager when the dust has settled. But are those crucial nuggets of information that he has in his head backed up?

  14. Anonymous Coward
    Anonymous Coward

    Brownout to brownpants

    My take a "power surge" happened in the form of a brownout probably triggered by IT.

    Simple scenario, support identify a requirement to do a needful update, this is automated via a management tool. The playbook for doing needful updates states the command sequence to execute, this is fat fingered (or applied incorrectly) and rather than progressively rolling out to the estate it applies to all.

    Servers all start rebooting near simultaneously, the inrush startup currents promptly overload the PDU/UPS/genset, many failures happen some physical hardware some data corruption. Local support are asked to please revert, sadly it's Saturday most are not in work and many no longer work for the company.

    Fat finger brownout thus becomes a brownpants moment.

  15. Anonymous Coward
    Anonymous Coward

    IT outsourcing

    I used to work at BA - not in IT, but in an area that worked with IT day in and day out.

    I, and many others left as the airline and capabilities that we loved gradually got ruined.

    Most of the IT guys and gals were offered generous packages to leave. Three "global suppliers" were chosen, and all work had to be given to them. They would pitch for it, and cheapest nearly always won. Unsurprisingly the good IT people took the packages, and the less good stayed (some good stayed also, but not enough).

    SIP / CAP / CAP2 / FICO / FLY - etc. are all complex systems, and when experience leaves then the support level goes down considerably. I think they have probably cut too much, the senior management don't have enough knowledge of IT to know when one cut is too much. The IT guys were resistant to change so the head was cut off the snake, then lots of yes people remained, and this is where we end up (as well as the complete outages of the website recently).

    To say that this is unrelated to the removal of most of the people who knew how these systems worked is disingenuous. Two data centres, with back up power, so I fail to understand how one power surge could take out both of them independently - sounds more like a failed update / upgrade by inexperienced staff - and then a lack of experienced staff around to fix it.

    Such a shame.

  16. Mvdb

    A summary of what went wrong inside the BA datacenter here.

    http://up2v.nl/2017/05/29/what-went-wrong-in-british-airways-datacenter/

    I hope BA wil make public the root cause so the world can learn from it. I however doubt BA will indeed do this.

  17. 0laf
    Black Helicopters

    From another forum and a friend of a friend that works with BA IT.

    The outsourcer was told to apply security patches which they did and powercycled the whole datacenter.

    When it came back up it popped many network cards and memory modules when the power spiked.

    The outsourcers lacked expereince in initiating the DR plan and it didn't work. Or maybe DR wasn't in the contract.

    True or not I dunno.

  18. Anonymous Coward
    Anonymous Coward

    Soft target?

    With all the terrorist risks to add to the natural causes and cock ups that will happen, I find it surprising that the location of the BA DCs are known. Even some idiot loser can work out that somehow hitting the data centres will have an impact out of all proportion to the cost. That being the case why doesn't BA have a plan that works?

  19. Mvdb

    Another update to my reconstruction of what went wrong:

    An UPS in one of BA London datacenters failed for some reason. As a result, systems went down. Power was restored within minutes however not gradually. As a result a power surge happened which damaged servers and networking equipment. This resulted in many systems down and Enterprise Service Bus.

    http://up2v.nl/2017/05/29/what-went-wrong-in-british-airways-datacenter/

    Big question: why wasn't a failover to the other datacenter initiated?

    1. Anonymous Coward
      Anonymous Coward

      Probably

      .. because they estimated the failover would take longer than fix on site. Clearly the wrong decision was made. Or perhaps they had no faith in their DR plan.

    2. Snoopy the coward

      Messaging systems failed to sync...

      From what I have read, they did a failover but they just can't resync again, meaning they don't have a point-in-time recovery capability for their messaging systems. Not sure what messaging systems BA are using but I know MQ can recover quite easily, it will resent what has failed, will not resend a duplicate.

      But anyway I think the applications are linked in a very complicated manner and a failover need to be done in a very strict sequence, or else it will ruin everything, requiring a restore from tape to recover. So the initial failure was the power surge which destroyed some hardware, the failover was initiated but it just couldn't continue from where it went down, thus requiring many hours of manual recovery work to get it up again.

  20. Captain Badmouth

    Willie Walsh

    on BBC now still blaming a failure of electrical power to their IT systems.We know where problem occurred, he says.

    BBC report that industry experts remain sceptical.

    1. Grunt #1

      Re: Willie Walsh

      Mr White Wa(l)sh.

      "We know what happened but we're still investigating why it happened and that investigation will take some time," he said.

      - We're hoping some other sucker is in the headlines when we publish.

      "The team at British Airways did everything they could in the circumstances to recover the operation as quickly as they could."

      - The recovery they performed was no doubt a fantastic job which pulled BA out of a tailspin at the last minute. The real question is what caused the tailspin.

  21. Anonymous Coward
    Anonymous Coward

    There were companies in the WTC on 9/11 with redundant DCs in New Jersey. The backup DC took over, they didn't lose any data, the file system didn't drop buffers on the floor, etc. And it wasn't Windows or UNIX based. The technology is out there but people don't like "old" proprietary systems... except when it saves them money.

  22. Snoopy the coward

    Surge cause by ?

    Since there were no surge reported from the supply grid, the surge must be caused by some heavy equipment, I can only think of the air-conditioning systems. Someone had left all the switches of the computers and cooling systems in the ON position, so when the power resumes, the air-conditioning systems caused a huge power surge, destroying many computer circuit boards. Experience staffs seems to be lacking in the BA datacentre, probably laid off to cut cost.

    1. Anonymous Coward
      Anonymous Coward

      Looks like an electrician struck again

      https://www.theguardian.com/business/2017/jun/02/ba-shutdown-caused-by-contractor-who-switched-off-power-reports-claim

      1. Anonymous Coward
        Anonymous Coward

        Re: no functioning disaster recovery

        From comments on that Grauniad article:

        "I want to see Willie Walsh get asked about this by a real tech journalist - maybe one from The Register or similar. "Mr Walsh, is it true that your global IT infrastructure has no functioning disaster recovery? If so, how soon do you think this will happen again?""

        No link, I don't want to get the Interwebs into an Nth order binary loop and make matters worse for outsiders than they already are.

        Making matters worse for BA/IAG management (and staff, and passengers) is a job that management are apparently well qualified for,

        1. Anonymous Coward
          Anonymous Coward

          Re: no functioning disaster recovery

          Many papers and websites have used the Register as their source. Perhaps BA should send their excuses to The Register to ensure the correct message out.

  23. anonymous boring coward Silver badge

    Could someone please explain why power should "spike" when, as the story goes, all things were started at once? In my mind there could be a rush of current leading to a brown-out condition.

    Perhaps it's a "power demand spike" that is being referenced?

    But the idea with these statements seem more aimed at conjuring up images of dangerous voltages spikes entering the system and blowing up things, like some episode of Star Trek, or Space 1999 where CRTs tended to explode.

    After a complete power failure, presumably equipment would need powering up in a controlled manner?

    That must all be part of the specifications for the system, and should happen more or less automatically. It seems unlikely that all systems would power up simultaneously and overwhelm the supplies?

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like