back to article BA's 'global IT system failure' was due to 'power surge'

British Airways CEO Alex Cruz has said the root cause of Saturday's London flight-grounding IT systems ambi-cockup was "a power supply issue*" and that the airline has "no evidence of any cyberattack". The airline has cancelled all flights from London's Heathrow and Gatwick amid what BA has confirmed to The Register is a " …

Page:

    1. Anonymous Coward
      Anonymous Coward

      Re: Comment from a Times article.

      Aside from the fact that you don't physically power off servers to install software upgrades that sounds completely plausible.

    2. Stoneshop

      Re: Comment from a Times article.

      Unfortunately, computers in these data centres are used to being up and running for lengthy periods of time.

      True.

      That means, when you restart them, components like memory chips and network cards fail.

      Nonsense; only if you power-cycle them Just rebooting without power cycling doesn't matter to memory or network cards. Processors and fans may be working closer to full load while booting compared to average load, and with it the PSUs will be working harder, but your standard data centre gear can cope with that.

      Compounding this, if you start all the systems at once, the power drain is immense and you may end up with not enough power going to the computers

      Switching PSUs have the habit of drawing more current from the mains as the voltage drops. Which will cause the voltage to drop even more, etc., until they blow a fuse or a circuit breaker trips. But as this lightens the load on the entire feed, it's really quite hard to get a DC to go down this way.

      - this can also cause components to fail. It takes quite a long time to identify all the hardware that failed and replace it.

      Any operational monitoring tool will immediately call out the systems that it can't connect to; the harder part will be getting the hardware jocks to replace/fix all affected gear.

  1. Paul Hovnanian Silver badge

    Power Supply Issue

    Someone tripped over the cord.

  2. N2

    And don't try to re-book...

    No worries there Senor Cruz, we wont so we wont have to bother your call centre either.

  3. simonorch

    memories

    It reminds me of a story from a number of years ago involving a previous company i worked for.

    Brand spanking new DC of which they were very proud. One day the power in the area went down but their lovely shiny generator hadn't kicked in. No problems, just go out with this key card for the electronic lock to get in and start the generator manually...oh, wait..

    1. Kreton

      Re: memories

      When I started my first job over 50 years ago I came across a standby generator, great big thing with a massive flywheel/clutch arrangement. If the AC supply failed, the clutch held by an electro magnet released and the flywheel engaged the diesel generator to start it up. I can't remember the exact figures but I'm sure the replacement power was flowing in less than 2 seconds.

  4. Anonymous Coward
    Anonymous Coward

    More interesting reading : http://www.flyertalk.com/forum/british-airways-executive-club/1739886-outsourcing-prediction-8.html (and elsewhere on FT)

    1. 23456ytrfdfghuikoi876tr

      and interesting DC insights here:

      http://www.electron.co.uk/pdf/ba-heathrow.pdf

      https://cdn1.ait-pg.co.uk/wp-content/uploads/2016/09/British_Airways_Case_Study.pdf

      http://www.star-ref.co.uk/media/43734/case-study-no-02-british-airways.pdf

      http://assets.raritan.com/case_studies/downloads/british-airways-raritan-case-study.pdf

  5. Anonymous Coward
    Anonymous Coward

    Now that things are settling down I suspect the BA senior management are busy polishing their sloping shoulders. Given that they can't blame outsourcing - both because they approved it and also it is generally a bad idea to piss off the people who control your IT lifeblood - my guess is that they will go with "unforeseeable combination of circumstances" plus, of course, "lessons will be learned".

    1. Anonymous Coward
      Anonymous Coward

      Good opportunity

      As I say, good opportunity to "learn lessons and hire a buddy as infrastructure continuity director".

      And then, pay money to another buddy to deeply study the problem.. and maybe fire a few of the internal IT guys for not being part of the solution, and keep pointing and the problems.

  6. Anonymous Coward
    Anonymous Coward

    Happenings ten years time ago

    This six minute video was released ten years ago by a then-major IT vendor. It features resilient dual data centres, multiple system architectures, multiple software architectures, the usual best in class HA/DR stuff from that era. Stuff that has often been referred to more recently as "legacy" systems - forgetting the full definition, which is "legacy == stuff that works."

    https://www.youtube.com/watch?v=qMCHpUtJnEI

    Lose a site, normal service resumes in a few minutes. Not a few days. Design it right and the failures can even be handled transparently.

    Somewhere, much of what was known back then seems to have been forgotten.

    The video was made by/for and focuses on HP, who aren't around any more as such. But most of the featured stuff (or close equivalent) is still available. Even the non-commodity stuff. Now that even Matt Bryant has admitted that IA64 is dead, NonStop and VMS are both on the way to x86 [1].

    Anyone got BA's CEO's and CIO's email address? Is it working at the moment?

    [1] Maybe commodity x86 isn't such a bright idea given the recent (and already forgotten?) AMT/vPro vulnerabilities. Or maybe nobody cares any more.

    1. Anonymous Coward
      Anonymous Coward

      Re: Happenings ten years time ago

      "Anyone got BA's CEO's and CIO's email address? Is it working at the moment?"

      According to a flight attendant I spoke to after I had a particularly bad time with BA pre flight, it's Alex.Cruz@ba.com. She said she was sick of apologising for BA over the last few years.... And I know what she means. The cost cutting is very evident and perhaps BA should now stand for Budget Airways.

      I note from BBC reports that his last airline apparently had similar issues.

      Anonymous? - yes, I have some notoriously difficult to use air miles which I intend to at least try and use before I finish flying with that airline for good!

      1. Alan Brown Silver badge

        Re: Happenings ten years time ago

        "I have some notoriously difficult to use air miles which I intend to at least try and use before I finish flying with that airline for good!"

        They can generally be used elsewhere within the same alliance.

  7. Tom Paine

    72h and counting

    It's Monday lunchtime and it's looking like the PR disaster has greatly exacerbated how memorable this will be in years to come when people come to book flights. This is what they should have been googling three days ago:

    Grauniad: "Saving your reputation when a PR scandal hits"

    https://www.theguardian.com/small-business-network/2015/oct/23/save-reputation-pr-scandal-media-brand

    Torygraph: "Six tips to help you manage a public relations disaster"

    http://www.telegraph.co.uk/connect/small-business/how-to-manage-a-public-relations-disaster/

    Forbes: "10 Tips For Reputation And Crisis Management In The Digital World"

    https://www.forbes.com/sites/ekaterinawalter/2013/11/12/10-tips-for-reputation-and-crisis-management-in-the-digital-world/#bc0de87c0c68

    Listening to endless voxpops from very pissed off BA pax, those articles make very interesting reading. BA seems to have confused the "Do" and "Don't" lists...

    They're now saying the famous power failure was for a few seconds; agree with above commentards saying that's suggestive of some sort of data replication ./ inconsistency issue. Still hungry for the gory details though... come on, someone in the know, post here as AC (from a personal device obvs)

  8. one crazy media

    Is BT powering their global IT systems from a single power source. If you are going to lie, come with good lie. Otherwise, you like Trump.

    1. vikingivesterled

      It is the sad fact of electriccity. You can't power the same stuff from 2 different grids at the same time. Has something to do with how AC power alternates and the need of synchronization. Meaning all failure reactions are of a switch in nature. This is why you sometimes see a blink of the lights when the local grid to your door switches to a different source. And somtimes these switch-in's fail or several happen in a series leading to power surges and failed equipment.

      Most airlines has a main data-center with a main database to ensure the seat you book is not already taken by somebody else booking through a different center. It is not like tomatoes where any tomato will do the job. You will not be happy if that specific seat on that specific plane on that specific flight you booked is occupied by that somebody else..

      1. chairman_of_the_bored
        Joke

        Except if you are United...

      2. patrickstar

        The entire grid of a nation is usually in phase (with a few exceptions like Japan).

        However, you can't just go hook up random connections since the current flows need to be managed. And if one feed dies due to a short somewhere, you don't want to be feeding current into that. Etc.

      3. Alan Brown Silver badge

        "You can't power the same stuff from 2 different grids at the same time."

        Yes, you can: It's all in the secret sauce.

        https://www.upssystems.co.uk/knowledge-base/understanding-standby-power/

        And when it's really critical, you DON'T connect the DC directly to the external grid. That's what flywheel systems are for (you can't afford power glitches when testing spacecraft parts as one f'instance)

        http://www.power-thru.com/ (our ones run about 300kW continuous load apiece and have gennies backing them up.)

        For the UK, Caterpillar have some quite nice packaged systems ranging from 250kVA to 3500kVA - and they can be stacked if you need need more than that or to build up from a small size as your DC grows.

        http://www.cat.com/en_GB/products/new/power-systems/electric-power-generation/continuous-power-module/1000028362.html

        So, yes. This _is_ a solved problem - and it's been a solved problem for at least a couple of decades.

        1. Anonymous Coward
          Anonymous Coward

          Are there any qualified electrical people on this site?

          If so, please educate us.

  9. Tom Paine
    Go

    Interview with Cruz has more detail

    Interview on today's WatO has a lot more lines between which technical detail can be read. Starts about 10m in, after the news bulletin;

    http://www.bbc.co.uk/programmes/b08rp2xd

    1. anthonyhegedus Silver badge

      Re: Interview with Cruz has more detail

      Well he said he won't resign, which hopefully will cost them even more. He forgot to mention anything about incidentals that need to be paid for like onward flights, hotels etc.

      And I didn't hear anything other than that there were local power problems at Heathrow. And it definitely wasn't the outsourced staff or the redundancies. And presumably it wasn't a hack either. I don't suppose they need to be hacked, they're quite capable of ruining themselves on their own anyway.

      Lame excuses, no detail and they're going to shaft the customers and probably reward the CEO.

  10. OFSO

    Only one way to do it - and that is the right way.

    1) The European Space Operations Centre has two independent connections to the German national grid - one entered the site at one end, one at the other. In addition a massive genset was hired for weeks surrounding critical ops, with living accommodation for the operator. (Yes it was that large).

    2) A backup copy of all software and data files was made every day at midnight, encompassing the control centre and the tracking telemetry & command station main computers (STAMACS) spread around the world..

    In my 25 years there I only remember one massive failure, which is when an excavator driver dug up the mains supply cable - in the days when there was only one power feed. In fact that incident is why there are two power connections today. Maybe someone here will correct me but I do not remember any major computing/software outages lasting more than an hour, before the system was reloaded and back on line.

    Of course we were not trying to cut costs and maximise profits.

    1. Anonymous Coward
      Anonymous Coward

      Re: Only one way to do it - and that is the right way.

      So, all your systems were physically close to each other then? That is not 'the right way'.

  11. StrapJock

    Twent years ago...

    I remember designing network infrastructures on behalf of BAA some 20 years ago (1996 - 2000). Three of them were in the UK, the others in far flung places. Back then all network infrastructures had to be designed to survive catastrophic failure e.g. bomb blast, fire, flood, air crash etc., and be capable of "self-healing" - i.e. have the ability to continue processing passengers within 3 minutes. All the major airlines including BA insisted on these requirements. Not an easy task back then but we managed to get the infrastructure restore down to 30 seconds.

    It seems like BA have sacrificed those standards in favour of saving money and that it now has a system which takes 3 days to restore. I wonder how much that will cost them?

    Progress.

  12. thelondoner

    BA boss 'won't resign' over flight chaos

    "BA chief executive Alex Cruz says he will not resign and that flight disruption had nothing to do with cutting costs.

    He told the BBC a power surge, had "only lasted a few minutes", but the back-up system had not worked properly.

    He said the IT failure was not due to technical staff being outsourced from the UK to India."

    http://www.bbc.co.uk/news/uk-40083778

  13. Rosie

    Power supply problem

    You'd think they could come up with something more plausible than that old chestnut.

  14. JPSC

    Never say never

    "We would never compromise the integrity and security of our IT systems"...

    .. unless placing a fundamental part of our infrastructure under the control of a bunch of incompetent foreign workers would increase our CEO's quarterly bonus.

    TFTFY.

  15. Anonymous Coward
    Anonymous Coward

    So one power outage can take down an entire system and its not his fault. If I was a major shareholder I'd be outraged and he'd be gone.

  16. Duffaboy
    FAIL

    Here is why they have had a system failure

    They most likely have put someone in charge of IT spent who knows Jack S about IT

  17. Duffaboy
    Trollface

    Could it be that its still down because

    The IT Support staff also have to fly the planes, and can't fly them because the IT systems are down..

  18. john.w

    When IT is your company

    Airline CEOs used to understand the importance of their IT; a quote from Harry Lucas's 1999 book 'Information Technology and the Productivity Paradox - Assessing the value of Investing in IT'

    Robert Crandall, American (Airlines) recently retired CEO, said that “if forced to break up American, I might just sell the airline and keep the reservations system". It was SABRE developed by AA and IBM - https://www.sabre.com/files/Sabre-History.pdf

    BA should keep the IT in house and sub contract the planes to Air India.

  19. Anonymous Coward
    Anonymous Coward

    I'm enjoying BA's pain ...

    If it isn't the CEO'S fault, whose is it?

    Former BA CEO, Alex Cruz .....

    I gave up on on BA (and United) years ago for other reasons.

  20. chairman_of_the_bored

    IT datacentre design is NOT rocket science, but it has to be approached holistically - any major datacentre worth its name will have at least dual feeds to the grid plus generators on standby. But if nobody really understands the importance of the applications running (and why would remote staff in India have a clue) then you end up with non-overlapping up-time requirements. The datacentre (probably) had their power back up within their agreed contractual requirements, but nobody seems to to have considered the implications of countless and linked database servers crashing mid-transaction. If the load sheets for each aircraft relied on comms links back to the UK didn't anyone consider the possibility of a comms breakdown? Why not a local backup in (say) excel? It is probably a cheap and fair jibe to attack the Indian outsourcers, but all these things should have been considered years ago.

  21. Pserendipity

    Fake news!

    BA flights were leaving without baggage on Friday, with no explanation.

    BA flights from everywhere but LHR and LGA were flying.

    Now, if MI6 got a whisper of a bomb on a BA plane leaving London over the Bank Holiday weekend, how could the authorities avoid panic and ensure that all passengers and baggage got fully screened?

    Hmmm, how about blaming computer systems (everybody does) the weather (climate change is politically neutral) and outsourced IT (well, the Commonwealth is next on the immigrant exclusion agenda once Brexit is sorted) while playing games with the security levels?

    [I await receipt of a personal D-Notice]

  22. Brock Landers

    Who will do the needful and revert?

    Outsourcing is sanctioned at the top and so no heads will roll. If this debacle is a result of outsourcing to a BobCo in India, then you can bet that the Execs at the top will defend the BobCo. To admit they made a mistake in outsourcing a vital function of their business to a 3rd World Bob Shop would be like the Pope saying "there is no God", just will not happen. The answer will be "MOAR OUTSOURCING".

    Sorry to be the bearer of bad news, but those Directors with their MBA in Utter B*ll*cks hate IT, they hate 1st world IT staff and they hate the fact that they themselves do not understand IT. They'll always do what they can to offload IT to India et al with their mistaken belief that IT people are not smart, anyone can do IT and Indians are the best and the cheapest when compared to 1st World people. It's almost as if such people cannot stand people below them who are smarter than them: the IT guys & gals.

    I seem to remember this rot setting in during the late 90s. Anyone remember when non-IT chumps started to show up in IT depts? They were called "Service Management" armed with their sh-ITIL psychobabble. Then we got the "InfoSec" types who knew feck all about Windows, Servers, Networking, Exchange etc. but barked out orders based on sh*t they were reading on InfoSec forums. And then we got the creeping death known as the "Bobification of IT"...Indians with "Bachelors Degrees" and MCSE certifications. And now we know that cheating on exams and certs is an epidemic in India. Why are directors & managers in the 1st world outsourcing 1st world jobs to 3rd world crooks?

    Who will do the needful? Who will revert?

    1. Anonymous Coward
      Anonymous Coward

      Re: Who will do the needful and revert?

      Oh no, on the contrary... they DO understand that IT ppl tend to be smarter than them but with less people skills. But some DO have people skills.

      Therefore it is very dangerous to have your own IT, as they will potentially know more about running the business than you.. so you prevent all of that by outsourcing it.

      1. Brock Landers

        Re: Who will do the needful and revert?

        Thank you for saying the needful.

  23. Anonymous Coward
    Anonymous Coward

    power surge at 9.30am on Saturday that affected the company's messaging system

    From the Independent. Cruz reportedly said:

    'The IT failure was caused by a short but catastrophic power surge at 9.30am on Saturday that affected the company's messaging system, he said, and the backup system failed to work properly.'

    I wonder if they had a single point of failure in their communications network?

    Having spent my early career in Engineering Projects for a very well known British Telecommunications company, when building the new fangled digital network it was absolutely drilled into us to ensure that there was always a hot standby backups, diverse routing, backup power supplies etc. etc. And we tested it too. I spent some weeks of my life going through testing plans ensuring that if you made a 999 call in our area it would be bloody difficult for the call to not get through.

    For us 5 9's was not good enough. I doubt BA were operating at 3.

    1. Anonymous Coward
      Anonymous Coward

      Re: power surge at 9.30am on Saturday that affected the company's messaging system

      Most power spikes and cuts are very short, the real issue is how long does it take to get back to normal.

      1. Anonymous Coward
        Anonymous Coward

        Re: power surge at 9.30am on Saturday that affected the company's messaging system

        However when it comes to critical systems power spikes should not affect you. You have built in protection for that?

        Power cuts should also not affect you (at least for quite a period). You have battery for immediate takeover without loss of service and then generators to keep the batteries going?

        You do test it at regular intervals as well?

  24. Kreton

    Ticked the wrong box

    The customer complained there was no data on the backup tapes. They had ticked the wrong box in the backup software and presumably ever since they installed the system all they had backed up was the system, not the data. Much anguish and gnashing of teeth.

    1. Korev Silver badge

      Re: Ticked the wrong box

      Where I used to work, most of the IT was outsourced to someone you'd have heard of. For months they'd been taking monthly backups for archiving, no one at the said firm thought it was odd a thousand person site's main filer only required a single tape! Of course when we needed them to restore they only then realised. This was the excuses we needed to bring this service into our shadow IT group...

  25. PyroBrit

    Maintain your generators

    Whilst working at a company in 2001 we had a total power failure to the building. Quite correctly, the UPS maintained the integrity of the server room and the backup generator started up as commanded.

    Upon contacting the power company we were told it would be down for at least an hour. Our building services guy said we had fuel for the generator for 48 hours. Two hours later the generator dies.

    Turns out the fuel gauge was stuck on full and the tank was close to empty. Dipsticks are your friend.

    1. Florida1920

      Re: Maintain your generators

      Dipsticks are your friend.

      Sounds more like dipsticks were in charge of the data center.

    2. Aitor 1

      Re: Maintain your generators

      Seen that more than once.

      Also, people expect 3 year old fuel not to clog the filters.

      1. Alan Brown Silver badge

        Re: Maintain your generators

        "Also, people expect 3 year old fuel not to clog the filters."

        Which is why you have circulation pumps, regular run tests, redundant filtration systems and duplicated fuel gauges.

        Of course, if it's all put together by speed-e-qwik building contractors then you'd better be absolutely sure you covered every last nut bolt and washer in your contract and specified that any departures _they_ make from the plans are not allowed (else they will and then try to charge you for putting it right)

  26. Saj73

    Tata Consultancy.... surprised at incompetency.... I am not

    Typical, i am not surprised that Tata or HCL name would have cropped up. All the knowledge walked out when they let go of local resources. Tata and group are just ticket takes - i work and see this first hand in my line of business with these 'Indian IT specialists' who are complete rubbish only 1/100 will know what needs to be done. It is mass production and shift work; i will bet my bottom that the challenges they would have had just to bring resources on board because it was past their work time. This wont be the first or last time and the lame excuse that this is standard practise in industry. Well go and have a look at GE or larger organisations who are now going back because of the complete 3rd class incompetence that these outsourcing companies bring.

    This is time for the british public to demand that BA bring its core IT competencies inhouse. This wont be the first or last time this will happen.

    1. Alan Brown Silver badge

      Re: Tata Consultancy.... surprised at incompetency.... I am not

      "This is time for the british public to demand that BA "

      The "british public" can't demand anything. BA is owned by a spanish private consortium.

      So that's a lack of excellence in engineering OR customer service.

  27. SarkyGit

    Have they said it was the IT kit that had a power failure

    Ever seen what very quickly happens in a DC when the air-con fails.

    It will boil your data and traffic, this could cause plenty of anomolies that a DR site wont recognise as issues until thermal cutouts (hopefully correctly configured) kick in or physical parts fail.

    You can be assured you have many days work in front of you.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like