back to article BA's 'global IT system failure' was due to 'power surge'

British Airways CEO Alex Cruz has said the root cause of Saturday's London flight-grounding IT systems ambi-cockup was "a power supply issue*" and that the airline has "no evidence of any cyberattack". The airline has cancelled all flights from London's Heathrow and Gatwick amid what BA has confirmed to The Register is a " …

Page:

    1. Fake_McFakename

      Re: State of IT

      The RBS Group scheduling system incident was caused by a botched upgrade performed by internal IT staff in Scotland, nothing to do with outsourcing.

      It's incredibly frustrating to see outsourcing being touted as the Great Evil of IT whenever there's a major incident like this, especially when people haven't the first idea about the actual root cause of the incident.

      1. Anonymous Coward
        Anonymous Coward

        Re: State of IT

        "The RBS Group scheduling system incident was caused by a botched upgrade performed by internal IT staff in Scotland, nothing to do with outsourcing. [...]"

        You'll have a definitive source for that, presumably?

        Btw, thanks for creating a brand new account today for posting that highly valuable item.

        1. Fake_McFakename

          Re: State of IT

          "You'll have a definitive source for that, presumably?

          Btw, thanks for creating a brand new account today for posting that highly valuable item."

          This is RBS Group's incident report containing the technical details:

          http://www.rbs.com/content/dam/rbs/Documents/News/2014/11/RBS_IT_Systems_Failure_Final.pdf

          These are the letters between the Treasury Committee and RBS about the incident:

          http://www.parliament.uk/business/committees/committees-a-z/commons-select/treasury-committee/news/treasury-committee-publishes-letters-on-rbs-it-failures/

          This is the specific letter in which RBS specify that the team involved were based in Edinburgh:

          http://www.parliament.uk/documents/commons-committees/treasury/120706-Andrew-Tyrie-re-IT-systems.pdf

          Btw, thanks for continuing the Register Comment Section's long tradition of posting anti-outsourcing opinion with little or no facts about root cause or actual real-life experience of major incidents

          1. Anonymous Coward
            Anonymous Coward

            Re: State of IT

            Who allowed them to outsource it to Scotland?

          2. Anonymous Coward
            Anonymous Coward

            Re: State of IT

            The initial RBS problem was caused by UK staff (botched CA7 upgrade) - but - the real damage was done by the recently off-shored ops analysts trying to recover the batch suites, stuff run out of order, stuff run twice, stuff not run at all - basically shoving jobs in and hoping for the best led to corrupt databases, lost data, referential integrity between applications being lost. The 90 odd UK staff who knew these suites in intricate detail had been recently dumped. The off-shoring turned a trivial issue into a major disaster.

    2. Anonymous South African Coward Bronze badge

      Re: State of IT

      "...and don't even get me started on RS232 communications and hardware handshaking."

      Heh, been there, done that. It was the time when you still got cards with jumpers on and newer NE2000's with software configuration utilities was being released.

      I preferred the jumper style of changing addresses though, makes it much easier to determine what IRQ and whatnots a card is using when the PC is switched off.

    3. keithpeter Silver badge
      Coat

      Re: State of IT

      "I've worked in IT for nearly 30 years and have seen the gradual dumbing down of IT, some in part due to technology changes, others due to offshoring."

      @Jaded Geek and all

      Disclaimer: I'm an end user

      Could the increased demand for 'live' and 'real time' data be a factor as well It strikes me that results in the layering of systems to considerable depth and possibly unknown spread Again as you say later in your post, the dependency graph for a restart becomes very complex and may not be known/documented. The graph could even be cyclic (so System A version 9.3 expects System B 3.7 to be running, but System B 3.8 has been installed that depends on System A 9.3)

      Coat: not my mess so out for a walk

      1. one crazy media

        Re: State of IT

        Dumbing down is not the word. Anyone who who can send a Tex is a Network Engineer. Anyone who can format a word document is software engineer.

        Company's don't want to pay decent wages to educated and qualified engineers, so they get dummies, who can't think.

  1. Anonymous Coward
    Anonymous Coward

    Alex Cruz, meet Dido Harding, I'm sure you'll get on really well as you have a lot in common.

  2. Camilla Smythe

    I do not get it.

    I've been running Linux on my desktop, vintage 2006, for quite a long time and every time someone fucks up the house electrics it comes back up the same as it was before after swearing at who fucked them up and resetting the breakers.

    1. vikingivesterled

      Re: I do not get it.

      Yes but; does it run a relational database, did it ever loose power in the middle of an upgrade and did the power ever come back in a set of surges that fried your power supply. That is why even your home computer should be protected by the cheapest of ups's.

  3. trev101

    This is going to go down in history "How not to do IT you depend upon".

    If they got rid off all staff who knew how to design a resilient system and staff who had specialist knowledge of existing systems indicates an extremely poor judgement by the management.

    1. Destroy All Monsters Silver badge

      This is going to go down in history "How not to do IT you depend upon".

      Since the 70's or so, that book has become, fat, fat, fat.

      Right next to it is a thin volume about "successful land wars in asia waged by people from the european landmass", but that's for another discussion.

  4. Tombola

    Very Old IT Person

    Yes, I'm v old! When I joined an IT Dept. there still was a rusting Deuce in its stores, waiting to be sold for its metal value that included the mercury in its primitive memory. So I've come thru' it all & now in retirement. The aspect that I can't understand is that in over 200 observations, none come from any folks that really know what has happended & whta is going on.

    Is it the outsourcing or, what?, to Tata. What is Tata running?

    Is it badly designed & implemented resilience vis-a-vis power supply?

    Or whatever?

    Over 200 "guessers" have been at it but no inside knowledge that can explain. Readers must include some employees who could say what really is happening. The official statement that its a a power outage suggests that somebody hasn't thrown enough logs into the silly upgraded Drax power station.

    It has to be much more complictated & therefore much more worthwhile knowing about. So let's have no more rabbitting about whether Linux would have saved them & some inside facts, please!

    1. Anonymous Coward
      Anonymous Coward

      Re: Very Old IT Person

      I agree - we're definitely missing some unofficial info as to what actually happened.

      I the absence of any details, I will add I'm firmly in the "outsourcing your core / mission critical IT is beyond dumb" camp.

      I had to review the source code of a system we outsourced to TCS to write - and it made you want to cry - horrible verbose & unreadable spaghetti code. 20 lines of gibberish to do what you could have written in about 5 lines in a much cleaner fashion - over and over again. Dealing with systems written by the old school big 5 UK consultancies was bad enough (a few good people and dozens of grads with learn programming books in one hand) - but this was on another level.

      1. Anonymous Coward
        Anonymous Coward

        Re: Very Old IT Person

        A flash from the past for me.

        I used to be extremely good with visual basic.

        The problem is (as is now with perl, python, javascript, php...) that people felt it was easier, therefor the average programmer was CRAP.

        So my customer outsourcedthis system to a company in Morocco.

        The result? No passing of parameters. Everything global with a1, a2 variables, ignoring strong typing, etc etc.

        To their credit, it DID work and they tested it perfectly: no bugs.

        But.. impossible to mantain... and very expensive at it.. so in a few years, it had to be scrapped.

        These days user interfaces should aim at being easy to use and mantain. Almost forget efficiency.. computers are cheap, just dont do anything stupid..

        As for the snafu... well, when you otsource on price, this is what you get. Outsourcing for the same quality is at least 30% more expensive than doping it in house.

        If you cannot manage your IT department, even less so the outsourcers...

    2. Anonymous Coward
      Anonymous Coward

      Re: Very Old IT Person

      So let's have no more rabbitting about whether Linux would have saved

      ????

      I haven't seen a single post along those lines.

      Most are speculating about management failures, and bayesian reasoning indicates indeed that this is likely to be root cause.

    3. Grunt #1

      Re: Very Old IT Person

      I suspect everyone who knows is working, worn out or resting and has better things to do. The fact there are probably too few of them won't help, no matter what the reason.

      It seems the BA communications plan is to tell no-one, including passengers. What puzzles me is they have good plans for flying incidents why treat their IT differently?

      1. Aitor 1

        Re: Very Old IT Person

        They see IT as they see plumbing: a cost centre with almost no repercussion in their main gig.

        The fact that they whould be a SW powerhouse escapes them.

        1. vikingivesterled

          Re: Very Old IT Person

          Until, like in plumbing, the shit starts to fly, or in this case not fly.

        2. Anonymous Coward
          Anonymous Coward

          Re: Very Old IT Person

          Plumbing is probably the only time you don't want a backup.

      2. ajcee

        Re: Very Old IT Person

        "What puzzles me is they have good plans for flying incidents why treat their IT differently?"

        Because IT is a cost, silly! IT doesn't make any money.

        After all: if you're not directly selling to the punters and bringing in the money, what good are you?

        1. Anonymous Coward
          Anonymous Coward

          Re: Very Old IT Person

          "if you're not directly selling to the punters and bringing in the money, what good are you?"

          Good question, and if the IT Director hasn't got a convincing answer when the CEO asks the question, then the IT Director should get what's coming to him.

          Shame all those BA ex-customers and soon-to-be ex-staff had to suffer before the CEO would finally understand what IT's for, and what happens when it's done badly. Maybe the compensation for this weekend's customers can be deducted from the "executive compensation" approved by the Remuneration Committee, after all, £100M or so presumably isn't much to them..

          1. Hans 1

            Re: Very Old IT Person

            Sadly, CEO and IT director will blame the proles, you know, the understaffed underfunded in the trenches trying to keep the machine rolling while adhering to useless time-wasting "corporate policies" and always taking the blame for the results of board decisions ...

            Crap, I must be a lefty!

  5. Cincinnataroo

    The stand out for me is all the guessing. No decent folk working on this problem who feel some responsibility to humanity and tell us the truth.

    That in itself is a serious issue.

    Does anybody here have insight into the people who are reputed to be doing the work? (Or has smart person designed a nice family of DNN's and there's no humans involved at all?)

  6. Wzrd1 Silver badge

    Well, I can't speak to BA, but I know of a case

    Where a US military installation, quite important for wartime communications, entirely lost power to critical communications center power for the entire bloody war, due to a single transformer and a dodgy building UPS, which was to keep everything operational for all of five minutes, in order to let standby generators come fully up to stable speed.

    It turned out, due to the installation being in a friendly nation in the region, it had lower priority (odd, as US CENTCOM was HQ'd there). So, when the battery room full of batteries outlasted their lifetimes and failed and due to budgeting, was not funded for lifecycle replacement.

    Until all war communications to the US failed. A month later, the batteries arrived by boat and then had to endure customs.

    That all after correction of a lack of generator testing on a monthly basis, which management claimed was unheard of, but the technical control facility supervisors admitted to being a regular test that they had forgotten about and hence, managed to avoid being part of our monthly SOP.

    That, being brought up by myself, the installation IASO, in a shocked outburst when told that the generator failed and was untested.

    The gaffe in SOP was corrected.

    To then fail again, due to a different transformer explosion from failure, due to a leak of coolant oil in the desert heat and a week previous flood, caused by a ruptured pipe.

    Not a single one of us dreamed of water from the one inch pipe leaking onto the calcium carbonate layer directly beneath the sand flooding into the below ground diesel oil tank, displacing it and upon need, the generator getting fuel from the lines, then a fine drink of fresh water.

    Yes, another change in SOP. Whenever there is a flood within X meters of a below ground generator fuel supply, test the generator again. The generator was tested the week before the leak, so was two weeks from the next test.

    Boy, was my face red!

    1. Kreton

      Re: Well, I can't speak to BA, but I know of a case

      I worked with backup software on a number of operating systems and knew that disaster recovery was more than just a backup of the data. From the number of comments showing diverse practical experience here, it seems an invaluable source of information for someone to construct a How To Manual from these comments and to invite contributions from others. A new website perhaps? Don't have the time myself but I'm sure someone has, so that the next time something like this happens the embarrassment of fingers wagging and saying "It was on the Web" can be well an truly propagated. Oh, and don't have the IT director and team living too close to the facility as if there is major incident in the area they will want to secure their family before the computer systems.

      1. Anonymous Coward
        Anonymous Coward

        Re: Well, I can't speak to BA, but I know of a case

        For starters try this..... http://www.continuitycentral.com/

  7. 0laf
    FAIL

    Not shocked

    Last time I flew BA (2016) the plane broke before the doors even closed (fuel valve problem). BA basically reacted like they've never seen or even heard of a broken plane before and as if all their staff had just come off a week long absinthe and amphetamine bender. They lost a bus load of passengers who then re-entered T5 without going through security. BA staff were wandering round shouting "I don't know what to do" and the tannoy was making automated boarding calls the staff didn't know about. I've rarely seen such a display of shambolic ineptitude.

    Still the compo (when the ombudsman made them stop ignoring me) was more than the costs of the flight.

    So to see a fuck up of this magnitude, really not surprised at all.

    1. Anonymous Coward
      Anonymous Coward

      EU tech: Vive La Merde

      Airbus?

      French+British

      A cooperation between some who don't care about engineering and some who don't care about service..?

      1. Anonymous Coward
        Anonymous Coward

        Re: EU tech: Vive La Merde

        I've worked with the French and they really care about engineering and were very good.

        OK, I get your point.

  8. RantyDave

    The ol' corrupted backup

    My money's on "oh smeg, the backup's broken". Somehow incorrect data has been being written to the backup for the past six months - and you can replicate incorrect data as much as you like, it's still wrong. Hence the enormous delay while they pick through the wreckage and fix tables one by one until the damn thing is working again.

    1. Anonymous Coward
      Anonymous Coward

      Re: The ol' corrupted backup

      "Somehow incorrect data has been being written to the backup for the past six months "

      And presumably nobody's tried a restore, because it's too time consuming and expensive.

      That restore could have been tested on the "DR/test/spare" system which various folks have been mentioning - IF management had the foresight to invest in equipment and skills. But outsourcing isn't about value for money, it's about cheap. Until senior management get to understand that management failures will affect them personally, anyway.

      1. vikingivesterled

        Re: The ol' corrupted backup

        Backup's and DR testing never gave any manager a promotion. It is quite soulsucking to work on something for years and years that nobody up top ever notices, until if you are (un)lucky the one day it is needed. And then you'd probly get criticism for why the main system went down at all and why there was a short break in service, instead of a path on the back for how quick and easy it was restored thank's to your decade of quiet labouring to ensure it would.

        Smart talkers steer well clear of it and work instead on sexy development projects that gives new income streams and bonuses, leaving DR to them with a conscience, who will be the first out the door when savings are looked for.

    2. Anonymous Coward
      Anonymous Coward

      Re: The ol' corrupted backup

      What makes you think they ever put the right data in the tapes?

      I have worked for quite a few Fortune 500 companies that did not test their backups as "these systems are not mission critical". Only to find that well, maybe they are after all, we just marked them "non critical" to pay less to outsource them.. me being the project leader of the outsourcing company I was of course "well, 12x5 service, but we are happy to make an effort, will send you the bill".

      And this is how you make money with outsourcing: everything not in the contract, you are happy to provide.. at a steep price. There is where you make the money.

  9. TerryH

    Couldnt start the standby....?

    https://www.thesun.co.uk/news/3669536/british-airway-it-failure-outsourced-staff/

    Article claims that outsourced staff didnt know how to start the backup or standby system...?

    Needs to be verified of course but, if true, then makes you wonder what other knowledge and procedural gaps there are now.. were there gaps before the outsource deal? or were they just not handed over? would they have been handed over but the outsource deal was rushed? Are TCS actually any good on the whole? how safe do you now feel flying on a BA plane?

    My experience with TCS has been mixed. They are all well meaning and they have a few very good, stand out technical people but there are a lot who dont have much experience or knowledge and they are just thrown in at the deep end at the client.

    Recently had one of them sat with me and a few colleagues (windows server infrastructure build team).He had been clearly been tasked by his head office to turn each of our 20 to 30 years experience and knowledge into some sort of script that they could then follow to fulfill server build projects... It was painful and an entirely impossible task in my opinion. Plus he could not even do a basic install of Windows server without help to begin with!

    Anyway a very depressing state of affairs and perhaps if companies want to cut costs so badly then directors can consider taking more pay cuts etc. Once youve sacked or pissed off a loyal worker you will never get them back

    1. Anonymous Coward
      Anonymous Coward

      Re: Couldnt start the standby....?

      "Recently had one of them sat with me and a few colleagues (windows server infrastructure build team).He had been clearly been tasked by his head office to turn each of our 20 to 30 years experience and knowledge into some sort of script that they could then follow to fulfill server build projects... It was painful and an entirely impossible task in my opinion. Plus he could not even do a basic install of Windows server without help to begin with!"

      This is one of the problems with outsourcing.

      I once spent a few weeks handing over a system to an Indian guy, and it was built in a language he had no experience in. To his credit, he was a fast learner.

      The other problem is that these companies move people around and have high turnover of staff. I managed an intranet for a company that got pushed out to India and bumped into one of the users, and they were complaining that like every 2 months, they got new people to deal with. So, they'd then have to explain what things meant. Things would take a long time because the staff didn't know the codebase. Eventually, they moved it back to the UK.

      The best setups I've seen are a mixed thing. Team in the UK of half a dozen employed technical staff, and a load of guys out somewhere else. That team is there for the long-term, know the code base pretty well. They can turn around live problems in hours rather than days.

    2. Saj73

      Re: Couldnt start the standby....?

      Absolutely correct, i have gone through with various of these indian flavoured companies. All of them rubbish. 1/100 knows what needs to be done and knowledge transfer is never enough - how do you expect 20yrs of expertise to be transferred to offshore who are only interested in working their shift.

  10. Anonymous Coward
    Anonymous Coward

    You get the idea

    Ted Striker: My orders came through. My squadron ships out tomorrow. We're bombing the storage depots at Daiquiri at 1800 hours. We're coming in from the north, below their radar.

    Elaine Dickinson: When will you be back?

    Ted Striker: I can't tell you that. It's classified.

  11. rainbowlite

    I worked with TCS many years ago and found the staff to be very bright and often over qualified, however there were rarely allowed to use their brains as they were constrained by a do as the customer says mantra. This often led to very inefficient/poor designs or code, perfectly replicated again and again.

    Regardless of who the work gets moved to, if it is not in house then there is can be an immediate disconnect between urgency and pain/responsibility. Even within an organisation, especially where people work out of multiple sites and remotely, if you are not amongst the users who are staring at you or can hear the senior managers stomping around, it naturally becomes less of a driver for you to burn the extra hours etc.

    We still don't know what happened - I prefer the ideas that it was a peak demand issue sprinkled with a reduction in the capacity at maybe one DC.

    1. LyingMan

      Well.. that was eons ago.. Now there are two significant changes..

      1. Brightness has faded.. The recruitment is focusing on the cheap and easy but not so bright, who wont jump ship after gaining three or four years of experience. This has been going on for the last 10 to 11 years but accelerated in the last 6 years

      2. The culture within TCS has also changed. Previously as many mentioned it was 'Customer is the king'. Now, it is pass the ball back to the customer with more questions. If the customer asks for a change, ask repeatedly about the specification until the customer cannot answer in any clear form and then implement something minimal that the customer cannot complain as not meeting the requirements. And in the meantime grill the customer for test cases so that the customer loses the will to live (remember that in most of the companies the business team's turn over is so much that if you ask questions for sufficient time, the person who asks the question would have left the team before it comes around to implement!) and bake in the code for the test cases to pass.

  12. Anonymous Coward
    Anonymous Coward

    No surprise

    I used to work for BA in the IT department until last year. Given the management and outsourcing now in place this latest debacle is no surprise at all. I could say a lot more, but it would just turn in to a rant...

    1. stevenotinit
      Thumb Up

      Re: No surprise

      I'd like to hear your rant, actually. I don't think I'm the only one, and I think BA needs to hear it for their own good, as well as for their customers and IT staff!

      1. This post has been deleted by its author

      2. Anonymous Coward
        Anonymous Coward

        Re: No surprise

        They never listened to any of us when we still worked there, so it wouldn't make a jot of difference now. Management there are living in a parallel universe where nothing can be said against the great idea that is outsourcing. BA used to be a great place to work until 3-4 years ago, the sad thing is that it deteriorated to the extent that I was happy to leave...

  13. Anonymous Coward
    Anonymous Coward

    Laughing stock

    Striker: We're going to have to blow up the computer!

    Elaine Dickinson: Blow ROC?

    [a smiling face appears on the computer]

  14. Grunt #1

    At least Sainsbury's have reacted quickly.

    http://www.continuitycentral.com/index.php/jobs/2015-operational-resilience-manager

    If they can do it, why can't BA?

  15. A Mills

    False savings

    Various sources report that BA will very likely be facing total compensation claims a great deal north of 110 million pounds, I suspect that a much smaller sum than this could have well bought them some decent system redundancy.

    Bean counters in charge of IT, you know it makes sense.

    1. Anonymous South African Coward Bronze badge

      Re: False savings

      Ouch.

  16. Anonymous Coward
    Anonymous Coward

    Comment from a Times article.

    From the IT rumour mill

    Allegedly, the staff at the Indian data centre were told to apply some security fixes to the computers in the data centre. The BA IT systems have two, parallel systems to cope with updates. What was supposed to happen was that they apply the fixes to the computers of the secondary system, and when all is working, apply to the computers of the primary system. In this way, the programs all keep running without any interruption.

    What they actually did was apply the patches to _all_ the computers. Then they shutdown and restarted the entire data centre. Unfortunately, computers in these data centres are used to being up and running for lengthy periods of time. That means, when you restart them, components like memory chips and network cards fail. Compounding this, if you start all the systems at once, the power drain is immense and you may end up with not enough power going to the computers - this can also cause components to fail. It takes quite a long time to identify all the hardware that failed and replace it.

    So the claim that it was caused by "power supply issues" is not untrue. Bluntly - some idiot shut down the power.

    Would this have happened if outsourcing had not be done? Probably not, because prior to outsourcing you had BA employees who were experienced in maintaining BA computer systems, and know without thinking what the proper procedures are. To the offshore staff, there is no context, they've no idea what they're dealing with - it's just a bunch of computers that need to be patched. Job done, get bonus for doing it quickly, move on.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like