back to article BA CEO blames messaging and networks for grounding

The catastrophic systems failure that grounded British Airways flights for a day appears to have been caused by networking hardware failing to cope with a power surge and messaging systems failing as a result. The Register has asked BA's press office to detail what went wrong, what equipment failed, what disaster recovery …

  1. jerehada

    So by messaging he means some sort of enterprise service bus was taken down ? also a power surge protection is normal so what went wrong with the switchover to redundant power? The hints from this a bit like the Talk Talk hack suggest something very simple and not some unavoidable impossible to understand failure they would like to media steer us towards.

    1. wyatt

      I'm looking forward to seeing the RCA, if we ever do. So many businesses have single points of failure, even within HA systems. I realise that backup systems have been mentioned but if they don't work or can't be brought online (ever tested?), has someone been telling porkie pies?

      1. Anonymous Coward
        Anonymous Coward

        "I realise that backup systems have been mentioned"

        I used to work for a company, large company, that provided DR services. The vast majority of companies treat DR as a compliance checkbox. They buy some DR services so they can say they have DR services... but in the event of a primary data center loss, there really is only the rough outline of a plan. Basically their data, or most of it, is in some alternative site and they may have the rest of their gear there too or not. There is rarely anything resembling a real time switch over from site A to site B in case of a disaster in which their entire stack(s) would come up without any manual intervention at site B. Mainly because architectures are a hodge podge of stuff which has collected over the years. Many companies never rewrite or modernize anything, meaning much of the environment is legacy with legacy HA/DR tools... and there is sparse automation.

        1. wheelybird

          There's a difference between disaster recovery and high-availability (though they do overlap).

          It's perfectly reasonable that disaster recovery is a manual fail-over process. Fully resilient systems over two geographically separated locations can be hard and expensive to implement for smaller companies with not much in the way of a budget or resources, and so you have to compromise you expectations for DR.

          Even if failing-over can be automated, there might be a high cost in failing-back afterwards, and so you might actually prefer the site to be down for a short while instead of kicking in the DR procedures; it works out cheaper and avoid complications with restoring the primary site from the DR site.

          Not every company runs a service that absolutely needs to be up 24/7.

          A lot of people designing the DR infrastructure will be limited by the (often poor) choices of technology made by the people that wrote the in-house stuff.

          As an example, replicating your MySQL database between two datacentres is more complicated than most people would expect. Do you do normal replication and risk that the slave has lagged behind the master at the point of failure, losing data? Or use synchronous replication like Galera at the cost of a big latency hit to the cluster, slowing it right down?

          If it's normal replication, do you risk master-master so that it's easy to fail-back, with the caveat that master-master is generally frowned upon for good reasons?

          I think it's disingenuous to berate people for implementing something that can be very difficult to implement.

          Though of course, large companies with lots of money and lots to lose by being down (like BA) have no excuses.

        2. Anonymous Coward
          Anonymous Coward

          That's scary

          That is our DR to a tee, so glad I'm not the boss

          anon for obvious reasons

      2. NoneSuch

        You can't test everything all the time.

        Stuff happens and often the unimagined causes grief.

        Redundancies are only guaranteed when they come from HR.

    2. Ledswinger Silver badge

      suggest something very simple and not some unavoidable impossible to understand failure

      The weasel isn't going to take any personal responsibility - even though he is THE chief executive officer. But it is his fault, all of it, in that capacity. The total and absolute failure of everything is clearly a series of multiple failures, and he (and BA) are trying to control the message as though that denies the reality of this catastrophe. He should be fired for his poor communication and poor leadership if nothing else. But that's what you get when you put the boss of a tiddly low cost airline into a big, highly complex operation with a totally different value proposition.

      Looking around, press comment reckons that it'll be two weeks before all flight operational impacts are worked out (crews, aircraft in the wrong place at the wrong time, passenger failures made as good as they can), and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?

      1. Credas Silver badge

        But that's what you get when you put the boss of a tiddly low cost airline into a big, highly complex operation with a totally different value proposition.

        Whatever you might think about his performance during this unmitigated balls-up, there's much more relevant experience in his biography than just running a "tiddly low cost airline".

        1. Dan 55 Silver badge

          I don't know who you're trying to convince, but it's not me. Neither Clickair nor Vueling have or had stellar reputations, Sabre's had its outages, and the less said about US airports the better.

        2. dieseltaylor

          His CEO experiences is from a minor airline so accept that fact. His previous experience reads well but then every exec I know of makes sure it does. : )

          Should he jump? Probably not but some people somewhere must be guilty of hiding, or not implementing, necessary IT improvements.

      2. Bloodbeastterror

        "I wonder if that will affect his bonus?"

        Ha ha ha ha... Of course not. After the attainment of a certain pay grade "reward for failure" kicks in. Only the actual workers enjoy "reward for success". Sometimes.

        1. Antron Argaiv Silver badge
          Thumb Up

          Re: "I wonder if that will affect his bonus?"

          A former boss referred to it as "f*ck up and move up".

          Though, admittedly, a change of employer is sometimes implied.

        2. Anonymous Coward
          Anonymous Coward

          Re: "I wonder if that will affect his bonus?"

          He didnt get one last year

          "Alex Cruz, the Spanish CEO of British Airways, will not receive a bonus for 2016 from the IAG airlines group. The company said in a statement to the National Stock Market Commission that he will be the only one of the 12 senior executives not to receive a bonus. "

          1. John Smith 19 Gold badge
            Unhappy

            "he will be the only one of the 12 senior executives not to receive a bonus. ""

            Which suggests he has been trying extra hard to get one.

            And look what his efforts have produced.....

            I think he's going to be on the corporate naughty step again.

            IT.

            It's trickier than it looks in the commercials.

      3. Anonymous Coward
        Anonymous Coward

        The weasel isn't going to take any personal responsibility - even though he is THE chief executive officer.

        IT and the CIO don't fall under him, IT is provided by [parent company] IAG "Global Business Services" as of last year. But of course, Cruz has fully supported all the rounds of cuts that have been made.

        It smells like a store-and-forward messaging system from the dawn of the mainframe age

        JMS-based ESB.

        Ex BA AC

        1. Anonymous Coward
          Anonymous Coward

          But you would think that something as critical as an ESB in BA would mean that they have built it with high availability in an active/active configuration with plenty of spare capacity built in with nodes in different locations and on different power supplies. And of course ensuring that the underlying data network has similar high availability.

          Otherwise you have just built in a single point of failure to your whole enterprise and as Murphy's law tells us - if it can go wrong then it will go wrong and usually at the most inopportune moment.

          1. Norman Nescio Bronze badge
            Pint

            ESB?

            "...something as critical as an ESB in BA would mean that they have built it with high availability in an active/active configuration with plenty of spare capacity..."

            They probably thought they could just pop down the road to the brewery in Chiswick and get a refill there.

            1. Tom Paine Silver badge
              Pint

              Re: ESB?

              Funny, I thought Fuller's had closed that site and moved to an industrial estate in Maidstone or Nuneaton or something -- but I was completely wrong: https://www.fullers.co.uk/brewery

              Doesn't it look nice? Mmmm... ale...

              1. Simon Harris Silver badge
                Pint

                Re: ESB?

                "Funny, I thought Fuller's had closed that site ... but I was completely wrong: https://www.fullers.co.uk/brewery

                Doesn't it look nice? Mmmm... ale..."

                They do an excellent brewery tour with a tasting session in their bar/museum afterwards :)

              2. Bigbird3141

                Re: ESB?

                Think you're confusing it with Young's - the Wandsworth-based brewer Fullers bought and closed and redeveloped the site of.

                1. CH in CT20
                  Pint

                  Re: ESB?

                  Ahem. You seem to be confusing Fuller's with Charles Wells, the company which brews Young's beers in Bedford.

            2. Doctor Syntax Silver badge

              Re: ESB?

              "They probably thought they could just pop down the road to the brewery in Chiswick and get a refill there."

              You thought they could organise....?

        2. Ledswinger Silver badge

          IT and the CIO don't fall under him, IT is provided by [parent company] IAG "Global Business Services" as of last year.

          As a director of BA, he is in fact responsible in law, even if the group have chosen to provide the service differently. I work for a UK based, foreign owned energy company. Our IT is supported by Anonco Business Services, incorporated in the parent company's jurisdiction, and owned by the ultimate parent. If our IT screws up (which it does with some regularity), our customers' have redress against the UK business, and our directors hold the full contractual, legal and regulatory liability, whether the service that screwed up is in-house, outsourced, or delivered via captive service companies.

          1. Anonymous Coward
            Anonymous Coward

            Director?

            If he is a director of BA! A search of companies house finds a director of a BA company in the name of

            Alejandro Cruz De Llano

            I'm guessing this him?

            A member of staff of a company only has legal responsibility if they are a registered director with companies house. The fact the company calls them a CEO or director does not mean they are a registered director.

            1. butigy

              Re: Director?

              Actually I don't believe that's correct. If you hold yourself out to be a director then it's possible you can be treated like one also the concept of shadow directors but that's a bit different.

      4. John Smith 19 Gold badge
        Unhappy

        "and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?"

        You can bet that any "profit improving" (IE cost cutting) ideas certainly did.

        This should as well.

        But probably won't, given this is the "New World Order" of large corporate management that takes ownership of any success and avoids any possibility that their decisions could have anything to do with this.

        If you wonder who is most modern CEO's role model for corporate behavior it's simple.

        Carter Burke in Aliens.

        1. 0765794e08
          Joke

          Re: "and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?"

          “Carter Burke in Aliens”

          Sticking with movies, Johnny from Airplane! springs to mind...

          “Just kidding! Oh, wrong cable. Should’ve been the grey one. Rapunzel! Rapunzel!”

      5. Mike Richards Silver badge

        Cruz previously worked at Vueling which has a terrible record for cancellations, lost bookings and cruddy customer service - so he's clearly brought his experience over.

        He was appointed to cut costs at BA which he's done by emulating RyanAir and EasyJet whilst keeping BA prices. He's allowed the airline to go downmarket just as the Turkish, the Gulf and Asian carriers are hitting their stride in offering world-wide routing and don't treat customers like crap. Comparing Emirates to BA in economy is like chalk and cheese.

        BA's only hope is if the American carriers continue to be as dreadful as ever.

        1. Anonymous Coward
          Anonymous Coward

          I had the pleasure of flying back to the UK on American in Business Class recently - service and comfort was a notch above BA Club World, and the ticket was cheaper than BA Premium Economy. BA are screwed...

        2. Richard Laval

          "BA's only hope is if the American carriers continue to be as dreadful as ever."

          So they definitely have a fighting chance then!

    3. Voland's right hand Silver badge

      It smells like a store-and-forward messaging system from the dawn of the mainframe age (Shows how much BA has been investing into its IT). It may even be hardware + software. Switching over to backup is non-trivial as this is integrated into transactions, so you need to rewind transactions, etc.

      It can go wrong and often does, especially if you have piled up a gazillion of new and wonderful things connected to it via extra interfaces. Example of this type of clusterf*** the NATS catastrophic failure a few years back.

      That is NOT the clusterf*ck they experienced though because their messaging and transaction was half-knackered on Friday. My boarding pass emails were delayed 8 hours, check-in email announcement by 10 hours. So while it most likely was the messaging component, it was not knackered by a surge, it was half-dead 24h before that and the "surge" was probably someone hired on too little money (or someone hired on too much money giving the idiotic order) trying to reset it via power-cycle on Sat.

      This is why when you intended to run a system and build on it for decades, you have upgrade, and you have to START each upgrade cycle by upgrading the messaging and networking. Not do it as an afterthought and an unwelcome expense (the way BA does anything related to paying with the exception of paying exec bonuses).

      1. James Anderson

        If it was a properly architected and configured mainframe system it would have just worked.

        High availability, failover, geographically distributed databases, etc. etc. were implemented on the mainframe sometime in the late '80s.

        Some of the commentards on this site seem to think the last release of a mainframe OS was in 1979, when actually they have been subject to continuous development, incremental improvement and innovation to this day. A modern IBM mainframe is bleeding edge hardware and software presenting a venerable 1960s facade to its venerable programmers. Bit like a modern Bentley with its staid '50s styling on the outside and a monster twin turbo multi valve engine on the inside.

        1. Nolveys Silver badge
          Windows

          @ James Anderson

          (Mainframe operating systems) have been subject to continuous development, incremental improvement and innovation to this day.

          That sounds expensive, has anyone told Ginni about this?

        2. Mr Dogshit Silver badge
          Headmaster

          There is no such verb as "to architect".

          1. MyffyW Silver badge

            no such verb as "to architect".

            I architect - the successor to the Asimov robot flick

            You architect - an early form of 21st century abuse

            He/She architects - well I have no problem with gender fluidity

            We architect - sadly nothing to do with Nintendo

            You architect - abuse, but this time collective

            They architect - in which case it was neither my fault, nor yours

            1. This post has been deleted by its author

          2. Nigel 13

            There is now.

            1. Aus Tech

              RE: There is now.

              It's too late now, the disaster has already happened. Very much like the old story "shut the gate, the horse has bolted."

          3. tfb Silver badge
            Alert

            But there will be as soon as enough prescriptive-grammar fogeys who can remember that once there wasn't die off. This is how language evolves: by the death of idiots.

            1. Anonymous Coward
              Anonymous Coward

              Die off is fine. So is die back. They're descriptive and worth keeping. Architect as a verb is more or less OK, although why did someone assume 'design' wasn't good enough, since it's a correct description of the process, making architect as a verb a replacement for a word that didn't need replacing.

            2. whileI'mhere

              by the death birth of idiots.

              FTFY

          4. Anonymous Coward
            Anonymous Coward

            <rant> True. But at least that's one I can, however reluctantly, at least imagine.

            For me, by far the worst example of this American obsession with creating non-existent 'verbs' is, obviously, 'to leverage'.

            Surely that sounds as crass to even the most dim-witted American as it does to everyone else in the English speaking world, doesn't it? I'm told these words are created to make the speaker sound important when they are clueless.

            I can accept that some lone moron invented the word. But why did the number of people using it ever rise above 1? </rant>

            1. Anonymous Coward
              Anonymous Coward

              *I can accept that some lone moron invented the word. But why did the number of people using it ever rise above 1?*

              Because the number of morons is >>>> 0[1]

              [1] Yes - I made up >>>>> to be "a far, far larger number than the one compared against" It's mine. You can't have it. So there.

            2. Glenturret Single Malt

              And pronounce it levverage instead of leeverage.

          5. Nifty

            No such verb as to architect?

            https://en.oxforddictionaries.com/definition/architect

            verb

            [WITH OBJECT]Computing

            Design and configure (a program or system)

            ‘an architected information interface’

          6. Jtom Bronze badge

            Suggestion we were given some twenty-five years ago: Don't verb nouns.

            1. Marduk

              Verbing weirds a language.

          7. dajames Silver badge
            Headmaster

            There is no such verb as "to architect".

            That's the beauty of the English language -- a word doesn't have to exist to be usable. (Almost) anything goes.

            It's not always a good idea to use words that "don't exist" -- especially if you're unhappy about being lexicographered into the ground by your fellow grammar nazis -- but most of the time you'll get the idea across.

            [There is no such verb as "to lexicographer", either, but methinks you will have got the point!]

            Ponder, though, on this.

        3. CrazyOldCatMan Silver badge

          A modern IBM mainframe is bleeding edge hardware and software presenting a venerable 1960s facade to its venerable programmers.

          And always has done. in the early 90's, I was maintaining TPF assembler code that was originally written in the 60's (some was older than me!).

          And I doubt very much if those systems are not still at the heart of things - they worked. In the same way as banks still have lots of stuff using Cobol, I suspect airlines still have a lot of IBM mainframes running TPF. With lots of shiny interfaces so that modern stuff can be done with the source data.

          1. Down not across

            With lots of shiny interfaces so that modern stuff can be done with the source data.

            Dunno if its shiny, but probably something like MQ.

            For most parts it seems to do fairly decent job in distributed systems if it has been properly configured.

      2. yoganmahew

        @Voland

        "It smells like a store-and-forward messaging system from the dawn of the mainframe age "

        You mean teletype? No, it doesn't sound like that, TTY store and forward is simple in its queuing and end points. If the next hop to the destination is unavailable, it stops. When it is available, it restarts. To me, this sounds like MOM, that wonderful modern replacement for TTY. That heap of junk that queue fulls and discards messages, that halts with only writers and no readers, or readers and no writers. That cause of more extended system outages that any other component in a complex system.

        @James Anderson

        Damn straight!

        On the issue of local resources:

        The use of 'resources' is as usual interesting - it speaks volumes of identikit, replacable, hired in, temporary. An airline doesn't refer to its pilots as resources, or its aircraft engineers. It should refer to its IT staff as such... my guess is that of course it was local 'resource' fixing the problem, they'd be the only one with the access to touch the systems unless BA is gone even more loony than some on here suspect. They local resources would be on a bridge with a cast of hundreds from the supplier, all shouting at each other, all pointing in different directions. First the load balancer would be failed over and resynched. Then the firewall. Then the DNS. Then someone would point out that some component that's never failed before has everything going through it and that it has a single error report in log somewhere not highlighted in automation because it's never been seen before.

        Then, as said above, when some numpty decides to restart the box, a part of it fails catastrophically - something to do with electricity... it may be sunspots or a power surge, yes, we'll call it a power surge.

        1. MarkHewis

          'resource' as you said org with so little respect of IT to call critical staff that is deeply worrying.

        2. Mellipop

          We're doing the RCA

          This is why I like reading the comment on the reg, we'll explore the problems and suggest changes.

          If only some of those managers in the PMO carefully read these comments instead of 'socialising' requirements or 'managing expectations' then more of these chaotic and complex system evolutions would actually get better. Dare I say become antifragile?

          My suggestion is an AI based system monitoring tool would have worked wonders.

          https://www.moogsoft.com for example.

          My two penny worth.

      3. Anonymous Coward
        Anonymous Coward

        "It smells like a store-and-forward messaging system from the dawn of the mainframe age"

        Probably the case.

    4. Tom Paine Silver badge

      "millions of messages"

      Haven't seen the Sky TV etc stuff, but on R4 WatO on Monday he used a phrase like "millions of messages passing between the various systems" . I interpreted that to mean IP packets. Of course I may very well be wrong!

      1. Mike Richards Silver badge

        Re: "millions of messages"

        Three quarters of those messages were probably copies of the ones I get from BA advertising holidays and telling me my points are about to expire because I haven't sufficiently braced myself yet for another whirl of the unique BA customer experience.

      2. Dwarf Silver badge

        Re: "millions of messages"

        You are probably 100% correct - The IP packets are probably "ICMP unreachable"

    5. TitterYeNot
      Facepalm

      "So by messaging he means some sort of enterprise service bus was taken down?"

      Sounds something like it. To quote Cruz - “we were unable to restore and use some of those backup systems because they themselves could not trust the messaging that had to take place amongst them.

      So, production system suffers major power failure, production backup power doesn't kick in, and either:

      A) Power is restored to production but network infrastructure now knackered either due to hardware failure or someone (non-outsourced someone, obviously, 'coz he said so <coughs>) not saving routing and trust configuration to non-volatile memory in said hardware, so no messages forwarded.

      or

      B) DR is immediately brought online as the active system, but they then find that whatever trust mechanism is used on their messaging bus (account/ certificate/ network config) isn't set up properly so messages are refused or never get to the intended end-point in the first place, leaving their IT teams (non-outsourced IT teams, obviously, 'coz he said so <coughs>) scrabbling desperately through the documentation of applications they don't understand trying to work out WTF is going wrong.

      Same old story, again and again...

      - Mr Cruz, did you have backup power for your production data centre?

      - Yes definitely, the very best.

      - Mr Cruz, did you test your backup power supply?

      - Erm, no, that takes effort and costs money...

      - Ah, so you didn't have resilient backup power then, did you? Mr Cruz, did you have a DR environment?

      - Yes definitely, the very best money can buy, no skimping on costs, honest...

      - Mr Cruz, did you test failover to your DR environment?

      - Erm, no, that takes effort and costs money...

      - Ah, so you didn't have resilient DR capability then did you Mr Cruz?

      - Mr Cruz did......etc. etc. ad nauseam...

    6. TkH11

      messaging

      I doubt they are that modern to be using SOA architecture.

      I read a transcript of what he said the other day and he was eluding to network switches going down. So I think he's trying to dumb down his words for a non technical audience, messaging - aka packets being switched or routed across the network between servers and apps.

      1. Chris 239

        Re: messaging

        @TkH11

        I don't know Mr Cruz background but I expect it was dumbed down FOR him not BY him.

        Or probably dumbed down several times through layers of progresivly dumber management.......

      2. Matt Bryant Silver badge
        Thumb Up

        Re: TkH11 Re: messaging

        ".....he was eluding to network switches going down....." Ahem, not wanting to point fingers, of course - perish the thought! - but, knowing some of the "solution providers" involved in the designs of BA systems, has anyone asked CISCO for a quote about the resilience of their core switches in "power surge" situations?

        1. Anonymous Coward
          Anonymous Coward

          Re: TkH11 messaging

          TkH11 not sure re the CISCO question -

          http://www.computerweekly.com/news/4500254373/British-Airways-picks-Juniper-Networks-to-build-out-cloud-core-network

          Apparently the boxes bricking is a feature of what is used...

  2. Voland's right hand Silver badge

    Even if it is sourced locally

    He personally is grossly incompetent.

    IT is not a cost center in a modern airline. It is a key operational component and in fact a profit center. Without IT you cannot operate online bookings, notifications and most importantly you cannot dynamically price your flights. That is wholly dependent on the transactions being done electronically. All of the profit margin of a modern airline comes from dynamic pricing. If it prices statically it will be in the red.

    He, however, has systematically treated IT as a cost to be cut, not as a profit center to be leveraged. So even if he hired the staff for these systems locally, they were most likely hired under the market rate and the results are self-explanatory.

    1. Ledswinger Silver badge

      Re: Even if it is sourced locally

      IT is not a cost center in a modern airline. It is a key operational component and in fact a profit center.

      Its more than a profit centre. A modern airline is an IT business, one that just happens to fly aircraft. There's no manual processes to backup ops management, scheduling, pricing, customer acquistion, customer processing, ticketing and invoicing.

      Until BA (and many large businesses) get a grip on this concept and start treating IT (people, infrastructure, systems) as the core of their business, we'll continue to see this sort of screw up

      1. Mark 110 Silver badge

        Re: Even if it is sourced locally

        I can almost guarantee that whilst the physical infrastructure was locally supported the application support will have been off shored. Its the typical model in these organisations. Always seems to end up costing more than the planned savings in my experience.

        The question to ask is when was the DR plan last updated? When was it tested? Was it successful?

        If the answer is 'what plan?', which isn't as uncommon as you might think, then someones head will probably roll.

        1. Anonymous Coward
          Anonymous Coward

          Re: Even if it is sourced locally

          Oh, can I get back to you as I've just got to finish this important document.

      2. John Smith 19 Gold badge
        Unhappy

        "A modern airline is an IT business, one that just happens to fly aircraft. "

        Which echos the comment that banks are IT businesses (big ones if they are retail) which just happen to have a banking license

        There is at least one major IBM iSeries app that was basically a complete banking system, just add money, banking license and customer accounts.

        I wonder how many major lines of business have been so automated that manual reversion is simply impossible. I'm guessing the fruit and veg arms of all big supermarkets.

        1. A Non e-mouse Silver badge

          @John Smith 19 Re: "A modern airline is an IT business, one that just happens to fly aircraft. "

          I wonder how many major lines of business have been so automated that manual reversion is simply impossible. I'm guessing the fruit and veg arms of all big supermarkets.

          I read about the history of the LEO computer. Even back then, they realised that although this new fangled computer could improve things enormously, they were buggered if it went down. So they didn't go live with it until a second, backup, unit was in place.

          1. John Smith 19 Gold badge
            Unhappy

            "So they didn't go live with it until a second, backup, unit was in place."

            Sensible plan.

            It would seem other companies should follow such an example.

        2. Anonymous Coward
          Anonymous Coward

          Re: "A modern airline is an IT business, one that just happens to fly aircraft. "

          Ooops, did you see the Sainsbury's outage this week?

          1. Mike Richards Silver badge

            Re: "A modern airline is an IT business, one that just happens to fly aircraft. "

            I wonder if the two are connected?

            I ordered a pack of Sainsbury's own-brand lemons and I've ended up with a decidedly inorganic BA Boeing 747-400.

      3. Anonymous Coward
        Anonymous Coward

        Re: Even if it is sourced locally

        Could not agree more but have you worked in a large enterprise and tried to explain that to the beany's?

    2. Charlie Clark Silver badge

      Re: Even if it is sourced locally

      He personally is grossly incompetent.

      And presumably already negotiating his exit so that blame doesn't spread up to the IAG board.

      So, he'll leave early on a fat settlement and the cuts will continue, presumably after a round of pink slips for those who just happened to be in the building.

    3. Tom Paine Silver badge

      Re: Even if it is sourced locally

      Devil's advocate for a moment: couldn't you say the same thing about electricity?

      1. Peter2 Silver badge

        Re: Even if it is sourced locally

        Devil's advocate for a moment: couldn't you say the same thing about electricity?

        Yes, you could.

        Which is why important servers have UPS's to ensure they don't lose power for more than a few milliseconds and backup generators which can then ramp up and take over from battery backups.

  3. Anonymous Coward
    Anonymous Coward

    The excuse just doesn't fly with me.

    1. Korev Silver badge
      Coat

      It's plane that there's more to this

      1. Dan 55 Silver badge

        Don't get in a flap about it.

        1. Korev Silver badge
          Coat

          Try to wing it instead

      2. AndrueC Silver badge
        Joke

        Perhaps they've just been winging it for too long.

        1. Grunt #1

          Perhaps they are at Cruzing altitude?

          Communications fail.

        2. John Brown (no body) Silver badge
          Mushroom

          Yeah, Cruzing blindly into disaster.

    2. macjules Silver badge

      Well, thrre are great alternatives out there .. United for example

      PS El Reg: https://www.britishairways.com/en-gb/information/delayed-or-cancelled-flights - information on delayed and cancelled flights. Google is your friend (sometimes).

      1. Symon Silver badge
        Coat

        Alternatives.

        Charlie: Ray, all airlines have crashed at one time or another, that doesn't mean that they are not safe.

        Raymond: QANTAS. QANTAS never crashed.

        Charlie: QANTAS?

        Raymond: Never crashed.

        Wait, not true...

        https://www.itnews.com.au/news/qantas-blames-it-provider-for-check-in-system-outage-281697

        1. collinsl

          Re: Alternatives.

          Ah, well, we can explain that, the front fell off.

          1. yoganmahew

            Re: Alternatives.

            @collinsl

            For the two people in the world that have never seen it...

            https://www.youtube.com/watch?v=1snpT5KxaWo

        2. Mike Lewis

          Re: Alternatives.

          Qantas did crash, badly, in 1999. They switched to their backup data centre so they could test their main centre for year 2000 problems. The backup centre had a fire and they couldn't return to the main one.

  4. Norman Nescio Bronze badge

    Ethernet

    At a basic level, the progress of Ethernet datagrams around a network is 'messaging', so the problem could be as simple as a switch failing to operate properly, and a backup/fail-over process not working. Rebuilding a switch configuration from scratch in a data-centre might take a while, especially if documentation is missing or inaccurate.

    1. Warm Braw Silver badge

      Re: Ethernet

      Networks are supposed to be redundant - that's the whole point of spanning trees and routing protocols. A switch failing to operate shouldn't be catastrophic. And anyone found responsible for "missing or inaccurate" documentation in a critical operation of this kind should be hung from the cable trays as a warning to others.

      1. Norman Nescio Bronze badge

        Re: Ethernet

        @ Warm Braw

        "Networks are supposed to be redundant - that's the whole point of spanning trees and routing protocols. A switch failing to operate shouldn't be catastrophic. And anyone found responsible for "missing or inaccurate" documentation in a critical operation of this kind should be hung from the cable trays as a warning to others."

        I agree completely, they are supposed to be redundant, and the cause of 'missing or inaccurate' documentation does need to be determined and rectified.

        I'm not saying that a 'simple' switch failure is the cause of the BA issue. Just pointing out that 'messaging' in CEO-speak is not inconsistent with standard LAN protocols: you don't necessarily need to invoke message-passing applications, although that would be the standard interpretation of what the CEO said.

        It might surprise you to find out that spanning tree may well not be enabled. In the core of a big data-centre you are likely to have a pair (or more) of switches that are meant to share throughput in normal running, but be able to fail over to each other in case of need. This isn't done by spanning tree, and could be implemented on virtual switches, just to makes things a little more complicated.

        The thing about change control on switches and routers is that it is hard. There are expensive solutions out there, but any device that has a running configuration and a stored configuration, where the running configuration can be changed without a reboot, is susceptible to someone making a change and not making the corresponding change in the stored configuration file. There are expensive solutions that force all* access to managed devices through a server where sessions are logged, keystroke-by-keystroke on an account-by-account basis, so you can see who did what to which device and when. Reviewing those logs is tedious, but sometimes necessary. in addition, most changes are automated/scripted to prevent typos. There are periodic automatic audits of the configuration on the device against the configuration stored in the management system. Despite all this, discrepancies occur.

        What could easily happen is a power glitch that kills one of two switches (high-end switches can have multiple power supplies and multiple control processors, but it is still possible for a dodgy power feed to bring one down terminally, possibly letting the magic blue smoke out). The other switch picks up the load, but - maybe the fail-over logic doesn't work properly so it reboots, or a power-cycle is needed - at which point you find the stored config being loaded is not the same as the old running config, and all hell breaks loose - systems that should be able to talk to each other can't; and systems that should be on isolated VLANs suddenly can chat to each other. You find you are using physical cabling that hasn't be used for a while, some of which has a fault, so you need to start tracing cables between devices using data-centre cabling documentation last updated by an external contractor whose mother-tongue wasn't English; and the technician in the data-centre has to come out of the machine room to talk to you because the mobile signal inside is poor and the noise from the fans so high you can't reliably hear him (or her), and there are no fixed-line phones. It can take days to sort out. Unfortunately I have experienced all of the above at one time or another. What should happen and what does happen can be remarkably different.

        I hope we find out what actually happened to BA, but I suspect a veil of 'commercial confidentiality' will be drawn over it. An anonymised version may turn up in comp.risks or on https://catless.ncl.ac.uk/Risks/ someday.

        *Of course, in 'an emergency' there are means of gaining access without going through the management system. Reconciling any changes made then with what the management system thinks the configuration should be is always fun.

      2. Anonymous Coward
        Anonymous Coward

        Re: Ethernet

        My interpretation from his dumbed-down explanation is they lost power to the network switches. You have to remember, they've consistently referred to a power failure, if you're lost power to all their switches, then spanning tree won't help you.

        I'm suspecting a poorly maintained UPS, with knackered batteries, and they lost power to the entire equipment room.

        1. John Smith 19 Gold badge
          Unhappy

          "I'm suspecting a poorly maintained UPS, with knackered batteries,"

          Yes, that should do it.

          Dropping a spanner across a couple of power bus bars in the main electricity distribution room of a building is also quite effective. Spectacular to witness apparently, but I only saw the after effects in a company.

          In fact the premium rate power repair service took hours to turn up, the system wide UPS batteries had not been charged and the backup generator was due to be fueled next week. IOW a perfect storm.

          The Director level pyrotechnics were quite spectacular.

      3. Alan Brown Silver badge

        Re: Ethernet

        "Networks are supposed to be redundant - that's the whole point of spanning trees and routing protocols."

        Spanning tree operated over a domain more than 5 switches wide is a disaster waiting to happen (that's what it was designed for). It can be a disaster on domains even smaller than that if there's LACP involved (any LACP disturbances result in a complete spanning tree rebuild, so you don't want LACP to your servers unless the switches they're connected to don't use spanning tree to the rest of the network)

        Thankfully there are better alternatives.

        I deployed TRILL on our campus a couple of years ago. Whilst the switchmakers primarily push it as a datacentre protocol it was _DESIGNED_ for campus-wide/MAN applications and will work across WANs too. (Bypass Cisco and look around, there are a number of OEMs all selling using Broadcom's excellent Trident2+ descendants, with far better levels of support than Cisco sell)

        Naked TRILL does leave a (small) SPOF - routers for inter-subnet work, but that was plugged a while back: https://tools.ietf.org/html/rfc7956 - The distributed L3 extension to TRILL takes care of that nicely and means one less complicating factor can be taken out of the loop (no need for VRRP or OSPF or other failover protocols within the network, just at the edges)

        1. Anonymous Coward
          Anonymous Coward

          Re: Ethernet

          i hope "a couple" is > 2. TRILL was all very cool for the datacentre 5 or 6 years ago, but it's a dead duck nowadays (VXLAN is where it's at if you still need layer 2 - ideally with EVPN as the control plane). In the campus I can't see any reason to have layer 2 domains that span multiple switches...

    2. highdiver_2000

      Re: Ethernet

      I don't think it is as simple as a dead switch, even a humongous one.

    3. Bill M

      Re: Ethernet

      I set up a resilient IATA Type B interface for airline industry messaging at the tail end of last century with gateways at multiple geographic locations so it would all carry on working even if an entire data centre went TITSUP.

      A couple of months after this I was on a conference call with the Messaging Provider about another project and the concept of such resilient messaging was mooted and the Messaging Provider stated that it was impossible to set up geographically dispersed resilient messaging. When I said I done just this for a previous project there was a pregnant pause followed by indignation from the Messaging Provider saying it was impossible and even if it was possible then we were not allowed to do it.

      The Global IT Director had been on the conference call and visited me a couple of days later to see how I had set up the resilience. Soon after the Global IT Director announced we were changing our Messaging Provider and bunged me a pay rise and a cash bonus.

      Makes me wonder what Messaging Provider BA uses.

      1. Tom Paine Silver badge

        Re: Ethernet

        Cool story. When _I_ point out elementary blunders or misapprehensions, I get told off for making the head of $DEPT look bad in front of his underlings :(

        1. Bill M

          Re: Ethernet

          He was a good Global IT Director.

          I can remember a panic conference call a few years later when a major system was effectively down due to performance issues. I was sorting the issue with a techie from the supplier, but the Global IT Director kept wittering on about other things and I snapped at him and told him to either shut up or fuck off.

          He did shut up and I got things sorted with the techie and back up soon after. I was a bit worried in case I had overstepped the mark when I got no phone calls or emails from him for a few days. But then I got an email from him bunging me another, albeit modest, pay rise.

    4. dajames Silver badge

      Re: Ethernet

      Rebuilding a switch configuration from scratch in a data-centre might take a while, especially if documentation is missing or inaccurate.

      Oh, I know this one ... or the documentation is complete, accurate, and thorough ... but everyone assumes that if there is any documentation at all it will be patchy, inaccurate, and misfiled, so they don't even look for it, and just make stuff up as they go along.

      (Cynic, moi?)

  5. mordac

    Where was the "power surge"

    Is anyone aware of a power failure that corresponds with the onset of this episode?

    1. Korev Silver badge

      Re: Where was the "power surge"

      It could also be internal to BA's premises.

    2. Arthur the cat Silver badge

      Re: Where was the "power surge"

      I've seen suggestions of a lightning strike, but that could be someone confusing Saturday's cock up with Sunday's thunderstorm. There are always people ready to put 2 and 2 together to make 22.

      1. Charlie Clark Silver badge
        Thumb Up

        Re: Where was the "power surge"

        There are always people ready to put 2 and 2 together to make 22.

        Have an extra upvote for that.

      2. Nick Kew Silver badge

        Re: Where was the "power surge"

        I've seen suggestions of a lightning strike,

        Might've been me, in the last El Reg commentfest on the subject.

        but that could be someone confusing Saturday's cock up with Sunday's thunderstorm.

        You mean, my comment posted here on Saturday referenced Sunday's alleged storm?

        More likely having observed the very big and long-lasting thunderstorms we had around the wee hours of Friday night / Saturday morning. Caused me to power down more electricals than I've done any time in my four-and-a-bit years since moving here. All my computing/networking gear, including UPS protection. Even the dishwasher, which had been due to run overnight.

        1. Arthur the cat Silver badge

          Re: Where was the "power surge"

          You mean, my comment posted here on Saturday referenced Sunday's alleged storm?

          I was thinking old fashioned print media (except read by the magic that is the Internet :-). Trouble is I have several in the news feeds and can't remember which it was.

          More likely having observed the very big and long-lasting thunderstorms we had around the wee hours of Friday night / Saturday morning. Caused me to power down more electricals than I've done any time in my four-and-a-bit years since moving here. All my computing/networking gear, including UPS protection. Even the dishwasher, which had been due to run overnight.

          $DEITY, that bad? Worst I've ever had in 30+ years round here (Cambridge) was a nearby lightning strike that took out an old fax modem and left everything else standing. Probably because the place is rather flat and the church steeples (and the UL) stick up above most other buildings.

      3. Tom Paine Silver badge
        Trollface

        Re: Where was the "power surge"

        Indeed. There are rather a lot of commentards here and on the previous story of the 27th offering detailed, blow-by-blow accounts of what must have happened, and they're all different -- except that everyone's dead certain Cruz is an incompetent idiot and that it's all due to the famous outsourcing deal.

        I'm not saying the outsourcing had nothing to do with it -- I'm not in the position to know -- and I don't take his assurances that it wasn't a factor completely at face value. I just wonder how so many people know so much more about it than I do, when we all read the same article...

    3. Dodgy Geezer Silver badge

      Re: Where was the "power surge"

      SSE and UK Power Networks, which both supply electricity in the area, said that there had been “no power surge”.

      1. TkH11

        Re: Where was the "power surge"

        At the start of the incident, all references were to a power failure. It's only Cruz that introduced 'power surge'. Power surges are generally caused by the national grid power in to the site supplied by the electricity company, power failures can be anything, including external to the site, or within the internal power distribution network in the data centre, UPS, generators, circuit breaks tripping..

        Either an attempt by Cruz to deflect blame on to the power company, or just a poor choice of words.

  6. frank ly Silver badge

    Just saying

    "Cruz's watchword in all interviews was “profusely” - that's the adjective chosen ..."

    'Profusely' is an adverb.

    1. Arthur the cat Silver badge
      Headmaster

      Re: Just saying

      'Profusely' is an adverb.

      You forgot the icon :-)

  7. Brett Weaver

    I'm not a BA Customer or Shareholder

    I just have 40 years experience in IT. The CEO should be fired (After he fires the CIO)..

    Actually I am sick and tired of organisations insisting that issues that have been addressed successfully for generations are somehow new, and different and they could not have anticipated....

    1. Anonymous Coward
      Anonymous Coward

      'could not have anticipated....'

      How this would affect their bonuses.... Guys like this, never fall on their sword. Its capitalism 101... Screw everyone else as long as the CEO and Snr Execs get paid!

    2. Dunstan Vavasour
      Coat

      Re: I'm not a BA Customer or Shareholder

      "...issues have been addressed to make sure this can never happen again until next time"

  8. Blotto

    Sounds like they had some type of encryption system running, perhaps encrypting their WAN traffic that relies on a central key server that went awol.

    Something like a Group Encrypted Transport VPN

    http://www.cisco.com/c/en/us/td/docs/ios-xml/ios/sec_conn_getvpn/configuration/xe-3s/sec-get-vpn-xe-3s-book/sec-get-vpn.html

    Very secure but

    Loose your key server / config or timings and things go bad quickly,

    Loss or delay of replication leading to out of sync data leading to corruption leading to a mess.

    1. Anonymous Coward
      Anonymous Coward

      There are stabs in the dark and then there are stabs in the dark

      1. Anonymous Coward
        Anonymous Coward

        Gollum? Is that you, my preciouses?

      2. eldakka Silver badge

        And then there are those who forgot to pick up a knife in the first place... ;)

    2. TkH11

      Encrypted traffic

      F. Me! I think I've just seen a UFO. Anybody else got any entirely random and made up theories?

      Let me offer this one: a cat found a small hole in the side of the building, jumped up on to the circuit breakers and urinated. Bang! circuit breakers blew, crashing the databases whilst they were in the process of writing records.

      1. Grunt #1

        Re: Encrypted traffic

        How was the cat?

  9. Codysydney

    You can have all the redundancy you want, you can think you're completely safe, but if you don't TEST IT, you may as well be running everything on a single server powered by mice.

    1. Anonymous Coward
      Anonymous Coward

      There is Testing it and there is Testing it

      Doing a once only failover at 02:00 on a Monday morning is not going to cut it.

      Doing it regulary at 02:00 on a Monday morning is a step in the right direction

      Doing it at least once on a very busy holiday weekend is testing it but only once you are really confident in the previous step.

      Doing a DR failover with little or no load is one thing. Doing it when the system is 80% loaded is IMHO a proper real world test.

      In my last job (also in the Airline Industry which was outsorced to India) we would send someone to the main DC and tell them to sit there all day. At some point in the day they would hit the big RED button that would power down that DC. They would do it without warning to anyone else. That way we would know if the systems could fail over properly and come up again when power was restored.

      All that anyone outside of the person tasked with doing the power down knew was that sometime that day, a failover would happen.

      Almost the first thing that the Outsorced team did was to put a stop to those regular DR tests. The reason they gave was that they'd have to send almost the whole team to the site to be there when it happened. That would be too expensive and time consuming.

      After a 'Doh' moment, I took my redundancy and left them to it.

      As I said, there is Testing and there is TESTING!

      1. richard_w

        Re: There is Testing it and there is Testing it

        Having worked there, I strongly suspect that there is not enough confidence in the DR capability to try it out in anger under full load, for fear of exactly this

      2. Anonymous Coward
        Anonymous Coward

        Re: There is Testing it and there is Testing it

        It's noticeable how prominent those big Red Emergency Power Breaker buttons look to a disgruntled employee whose job is being outsourced pretty soon and knows it won't end well for the Company involved. Especially when it controls the power to a (hypothetical, of course) data centre the size of 5 football pitches.

        I know of one example - a US company was very jittery about their size and prominence in the newly built Data Centre, when they set up a European Base. It's odd, the need to explicitly explain that you do need to be able to quickly find these big red buttons to kill the circuit breaker, in an Emergency, emphasizing the point, you don't let anyone into those areas you have any doubts about.

        1. Anonymous Coward
          Anonymous Coward

          Re: There is Testing it and there is Testing it

          And sometimes for special customers who are concerned about Data Center resilience, they are invited to come along and turn the power off themselves with the BRB (Big Red Button). It seems that only half of them actually hit the BRB, the other half chicken out and the Data Center manager has to do it himself.

      3. Anonymous Coward
        Anonymous Coward

        Re: There is Testing it and there is Testing it

        "Doing it at least once on a very busy holiday weekend is testing it but only once you are really confident in the previous step."

        ... however, can ou imagine the headlines etc if we get the report that "BA reveal the bank holiday IT failure was due to their IT department deciding to test their failure recovery system on one o fhte busiest days of the year. The spokeseprson added that 'the IT department were confident from rpevious tests that this would work but unforseen circumstances broughtthe system ot a standstill'. BA would like to thank all the passengers that have been delayed for their patience and for playing their part in this learning experience"

      4. A Non e-mouse Silver badge

        Re: There is Testing it and there is Testing it

        I think this is happening because IT is often siloed into "systems" (SAN, Networking, Servers, etc) rather than services. Which means that few (if any) people really knows how everything glues together.

        (Anyone remember this IBM TV advert from a few years back...?)

        1. Roland6 Silver badge

          Re: There is Testing it and there is Testing it

          I think this is happening because IT is often siloed into "systems" ... rather than services. Which means that few (if any) people really knows how everything glues together.

          Another cause is that few enterprise applications (off-the-shelf or bespoke) have been truely written to incorporate instrumentation and measurement hooks for service monitoring and mapping, and even fewer organisations have systems management systems that do service level monitoring of applications. So few (if any) people really know which interfaces/systems are the bottlenecks and critical points in the business functions/application's cloud.

      5. Roland6 Silver badge

        Re: There is Testing it and there is Testing it

        Re: As I said, there is Testing and there is TESTING!

        Agree, hitting "the big RED button that would power down that DC" is a very dramatic, but quite simple failure, basically I would list it in the 'clean' fail list. What I suspect happened at BA was a dirty fail combined with a rather large systems estate - 200 hundred (application) systems is rather a lot of systems to keep in sync.

        Suspect in the final analysis, the fault, that actually caused the recovery failure, will be found in some poorly written Web2.0 application (ie. written without regard for the constraints of distributed real-time transactional processing).

    2. Bill M

      Correction: Not "powered by mice", but "powered by a single mouse"

  10. Jos V

    No more?

    "Cruz has also promised never again have such an experience with BA"

    Why, because soon he will be "pursuing different opportunities?"

    1. Aitor 1 Silver badge

      Re: No more?

      Or maybe expend some quality time with the family, while crying over the unfair treatment and using bills to heat the mansion.

  11. Anonymous Coward
    Anonymous Coward

    'backup power systems then failed'

    - Well, if you're a terrorist, now you know which areas BA is vulnerable on...

    - Like the recent United meltdown it doesn't show much actual leadership, just finger pointing and excuse making while this CEO waits for his million-dollar-bonus and golden parachute, until the next elite arrives.

    - If anything too, it shows growing distancing between the elite and the workers, in everything except PR statements: 'sure aren't we all on the same shop floor lads'???

    - Have to say it was nice watching this particular CEO squirm. BA kept $2K after they refused to let us embark recently and wouldn't return the money, so forgive a moment or smugness.

  12. jake Silver badge

    Regardless of anything else ...

    ... it would seem that BA is run by blithering[0] idiots.

    [0] That's my verbed adjective of the week.

    1. Charlie Clark Silver badge
      Headmaster

      Re: Regardless of anything else ...

      I think you mean gerundive. Though it could also be a straight participle. I guess it depends on the amount of blithering going on.

      1. jake Silver badge

        Re: Regardless of anything else ...

        Not a lot of gerundiving in English, but there is a whole lotta verbin' goin' on. Ceteris paribus, of course.

        1. Uncle Slacky Silver badge
          Unhappy

          Re: Regardless of anything else ...

          Verbing weirds language.

          1. Z80

            Re: Regardless of anything else ...

            https://xkcd.com/1443/

      2. Bill M

        Re: Regardless of anything else ...

        Got to be very careful messing about with any gerundive without protection - all too easy to catch gerundivitus.

  13. MonkeyCee Silver badge

    Latest config not saved

    Allright, my bet is that the power failure is caused by putting too much load on the circuits at one time. Either from a mass reboot or the cooling systems not behaving nicely, or being mismanaged. Some piece of vital network kit wasn't on the UPS, and lost it's current config on reboot. Or mangled it in a fun way.

    1. Korev Silver badge

      Re: Latest config not saved

      Someone on the other thread is claiming that there was a mass reboot due to patching which caused lots of circuit breakers to pop.

      1. Anonymous Coward
        Anonymous Coward

        Re: Latest config not saved

        Mmmh, proper design of a datacenter won't allow too many machines to turn of at the same time. There are managed PDUs which monitor the load and can control the power on/off sequence.

        Also an hot reboot doesn't drain as much power as a cold one - but even patching of large installations is done in a rolling fashion.

    2. Alan Brown Silver badge

      Re: Latest config not saved

      "Allright, my bet is that the power failure is caused by putting too much load on the circuits at one time. Either from a mass reboot or the cooling systems not behaving nicely, or being mismanaged."

      If any of the above happen, then the power system is mismanaged.

      If it can't survive a mass reboot then you beef it up or don't let such a thing ever happen

      I can't fathom DCs which appear to only have one (or two) cooling systems. This falls under "not a good idea" for a number of reasons - and after some nasty experiences with dickheaded wiring on "centrally managed" cooling(*), I'm minded to insist on completely independent command and control systems for each one even if that puts the cost up a little.

      (*) Hint: Guess what happens to your 5 "independent" cooling systems if the central panel goes "phut" ?

  14. Anonymous Coward
    Anonymous Coward

    As a 3rd party....

    ...a l link between our system was down on Saturday, rejecting connections from us to them.

    A restart of a "system" their end fixed it Saturday evening.

    That's all the info we have.

    Anon for obvious reasons.

    1. Mike Richards Silver badge

      Re: As a 3rd party....

      Are you saying they turned it off and on again?

      1. Doctor Syntax Silver badge

        Re: As a 3rd party....

        "Are you saying they turned it off and on again?"

        It seems to have been a case of turned it off.

    2. Anonymous Coward
      Anonymous Coward

      Re: As a 3rd party....

      Anon for obvious reasons.

      Umm, not quite sure how to say this, but...

  15. pele

    One possible explanation, since he mentions that backup systems would not trust other systems, is that backups were running expired certificates (for db connections etc) or for the backup sub-domain and backup machines would talk to eachother but would not talk to primaries (because of the aforementioned certificate issues) since noone ever bothered to check if they even could (cross-talk between the mix of primary and backup boxen). In the previous 70k-plus-employees organisation we had a team of 5 who would only worry about certificates and keep them fresh on all machines all day every day. We even had a random person in charge of changing batteries in servers on any given day for chrissake! And there were no "backup systems" per se, everything was live at all times, in parallel, but spread across 3 physical locations. And we would be knocking machines out daily in the process of rolling out new versions and patches so were constantly testing "backup systems". Every friday PSUs for the building were kicking in just for fun. As did the fire and smoke alarms. Something like this BA "incident" was not possible. So to my mind CIO/CTO bear the full responsibility for this together with a lot of people downstream from there. And CEO just for choosing an incompetent CTO/CIO for the job. And the whole board for bringing on-board such an incompetent CEO. YES you outsourcing crucial brain power to indian subcontinent DOES have EVERYTHING to do with this episode. Just like it did for RBS 5 years ago, an assowl comes along who thinks he can just "upgrade" the MQ and "lets just clear this stupid queue there who needs that anyway". I say someone needs to fine BA heavilly and use the money to re-hire all the laid-off people. And then re-nationalise the whole lot. No I never worked for BA nor am I a member of any workers union, I was just a well-pleased BA frequent flyer who is sad to see such a great airline being degraded by small greedy men who have no basic understanding of the quality they (used to) represent.

    1. kmac499

      It's looking increasingly likely, that whatever the initial cause of failure; the recovery was hampered by poor or zero maintenance of backup hardware,software and configurations.

      Maybe the CIO should take a leaf out of the aircraft maintenance process. Scheduled checks and refurbishments, signed checklists stamped by authorised\certified engineers as dictated by external regulators.

      I wonder what BA's public liability insurance quote is going to look like next year?

      1. pele

        You don't need to take anything out of aircraft trade - you need, however, to have listened in CS208 Software Engineering class when the lecturer was talking about software lifecycle management etc etc. Which would have been a requirement to finish off your Comp Sci degree. Which I am certain would have been a requirement to have been hired as a CTO in the first place.

        This mentality of blaming EVERYTHING nowdays on "IT issues" is getting on my nerves big time. All cock-ups that incompetent idiots cause can simply be referred to you friendly local IT guy who'd be more than happy to take all the blame. Or, failing that, a "terrorist threat".

      2. Doctor Syntax Silver badge

        "Maybe the CIO should take a leaf out of the aircraft maintenance process. Scheduled checks and refurbishments, signed checklists stamped by authorised\certified engineers as dictated by external regulators."

        Way too expensive. That's why we're outsourcing it.

      3. Bill M

        That comment now makes me worry whether BA has compromised their aircraft maintenance process as well.

        1. John 78

          > That comment now makes me worry whether BA has compromised their aircraft maintenance process as well

          The "aircraft maintenance process" is overseen by the CAA, to stop people doing shortcuts.

          1. Alan Brown Silver badge

            > The "aircraft maintenance process" is overseen by the CAA, to stop people doing shortcuts.

            And all the "paperwork" is on..... computers in BA's datacentre.

            As were all the loadout calculations and a bunch of other safety critical functions.

  16. Ochib

    “All the parties involved around this particular event have not been involved with any type of outsourcing in any foreign country ,” he said [Reg emphasis]. “They have all been local issues around a local data centre who has been managed and fixed by local resources.”

    So were is the data center, because you could read this as

    "All the parties involved around this particular event have not been involved with any type of outsourcing in any foreign country [Data Center and Staff are in the same town].They have all been local isssus around a local data centre [Based in Mexico] who has been managed and fixed by local resources [Also based in Mexico]

    1. Nick Ryan Silver badge

      Precisely, his choice of weasel-words was careful. At no point did he mention where the data centres were located. "foreign country" is relative to the origin point.

  17. Ian Emery Silver badge

    The Compensation Documents

    Can be found in a filing cabinet, in an unused toilet in the basement, and behind a sign reading "Beware of the Leopard".

    1. Antron Argaiv Silver badge

      Re: The Compensation Documents

      While the excuses...can be found here:

      http://pages.cs.wisc.edu/~ballard/bofh/bofhserver.pl

  18. lglethal Silver badge
    WTF?

    um what?

    "There are no redundancies or outsourcing taking place around this particular hardware live operational systems resilience set of infrastructure in this particular case."

    Is this is a misquote, or did he really say something which can best be described as gibberish or perhaps a poor attempt at bullsh%t bingo?

  19. Anonymous Coward
    Anonymous Coward

    My sympathies are with the passengers caught up in this...

    Like many (all?) commentators here, I get the feeling there's a lot more to this than a simple random act of God power surge type event. And I also get the feeling we'll be seeing more and more of this as companies that don't consider themselves as IT-reliant refuse to invest in keeping their systems up-to-date with the demands placed on them, outsource vital support to far away and cheap lands, and generally believe that fizzy TV ads and pastel coloured websites are more than enough to keep the bad things at bay.

    People who work at these companies are getting fed up with being treated like dirt, the people in the far away lands (fill in the applicable country yourself) are starting to appreciate what being called a 'low cost environment' really means, are starting to appreciate their importance to the company, and are starting to quietly down tools at critical moments. This will only get worse as every company engages in a desperate race to the bottom.

    But my sympathies go out to all the passengers caught up in this nonsense - air travel is horribly soul destroying and stressful at the best of times, so to add to the misery almost feels criminal. But alas it's going to take a major air catastrophe before we see any sort of real improvements.

  20. JimmyPage Silver badge
    Headmaster

    EFFECTED ?????

    I think he meant "affected".

    What the fuck is going on with the UK that the CEO of an alleged serious global company is allowed to make such a schoolboy error ????

    Also, really, El Reg should have noted it as "(sic)"

    It really is a bad day for the image of UK education when a French and Indian colleague can point out a Brits bad grammar .....

    1. pele

      Re: EFFECTED ?????

      "Education, education, education!"

      Remember that?

      1. Charlie Clark Silver badge

        Re: EFFECTED ?????

        "Education, education, education!"

        Yes, it's considered too expensive. Gruel is already coming back, the workhouses won't be long.

    2. Nick Kew Silver badge
      Headmaster

      Re: EFFECTED ?????

      Spoke to Sky News ...

      Effected and Affected being homonyms, the error would then appear to be one of transcription.

    3. 2+2=5 Silver badge
      Facepalm

      Re: EFFECTED ?????

      > I think he meant "affected".

      I don't think the people who've had their honeymoons and various "holidays of a lifetime" ruined are worried about a small grammatical slip.

    4. Anonymous Coward
      Anonymous Coward

      Re: EFFECTED ?????

      A lot of English people get it wrong, such is our wonderful edukation system where everything gets A and A*s in their GCSEs, this guy is Spanish. At least allow him a bit of slack.

      1. Anonymous Coward
        Anonymous Coward

        Re: EFFECTED ?????

        I bet his grasp of grammar is better than mine.

  21. George.Zip

    I'm thinking the answer is quite simple, as part of the bank holiday deep clean a Henry vacuum went rogue having been plugged directly into the same PDU as the switch marked "Really important do not touch".

    Ooops

  22. PNGuinn Silver badge
    Trollface

    Explainations, reasons, excuses ...

    He could always get some underling to google for some BOFH excuses ... eg

    excuse no: 102

    Power company testing new voltage spike (creation) equipment

    excuse no: 254

    Interference from lunar radiation

    Go on, you know you want to....

  23. xyz

    how about...

    the grandfather clock syncing their stuff reset itself to 1971

  24. Fenton

    Hardware vs software

    Even if the hardware is locally maintained, it's the software that goes tits up when the hardware fails. And it's the software management that has been oursourced.

    The software is old and does not have automated resyncing (taking referencial integrety into account),

    there are likely to be hundreds of little scripts that have been created to get around functionality gaps, many of which will not have been documented or architected properly (likely to have been created to fix a P1 issue).

    I've been part of many an outsourcing deal, with perspecitive from both sides.

    The recipient doesn't know what they don't know, the people giving up the system don't know what they have forgotten and are generally pissed off they may be redundent soon.

    The best moves have been the ones where a ground up review/re-write/rearchitect/retest have been part of the move adding new functionality, plugging gaps properly and properly re-testing.

    1. Commswonk Silver badge

      Re: Hardware vs software

      "...many of which will not have been documented or architected properly..."

      There is still no verb "to architect".

      1. Doctor Syntax Silver badge

        Re: Hardware vs software

        There is still no verb "to architect".

        Could you please suggest a pre-existing alternative.

        The implication of the word is something at a higher level than "design" would cover, dealing with the overall form and how the components fit together but not quite the same as "specify".

        If there isn't such a verb - and off-hand I can't think of one - then it might be necessary to devise one. Importing a word from one part of speech into another is a long established practice in English. All it requires is for enough people to do it so that it becomes accepted. Objections along the lines of "there's no such word" seem to be part of the process.

        1. Commswonk Silver badge

          Re: Hardware vs software

          There is still no verb "to architect".

          @Doctor Syntax: Could you please suggest a pre-existing alternative.

          Try this:

          ...there are likely to be hundreds of little scripts that have been created to get around functionality gaps, many of which will not have been documented or architected structured properly...

      2. jeremyjh

        Re: Hardware vs software

        There may not yet be such a verb, but keep saying so and there will be.

    2. Doctor Syntax Silver badge

      Re: Hardware vs software

      "The best moves have been the ones where a ground up review/re-write/rearchitect/retest have been part of the move adding new functionality, plugging gaps properly and properly re-testing."

      And we all know what the chances of that happening are when the objective is cost-saving.

    3. TkH11

      Re: Hardware vs software

      Smoke and mirrors and politician double talk. To be a CEO you have to be good at that.

      Hardware failed, was restored by local staff, but application and system support, done remotely from India, I bet.

  25. richard_w

    I dealt with BA as a supplier and I also worked there. As a supplier, the IT department was super fast to blame us when there was a hardware problem, but refused point blank to take the steps which would have avoided the problem in the first place. Because it would have cost more, or because they did not need to do what we suggested on their small systems (as opposed to multi terabyte enterprise database systems).

    When I worked for them it became apparent that the multiplicity of disparate systems all of which had to communicate to keep the airline running were an accident waiting to happen.

    1. Nick Ryan Silver badge

      When I worked for them it became apparent that the multiplicity of disparate systems all of which had to communicate to keep the airline running were an accident waiting to happen.

      <= THIS

      Organisations need to periodically refactor and simplify their systems, particularly after what can sometimes be decades of accumulation and evolution. Yes, there is a risk in doing this, however there is probably more of a risk in not doing this because of unknown, undocumented systems being somehow critical to operations. Like insurance, it's a gamble and in this case BA lost - now it's just a case of them working out how much money they lost as a result.

      Some organisations refactor and simplify on a continual basis which, if done well, should remove the need for more drastic measures later. However this requires more ongoing investment and planning but the upside is better systems overall and, most likely, cheaper in the long run,

      1. richard_w

        In BA's case there were 2 things which mitigated against simplification. A lack of understanding of the individual architectural components (for instance there was a mission critical PC system in many airports around the world which exactly one person in the airline understood). If it went wrong, they put him on a plane (first of course), to fix it.

        The other was the fact that many of the systems dated from the 1980s or possibly earlier, and had been changed repeatedly. The only way to fix this was to re architect them. Not quick, expensive, and gave no bang for your buck, unless you improved the functionality from the business perspective.

        1. Anonymous Coward
          Anonymous Coward

          Everything was a profit centre at BA when I was there (too long ago and too unimportant in the scheme of things to have any insight into this fuck up). Not as in profitable, but in that it had to make a profit otherwise bad things would happen.

          And with a set-up like that, IT is considered a dead weight because it's got no external clients, instead of being the backbone of the company.

          1. Grunt #1

            Internal charging.

            Internal charging would fix that.

      2. Dunstan Vavasour

        If there's a part of your environment that everybody's afraid to fix in case it breaks, that's the first bit you should be fixing.

      3. Anonymous Coward
        Anonymous Coward

        I think BA did refactor their systems, to modernise them. But poor design, low calibre software engineers, possibly cheap outsourcing the development off-shore, inadequate testing is what is now leading to most of the problems BA is experiencing. I think the most recent incident is probably a combination of an initial hardware problem and then a massive issue of trying to recover crashed applications and databases, messaging systems which have left apps in inconsistent states and thus very difficult to recover.

    2. Anonymous Coward
      Anonymous Coward

      Risk Appetite

      I work in the transport business supporting some large systems. The level of risk appetite is scarily low. I could tell you a few horror stories, but I'd be recognised. Hardware, software is not updated, operating systems not patched, because the customers are too frightened in case it causes future operational issues. The customer is frightened to reboot servers in case they don't come back up.

      1. Anonymous Coward
        Anonymous Coward

        Re: Risk Appetite

        That would probably be the same transport company that has no meaningful DR plan, and wouldn't know which systems to restore in what order in the event of an outage. Senior exec capability is scarily low in both BAU and projects. Actually I think most transport cos are like this from what I've seen...

  26. Anonymous Coward
    Anonymous Coward

    How was a surge possible?

    They not using UPS kit?

    Also, am I getting the wrong end of the stick here in assuming that their live kit and BC kit was in the same building?

    1. Anonymous Coward
      Anonymous Coward

      Re: How was a surge possible?

      Probably wasn't a surge. They initially said it was a power failure. It was only Cruz that mentioned a surge. And UPS's are not infallible, the batteries need to be maintained and only have a finite life.

  27. John Smith 19 Gold badge
    Unhappy

    A note on message passing.

    In IBM MF land message queues (msgq in the AS400 command language) are effectively named pipes which can link processes. They can expand if the "writer" is producing a lot more data than the "reader" can accommodate at any one time. IIRC they can also do character set translation (EG EBCDIC to ASCII) which is handy give a lot of stuff is not EBCDIC as standard.

    BTW there is also an MS version of MQ series.

    I can't recall if the reader dies wheather that can pause the writer process or if the queue just keeps getting bigger (the simple programming option is the MQ just deals with it. No special case handling required).

    I can see the joker in the pack being different processes dying at different times given different queues holding mixed amounts of good and bad data that are not synchonised, making it very difficult to decide which entries (BTW they are called "messages" but the definition of "message" is very flexible) to discard.

    However these issues are completely predictable and MF devs and ops have been dealing with them for decades. BA should definitely have some tools to manage this and some procedures in the Ops manual to use them.

    As for configuration I find it very hard to believe that in 2017 a business this big does not have a set of daemons checking all its network hardware and recording their actual (working) configurations.

    This is also one of those moments when labeling all those cables and power plugs with what they power and what they should be plugged into turns out to be quite a good idea.

    So much HA and DR is not in the moment. It's in the months of prep before the moment.

    1. Anonymous Coward
      Anonymous Coward

      Re: BTW there is also an MS version of MQ series.

      I well know - I integrated it into the Avery "Weighman" suite to connect to British Steel (as was).

      It was designed to be a reliable protocol, on top of the unreliable TCP/IP stack - so no data loss. Unread messages are queued (MQ - "Message Queuing") until they can be processed. Obviously there is a theoretical upper limit predicated on memory and disk space, but if you hit that, you have far more problems than a missed message.

      While we're talking reliable protocols ... am I alone in finding that more (younger) people really struggle with the idea ... ("We'll send it by email.", "er, what if it doesn't arrive ?", "but that never happens ...").

      1. Alan Brown Silver badge

        Re: BTW there is also an MS version of MQ series.

        > am I alone in finding that more (younger) people really struggle with the idea ... ("We'll send it by email.", "er, what if it doesn't arrive ?", "but that never happens ...").

        No you're not and many of the worst offenders are people old enough amd experienced enough to know far better.

        They're also the same people who use email as a multi-terabyte document repository, then complain because the servers (which haven't been upgraded to cope) are struggling/crashing, or won't let you change the storage methods to something which _WILL_ cope because of the "risk"

    2. Alistair Silver badge
      Windows

      Re: A note on message passing.

      MQs runs on linux, AIX, HPUX and Solaris as well as on MF and Windows.

      Actually runs rather well on all of that save windows.

      does inline encryption between MF and *nix. No experience with MF to Windows encryption.

      have a rather critical pipeline from HPUX <-> MF <-> linux <-> vpn tunnel(s) <-> ??

      Terrifying days for that are when one of those elements updates MQ. 8 years and we've never had an update take it out. Only SSL cert (changes) take it out.

      1. John Smith 19 Gold badge
        Unhappy

        "MQs runs on linux, AIX, HPUX and Solaris as well as on MF and Windows."

        Thank you for reminding me. It's been a while. MQ's mult platform nature is one of its strengths. I called it MF land because of IBM's MF centric view.

        As anything involving HA systems I'm sure no update would go to the live environment unless it's been thoroughly tested first.

        All of which makes the CEO's story about this being a messaging failure seem stranger and stranger.

  28. DaddyHoggy

    If your Backup System is untested and fails on the first time of asking, then you don't have a backup system...

    1. Aitor 1 Silver badge

      Backup and DR is not sexy

      Nobody wants to sink money in that!! And it is a dead end career decision to go that way...

      1. Doctor Syntax Silver badge

        Re: Backup and DR is not sexy

        "And it is a dead end career decision to go that way..."

        I'd guess that a few people are currently taking a big whack of money from BA for just this.

        1. Anonymous Coward
          Anonymous Coward

          Re: Backup and DR is not sexy

          I've found most are scared of it, some of us enjoy the terror.

  29. cynical bar steward
    IT Angle

    Dump bad operating systems NOW

    If you want reliable clustering and computing, dump the M$ permanently bugged (since Win 3.1 they have NOT fixed it yet? Hey sorry to tell you guys but this emperor is stark naked and always will be. Put proper stuff in. Legacy means IT WORKS. M$ just keep dangling that carrot beyond W10? Is this coffee you can smell?

    Reliability was designed in, not cobbled around. Unixen may be more secure but never designed from the ground up for reliability.

    [Does BA use any of this, who cares its RUBBISH and this proves it.]

    Programming for VMS railroads you into doing things properly, or at least considering that.

    Oh, another thing VMS and [insert name of exploit]? It doesn't need patching 15 times a day for security holes. But soon to appear on the very same hardware M$ cannot run securely.

    VMS is alive and well, you'd almost think that the old naysayers were the lemmings leaping towards that cliff [of unreliability]... Best place for them, probably.

    Yeah yeah you've committed too much into your IT infrastructure to change direction? It's your funeral...

    The cost of the IT rarely actually measures much up to the value of the data or the company, and in these days of bleeding talent to save costs sight of that has been lost, penny wise and pound foolish.

    Fine, it's like the CVIT (Cash and Valuable In Transit) firms swapping to using bicycles because they are cheaper.

  30. Anonymous Coward
    Anonymous Coward

    Only speculating here but my bet would be something like this:

    http://www.cisco.com/c/en/us/about/supplier-sustainability/memory.html

    Note that this affected other vendors and not just Cisco but... having seen other customers impacted by the issue above, it feels very familiar (short power failure/maintenance caused network devices to die).

    Looking forward to reading some root cause analysis...

  31. AbsolutelyBarking

    What a SNAFU. Any decent middleware/MQ platform is (or should be...) designed to be fault tolerant and, in particular, work in a distributed environment. One node falls over and it all carries on working, albeit maybe with more queuing latency. Quite good fun seeing this work properly when you go around yanking network cables out (in the dev environment obviously..) and back in. An infrastructure architecture FAIL IMHO.

    Senior people typically only interested in reducing short term costs = this year's bonus. Well that worked out well didn't it....

    1. Anonymous Coward
      Anonymous Coward

      Think you mean FUBAR, not SNAFU.

  32. eldakka Silver badge
    Devil

    My theory:

    They are probably using IBM mainframes and IBM P-series servers and IBM Datapower appliances, all with IBM providing the support with a dedicated team of 20 professionals who have worked on the contract for years with tons of corporate knowledge.

    Then IBM fired all their contractors, and were suddenly down to 5 permanents as the 15 contractors are let go, and those permanents are mostly management-types. So there were no skilled staff left to handle what would have otherwise been a routine failover and so botched it.

    1. Anonymous Coward
      Anonymous Coward

      Plus...

      the IBM Support desk is in India where the first thing they ask you to do is 'Switch it off and on again'...

      They seem blissfully ignorant that 'switching it off and on again' is not something you do to an IBM Mainframe setup. PErhaps that was the power surge?.... {mind boggles}

      Anon as I make a living from implementing MQSeries, WAS and IIB on a variety of platforms.

  33. Domeyhead

    Quite surprised at some of the immature and uninformed comments on a site that should provide more information than the mainstream media. I have worked with several large organisations and in 20 years experienced 4 or 5 serious (datacentre scale) failures. In all cases fast failover technology was available but in all cases it failed to function fully and correctly when needed. In one case the business in question had 3 levels of UPS to protect a datacentre and it was one of these that caused the outage! Only one company in that list actually bothered to rehearse DR events and this is crucial - when did BA actually rehearse a full DR at the datacentre in question. The most fundamental requirement of any DR is to prove it works anytime, as much for the people as for the technology. I wrote the DR specification for one particular system (not BA I hasten to add) . Auditors would call the DR at an unspecified time and then it was down to the staff - and although operations rehearsed recovery regularly (around once per year) they never passed the DR audit in the 4 attempts I was involved in - but they did get a whole lot better (and calmer) at recovering from the inevitable spanners that chance throws into the mix. Let's get something clear here. a CEO may be accountable for the ultimate financial performance of a company - that is his job - but he is clearly NOT responsible for the success or otherwise of a particular DR. Accountability and responsibility are not the same thing. The CIO (or IT director in some places) IS accountable. And the IT Operations Director IS responsible because he/she can actually make decisions that influence the quality of the solution - especially the requirement to homogenise and rehearse under an independent auditor. I suspect the problems here were caused by multiple hardware and software layers having inconsistent commit and rollback points. I've seen this myself - storage using some kind of block level mirror across sites while up above the database commits at a row level and the application above that commits at an a transaction level. All these bits flow along a bus at some point - but it's not the bus that's the problem. It's a lack of DR planning and rehearsal (Ops Director) - it's a lack of communication between delivery and operational teams ( technical team managers - and thinking that a whole load of expensive software and hardware enables you to do without Business Continuity professsionals (CIO). There are plenty of war stories out there - look for what they all have in common - and it isn't the CEO of BA.

    1. Doctor Syntax Silver badge

      "Let's get something clear here. a CEO may be accountable for the ultimate financial performance of a company - that is his job - but he is clearly NOT responsible for the success or otherwise of a particular DR."

      He is - or should be - accountable or responsible, whichever you prefer, for ensuring that the CIO has taken whatever steps are necessary to ensure the proper operation of his side of the business.

      If the CEO has lost sight of the fact that his business relies on IT for the moment to moment operation of the business (not just day to day) and not acted accordingly then he should cease to be the CEO.

  34. Cuddles Silver badge

    Not outsourced

    "All the parties involved around this particular event have not been involved with any type of outsourcing in any foreign country"

    Maybe they should have been? It's not like that could have made things much worse.

  35. CheesyTheClown Silver badge

    How could this even happen?

    I'm developing a system now that is small and not even mission critical and it has redundancy and failover. Does anyone alive actually make anything anymore that doesn't do this?

    1. John Smith 19 Gold badge
      Unhappy

      "not even mission critical and it has redundancy and failover. "

      Good for you.

      Now when will you test it and what will you do if it fails?

  36. Anonymous Coward
    Anonymous Coward

    Well it sounds like a Network Failure (power) caused the ESB to fail. It must be more than that though because I cannot believe that BA would have a single ESB node connected to a single switch for communication to the rest of the enterprise.

    If they did then they deserve everything they got and have got plenty of architecture redesigning to do.

  37. Anonymous Coward
    Anonymous Coward

    Should have simplified and bunged it all in AWS years ago.

    1. Neil Spellings

      Yea because cloud providers never fail....

      1. Anonymous Coward
        Anonymous Coward

        What do you get from clouds?

        Shade.

        Lightning.

        Thunder.

        Rain.

        None of them good for a DC..

  38. smartypants

    Taking a step back for a moment...

    Regardless of the nature of the issue and the reasons for it, in general, things do go wrong when humans are involved. The aspiration we have of things never breaking is a good one, but reality is in control here.

    What has changed in the last 40 years is the sheer number of IT systems out there, and for each, the average complexity of what the systems do, and the number of people they serve. But although methodologies and technologies for ensuring service integrity have proliferated, the static component in the mix is the mere human, and this hasn't really changed much at all in the same period (perhaps fatter?) We're just as able to underestimate the risk factors as we always were, and we are just as subject to the lure of following the money (or the promise of spending less of it).

    There will generally be more of this sort of thing in the future. In each case, the broken service will return to what it was before the event in just a few days, which is more than can be said for the system known as the "United Kingdom", which some idiots have taken the back off and are planning to improve it by cutting a load of wires with shears.

  39. simmondp

    Business School 101

    Business School 101; never outsource anything which is critical to your business. Outsource that which is not critical and gives you no competitive advantage (for example; payroll & HR systems).

    Seems BA seem to have forgotten the essentials.......

    Any CEO who signed off on such a deal, irrespective of whether it was the root cause, should be falling on their sword.

    Any CEO who does not understand the criticality of the IT systems (and it may be negligible) in their business should not be in the job.

  40. 2+2=5 Silver badge
    FAIL

    Rule 1 of Press Releases

    Rule 1 of press releases is... delete the evidence when your client screws up big time!

    http://sap.sys-con.com/node/1187093

    1. Doctor Syntax Silver badge

      Re: Rule 1 of Press Releases

      "The result, according to Penfold, is to drive an agile business"

      And following the link we find "The result, according to Penfold, is to drive an agile business". That speaks volumes.

      1. Brewster's Angle Grinder Silver badge

        Re: Rule 1 of Press Releases

        I thought Dangermouse always did the driving, not Penfold?

    2. TooManyChoices

      Re: Rule 1 of Press Releases

      It's changed now due to acquisition:

      Is this the messaging system that went down?

      https://www.aurea.com/our-solutions/cx-platform/cx-messenger/

  41. chivo243 Silver badge

    Didn't I hear this before?

    Some other place had backup power issues, took days to find the correct working parts...

    "Cruz told several outlets that a power surge took some systems offline and that backup power systems then failed."

  42. cdegroot

    20/20 hindsight is easy...

    But let's look at the other side of the coin. You're a CEO of an airline. Competition is incredibly fierce, and because of that and deregulation ticket prices have been dropping like a rock over the decades (when my dad took us to Spain in the late '70s, I think the charter tickets where 600 guilders. I guess I can pick up an AMS->ALC roundtrip for 200 euros or thereabouts?). We're all profiting from the collective consumer consciousness driving prices into the basement, let's not forget that.

    Efficiency and cost cutting are pretty much your only options, and pretty much they need to be done across the board. As a CEO, you're in the risk taking business - cut just enough cost, you make a profit; cut just too much somewhere, and you have a disaster. Cut cost on aircraft maintenance, the disaster is people dying, cut cost on IT, the disaster is a huge inconvenience, egg-on-face, and a - probably nicely predictable - monetary loss. Yes, this may wipe out this year's profits, but across the board, it's probably the better outcome.

    You know what? I'd probably cut IT costs if I'd be running an airline. I think it's just too simple to blame this on stupidity. Yes, you could probably do better than BA's IT (I've only worked three air travel related systems - booking, ATC and airport - and that was enough to put me off for good), but I wouldn't be surprised if these IT systems are a mess of old and new kept together by the Kermit protocol and similar band-aids. For all I know, the C-levels took reasonable decisions with measurable risks, and the coin just fell on the wrong side.

    1. Doctor Syntax Silver badge

      Re: 20/20 hindsight is easy...

      "Efficiency and cost cutting are pretty much your only options"

      Your IT systems then become they key to providing that efficiency. Your business becomes, as has been said in another comment, an IT business that happens to fly planes. Or another way to put it is that IT becomes one of your core competences. Becoming incompetent at that is stupidity.

  43. JaitcH
    Happy

    Dependable Utilities Generate a Blasé attitude

    I lived for many years in Canada where the water flows free and plentiful and generates most of our power needs, year after year, decade after decade. Then in 1998 came the Great Ice Storm.

    Quebec was knocked flat. Much of it's power comes from Labrador (part of Newfoundland) and many of the stately pylons carrying the life blood of today's lifestyle collapsed, simply crumpled, under the weight of the ice.

    'Low voltage' (local distribution) failed, too, with trees and hydro (electricity) wires fell and utility poles cracking or snapping off under the combination of 7-11 cm (3-4 in) and cold weather. But after a month everyone had their power restored.

    Then came the Northeast blackout of 2003 was a widespread power outage that occurred throughout parts of the Northeastern and Midwestern United States and the Ontario, Canada on Thursday, August 14, 2003, just after 4:10 p.m. EDT.

    Two very hot days in a big city like Toronto, with no power, is not fun.

    Now. living in VietNam, power failures are a nothing. Most business have portable generators, larger entities such as hotels and hospitals have impressive generators, housed in 'sound proof' box=shaped containers.

    My wife owns two mid-sized hotels and I own a mid-size 4-floor office building and a couple of homes. All have fire alarms and stand-by power systems. The hotels have LED lighting, throughout, and their initial back-up is through batteries, big batteries.

    My office has battery/generator back-up, as do my homes.

    Every one of our buildings has an automatic, human intervention-free, fire and power system. They are programmed to randomly power off, or sound the alarm, so that all occupants are very familiar with emergency procedures.

    I wonder how often BA actually tested their facilities without giving prior notice? How often does YOUR company do the same? It's the only way to really test emergency equipment.

  44. Dodgy Geezer Silver badge

    Culled from the Pilot's forum...

    "On Saturday morning around 9:30 there was indeed a power surge that had a catastrophic effect over some communications hardware which eventually affected the messaging across our systems..."

    ...Mr Cruz said the surge was “so strong that it rendered the back-up system ineffective”, causing an “outage of all our systems” at 170 airports in 70 countries. Power companies denied that there had been any supply problems at the company’s main hub at Heathrow or the airline’s headquarters, north of the airport perimeter. SSE and UK Power Networks, which both supply electricity in the area, said that there had been “no power surge”....

    1. fowler

      Re: Culled from the Pilot's forum...

      Mr Cruz said the power surge was so strong it caused the Thronge Sprocket to fail rendering the Brattling Pieces inoperative.

      And if you believe that I have a bridge in Brooklyn you might want to buy.

    2. anonymous boring coward Silver badge

      Re: Culled from the Pilot's forum...

      About that "power surge":

      Looks like embellishment of the truth and dramatisation comes easy to this CEO.

  45. Alan Brown Silver badge

    erm

    Fully redundant networking systems with rapid convergence and no spanning tree storms are even easier to implement than full DR across multiple datacentres.

    This makes no sense unless someone's been lobotomising the internal IT structures.

  46. macaroo

    Sounds like a lot of qobbily gook and thinly vale excuses for a more basic problem

  47. SharkNose

    BA used to be a massive user of IBM MQ. I'd be astonished if that isn't still in the picture for intersystem messaging. I've used it extensively for pushing 20 years and found it to be incredibly resilient when configured correctly.

    The BA problem sounds more like a lack of robust tested procedures to be able to reconcile disparate systems following an outtage.

  48. Anonymous Coward
    Anonymous Coward

    As insane as the BA fiasco is, I think I can top it.

    In the 1990s I worked for a cable TV and broadband provider based in the Boston suburbs. We had a single data center which delivered phone, TV and internet service to the entire New England region (and parts of NY state) and it just happened to be in the same building as our corporate HQ, call center and depot. In fact, almost the whole company was based in a single building.

    Security was non-existent. Anyone could walk through the front door and enter the data center at will. For example, FedEx guys would routinely access the DC wheeling in their deliveries unescorted and unchallenged.

    One day a FedEx guy was wheeling in several boxes of new hardware. Someone - I forget who - politely held one of the 2 main doors open all the way so he could wheel in his dolly/cart/trolley. The only problem was that behind the door, on the wall, was the Big Red Button (BRB). The door opener leant against the BRB and, well, you know what happened next. The DC was shut down and all of New England went offline - TV, phone, cable - the lot. The call center went offline. Corporate systems - including phone - went offline. The only route to the outside world was a single analog phone mounted to a wall in the DC.

    RCA was simple - someone hit the BRB. The long-term solution? Encase the BRB in a perspex box. And.....er.....that was it. Amazing.

    Then again, this was the company who hired an IT Director who made the decision to (and remember this was the mid 1990s) equip the 200 seat call center with Macs rather than Windows PCs. His reason? He simply preferred Macs. That no commercial 3rd party systems/tech - phone, CTI, databases, etc. - integrated with Macs was irrelevant. He just liked Macs. Madness.

  49. Bob Gateaux

    To be fair, if they implemented the Emperor's New Clothes that is an Enterprise Service Bus then they deserve all they got.

  50. CJames

    Not monitoring - not smart

    Never underestimate the importance of monitoring data centre performance. In a millisecond world why is it that monitoring every few minutes, or not monitoring at all, is acceptable. The world has changed, application performance rules any on-line activity, so application and infrastructure owners need to continuously monitor performance in real-time, not just availability, or agree a performance based SLA with their supplier. Modern Infrastructure Performance Monitoring (IPM) platforms work at an application centric level and can proactively prevent slow-downs or outages before they happen by alerting before end users are impacted.

    1. John Smith 19 Gold badge
      Happy

      data centre performance..continuously monitor performance in real-time..performance based SLA

      "Modern Infrastructure Performance Monitoring..application centric level..proactively prevent slow-downs

      House !

      2 posts in less than 2 hours and both posts look remarkable alike.

      1. Roland6 Silver badge
        Pint

        Re: data centre performance..continuously monitor performance in real-time..performance based SLA

        Re: House !

        Bingo !

        There are (currently) only two posts by CJames - and both posts look remarkably alike...

      2. Anonymous Coward
        Anonymous Coward

        Re: data centre performance..continuously monitor performance in real-time..performance based SLA

        "2 posts in less than 2 hours and both posts look remarkable alike."

        More than "remarkably alike". Word for word cut and paste (near enough?) in two different topics.

        Registered today too.

        I wonder if El Reg have sent Chris a copy of the advertising tariff?

    2. Vic

      Re: Not monitoring - not smart

      Never underestimate the importance of monitoring data centre performance

      That's twice you've posted that exact same message - and those are your only two posts here.

      Do you have a commercial interest in such things, perchance?

      Vic.

    3. Anonymous Coward
      Anonymous Coward

      Re: Not monitoring - not smart

      Your points are valid.

      If you are a salesman then please say so.

      I would buy from an honest salesman.

  51. cantankerous swineherd Silver badge

    from the grauniad:

    The global head of IT for IAG, Bill Francis, has been attempting to find savings of around €90m per year from outsourcing. Francis, with no previous IT experience, was previously in charge of introducing new contracts and working practices to cabin crew at BA during the bitter industrial dispute of 2010-11.

    draw your own conclusions.

    1. John Smith 19 Gold badge
      Unhappy

      "Francis,..no previous IT experience, was previously in charge of introducing new contracts "

      So basically a management goon.

      Doesn't sound like someone who take advice, especially from subordinates, before, during or after an IT situation.

      Which is a bit of a problem if you a) Know nothing about IT and b) The s**t has hit the fan.

  52. Old Handle

    There are no redundancies [...] taking place around this particular hardware

    Well there's your problem. A system that big needs redundancy.

  53. Jtom Bronze badge

    While there has been much discussion about Cruz's future viz-a-viz the system crash, I have seen no mention of the other half of this debacle: the complete and utter meltdown of BA's airport services. Rebooking desks with no agents, poor communications, wrong info given out, unretrievable luggage, etc.

    I have always told my staff that everyone makes mistakes, it's how you handle them that shows your worth. BA failed miserably, and has been continuing providing poor customer service since. They seemingly have no plans as to what to do if they aren't flying due to their own cock-ups (e.g., they don't have the luxury of being terse, saying, "Don't blame us, it's the weather. Sorry you can't find a hotel room, but we aren't responsible.")

    As soon as it hit the fan and flights started getting cancelled, someone should have been calling in caterers to provide free food and drinks, sending reps into the crowds to give them updates, doing what they could to help special needs passengers, working with airport and nearby hotels to get as many bookings as possible (and threatening them with lawsuits if they try to price-gouge), and so on. From what I have read, major airports (Heathrow, Gatwick) were in shambles and the BT staff running around in circles.

    1. anonymous boring coward Silver badge

      These people aren't suited to thinking by themselves and take any initiatives.

      I once was at an airport where my luggage would never reach the pick-up area because the belt was full with unclaimed luggage from some earlier flight (and those passengers where nowhere to be seen).

      I asked the staff there why they didn't remove the luggage so the new luggage could reach us. They "weren't working for the luggage handling company".

      Of course, the orange-vest people on the other side of the rubber curtains couldn't come out into the hall to remove luggage, so they just stood around waiting for the band to move...

      So I and another passenger removed all the luggage so the band could again move, and our luggage could get to us.

      And this was in Western Europe. Not bloody USSR.

  54. anonymous boring coward Silver badge

    Cruz has also promised that passengers will never again have such an experience with BA, in part because the carrier will review the incident and figure out how to avoid a repeat.

    Is that because the only way such an experience can happen is if this incident occurs in the exact same way again?

    Very intelligent thinking by that fantastic CEO. Obviously this has nothing at all to do with having enough of the right kind of people available to quickly identify issues and resolve them urgently. Nothing at all.

    Looks like BA has a f*cking politician at the helm -and we all know how well that usually goes.

  55. scubaal

    currently in LHR

    In BA Lounge in LHR still trying to get to where I am supposed to be (spain) from OZ due to BA cock-up.

    Most helpful BA office was in USA -trying calling that from Oz - but least they answer.

    The Oz support number ON THE BA WEBSITE has a recorded message that says 'this number is no longer in use - please call the number on the web site'.

    You couldn't make it up.

  56. Dodgy Geezer Silver badge

    I don't think I've ever seen...

    ...a big blue-chip IT disaster where so LITTLE technical detail has leaked out to the Register.

    Of course, employing people who only speak Hindi probably helps...

  57. mojacar_man

    BA didn't need to outsource offshore to India. TATA, a massive Indian company with its finger in many pies, has had a major IT presence in UK for many years. (For instance, they took over the the IT department of Legal and General Insurance about a dozen years ago. Programmers had the choice of working for TATA of walking. Many walked, rather than accept the new terms!)

    TATA, of course, are free to share their workload - with their Indian head office!

    1. Anonymous Coward
      Anonymous Coward

      Legal & General

      Anyone know how that worked out for them?

  58. AdrianMontagu

    IT Failure

    Is there going to be ANY comment from the IT chief?

    If not then we must assume a cover-up.

  59. cyfahead
    Mushroom

    The big picture has historical precedents...

    Some light has been shed upon the BA systems outage over the Bank Holiday:

    "Speaking to Sky News, he [BA CEO Cruz] added a little more detail about what went wrong, as follows:

    On Saturday morning around 9:30 there was indeed a power surge that had a catastrophic effect over some communications hardware which eventually affected the messaging across our systems.

    Tens of millions of messages every day that are shared across 200 systems across the BA network and it actually affected all of those systems across the network. "

    From a policy perspective for politicians as an IT Systems Engineer I would draw the following conclusion, and warning, to us all:

    The BA disaster will be followed by others until all telecommunications are seen as a single multi-modal utility channel and as a critical local resource by every region across the planet and brought under common control and direct accountability to the people it serves.

    -O-

    What BA CEO Cruz describes is simple.. there was a failure of their Critical Component Failure Analysis process. This is de riguer an everyday functional department of any major IT operation.

    The constant problem these exercises live with is in defining the boundaries of "your" IT operation, over which you have control, and the risks beyond that where you have none, or little, and to mitigate against them while hoping all the components 'sites' making up that external environment are as diligent in their CCFA processes as you are yourself.

    The potential resilience and durability of complex systems that rely upon the Interconnection of many independent participants/components is theoretically high... if every component has available to it many alternative ways of making the connections it needs.

    But the degree of interconnection of any specific independent site/component in a network of commercial entities is the consequence of cost-benefit driven decisions taken in its strictly local context. It is also often a factor the specific details of which are regarded as being commercially sensitive and are kept confidential within each local operation.

    The result is the failure of the 'Internet' to achieve the absolute resilience which was the raison d'etre, and the main design feature, of the 1950/1960's USDF 'Arpanet' out of which the 'Internet' has evolved.

    Historically, just like chaotic complexity eventually characterised the commercial competitive generation of electrical power and the laying of railway track, the entire 'multi-modal' telecommunications enterprise has now reached the same level of criticality to our economic existence in displays the same state of chaos and susceptibility to CCF events.

    The BA disaster will be followed by others until all telecommunications are seen as a single multi-modal utility channel and as a critical local resource by every region across the planet and brought under common control and direct accountability to the people it serves.

    Clearly there are existential impediments to achieving this, but they go beyond the froth of ICANN haggling and Steely Neely's earlier resistance to non-EU corporations' pressures. There is a context in our Economic History and the philosophies of political economy by which they are formed and interpreted. For some of that.. and a peek into the Pandora's box that lies there, read on... you should have time before they let you board your next flight...

    -O-

    The 'Internet', like roads, running water, rail and energy reticulation before it, has evolved now evolved from being a luxury novelty to being a critical part of the infrastructure of many national economies and their global trading links.

    In the past whenever a complex system technology has reached this stage it has been brought under total central control within its country of operation. This 'nationalisation' has been achieved, in general, either by public ownership or nationally controlled 'sole operating licences' granted commercial operators in clearly defined non-overlapping domains whose service levels, revenues and operating profits are expected to be tightly controlled in exchange for an absence of competition, except when periodically bidding for licence renewals.

    In the past the evolution of a new technology from 'novelty' to 'utility' was easily confined within the geopolitical domain of a country and regulated accordingly. It is inherent to the Internet, and telecommunications in general, that this technology creates complex local systems interconnected in complex global systems which transcend geopolitical domains.

    However, the Internet is no longer alone. As it evolved so too did the global energy reticulation system and the rail and the road systems. None of them have any significant degree of resilience designed into them. Some resilience is there, but tends to be at the duplicative level. There is very little evidence of the complex interconnections that high levels of 'resilience' require. UK's winter gas supply depends completely on 2 'taps'. One in Russia and one in Sweden. The few remaining liquid methane tankers do not have the capacity, or operational flexibility, to quickly step in to the energy gap that would be created by a sustained cessation of piped gas from across the North Sea.

    After commercial interests have successfully exploited a new technology, moving it from a market 'novelty' to a national 'utility' we have seen over a period of more than 200 years that it has been left to the national exchequer to foot, or underwrite, the bill for the expenses of mitigating the risks of economic disruption to which the commercially driven development of 'utilities' has exposed the general population of businesses and households.

    Examples include energy stockpiles, rail track networks, channel tunnels, enlarged and deepened port facilities, motorway links to outlying areas, fibre-optic cables into rural areas, cellular phone networks, into areas with poor signal propagation characteristics and low population density even though having numerically large numbers of potential users.

    The trans-national geopolitical domain which is the European Union, despite subversion by commercial and neoliberal politico-economic interests of late, has been very effective in creating the conditions for achieving high resilience at lower overall cost of many such complex 'utliity' technologies. Beyond its geopolitical domain it is the USA which seeks to arrange such things to its own benefit and under its geopolitical control through a combination of financial, gunboat and global 'administrative' functions and organisations which it economically dominates through their dependence upon its, the USA's, high-levels of funding relative to other contributors.

    Brexit affords UK entities, which relative to their EU ones have large global and hence also US interests, to continue to exploit the 'special relationship' which the largest UK financial and trading corporations, through their governments, have sought to preserve since becoming indebted to the US by WWII. By this means they have been able to transform without loss the value of their global commercial interests from being under the direct protection of the then defunct Empire to the indirect, but potentially more effective, protection provided by the USA.

    Now, in 2017, the EU has become the single largest trading entity in the world, outstripping the USA in the position it has long jealously guarded for itself. This has threatened UK interests whose commercial roots lie in the days of our own Empire and in the 'special relationship' which is so often spoken of but which is so difficult to define, unless one says exactly what it is. In the historical context, it is no more and no less than a global trading alliance protected by the mutually exercised global projection of military power.

    By extending its commitment to the EU through the Lisbon Treaty, the UK was suddenly able to play 'both ends to the middle'. It was de facto a member of the two largest trading empires in the world, both commercially and militarily through NATO. During the last two decades the EU welcomed the strength in world markets that the UK added to it. On the other side of the Atlantic the US might be forgiven for feeling a bit double-crossed as the EU weakened its own growing need for global trade pre-eminence.

    Entities participating in and benefiting from the 'special relationship', on both sides of the Atlantic, who were now finding themselves increasingly being sidelined by UK and other EU commercial allies, and by increasing EU resistance to US-dominated global organisations and to the demands of US global corporations. At the same time the US was having to compete with a resurgent China, then SE Asia and soon to be followed by India.

    The EU's military capacity was largely confined to a defensive posture, mandated to protect its members' trading routes. The EU was a poor partner for UK global interests as compared with the indirect protection of the US loosely allied to the UK's own military. A Brexit presented, for them and the US, a win-win situation.. strengthen the 'special relationship' that mitigates risk in international trade, and weaken the competitive trading power of the EU.

    In so far as those US global trading interests identify with the US political right, as do their UK counterparts, it bears some speculation as to how far back their overt alliance goes. And to what extent Mr Farrage's UKIP is a child, and deliberate pawn, of the UK and US political right embedded in the foundations of both their main political parties. Certainly the public appearance of the close relationship between Trump and Farrage was very sudden and very intense. And once one speculates this far it that raises the prospect of speculating upon the actions and influence, upon anti-EU moves to the right of players in many EU nations, of the other player that is mooted to be in the game to re-assert global influence... Mr Putin.

    -O-

  60. Anonymous Coward
    Anonymous Coward

    Wow

    Did you write this as a response to the BA debacle? A good piece, if it was sourced elsewhere please say where. Up vote as I think a lot of what you say is valid.

    Britain will be no more independent than we were a year ago except we will have different masters. I'm pro-US and pro-EU but to some extent we have burnt our bridges. As for BA they will be one of the first to go, it already has it's holding company registered office (IAG) in Madrid and has a Spanish boss.

  61. Grunt #1

    "To Fly you need Servers"

    Perhaps the BA motto should be changed to this from "To Fly. To Serve".

  62. FudgeMonkey

    A network issue?

    Of course it's a network problem - anything you can't explain or don't understand is ALWAYS a 'network problem' in my experience.

    If I had a pound for every time a misconfigured server, badly written application or lack of hardware maintenance was described as a 'network problem', I could hire somebody else to tell them it isn't and why...

  63. Anonymous Coward
    Anonymous Coward

    It's a shame

    That this story is spread across different pages on the Reg. There are valid comments and suggestions spread around.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019