So by messaging he means some sort of enterprise service bus was taken down ? also a power surge protection is normal so what went wrong with the switchover to redundant power? The hints from this a bit like the Talk Talk hack suggest something very simple and not some unavoidable impossible to understand failure they would like to media steer us towards.
The catastrophic systems failure that grounded British Airways flights for a day appears to have been caused by networking hardware failing to cope with a power surge and messaging systems failing as a result. The Register has asked BA's press office to detail what went wrong, what equipment failed, what disaster recovery …
Tuesday 30th May 2017 06:18 GMT wyatt
Tuesday 30th May 2017 12:16 GMT Anonymous Coward
"I realise that backup systems have been mentioned"
I used to work for a company, large company, that provided DR services. The vast majority of companies treat DR as a compliance checkbox. They buy some DR services so they can say they have DR services... but in the event of a primary data center loss, there really is only the rough outline of a plan. Basically their data, or most of it, is in some alternative site and they may have the rest of their gear there too or not. There is rarely anything resembling a real time switch over from site A to site B in case of a disaster in which their entire stack(s) would come up without any manual intervention at site B. Mainly because architectures are a hodge podge of stuff which has collected over the years. Many companies never rewrite or modernize anything, meaning much of the environment is legacy with legacy HA/DR tools... and there is sparse automation.
Tuesday 30th May 2017 13:12 GMT wheelybird
There's a difference between disaster recovery and high-availability (though they do overlap).
It's perfectly reasonable that disaster recovery is a manual fail-over process. Fully resilient systems over two geographically separated locations can be hard and expensive to implement for smaller companies with not much in the way of a budget or resources, and so you have to compromise you expectations for DR.
Even if failing-over can be automated, there might be a high cost in failing-back afterwards, and so you might actually prefer the site to be down for a short while instead of kicking in the DR procedures; it works out cheaper and avoid complications with restoring the primary site from the DR site.
Not every company runs a service that absolutely needs to be up 24/7.
A lot of people designing the DR infrastructure will be limited by the (often poor) choices of technology made by the people that wrote the in-house stuff.
As an example, replicating your MySQL database between two datacentres is more complicated than most people would expect. Do you do normal replication and risk that the slave has lagged behind the master at the point of failure, losing data? Or use synchronous replication like Galera at the cost of a big latency hit to the cluster, slowing it right down?
If it's normal replication, do you risk master-master so that it's easy to fail-back, with the caveat that master-master is generally frowned upon for good reasons?
I think it's disingenuous to berate people for implementing something that can be very difficult to implement.
Though of course, large companies with lots of money and lots to lose by being down (like BA) have no excuses.
Tuesday 30th May 2017 06:26 GMT Ledswinger
suggest something very simple and not some unavoidable impossible to understand failure
The weasel isn't going to take any personal responsibility - even though he is THE chief executive officer. But it is his fault, all of it, in that capacity. The total and absolute failure of everything is clearly a series of multiple failures, and he (and BA) are trying to control the message as though that denies the reality of this catastrophe. He should be fired for his poor communication and poor leadership if nothing else. But that's what you get when you put the boss of a tiddly low cost airline into a big, highly complex operation with a totally different value proposition.
Looking around, press comment reckons that it'll be two weeks before all flight operational impacts are worked out (crews, aircraft in the wrong place at the wrong time, passenger failures made as good as they can), and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?
Tuesday 30th May 2017 06:51 GMT Credas
But that's what you get when you put the boss of a tiddly low cost airline into a big, highly complex operation with a totally different value proposition.
Whatever you might think about his performance during this unmitigated balls-up, there's much more relevant experience in his biography than just running a "tiddly low cost airline".
Tuesday 30th May 2017 07:16 GMT Bloodbeastterror
Tuesday 30th May 2017 17:11 GMT Anonymous Coward
Re: "I wonder if that will affect his bonus?"
He didnt get one last year
"Alex Cruz, the Spanish CEO of British Airways, will not receive a bonus for 2016 from the IAG airlines group. The company said in a statement to the National Stock Market Commission that he will be the only one of the 12 senior executives not to receive a bonus. "
Wednesday 31st May 2017 06:25 GMT John Smith 19
"he will be the only one of the 12 senior executives not to receive a bonus. ""
Which suggests he has been trying extra hard to get one.
And look what his efforts have produced.....
I think he's going to be on the corporate naughty step again.
It's trickier than it looks in the commercials.
Tuesday 30th May 2017 07:19 GMT Anonymous Coward
The weasel isn't going to take any personal responsibility - even though he is THE chief executive officer.
IT and the CIO don't fall under him, IT is provided by [parent company] IAG "Global Business Services" as of last year. But of course, Cruz has fully supported all the rounds of cuts that have been made.
It smells like a store-and-forward messaging system from the dawn of the mainframe age
Ex BA AC
Tuesday 30th May 2017 09:45 GMT Anonymous Coward
But you would think that something as critical as an ESB in BA would mean that they have built it with high availability in an active/active configuration with plenty of spare capacity built in with nodes in different locations and on different power supplies. And of course ensuring that the underlying data network has similar high availability.
Otherwise you have just built in a single point of failure to your whole enterprise and as Murphy's law tells us - if it can go wrong then it will go wrong and usually at the most inopportune moment.
Tuesday 30th May 2017 13:04 GMT Ledswinger
IT and the CIO don't fall under him, IT is provided by [parent company] IAG "Global Business Services" as of last year.
As a director of BA, he is in fact responsible in law, even if the group have chosen to provide the service differently. I work for a UK based, foreign owned energy company. Our IT is supported by Anonco Business Services, incorporated in the parent company's jurisdiction, and owned by the ultimate parent. If our IT screws up (which it does with some regularity), our customers' have redress against the UK business, and our directors hold the full contractual, legal and regulatory liability, whether the service that screwed up is in-house, outsourced, or delivered via captive service companies.
Tuesday 30th May 2017 15:29 GMT Anonymous Coward
If he is a director of BA! A search of companies house finds a director of a BA company in the name of
Alejandro Cruz De Llano
I'm guessing this him?
A member of staff of a company only has legal responsibility if they are a registered director with companies house. The fact the company calls them a CEO or director does not mean they are a registered director.
Tuesday 30th May 2017 08:06 GMT John Smith 19
"and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?"
You can bet that any "profit improving" (IE cost cutting) ideas certainly did.
This should as well.
But probably won't, given this is the "New World Order" of large corporate management that takes ownership of any success and avoids any possibility that their decisions could have anything to do with this.
If you wonder who is most modern CEO's role model for corporate behavior it's simple.
Carter Burke in Aliens.
Tuesday 30th May 2017 12:42 GMT Mike Richards
Cruz previously worked at Vueling which has a terrible record for cancellations, lost bookings and cruddy customer service - so he's clearly brought his experience over.
He was appointed to cut costs at BA which he's done by emulating RyanAir and EasyJet whilst keeping BA prices. He's allowed the airline to go downmarket just as the Turkish, the Gulf and Asian carriers are hitting their stride in offering world-wide routing and don't treat customers like crap. Comparing Emirates to BA in economy is like chalk and cheese.
BA's only hope is if the American carriers continue to be as dreadful as ever.
Tuesday 30th May 2017 06:29 GMT Voland's right hand
It smells like a store-and-forward messaging system from the dawn of the mainframe age (Shows how much BA has been investing into its IT). It may even be hardware + software. Switching over to backup is non-trivial as this is integrated into transactions, so you need to rewind transactions, etc.
It can go wrong and often does, especially if you have piled up a gazillion of new and wonderful things connected to it via extra interfaces. Example of this type of clusterf*** the NATS catastrophic failure a few years back.
That is NOT the clusterf*ck they experienced though because their messaging and transaction was half-knackered on Friday. My boarding pass emails were delayed 8 hours, check-in email announcement by 10 hours. So while it most likely was the messaging component, it was not knackered by a surge, it was half-dead 24h before that and the "surge" was probably someone hired on too little money (or someone hired on too much money giving the idiotic order) trying to reset it via power-cycle on Sat.
This is why when you intended to run a system and build on it for decades, you have upgrade, and you have to START each upgrade cycle by upgrading the messaging and networking. Not do it as an afterthought and an unwelcome expense (the way BA does anything related to paying with the exception of paying exec bonuses).
Tuesday 30th May 2017 07:03 GMT James Anderson
If it was a properly architected and configured mainframe system it would have just worked.
High availability, failover, geographically distributed databases, etc. etc. were implemented on the mainframe sometime in the late '80s.
Some of the commentards on this site seem to think the last release of a mainframe OS was in 1979, when actually they have been subject to continuous development, incremental improvement and innovation to this day. A modern IBM mainframe is bleeding edge hardware and software presenting a venerable 1960s facade to its venerable programmers. Bit like a modern Bentley with its staid '50s styling on the outside and a monster twin turbo multi valve engine on the inside.
Tuesday 30th May 2017 10:15 GMT MyffyW
no such verb as "to architect".
I architect - the successor to the Asimov robot flick
You architect - an early form of 21st century abuse
He/She architects - well I have no problem with gender fluidity
We architect - sadly nothing to do with Nintendo
You architect - abuse, but this time collective
They architect - in which case it was neither my fault, nor yours
Tuesday 30th May 2017 12:41 GMT tfb
Tuesday 30th May 2017 14:24 GMT Anonymous Coward
Die off is fine. So is die back. They're descriptive and worth keeping. Architect as a verb is more or less OK, although why did someone assume 'design' wasn't good enough, since it's a correct description of the process, making architect as a verb a replacement for a word that didn't need replacing.
Tuesday 30th May 2017 14:13 GMT Anonymous Coward
<rant> True. But at least that's one I can, however reluctantly, at least imagine.
For me, by far the worst example of this American obsession with creating non-existent 'verbs' is, obviously, 'to leverage'.
Surely that sounds as crass to even the most dim-witted American as it does to everyone else in the English speaking world, doesn't it? I'm told these words are created to make the speaker sound important when they are clueless.
I can accept that some lone moron invented the word. But why did the number of people using it ever rise above 1? </rant>
Wednesday 31st May 2017 03:28 GMT Jtom
Wednesday 31st May 2017 16:41 GMT dajames
There is no such verb as "to architect".
That's the beauty of the English language -- a word doesn't have to exist to be usable. (Almost) anything goes.
It's not always a good idea to use words that "don't exist" -- especially if you're unhappy about being lexicographered into the ground by your fellow grammar nazis -- but most of the time you'll get the idea across.
[There is no such verb as "to lexicographer", either, but methinks you will have got the point!]
Ponder, though, on this.
Tuesday 30th May 2017 15:06 GMT CrazyOldCatMan
A modern IBM mainframe is bleeding edge hardware and software presenting a venerable 1960s facade to its venerable programmers.
And always has done. in the early 90's, I was maintaining TPF assembler code that was originally written in the 60's (some was older than me!).
And I doubt very much if those systems are not still at the heart of things - they worked. In the same way as banks still have lots of stuff using Cobol, I suspect airlines still have a lot of IBM mainframes running TPF. With lots of shiny interfaces so that modern stuff can be done with the source data.