So by messaging he means some sort of enterprise service bus was taken down ? also a power surge protection is normal so what went wrong with the switchover to redundant power? The hints from this a bit like the Talk Talk hack suggest something very simple and not some unavoidable impossible to understand failure they would like to media steer us towards.
The catastrophic systems failure that grounded British Airways flights for a day appears to have been caused by networking hardware failing to cope with a power surge and messaging systems failing as a result. The Register has asked BA's press office to detail what went wrong, what equipment failed, what disaster recovery …
I'm looking forward to seeing the RCA, if we ever do. So many businesses have single points of failure, even within HA systems. I realise that backup systems have been mentioned but if they don't work or can't be brought online (ever tested?), has someone been telling porkie pies?
suggest something very simple and not some unavoidable impossible to understand failure
The weasel isn't going to take any personal responsibility - even though he is THE chief executive officer. But it is his fault, all of it, in that capacity. The total and absolute failure of everything is clearly a series of multiple failures, and he (and BA) are trying to control the message as though that denies the reality of this catastrophe. He should be fired for his poor communication and poor leadership if nothing else. But that's what you get when you put the boss of a tiddly low cost airline into a big, highly complex operation with a totally different value proposition.
Looking around, press comment reckons that it'll be two weeks before all flight operational impacts are worked out (crews, aircraft in the wrong place at the wrong time, passenger failures made as good as they can), and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?
It smells like a store-and-forward messaging system from the dawn of the mainframe age (Shows how much BA has been investing into its IT). It may even be hardware + software. Switching over to backup is non-trivial as this is integrated into transactions, so you need to rewind transactions, etc.
It can go wrong and often does, especially if you have piled up a gazillion of new and wonderful things connected to it via extra interfaces. Example of this type of clusterf*** the NATS catastrophic failure a few years back.
That is NOT the clusterf*ck they experienced though because their messaging and transaction was half-knackered on Friday. My boarding pass emails were delayed 8 hours, check-in email announcement by 10 hours. So while it most likely was the messaging component, it was not knackered by a surge, it was half-dead 24h before that and the "surge" was probably someone hired on too little money (or someone hired on too much money giving the idiotic order) trying to reset it via power-cycle on Sat.
This is why when you intended to run a system and build on it for decades, you have upgrade, and you have to START each upgrade cycle by upgrading the messaging and networking. Not do it as an afterthought and an unwelcome expense (the way BA does anything related to paying with the exception of paying exec bonuses).
But that's what you get when you put the boss of a tiddly low cost airline into a big, highly complex operation with a totally different value proposition.
Whatever you might think about his performance during this unmitigated balls-up, there's much more relevant experience in his biography than just running a "tiddly low cost airline".
If it was a properly architected and configured mainframe system it would have just worked.
High availability, failover, geographically distributed databases, etc. etc. were implemented on the mainframe sometime in the late '80s.
Some of the commentards on this site seem to think the last release of a mainframe OS was in 1979, when actually they have been subject to continuous development, incremental improvement and innovation to this day. A modern IBM mainframe is bleeding edge hardware and software presenting a venerable 1960s facade to its venerable programmers. Bit like a modern Bentley with its staid '50s styling on the outside and a monster twin turbo multi valve engine on the inside.
"I wonder if that will affect his bonus?"
Ha ha ha ha... Of course not. After the attainment of a certain pay grade "reward for failure" kicks in. Only the actual workers enjoy "reward for success". Sometimes.
The weasel isn't going to take any personal responsibility - even though he is THE chief executive officer.
IT and the CIO don't fall under him, IT is provided by [parent company] IAG "Global Business Services" as of last year. But of course, Cruz has fully supported all the rounds of cuts that have been made.
It smells like a store-and-forward messaging system from the dawn of the mainframe age
Ex BA AC
I don't know who you're trying to convince, but it's not me. Neither Clickair nor Vueling have or had stellar reputations, Sabre's had its outages, and the less said about US airports the better.
"and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?"
You can bet that any "profit improving" (IE cost cutting) ideas certainly did.
This should as well.
But probably won't, given this is the "New World Order" of large corporate management that takes ownership of any success and avoids any possibility that their decisions could have anything to do with this.
If you wonder who is most modern CEO's role model for corporate behavior it's simple.
Carter Burke in Aliens.
@ James Anderson
(Mainframe operating systems) have been subject to continuous development, incremental improvement and innovation to this day.
That sounds expensive, has anyone told Ginni about this?
"It smells like a store-and-forward messaging system from the dawn of the mainframe age "
You mean teletype? No, it doesn't sound like that, TTY store and forward is simple in its queuing and end points. If the next hop to the destination is unavailable, it stops. When it is available, it restarts. To me, this sounds like MOM, that wonderful modern replacement for TTY. That heap of junk that queue fulls and discards messages, that halts with only writers and no readers, or readers and no writers. That cause of more extended system outages that any other component in a complex system.
On the issue of local resources:
The use of 'resources' is as usual interesting - it speaks volumes of identikit, replacable, hired in, temporary. An airline doesn't refer to its pilots as resources, or its aircraft engineers. It should refer to its IT staff as such... my guess is that of course it was local 'resource' fixing the problem, they'd be the only one with the access to touch the systems unless BA is gone even more loony than some on here suspect. They local resources would be on a bridge with a cast of hundreds from the supplier, all shouting at each other, all pointing in different directions. First the load balancer would be failed over and resynched. Then the firewall. Then the DNS. Then someone would point out that some component that's never failed before has everything going through it and that it has a single error report in log somewhere not highlighted in automation because it's never been seen before.
Then, as said above, when some numpty decides to restart the box, a part of it fails catastrophically - something to do with electricity... it may be sunspots or a power surge, yes, we'll call it a power surge.
There is no such verb as "to architect".
But you would think that something as critical as an ESB in BA would mean that they have built it with high availability in an active/active configuration with plenty of spare capacity built in with nodes in different locations and on different power supplies. And of course ensuring that the underlying data network has similar high availability.
Otherwise you have just built in a single point of failure to your whole enterprise and as Murphy's law tells us - if it can go wrong then it will go wrong and usually at the most inopportune moment.
"...something as critical as an ESB in BA would mean that they have built it with high availability in an active/active configuration with plenty of spare capacity..."
They probably thought they could just pop down the road to the brewery in Chiswick and get a refill there.
no such verb as "to architect".
I architect - the successor to the Asimov robot flick
You architect - an early form of 21st century abuse
He/She architects - well I have no problem with gender fluidity
We architect - sadly nothing to do with Nintendo
You architect - abuse, but this time collective
They architect - in which case it was neither my fault, nor yours
"millions of messages"
Haven't seen the Sky TV etc stuff, but on R4 WatO on Monday he used a phrase like "millions of messages passing between the various systems" . I interpreted that to mean IP packets. Of course I may very well be wrong!
"So by messaging he means some sort of enterprise service bus was taken down?"
Sounds something like it. To quote Cruz - “we were unable to restore and use some of those backup systems because they themselves could not trust the messaging that had to take place amongst them.”
So, production system suffers major power failure, production backup power doesn't kick in, and either:
A) Power is restored to production but network infrastructure now knackered either due to hardware failure or someone (non-outsourced someone, obviously, 'coz he said so <coughs>) not saving routing and trust configuration to non-volatile memory in said hardware, so no messages forwarded.
B) DR is immediately brought online as the active system, but they then find that whatever trust mechanism is used on their messaging bus (account/ certificate/ network config) isn't set up properly so messages are refused or never get to the intended end-point in the first place, leaving their IT teams (non-outsourced IT teams, obviously, 'coz he said so <coughs>) scrabbling desperately through the documentation of applications they don't understand trying to work out WTF is going wrong.
Same old story, again and again...
- Mr Cruz, did you have backup power for your production data centre?
- Yes definitely, the very best.
- Mr Cruz, did you test your backup power supply?
- Erm, no, that takes effort and costs money...
- Ah, so you didn't have resilient backup power then, did you? Mr Cruz, did you have a DR environment?
- Yes definitely, the very best money can buy, no skimping on costs, honest...
- Mr Cruz, did you test failover to your DR environment?
- Erm, no, that takes effort and costs money...
- Ah, so you didn't have resilient DR capability then did you Mr Cruz?
- Mr Cruz did......etc. etc. ad nauseam...
There is now.
Funny, I thought Fuller's had closed that site and moved to an industrial estate in Maidstone or Nuneaton or something -- but I was completely wrong: https://www.fullers.co.uk/brewery
Doesn't it look nice? Mmmm... ale...
"I realise that backup systems have been mentioned"
I used to work for a company, large company, that provided DR services. The vast majority of companies treat DR as a compliance checkbox. They buy some DR services so they can say they have DR services... but in the event of a primary data center loss, there really is only the rough outline of a plan. Basically their data, or most of it, is in some alternative site and they may have the rest of their gear there too or not. There is rarely anything resembling a real time switch over from site A to site B in case of a disaster in which their entire stack(s) would come up without any manual intervention at site B. Mainly because architectures are a hodge podge of stuff which has collected over the years. Many companies never rewrite or modernize anything, meaning much of the environment is legacy with legacy HA/DR tools... and there is sparse automation.
"It smells like a store-and-forward messaging system from the dawn of the mainframe age"
Probably the case.
But there will be as soon as enough prescriptive-grammar fogeys who can remember that once there wasn't die off. This is how language evolves: by the death of idiots.
Cruz previously worked at Vueling which has a terrible record for cancellations, lost bookings and cruddy customer service - so he's clearly brought his experience over.
He was appointed to cut costs at BA which he's done by emulating RyanAir and EasyJet whilst keeping BA prices. He's allowed the airline to go downmarket just as the Turkish, the Gulf and Asian carriers are hitting their stride in offering world-wide routing and don't treat customers like crap. Comparing Emirates to BA in economy is like chalk and cheese.
BA's only hope is if the American carriers continue to be as dreadful as ever.
Re: "millions of messages"
Three quarters of those messages were probably copies of the ones I get from BA advertising holidays and telling me my points are about to expire because I haven't sufficiently braced myself yet for another whirl of the unique BA customer experience.
IT and the CIO don't fall under him, IT is provided by [parent company] IAG "Global Business Services" as of last year.
As a director of BA, he is in fact responsible in law, even if the group have chosen to provide the service differently. I work for a UK based, foreign owned energy company. Our IT is supported by Anonco Business Services, incorporated in the parent company's jurisdiction, and owned by the ultimate parent. If our IT screws up (which it does with some regularity), our customers' have redress against the UK business, and our directors hold the full contractual, legal and regulatory liability, whether the service that screwed up is in-house, outsourced, or delivered via captive service companies.
There's a difference between disaster recovery and high-availability (though they do overlap).
It's perfectly reasonable that disaster recovery is a manual fail-over process. Fully resilient systems over two geographically separated locations can be hard and expensive to implement for smaller companies with not much in the way of a budget or resources, and so you have to compromise you expectations for DR.
Even if failing-over can be automated, there might be a high cost in failing-back afterwards, and so you might actually prefer the site to be down for a short while instead of kicking in the DR procedures; it works out cheaper and avoid complications with restoring the primary site from the DR site.
Not every company runs a service that absolutely needs to be up 24/7.
A lot of people designing the DR infrastructure will be limited by the (often poor) choices of technology made by the people that wrote the in-house stuff.
As an example, replicating your MySQL database between two datacentres is more complicated than most people would expect. Do you do normal replication and risk that the slave has lagged behind the master at the point of failure, losing data? Or use synchronous replication like Galera at the cost of a big latency hit to the cluster, slowing it right down?
If it's normal replication, do you risk master-master so that it's easy to fail-back, with the caveat that master-master is generally frowned upon for good reasons?
I think it's disingenuous to berate people for implementing something that can be very difficult to implement.
Though of course, large companies with lots of money and lots to lose by being down (like BA) have no excuses.
Re: "and the total cost will be about £100m loss of profit. I wonder if that will affect his bonus?"
“Carter Burke in Aliens”
Sticking with movies, Johnny from Airplane! springs to mind...
“Just kidding! Oh, wrong cable. Should’ve been the grey one. Rapunzel! Rapunzel!”
"Funny, I thought Fuller's had closed that site ... but I was completely wrong: https://www.fullers.co.uk/brewery
Doesn't it look nice? Mmmm... ale..."
They do an excellent brewery tour with a tasting session in their bar/museum afterwards :)
I had the pleasure of flying back to the UK on American in Business Class recently - service and comfort was a notch above BA Club World, and the ticket was cheaper than BA Premium Economy. BA are screwed...
<rant> True. But at least that's one I can, however reluctantly, at least imagine.
For me, by far the worst example of this American obsession with creating non-existent 'verbs' is, obviously, 'to leverage'.
Surely that sounds as crass to even the most dim-witted American as it does to everyone else in the English speaking world, doesn't it? I'm told these words are created to make the speaker sound important when they are clueless.
I can accept that some lone moron invented the word. But why did the number of people using it ever rise above 1? </rant>
That is our DR to a tee, so glad I'm not the boss
anon for obvious reasons
Die off is fine. So is die back. They're descriptive and worth keeping. Architect as a verb is more or less OK, although why did someone assume 'design' wasn't good enough, since it's a correct description of the process, making architect as a verb a replacement for a word that didn't need replacing.
Re: "I wonder if that will affect his bonus?"
A former boss referred to it as "f*ck up and move up".
Though, admittedly, a change of employer is sometimes implied.
death birth of idiots.
No such verb as to architect?
Design and configure (a program or system)
‘an architected information interface’
A modern IBM mainframe is bleeding edge hardware and software presenting a venerable 1960s facade to its venerable programmers.
And always has done. in the early 90's, I was maintaining TPF assembler code that was originally written in the 60's (some was older than me!).
And I doubt very much if those systems are not still at the heart of things - they worked. In the same way as banks still have lots of stuff using Cobol, I suspect airlines still have a lot of IBM mainframes running TPF. With lots of shiny interfaces so that modern stuff can be done with the source data.
*I can accept that some lone moron invented the word. But why did the number of people using it ever rise above 1?*
Because the number of morons is >>>> 0
 Yes - I made up >>>>> to be "a far, far larger number than the one compared against" It's mine. You can't have it. So there.
"They probably thought they could just pop down the road to the brewery in Chiswick and get a refill there."
You thought they could organise....?
I doubt they are that modern to be using SOA architecture.
I read a transcript of what he said the other day and he was eluding to network switches going down. So I think he's trying to dumb down his words for a non technical audience, messaging - aka packets being switched or routed across the network between servers and apps.
If he is a director of BA! A search of companies house finds a director of a BA company in the name of
Alejandro Cruz De Llano
I'm guessing this him?
A member of staff of a company only has legal responsibility if they are a registered director with companies house. The fact the company calls them a CEO or director does not mean they are a registered director.
You can't test everything all the time.
Stuff happens and often the unimagined causes grief.
Redundancies are only guaranteed when they come from HR.
Re: "I wonder if that will affect his bonus?"
He didnt get one last year
"Alex Cruz, the Spanish CEO of British Airways, will not receive a bonus for 2016 from the IAG airlines group. The company said in a statement to the National Stock Market Commission that he will be the only one of the 12 senior executives not to receive a bonus. "
'resource' as you said org with so little respect of IT to call critical staff that is deeply worrying.
Re: "millions of messages"
You are probably 100% correct - The IP packets are probably "ICMP unreachable"
With lots of shiny interfaces so that modern stuff can be done with the source data.
Dunno if its shiny, but probably something like MQ.
For most parts it seems to do fairly decent job in distributed systems if it has been properly configured.
RE: There is now.
It's too late now, the disaster has already happened. Very much like the old story "shut the gate, the horse has bolted."
Suggestion we were given some twenty-five years ago: Don't verb nouns.
His CEO experiences is from a minor airline so accept that fact. His previous experience reads well but then every exec I know of makes sure it does. : )
Should he jump? Probably not but some people somewhere must be guilty of hiding, or not implementing, necessary IT improvements.