back to article The biggest British Airways IT meltdown WTF: 200 systems in the critical path?

One of the key principles of designing any high availability system is to make sure only vital apps or functions use it and everything else doesn't – sometimes referred to as KISS (Keep It Simple Stupid). High availability or reliability is always technically challenging at whatever systems level it is achieved, be it hardware …

Silver badge
Joke

More importantly

Is it just me, or does that picture look like David Cameron is about to be eaten by a dinosaur? Have we accidentally uncovered Angela Merkel's real identity?

14
8
Anonymous Coward

Re: More importantly

I'd say Theresa May's, but velociraptors seem to be effecfive, have a mission and set out to do it, rather than changing their mission with the daily headlines ....

25
6
Silver badge

Re: More importantly

Most large dinosaurs are strong and stable

36
0
Silver badge

Re: More importantly

2 thumbs down for my joke?

Is that you Mr & Mrs Cameron?

9
4
Silver badge

Re: More importantly

Thumbs down from me only because I'm bored sick of reading political posts in almost every comment section now.

20
14
Silver badge

Re: More importantly

Dont worry citizen. Most forum comments will be deemed thought crime in a few years. The citizens posting them will be in reeducation.

Strong and stable reeducation. Nothing to see here please move along.

27
0
Silver badge
Big Brother

Re: More importantly

@boltar: "Thumbs down from me only because I'm bored sick of reading political posts in almost every comment section now."

You usually see these kind of posts when they're attempting to derail the subject and distract from the contents of the main article. That being, the crap IT system at BA and who is responsible for it.

9
0
Silver badge

Re: More importantly

You usually see these kind of posts when they're attempting to derail the subject and distract from the contents of the main article

Nope, just making a joke. Surprised anybody took the trouble to comment.

2
2
Silver badge

Ignorance and greed. Trusting the professional to do their job can overcome the former, not so sure about the latter. BTW it needn't necessarily be the greed of immediate managers. Business in general tend to live like the grasshopper before he learns his valuable lession.

9
0
Silver badge

Ignorance and greed

In my experience, it has more to do with ignorance. I am always amazed at how poor most people are at judging risk. In particular, people tend to overestimate and agonize over vanishingly small risks and underestimate the mundane, everyday risks they face. Examples of each are being killed by a terrorist in the US versus being killed in a car accident. That can make for bad policy and decisions, in politics and in business.

61
0
Silver badge

Re: Ignorance and greed

From the article "Indeed, it was far from clear that even senior NASA management were actually capable of understanding the warnings their engineers were raising – often having neither an engineering or a scientific background."

This - in every company I've worked for. Even in the ones where they had some engineering experience it was so out of date as to be useless or actually only had a talent for climbing greasy poles. The best boss I ever had was an utter charlatan but he had the sense to leave the engineering to those that knew about it

25
0
Silver badge

Re: Ignorance and greed

" Examples of each are being killed by a terrorist in the US versus being killed in a car accident."

People always underestimate risk when they feel they're in control.

9
0

Re: Ignorance and greed

"Even in the ones where they had some engineering experience it was so out of date as to be useless"

In IBM I referred to these managers as "technicians". They had forgotten everything they ever learned in Engineering and had learned nothing in Management.

1
2
Silver badge

Re: Ignorance and greed

I've used this before.

The greatest test of an engineer is not his technical ingenuity but his ability to persuade those in power who do not want to be persuaded and convince those for whom the evidence of their own eyes is anything but convincing.

Extract from "Plain Words" in The Engineer 2nd October 1959

12
0
Anonymous Coward

Re: Ignorance and greed

>The greatest test of an engineer is not his technical ingenuity but his ability to persuade those in power who do not want to be persuaded and convince those for whom the evidence of their own eyes is anything but convincing.

Wise words for 1959 but in this post fact world the engineer needs more in the way of seminary skills than logic or debate.

1
0
Silver badge

Re: Ignorance and greed

I am always amazed at how poor most people are at judging risk. In particular, people tend to overestimate and agonize over vanishingly small risks and underestimate the mundane, everyday risks they face.

Well spotted. Here's a bit more on that.

https://en.wikipedia.org/wiki/Risk_perception

In the wods of Mr Monroe, "Six hours of fascinated clicking later,.. "

1
0
Anonymous Coward

The problem for many businesses is that their competition are cutting corners and cutting out as much spend as possible too. Customers then reward that behaviour - there's no point having the most reliable IT stack if you have no customers left to fund it all. You end up with a capture effect where the luckiest cheapskate has all the customers until their luck runs out, then people build resilient systems from scratch all over again, which immediately start getting cost reduced.

3
0
Silver badge
Boffin

@James51

No, not ignorance and greed.

Try micro services in additional to having legacy systems in place where it is cheaper to add another micro service in to the chain than it is to rewrite the original service, test it, with the additional feature.

The one advantage is that if you have only a certain class of travelers who have an additional process to check some sort of security... you don't have to run everyone thru that process.

Note, I'm not suggesting that this is the case, or that this model is the best fit for BA, but it could be viable and it what is happening when you consider stream processing.

The issue is that at some point you run in to a problem when the chain gets too long and it breaks in places and you don't know how to move forward or handle the errors.

0
0

low availability cluster

This sounds scarily like a standard blue print for a service oriented architecture gone horribly wrong. in financial markets it was common for "tib-storms" to crash a broadcast network with re-requests to sync topics, but capacity, tiering and investment addressed it.

My money is on a panic'd recovery - like the RBS CA7 debacle.

0
0
Anonymous Coward

Sunny when it is working

We provide cloud services and connectivity is the key factor in the actual uptime of our services to our end users. IT managers regularly make the wrong decision based on perceived priorities. If something is working, then tasks related to that will go down the list and even are forgotten.

For example, keeping some services running over ADSL when you have a new leased line available and not prioritising the work to switch because everything is working. Statistics tell us leased lines have better availability and quality of service but the customer often only reacts when a failure happens.

I know that they said outsourcing is not to blame but getting rid of people that know how things work or are held together is a dangerous risk. How many companies know nothing about what all the boxes do let alone how they are all dependant on each other?

44
0
Silver badge

Re: Sunny when it is working

getting rid of people that know how things work or are held together is a dangerous risk

This is an inevitable consequence of the regular "efficiency" pogroms that most companies undertake against their own support services (and that applies to functions like finance and procurement too). There is a vast amount of tacit knowledge in employees' heads that is never written down, and which the business places no value on until things go wrong. By then it is too late, because these pogroms are always selective - the people seen as whingers, the challenging, the "difficult", those simply so clever or well informed that they are a threat to management, all are first on the list to go. And unfortunately those are often the people who know how much string and sellotape holds everything together.

I work for a large company that has a home brew CRM of great complexity. It works pretty well, costs next to nothing in licence fees (cf SAP or Oracle), and we have absolute control - we're not beholden to a tech company who can force upgrade sales by "ending support". Over recent years we've outsourced many levels of IT to HPE, and each time something new goes over the fence, HPE waste no time in getting rid of the expensive talent that has been TUPE'd across. We did even have a CIO who understood the system - but he's been pushed out and replaced by a corporate SAP-head. You can guess what's going on now - the company is sleepwalking into replacing a low risk, stable CRM with a very high risk, high cost SAP implementation, and at the end of it will have a similarly complex CRM, except that it will cost us far more in annual licence fees, we'll have no control of the technology, and the costs of the changeover alone will total around £400m, judging by the serial screwups by all of our competitors.

64
0
Gold badge

Re: Sunny when it is working

"and which the business places no value on until things go wrong"

The only way to find out whether everyone still employed knows how to rebuild the system, is to provide them with an opportunity to do it. (It needn't be the actual system. You can let them assemble a clone.) Of course, that's expensive, but that is the cost of finding out whether the proposed efficiency drive is safe. My guess is that if that cost was included in the business case for the efficiency drive, the case would disappear.

Taking the argument a step further, it is easily seen that it isn't safe to let any of your staff go until you have reached the point where the system can be rebuilt by script. That's going to be an unpopular conclusion within management circles, but its unpopularity doesn't mean it is wrong.

10
0
Silver badge

Re: Sunny when it is working

Problem is kost IT managers see the free audit and recommendations as salesmen trying to sell shit. So they wont learn.

Better to take the audit for tye reports and ignore the phone for a month.

0
0
Silver badge
Facepalm

Re: Sunny when it is working

I know that they said outsourcing is not to blame but getting rid of people that know how things work or are held together is a dangerous risk. How many companies know nothing about what all the boxes do let alone how they are all dependant on each other?

Another factor that I see happening is feature sprawl, add-ons often being introduced as 'nice to have', with a low priority to fix if broken. Problem is, even if those features keep being handled at low prio[0], each of those features adds to the knowledge the first and second line support have to have at the ready, as well as simply adding to the workload as such. Having to not just physically but also mentally switch from one environment to another if a more urgent problem comes in and you have to suspend or hand off the first problem because you're the one who best understands the second one is another matter.

[0] and often they don't, because the additional info they provide allows for instance faster handling of processes, smoother workflow, better overview, etcetera, and after a while people balk at having to do without them. So even when they''re still officially low prio, call handling often bumps them to medium or even high because "people can't work". Oh yes they can; how about remembering the workflow that doesn't rely on those add-ons? The workflow they were trained in?

4
0
Silver badge

Re: Sunny when it is working

"it is easily seen that it isn't safe to let any of your staff go until you have reached the point where the system can be rebuilt by script."

And even then, when the staff are let go you may find nobody knows what the script actually does and you will even more likely find that nobody knows why it does it.

Not only do you need to retain knowledgeable staff, you need to have succession planning in place.

7
0
Anonymous Coward

Re: Sunny when it is working

Worked for a large firm and we switched from one provider to another for some fairly mission critical stuff. The previous system required essentially one program to be running and that was it. The new system required several add on programs (some TSR) to be running in addition to the main program on a users PC. The first time we noticed this wasn't long after deployment when someone couldn't start the main software on their machine. We tried various things with the vendor on the phone before they suggested that one of the other little progs might not be running or have stopped.

Spoke to someone else who used their software and he said that they didn't really do traditional updates to their software. If some functionality was additionally needed they'd just write another small add on program to provide this. Then after a few years release a completely new product complete with new name plus the bells and whistles added to the old one and the cycle restarts. Didn't exactly fill me with confidence.

1
0
Bronze badge

Re: Sunny when it is working

TSR? That brings back memories. You could also understand this as a reaction to "creeping featurism" on the part of the client company.

In DR-DOS I used to use TSRs to achieve needed functionality on a work PC. Difference is, in that world, nobody ever produced a single program to provide the same functionality.

0
0

Typo? Looks strange

"shuttle failure was "necessarily" one in 105"

1 in 105? is this perhaps meant to mean "1 in 10^5"?

9
0
Silver badge

Re: Typo? Looks strange

Yup. 1 in 100,000 was the figure projected by NASA.

8
0
Pint

Re: Typo? Looks strange

Updated

0
0

Re: Typo? Looks strange

"The chance that an HTHTP pipe will burst is 10^-7." You can't estimate things like that; a probability of 1 in 10,000,000 is almost impossible to estimate. It was clear that the numbers for each part of the engine were chosen so that when you add everything together you get 1 in 100,000.

From "What Do You Care What Other People Think", Richard Feynman.

15
0

Re: Typo? Looks strange

@TDog

"You can't estimate things like that; a probability of 1 in 10,000,000 is almost impossible to estimate."

It's also wildly meaningless. 1 in 10m whats? Messages through the system? Milliseconds? Times the life of the universe? (Remember six sigma events happened daily during the biggest move days of the financial crash... either the universe is impossibly old and all those events happened in a row, or these sort of 1 in x statistics are complete bunkum).

9
2

Re: Typo? Looks strange

Everyone should read Richard Feynman.

33
0
Silver badge

Re: Typo? Looks strange

Strongly seconded. His writing is hugely entertaining as well as educational.

7
0
Silver badge

Re: Typo? Looks strange

When working on fibre optics we used to work for an error rate of less than 1bit in 10**14bits. Its actually not that hard to work out if you are above or below that level at the theory level . Sitting in the lab for whatever was required to check that less than 1 bit every 3 days is wrong on average for 400Mb is another matter all together.

7
0
Silver badge

Re: Typo? Looks strange - Everyone should read Richard Feynman.

Mrs. May wants to know why you want people reading stuff that would be useful to terrorists.

(I would prefer that Daesh supporters continued to believe in miracles rather than science, thanks.)

8
0
Gold badge
Joke

" His writing is hugely entertaining as well as educational."

Indeed.

But I could never get the image of him as a New York taxi driver chewing a cigar out of my head.

"How 'bout that Quantum Chrono Dynamics, huh? Virtual particles mediating force transfer in a vacuum. Tricky stuff. You in town on business?"

Joking aside the world is poorer, not just for his intellect and vision but also for his ability to explain complex ideas. His rubber band in a cup of ice water (modelling the root cause of the Challenger crash) was a classic. Simple enough for even the "I don't understand science" crowd to grasp.

13
0
Anonymous Coward

Re: Feynman: see also Haddon-Cave

" [feynman's] writing is hugely entertaining as well as educational."

Closer to home in the UK, there's a senior judge called Charles Haddon Cave. He's a lawyer not a scientist or engineer, but if you need an inquiry done properly, he seems like a good man to have on your side. His writing is also educational, and entertaining in a way.

See e.g. his talk(s) on "Leadership*&*Culture,!Principles*&*Professionalism,!

Simplicity*&*Safety*–*Lessons*from*the*Nimrod*Review"

RAF Nimrod XV230 suffered a catastrophic mid-air fire whilst on a routine mission over Helmand

Province in Afghanistan on 2 nd September 2006. This led to the total loss of the aircraft and the death of all 14 service personnel on board. It was the biggest single loss of life of British service personnel in one incident since the Falklands War. The cause was not enemy fire, but leaking fuel being ignited by an exposed hot cross-feed pipe. It was a pure technical failure. It was an accident waiting to happen.

The deeper causes were organizational and managerial. This presentation addresses:

(1) A failure of Leadership, Culture and Priorities

(2) The four States of Man (Risk Ignorant, Cavalier, Averse and Sensible)

(3) Inconvenient Truths

(4) The importance of simplicity

(5) Seven Steps to the loss of Nimrod (over 30 years)

(6) Seven Themes of Nimrod

(7) Ten Commandments of Nimrod

(8) The four LIPS Principles (Leadership, Independence, People and Simplicity)

(9) The four classic cultures (Flexible, Just, Learning and Reporting Cultures)

(10) The vital fifth culture (A Questioning Culture) "

See especially point 10: A Questioning Culture.

In various places, just search for it (I have to be elsewhere ASAP).

As well as the Nimrod enquiry, from memory he also did the inquiry for Piper Alpha oil rig disaster and the Herald of Free Enterprise ferry disaster.

6
0

Re: " His writing is hugely entertaining as well as educational."

I think he was born at the southern end of NY, but with that accent he should be from Noo Joizy.

I always fondly imagine him wearing a zoot suit and spats, carrying a violin case.

One unarguably great thing Bill Gates did was to buy the rights to the lecture series so we can all watch them for free.

8
0

Re: Typo? Looks strange

Everyone should read the Rogers Commission appendix by Richard Feynman at the very least:

"For a successful technology, reality must take precedence over public relations, for nature cannot be fooled."

https://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/Appendix-F.txt

8
0
Silver badge

Re: Typo? Looks strange

"You can't estimate things like that; a probability of 1 in 10,000,000 is almost impossible to estimate."

But it will happen 99.9% of the time.

(Apologies to TP.)

7
0
Silver badge

Re: Typo? Looks strange

>Everyone should read Richard Feynman.

The most underrated and largely unknown boffin by the public probably of all time (certainly of the 20th century). Though Maxwell is right up there as well.

6
0
Gold badge
Unhappy

Re: Typo? Looks strange

""For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.""

I can't recall if it was him or AC Clarke who commented "Against the laws of Physics there are no appeals."

The Universe does not care how rich, famous or powerful you are. If a meteorite comes through your roof all that matters is are you in its path or not (yes people really have died of this).

1
0
Silver badge

Re: Typo? Looks strange - Everyone should read Richard Feynman.

Voyna i Mor - "(I would prefer that Daesh supporters continued to believe in miracles rather than science, thanks.)"

Really? If they believed in science, surely they'd stop supporting Daesh?

What is the scientific likelihood of enjoying 72 virgins (or white raisins) after death?

0
0
Anonymous Coward

Re: Typo? Looks strange

I think I see the problem here:- NASA predicted 1 in 10^5, and the actual probability of failure was 1 in 10^6, which, as all readers of great literature know is almost certainly going to happen.

3
0
Silver badge

Re: Feynman: see also Haddon-Cave

Another tangent: accident investigation reports can be very thought provoking, as well as interesting in their own right. Chernobyl, both Shuttle accidents, the Deepwater Horizon / Macondo 252, Piper Alpha, and all sorts of air accident investigation reports -- all have lessons, and describe similar patterns of organisational and system design or operation failures or accidents waiting to happen to those in many fellow commentards' workplaces. Recognising them doesn't necessarily help you stop them happening, because the root causes are often many pay grades above one's own., but it does make saying "I told you so" more fun,.

2
0
Silver badge
Thumb Up

Re: Feynman: see also Haddon-Cave

I think this is the Charles Haddon Cave talk you refer to:

https://www.youtube.com/watch?v=y99_lhFFCsk

1
0
Anonymous Coward

Re: Feynman: see also Haddon-Cave

That'll do nciely, thanks. Charles Haddon-Cave's Piper Alpha 25 presentation session is a good place to start. It's nearly an hour long, but can mostly be treated as radio.

There is an almost identical script (or maybe transcript) at

https://www.judiciary.gov.uk/wp-content/uploads/JCO/Documents/Speeches/ch-c-speech-piper25-190613.pdf

0
0
Anonymous Coward

Re: Feynman: see also Haddon-Cave

"Recognising them doesn't necessarily help you stop them happening, because the root causes are often many pay grades above one's own."

True.

"it does make saying "I told you so" more fun,."

Please don't take this the wrong way, but how much fun is there when being ignored by management leads to e.g. a a fatal incident which could easily have been avoided?

1
0
Silver badge

Workers defending their territory; managers afraid to challenge them.

This sounds like a situation where each worker aggressively defends his or her patch. "No, you can't possibly merge my legacy paper reporting system with Bob's new email reporting system, because [insert ridiculous reason here]." Given the chance, most of us will defend the systems we maintain (and by extension our jobs): it's human nature. A manager's job is to challenge the ridiculous reasons given.

BA's management are squarely to blame here.

17
0

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Forums

Biting the hand that feeds IT © 1998–2017