back to article The biggest British Airways IT meltdown WTF: 200 systems in the critical path?

One of the key principles of designing any high availability system is to make sure only vital apps or functions use it and everything else doesn't – sometimes referred to as KISS (Keep It Simple Stupid). High availability or reliability is always technically challenging at whatever systems level it is achieved, be it hardware …

LDS
Silver badge

Re: Workers defending their territory; managers afraid to challenge them.

VMs and containers might be making it even more complex - now everyone will want her or his system work separately from everything else, and they need to share something (and in the worst situation you have a lot of duplication to keep in sync) and communicate with each other.

While there are often good reason to have multiple separate system, there may be also sometimes good reason to consolidate them (to simplify the architecture) and then make them redundant and fail-safe (and it will be easier to achieve it when the architecture is simpler).

11
0
Silver badge

Re: Workers defending their territory; managers afraid to challenge them.

This sounds like a situation where each worker aggressively defends his or her patch

And why do you think it will be any different if every single one of them is perceived as cost to be shoveled off to TaTa?

3
0
Silver badge

Re: Workers defending their territory; managers afraid to challenge them.

And why do you think it will be any different if every single one of them is perceived as cost to be shoveled off to TaTa?

I didn't say anything about outsourcing. Outsourcing doesn't solve the problem at all: it merely shifts the problem to another company, and conceals the complexity from the end client.

Rather, it's an internal problem of employees being allowed to take "possession" over their little piece of the system (or in BA's case, their 1 system out of the 200). It then becomes hard to move or replace that person, and they become very resistant to change. I've seen this happening in a lot of places, especially large government or quasi-government organisations. The way to avoid it is for management to rotate employees around different systems so that everyone knows a bit about how three or four systems work, rather than just knowing a single system in-depth. This also helps you recover if/when the critical employee leaves.

I don't have any specific knowledge of the BA situation; but 200 critical systems in an organisation with strong unions (making it hard to fire intransigent workers) suggests something like this may have happened.

8
0
Silver badge

Re: Workers defending their territory; managers afraid to challenge them.

I'd imagine its more likely managers defending their territory than their subservients.It will almost certainly be the manager or sales rep who requires a certain service and refuses to have that taken away.

7
0
Silver badge

Re: Workers defending their territory; managers afraid to challenge them.

"The way to avoid it is for management to rotate employees around different systems"

Ouch! This is how the Civil Service produces senior officials who can avoid responsibility for anything. Something goes wrong on A's watch and he immediately blames predecessor B who in turn blames predecessor C who immediately blames A and/or B.

5
0
Silver badge
Thumb Up

Re: Workers defending their territory; managers afraid to challenge them.

Given the chance, most of us will defend the systems we maintain (and by extension our jobs): it's human nature.

I had a tour around the Mercedes engine factory in Stuttgart a few years ago. Our guide proudly told us that they'd never made anyone redundant since the 50s and it meant that employees were encouraged to suggest ideas that would mean they'd need fewer people without worrying about their jobs. If more companies took their attitude then who knows how more efficient they'd be.

5
0
Silver badge

Re: Workers defending their territory; managers afraid to challenge them.

> Ouch! This is how the Civil Service ...

Yes, fair point. But with developers, you only get rotated around 3-4 systems, so you eventually come back to code you previously worked on. The Civil Service path is one-way, so you never have a chance to apply lessons learned elsewhere to your previous mistakes.

0
0
Silver badge

Re: Workers defending their territory; managers afraid to challenge them.

"eventually come back to code you previously worked on."

And fail to understand a word of it.

2
0

Prophets of doom

> actually there are none – that like or encourage prophets of doom.

Except perhaps religious organizations.

11
1
Silver badge

Re: Prophets of doom

Except perhaps religious organizations.

Well, British Airways may not want any prophets of doom, but as a consequence they're now enjoying the profits of doom, to the value of minus £100m. Serve the fuckers right, too.

24
0
LDS
Silver badge

Re: Prophets of doom

IMHO, internally, even religious organization wish themselves to be able to exploit their gullible devotees in this world and explore its sins fully still for a long, long time. It's no surprise that many prophets of doom asking to repent and renounce to worldly riches and pleasures were sent to the stake....

3
2
TRT
Silver badge

Re: Prophets of doom

I made £100s with a port of a popular 8 bit sci-fi themed FPS...

Or was that a profit of Doom?

7
0

Re: Prophets of doom

32 bit, I believe. Sorry.

1
0
TRT
Silver badge

Re: Prophets of doom

It used an 8-bit palette, which is what I was referring to.

Although it began life as a 32-bit compile, it also ran on 16-bit DOS and other 16-bit platforms by using various techniques. It has been ported to a Vic-20, which is an 8-bit machine throughout.

1
0
Silver badge

Once worked for a company which ran every service or app on a different server and each of those servers were duplicated off site. Even SQL databases, one database per server (and then duplicated).

Rack apon rack of Compaq DL3x0 servers. Now I didn't work for the IT dept (I was pimped out to paying customers) but when you find out the IT manager's nickname was Kermit, it didn't need explaining what a bunch of muppets the IT dept was. It was as though no one knew servers could multi-task !!

4
11
Silver badge

> It was as though no one knew servers could multi-task !!

Always seemed to me the vast majority of suppliers would only guarantee support for their systems, especially windows systems, if they were the only one on the box. As boxes are cheap and support is expensive then one box per system, no matter how stupid, was the sensible option. If you have multiple software vendors on the same box the first response to any problems is almost guaranteed to be finger pointing.

And don't give me 'you should migrate to Linux'. Firstly Linux suppliers weren't much better and secondly if the Windows system is the one that meets the users' perceived needs best then you're stuck with it unless you run a remarkably dysfunctional non customer foccused shop.

24
0
Anonymous Coward

Thoughts

Too expensive and complex. Gradual change and adaptation has to be the order of the day in most places. Still no excuse for accepting the rubbish that still exists in many places.

1
0

I'd rather deal with a company like that than some bunch if smart arses who believe their own bullshit that they can make a few boxes do everything.

I wonder what the chances are of a setup like that,with similar style power supplies would end up doing a ba ?

There is such a thing as the law of diminishing returns,if you run realy vital services,doing it the simple but pricey way makes sense if you cannot afford or survive a single total failure,ever..

Did you leave or did they kick you out ?

7
0

And have they ever suffered a total crash ?

6
0
Silver badge

We do the same but with vms on a clustered host. Management is simple with tools, the only real extra cost is storage as vm licensing is cheap enough. I dont duplicate offsite though as one of the clusters is in a different part of the site.

Backups are different of course, they go to another part of the site.

1
0
Bronze badge

Re: > It was as though no one knew servers could multi-task !!

Pretty much this.

"You need to be the only on the box? OK!" *builds VM for app* "Here you go!"

89.9% of the time, none's the wiser, and the other 10% of the time, the vendor is basically "Oh, it's a VM. We support that too!" (It's not *quite* 100%, because there's always that ONE VENDOR who INSISTS they be the only tenant on that host/set of hosts because their code sucks that badly and they tend to be of the 'throw more hardware resources at the problem' types.)

3
0
LDS
Silver badge

Re: > It was as though no one knew servers could multi-task !!

Running different workloads competing for resources on a single system may not be an easy tasks, especially on OSes that don't let partitioning and assigning resources easily - i.e. Windows, something can be done using jobs, but that require applications support. Otherwise a "rogue" application can exhaust all physical memory and/or CPU, and starve other applications.

But developers capable of writing software that "play nicely with others" are rare, because it's far simpler and quicker to use as much resources as you can find, instead of writing clever and optimized code. And even when the options exist, they are often overlooked and the default installation is designed for a standalone application, because other types of tuning need to be done on an ad-hoc basis.

Thereby the simplest way became VMs, especially when the hypervisor can control the resources assigned to each VMs.

1
0

Do any really large companies rip it all out and start again?

I'd love to hear examples of really large companies that wrip-out their IT and start again to get genuine resilience back after x years of smooth operating.

I'd be willing to bet the only time this is gotten close too is if a government or parent company forces them to do so.

Thoughts?

9
0
Anonymous Coward

Re: Do any really large companies rip it all out and start again?

no evidence of "wripping" in my career.

This is handled by outsourcing your IT then blaming the supplier.

12
0
Silver badge

Re: Do any really large companies rip it all out and start again?

I can think of one project during my career which was a complete rip-out-and-replace exercise...

A company I contracted to had a large system which they'd built up from scratch. They got bought out by a much larger company, who had their own corporate standard system for such things, and within a short time a decree was issued that company #1's system should be replaced.

We then had a very long and drawn out project (actually it was more like a programme of projects) to do a migration.

To be fair, the outcome was what had been requested, but I don't think that the amount spent on the migration could be paid back in terms of tangible benefits (at least not for the company involved - personally I scored quite a bit of overtime and an end-of-project bonus which sorted me out with a new fitted kitchen)

8
0
Silver badge

Re: Do any really large companies rip it all out and start again?

I have worked on one Data centre transformation program that had exactly that goal. Complete migration of all services to a new hosting provider and rearchitecting all services to give them the right level of resilience/recoverability.

I have worked on another similar program that didn't have quite the scope to rearchitect but we rebuilt to resilience/recovery patterns and rewrote and tested the IT Service Continuity plans.

Currently working on a smaller scale program of work to identify resilience gaps and close them. My current client would have a few critical services out of action for at least a couple of weeks if there was a proper 'doom' scenario and its debatable as to whether some of the other services that look like they should be recoverable actually are as no-one has ever tested. We are going to both remediate and test.

Apparently the former IT Director was a penny pincher.

5
0
Anonymous Coward

Re: Do any really large companies rip it all out and start again?

How about an example of a large 24x7 company that has a resilient system with DR sites and multiple systems that have grown over many years overseen by different people, where the Senior tech or CTO has the balls to test it by walking up to the master power switch in Datacentre 1 to show it all switching over seamless while the system is live.

The risks could be catastrophic and if the systems are running just fine who would risk it? I certainly wouldn't.

Maybe that is what a CTO in BA was trying to prove - the resilience of the datacentre even on a bank holiday weekend!

2
1
Gold badge

Re: Do any really large companies rip it all out and start again?

I'm afraid that this is precisely what a CTO has to be brave enough to do.

At least if you throw the switch yourself you can choose the moment. You can tell all your front-line staff beforehand, so that they can tell the customers what is going on. You can have a manual back-up plan resourced and in place, so that your front-line staff can actually respond. Best of all, you can not do it on a Bank Holiday weekend.

If you don't flick the switch yourself, it is only a matter of time before Fate throws the switch for you. Then you lose at least two of the aforementioned benefits and probably all three because many failure modes are more likely at busy times.

12
0
Silver badge

Re: Do any really large companies rip it all out and start again?

You really don't want to do an unplanned failureunless you really have to. Not without a massive amount of planning anyway. The risks to your business are enormous, especially if the primary site comes back online in an unplanned way (that can be worse than the initial failure and I suspect is what fucked BA up).

What you do need to do is plan assiduously the following:

- what actions to take to provide service continuity if one of your sites fails (this might be nothing - your architecture might be resilient to site failure)

- how to prove that you can do that in a planned way

- what do if its unplanned (its different)

- how to bring the failed site back online safely

If you are ever going to try and prove things work as designed and tested in an unplanned scenario you need to do it on a production like test environment and that gets very expensive. Lots of people do it - finance institutions mainly, though have done it at a big energy firm as well.

0
0
Silver badge

Re: Do any really large companies rip it all out and start again?

I know of two. One when i was a lowly apprentice at British Leyland, DAF mandated a new IT system. It was a monster with robotic tape library (8 drive lord knows what capacity). This replaced some weird bespokeish system.

The second was after i finished Uni and BAe were shutting down strand road in Preston. They amalgamated and partially replaced Warton systems to accommodate. That was a years hands on after my HnD. They paid for my final year BSc too which was nice.

1
0
Silver badge

Re: Do any really large companies rip it all out and start again?

I had a client - small business, maybe a dozen employees - who did this in the run-up to Y2K.

His servers were Xenix with a fairly old version of Informix and custom applications. He did a rip and replace with SCO and a packaged system allegedly Informix compatible; he wanted various custom tweaks adding and there were more of these over the years. Also over the years I gradually discovered various "interesting" aspects to the alleged Informix compatibility that ended up with me directly amending the data in sysindexes so they reflected the actual indexes.

When he retired he sold the business to a group who presumable ripped and replaced with whatever they ran on as a group; certainly I never heard from them.

0
0
Silver badge

Re: Do any really large companies rip it all out and start again?

"Not without a massive amount of planning anyway."

You should have the massive amount of planning in place anyway. If you don't test it yourself on your own terms Murphy will do it for you and not at a time of your own choosing.

3
0
Silver badge

Re: Do any really large companies rip it all out and start again?

"I'd love to hear examples of really large companies that wrip-out their IT and start again to get genuine resilience back after x years of smooth operating."

It does happen. I know one company that did exactly that building two new parallel DCs to replace the adding tat with new, shiny, reliable kit. The problem was that retirement of the old DCs became a tangled and difficult process that took over five years to complete leading to a doubling of costs for those five years. Even then it wasn't perfect. Decommissioning the last DC resulted in a massive outage because someone had forgotten something important.

1
0
Silver badge

Re: Do any really large companies rip it all out and start again?

My DR test failover, at an operator of systems hosting financial tradiing:

The DR / failover plan existed on paper but hadn't been tested for years, since when enormous changes had been made to the code, systems and environments. Eventually management let ops spend a weekend testing it out. On paper, and in regulatory filings, it took 30 mins. The first time it was tried took 14 hours. After three months of working on all the issues that came to light, tried again: 2 hours this time. Another iteration of fixing and testing. Third time: 27 minutes. They now test it every quarter. They were in the fortunate position of having Friday night and most of the weekend to make changes with zero customer impact, but everything had to be fully operational by Sunday evening, ready for the start of trading; doing that if you're a bank, or an airline, or any other 24/7 operation must be enormously difficult, and of course the longer it's left untested, the harder and more dangerous it gets to test.

1
0
Anonymous Coward

I have to be against what you have just said. While 200 systems might seem a lot, for a big critical application it might be "just right". Example:

- load balancing web front end - 20 to 40 systems, all "critical" (if one goes down, the service is up - the down system is still critical and the situation has to be solved fast). From your BA example, 3 datacenters = they are split in 2 or 3 differently placed groups

- database high availability: at least 2 database servers if we speak about small numbers. We don't, so we speak about much more: a lot of data, a lot of connections/queries. 20? 40? All critical. One goes down, there is no problem (theoretically).

- batches, middleware and application servers: we add some more critical servers again

- virtualization high-availability: again we speak about big numbers, it is not a good idea to stuff too many critical machines on the same critical physical server

- storage: again, big numbers, all critical

And it is not over yet. Depending on the quantity of data to be treated, 200 critical devices might be just right. They are all critical, on the other hand if some of them are going down, the others should have absolutely no problem to hold the load. At least in theory.

8
5

Agree - 200 doesn't sound unusual for an organisation as IT dependent as BA, bearing in mind their business went IT dependent very very early.

Whether its good or not I'd hazard a guess that its NOT an outlier compared to other organisations its size.

Rather suspect the Author hasn't worked in anyone as big and hasn't had to contend with the sheer inertia that creates.

Also rather suspect this was 1 rogue application (ESB or similar) spamming corruption further out.

Double suspect that there are a few architects & CIOs waiting for the full BA post mortem.

16
0

Depends what they mean by "system". I read that as 200 applications, rather than interpreting as system = device/server.

In your interpretation, I'd completely agree, 200 devices/resources is not at all OTT.

But 200 applications, each with their own reliance on supporting systems and infrastructure, each application having their own data flows... I wouldn't fancy detailing that data model (unless it was for a fee, mind).

14
0
Silver badge

200 Applications?

In local Government IT, which has a very diverse user base, getting down to only 200 applications would have been beyond my wildest dreams. Counting PC client only applications it was well into 4 figures.

6
0
Anonymous Coward

Re: 200 Applications?

A few years ago I worked for a far more modern and competent airline than BA and I can assure you that 200 interdependent systems is not unusual.

Running an airline is a very complex business and there are a lot of factors that need to be accounted for. It's not just the airline itself either; it has to communicate with the systems in the airport too, and this is true for all of the outstations too.

6
0
Anonymous Coward

Only 200 ?

Surely they have more.

0
0
Anonymous Coward

Larger systems need to continuously evolve to survive

Having "200 systems in your critical path" is indeed a tad worrying. Being a large, distributed multi-national enterprise like BA etc. isn't really an excuse, either. We can drone on and on about "corporate inertia" and saying that having a system built "layer upon layer upon layer" is to blame, making it just too large and unweildy to be manageable or indeed fit for purpose.

Well, yes - I would have to agree.

I think the challenge is how to manage enterprise-level systems getting built layer upon layer and becoming fossilised through corporate interia, making it hard to (a) evaluate their ongoing operational function in an objective way and (b) engineering an improved system that perhaps replaces entire chunks/layers with more modern, more performant solutions. This is about mapping where the problems are, finding out what the critical chunks are that *must* be improved and then building a simpler more maintainable system to perform the task in hand. In short, building a live, functioning system that is under continuous evolution. This goes way beyond continuous delivery.

2
0
Bronze badge

"Depending on the quantity of data to be treated, 200 critical devices might be just right. They are all critical, on the other hand if some of them are going down, the others should have absolutely no problem to hold the load. At least in theory."

But that's the whole point: not that they are critical, but they are critical path. Which means that if any one of them goes down, the entire system goes down. These aren't redundant devices -- they're each one necessary for the whole to work.

I never took a statistics & probability class, but if I'm not mistaken, the chances of failure in this scenario increase exponentially with each device added. (If I am mistaken, I'm sure somebody will cheerfully point it out. I just want it to be the guy who actually took S&P classes and not the guy who thinks the probability of getting 10 heads in a row is 1 in 10.)

3
0
Silver badge

"But that's the whole point: not that they are critical, but they are critical path."

Nobody said that the 200 were critical path, that's an entire fabrication that the author made up in order to justify this article.

It wouldn't surprise me if there were though; the login app, the pilot hours app, the crew roster app, the passenger seat allocation app, the cargo load app, the weight and balance app, the weather app to get wind direction for the fuel prediction app which in turn drives the fuel load app, the emergency divert planning app, the terror checklist app, the crew hotel booking app...

3
0
Silver badge

Re: Larger systems need to continuously evolve to survive

"This is about mapping where the problems are, finding out what the critical chunks are that *must* be improved and then building a simpler more maintainable system to perform the task in hand. In short, building a live, functioning system that is under continuous evolution."

This. It's also easier to do as you go along. A good maxim would be to aim for a situation in which the result of each added development is that the system looks as if it were designed that way from the start.

1
0

Antifragile?

It's been a decade or more since NN Taleb wrote 'Fooled by Randomness', 'Black Swan' and five years since 'Antifragile.' Has anyone applied this philosophy to software systems? Have any CTOs read these works?

6
0
Anonymous Coward

Re: Antifragile?

Would that be Agile?

1
1

Re: Antifragile?

{PHBMode}Chaos Monkey{/PHBMode}

What do you mean knowing those 2 words isn't enough?

0
0
Anonymous Coward

Death to beancounters

blah blah textspittle

0
0

Anyone who claims they can deliver five nines availability, even for discrete components let alone a complex web of hardware and software, is talking out of their arse. Five nines means you can have a maximum 0.864 second outage in any given 24 hour period. Of course you can start saying that the up time calculation should be done over a week, month or year but where do you stop - a decade? Up time stats only have real meaning over short periods.

So, hands up, who for any amount of money is going to guarantee less than 0.864 seconds of downtime over DC, comms, hardware, and 200 interdependent applications. And how do you even define what counts as "up"?

It's basically all finger in the air stuff.

4
9
Silver badge

This was Feynman's point - the managers were generating the numbers that they wanted, not the numbers they needed - reality is a bitch and it always bites eventually. You can learn a lot by taking Feynman's approach to calculations.

13
0

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Forums

Biting the hand that feeds IT © 1998–2017