back to article BA's 'global IT system failure' was due to 'power surge'

British Airways CEO Alex Cruz has said the root cause of Saturday's London flight-grounding IT systems ambi-cockup was "a power supply issue*" and that the airline has "no evidence of any cyberattack". The airline has cancelled all flights from London's Heathrow and Gatwick amid what BA has confirmed to The Register is a " …

Anonymous Coward

Re: Cynical Me

I already work at BA.

My job is done, I'll get my coat.

1
0
Silver badge

Re: Back in the day...

I worked in a call centre, the office space contingency was to rent a spare location if our offices ever burnt down (or similar).

While obviously "renting" spare capacity of bespoke systems is not really viable unless everything is "cloud" (read universally deployable apps, which of cause takes a lot of the specialist options away from your software/hardware), can there not be a joint effort by some companies for some spare capacity?

While leaving a second complete duplicate somewhere is very costly, could there not be some similar/VM/smaller systems for at least basic use, shared costs across companies as it's very unlikely all of them go down the same day?

0
0
Anonymous Coward

Re: Back in the day...

"While obviously "renting" spare capacity of bespoke systems is not really viable unless everything is "cloud""

Back in the day when people knew how to do these things right, when cloud was still called "bureau timesharing", there were companies that had their DR kit in a freight container (or maybe several, across several sites) that could be readily deployed to anywhere with the required connectivity. It might be a container+kit they owned or it might be a container+kit rented from a DR specialist or whatever. And they'd know how it worked and how to deploy when it was needed (not "if").

That's probably a couple of decades ago, so it's all largely been forgotten by now, and "containers" means something entirely different.

E.g. here's one that Revlon prepared earlier (article from 2009), but my recollection is that this had already been going on elsewhere for years by that time:

http://www.computerworld.com/article/2524681/business-intelligence/revlon-creates-a-global-it-network-using--mini-me--datacenters.html

0
0
Silver badge

Re: Back in the day...

"shared costs across companies as it's very unlikely all of them go down the same day?"

If this were something on offer and BA had signed up for it, how would all the other corporate clients feel knowing that BA are now using the DR centre, so if anything goes pear shaped, they're on their own? Its like paying for an insurance policy that won't pay out if somebody else has made a similar claim before you on a given day. Would you buy cheap car insurance with a clause that reads:

"RBS Bastardo Ledswinger Insurance plc pay only one claim each working day across all policies written and in force; If your claim is our second or subsequent of that day it will not be paid, although we will deem the claim fully honoured. No refunds, the noo!"?

2
0

Re: Back in the day...

When you return to work on Tuesday ask about your DRP and sharing resources, you will be amazed how many firms are in this position. It is a risk that does reduce costs and is unlikely to occur. In most cases the major DR suppliers allow you to be locked out for 90 days. It is a first come first served business model.

The most likely scenario for multiple organisations to be hit simultaneously is through cyber attack; you have been warned.

2
0
Silver badge

Re: Cynical Me

"Was there any review in terms of obtaining another genny and/or onsite sparkie during operational hours? No."

In other words, next time it happens the site manager won't bother.

It's amazing how much money there is to fix things AFTER the stable door has been smashed to smithereens.

2
0
Silver badge

Re: Back in the day...

"the office space contingency was to rent a spare location if our offices ever burnt down"

Oh, so you need that much space, that urgently? Sure, I can do that for a 2500% premium.

0
0
Anonymous Coward

Re: Back in the day...

Sungard will happily arrange for this in advance for a small fee.

( I don't work for them, but have used them)

1
0
Anonymous Coward

Penny wise pound foolish

BA: Penny wise pound foolish.

Officially BA tells the world that it was due to a power failure but that 'IT is looking into it', which makes no sense. Power failure ==> sparky is the go to surely.

What may have happened

1. BA offshores IT

2. UK has bank holiday weekend (3 days)

3. Upgrade / new release cockup ==> systems down

4. No plan B.

26
0
Silver badge

Re: Penny wise pound foolish

Any number of things may have happened.

5
0
Anonymous Coward

Re: Penny wise pound foolish

"Any number of things may have happened."

There are a finite number of things that you can think of that may go wrong ...and an infinite number that can go wrong.

1
0

Re: Penny wise pound foolish

What you are suggesting is that BA is lying to the public about the cause of the failure. Too much chance of being found it, I don't honestly think they would so blatantly lie to hide what happened.

I believe what is likely is that there actually was a power failure, which may or may not have been BA's fault (if its BAs responsibility to maintain the UPS, and they failed to do it, then it is their fault).

Having experienced a genuine power failure to the site, and some kind of failure in the UPS and generator system, I bet BA's remote off-shore India IT support team struggled to bring the systems and applications back.

So they're only telling 1/10th of the real story, enough to make everyone think they are not to blame, when in fact they probably are.

3
1

Re: Penny wise pound foolish

There isn't an infinite number of things that can go wrong, it's finite, but just very large in number.

0
0
Silver badge
Holmes

Re: TkH11 "Some infinities are bigger than others"

No, it is infinite. Though depending on your mission objectives and definition of "success". When we frame success it has a defined boxed in end result (so while infinite it is bounded by what types of infinity). Failure is open ended (we can add both any finite number and any other type of infinity).

There is one way it can go right. But for ways it can go wrong, it is infinite. As you can always add one more spanner into the works... forever. ;)

0
0
Anonymous Coward

Re: Penny wise pound foolish

No matter what happens, the responsibility lies with the BA executives. It may not be their fault it went wrong, it is their fault for allowing it to destroy the business.

2
0
Anonymous Coward

Re: Penny wise pound foolish

BA is to blame regardless of the cause. The CEO should be be given the boot, clearly not fit for the job.

Outsourcing IT when IT is the blood that supports the core business? Oh well, serves them well.

2
0
Anonymous Coward

Re: Penny wise pound foolish

Would they buy aircraft just because they are cheaper without some serious due diligence?

0
0
Silver badge

Re: Penny wise pound foolish

"So they're only telling 1/10th of the real story, enough to make everyone think they are not to blame, when in fact they probably are."

Which is probably why the utility companies are now all stepping forward to categorically deny any kind of power surge anywhere in the country.

1
0

Top up your meter?

I'm reminded of that ad for a company which allows you to top up from your phone, so even if the office BA use is closed for the weekend, they can still get it sorted.

1
0

Still, ElReg finally got the story about 7 hours after the Beeb and everyone else, and have managed to add exactly what insight?

11
17
Anonymous Coward

I have to agree

It was disheartening to not see anything on this for hours on EL Reg.

Perhaps they where all on a jolly to Spain for the weekend and got stuck at the airport?

4
3
a_a

You must be new here. ElReg is light and frothy take on IT news, it's not 24/7 breaking news.

24
3
Silver badge
Facepalm

Re: I have to agree

I don't.

The Beeb is a well funded 24hr 365 day service. El Reg is not.

17
1
Silver badge

Re: I have to agree

"Perhaps they where all on a jolly to Spain for the weekend and got stuck at the airport?"

And the BA employee with the only key to the genny shed in his pocket was on the same flight.

2
1
Anonymous Coward

Re: I have to agree

"The Beeb is a well funded 24hr 365 day service. "

Wasn't there a proposal a while back that the BBC should cut back on 24/7 news resources - as the media moguls didn't like the competition?

3
1
Bronze badge

Re: I have to agree

My take on ElReg's story is that it's a placeholder for comments and a link to contact someone directly for people that do know more.

Looking around the various news sites and places that might know, I don't see much more than what is known about the affects of the outage and the official statements.

The interesting stuff will come in during the week when people are back in the office and go for after work drinks with ex-colleagues :)

8
0
Silver badge

Re: I have to agree

My take on ElReg's story is that it's a placeholder for comments and a link to contact someone directly for people that do know more.

El Reg is always quiet on the editorial front at the weekend, ever since they stopped their "weekend edition" experiment. The surprise to me was that they managed to get a story out at all and I think you are correct, put something - anything - out there, and hope that the commentards will fill the gap. However few of BA's IT staff are left in the UK, I bet a couple of them read El Reg...

For analysis I'll come back on Tuesday.

For the record, I have to agree with everyone saying "it stinks" because a huge business such as BA (or Capita - yup, it's a bit of a co-incidence) shouldn't fall completely over for lack of a few Amps at some data centre or other.

What's next? London City Airport's recently fanfared "remote tower"?

M.

8
0
Anonymous Coward

Don't worry your time will come.

Because we are one of the following.

- The smug ones who have been lucky to avoid a total failure.

- The smug ones who have planned and tested for this.

- The smug ones that don't work for BA/Capita/TCS etc..

- The smug ones who are not the ones in BA DCs right now.

1
0
Anonymous Coward

Re: Don't worry your time will come.

In 1999 I was part of a team at a broadcaster - dealing with a planned power shutdown to test the UPS and generators just in case the power went out on New Years Eve. My role was simple in that I had to shut down anything that had been left on where there wasn't a user at their desk to do it themselves. The non core systems (i.e. anything but the studios/IT/Engineering kit) would not have power and it was essential that everything else was off with work saved. My floor was cleared bang on time and after receiving clearance from all the other floors the fire alarm tannoy was used to announce to anyone left in the building not involved that the power was about to be cut.

Power is cut and then restored 2 minutes later which was odd given I thought we were going to give the generators a bit of a run. So I found the Head of Technology and the Facilities Management Manager looking concerned, they said that the generators had not kicked in. It turned out to be something that was easily fixed and that in part involved sacking the company contracted to provide maintenance. Questions about their 'regular' testing schedule were raised soon afterwards. We had apparently made assurances that should the power go out at 00:00 01/01/2000 we'd still be broadcasting with no interruptions. That was why we were running the test in September and still had time to fix anything that failed.

2
0

Doesn't add up

One of the things reported on PM on Radio 4 that early on when people scanning boarding passes were getting incorrect destinations on the screen. They reported that someone flying to Sweden got 3 different incorrect destination when the card was scanned.

It was also reported that at least initially BA's phones weren't working at Heathrow but I would have thought they would have had some local survivability in place if the phones couldn't register to the systems at the data centre, with a backup local breakout to PSTN.

My best (albeit poor) guess that it is more likely to be network related. Faulty router(s)?

5
0
Silver badge

Re: Doesn't add up

Does potentially add up if the root cause was last night's thunderstorms corrupting something. In a manner that wasn't anticipated by whatever monitoring they have in place.

0
0
Bronze badge

Re: Doesn't add up

Faulty network equipment rarely results in faulty destinations when scanning boarding passes - they result in either slow or no connectivity.

The boarding pass issue sounds more like storage, with either a fault (i.e. the power issues or a resulting hardware failure) causing a failover to another site with either stale data or the failover process not working smoothly (i.e. automated scripts firing in the wrong order or manual steps not being run correctly).

i.e. I suspect this is more of an RBS type issue rather than a Kings College type failure.

3
0
Anonymous Coward

Re: Doesn't add up

makes it sound like an application messaging tier failure I've seen this before where the weblogic tier was misconfigured. Responses to transactions were routed back to the wrong IP address. An interesting side effect was that a customer would call in, register their details with the IVR system and their case details would pop up on a call center agents pc in one call center while the phone call would route to a different one. Unfortunately the poor agent who recieved the phone call then could not access the case. If this has just started happening I. Suspect a software upgrade is to blame

1
0
Silver badge
Thumb Up

Re: Doesn't add up

Thanks Anon, ip routing of responses/caching problems does sound familiar... Steam and some other consumer stores/websites had such a problem when a reconfig sent the wrong cached pages all around, so customer data got spewed out to the wrong people.

So a power outage, recovered from, did not "recover" as expected?

0
0
Silver badge

Re: Doesn't add up

"So a power outage, recovered from, did not "recover" as expected?"

An educated guess tells me that they have lost a hall or a datacentre. And probably only then found out that vital systems are not fully replicated / key stuff doesn't work without it. Most probably systems that were DR tested were tested in isolation without a proper full DR shutdown and someone overlooked critical dependencies.

Once you are in such a situation and find you would need to redesign your infrastructure to fix gaping design holes, it's usual faster and safer to fix and turn back on the broken stuff.

2
1
Anonymous Coward

Re: Doesn't add up

If so, they are not the only ones.

AC for obvious reasons.

0
0
SJG

Operational Failover is incredibly complex

Let's assume that BA have lost a data centre. The process of switching hundreds, maybe thousands of sustenance to a secondary site is extraordinarily complex. Assuming that everything has been replicated accurately (probably not) then you've also got a variety of RPO recovery points dependent on the type and criticality of system. BA have mainframes, various types of RDBMS and storage systems that may be extremely difficult to get back to a consistent transaction point.

I know of no companies who routinely switch off an entire data centre to see whether their systems run after failover. Thus BA and most big companies who find themselves in this position will likely be running never fully tested recovery procedures and recovery code.

The weak point of any true DR capability is the difficulty of synchronising multiple, independent transactional systems which may have failed at subtly different times.

10
3

Re: Operational Failover is incredibly complex

You've pointed out what might be the problem. I once worked for a (public) organisation in which no-one dared take responsibility for pulling the big switch in case the backup system didn't take over properly. With the almost inevitable result that when a real failure occured, the backup system didn't take over properly. At least in a planned test they could have had relevant people warned and a team standing by to get it working with the minium disruption.

20
0

Re: Operational Failover is incredibly complex

Agreed. Plus, N+1 systems are based on educated (sic!) guesses of what loads will be - which they won't as systems just grow to fill the capacity provided. So, when N+1 become N (e.g. you lose a high density compute rack, then everything fails over - typical VMs then restart will more resources than they had when they were running normally and the hypervisor had nicked all the empty ram/unused CPU cycles. So, now nothing fits and everything starts thrashing and crashing domino style...

At some point you'll be looking to turn the whole thing off and on again - assuming you've documented that process properly, and nothing been corrupted, ...

2
0

Re: Operational Failover is incredibly complex

I know plenty of companies that routinely switch DCs every night to avoid this sort of monumentous cock-up.

13
0
Bronze badge

Re: Operational Failover is incredibly complex

Ive worked in numerous places, (public and private sector) where the DC's have had to be powered down for 5 yearly electrical testing. Its a complete power down, all systems off, ac off, ups off, geny's isolated etc. Its a pain to manage, eerie walking through silent data halls slowly getting to ambient temp, with the constant worry if what wont come back up.

Full DC power downs are not a rare event.

13
0

Re: Operational Failover is incredibly complex

Having assisted in a terrifically minor way in helping develop and test such a system for a client, I can vouch for this. It took a team of 15 6+ months of work to get that system up and tested for failover, and they were relatively small (think AS400 + 200-ish VM's) and already had the DR environment built out when we became involved.

Also, we have no evidence with which to judge BA beyond their own words that this is related to a power outage.

4
0
Bronze badge

Re: Operational Failover is incredibly complex

Plenty of companies do test DR or equivalent in production every six months or so. At least those who take it seriously. Any company not doing this is accepting the risk of not being able to run after a serious problem. CIO without that live testing on the list of required operating costs is immediately culpable.

5
0
Silver badge

Re: Operational Failover is incredibly complex

"assuming you've documented that process properly"

And you didn't go paperless so the whole documentation is on one of the servers that's not working.

5
1
Anonymous Coward

Re: Operational Failover is incredibly complex

At least in a planned test you can decide when the 'failure' occurs i.e. not during peak processing time!

2
0
Silver badge

Re: Operational Failover is incredibly complex

Those companies probably don't have a mishmash of legacy systems, some decades old, and complicated links to other service providers and their networks. That said, I intuit - possibly wrongly - that a mishmash of legacy systems would be less likely to fail completely, because different chunks of it would have been originally designed as standalone, or at least much less interdependent. (Anyone care to wield the cluestick with actual data or proper research on whether that's the case?)

It's interesting too that quite a few of these sorts of mega-outages hit industries that were some of the first to computerise in the 60s and 70s -- air travel and retail banking. What other sectors would fit that category and are also high volume / mass market infrastructural systems, I wonder?

* (looks uneasily at all those ageing nuclear stations built on coastlines before they'd discovered the Storegga Slide... )

0
0
Anonymous Coward

Re: Operational Failover is incredibly complex

It's worse than a mishmash. It'll be set up like a bank. Sure you've got your all-singing, all-dancing modern apps-and-web front end for the masses, but under the hood there will be the same core monolithic mainframe actually doing the real work, substantially unchanged for 30 years. This is translated into modern representations with layers upon layers upon layers of interfaces. Take out any one of those layers and it's anyone's guess what the outcome will be. That's why DR for these systems is particularly tricky.

4
0
Silver badge
Happy

Re: Operational Failover is incredibly complex

"...not during peak processing time!"

Ah, yes! The happy memories of faffing about switching things on and off at 2 o'clock in the morning to make sure that the DR was properly set up.

Unsociable but richly rewarding!

3
0
Anonymous Coward

Re: Operational Failover is incredibly complex

I know of no companies who routinely switch off an entire data centre to see whether their systems run after failover.

I do.

3
0
Silver badge

Re: Operational Failover is incredibly complex

I forget the company (and my Google Fu is off today), but it may have been Google or Valve's Steam, had a fire and it took out a DC, but the system just carried on normally.

So it can be done well and right.

1
0

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Forums

Biting the hand that feeds IT © 1998–2017