BA's 'global IT system failure' was due to 'power surge' • The Register Forums

Monday 29th May 2017 20:07 GMT PeterHP

As a retired Computer Power & Environment Specialist there is no such thing as a power surge, there are voltage spikes that can take out Power supplies. This sounds like bull shit from someone who does not know what they are talking about or getting fed a line. In my experience most CEO and IT Directors did not know how much money the company would lose per hour if the systems went down or if they would stay in business and did not consider the IT system important to the operation of the company. I bet they now know a DR system would have been be cheaper, but they not alone I have seen many airline Data Centres that run on a wing and a prayer.

7 0 Reply

Monday 29th May 2017 22:29 GMT vikingivesterled

In fairness to Cruz he didn't specify what the, in laymans tems power surge in engineer terms voltage spikes, took out. It could have been sample non ups'd air-con's power supplies being destroyed and a lack of environmental alarms going to the right people leading to overheating before manual intervention. AC can also be notoriously difficult to fix quickly. I have myself used emergency blowers and toll out ducts to cool an airline's overheating data center, where the windows where sealed and unopenable to pass pressurized fire control tests.

0 0 Reply

Monday 29th May 2017 21:56 GMT ricardian

Power spikes & surges

This is something on a much, much smaller scale than the BA fiasco!

I'm the organist at the kirk on Stronsay, a tiny island in Orkney. Our new electronic organ arrived 4 years ago and behaved well until about 12 months ago. Since then it has had frequent "seizures" which made it unplayable for about 24 hours after which it behaved normally. Speaking to an agent of the manufacturer he asked if we had many solar panels or wind turbines. I replied that we did and the number was growing quite quickly. He said that these gadgets create voltage spikes which can affect delicate electronic kit and recommended a "spike buster" mains socket. Since fitting one of these the organ has behaved perfectly. I suspect that with the growing number of wind turbines & solar panels this sort of problem will become more and more noticeable

2 0 Reply

Monday 29th May 2017 22:29 GMT vikingivesterled

Re: Power spikes & surges

That would probably only be an issue when there is more prower produced than can be consumed. Meaning the island needs something that can instantly lead away, consume or absorb overproduced power, like a sizeable battery bank, water/pool heater or similar. Alternatively if it is not connected to the national grid, the base ac sync creating generator/device is not sufficiently advanced.

1 0 Reply
1. Monday 29th May 2017 23:40 GMT ricardian
  
  Re: Power spikes & surges
  
  Our island is connected to the grid. Orkney has a power surplus but is unable to export the power because the cable across the Pentland Firth is already operating at full capacity. We do have a "smart grid" https://www.ssepd.co.uk/OrkneySmartGrid/
  
  0 0 Reply
Tuesday 30th May 2017 07:40 GMT anonymous boring coward

Re: Power spikes & surges

Although on a smaller scale, an organ mishap can be very humiliating.

4 0 Reply
Tuesday 30th May 2017 20:58 GMT Alan Brown

Re: Power spikes & surges

"He said that these gadgets create voltage spikes which can affect delicate electronic kit "

If they can 'affect delicate electronic equipment' then something's not complying with standards and it isn't the incoming power that's out of spec....

Seriously. The _allowable_ quality of mains power hasn't changed since the 1920s. Brownouts, minor dropouts, massive spikes and superimposed noises are _all_ acceptable. The only thing that isn't allowed is serious deviation from the notional 50Hz supply frequency (60Hz if you're in certain countries)

This is _why_ we use the ultimate power conditioning system at $orkplace - a flywheel

As for your "spike-buster", if things are as bad as you say, you'll probably find the internal filters are dead in 3-6 months with no indication other than the telltale light on it having gone out.

If your equipment is that touchy (or your power that bad), then use a proper UPS with brownout/spike filtering such as one of these: http://www.apc.com/shop/th/en/products/APC-Smart-UPS-1000VA-LCD-230V/P-SMT1000I

3 0 Reply

Monday 29th May 2017 22:32 GMT Anonymous Coward

Power spikes etc.

The difference between your situation and DCs is they have the money to invest in a UPS which conditions the power as well.

Your point is noted though as there are many endpoints that don't have the same facility.

0 0 Reply

Monday 29th May 2017 22:45 GMT Anonymous Coward

Outsourcery

Now the BAU function has been outsourced the real bills will arrive. All the changes that are now deemed necessary will be chargeable leading to a massive increase in the IT cost base.

2 0 Reply

Monday 29th May 2017 23:45 GMT Florida1920

"fixed by local resources"

Translation: Two people. One to talk to India on the phone, the other to apply the fixes.

2 0 Reply

Monday 29th May 2017 23:48 GMT Anonymous Coward

Something else will crash on Tuesday.

The share price.

2 0 Reply

Tuesday 30th May 2017 04:17 GMT Dodgy Geezer

I see that El Reg is unable....

...to get ANY data leaked from the BA IT staff at all.

One more advantage of outsourcing to a company which does not speak English...

5 0 Reply

Tuesday 30th May 2017 16:53 GMT Anonymous Coward

Re: I see that El Reg is unable....

Is there an Indian version of El Reg?

0 0 Reply

Tuesday 30th May 2017 05:03 GMT GrapeBunch

Real-time redundancy is why Nature invented DNA.

0 0 Reply

Tuesday 30th May 2017 07:02 GMT Milton

Hands up ... if you believe this for a second

Sorry, it won't wash. A single point of catastrophic failure, in 2017, for one of the world's biggest airlines, which relies upon a vast real-time IT system? A "power failure"?

Even BA cannot be that incompetent. Pull the other one.

6 0 Reply

Tuesday 30th May 2017 07:30 GMT anonymous boring coward

OK, so something failed. And they didn't have a working automatic failover. I get that. Embarrassing, and the CEO should go just for that reason.

What I don't get is how it could take so long to fix it? It must have been absolute top priority to fix within the hour, with extra bonuses and pats on backs to the engineers who quickly brought it back up again. How could it take so long?

2 0 Reply

Tuesday 30th May 2017 08:36 GMT pleb

"It must have been absolute top priority to fix within the hour, with extra bonuses and pats on backs to the engineers who quickly brought it back up again. How could it take so long?"

they had the engineers all lined up ready to execute the timely

fix you speak of. Trouble was they could not get them booked on a flight over from India.

1 0 Reply

Tuesday 30th May 2017 07:54 GMT Anonymous South African Coward

As an interesting aside, how does Lufthansa's IT stack up against BA's IT?

0 0 Reply

Tuesday 30th May 2017 09:03 GMT Anonymous Coward

My 20p/80p 's worth

It's 80 degrees and only 20% of staff are in over the long weekend

80% of the legacy systems knowledge went when 20% of experienced staff were decommissioned

80% of the time systems can handle 20% data over capacity

120% of Uks power is available from wind and solar so 80 % of coal/nuclear capacity is off-line

20% cloud cover and wind dropping to 80% cause sudden massive drop in grid capacity…. causing large voltage spikes

‘Leccy fall out agreements briefly swing in to action, BA can use UPS + generators to cover this

DC switches to UPS whilst only 80% of the 20% under-capacity generators spin up successfully - "the power surge"

80% of current customer accounts lose critical 20% of their data when twin system can't synch.

An 80% chance this is 20% right or a 20% chance this is 80% right?

0 0 Reply

Tuesday 30th May 2017 10:37 GMT Ian P

The power (with its backups) will never fail so we don't plan for it.

I guess it is just a blinkered approach. You convince yourself that the power will be fine and so you ignore the case when there is a failure, hence chaos when the system that will never fail actually fails. Is it the MDs fault? Yes for hiring an incompetent IT Manager. But I'd replace the IT Manager when the dust has settled. But are those crucial nuggets of information that he has in his head backed up?

0 0 Reply

Tuesday 30th May 2017 11:41 GMT Anonymous Coward

Brownout to brownpants

My take a "power surge" happened in the form of a brownout probably triggered by IT.

Simple scenario, support identify a requirement to do a needful update, this is automated via a management tool. The playbook for doing needful updates states the command sequence to execute, this is fat fingered (or applied incorrectly) and rather than progressively rolling out to the estate it applies to all.

Servers all start rebooting near simultaneously, the inrush startup currents promptly overload the PDU/UPS/genset, many failures happen some physical hardware some data corruption. Local support are asked to please revert, sadly it's Saturday most are not in work and many no longer work for the company.

Fat finger brownout thus becomes a brownpants moment.

0 0 Reply

Tuesday 30th May 2017 13:50 GMT Anonymous Coward

IT outsourcing

I used to work at BA - not in IT, but in an area that worked with IT day in and day out.

I, and many others left as the airline and capabilities that we loved gradually got ruined.

Most of the IT guys and gals were offered generous packages to leave. Three "global suppliers" were chosen, and all work had to be given to them. They would pitch for it, and cheapest nearly always won. Unsurprisingly the good IT people took the packages, and the less good stayed (some good stayed also, but not enough).

SIP / CAP / CAP2 / FICO / FLY - etc. are all complex systems, and when experience leaves then the support level goes down considerably. I think they have probably cut too much, the senior management don't have enough knowledge of IT to know when one cut is too much. The IT guys were resistant to change so the head was cut off the snake, then lots of yes people remained, and this is where we end up (as well as the complete outages of the website recently).

To say that this is unrelated to the removal of most of the people who knew how these systems worked is disingenuous. Two data centres, with back up power, so I fail to understand how one power surge could take out both of them independently - sounds more like a failed update / upgrade by inexperienced staff - and then a lack of experienced staff around to fix it.

Such a shame.

2 0 Reply

Tuesday 30th May 2017 15:15 GMT Mvdb

A summary of what went wrong inside the BA datacenter here.

http://up2v.nl/2017/05/29/what-went-wrong-in-british-airways-datacenter/

I hope BA wil make public the root cause so the world can learn from it. I however doubt BA will indeed do this.

1 0 Reply

Tuesday 30th May 2017 15:52 GMT 0laf

From another forum and a friend of a friend that works with BA IT.

The outsourcer was told to apply security patches which they did and powercycled the whole datacenter.

When it came back up it popped many network cards and memory modules when the power spiked.

The outsourcers lacked expereince in initiating the DR plan and it didn't work. Or maybe DR wasn't in the contract.

True or not I dunno.

1 1 Reply

Tuesday 30th May 2017 16:41 GMT Anonymous Coward

Soft target?

With all the terrorist risks to add to the natural causes and cock ups that will happen, I find it surprising that the location of the BA DCs are known. Even some idiot loser can work out that somehow hitting the data centres will have an impact out of all proportion to the cost. That being the case why doesn't BA have a plan that works?

0 0 Reply

Wednesday 31st May 2017 09:24 GMT Mvdb

Another update to my reconstruction of what went wrong:

An UPS in one of BA London datacenters failed for some reason. As a result, systems went down. Power was restored within minutes however not gradually. As a result a power surge happened which damaged servers and networking equipment. This resulted in many systems down and Enterprise Service Bus.

http://up2v.nl/2017/05/29/what-went-wrong-in-british-airways-datacenter/

Big question: why wasn't a failover to the other datacenter initiated?

1 0 Reply

Wednesday 31st May 2017 10:34 GMT Anonymous Coward

Probably

.. because they estimated the failover would take longer than fix on site. Clearly the wrong decision was made. Or perhaps they had no faith in their DR plan.

0 0 Reply
Wednesday 31st May 2017 11:10 GMT Snoopy the coward

Messaging systems failed to sync...

From what I have read, they did a failover but they just can't resync again, meaning they don't have a point-in-time recovery capability for their messaging systems. Not sure what messaging systems BA are using but I know MQ can recover quite easily, it will resent what has failed, will not resend a duplicate.

But anyway I think the applications are linked in a very complicated manner and a failover need to be done in a very strict sequence, or else it will ruin everything, requiring a restore from tape to recover. So the initial failure was the power surge which destroyed some hardware, the failover was initiated but it just couldn't continue from where it went down, thus requiring many hours of manual recovery work to get it up again.

0 0 Reply

Thursday 1st June 2017 11:15 GMT Captain Badmouth

Willie Walsh

on BBC now still blaming a failure of electrical power to their IT systems.We know where problem occurred, he says.

BBC report that industry experts remain sceptical.

1 0 Reply

Thursday 1st June 2017 11:54 GMT Grunt #1

Re: Willie Walsh

Mr White Wa(l)sh.

"We know what happened but we're still investigating why it happened and that investigation will take some time," he said.

- We're hoping some other sucker is in the headlines when we publish.

"The team at British Airways did everything they could in the circumstances to recover the operation as quickly as they could."

- The recovery they performed was no doubt a fantastic job which pulled BA out of a tailspin at the last minute. The real question is what caused the tailspin.

0 0 Reply

Thursday 1st June 2017 22:36 GMT Anonymous Coward

There were companies in the WTC on 9/11 with redundant DCs in New Jersey. The backup DC took over, they didn't lose any data, the file system didn't drop buffers on the floor, etc. And it wasn't Windows or UNIX based. The technology is out there but people don't like "old" proprietary systems... except when it saves them money.

2 0 Reply

Friday 2nd June 2017 08:53 GMT Snoopy the coward

Surge cause by ?

Since there were no surge reported from the supply grid, the surge must be caused by some heavy equipment, I can only think of the air-conditioning systems. Someone had left all the switches of the computers and cooling systems in the ON position, so when the power resumes, the air-conditioning systems caused a huge power surge, destroying many computer circuit boards. Experience staffs seems to be lacking in the BA datacentre, probably laid off to cut cost.

0 0 Reply

Friday 2nd June 2017 13:25 GMT Anonymous Coward

Looks like an electrician struck again

https://www.theguardian.com/business/2017/jun/02/ba-shutdown-caused-by-contractor-who-switched-off-power-reports-claim

0 0 Reply
1. Friday 2nd June 2017 14:31 GMT Anonymous Coward
  
  Re: no functioning disaster recovery
  
  From comments on that Grauniad article:
  
  "I want to see Willie Walsh get asked about this by a real tech journalist - maybe one from The Register or similar. "Mr Walsh, is it true that your global IT infrastructure has no functioning disaster recovery? If so, how soon do you think this will happen again?""
  
  No link, I don't want to get the Interwebs into an Nth order binary loop and make matters worse for outsiders than they already are.
  
  Making matters worse for BA/IAG management (and staff, and passengers) is a job that management are apparently well qualified for,
  
  1 0 Reply
  1. Sunday 4th June 2017 09:59 GMT Anonymous Coward
    
    Re: no functioning disaster recovery
    
    Many papers and websites have used the Register as their source. Perhaps BA should send their excuses to The Register to ensure the correct message out.
    
    1 0 Reply

Monday 5th June 2017 07:33 GMT anonymous boring coward

Could someone please explain why power should "spike" when, as the story goes, all things were started at once? In my mind there could be a rush of current leading to a brown-out condition.

Perhaps it's a "power demand spike" that is being referenced?

But the idea with these statements seem more aimed at conjuring up images of dangerous voltages spikes entering the system and blowing up things, like some episode of Star Trek, or Space 1999 where CRTs tended to explode.

After a complete power failure, presumably equipment would need powering up in a controlled manner?

That must all be part of the specifications for the system, and should happen more or less automatically. It seems unlikely that all systems would power up simultaneously and overwhelm the supplies?

0 0 Reply

Topics

Special Features

Vendor Voice

Resources

COMMENTS

Page:

Power spikes & surges

Re: Power spikes & surges

Re: Power spikes & surges

Re: Power spikes & surges

Re: Power spikes & surges

Power spikes etc.

Outsourcery

"fixed by local resources"

Something else will crash on Tuesday.

I see that El Reg is unable....

Re: I see that El Reg is unable....

Hands up ... if you believe this for a second

My 20p/80p 's worth

The power (with its backups) will never fail so we don't plan for it.

Brownout to brownpants

IT outsourcing

Soft target?

Probably

Messaging systems failed to sync...

Willie Walsh

Re: Willie Walsh

Surge cause by ?

Looks like an electrician struck again

Re: no functioning disaster recovery

Re: no functioning disaster recovery

Page:

POST COMMENT House rules

Enter your comment

Add an icon

Other stories you might like

911 goes MIA across multiple US states, cause unclear

Sacramento airport goes no-fly after AT&T internet cable snipped

Cyberattack hits Omni Hotels systems, taking out bookings, payments, door locks

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

Datacenter outages are on the decline, but when they hit, they hit hard

Tech trade union confirms cyberattack behind IT, email outage

McDonald's ordering system suffers McFlurry of tech troubles

LinkedIn's turn to fall over: Outage hits thinkfluencer hub

World-plus-dog booted out of Facebook, Instagram, Threads

AT&T's apology for Thursday's outage should stretch to a cup of coffee

Americans wake to widespread AT&T cellular outages

X protests forced suspension of accounts on orders of India's government

About Us

Our Websites

Your Privacy