back to article BA IT systems failure: Uninterruptible Power Supply was interrupted

An IT bod from a data centre consultancy has been fingered as the person responsible for killing wannabe budget airline British Airways' Boadicea House data centre – and an explanation has emerged as to what killed the DC. Earlier this week Alex Cruz, BA's chief exec, said a major "power surge" at 0930 on Saturday 27 May …

Anonymous Coward

If it got interrupted...

Then its not a UPS. Its a DHL. Dumbass High-risk Liability.

22
2
Reply
Silver badge
Holmes

Re: If it got interrupted...

First we know everything was slowing down. Maybe they decided they couldn't fix it live and wanted to force a failover.

The Times suggests a big red button was pressed in the data centre by a contractor and the power went down. That might be when BA claimed there was a power failure.

That would be the point when the failover failed. Perhaps that is why the CEO said something about there being millions of messages, although he seems to have stopped saying that now, maybe because it suggests there's something wrong with their IT.

Then I guess they tried to bring the data centre back up, and it looked like the bridge of the Enterprise, shaking about, staff falling to the floor, and smoke everywhere. That would be the power surge.

I wonder how long it was since power and switching to secondary or backup data centres were tested.

21
3
Reply
FAIL

Re: If it got interrupted...

The issue is that what was someone doing in the DC playing with buttons they should not have had access to.

If your IT workforce is all in house then you don't get contractors wandering around unsupervised.

33
5
Reply
Anonymous Coward

Re: If it got interrupted...

Electrical installation is rarely done in-house and is quite a specialised task. You'd have to be special to cock this up.

I can see it now.

Electricians apprentice - Whoops why has it gone quiet?

Foreman - Quick, restart everything and get out before anyone notices.

35
0
Reply
Anonymous Coward

Re: If it got interrupted...

IT staff rarely go near the electrical stuff, it's far too dangerous for that.

16
1
Reply
pdh

Re: If it got interrupted...

> I wonder how long it was since power and switching to secondary or backup data centres were tested.

I think it's been about a week.

71
0
Reply
Anonymous Coward

Re: If it got interrupted...

Somebody please ask BA if they do this.

0
0
Reply
Silver badge

Re: If it got interrupted...

Who said he shouldn't have had access to the buttons? He's the electrician.

I bet he was called in and told to pull the plug as a consequence of the system grinding slowly to a halt yet not switching over to secondary. It can't be a coincidence.

5
3
Reply
Holmes

Re: If it got interrupted...

At one facility I worked at & I don't have the full story of why......

The offshore support insisted on the former plant Sysadmin hitting the plants BRB, pictures were sent to the remote guy via email, he confirmed that was the button he wanted to be pushed & goodness gracious me it was going to be pushed. He was advised again of what it was that would be pushed & the consequences, the plant manager dutifully informed of what was required, what the offshore wanted & what would be the fallout.

& so it came to pass that the BRB was pushed on the word of the Technically Competent Support representative.

(Paraphrasing here........)

"Goodness gracious me, Why your plant disappearing from network?"

"Because the BRB you insisted that was the button you wanted pushing, despite my telling you that it was never to be pushed under pain of death has just shut down the entire plant."

I think it took until about 15 minutes before production was due to commence the following day to get everything back up & running.

30
2
Reply

Re: If it got interrupted...

"IT staff rarely go near the electrical stuff, it's far too dangerous for that."

As a significant percentage of BOFH plotlines have taught us ; )

19
0
Reply
Silver badge

Re: If it got interrupted...

...then a BOFH and PFY will soon get (yet another) new Boss.

1
0
Reply
Anonymous Coward

Re: If it got interrupted...

My guess about the initial failure: ATS left in bypass or failed on power transfer. 15mins for someone authorised to manually switch ATS to good power source or bypass failed ATS. UPS/generators not specified (or too much new kit added to the DC) for full startup load of the whole data center which then failed again.

Some systems probably started up and began a re-sync then the high load crapped out the generators returning everything to silence again, leaving the replication in an unknown state when systems were manually restarted over a longer period to manage the initial load.

12
0
Reply
Silver badge

Re: If it got interrupted...

IT staff rarely go near the electrical stuff, it's far too dangerous for that.

Er, IT staff work on things that are run off the very same electrical stuff. I do hope you are not implying that data centre grade equipment is too dangerous ?

Having said that, even a complete Muppet can hurt themselves with nothing more than a mildly sharp stick or an LR44 button cell battery that looks like a sweetie, like everything, its all down to training and understanding the job and the risks of the job. Take a look on YouTube for the chaps that work on the live 500KV power lines, or the guys that maintain the bulb at the top of the radio towers

Everyone should have seen all the warning signs on the way into the facility (that tick the boxes in the H&S assessment) and had the prerequisite training.about safe escape routes if gas discharge occurs (no, not that gas, the other one), the presence of 3 phase power; the presence of UPS power; various classes of laser optics; automated equipment such as tape libraries that can move without warning and of course the data centre troll who's not been seen for a couple of weeks now, oh and of course, the ear defenders due to the noise plus the phone that you can't hear as its too noisy.

My point is that data centres are no worse than any other environment - like maintaining a car engine or running a mower in your garden.

8
7
Reply
Anonymous Coward

Re: If it got interrupted...

Oh, that takes me back some years. New standby generator went in on Friday night/Saturday morning. We're on site, shut down everything, new generator in place, all well, so stuff gets brought back up. 9:30 am, apprentice sparky cut a wire that caused the whole thing to shutdown. I got a call from security, arrived at 9:45 am (I stay close) and the place was like the Marie Celeste. Open cans of diesel for the generator, warm cups of tea and not a bloody person on site.

14
0
Reply
Silver badge

Re: If it got interrupted...

"The Times suggests a big red button"

These exist in many date centers. But the are not intended for normal, sequenced shutdowns or to initiate failover to backups. They are usually placed near the exits and intended to be hit in the event of a serious problem like a fire. They trip off all sources _Right_Now_ and don't allow time for software to complete backups or mirroring functions.

*Usually for events that dictate personnel get out imediately.

16
0
Reply
Silver badge
FAIL

Re: If it got interrupted...

I bet he was called in and told to pull the plug as a consequence of the system grinding slowly to a halt yet not switching over to secondary.

If you really want to force a failover that way, you do so by shutting down the small number of systems that would cause the monitoring system to detect a "critical services in DC1 down, let's switch to DC2". If you can't log in to those systems because of system or network load you connect to their ILO/DRAC/whatever, which is on a separate network, and just kill those machines. If the monitoring system itself has gone gaga because of the problems, you restart that, then pull the rug out from under those essential systems. Or you cut connectivity between DC1 and the outside world (including DC2), triggering DC2 to become live, because that would be a failure mode that the failover should be able to cope with.

You. Do. Not. Push. The Big. Red. Button. To. Do. So.

Ever.

27
0
Reply
Anonymous Coward

Re: If it got interrupted...

My last employer had the big red shutdown button conveniently located next to the exit door. Unfortunately just in the position that the door open button would be. One lunchtime a visiting engineer carrying a few boxes of spare parts accidentally pressed it trying to open the door...

20
1
Reply
Silver badge

Re: If it got interrupted...

"No determination has been made yet regarding the cause of this incident. Any speculation to the contrary is not founded in fact."

Which would of course not be a problem for the Daily Mail. The Paul Nuttall of the newspaper world.

10
3
Reply
Silver badge

Re: If it got interrupted...

"One lunchtime a visiting engineer carrying a few boxes of spare parts accidentally pressed it trying to open the door..."

They usually have a plastic cover. And a large label....

3
1
Reply
Silver badge

Re: If it got interrupted...

"playing with buttons they should not have had access to."

EPO buttons are easily accessible. That's the whole point of them as emergency safety feature. Usually near the door in each DC hall...

1
0
Reply
Silver badge

Re: If it got interrupted...

Maybe Emergency Power Off resembles the Indian for Light Switch?

1
10
Reply
Silver badge

Re: If it got interrupted...

You. Do. Not. Push. The Big. Red. Button. To. Do. So.

Ever.

I am Groot?

15
0
Reply

Re: If it got interrupted...

@TheVogon

One installation I worked in, learned the plastic cover and label thingy the hard way (the hapless third party support techie who pushed the BRB instead of the door opener was banned permanently from the site to boot)

2
0
Reply
Silver badge
Facepalm

Re: If it got interrupted...

Usually near the door in each DC hall...

But not so near that they can be mistaken for a door opener button by the dimmest of dimwits. At chest/shoulder height and at least a few steps away from the door appears to me the most sensible location.

That said, I've seen a visitor who shouldn't have had access to the computer room in the first place look around, totally fail to see the conveniently located, hip-height blue button at least as large as a BRB next to the exit door, and killed the computer room because a Big Red Button high up the wall and well away from the exit is obviously the one to push to open the door for you.

Unfortunately, tar, feathers and railroad rails are not common inventory items in today's business environment; rackmount rails are too short and flimsy for carrying a person.

7
0
Reply

Re: If it got interrupted...

In my office the button to open the exit door is right next to the Fire Alarm button (which has no guard). There are also light switches and other visual clutter nearby. At the end of a long tiring day I've sometimes come close to pressing the wrong button.

7
0
Reply
Silver badge

Re: If it got interrupted...

The electrical generators are almost certainly three-phase generators ... the trick here is connecting the three phases in the right order. I saw a generator test years ago in Oxford fail on the initial installation test after the phases why connected incorrectly. The generator spun up, and as the power switched over there was one heck of a bang and a lot of smoke ... and no more electricity.

6
0
Reply

Re: If it got interrupted...

The first thing I would ask for would be the fail over test schedule and the resulting reports on how they went (if they did any).

5
0
Reply
Facepalm

A low availability cluster

My bet would be that a cluster failover was initiated by the power failure, then fail-back manually triggered, but the primary site failing again with power surge starting the secondary systems. With a manual failback an engineer would be needed to failover again and not just a bargain basement operator.

1
1
Reply

Re: If it got interrupted...

I think the best solution to prevent accidental use is to have 2 big red buttons. And to require both to be simultaneously pushed to trigger the power shutdown.

In fact I saw a UPS product having exactly that feature, two EPO buttons that you needed to push simultaneously to shut it down.

2
0
Reply

Re: If it got interrupted...

Fine in principle, but that assumes it is a planned event, it's an EPO for a reason.

Better to have someone knowledgeable watching over the contractor.

0
0
Reply
Anonymous Coward

Re: If it got interrupted...

In my day, one of the first things pointed out to me was the BRB and the circumstances under which it can be used without being fired.

1
0
Reply
Anonymous Coward

Re: If it got interrupted...

Point taken.

The heavy electrical stuff and switch rooms should be kept under lock and key at all times. Every DC I have been in the techies were not allowed near these. They were expected to be familiar with all the safety items you mention.

0
0
Reply

Re: If it got interrupted...

> "I saw a generator test years ago in Oxford fail"

Yeah... I was told of a similar incident, but in a power station. When the station is powered on it needs to sync to the grid before linking as it is vital that not only the frequency matches exactly, but also the phase. The traditional way to do this was with a dial showing the phase-error. Apparently when the plant was down for maintenance they also had cleaners in to give the control room a going over. One of these cleaners discovered that it was possible to unscrew the glass fronts of the dials to clean the glass. In the process they knocked off the needle, and replaced it... 180 degrees out of phase. When the power station was brought back online the generators apparently detached themselves from the floor... with considerable (i.e. demolition-grade) force!

5
0
Reply

Re: If it got interrupted...

Having two buttons doesn't mean a single person can't operate them. You can place the two buttons close enough for that. But it ensures the person operating them really knows what it's doing, and it's not randomly pressing buttons.

Of course, if the purpose of having such buttons is to allow even untrained people to shut everything down in case of emergency, it would complicate things. But a large warning message, for example "In case of fire, you need to press these two buttons at the same time!" should take care of that as well.

1
0
Reply
Silver badge

Re: If it got interrupted...

Similar to where I worked, except the backup power gen came on, picked up the load and then promptly died. Seems there wasn't much fuel in the tank. Maintenance guy was fingered for it as it was in his job description to fuel the generator and keep it topped off. The lesson was that firing off the generator once a week for 10 minutes to test uses fuel... duh!!!!!!

2
0
Reply
Anonymous Coward

Re: If it got interrupted...

The real answer is always train people before letting them in.

1
1
Reply
Anonymous Coward

Re: If it got interrupted...

@Grunt

But then you need another contractor to do the knowledgeable one's job

1
0
Reply
Silver badge

Re: If it got interrupted...

They cost extra.

1
0
Reply
Silver badge
Thumb Up

wannabe budget airline British Airways

c'mon guys, that's just ... cheap

17
5
Reply
Joke

Not as cheap as EasyJet.

15
0
Reply
Silver badge

"wannabe budget airline British Airways

c'mon guys, that's just ... cheap"

Sorry? I thought budget airlines were supposed to be cheap?

Perhaps you mean cruel.

6
0
Reply
Mushroom

Tech Support: "Have you tried unplugging it and plugging it back in again"

22
0
Reply
Silver badge

Yeah but at least that normally actually works!

1
0
Reply
Headmaster

... not due to outsourcing...

If you don't know why it happened, then I doubt you know it was not due to outsourcing.

37
2
Reply

Re: ... not due to outsourcing...

That's false logic. I personally don't know why the Russians didn't send a man to the moon but I know for definite that they didnt. Now fuck off.

8
34
Reply
Silver badge

Re: ... not due to outsourcing...

Or they did but covered it up .....

13
0
Reply
LDS
Silver badge

't know why the Russians didn't send a man'

Oh well, they were pretty secretive back then, hiding whole cities from maps, and restricting access harshly. Just as BA would like to be today.

Anyway it was because their Moon rockets couldn't be launched without failing quickly.

2
2
Reply
Anonymous Coward

Re: ... not due to outsourcing...

Sure, the outage itself may not have been due to outsourcing.

The extreme time needed to get things running though... that has outsourcing written all over it.

18
0
Reply
Anonymous Coward

Re: ... not due to outsourcing...

You may be right, but you need to be nice.

1
0
Reply
Anonymous Coward

If the people that manage the servers are from TCR and they were unable to recover from the power failure in a reasonable amount of time then I deduce that they are at fault. Maybe not for the initial outage but the subsequent problems. They would also be responsible for the disaster recovery procedures so the fact it all failed in the first place also lies with them.

26
0
Reply

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2018