If it got interrupted...
Then its not a UPS. Its a DHL. Dumbass High-risk Liability.
An IT bod from a data centre consultancy has been fingered as the person responsible for killing wannabe budget airline British Airways' Boadicea House data centre – and an explanation has emerged as to what killed the DC. Earlier this week Alex Cruz, BA's chief exec, said a major "power surge" at 0930 on Saturday 27 May …
First we know everything was slowing down. Maybe they decided they couldn't fix it live and wanted to force a failover.
The Times suggests a big red button was pressed in the data centre by a contractor and the power went down. That might be when BA claimed there was a power failure.
That would be the point when the failover failed. Perhaps that is why the CEO said something about there being millions of messages, although he seems to have stopped saying that now, maybe because it suggests there's something wrong with their IT.
Then I guess they tried to bring the data centre back up, and it looked like the bridge of the Enterprise, shaking about, staff falling to the floor, and smoke everywhere. That would be the power surge.
I wonder how long it was since power and switching to secondary or backup data centres were tested.
Electrical installation is rarely done in-house and is quite a specialised task. You'd have to be special to cock this up.
I can see it now.
Electricians apprentice - Whoops why has it gone quiet?
Foreman - Quick, restart everything and get out before anyone notices.
Oh, that takes me back some years. New standby generator went in on Friday night/Saturday morning. We're on site, shut down everything, new generator in place, all well, so stuff gets brought back up. 9:30 am, apprentice sparky cut a wire that caused the whole thing to shutdown. I got a call from security, arrived at 9:45 am (I stay close) and the place was like the Marie Celeste. Open cans of diesel for the generator, warm cups of tea and not a bloody person on site.
Similar to where I worked, except the backup power gen came on, picked up the load and then promptly died. Seems there wasn't much fuel in the tank. Maintenance guy was fingered for it as it was in his job description to fuel the generator and keep it topped off. The lesson was that firing off the generator once a week for 10 minutes to test uses fuel... duh!!!!!!
The electrical generators are almost certainly three-phase generators ... the trick here is connecting the three phases in the right order. I saw a generator test years ago in Oxford fail on the initial installation test after the phases why connected incorrectly. The generator spun up, and as the power switched over there was one heck of a bang and a lot of smoke ... and no more electricity.
> "I saw a generator test years ago in Oxford fail"
Yeah... I was told of a similar incident, but in a power station. When the station is powered on it needs to sync to the grid before linking as it is vital that not only the frequency matches exactly, but also the phase. The traditional way to do this was with a dial showing the phase-error. Apparently when the plant was down for maintenance they also had cleaners in to give the control room a going over. One of these cleaners discovered that it was possible to unscrew the glass fronts of the dials to clean the glass. In the process they knocked off the needle, and replaced it... 180 degrees out of phase. When the power station was brought back online the generators apparently detached themselves from the floor... with considerable (i.e. demolition-grade) force!
IT staff rarely go near the electrical stuff, it's far too dangerous for that.
Er, IT staff work on things that are run off the very same electrical stuff. I do hope you are not implying that data centre grade equipment is too dangerous ?
Having said that, even a complete Muppet can hurt themselves with nothing more than a mildly sharp stick or an LR44 button cell battery that looks like a sweetie, like everything, its all down to training and understanding the job and the risks of the job. Take a look on YouTube for the chaps that work on the live 500KV power lines, or the guys that maintain the bulb at the top of the radio towers
Everyone should have seen all the warning signs on the way into the facility (that tick the boxes in the H&S assessment) and had the prerequisite training.about safe escape routes if gas discharge occurs (no, not that gas, the other one), the presence of 3 phase power; the presence of UPS power; various classes of laser optics; automated equipment such as tape libraries that can move without warning and of course the data centre troll who's not been seen for a couple of weeks now, oh and of course, the ear defenders due to the noise plus the phone that you can't hear as its too noisy.
My point is that data centres are no worse than any other environment - like maintaining a car engine or running a mower in your garden.
I bet he was called in and told to pull the plug as a consequence of the system grinding slowly to a halt yet not switching over to secondary.
If you really want to force a failover that way, you do so by shutting down the small number of systems that would cause the monitoring system to detect a "critical services in DC1 down, let's switch to DC2". If you can't log in to those systems because of system or network load you connect to their ILO/DRAC/whatever, which is on a separate network, and just kill those machines. If the monitoring system itself has gone gaga because of the problems, you restart that, then pull the rug out from under those essential systems. Or you cut connectivity between DC1 and the outside world (including DC2), triggering DC2 to become live, because that would be a failure mode that the failover should be able to cope with.
You. Do. Not. Push. The Big. Red. Button. To. Do. So.
At one facility I worked at & I don't have the full story of why......
The offshore support insisted on the former plant Sysadmin hitting the plants BRB, pictures were sent to the remote guy via email, he confirmed that was the button he wanted to be pushed & goodness gracious me it was going to be pushed. He was advised again of what it was that would be pushed & the consequences, the plant manager dutifully informed of what was required, what the offshore wanted & what would be the fallout.
& so it came to pass that the BRB was pushed on the word of the Technically Competent Support representative.
"Goodness gracious me, Why your plant disappearing from network?"
"Because the BRB you insisted that was the button you wanted pushing, despite my telling you that it was never to be pushed under pain of death has just shut down the entire plant."
I think it took until about 15 minutes before production was due to commence the following day to get everything back up & running.
Usually near the door in each DC hall...
But not so near that they can be mistaken for a door opener button by the dimmest of dimwits. At chest/shoulder height and at least a few steps away from the door appears to me the most sensible location.
That said, I've seen a visitor who shouldn't have had access to the computer room in the first place look around, totally fail to see the conveniently located, hip-height blue button at least as large as a BRB next to the exit door, and killed the computer room because a Big Red Button high up the wall and well away from the exit is obviously the one to push to open the door for you.
Unfortunately, tar, feathers and railroad rails are not common inventory items in today's business environment; rackmount rails are too short and flimsy for carrying a person.
"The Times suggests a big red button"
These exist in many date centers. But the are not intended for normal, sequenced shutdowns or to initiate failover to backups. They are usually placed near the exits and intended to be hit in the event of a serious problem like a fire. They trip off all sources _Right_Now_ and don't allow time for software to complete backups or mirroring functions.
*Usually for events that dictate personnel get out imediately.
My last employer had the big red shutdown button conveniently located next to the exit door. Unfortunately just in the position that the door open button would be. One lunchtime a visiting engineer carrying a few boxes of spare parts accidentally pressed it trying to open the door...
I think the best solution to prevent accidental use is to have 2 big red buttons. And to require both to be simultaneously pushed to trigger the power shutdown.
In fact I saw a UPS product having exactly that feature, two EPO buttons that you needed to push simultaneously to shut it down.
Having two buttons doesn't mean a single person can't operate them. You can place the two buttons close enough for that. But it ensures the person operating them really knows what it's doing, and it's not randomly pressing buttons.
Of course, if the purpose of having such buttons is to allow even untrained people to shut everything down in case of emergency, it would complicate things. But a large warning message, for example "In case of fire, you need to press these two buttons at the same time!" should take care of that as well.
My guess about the initial failure: ATS left in bypass or failed on power transfer. 15mins for someone authorised to manually switch ATS to good power source or bypass failed ATS. UPS/generators not specified (or too much new kit added to the DC) for full startup load of the whole data center which then failed again.
Some systems probably started up and began a re-sync then the high load crapped out the generators returning everything to silence again, leaving the replication in an unknown state when systems were manually restarted over a longer period to manage the initial load.
My bet would be that a cluster failover was initiated by the power failure, then fail-back manually triggered, but the primary site failing again with power surge starting the secondary systems. With a manual failback an engineer would be needed to failover again and not just a bargain basement operator.
If the people that manage the servers are from TCR and they were unable to recover from the power failure in a reasonable amount of time then I deduce that they are at fault. Maybe not for the initial outage but the subsequent problems. They would also be responsible for the disaster recovery procedures so the fact it all failed in the first place also lies with them.
Biting the hand that feeds IT © 1998–2019