IBM's Warwick, UK, data centre was hit by a power outage on Thursday, causing customers' mission-critical systems to be down for several hours, according to a source. Many customers use this IBM data centre to host their applications, including Edwards Ltd., a global manufacturer of vacuum pump technologies for precision …
Looks like someone dropped the danglie on this...
One would think that a setup like this would run up its generators every few days to check all was well with the engine, alternator and transformers etc. and note if the damn thing works properly. Did they underestimate the power demand on the generator and overload it? Or has the credit crunch hit IBM to the point that its been running management cars on the red diesel destined for the generator...
IBM says it was a "Short outage"? LOL. It was 7 freaking hours!
There's form with IBM Warwick
I worked for IBM for many years. When Warwick was built they specified there should be two entirely independent cables bringing mains power to the site (quite big fat cables, clearly). 'twas done..
Some time after it opened - IIRC before the "hosting paying punters systems" game started - it all went "pufffttt" and the power dropped - no juice from either cable. everyone runs round in circles panicing & the systems are brought back up on generators... those that weren't UPS'd.
After the inevitable enquiry it transpired that although there were two separately routed cables into the site///.................. they both led back to the same electricity sub-station. Brilliant!
Been there, seen that
Have been the unfortunate engineer on duty when this sort of thing happened. Fortunately we had enough battery power to keep the site running until our facilities engineer could get on site and figure a way to get things working.
I can't tell you how fun it is standing in the pouring rain trying to get the generator to come on line (you could manually start it but the UPS wouldn't take the input from it) The engineer ended up pulling a sodding great fuse out of the main panel which failed one of the phases at which point the the whole thing just sprang into life. I wasn't paid enough to reach into a three phase power distribution panel I can tell you....
So I can have sympathy for them. Sometimes you can test everything as much as you want but until the power goes out for real you just don't know.
Here, we have to get customers approval for this type of drills. Time and again, no customer wants to run any risk related to "let's switch off the mains and see the generator humming".
> One would think that a setup like this would run up its generators every few days to check all w
No, one would expect a world class data processing outsourcer like this to have multiple site datacentre rendundancy configured to work seemlessly in active-active or active-passive failover configuration.
Losing a datacentre should not result in lost transaction processing capabilities on mission critical systems...
Last I knew, most of those entire systems were backed up on other sites
Being on the flightpath to Birmingham International, my understanding was the many of the critical Warwick services were real-time mirrored (albeit onto older kit) in other sites, precisely because of the risk of a 'jet crash' scale calamity. I suppose it depends what level of backup the customer is paying for, though. I'd be interested if customers such as TUI were affected, for instance. It was always considered a little ironic that TUI would pay for such a high level of redundancy, given that in all likelihood, it would be one of their jets, doing the crashing!
Either way, many of those systems can take upwards of an hour to reboot (some are rebooted once every year, as a matter of course, and it's quite an operation, whenever it's that time of year - even when you have a schedule to work to). So bringing the whole lot back from an unexpected sudden death, in stages, would cause a lot of downtime.
One would be wrong
"One would think that a setup like this would run up its generators every few days to check all was well with the engine, alternator and transformers etc. and note if the damn thing works properly", rhydian
One would be more accurate to speculate that the maintenece was outsourced to a company that hired another company that outsourced to another company that hired on cheap east european contract labor. At least that's the way they do it round here.
I was hit by this...
I was on holiday, on the last full day, attempting to use a FairFX US currency card and it completely ruined the last day of my holiday, not being able to make any payments (courtesy of the aforementioned MasterCard processor).
Stupid IBM. Epic FAIL.
I have nothing but contempt for IBM. "Server(s)" them right!
@rhydian and Anon @14:58
It is a must in our organisation... every Thursday our mains power drops out and the generators take over for a while. The last time something did go wrong the whole thing pancaked, but at least we learnt from it. Chances are that our systems won't go down...
IBM clearly was a bit too confident that everything worked and would continue to.
Hmmm... that said, interesting that IBM has a massive data centre in Warwick - The town's not really big... ;-)
My impression is that the power outage was quite short - nothing like 7 hours. The rest of the time was spent in bringing systems on line in an orderly manner. I would be surprised if any readers of The Register did not understand this. Certainly all the customers would be aware of it -it's stanfard T's & C's used in all contracts covering this type of service provision.
The big issue is what were customer expectations, based on the specifications of the MG sets and the UPS.
It is my experience that standby MG sets are far too frequently just that: they standby while everybody is expecting them to start. So the UPS should be specified to be large enough to allow a manual start of the MG after main failure. But it would be very rare - and unacceptably expensive at a site the size of Warwick - to specify UPS that can support the failure of a prime or singleton MG set while mains power is also off. The mean time to fix a failed MG set is certainly some hours. A better solution is always to have multiple MG sets, even though there is extra synchronisation costs involved - with each set capable of supporting a reduced operation of the site.
Twice as smug...
Advised against putting mission-critical services on Slowaris on SPARC? Tick!
Advised against out-sourcing mission-critical systems to IBM Warwick? Tick!
Will I be buying a lottery ticket this week? TICK!
Running up your generators is all well and good...
Until the contactors weld themselves so that you could only run on generators...
The natives were not happy about that when our scheduled gen-test meant the genny was running for 4 days. People were queued up at the site gate moaning about the noise. Then again, we've been here for longer than them, so they were told to go away.
Anyway, 4 days later, the new contactor arrives, and the datacenter needs to be powered down so the sparks can do their thing.
An interesting week.
I used to do IT for a certain pivate hospital group whom I will not mention (But who share the name of a famous english chef) .
We ran gen tests once a month, for each site, without fail. Everything critical had its own battery attached, so in the even that the gen set didn't fire up, we could flick back to the mains, and no one would notice.
There is no excuse for a multi million dollar data centre NOT doing this gen-set testing once a god damn month, especially if they are confident that their UPS system is up to par.
Here is a really easy system guys...
Step 1) Flick off mains
Step 2) See if the gen system kicks off
If Gen system fails
Then Turn mains back on AND fix gen system
Repeat until gen system works.
If your power systems are designed correctly, you should be able to do this, with no one the wiser, as your UPS will cover the 30 seconds or so it takes for your gen systems to notice the loss of power.
Generators can't just sit there until you need them. They are complex machines that need monthly testing and checking.
Can I have my multi zero consultancy fee please?
Mines the one with the pliers and the voltmeter in the pocket....
Server UPS AND Generator?
All 3 failed? How did they allow that to happen. Shouldnt they be maintained for this type of thing.
Generator tests fail too
I would hope the tested the generator. But, I've heard of at least several large instances where the generator test itself caused an outage -- generator caught fire, burned out some switch it shouldn't, etc.
I would also not judge too quickly -- they summarize it as "the generator failed", but that doesn't necessarily mean it didn't start; possibly the hardware failed that hooks the generator to the data center; maybe there was some power surge when the power went out, damaging the generator etc.; maybe the generator ran but was not working under load. Any of these would not have been found even if it was tested.
@AC, "they both led back to the same electricity sub-station. Brilliant!" That's hilarious. This does happen with data as well; I remember reading about some data center that ordered like 6 or 8 different redundant connections so a backhoe wouldn't knock them off line. Well, one did... these providers were claiming they had different physical connections but actually had something like different wavelengths on the same fiber, or different fibers in the same bundle, so indeed a backhoe knocked out the primary and all backups.
" No, one would expect a world class data processing outsourcer like this to have multiple site datacentre rendundancy configured to work seemlessly in active-active or active-passive failover configuration."
That really depends on what kind of data center they are. The case where they were hosting SAP for someone, I'd agree, IBM should have been taking care of redundancy. In some cases, hosting amounts to "put your boxes in our datacenter", it's up to the customer to then get redundant boxes, put them in multiple data centers, and arrange for failover.
I think when doing generator testing with live equipment on the line you wouldn't just flip off the main power to test to see if the generator comes up, you should be able to do that without having to force the UPS to take the full load.
When testing UPS systems you should be able to fire up the generators so that in the event the UPS fails the generator power is already flowing and there should be no disruption of service.
Of course if the automatic switch over fails then you lose power to the UPS..that happened at a local Internap facility here a couple of years ago, the UPS worked, the generator worked, but the automatic switch that flipped the UPS from grid to generator power failed, the batteries were already needing replacement and didn't last as long as people thought and they failed.
And of course it's ideal to have everything connected to two independent power sources so you can test one at a time. And run metered power strips(at least) so you can ensure that you are not overloading your circuits, most data centers won't alert you to this unless your a serious abuser of this(long after you've effectively lost redundancy by oversubscribing the line too much)
And I for one wouldn't want to be at a data center where there's only 30 seconds of UPS power, I would want at least 15 minutes, in the event of a problem 30 seconds is not enough time of course for on site personnel to respond. I think data centers that use flywheel UPSs are just nuts.
Now if my equipment is fully redundant across sites and losing a site isn't a big deal then it's not a big deal, but organizations that have that level of availability are very few and far between.
Seems most posters here have none.
First, to test a generator, you rarely have to remove the main power. A well installed generator will have a byass to allow full system start and load testing without removing the main power from the equipment. Good enough 99.9% of the time.
For those that think IBM "should have" blah, blah, blah, what do you know about the contract they have with the clients involved? Perhaps the client wanted to be at an IBM data centre, but wasn't willing to spend the money on a second data centre. You know IBM is going to try to sell it and if the client declines, you can be sure that 53 IBM lawyers covered their arse with a clause that says "you declined better service, you're on your own now".
Lastly, hardware fails, usually when you don't want it to. How do you know the generator wasn't tested the day before and it worked like a charm? Maybe it was just it's day. That's exactly the reason why, as people have suggested, the *clients* *should* have had a redundant site. Because hardware fails!
Now, everyone go turn on their generator and report back here that it worked.
What do you mean you don't have one?
This took out Equifax's credit referencing services as well as killing ISDN equipment at the BT exchange!
Lucky we believe in backup systems, even if IBM don't ;-)
Running generators as a test has its own problems, as BT found out many years ago. Policy was to start the diesel generator every Monday to make sure it worked. It did.
Then when the power failed one day, the generator started fine, ran for about 30 minutes, and died.
It turned out that the constant cold starts & shutdowns had had the same effect as short runs in a car, the generator was well coked-up inside. Once it got up to running temperature all that glowing carbon caused havoc.
After a full cylinder-head-off rebuild the test policy was changed to a test run once a month, up to full temperature for a few hours.
IBM data center flat lines
Oh, well, just a little egg on their faces.
Quite a many years ago, I was assisting the head electrician on a high rise residential tower installing a back up generator. The only reason for the generator was for the four elevators (I think you Brit's call them 'lifts'. Because the site location was in the lightning capital of the US, the cycle time was set at 2-1/2 minutes after power failure.
The local fire marshal required a 'simulated power failure' before he would sign off on the certificate of occupancy. I was given the 'pleasure' of killing the power. Pull the 1200 amp main switch, and 2 and one half minutes later, the generator started right up. Some firefighters went aboard each elevator, and ran all of them up and down for over a half hour. Satisfied, that the generator was able to perform it job, the fire marshal gave his approval. I was told by the fire department, that once a year, they will inspect the entire building for fire safety issues; and that generator better work when they show up.
I can think of only one thing worse than a generator that fails to run, is one that runs out of fuel because some (management) tw@t forgot to have the fuel tank refilled. ("Hey boss, I know a way to save some money, let's not buy any more fuel for the generator")
Flames - because generators run of flammable liquids.
Re: Five 9's
Here's some math for you... "Five 9's" that people are so enamored with is the 99.999% percent up time...
60 seconds * 60 minutes = 3600 sec/hr * 24 hr/day = 86,400 sec/day * 365 day/yr = 31,536,000 sec/yr
00.001% of this is 31536 seconds, which equals 8.76 hours... so a 7 hour outage is still within your 99.999% up time requirement with over 1.5 hours to spare... as long as there are no further outages this year.
Sad - but not uncommon
I can't remember 100% but had some involvement with Reuters in the late '80s. As I recall the share data had one or two outages in similar circumstances.
The backup generators failed because the diesel in their tanks had turned to treacle - basically unless you use it the lighter fractions separate or even evaporate so you have to either run them or flush the tanks every so often. Expensive.
Oh yes and three power feeds into the building (at different corners!) to protect against JCB syndrome BUT turned out they used the same duct in the building. So when that caught fire they lost all three power lines. Can't recall if it was an overload or if some builders did it. Always beware builders within 100's of yards. I've worked in two buildings where some builders set fire to the adjacent building. Made being a fire marshal much more real!
Fire - well fire.
Re: Five 9's
Here's some slightly better maths: five 9s is 00.001% time down, which is 315.36 seconds per year, or just over five minutes. So a 7 hour outage is still within your uptime requirement for the next century with a bit to spare...
Correct, and since no-one installs equipment just for a year, I always prefer to think of "5-nines" as about 1 hour max of downtime (unintentional AND planned) over 10 years. Looked at like that, people start to understand the challenge.
The other side of these stories also....
Also to comment i assisted several times external power cutting situations (like some road works and a machine digging and cutting the power lines) and the Data Centre being kept running for 12+ hours with the diesel power generators and UPS, and later through the second external power line connected to a different sub-station.
5 hours to bring the systems up and running don't seem to be a catastrophic situation as probably the (business) data loss was minimal or nule and the true impact was stopping electronic business for some hours which also happens a lot with normal ATM debit payments with having these disruptions...
"... which also happens a lot with normal ATM debit payments WITHOUT having these disruptions..."
Wonder about data center design
I'd like to know if the data center in question was a tier 1, 2, 3, or higher. If built as a tier 1 or 2, I would suspect this type of down-time as redundancy is not a primary issue. However, if designed as a tier 3 or higher, then IBM might need to rethink their build. Of course, putting mission-critical processing into anything but a tier 3+ center is folly to begin with.
Me thinkish they did not fill up the tank for the pwer generator.