A major US data centre in Texas was taken out when a truck driver hit a utility pole, sending it crashing into a power transformer. The Dallas/Fort Worth facility, owned by managed hosting outfit Rackspace, automatically switched to power from its own generators when the accident happened, but chiller units failed to re-start. …
"his 'perfect storm' chain of events"
"While our systems are engineered to chug through major failure, this 'perfect storm' chain of events beat both our set up and our data centre's sophisticated back-up systems."
Oh, yes, a data centre power incident is a 'perfect storm' nobody could have predicted that happening. One is left wondering what is meant by 'major failure', perhaps one of the servers has a RAID controller in case a disk fails. If this lot are providing a service to customers they need to learn about reliability quickly and realise that 2 cheap stacks in independent data centres trumps an expensive 'resilient' stack in one building every time.
The hardware could certainly be made more redundant in the way you're suggesting, but that's only half the issue. If they're not serving static content then syncing the data on the two sites stops it being quite so simple.
Obviously it's still possible, but stops being so financially clear which to use.
BTW the Rackspace managed hosting with "100% Network uptime Guaranteed" advert in the middle of these comments is quite amusing.
we have a big lever switch marked "Overide" so we can run on generator only feed until we are confident that the fault has been fixed. Maybe they should look at such a item or isn't that Hi-Tech enough
I thought resilience meant "what if the data centre was nuked?" as well. Surely this is a single point of failure. Given what I have seen and heard of American driving standards perhaps this is not a perfect storm. I understand the phobia Americans have for small cars is widely based on a fear of being crushed by amphetamine-crazed long distance waggon drivers.
It does worry me a bit that what is basically a power failure (albeit elaborate, but basically boiling down to grid power not being reliable but internal power being "supposedly" reliable) caused such problems at a "resilient" hosting firm. Power going up and down is hardly an excuse for actual customers having their servers switched off until and unless a) you run out of diesel for the generators b) your generators blow up or c) something catches fire (yes, I'd let you off with switching off everything in your data centre for even the smallest of fires).
And even then I'd expect it to be a very temporary issue until you fixed the above problem.
I work in broadcast and my working ethic is "if the listener/viewer can tell that somthing is wrong i have failed in my task"
The studio can be on fire but as long as we're still transmitting a programme then its OK. Broadcast systems CANNOT ever have downtime, in this instance it's not RackSpace's fault, but power cuts are a problem and obiously although their backup generators kicked in, the systems did not start up correctly, which im sure they will fix.
Syncing sites is an expensive (but good) way to go, if they're to compete in the hosting market then they have to tradeoff the cost of the second redundancy against how much they can charge.
Don't plan for the worst
This same type of event happened to me back in 1986. Triple-redundant gas-turbine generators, battery backups, fly-wheels - and when the power dropped and the generators couldn't be restarted it was the chillers that were not on battery that killed three mainframes.
The moral of this story: this happens on a regular basis. No matter what you plan for, there is always going to be SOMETHING that was NOT planned that WILL happen...and, since the planning was for REALLY BAD things happening, when the failure occurs, it will be IN ADDITION TO the other REALLY BAD things.
There is only one solution: you buy insurance to cover the losses when something unplanned happens, and you make sure your contracts define exactly WHEN the insurance comes into effect. Don't bother with super-six-thousand-times-redundant systems, because there will always be the six-thousand-and-first problem that catches the system with it's bloomers down.
You're not in IT are you? Your comment is so practical and realistic that I find it hard to believe you are. If you are you are a rare breed. Yea for practicality!!!
No, sorry, if the servers go down, it's Rackspace's fault. Period. It's their job to make sure the servers stay running and they failed.
A total power failure is to be expected at some point and should (and can) be covered, and wasn't.
Too much power, not a lack of it
The problem wasn't lack of power generation capacity. It was that the power had to be removed to allow for recovery and repair work to be done safely. You might like uptime but electrical workers like their lives.
All of this reinforces what ISPs have known for years: site redundancy is the best redundancy of all.
You couldn't be any righter, Sean! It is all Rackspace's issue. 100% reliability? . . .impossible. An ad such as that should be a RED FLAG to anyone looking for hosting. The company LIED! couldn't they have said something like 5 9's and then in small print stated: over a thousand years . . . At least that would leave a lot of room for getting that ONE 9 back up to 5 :-) . . . you know, someday.
And the chillers were not on UPS because...?
"Perfect storm" indeed. "Sophisticated", hah. Too expensive, more likely.
@ Glen Turner
"It was that the power had to be removed to allow for recovery and repair work to be done safely. You might like uptime but electrical workers like their lives."
Sorry, but I just can't see the relevance of that statement.
The safety of the emergency workers should have nothing to do with whether or not a service provider like Rackspace has AC power available or not - if you're in that line of business you should have all of the following - duplicated utility feeds from different distribution sub-circuits, at least triple redundant backup generators (where I work, of the 3, one is on load, one is running at speed but not on load and the other is waiting to start should either of the first 2 fail during a utility failure) and significant UPS capacity that can run all critical systems, including aircon, for long enough to get a failed engine sorted should all three not start.
Being in the telephone exchange business, we have all of these plus several hours of 50 volt battery capacity available.
Is this cheap? - of course not - if nothing else the building I'm based in consumes about 1000 litres of diesel/hour when the power is out, but if you're in the reliability business need to spend the money.
Oh, and yes, there have been occassions when we were down to battery power and UPS only, ie., all 3 engines failed in turn during a power outage.
...its BAU P&L...
@Cube: 2 independent data centres
... yup, better than one... it sounds like they do not have a
sync'd replication scheme to another site, since they were hell-bent on getting the power up on the affected DC
@Graham: 100% network uptime guaranteed
...looks like the marketing department got confused with all
the 99.9995 stuff and thought "thats close enough to 100.. and its less numbers so the advert won't cost as much"
Catastrophic events will occur, it just depends on what you are prepared to invest in as a company to ensure you *still* have a client base after a major outage... fully agree with @Dave: its a financial tradeoff
BBC (and Dave can probably verify)
A mate who used to work at the Beeb told me that an IBM salesdroid was horrified that the BBC wouldn't buy UPSs for their new servers.
He was taken to a basement where he was shown an engine, from a ship, which powers the whole of Television Centre if mains power goes out. The 'starter motor' for this giant generator/engine is apparently a V8 car engine...
The best-laid plans
Well, once again - thanks to all the great minds that contribute here - Rackspace has been duly castigated for its lack of foresight/money/intelligence/honesty/divination.
However, back in the real world - where business continuity planning rules apply - we learn two things: We learn that the amount we're prepared to pay for redundancy, alternate power sources, etc. bears a direct relationship to our estimation of the likelihood of an event occuring (and we are often made fools of by doing so) and; We learn that Robert Burns may well have been in the BCP business. Why else would he have written the cardinal rule of business continuity planning, "The best laid plans o' mice and men gang aft agley."?
By the way, congratulations to Chris Collins for the day's most gratuitous swipe at America and Americans. Way to go, Chris!
Strategy for prevention needed - Rackspace could up the game.
You've all mentioned how Rackspace needs to secure it's backup electrical facilities, but this incident highlights their need to follow the less traveled path of investing in the improvement of their primary commodity.
In much the same way that Rackspace might build a bleeding edge facility, the company (and it's like) should blaze the path for alternative energy sources. I don't mean backup alone, but as a primary power source, using bleeding edge power generation technology.
Find another company that is on this patj, invest in them, and reap the rewards for using their breakthrough method. Renewable energy issues aside, the purpose is to further reduce dependencies.
Sure, they likely receive kickbacks from their home town headquarters for using the local power stations and contributing to town treasury. But not-spending is the best way to save money - not paying discounts - and that means higher revenues (further enhancements, etc).
The larger benefit (the one I'm most interested in) is that of the alternative energy industry. When big companies risk-and-research (invest in burgeoning industries) it can quickly trickle down to city governments and individual households.
If I wanted to reassure customers, I'd announce investments/partnerships with an alternate energy outfit, pointing to a goal to be a umpteen-percent less dependent on local power.
The best competitive advantage is when you're the only player. Rackspace should see this opportunity to do more than fix it, and prevent it.
"Perfect storm"? You yanks are so dramatic
For "perfect storm" read 1) "can't afford location redundancy" or 2) "don't understand database replication"
Why not come clean and say this event happenned outside of their "cost vs availability" envelope.