We've heard of it!
Tata's datacentre in the east end of London went titsup for two-hours on Thursday evening, following a power cut. Backup power systems also failed, downing servers belonging to hosting providers throughout three floors of the Stratford facility at about 5.20pm. Firms including C4L, ServerCity and Coreix were hit by the outage …
We've heard of it!
Really? id never have guessed!
This seems to be a recurring theme with power outages these days. Can anyone recall an occasion where the backup power systems have actually ever worked?
My home UPS did during a local 30 minute outage. OTOH, I'm only powering my computer, monitor, cable modem, vonage wireless router and phones from it.
It's going on 40 years I've been working in the IT industry and I've been directly involved one way or another with dozens of installations with power backups of many sorts, and seen them activated many times.
They nearly always work as expected.
But when they do, hardly anyone notices and the event is not considered newsworthy.
[IT? What's that? Don't we call it Data Processing anymore? What happened to my plug boards? Ah - there they are - behind my card boxes.]
You wouldn't hear about times when the backup power worked, because that's not newsworthy.
"Power outage at tata datacentre! Backup power worked perfectly, no issues at all" isn't really that interesting as a story now, is it?
For a very good reason. "Everything's fine, folks" doesn't make the news.
I know what you're getting on to, but no, there are plenty of systems that have no problems at all because they have effective (and tested..) systems in place. However, a "Blackout Causes Failover System To Work As Expected!!" doesn't make quite as good a headlight as "Blackout Causes Failover to Fail Epically! Zombie incursions expected!!"
Oh yeah, we suffered a massive spike in our data-centre's main power supply at about 11am one morning - our business's busiest trading period.
Of course our system worked - for all of the 9 seconds it took for the same spike to hit the secondary power supply and destroy the 10-year old UPS.
Lucky we had a separate disaster-recovery site 50 miles away - oh wait, the replication link hadn't been working for a few days and no-one had noticed.
IT companies never learn.
Mine worked fine today. Power cut upset the clocks and the satellite receivers, but the UPS kicked in and kept my server and DSL running happily.
Another reason to keep IT under control, and not outsource it to clouds. They evaporate when you leasst expect it.
They do work, the thing is you never see the press crowing about "data centre in north london suffers short mains disruption, UPS and generator systems function as intended, nothing happened" because that wouldn't be news.
Now, if you don't build your data centre properly, don't maintain it properly (like checking that the UPS is actually charging the batteries) and don't run tests (such as black building) you are in for a nasty surprise.
These things don't look after themselves and the two best ways to shoot yourself in the foot are;
1) Let some bean counter skimp on the maintenance budget and hope it won't be customer affecting when it happens
2) Build a banking grade redundant everything, more complex than the LHC monster that is too complex to operate or maintain and watch it all go wrong because nobody can run it properly.
Obviously one of these is cheaper than the other...
Not a big Fasthosts fan myself, but when Gloucester got flooded a few years ago and lost electricity the datacentre backup systems did kick in and work whilst under water.
I used to work in a building with a tiny 'data center' (constructed with cubical walls...) and we had quite a few power outages, through which our ups and genny power managed to work each time.
That's not to say that failed UPSs didn't cause us some extra downtime for no reason...
@ Maliciously Crafted Packet
When they work you never hear about them ....... its just not that interesting ...
Now if only the water processing plant had too we wouldn't have had to use the bowsers those few weeks! I guess that is as good a definition for 'irony' as any you'll find.
I remember back in the 80s when I worked for GEC/Marconi, they decided to do a generator test one weekend.
So they kicked them off, and they both started... Within a minute one stopped so they went to have a look at the problem.
Whilst being ignored the second one felt lonely and decided to catch fire. Luckily there was an onsite fire brigade who came round in their little red landrover and tackled it.
Unfortunately the fire, or extinguishing caused the mains electrical feed to trip, and the entire data centre was plunged into darkness.
We had a great 5 hours of doing sod all in the office on Monday morning as the operators were still trying to get the mainframes to start up!
Strangely enough, i can't recall one that made the news!
Power outages happen all the time however, when everything is working, you don't know about it. It's only when something goes wrong that it makes the news.
We have backup UPS, generators and an alternate site that can handle the processing load if necessary. Supposedly, you shouldn't notice it even if a bomb hit the site. Unfortunately, recently, it all went tit's up. The generators didn't kick in and processing didn't switch to the alternate site for some reason. The UPS is only there to keep everything going during the time it takes to fire up the generators, so that wouldn't keep it going for long.
But, when its tested every month or so, you don't even know it's happening!
no tata, thank you very much
Probably had electric starters.
TATA are the Indina conglomerate that own everything from Tetley Tea to Land Rover. Perhaps you would have gotten a response if they used an Indian call centre instead!
If they moved thier power control systems to thier new 27' iMac ;p
One problem is that people are often reluctant to test them fully in case they don't work!
One way to do it properly is to pull the breaker on the incoming mains supplies.
I originally said tata to your English jobs, now it seems it's tata to your English datacentres as well, oh well, saved some cash didn't we?
Because i've never seen anybody that has truly tested their DR plan to the point that they weren't surprised when they had some downtime. I have seen plenty of people doing nice safe little tests to satisfy the management, but nobody that's done a real test.
Which probably explains why i've seen things like a data centre go down because while everything did have power for the servers it didn't have for the air con. You'd be surprised how quickly a server room can reach 40 degrees without aircon.
If the backup-generator started correctly, then it would not be newsworthy.
Shock horror. Company's disaster plan worked correctly!
Yes, as an Electrical Engineer for more years than you can shake a stick at the scenario is all too common as of late.
Standby Geny - check, full fuel tank - check, battery ok? er wot?
The company I work for maintains standby Genys for the local Police/ fire/Ambulance infrastructure.
Genys are tested on load for two hours every six months, and the heavy duty batteries changed for new every year, if there has been no mains outage then the Geny will have run for about four hours in a year, but you change the batteries anyway cos comms are vital.
Now a big name bank we look after does not want to pay for new batteries every year, "why there is hardly any wear on them and they cost over a hundred quid each!!, just leave the original ones for another year"
repeat till failure
Anonymous for obvious reasons.
Would that be one of those banks that have paid 'bonuses' in the millions?
At a certain place that I worked, the backup genny was tested weekly for startup, run and power output. Every month a full load cutover of the entire site to generator for a couple of hours was performed.
When we actually had a power cut, do you think we could get the sodding thing to start? Still, watching the faces of those trying to make it work as the UPS battery deathclock ticked down was quite funny.
BTW, any idea what happens some months later when, after some spending of the arse-covering budget, two gennies start up while connected to a power distribution system originally specified to support only one?
I took this little lot as conclusive proof of Sod's Law.
I used to have a server hosted in this data-centre with ServerCity.
Tata are total cowboys. The data-centre is poorly laid out, everything's just been shoved in wherever it fits, and you have to tread carefully otherwise you'll break your neck by tripping over the cables sprawled across the floor.
The day after I visited the data-centre to do an OS upgrade on my server, I switched suppliers to Telecity - you could see the difference as soon as you walked through the door.
No tata for me either.
Joy, that's why I'm planning on doing my offsite backups to...
A very good question!
I have seen successful backup power working at Level 3's building in Goswell Road several times. I have also seen a 7 hour outage there.
Telehouse, seen a power outage there. Telehouse reps said it didn't happen, but couldn't explain why every piece of our equipment, in several racks, suddenly decided to turn itself off and on at the same time.
Harbour Exchange Redbus (as it was then), seen a loss of power during "testing of the backup power system". Not successful, I would say. This was after another power outage when the backup systems didn't work. I recall several power outages at this site.
Power outage at BT's Ilford POP. We were the first to notice and call in about it. Not sure if they have backup power, but I would be surprised if they did not.
Basically at every site where we have equipment, where backup UPS and generators are supplied, I have seen outages. I think it's fair to say backup power works sometimes.
I wonder how much the BOFH paid them when the beancounters cut his 'mantenance' budget.
.. sounds like grounds to ditch ANY backup contract with those people then!
Also VoIP in the same data-center to cover your telecoms, inspired, truly inspired!
Icon? For what was required not provided!
Tatas are always better in pairs.
I'm pretty sure a test/maintenance/replace schedule (whatever it's called) for their batteries would help? Anyway - a diesel generator should have sorted the problem just like that! FAIL
On two occasions in the past, when working as a systems test engineer in large (nameless) companies, I have suggested cutting the main power feed (at a non-busy time and with advance warning to all staff) to test the backup power facilities and procedures, recovery process, etc.
(Note: this was at a stage before a facility had become fully active and part of everyday company activity.)
Each time, I was told that there was a danger of disruption to services and possible damage to equipment, so that would not be allowed by 'management'.
By the time my brain had finished doing mental backflips to try to understand their point of view, the meeting had ended.
Coat, because at the end of the day, I got paid whether it worked or not.
Migrated out of there a couple of months back, lucky us, bad luck for anyone left there. It's a rubbish facility in a rubbish location that's rather expensive too...
I wonder if on the last test they saw the generators fuel was low and went oh let’s put in a fiver that will pass the test for now and we can fill it up later :)
Does make you wonder about N+1 for redundancy it seems this case was -N-1
Apparently Tata has the following
"The facility offers complete redundancy in protected power, HVAC, fire suppression ..............
The facility takes power feeds from multiple power grids, distributed via N+1 MGE UPS battery backup power and three 2.5 MVA Caterpillar diesel generators to backup primary power source with 48-hour on-site fuel storage supported by continuous refuelling contracts"
Which is all well and good, but only if people know to use it, but I guess they thought they didnt to as 99.9999% of the time it would be done automatically but still if it didn’t, the people on site need know how to start it manually
RE: Maliciously Crafted Packet
Does it ever make the news if a D/C loses main power and the generator’s kick in and work?
I would like to bet that more do have mains failure and stay online and are not in the news for working correctly when there is a power loss than those that lose power and fail to get the backup systems working.
Tata Tata and your epic fail.
So a 'spokeswoman said Tata was still looking into what caused the outage and the subsequent failure of backup power'.
More like they need some time to scour their Ts & Cs and cook up a lame excuse that attempts to absolve them of responsibility.
Go to the back of the Data Centres for Dummies class, stay after school and write out 1000 times:
"Data Centres must be continuously supplied with AC power, all redundant systems must be tested, tested and tested again"
When I was a telephone engineer we har Strowger exchanges run off 80V batteries. A big diesel engine could run the batteries when the mains failed.
We had 2. genies at Grantham. We tested one on Tuesday & the other one on Friday.
Odds are this company thought they didn't need a Power & Plant engineer.
I've seen it all too often - These companies try to save "so much money" which works short term, but only ends up like this, with reputation in tata's....
At one time there was considerable concern that the electricity distribution infrastructure in that part of London would be inadequate for 2012.
Related to that, a big new substation in preparation for the Olympics opened a few weeks ago.
Could these two be related to Thursday's failure?
And they can offer an even cheaper center in India. It won't stay up either ;)
Tata for now...
If IIRC something similarly embarrassing happened to a Reuters data centre in the 1980s when I worked for them. Builders = Power outage, then it turns out the generators had been fueled months (or years earlier). Apparently diesel goes off if you do that - turns into a kind of nauseous treacle, or an acid.
I think the only other people who usually have to look out for this problem are the MOD with mothballed kit and farmers; so its little known. Some sources on the web say diesel will keep for 18-24 months without additives while others say 2 years max with additives. Apparently you have to keep it cool and avoid water condensing out in the tank to get those times. The data centre tanks are probably sited outside in the sun....
If fairness the UPS batteries may have been empty because they'd kept things going long enough for the generators to (not) start. Those generators on the site are advertised as being 7.5MW in total so I guess the battery life is pretty short!
Sort of -- two fat generators caught fire after ten minutes ---oh how we laughed.
...not all of Tata's customers were affected. Spammers seemed to have uninterrupted service; got two spam emails this morning and one last night advertising Tata-hosted "make money fast" sites. Good to know that not all their customers had probems!
CoreIX is still claiming on their website "The Coreix Premium Network has obtained 100% Network uptime over the last 4+ years." That does not tally up with their status page at http://status.coreix.net/ - which was pretty useless when they were down. If you have a status page you should make sure it is run on an entirely separate infrastructure and domain.
Testing often doesn't work as its either not done on load and its done in a controlled manner. Large spikes can knock out the control systems and even then you could have tested it 5 minutes before and it could still not work for some reason.
Much like backups the test is only relevant at that moment in time it does not guarantee the situation going forwards.
Even with all that an emergency power off will defeat you and you won't be allowed back in until the firemen say so. Over the years I've seen outages caused by overload, component failure, huge spikes and a failed fan causing the fire alarms to go off.
Nothing is perfect however much testing you do.
In my experience its generally the control systems that fail and it requires an engineer to install a bypass and then a 2nd outage to take it out once everything is fixed!
Still annoying though!
Details received from Coreix
At approximately 16:48 GMT the Stratford, London facility lost mains power from the power grid. The time-line of events is as follows:
16:48 - Power to site lost - running on UPS.
17:15 - UPS systems depleted, generators failed to start.
17:30 - Generators failovers failed to function despite multiple attempts - The power engineers were dispatched.
18:32 - The power engineers arrived on site.
18:54 - The power engineer estimates 30-45 minutes to return power to site.
19:16 - The power was returned to site and the process of booting up each rack commenced.
19:35 - All racks powered up and brought on-line.
The last twenty-four hours have seen numerous power grid blips but the system as designed have taken the load with the UPS battery backup and generators keeping the facility live, ensuring an un-interrupted service.
Tonight however, at 16:48 a failure in the control boards prevented the generators from powering the facility once the power from the grid failed, a manual bypass was installed to get the facility on-line.
The facility power engineers are continuing to monitor the facility and extra staff were drafted in to help.
"Tata's Datacenter goes tits up?"