We've heard of it!
Tata's datacentre in the east end of London went titsup for two-hours on Thursday evening, following a power cut. Backup power systems also failed, downing servers belonging to hosting providers throughout three floors of the Stratford facility at about 5.20pm. Firms including C4L, ServerCity and Coreix were hit by the outage …
It's going on 40 years I've been working in the IT industry and I've been directly involved one way or another with dozens of installations with power backups of many sorts, and seen them activated many times.
They nearly always work as expected.
But when they do, hardly anyone notices and the event is not considered newsworthy.
[IT? What's that? Don't we call it Data Processing anymore? What happened to my plug boards? Ah - there they are - behind my card boxes.]
I know what you're getting on to, but no, there are plenty of systems that have no problems at all because they have effective (and tested..) systems in place. However, a "Blackout Causes Failover System To Work As Expected!!" doesn't make quite as good a headlight as "Blackout Causes Failover to Fail Epically! Zombie incursions expected!!"
Oh yeah, we suffered a massive spike in our data-centre's main power supply at about 11am one morning - our business's busiest trading period.
Of course our system worked - for all of the 9 seconds it took for the same spike to hit the secondary power supply and destroy the 10-year old UPS.
Lucky we had a separate disaster-recovery site 50 miles away - oh wait, the replication link hadn't been working for a few days and no-one had noticed.
IT companies never learn.
They do work, the thing is you never see the press crowing about "data centre in north london suffers short mains disruption, UPS and generator systems function as intended, nothing happened" because that wouldn't be news.
Now, if you don't build your data centre properly, don't maintain it properly (like checking that the UPS is actually charging the batteries) and don't run tests (such as black building) you are in for a nasty surprise.
These things don't look after themselves and the two best ways to shoot yourself in the foot are;
1) Let some bean counter skimp on the maintenance budget and hope it won't be customer affecting when it happens
2) Build a banking grade redundant everything, more complex than the LHC monster that is too complex to operate or maintain and watch it all go wrong because nobody can run it properly.
Obviously one of these is cheaper than the other...
I used to work in a building with a tiny 'data center' (constructed with cubical walls...) and we had quite a few power outages, through which our ups and genny power managed to work each time.
That's not to say that failed UPSs didn't cause us some extra downtime for no reason...
I remember back in the 80s when I worked for GEC/Marconi, they decided to do a generator test one weekend.
So they kicked them off, and they both started... Within a minute one stopped so they went to have a look at the problem.
Whilst being ignored the second one felt lonely and decided to catch fire. Luckily there was an onsite fire brigade who came round in their little red landrover and tackled it.
Unfortunately the fire, or extinguishing caused the mains electrical feed to trip, and the entire data centre was plunged into darkness.
We had a great 5 hours of doing sod all in the office on Monday morning as the operators were still trying to get the mainframes to start up!
Strangely enough, i can't recall one that made the news!
Power outages happen all the time however, when everything is working, you don't know about it. It's only when something goes wrong that it makes the news.
We have backup UPS, generators and an alternate site that can handle the processing load if necessary. Supposedly, you shouldn't notice it even if a bomb hit the site. Unfortunately, recently, it all went tit's up. The generators didn't kick in and processing didn't switch to the alternate site for some reason. The UPS is only there to keep everything going during the time it takes to fire up the generators, so that wouldn't keep it going for long.
But, when its tested every month or so, you don't even know it's happening!
Because i've never seen anybody that has truly tested their DR plan to the point that they weren't surprised when they had some downtime. I have seen plenty of people doing nice safe little tests to satisfy the management, but nobody that's done a real test.
Which probably explains why i've seen things like a data centre go down because while everything did have power for the servers it didn't have for the air con. You'd be surprised how quickly a server room can reach 40 degrees without aircon.
Yes, as an Electrical Engineer for more years than you can shake a stick at the scenario is all too common as of late.
Standby Geny - check, full fuel tank - check, battery ok? er wot?
The company I work for maintains standby Genys for the local Police/ fire/Ambulance infrastructure.
Genys are tested on load for two hours every six months, and the heavy duty batteries changed for new every year, if there has been no mains outage then the Geny will have run for about four hours in a year, but you change the batteries anyway cos comms are vital.
Now a big name bank we look after does not want to pay for new batteries every year, "why there is hardly any wear on them and they cost over a hundred quid each!!, just leave the original ones for another year"
repeat till failure
Anonymous for obvious reasons.
At a certain place that I worked, the backup genny was tested weekly for startup, run and power output. Every month a full load cutover of the entire site to generator for a couple of hours was performed.
When we actually had a power cut, do you think we could get the sodding thing to start? Still, watching the faces of those trying to make it work as the UPS battery deathclock ticked down was quite funny.
BTW, any idea what happens some months later when, after some spending of the arse-covering budget, two gennies start up while connected to a power distribution system originally specified to support only one?
I took this little lot as conclusive proof of Sod's Law.
I used to have a server hosted in this data-centre with ServerCity.
Tata are total cowboys. The data-centre is poorly laid out, everything's just been shoved in wherever it fits, and you have to tread carefully otherwise you'll break your neck by tripping over the cables sprawled across the floor.
The day after I visited the data-centre to do an OS upgrade on my server, I switched suppliers to Telecity - you could see the difference as soon as you walked through the door.
No tata for me either.
A very good question!
I have seen successful backup power working at Level 3's building in Goswell Road several times. I have also seen a 7 hour outage there.
Telehouse, seen a power outage there. Telehouse reps said it didn't happen, but couldn't explain why every piece of our equipment, in several racks, suddenly decided to turn itself off and on at the same time.
Harbour Exchange Redbus (as it was then), seen a loss of power during "testing of the backup power system". Not successful, I would say. This was after another power outage when the backup systems didn't work. I recall several power outages at this site.
Power outage at BT's Ilford POP. We were the first to notice and call in about it. Not sure if they have backup power, but I would be surprised if they did not.
Basically at every site where we have equipment, where backup UPS and generators are supplied, I have seen outages. I think it's fair to say backup power works sometimes.
On two occasions in the past, when working as a systems test engineer in large (nameless) companies, I have suggested cutting the main power feed (at a non-busy time and with advance warning to all staff) to test the backup power facilities and procedures, recovery process, etc.
(Note: this was at a stage before a facility had become fully active and part of everyday company activity.)
Each time, I was told that there was a danger of disruption to services and possible damage to equipment, so that would not be allowed by 'management'.
By the time my brain had finished doing mental backflips to try to understand their point of view, the meeting had ended.
Coat, because at the end of the day, I got paid whether it worked or not.
I wonder if on the last test they saw the generators fuel was low and went oh let’s put in a fiver that will pass the test for now and we can fill it up later :)
Does make you wonder about N+1 for redundancy it seems this case was -N-1
Apparently Tata has the following
"The facility offers complete redundancy in protected power, HVAC, fire suppression ..............
The facility takes power feeds from multiple power grids, distributed via N+1 MGE UPS battery backup power and three 2.5 MVA Caterpillar diesel generators to backup primary power source with 48-hour on-site fuel storage supported by continuous refuelling contracts"
Which is all well and good, but only if people know to use it, but I guess they thought they didnt to as 99.9999% of the time it would be done automatically but still if it didn’t, the people on site need know how to start it manually
RE: Maliciously Crafted Packet
Does it ever make the news if a D/C loses main power and the generator’s kick in and work?
I would like to bet that more do have mains failure and stay online and are not in the news for working correctly when there is a power loss than those that lose power and fail to get the backup systems working.
Tata Tata and your epic fail.
So a 'spokeswoman said Tata was still looking into what caused the outage and the subsequent failure of backup power'.
More like they need some time to scour their Ts & Cs and cook up a lame excuse that attempts to absolve them of responsibility.
Go to the back of the Data Centres for Dummies class, stay after school and write out 1000 times:
"Data Centres must be continuously supplied with AC power, all redundant systems must be tested, tested and tested again"
At one time there was considerable concern that the electricity distribution infrastructure in that part of London would be inadequate for 2012.
Related to that, a big new substation in preparation for the Olympics opened a few weeks ago.
Could these two be related to Thursday's failure?
This post has been deleted by its author
If IIRC something similarly embarrassing happened to a Reuters data centre in the 1980s when I worked for them. Builders = Power outage, then it turns out the generators had been fueled months (or years earlier). Apparently diesel goes off if you do that - turns into a kind of nauseous treacle, or an acid.
I think the only other people who usually have to look out for this problem are the MOD with mothballed kit and farmers; so its little known. Some sources on the web say diesel will keep for 18-24 months without additives while others say 2 years max with additives. Apparently you have to keep it cool and avoid water condensing out in the tank to get those times. The data centre tanks are probably sited outside in the sun....
If fairness the UPS batteries may have been empty because they'd kept things going long enough for the generators to (not) start. Those generators on the site are advertised as being 7.5MW in total so I guess the battery life is pretty short!
CoreIX is still claiming on their website "The Coreix Premium Network has obtained 100% Network uptime over the last 4+ years." That does not tally up with their status page at http://status.coreix.net/ - which was pretty useless when they were down. If you have a status page you should make sure it is run on an entirely separate infrastructure and domain.
Testing often doesn't work as its either not done on load and its done in a controlled manner. Large spikes can knock out the control systems and even then you could have tested it 5 minutes before and it could still not work for some reason.
Much like backups the test is only relevant at that moment in time it does not guarantee the situation going forwards.
Even with all that an emergency power off will defeat you and you won't be allowed back in until the firemen say so. Over the years I've seen outages caused by overload, component failure, huge spikes and a failed fan causing the fire alarms to go off.
Nothing is perfect however much testing you do.
In my experience its generally the control systems that fail and it requires an engineer to install a bypass and then a 2nd outage to take it out once everything is fixed!
Still annoying though!
Details received from Coreix
At approximately 16:48 GMT the Stratford, London facility lost mains power from the power grid. The time-line of events is as follows:
16:48 - Power to site lost - running on UPS.
17:15 - UPS systems depleted, generators failed to start.
17:30 - Generators failovers failed to function despite multiple attempts - The power engineers were dispatched.
18:32 - The power engineers arrived on site.
18:54 - The power engineer estimates 30-45 minutes to return power to site.
19:16 - The power was returned to site and the process of booting up each rack commenced.
19:35 - All racks powered up and brought on-line.
The last twenty-four hours have seen numerous power grid blips but the system as designed have taken the load with the UPS battery backup and generators keeping the facility live, ensuring an un-interrupted service.
Tonight however, at 16:48 a failure in the control boards prevented the generators from powering the facility once the power from the grid failed, a manual bypass was installed to get the facility on-line.
The facility power engineers are continuing to monitor the facility and extra staff were drafted in to help.
Pah .. I have an account with the halifax - well, soon to be I DID have an account with the halifax, since after their 'power outage due to the weather' the other weekend their UPS's failed to work, their generators never kicked in until someone phsically went to the site two hours after the power dropped, and they spent all day rebuilding their servers.
Apparently Halifax dont believe in offsite failover, and from what I gather they dont have any UPS either ... for me HBOS have earned a new moniker
Halifax, Bank of FAIL!
HBOS, your summons will be in the post!
I can't see the problem here.Everyones talking like they've had a major life experience/or not here.
For those of you actually affected, thank you lucky stars you don't depend on TalkTalk.
For those of you not affected, Sainsbury's are still selling salted peanuts at a very reasonable price, and if you don't like salt you can always wash it off.Wet peanuts solve both dehydration and nutrition issues, enabling those ana+++ retentive amongst you to have a fucking good dump in the AM.
for nigh on 190 years.
I remember it quite well, I said to Charles, 'Mr Babbage your machine is quite spectacular, but what about the power back up!? I know that young fancy of yours, Miss Ada, will steer you a merry dance, with her cog whirling and fancy smancy mathematicals, but dear God man, don't forget the power backup.',
I said, and with much foresight and deliberation; 'It should be a power that cannot be interrupted, a supply of un interruptible power.'
I swear the old codger, must have misheard me, what with his dickey ear and all that, and thought I had said, 'Hun'. He picked up his fire stocker, and chased me out the door. I was a spritely lad at the time, so I lead him a merry dance through the streets of London, Babbage was huffing and puffing behind, quite out of steam. 'Yes', I thought, 'those who don't learn from history are destined to repeat it.', well at least that is what old Eddy used to say, I did think he was a bit of a Burke though.
Having worked in a company that TATA came in and covered a major contract for because the phb's deemed we didn't have the expertise in house. I am not surprised in the least.
They, like a awful lot of other huge indian outsourcing companies who are trying to branch out think they can just cut costs to the bone, throw in a load of inexperienced staffers to learn on the fly and baffle everyone in upper management with managementspeak.
However, when it comes time to walk the walk, besides talk the talk the outcome is woefully inadequate, and the response tends to descend into a flurry of email chains slowly with each fwd or cc descending the inner caste structure they have internally until it falls on the desk of someone so junior they can't pass the buck any further along. And they will have to go and sort it themselves, whereby they won't have the experience being junior and won't have the political clout when a serious situation arrives to take the needed decisive action. They'll then send a email to someone who has the political standing in the structure to sort something, but who convieniently will be too busy (hiding behind the couch to avoid any flak) to deal (and take some responsibility). Until the client screams for blood and points to SLA's, whereby some shakeup will take place, and for a few weeks people will try to field the issues.
Usually we found it easier to just quietly do the work ourselves while the tata people sent huge email chains around trying to generate as much noise as possible to cover the fact they weren't actually contributing anything.
Host in their DC? I'd rather have a linux box on a bt dsl line!
Biting the hand that feeds IT © 1998–2020