When a server loses power, even for a split second, it can damage the hard drive.
True, very true...that's why you fit a UPS to bridge the gap between mains failure and backup generation.
Manchester workmen were blamed for knocking out hosting business UKFast's data centre today, after it seems some hapless bod cut a cable with a pickaxe. According to the Brit firm's status page, the problem arose this morning at 11am, affecting its MANOC 5 data centre. The incoming mains supply was lost to the site and …
"Gas mains are no more impervious to pickaxes than power cables are."
But you are very very unlikely to loose both at the same time. They are entirely diverse services. The gas only needs to be there when the electricity isn't. As I would have thought was obvious.
Hence why companies like eBay, Google, FedEx, Bank of America, Walmart, Coca Cola, etc. etc. use Bloom Energy servers instead of wasting power via online UPS conversion and keeping batteries charged.
This post has been deleted by its author
>>There's nothing especially "clean" about Bloom, it's just a fuel cell that runs on gas, conceptually the same as a standard gas or diesel genset."
Well there is - fuel cells emit far less CO2 than a typical gas generation plant and near zero of the other common related pollutants such as NOx, SOx and VOCs
>> Produces 735-849 lbs of CO2 output per MWh.
As opposed to typical values of 2,117 lbs/MWh for current coal generation and 1,314 lbs/MWh for existing gas power plants - so a much lower CO2 output for fuel cells.
As opposed to typical values of 2,117 lbs/MWh for current coal generation and 1,314 lbs/MWh for existing gas power plants
And around 600 for modern cogeneration (CHP) sets which are popular for local off-grid supply.
Bloom isn't dirty by any means, but it's not especially clean either.
Bloom Energy servers instead of wasting power via online UPS conversion and keeping batteries charged.
Bloom Energy units are just specialised fuel cells, effectively solid-state generators. It's just local generation, with the grid as backup, nothing especially new in that and as environmentally unfriendly as a standard gas generator. You don't need a UPS because you assume the grid is "always on" so you have no need for a UPS to fill in the generator start time should your primary source fail. If a backhoe should take out both gas and electricity conduits you're screwed.
This post has been deleted by its author
Standard gas generators emit lots of CO2 and are submit to grid and transformer losses. Fuel cells emit hardly any CO2 and the primary output is water vapour (and electricity!)
That's simply not possible. You're consuming a hydrocarbon in the presence of oxygen, it doesn't matter if it's a hot or cold combustion, the chemical products will be the same. See the data sheets on the Bloom website for the CO2 figures. Only a hydrogen-fed fuel cell will produce water alone.
Also there'll be no grid or transformer losses from a local gas generator, which will not be connected to the grid.
"and as environmentally unfriendly as a standard gas generator"
No, standard gas generator plants emit about double the CO2, various other pollutants and are submit to grid and transformer losses. Bloom Energy fuel cells emit about half the typical CO2 output of a gas generator plant and a quarter of a typical coal plant.
"If a backhoe should take out both gas and electricity conduits you're screwed."
You could say the same if someone took out all the fibre connectivity to your data centre. Which is why diverse services are fed to your data centre some distance apart via concrete lined trenches. Usually with dual feeds too. If you look at the rather impressive customer list above, clearly it works.
Microsoft - generally acknowledged as one of the world leaders in datacentres - are putting these directly into racks:
http://www.datacenterknowledge.com/design/microsoft-launches-pilot-natural-gas-powered-data-center-seattle
Having worked on gas mains, leaking and otherwise in a previous incarnation, I can assure you that only a very ancient, rusted-out one is vulnerable to a pickaxe. But a trenching machine...or a backhoe...that's different. BTW, I once helped drive a 1 1/2" gas pipe squarely through an underground telephone cable, knocking out service to portions of two adjacent states. It was fascinating to see service trucks and company carsful of engineers screech in from all points of the compass, form a seething knot and bicker about who was entitled to scream at whom.
No UPS can be guaranteed to function through a short-circuit or other dangerous situation (e.g. phase crossing).
However, a datacentre uses UPS only as a brief stopgap, and the slightest delay in starting up the generators will mean dead batteries and a power blip inside.
But "UPS" don't provide "uninterruptible" power. They just provide a backup, like any other. When a dangerous situation exists, even a high-end UPS will cut out for safety. Yes, I've seen them do it. In one case, a phase-crossing accident would literally hard-power-off the UPS instantly without warning or beeping or anything - just a single red light. Just bang, down, wait for power to return to normal. UPS was doing its job, before, during and after.
A pickaxe through a cable is exactly the kind of thing that can bridge the live and earth, or multiple phases for instance, and UPS can't completely isolate the inside from the outside.
When a dangerous situation exists, even a high-end UPS will cut out for safety
Which is why there's a distinction between High Availability (the UPS + generator) and Disaster Recovery (the second site with the hot standby on completely different power circuits). Each protects against a diferent type of fault.
I can tell you for a solid fact that APC UPS units get mighty peeved when they see 220 on the _ground line_. In that incident, the UPS went into full isolation and shut down hard, which protected the server that was on it from getting it's hardware blown up. (unlike two of the brand new workstations at that site, which decided to set their power supplies on fire as the electrolytic caps blew out.) Thankfully, the server passed it's fsck and carried on after everything was brought back to normal.
(mushroom cloud icon, because it's not every day one sees flames jetting out the back of a brand new computer's power brick.)
>I can tell you for a solid fact that APC UPS units get mighty peeved when they see 220 on the _ground line_.
What, it's 2017 and this is still a problem! :)
Back in the late 1970's I worked for a company that installed IT systems next to railway lines ie. signalling systems. Where fluctuating voltages on the ground line were regular events (ie. every time a train went by). So the company had developed some rather fancy switch gear that sorted the problem. The other problem (to delicate IT systems) we saw in the early 80's were the power spikes caused by the then newly introduced thyristor controlled systems, these were particularly troublesome as they were invisible to the then new digital scopes but not to the analogue scopes.
@Roland6: It's usually only a problem when it's intentionally done by an electrician who got the bill for the outage they caused to the businesses in that office complex, *and* our bill for the replacement of hardware, technician labor, call out, etc. :)
If I recall correctly, said electrician shorted the 220V line in to the something- now that I'm thinking, it might have been the neutral line and not the ground line. (US uses hot, neutral, and earth ground for most things) In any case, two of the workstations did not like whatever they did and the capacitors in their power supplies blew up rather messily. the UPS did exactly what it was supposed to do- isolated the load entirely, then shut down.
It was fun walking into the shop in the mornings and smelling freshly cooked power supplies.
"when they see 220 on the _ground line_"
If you have decent input and output protection this "should" never happen.
You can get away with assuming it's all fine 99% of the time, but it's that 1% that gets you - and in a lot of cases the UPS power supply systems are overcautious about shutting down under conditions that the mains keeps going under. Lots of arguments between power people and data centre people revolve around what's acceptable.
I worked as an engineer developing the circuitry and switching components for UPS systems running the safety systems at two nuclear facilities in the U.S.. these systems delivered 360v at 375amps uninterrupted
Rule #1 : Four independent UPS systems
Rule #2 : Two UPS off grid powering the safety systems. One at 100% drain, one at 50% drain
Rule #3 : One UPS being discharged in a controlled manor to level battery life and identify cell defects
Rule #4 : Recharge the drained battery
Rule #5 : Fourth UPS drain and recharge separately
Rule #6 : Two diesel generators off grid
This system may not guarantee 100%, but it is far better than five-9’s. There can absolute catastrophic failure on the supplying grid and it does not impact the systems even one bit. This is because the systems are never actually connected to the grid. And before you come back with issues or waste related to transference, the cost benefits far outweigh the losses because the life span of 90% the cells are extended from four years to an additional 3-5 years by properly managing them in this fashion. And the power lost at this level is far less expensive than replacing the cells twice as often.
P.S. before you call bullshit, there was extensive (corroborated) research at Univeristy of South Florida over a period of 15 years on this one topic.
"Those things cook batteries until after 3 years they are no good. "
A lot of that has to do with the discharge cycles they see. Lead Acid batteries DO NOT like being discharged and the deeper the discharge the fewer cycles they'll endure.
AGM batteries are compact and don't gas off, but that's their only advantage. If you want cells that last for decades, then use traction batteries or a string of flooded deep discharge telco cells like these exide ones: https://www.exide.com/en/product/flooded-classic-ncn (those are the nuke type, read the PDF to see the selection choices you have)
You'll need 24 of them.
Even at end of life they're impressive. A certain exchange engineer strapped 12 old satellite exchange ones into the back of an original Fiat Bambino, replaced the engine with an electric motor and would commute the 5 miles from home to work for a week on a single charge - he did this for 20 years (until he retired) and there was no noticeable degradation in range.
AGM's also don't generally like being charged all the time, either- that's what usually kills the battery packs on the APC units. Generally what happens is one or more of the batteries in the pack get tired of dealing with the overcharge and go dead open, at which point the pack stops charging entirely, and you lose protection entirely. Given the prices that APC charges for replacement battery packs, I'm more fond of buying a set of replacement batteries and re-building the packs- the downside to that is that you void the connected equipment coverage by doing that. I'm not quite certain what the big 3-phase Emerson-Liebert beasts we use at RedactedCo use, but I don't have to worry about it because we have them on a maintenance contract, and the company we are using is quite reliable..
P.S. before you call bullshit, there was extensive (corroborated) research at Univeristy of South Florida over a period of 15 years on this one topic.
Be interested in a reference/pointer to the research, as suspect because the end result is fewer battery sales, this isn't something many in the vendor community would want to be widely known about.
"True, very true..."
So how does the drive suffer damage then? I assume you are talking about physical damage here. I've never seen a hard drive damaged by a power failure; data corruption yes but actual physical damage, no. Aren't they designed to auto-park the head when the power trips? Maybe it's different in large data centres with 000's of servers. Please enlighten me :)
I'll wager the components of a recovery plan were all documented and tested, including physical test runs of the gensets. I doubt they did a full "turn off the mains power" test, but if they had, they'd have been in the same position (sitting in a dark data centre, thinking "shit!").
The other possibility is that they have turned off the mains in tests, and everything went perfectly. That's a known problem with standby power - it only works most of the time. And on that subset of times when it doesn't work, you usually need it and everybody notices.
A question for the DR professionals: What is the ACTUAL failure rate of a completely successful, fully automatic handover from interrupted mains to on-site generators? My guess is nobody does it often enough to know.
"What is the ACTUAL failure rate of a completely successful, fully automatic handover from interrupted mains to on-site generators? "
We get around 400-600 power breaks per year (rotten power feeds in the surrey countryside). We've had about 5 unplanned outages in the last decade. That's with a flywheel kinetic system backed by diesel generators and at least one of those was due to the generator starter motor battery being dead. Most of the time the flywheel rides out the break and the gensets only start at the 10 second mark,
Exactly.
The article very carefully skims over that bit, doesn't it.
The incoming mains supply was lost to the site and generators failed to take over the service.
Every datacenter I've ever dealt with does weekly on-load generator tests, and UPS failover tests.
Now we all know shit happens, no matter how much we try to prepare, but this does feel like they haven't been taking enough time on planning or testing.
Why didn't their UPS have enough capacity to keep things up, even if the generators failed to start cleanly?
Every datacenter I've ever dealt with does weekly on-load generator tests, and UPS failover tests.
None of which tells you that you're safe against a breaker cascade as the whole A load switches to B and idling PSUs in blade chassis reactivate etc. There is no substitute for randomly[1] flicking breakers, PSUs and HVACs on a routine basis to verify TIII resiliency and DR will work as required. Unfortunately, that requires a degree of testicular fortitude entirely absent in facilities staff (other than perhaps those actually angling for a P45) and so this sort of thing keeps happening.
[1] It has to be random otherwise Ops will shift loads to other infrastructure to protect their uptime metrics and thus invalidating the results. Idling equipment draws less power.
Such testing is what I (as a conscientious consultant) recommend to my clients. But I also tell them: "If you ever have a real disaster (fire, flood etc) and 80% of services carry on working, you'll be a hero. If you do a disaster test and 98% of services carry on working, start looking for a new job."
"There is no substitute for randomly[1] flicking breakers, PSUs and HVACs on a routine basis"
Break the thing on purpose so you're better at fixing it. That could annoy a lot of people but they to will get used to coping when it breaks so will be less affected when it breaks for real. It's only reliable things which cause a massive problem when one day they break.
@jmch
Doesn't matter how big your batteries are, if the generators don't work the batteries will eventually run out.
No, that's obviously true, but my reading of this situation is that the mains power went out and everything stopped immediately. They should have had sufficient battery power to at least give them time to manually start the backup generators, but that doesn't appear to have happened.
but the UPS is connected to the server so when battery level = x it shuts down safely
I bet it isn't in any large datacentre - with tens of thousands of servers it's just going to be a big hassle and create problems of it's own (false alarms causing shutdowns). Instead, they work on the basis of having UPSs sized to cover the gap till the gennys start up - and gennys to take over before the batteries run out. In principle, there should never be a need for low UPS battery to shut down the servers. Apart from these loss of mains events, most other faults won't give you any warning before the server loses power.
We saw a small temperature wobble for the 20 mins prior to the failure. I wonder if they were on UPS ?
That isn't uncommon. The servers are on UPS but the CRAC units aren't so temperature trends upwards until the generators kick in. That's the theory, anyway. I've never known anyone actually perform a controlled test to see what happens if the generators fail - does the UPS power run out before thermal overrun cuts in or not?
I have seen it happen for real though. The "cold aisle" input temperature hit about 80 degrees before the monitoring equipment comms failed. God knows what the peak was (dark site - no humans present).
That isn't uncommon. The servers are on UPS but the CRAC units aren't so temperature trends upwards until the generators kick in.
Well, I guess that would be restricted to a brief temperature bump if the generators took over in a minute, maybe two, but I've seen the thermograph register 10 degrees rise in as many minutes in a computer room when the AC konked out, in a room way less loaded than a modern datacentre.
If the UPS is sized for the few minutes during which the generators are supposed to get their act together you might have maybe 30 degrees rise before the power fails, so not yet tripping overtemp safeties.
If the UPS is sized for the few minutes during which the generators are supposed to get their act together you might have maybe 30 degrees rise before the power fails, so not yet tripping overtemp safeties.
The problem with that is that the UPS sizing is always based on full DC capacity whereas a cold aisle containment unit is a semi-sealed micro-climate. That means that in a part filled DC it is possible for a given CAC unit to keep on trucking for several times longer than the design calculations suggest and ambient cooling doesn't help because the CAC is semi-sealed. In the case I referred to above, the equipment was still operating on UPS at least half an hour after HVAC went down.
It does remind me of a time, long ago, when I worked for a large, international, ummm, lets just say MoD supplier.
One weekend they decided to test the generator failover, they tripped the trigger and the two generators sprang into life.
For some reason, which I never understood, they decided to give the generators a load, namely the site mainframe... Which they did, and things were ok for a while.
Then generator #2 died, so they all ran off to investigate - nobody thought about putting the mainframe back to the grid feed at this point - which was a pity... as while they were all distracted by genny #2, #1 ran out of fuel.
When we all came in on Monday, they were *still* trying to get the mainframe to boot.
UK Fast provides Colo services.
You can bet that most of the affected Colo customers haven't invested in the infrastructure to have their gear distributed across two data centres. It is not something anyone expects to get from Colo unless they pay for it - double costs pus load balancers etc.
You can bet your arse that anyone with that setup stayed online but attacking the DC for the those customers that didn't is daft.
The only thing I'd want to know is why the generators didn't kick in. Our DC runs full load tests but sods law states that something will go wrong right when you need it.
Good old metrolink building the tram to the Trafford centre, if anyone has a visit between now and Christmas (why would you though?) please marvel at the temporary "roundabout" at event city. It's a sheer work of art, no doubt the result of weeks of architects drinking special brew and trying to draw just one round circle then giving up and throwing the lane rulebook for roundabouts out of the window while shouting incoherent drunk instructions to the workers. This "accident" does not surprise me.
that most of these road planning disasters are the result of a decision taken, must have been around 1991, to remove the ban on bringing coffee mugs into the planning room where all the blue-prints are laid out. It's the only explanation I can think of for the bizarre appearance, over the last three decades, of tens of thousands of roundabouts with no additional roads connecting into them, offset from the natural flow of the traffic by about two lanes width, overlapping with each other or just plain badly built - planners putting their coffee mugs down on top of the road planning documents.
Thousands of Roundabouts?
How about the sudden rise of Traffic Lights operating 24/7 when their might only be a decent amount of traffic for 2-3 hours a day.
They are obviously just used to slow ALL traffic down and increase pollution due to more idling.
Traffic lights V2019 will all come with red-light cameras. All you habitual red light jumpers (and that includes Busses and HGV's) will need to beware.
We, the motorist are a big fat jucy target who can't easily fight back.
....must have been around 1991, to remove the ban on bringing coffee mugs into the planning room
One has to wonder if that was actually coffee or tea in those mugs than or something with a bit of alcohol content. I've worked with a few like that. Never, ever invited them to a meeting when their coffee/tea mug was less than half full.
We got alerted that our servers had gone down at around 1040 this morning - they came back online again at some time around 1130.
First issue is that the backup generators did not kick in at all - no idea why not but there should never be downtime due to a power issue whether that is because the generators take over or because you have more than one input feed for power.
Now, 3 hours later, whilst the servers are all online they are as laggy as hell - we suspect the network is to blame but anything that generates traffic to our database servers is really slow whilst the servers themselves are fine. Will see how the next few hours pans out.
MANOC 5 is supposed to be a Tier III facility which means (and I quote):
"N+1 fault tolerant providing at least 72 hour power outage protection"
If a single HVAC failure took out the DC that means they're either lying or else the secondary and tertiary power supplies plus the UPS and generators were all simultaneously both defective and not known to be defective. Even with negligence and incompetence bordering on deliberate sabotage, I don't find the latter option credible.
I know this is pretty far-fetched, but maybe they could implement some kind of system where there is a battery between the mains power and the datacentre infrastructure. If the special battery system becomes active, they could have some sort of power generator that takes over in a seamless manner.
This is already a thing. Our datacenter has this using an off the shelf facility-wide unit, and we're not even Tier III. This was a display of sheer incompetence, nothing more.
Also, wonder what happened to the worker. If the line was 13.8kV all that's left is a plasma burn, but something lower voltage might have been survivable(ish).....
EDIT: On second reading I missed the sarcasm :) That being said, a lot of folks don't know the difference between true double conversion and a "standard" UPS, I'd hope these operators weren't stupid enough to use single conversion units at the racks. Then again, since they're mentioning hard drives in the individual nodes, who knows....
I think that if the BA data centre had been run by managers schooled in the standards of reliability applied to airliners then their problems would not have happened. If my experience is anything to go by, It is most likely that they would have liked to do a better job but the bean counters had other ideas.
It's possible to make a robust, durable UPS setup for datacentres.
Until management decide it costs too much and shitcan it.
Bean counters have more sense than to cut corners in these area when you explain the consequences of fucking up. Management will let you talk and do it anyway.
The line *in the song* is actually "Heigh-ho, Heigh-ho, its home from work we go", and in the official Disney lyrics for it the line "off to work we go" isn't even mentioned.
There's a bit more to it than that, but it's always a great argument starter down the pub.
Apparently this data center didn't have enough redundancy. A proper carrier-class data center is going to be fed from more than one mains supply, entering the building in different locations. Those mains supplies will then lead to separate UPS plants, separate PDU's, and finally to each critical piece of IT gear via A/B power supplies in each rack.
Any data center that has a single point of failure *anywhere* is not a data center in which one should run mission-critical workloads.
A proper carrier-class data center is going to be fed from more than one mains supply, entering the building in different locations. Those mains supplies will then lead to separate UPS plants, separate PDU's, and finally to each critical piece of IT gear via A/B power supplies in each rack.
Which is probably about N+3.
N+1 simply means the site has one independent backup, in the form of the UPS and gensets. It didn't work, but up until this morning everybody hoped that it did, and that "hope" element is common to most disaster recovery and resilience plans, no matter how many Ns they claim to have.
I've heard from a digger driver that they don't worry about things like digging through power cables or water mains - it slows them down too much and any fallout from damage they do is picked up by the insurance company, so why bother trying to avoid it?
It's no wonder it's such a common occurrence.
I've heard from a digger driver that they don't worry about things like digging through power cables or water mains
I heard that too, from a BT staffer. If you wrote to them weeks in advance they'd send a map saying where not to dig but if you put a JCB bucket right through they'd actually come out on site, same day too!
A pickaxe magically stops their UPS and generators from working. Yeah ok...
They are currently still working on "residual" issues. Which means their big clients are being ignored so that they can get the bulk of the smaller clients working. I think many people will be doing the same as us and changing provider.
Tesla has a battery system to prevent this.
Back in the day I worked on a navigation system transmitter site.
We had duplicate mains feeds from different segments of the grid, a massive bank of batteries and a ONE-CYLINDER ENGINED GENERATOR. We technicians religiously maintained the back-up system, carefully checking each individual 2 volt glass-walled cell, recording the battery electrolyte levels, internal resistance (annually), etc.
Once a week the station engineer would, without prior warning, disconnect the grid power to test the reserve power system. It never failed.
I would put money on this incident that regular maintenance was not performed.
Outages happen, simple as that regardless of who, you are, what teir you are, how big or small you are. Surely we've learnt that by now? There always appears to be shock and amazement when a DC or major cloud operator suffers an outage.
So where is your DR plan? All eggs in one basket? You should be able to implement a pretty decent automated failover that kicks in within 5 minutes or at least a manual one within 30 mins.
Perhaps you should look at your own DR plan whilst twiddling your fingers waiting for your server to come back up.
In cab UPS (dual fed everything) is worth a thought if you are colocated, it at least means your servers can shut themselves down cleanly in the event of a power failure.
Which gives me a thought, why don't DC's offer the facility for servers to know when they are on UPS and the UPS is running down, my kit at home (windows and linux) knows this and can shutdown automatically to prevent damge, so why can't kit at DC's?
Please find below an interim report following the power issue affecting our MaNOC5 facility data centres on Tuesday 12th December 2017.
Please be aware that this is an interim report from the information we currently have available, we are waiting on further information from our Generator suppliers which will add to the final report when it's available.
At 10:28 GMT on Tuesday 12th December 2017, our MANOC 5, 6 and 7 facilities were impacted by an instability on the incoming mains power as a result of a civil contractor passing a spike through the main feed. This was not work being carried out on behalf of UKFast or within our site but at another location 0.75km away on the path to our onsite transformer.
The UPS system supported the load for its designed time and the generators started; however, due to the physical damage to the power cable, service to the site was unstable and intermittent. As a result, the generator set failed to synchronise and take over service.
UKFast engineers on site were alerted to the generators being unable to take over the load , and that manual intervention was required. During this time the UPS batteries depleted past the designed runtime resulting in total power loss to equipment on site at 10:40 GMT.
The manual synchronisation was completed at 10:48 GMT and the onsite generator set took over power, enabling us to start bringing power back on for client services.
Our engineers have worked throughout the day to restore individual services and continue to do so for those clients who remain affected by the power issue. The process of bringing back all services in a facility requires more than just powering on equipment and coupled with the resulting failures in physical devices that have required replacement, means this is a lengthy process.
Once we have resumed full service, we will be investigating what we can do to prevent this from happening again for both the incoming power issue and also the time taken to restore service for some of our clien ts.
We will update this report as we get more information from our generator supplier and also from our teams to discuss the delays in resolving this issue.
Kind regards,
Charlotte
Charlotte Bentley-Crane
Account Management Director
UKFast
The UPS system supported the load for its designed time and the generators started; however, due to the physical damage to the power cable, service to the site was unstable and intermittent. As a result, the generator set failed to synchronise and take over service.
Holy crap! You mean that their transfer switch didn't automatically cut the mains feed when the power went wobbly and back up kicked in?
And not restore mains power until it's stable? (For at least 15 minutes, if I recall correctly)
Or, just as bad, they had more than one generator operating, and no way to synchronise them?
Wait, what?
If that statement is true, then they are about to get a very nasty visit from National Grid.
The only way a break or short upstream of the local transformer/substation could affect generator sync is if they backfeed their generators into the incoming mains supply.
Or possibly their PE arrangement is dangerously wrong.
Both of which are illegal due to being literally deadly.
I mean, how much must they have scrimped on this build to have not got this right!!So the incoming feed is wobbly, the batteries take over while the gen sets kick in ... they have massive fuel tanks so why would they have worried .. I thought they were able to hold power for 3 days? I'm guessing someone from Sudlows is going to get a kicking for this!
Bad maintenance or most likely insufficient maintenance due to lack of investment. This shouldn't be a thing these days but it still happens where someone in the organisation can't see what this piece of switchgear does so can't see the point of having a PMA. OK, you save the company maybe a grand a year and then get hot by service credits (or worse) and lose a six or seven digit sum as a fine or compensation claim.
Wait ... so the MD claims that it was by a pick axe but there's no evidence. I mean ... ENW have nothing about it on their website, no outages. If someone did hit a power cable with a pick axe ... I'm reckoning that'll be a toasty person - I wonder how many MVA they have going through that cable that this alleged workman hit ! With his super sharp pickage that got through the armour surrounding cable and into the main cluster conveniently bridging live and earth which all seems mighty convenient or unlucky
Even if it was a power feed issue ... I thought they were using dual power feeds, which begs the question as to what happened to the other power feed.
Finally we then have the issue of no power failover. If you host with them, I would be asking to see evidence of the failover tests from previously to see if they have had issues previously.
Between this and the crazy peepshow dancing girls on stage - bad few days for UKFast. No doubt some inspirational blog will follow.
We were offline 11.5 hours because of this.
We summarised our events here: https://www.linkedin.com/pulse/onbuy-offline-115-hours-due-ukfast-data-centre-failure-cas-paton/
The funny part of this is that we called them at the height of the incident, as a mystery shopper - check it out !!
Please find below an update on yesterday's issue from our Critical Power Director, explaining our return to mains power:
We have been running successfully on our back up generators since 10.48am 12th December 2017.
We have 12,250 litres of fuel which equates to a run time of 49 hours, with tankers on standby who are able to deliver with immediate effect.
At this time we await confirmation of a time to re-energise the power network to MaNOC 5,6 & 7. The cable has been fixed by Electricity North West. We understand this will more than likely be later today.
To return to mains power we will be doing this in a controlled manner. The UKFast data centre is currently locked onto generator power so when ENW switch the power back on we will prove it’s working perfectly before starting the process to switch over back to mains. This mitigates any risk of the power coming on intermittently or incorrectly.
Once ENW confirm to UKFast the electric supply is energised we will then check the electricity supply is present at our transformer and check the electricity for phase rotation to ensure the supply is electrically correct. Even though ENW should have proved this themselves we will double check it before proceeding to the next step.
Once we are happy that the supply is stable and correct we will activate our automatic power change-over system. The automatic change-over system will also monitor the power for 5 minutes and then initiate the automated return to mains power. This is called a proving period.
At this point the generator's bus-coupler main breaker will open, removing generator power from the system. The UPS which is constantly in operation will automatically support the technical load for around 10 seconds while the mains electricity power circuit breaker is closed and reconnected to the power systems.
The UPS battery system also supports the mains change-over for a further 2 minutes as the UPS slowly transfers from battery power to mains power. This is called a "walk in" procedure and removes the risk of the UPS seeing huge power demands and creates a smooth transition. There is no break in power to the technical load during the walk in period.
During the walk in period the air-conditioning will stop for a moment and restart and also perform a walk in procedure to ensure no stress is placed on the power network. This takes around 1-2 minutes as the CRAC units stop and restart.
During this period the generators continue to run should they be required. Once the transfer is complete the generators run on for a further 3 minutes then shut down and go back into standby mode.
Return to mains power is then complete should there be any issue during the transfer period we can automatically switch back to generator power.
On site managing the process we have the UKFast Electrical Team and Ingram Generator Service Partners.
It's not uncommon to lose power in the UK. In the last 12 months we have performed this exercise twice after losing power to the area. Both times the systems have switched over perfectly. This is one of the reasons we run an N 1 environment.
We prove the start signal on a weekly basis which fires up the generators and the UPS tests itself every day at 8am. We are the only data centre that we are aware of to hold the NICEIC accreditation meaning we are a fully licensed electrical contractor and can manage and maintain our data centres without the need for external contractors.
Unfortunately we do not have an exact time when the supply will be reconnected, however we are on standby and working closely with ENW who indicate it will probably be late afternoon or early evening.
Yours sincerely
Miles Allen Critical Power Director
"It's not uncommon to lose power in the UK"
Okay, first of all.. I'm not the most experienced guy posting here that's for sure... I've only been enduring a career in IT for 13 years. But during those 13 years, working in a number of data centres, I've never actually come across the loss/failure of external power.
It's different at home. Quite regular blips there... but I would like to think the quality of the connection to a data centre would be far superior than the wet string they use for over populated 1970's housing estates.
As usual all I hear is excuses from UKFast. They don't have a clue!
Maybe if they did depend on external contractors they would actually have a better power connection... I've always thought you should stick to what you are good at. If your good at running a DC, then do that, and contract in some experts for dealing with the power side.
And why only a single power connection?
"we run an N+1" - errrr, except for power feed....
I was on call for a healthcare provider when the grid failed on 1st January 2011. The campus style site had a link to a 33kv substation and an 11kv substation. The 33kv distribution went down and the 11kv only supplied a corner of the site.
The DC had a UPS and a genny (the site had 4 gennies in total) , UPS lasted 30 minutes less than the grid outage, the grid was down for a little over an hour. The genny had fluid level sensors in its bund - you know, to stop it blowing itself up if there was a leak. The bund was full of snow so the genny wouldn't start. An E&F manager bypassed the bund level switches but found the genny would start and idle perfectly until the revs were needed to actually make some power at which point it gasped and died.
There was a tiny leak in the non return valve in the diesel filter which was sucking air in and interfering with the fuel pressure.
4 problems in a row:
Grid failure
UPS batteries exhausted
Generator safety cut out
Generator fuel supply issues
In the aftermath the UPS was upgraded to hold for at least 60 minutes and the gennies were reconfigured with a changeover mechanism to allow a second genny to be manually switched over to supply the DC.
Shit happens. Good thing it was a Sunday and a bank holiday, there was plenty of time to stand it all up again afterwards. Nobody rehearses starting a cold datacentre - When your DHCP is down and your phones have rebooted (because the switches went down) but not picked up their ip addresses, tftp configs etc. it quickly affects absolutely everything. And what do you start first?
On a side note, everybody that was needed came in and stayed until we stood it back up - but a combination of dire management that treated everyone as a consumable and a desperate arse covering blame culture means that NONE of them are working there anymore. But the experience they've gained is invaluable.
AC for obvs!
Is it ironic that in April UK fast won the government contract to support effective response to power outages?
"British hosting firm UKFast has secured a contract to supply the UK Cabinet Office with a cloud platform for its emergency ResilienceDirect service, which supports effective response to incidents like natural disasters, terror attacks and power outages."
https://www.ukfast.co.uk/press-releases/ukfast-secures-six-figure-government-cloud-deal.html
More seriously, Trafford park is also susceptible to the odd local scrote (or disgruntled business owner) torching BT man holes by throwing petrol down them or burning cabinets. Late 1990's and early 2000's since it last happened to the comm lines supplied by Salford Quays exchange.