Uninterruptible Power Supply
The Revenge of the Oxymoron...
IT infrastructure company Pulsant suffered a power outage at its Maidenhead data centre last night that cut websites off from the internet. Although electricity was restored soon after by routing around a dead uninterruptible power supply, the blip knocked out web hosting firms' servers and firewalls, and issues were still …
The Revenge of the Oxymoron...
We were one of the unfortunate hosting companies (servwise.com) that experienced the server power failure on three of our servers. Luckily they were not our most used servers and the outage had little impact to our customers. I just hope tonight the same thing does not happen when they put the UPS back. :(
Been in 3 DCs, all lost power at least once that was delivered by N+1 UPS including dual feed. Important stuff needs in rack UPS too, just to add another point of failure!
Sure shit happens and that's the way of the world, but this is a yearly event at Blue Square/Pulsant. The UPS system in the BS2 and BS3 areas has been nothing but trouble for as long as I can remember. Last year they evacuated the site because one of its components went up in smoke tripping the VESDA alarms. The whole site was cleared and it was a good couple of hours before customers could regain access to start picking up the pieces.
Right now I think my garage and a two stroke generator is a safer bet for up-time than my racks in BS.
maybe they should get a UPS for the UPS?
Why does this remind me of customers who do faithful nightly backups of their data, without actually having tried if the restore procedure works?
So much for dual redundant power supplies. You would think that doing maintenance work on a UPS they would make sure there is an alternate power source available for exactly this eventuality!
>> ... they would make sure there is an alternate power source available ...
Err, if you read the article you will see that they did in fact have an alternative and were able to bypass the UPS to get power to connected systems. As someone who's experienced this first hand, bringing up a large setup from cold can be tricky - it's hard not to build in many dependencies (which can be circular !).
Example, your main routers and firewalls include references to devices/service by DNS name*. From a cold start, your DNS servers aren't running so the firewalls either fail to boot, or fail to load the correct setup.
So, you think, lets bring up the DNS first - but of course, they don't work properly until you have internet connectivity.
So you may have to do some work to get internet connectivity up, then get some other systems up, then go back to the first bits and get them up with the normal config, and so on.
Unless you put a lot of time, effort, and money into managing your startup process (which remember, you plan to never actually use), then all this startup scheduling will have to be done manually.
* If you use IP addresses then you introduce a different management problem of making sure they all get updated as required.
Then you have all these systems that had an uncontrolled shutdown. They will all need to check their filesystems - which may or may not take a long time (so many variables). Databases will also need checking. During these checks, there may well be errors found, and if you are unlucky these errors may be serious.
And lastly, some of these systems may have been running without power off for years. It's not uncommon for them to fail when powered off - so now you have dead hardware as well.
So putting all this together, it's no surprise it takes time to get things back up after such an incident.
Perhaps I should clarify my point about having an alternate power source available.
I mentioned dual-redundancy, in that if the output of the UPS they were working on fell over (as it did) there was a secondary source of power to the server racks already online so that there was no power interruption at all. Thus preventing the hard shutdowns, cold boots, circular dependencies, etc. Any competent DC engineer makes damn sure that there is always online backup power ready to take over instantly.
Don't worry, I know all about server farm power failures and the trouble bringing them back up, I have been in the industry for years so I have also faced my fair share of these kind of problems.
Problems comes when some bean counter says no.
"Problems comes when some bean counter says no."
When you start listening to bean counters in regards to infrastructure needs, you have failed.
The engineering company I work for is essentially run by the bank accountants, accountants don't know shit about general engineering, laser cutting and steel fabrication. All they see is jobs going out and money coming in. The don't see nor understand the need to replace tooling that still works even though the staff using said tooling are screaming out for new tools because they need to do 10x as much setup and loose parts to poor quality because the still working tools are worn drastically.
"during planned maintenance"
Well, we all have a few bar-room stories there...
Proviso: I'm an old mainframe systems programmer, so naturally biased .-)
Yup, always amusing that one.
My favourite was an IBM System/38 that was having a microcode fix installed and was down to engineering mode for the short, planned, maintenance window. The '38 has four cooling fans, one on the CPU / cards (critical) and three on the PSU (any two will do). The engineer noticed that one of the fans on the PSU was dead and offered to swap it, as he had one in the car and it would only take a few minutes to drop from engineering mode to full power off.
15 minutes later, the thing's off and a new fan is installed in place of the dead one. Switch on and one fan, the new one, starts up. We all stand there, hoping it'll make it through the bootstrap to the boot logon so it can be shut back down, before the inevitable thermal check. It didn't.
Needless to say, he didn't have three more spares on him and went from hero to zero......
I stopped reading right there, as that explained it all... Lumison were always a useless shower of dolts.
Best one I ever had was a server dropping offline with no warning, machines above and below it in the rack were still operating perfectly so I knew the problem had to be local to that one machine.
Phoned up to ask if someone could have a quick nose at the machine and see if they could see why it wasn't responding to anything only to have the response of "That would be the one that was belching out smoke"
turned out that the filtering cap across the mains inlet on one of the psu's had caught fire, but because the psu was still operational (albeit with flames) it hadn't switched to the backup
Goes up in smoke, and doesn't stop! I need one of those - even if just for the fun! :D
Partial power disruption during scheduled maintenance = somebody tripped over the chord.
The company I used to work for had data centres all over the UK. One of them had a prolonged outage starting in the small hours of the morning (for which I was on call) because a capacitor on the HV side of one UPS had physically exploded, taking the HV side of the adjacent redundant controller with it; one had an extended outage when a static switch melted during a routine test causing a complete loss of power to the distribution panels; one had a shortish outage (but long startup) after an electrician accidentally dislodged a cut-out lug at the top of a distribution panel and shorted two phases together in said panel causing the UPS to shut down (they don't like having phases crossed!); another had a planned run on generator curtailed because the building's owner had just finished bricking up the inlet vent for the generator (without notifying anyone) resulting in the local fire service decreeing that said generator was now a fire risk and had to be shutdown.
Then I moved jobs, a new UPS got installed at $newjob, and it transpired someone had got an RMS calculation back to front and said UPS was therefore only 70% capable of full load and kept going into bypass. It got made bigger, and then controllers kept failing.
Since I moved offices to over a mile away, it hasn't barfed. Coincidence?
why not more companies are integrating UPSes into the power supply. It's far easier to integrate it into that part.
Possible designs would be having a special 14 volt output of the PSU to charge the battery and then a second primary winding and switcher to run off that battery.
Another design would be a dual stage power supply. You'd first have a simple coarsely regulated power supply giving you about 14 volts (or a multiple of that) for charging. The batteries are then connected in parallel to the first stage. Then from there on you'd have a second stage working on input voltages between 12 and 14 volts (or multiples of that) and giving you the voltages you actually want.
The first design would probably be more efficient, the later would be simpler and wouldn't have any "switch" between both modes. Once the power goes out, the voltage of the battery will drop from about 14 volts (required for charging) to about 12-13 volts.
Sidenote: Obviously you'd use lead acid batteries which can be charged with a constant voltage.
"a partial power disruption during scheduled maintenance"
So, somebody hit the wrong breaker?