Years ago, I received a report that we had a faulted UPS at our data center, which, as it failed, would take out half the power supplies in one of our production racks. I duly headed down to investigate, saw that the fault light was on and the battery drained. Assuming that it might just be a temporary glitch, I hit the reset button on the UPS. The next thing I heard was a great, terrible silence as all of our production systems, including our core switches and SAN, went offline due to every single UPS in the data center shutting down. After I changed into my emergency pair of brown trousers, I called the most senior manager I could get hold of and explained that we were dead in the water, resulting in an "all hands on deck" call.
The scenario which emerged was as follows:
Due to a lack of confidence in the data center's UPS/generator system, we had installed our own UPS units. Fire safety laws mandated that all power in the facility be able to be shut down, so the UPS units were wired into an emergency power off circuit: one, single emergency power off circuit. When I reset the faulted UPS, it shorted back into the EPO circuit, which caused every other UPS in our cage to receive an EPO signal and shut down. The led to a bunch of us standing around the back of the culprit UPS with an electrician, trying to safely remove the EPO wire, which would not come out due to being physically fused to the plug. Eventually, we got to a Mission Impossible/James Bond-style scene where the wire cutter came out and we had to make the call to just cut it and hope that nothing worse happened.
One crispy-fried electrician later . . . (I jest, of course)
With the EPO circuit cut off, all the other UPSes came back on line, powering everything back up. We took out the offending UPS and began the cleanup process. Fortunately, it only took us about four hours from initial failure to final confirmation that our production systems were back up.
The UPS manufacturer's response to this behavior by their faulty hardware amounted to "bummer, dudes." A few weeks later, we pulled them all out, having decided it was better to rely on the datacenter's power backup instead.