One fine morning around 3am I rebooted one of my servers in co-lo. After 10 minutes it did not come back on-line, so I hit the PDU to power-cycle it. No alarm bells rang at this point since this particular machine was very, very cranky due to a bad temperature sensor on the motherboard. Most times killing the power for 20 minutes would bring it back online.
By 4am the server was not up and I got a text message that another server had shut down while yet another was starting to shut down. Of course this had me very worried and I hopped in the car to head to the co-lo.
The co-lo is through two fob-key doors, up a split flight of stairs in a back stair well. As I approached the door I felt a lot of heat. This brought back memories of all the fire safety videos in school and interstitial PSAs from Saturday morning cartoons: a hot door means fire! I sampled the air a few times and did not smell any smoke, so like any idiot would do I presented my fob to the sensor and opened the door.
I thought I was going to pass out from the heat. I had not felt heat like that since a summer many years ago on a trip through Texas when my flip-flops melted to the asphalt parking lot of a truck stop. I got inside and found the air conditioning was not working and neither was the air handler. The thermometer on the wall at the door read around 160F. Servers were beeping, some had shut down already, and I had to get the heat out of there. But it is a natural heat-trap: windows covered over with foam insulation board, an inside door that I cannot open, and an outside door which just vents into a stair well with no exhaust ventilation.
I started calling the co-lo operators and leaving nasty messages. I found two exhaust fans which had been closed up when the new dedicated A/C was (recently) installed and cut them open, only to find them stuffed with insulation and covered on the far end (this was the first time the A/C had been given a good running due to hot weather.) I had pulled down one of the foam insulation boards and was just about to chuck a stool out the window to ventilate the room when someone showed up. Once we got the temperature in the room down enough the blower would turn on and the compressor would run someone had to run a hose on the compressor to keep it from tripping until the A/C guy could turn up.
Obviously the A/C had failed. We figured out by one of the servers graphs that sometime around 8pm or so the A/C compressor failed and the temperature steadily rose from 78F to around 135F (inside the server) which held for a couple of hours until around 1:30am when the blower motor went into thermal shutdown and the temperature in the server rocketed to about 180F before the graphs stopped. Tying into some of the other stories here, as I recall a contributing factor to the failure of the compressor was a reversed phase. I was too angry about the whole event to stick around and get the whole story.
MRTG has a neat feature to trigger a script upon specified variables hitting specified values. I now have my UPS graphs trigger alarms in my office and text messages when temperature (amongst other variables) hit the danger zone. It is a co-lo, after all, and not a data center, is what I tell myself, and that I should have been watching that crap from the get-go.