New Year's Eve
I'm keeping my fingers crossed for this year. Last year was a disaster. At about 3pm on NYE, some alerts were raised by our ISP. Trying to get in to have a look I found a number of machines strangely non-reponsive (including our main monitoring server). Thinking the worst, I had a look at the UPS logs which showed the output had gone down for a few seconds. I managed to reboot a number of machines via remote PDUs and get to a more-or less working state.
15 minutes later *everything* went dark, so I was off to the DC. When I got there, I was greeted by silence for the racks and a 160kVA UPS festooned with red lights. One phase of input was gone and the UPS was in shutdown. Managed to get hold of an electrician and on-call UPS engineer. The sparky arrived first and found a blown 300A fuse in the UPS feed. We searched in vain for spares but managed to come up with a 200A in the same size which would do at a pinch. I went back up to the DC and via the radio asked the sparky to switch the breaker back on. I was confronted by a 6 foot fountain of sparks leaping from the front of one of the redundant UPS rack units and a very loud bang indeed. If I'd been standing in front of it it would not have been pretty.
Not too long after the UPS guy arrived with the smell of smoke still heavy in the air. The scorched unit was opened up, revealing a main board covered in soot and the input wires from the rectifiers melted back by over half an inch from where they had been soldered into the board, blobs of molten metal scattered around. The UPS chap although rather surprised checked all the contacts in the frame, which had luckily survived and set off back to base to get a replacement unit.
At about 5am he returned, new unit in hand. We had to replace all 3 phase fuses and then where was a very tense moment as the breaker was thrown again. Luckily power was finally restored. Thankfully due to the way the days fell we had two more days to recover everything. I called in the rest of the team and managed to get 3 people to help me sequencing the power-on (about 120 physical machines and a few hundred VMs). I left exhausted by 7pm (but still was connected at home) and by 11pm on the 1st we had all the servers up, with the application guys in Melbourne finishing up on the holiday Monday.
Ruined New Years for a good few of us that year. And we only got 1.5 TOIL/OT from it - but at least the "right" people remembered what we did and thanked us just a few days ago. Fingers crossed for this year.
Postscript: An IGBT had cracked open in the UPS module, had never been seen before by the engineers. One 300A fuse and 4 more 200A blown...