A lesson in redundancy
To have redundant everything you need to know what everything is. In the complex interconnected world in which we operate it's often hard to know what you don't know.
One of our servers at Memset (who are normally an all round top notch provider) was affected by this outage. Anticipating that data centres can never actually have 100% up time, we replicate our servers to another DC run by a different service provider. That way we can just switch the DNS and hey presto.
For our DNS we use another normally top notch provider with loads of DNS servers spread around the world etc.... It just so happened that after ten years of flawless and uninterrupted service they had a problem at the exact same time as the Memset outage. All DNS servers were running normally, but their control panel went offline for an hour due to a database glitch - meaning we couldn't switch the DNS to our redundant server.
As it happened, our Memset server came back very quickly and we didn't need to switch, but still, another lesson learned.
Any tips on how to mitigate against this problem would be much appreciated. DNS secondaries with another provider (or our own) would not have helped in this instance as DNS was running normally. We just couldn't modify the zone files.