Diary of a server failure

Silver badge
FAIL

Duh

1) If your rebuild-from-backup time is 3 days, why don't you invest in something minor (like a 1Tb eSATA- (or even USB-) connected disk like I can buy from any Maplin's shop for about £50-100) that will *not* provide any extra security or redundancy but *will* provide a restore time on the order of minutes rather than days. Permanently connect it, have backups write to it as an external medium in the usual backup procedure. It won't save you in a fire, but it'll cut your restore times by orders of magnitude when it comes to most hardware failures. Especially handy if you know your drives have been troublesome lately. Or a nice, cheap NAS box is the perfect "local restore" backup for things like this - you barely need to do *anything* to make it work and Gigabit restore speeds certainly sound better than waiting days for backup tapes to be found.

2) If ESXi was so finicky about what it would install on, you're purely lucky that it installed at all. What would you have done if it didn't like the only card that could see your RAID, so that no matter WHAT machine you used for restore it wouldn't install?

3) If your RAID5 rebuild times are that high, that's why RAID6 was invented. At the cost of an extra disk, it's a MASSIVE reassurance when you have long RAID restore times (and even a double-failed array can be read raw off the disk and reconstructed, assuming point 4 below apply).

4) Your RAID was highly dependent on the chipset implementation given to you. This in itself is a *fabulous* argument for Software RAID (which has closed the gap tremendously in terms of speed) or for using a RAID setup that has a well-documented layout (and thus can be loaded in any machine with something like Linux "md" driver, at least enough for data recovery). If the chipset had changed, or the format had been upgraded, or newer chips didn't come with the backwards compatibility, you would be stuffed again.

You were damn lucky, basically. And all for something that an extra hard-drive (either in the RAID or as an external backup device) would have turned into a mere afternoon job without panic.

3
0

Back to the forum

Forums