Re: Mirrored systems
We have had some experience of fail-over systems and it is much harder to make it work properly than you imagine at first. You have a few rather tricky issues to address:
1) On what conditions do you fail over? Total loss of one system is obvious (power off, kernel panic, etc) but what do you do if some part is down and other look OK? What exactly are the thresholds for action?
2) If you go for something more useful than total outage, how do you make sure its not triggered by a temporary condition (flood of data requests, etc) that might push system load up higher than normal, but is in fact an acceptable short term condition?
3) When failing over, how do you ensure data completeness and integrity? If, for example, one hard on a NAS fails you could end up with partly written files and may not be sure of what the clients think was successfully written.
4) How do you avoid the "split brain" problem when one system takes over from what it thinks is a failed mirror, but that mirror is still doing stuff with shared resources? If you go for powering down the failed system (AKA "shoot it in the head", zombie apocalypse style) to be damned sure its not meddling with shared stuff, how do you then avoid the risk of mutually assured destruction if both lose the heartbeat link and more or less simultaneously kill the other?