What does ‘high availability’ actually mean in the context of IT today? We’ve written elsewhere about more general availability and good systems design, but what if you want to assure availability when things go unexpectedly wrong? From a systems perspective this is where we start adding to the number of nines we talk about, for …
It's not that easy
Remember it's usually not the hardware which fails, although that's where it's easiest to throw money. The biggest components of service recovery are problem determination and application restore. I don't think I've ever seen application restoration following failure as a requirement during application package selection, that always is viewed as an infrastructure problem or a bolt on.
This is in terms of operational failure, when you talk 'DR' then it adds organisational reaction times to the mix. I've only once worked with an organisation (a global bank) which actually could cope with a disaster hitting it's datacentre.
I've come round to the view that for most businesses, building a bulletproof main environment is more worthwhile than trying to provide a cut price DR environment.
A lot of companies lease a virtualised DR centre > shipping their data and planning to call down capacity from a server farm if they need to invoke DR. I'd love to know how the DC provider calculates capacity and whether they refuse customers who are too close geographically to each other and likely to suffer from the same disasters.
Buy me a pint and I'll keep talking all day about this.
There are a lot of moving parts in this discussion; I think the author did a good job of distinguishing High Availability from Business Continuity and Disaster Recovery. I agree with Ken 16's comment that focusing on main data center availability should be the primary focus, establishing a local failover on the same campus would be second, followed by a warm site that is not too close geographically. The likelihood of a significant failure decreases as the proximity to the data center widens. That is, you’re more likely to have a server fail than to have a failure that will take out your datacenter and rarer still to have a failure that takes out the entire campus or your geographical region. Full disclosure, I work at Stratus Technologies, we run all of our mission critical applications on our own servers; these servers provide greater than five 9s availability without clustering, we call this Continuous Availability (CA), as opposed to High Availability (HA) which I'd classify as three to four 9s. The advantage to a local failover site is that you have a greater amount of control over the infrastructure and it can be leveraged to avoid "routine" downtime due to planned outages, e.g. scheduled power maintenance in the lab, major application and/or infrastructure upgrades, etc. Leveraging virtualization for local and remote failover sites significantly simplifies disaster recovery plans and keeps infrastructure costs down. Cheers!