El Reg's ever-twitching cyberspace antennae have detected that Compellent is certifying its storage product with a coming major release of ESX which appears to be slated for May. It was expected by many people that ESX 4.0 would have been announced at VMworld in Cannes last week. Instead there was much talk about VMWare and the …
The new version will support "Fault Tollerance" - keeping two VMs in lockstep with each other (a lot like Tandem computing hardware, but SO much cheaper), there will be the Cisco Nexus1000V distributed network switch, allowing the networking team to have full Cisco IOS control of the virtual network device, host profile management to keep all hosts configured to a known standard, 64-bit kernel & service console(even though it has supported 64-bit guests for a while, the kernel and console were still only 32-bit), along with a lot of new storage plugin technologies.
For a geek like myself, I'm almost wetting myself in anticipation!!
While it certainly has some interesting and experimental pioneer character, it might be barely usable for mission critical applications, since it only works with a single CPU - who wants to run a mission critical app on a single CPU?
But I'm sure it is a good path and with this going public and certainly be perused in various niches, it will be a good base for a possible multi CPU lockstep technology. But if one knows how complicated it may be to make SMP efficient even on one OS, you also know how difficult it will be to keep two OSs with SMP in lockstep.
Clustering is Painful
As of today, VMware ESX has made "high availability" on commodity hardware a no-brainer in terms of configuration, compatibility, supportability and cost i.e. set up the hardware for HA (multiple hosts, shared storage, redundant hardware) and every VM you deploy becomes instantly "highly available" (VMs automatically restart on another host in the event of a hardware failure).
If you wanted even higher availability, you had to resort to clustering at the application level with solutions like Microsoft's Clustering Services, Symantec's Veritas Cluster Server etc. Those in the know are well aware of how painfully complex these solutions are to configure and run successfully. For certain applications and requirements, there were no other alternatives. For non-cluster-aware workloads, there wasn't even a solution.
The introduction of "Fault Tolerance" in ESX 4 essentially bridges that gap and provides a simple means of providing that extra level of availability without having to jump through hoops and fork out a tonne of $$$ to get it. The marketing indicates that it is a mere act of "checking a tick box" to make a VM "fault tolerant". Woohoo!
I can appreciate that the initial support is for single vCPU virtual machines only as many of VMware's "v1.0 features" are always somewhat "limited" or "experimental" as they eventually iron out the kinks as more and more customers start deploying them in their environments. You can certainly imagine the extra network bandwidth required to provide two VMs with a communication channel to run in lock-step continuously. As 10GigE proliferates and becomes common-place, the barriers will start to fall I reckon. In the meantime, it makes sense to go after the "low hanging fruit" before enabling such functionality on for your mission critical stack.