Server partners Stratus Technologies and NEC have revamped their fault tolerant server lineups to take advantage of Intel's quad-core Nehalem EP Xeon 5500 servers. The two companies typically like to get their fault tolerant machines out the door within a quarter of a new chip launch from Intel. This time, though, the big shift …
dubious about this bit
"Fault tolerant servers are distinct from the more popular clusters in that they are two completely mirrored systems running two copies of an operating system and their applications are kept in absolute lockstep by a chipset and electronics in Intel's Xeon chips."
Unless someone corrects me, I don't buy this. Unless you could guarantee absolutely identical memory access patterns then there will be variations in cache access which will lead to different levels of cache being accessed which, given the huge differences in access time of different cache levels, would cause the system to run at the speed of the slowest - which could be very slow.
And then there's different disk access characteristics, so that even nominally identical units will have different request servicing times.
And there's interrupts. And network servicing times possibly acting as a bottleneck of different sizes per processor, moment to moment, both in network hardware differences and network load.
And then there's the circuitry on top to supposedly validate the outputs against each other (according to what, anyway, and what happens when a discrepancy is found?)
And this would be bad enough if it was multicore, single socket, but this supposed to have *dual sockets*.
I suppose big jobs will be main memory- rather than cache-bound, so that price is perhaps already paid (ie. it runs as if from main mem anyway, so cache bonus is effectively lost), but I must be missing something. Can anyone who actually knows about this stuff please enlighten me, ta.
The setup inside the box seems to be basically the same as the original 68xxx based machines that I last used, so here it is in a nutshell.
Every board in the machine is paired and hot pluggable
All disks are mirrored - Raid 0
Each logical CPU consists of 4 logical CPUs, two in each of the paired boards. The same sort of arrangement also applies to comms boards, disk controllers, etc, but for simplicity I'll just describe CPU boards.
The basis of the system is that the two CPUs on a board are very tightly synchronised. They both execute the same instruction at the same time and fast comparators compare the outputs of the pair. If a comparator spots a difference it can turn the board off before the bad data gets onto the main busses that connect all boards in the cabinet.
The entire box has a single system image with paired boards synchronised at the bus read/write level. This means that a failing board can be turned off without affecting system operation - unless, of course, its pair has already failed. IOW the system is completely tolerant to single point failures and also to a more limited set of multi-point failures. You can pull up to half the boards out of the system without affecting its operation or performance provided that you don't pull both halves of a pair.
There is only one copy of the OS and of each application program in memory. Each running process is a single image, but each executes simultaneously on all four processors that make a logical processor: when everything is working correctly the data from three of them is discarded. If a board fails, things go on in the same way, but now the logical processor has only two chips until the board is replaced. On replacement the board is tested, brought up and synchonised with its active pair-mate, so the logical processor is again made of four physical processors. During all this the affected processes have continued to run without interruption and at full speed.
It used to be said that any non-fault-tolerant OS could be run on a Stratus with one change: it needed a special fault detecting interrupt handler whose only job was to kick off the phone-home process if an error interrupt ocurred.
Two things happen when a board fails: its switched off and after a 30 second delay, the system rings Stratus and tells them what broke so they can send an engineer round with a replacement. The delay was introduced because in the early days actual faults were hugely exceeded false alarms that were due to people showing their mates that you could pull a board or two without anything happening apart from BOARD FAILED messages appearing on the console as you took them out and IN SERVICE messages appearing when you stuck the board back in. The delay let you pull the board, say "look Ma, no fault" and stick it back in without the system phoning home.
You must have heard this story
Told to me a long time ago; somebody phoned up the service centre and said 'we had an earthquake, and our fault tolerant box has fallen over'. Service guy; 'it can't fall over, it's a fault tolerant system'. Customer; 'No, no, it's still *running*, but it's fallen over on its side and we need someone to help us get it upright again' :-) :-) :-)
Thanks for that explanation, I understand the one of intel's were designed to do something similar (the PII?) but we're talking several generations ago. IIRC the 68040 would be running ~30MHz, against current clock speeds of round about 3 *orders of magnitude* faster. I don't know but I'm guessing the clock is on-chip, so how do you sync 2 sockets? Even if it is off-chip you would have significant clock skew.
Dunno, it seems more plausible but still extremely hard.