Linux cluster supplier Penguin Computing is diving into the low-power ARM microserver racket and has tapped server chip upstart Calxeda – which has just rolled out its multiyear product roadmap for its EnergyCore processors – as its chip and interconnect supplier for its first boxes. The new machine, called the Ultimate Data X1 …
This sounds like the set up for an episode of campy 1960s Batman.
The question is why has The Penguin become so interested in ARM microservers. What is his fiendish plan?
And here I was gonna ask if it ran the new Win-ate, innately.
Cache coherent interconnect would only be useful if...
...the processors have enough physical address bits to allow direct addressing across all the memory attached to the interconnect. Cortex-A9 can only address 4GB in total, so to get anywhere near addressing the memory on 4096 sockets, you'd only be able to put 1MB on each socket, which seems a bit small for today's software... :-)
Also, are you sure you even *want* it? I was somewhat involved with the 1536 processor Altix 3700 system that was installed at my workplace (nf.nci.org.au); it seemed that SGI were keen for us to run it as a few honking big SMP boxes, but the exposure to component failure that you get with a few huge SMPs means it only really makes sense for jobs which necessarily take the whole system. AFAIK that's how NASA ran their Columbia Altix cluster. We ran a big mix of workloads across that number of CPUs, and so a failure that crashed a 512-1024 CPU SMP would have killed a lot of jobs that had no dependency on the failed part.
Even when the Altixes were run as a cluster of 32-64 CPU SMPs, with the same interconnect serving to run MPI between SMP boxes, the cache coherency in the interconnect was still there, and could lead to cascading failures if you didn't shut down a failed SMP box in just the right way; memory shared between SMP nodes for MPI communication was actually cache coherent with the other nodes mapping the same memory, so a failure in one node could cause other nodes to fail if cache lines for other machines' memory got "stuck" on the failed node. Not pretty, and made worse by all the Custered XFS storage fencing that happened as nodes died; if enough fence-outs happened in a short time, it could cause the Brocade director class FC switches to hang, with further ensuing hilarity.
Moral is, be very sure that you want the very tightly coupled thing, because you'll pay for the complexity one way or another...
Stop trying to confuse the shiny world of press releases with your boring real world experience, the kind of thing that any big-system fule doth know (and hatheth knowed for ages).
Don't you understand that there are lunches, VC funding rounds and maybe even IPOs at stake here?
p... p... pick up a penguin!
Chocolate bars! Not servers!