If you want big server iron but you have midrange server budgets, Numascale has an adapter card that it wants to sell to you. The NumaConnect SMP card turns a cluster of Opteron servers into a shared memory system, and in the not-too-distant future, probably Xeon-based machines, too. The clustering of cheap server nodes to make …
What problem are we solving with NUMA?
In the late 90's I worked for a bank where we were tasked with migrating 2 years of transaction data from a very expensive Sequent Symmetry NUMA-Q system running Oracle. I think it cost £20m and made the front page of Computer Weekly when it went live.
In a nutshell, it didn't work - there was one DBA per user (4 of each), only a small percentage of the fact table could be included in a query and no joins were possible, unless you wanted to wait forever or if you insisted that your querys didn't crash the system.
The billions of rows were dropped onto Teradata, an MPP system, where it worked fine, and where the table and app sits to this day...only a lot bigger.
There's nowt new under the sun.
cheap numas are useless
IBM has Xseries 3850 and 3950 which they claim are numa, but if you connect more than one box together and use memory cards, the memory becomes so slow that my eeepc901 showed better latency. You are then far better off with a cluster of cheaper nodes.
My experience from this, is to run the stream benchmark on all new configs I get to deploy on and then just reject them if memory becomes too slow.
If you need numa you need to buy something that give you almost the same memory latency for all nodes, like the SGI altix. It doesn't cost you so much more than all those crappy numas around.
Shared memory NUMA @
You are comparing apples with oranges. You compare an IBM system, with a new technology, which I assume you haven't tested?, with an SGI system, which I assume you have?
Give these people a break, at least until you have had your fingers in the keyboard and can quantity you claims.
It's all about NUMA!
I have seen your argument about having the same latency for all nodes before. My reply was the same then as it is now: I can make all the memory equally slow, but I can't make it equally fast! Numa means that some (the local) memory is faster than the remote memory. Modern machines with more than one socket of CPUs and on-chip memory controllers are all Numa machines. Also all modern processors rely on highly efficient cache hierarchies to be able to obtain a reasonable performance level compared to their peak performance. This stems from the fact that the processors operate around 3GHz (0.3nanoseconds cycle time) and DRAM access time is in the order of 100 nanoseconds. Simply put: Smaller memories closer to the processor can be made faster that bigger memories further away.
At Numascale, we have applied the following design philosophy: Keep the latency for remote accesses as low as possible and introduce a fourth level of cache on each node to smoothen out the latency difference between remote and local access. This cache can be configured to be ≈1000 times larger than the L3 cache of the processor (4GBytes vs 6MBytes) and it operates with the native processor cache line size of 64 bytes per line. Remote latency on cache miss will necessarily have to traverse the interconnect fabric and will be in the microsecond range. This is around a factor of 10 more than local memory. This may not be so bad when you count in the hit rate of all the caches and look at the alternative which means that you will have to go to a secondary storage device which is seriously much slower, or go through a more painstaking process of decomposing your data structure and write a message passing program.