Latency, it's all about latency
I have some rather memory-intensive code, that I did once run (or rather walk) on a ScaleMP machine (8 boards with 8 cores each (older incarnation)). Performance was dismal. Why? Each core thread may need access to each part of memory, because the outcome for each pixel in these huge images may depend on any pixel in the image, and you do not know beforehand which ones matter. Everything works hunky-dory as long as each processor only accesses the memory on its board, but the moment it needs large amounts of data from another board, latency kills performance. Getting a speed-up of 0.5 at 2 threads (if they weren't explicitly pinned to cores on the same board) is rather discouraging.
What they are doing is putting a software layer over a distributed memory or NUMA machine, so as to hide the complexities and allow shared memory algorithms to run (or rather walk) without the need to rewrite the code. ScaleMP does hide the actual NUMA architecture very well. Curiously, this leads to problems when optimizing the parallel code for that particular hardware. Because details are too well hidden. You really need to understand the memory architecture and the latencies of the machine to design the appropriate algorithm. Parallel programming on shared-memory machines and distributed-memory/NUMA machines are two very different ball-games, often requiring a careful rethink of the algorithms, in order to get the processors spend their time working, not talking (just like an old-fashioned classroom), or waiting for data.
A ScaleMP-like approach could work if the latencies are kept very low (like the QPI approach). On run-of-the-mill network connections (or even Infiniband), you need to rethink shared memory code, not so much because of bandwidth, but because of latency. For Cell/GPU type systems similar rethinks are needed, for much the same reason