It's all about NUMA!
I have seen your argument about having the same latency for all nodes before. My reply was the same then as it is now: I can make all the memory equally slow, but I can't make it equally fast! Numa means that some (the local) memory is faster than the remote memory. Modern machines with more than one socket of CPUs and on-chip memory controllers are all Numa machines. Also all modern processors rely on highly efficient cache hierarchies to be able to obtain a reasonable performance level compared to their peak performance. This stems from the fact that the processors operate around 3GHz (0.3nanoseconds cycle time) and DRAM access time is in the order of 100 nanoseconds. Simply put: Smaller memories closer to the processor can be made faster that bigger memories further away.
At Numascale, we have applied the following design philosophy: Keep the latency for remote accesses as low as possible and introduce a fourth level of cache on each node to smoothen out the latency difference between remote and local access. This cache can be configured to be ≈1000 times larger than the L3 cache of the processor (4GBytes vs 6MBytes) and it operates with the native processor cache line size of 64 bytes per line. Remote latency on cache miss will necessarily have to traverse the interconnect fabric and will be in the microsecond range. This is around a factor of 10 more than local memory. This may not be so bad when you count in the hit rate of all the caches and look at the alternative which means that you will have to go to a secondary storage device which is seriously much slower, or go through a more painstaking process of decomposing your data structure and write a message passing program.