All about the interconnects
The argument goes that of you can break your work into smaller chunks then attack the problem in parallel you might not need as much fast memory especially if it is MUCH faster than generic DDR4. If NVlink interconnects at up to 200GB/sec can (in theory) flush through all the RAM on even the largest multi-GPU setup in less than a few seconds then compare that with legacy multi-socket boards that struggle to share anywhere near that amount of data between NUMA nodes in more than a few seconds even using older SLI configurations for accelerating the GPUs using faster GDDR5. If the GPU is doing all the work and CPUs are really no longer constrained by the overall number of PCIe lanes available then this may differ widely between different CPU architectures, it certainly looks like it might between Intel Xeon or AMD EPYC x86 boards when compared to how POWER are doing it, more like the custom accelerator coprocessors on mainframes. I wonder where the next I/O bottleneck will be attacked if multi TB of fast graphics RAM for in-memory databases becomes commonplace.