"I would have to conduct my own testing. My lab results consistently show an ability to saturate RAM bandwidth on DDR2 systems. Your results smell like an issue with RAM bandwidth, especially considering that's where you're pulling your I/O. I will look to retry by placing the I/o on a Micron p420M PCI-E SSD instead."
This implies you are asserting that running virtualized causes a substantial overhead on memory I/O, otherwise saturating memory I/O shouldn't matter. I'm open to the idea that the biggest overhead manifests on loads sensitive to memory bandwidth, although measuring memory bottleneck independently of the CPU bottlenecking isn't trivial.
"I also disagree with your assessment regarding near/far cores on NUMA setups. Just because the hypervisor can obfuscate this for guest OSes doesn't mean you should let it do so for all use cases. If and when you have one of those corner case workloads where it is going to hammer the CPUs ina highly parallel fashion with lots of shared memory between then you need to start thinking about how you are assigning cores to your VMs."
In some cases you may not have much of a choice, if you need more cores for a VM than a single physical socket has on it. For other cases, maybe you could get a little more mileage out of things by manually specifying the CPU socket/core/thread geometry - if your hypervisor supports that. I'd be interested to see your measurements on how much difference this makes on top of pinning the cores.
"Hypervisors can dedicate cores. They can also assign affinity in non-dedicated circumstances. So when I test something that I know is going to be hitting the metal enough to suffer from the latency of going across to fetch memory from another NUMA node I start restricting where that workload can play. Just like I would in production."
Sure you can, but testing with a simple base-line use-case where you have one host and one big VM seems like a good place to start assessing the least bad case scenario on the overhead. As you add more VMs and more arbitration of what runs where, the overhead is only going to go up rather than down.
I'm not even saying that the overhead matters in most cases - my workstation at home is dual 6-core Xeon with two GTX780Ti GPUs (and an additional low spec one), split up using Xen into three workstations, of which two are gaming capable. With the two gaming spec virtual machines having a dedicated GPU and pinned 3 cores / 6 threads, both on the same physical socket (but no overlap on the CPUs). The performance is good enough for any game I have thrown at it, even though I am running at 3840x2400 (T221). So clearly even for gaming type loads this kind of a setup is perfectly adequate, even though it is certainly not overhead-free. It is "good enough".
But in a heavily loaded production database server you don't necessarily have the luxury of being able to sacrifice any performance for the sake of convenience.
"Frankly, I'd also start asking pointed questions about why such workloads are running on a CPU at all, and can't I just feed the thing a GPU and be done with it?"
That's all well and good if you are running custom code you can write yourself. Meanwhile, the real world is depressingly bogged down in legacy and off-the-shelf applications, very few of which come with GPU offload, and most of which wouldn't benefit due to the size of data they deal with (if you PCIe bandwidth is typically lower than RAM bandwidth, so once your data doesn't fit into VRAM you are often better off staying on the CPU).
"That makes me very curious where the tipping point between my workloads and your simulation is."
Databases are a fairly typical worst-case scenario when it comes to virtualization. If you have a large production database server, you should be able to cobble together a good test case. Usually 100GB or so of database and 20-30GB of captured general log works quite well, if your queries are reasonably optimized. Extract SELECT queries from your general log (percona toolkit somes with tools to do this, but I find they are very broken in most versions, so I just wrote my own general log extractor and session generator that just throws SELECTs into separate files on a round-robin basis). You will need to generate at least twice as many files as you have threads in your test configuration (e.g. 24 files for a single 6-core/12-thread Xeon). You then replay those all in parallel, and wait for them to complete. Run the test twice, and record the time of the second run (so the buffer pools are primed by the first run). Then repeat the same with a VM with the same amount of RAM and same number of CPU cores/threads (restrict the RAM amount on bare metal with mem= kernel parameter, assuming you are testing on Linux). This should give you a reasonably good basis for comparison. Depending on the state of tune of your database, how well indexed your queries are, and how much it all ends up grinding onto disks, I usually see a difference of somwhere in the 35%-44% ball park. Less optimized, poorly indexed DBs show a lower performance hit because they end up being more disk I/O bottlenecked.