Interesting developments, in particular the having multiple MPI tasks able to run simultaneously on a single Kepler chip. Great added flexibility.
When Nvidia did a preview of its next-generation "Kepler" GPU chips back in March, the company's top brass said that they were saving some of the goodies in the Kepler design for the big event at Nvidia's GPU Technical Conference in San Jose, which runs this week. And true to its word, the Kepler GPUs do have some goodies that …
Power vs clock speed
Interesting. However, one thing early on bothered me. If power scales as only the log of the clock speed, that would reduce the sensitivity of the power consumption to clock speed, not increase it. Wouldn't it? So I suspect it may be the other way around, that the log of the power consumption scales with the clock speed.
Re: Power vs clock speed
Yes, you're not the only one to spot that. The article has been amended.
Why the massive hit on double precision in the K10 compared to the Femi? Is there a technical reason, or is it just to force DP users on to the more expensive K20?
Re: DP bad
Not just DP bad on Kepler K10, you have 50% less memory per GPU. While there maybe more memory per board, I don't believe the GPU's share memory (i.e. all connectivity between the GPU's is via a PCIe bridge) so you get a choice of hitting main memory via PCIe or the second GPU via PCIe if you need to run larger data sets.
You've got cores and core-groups confused at the start; you write
The Fermi GPU had 512 cores, with 64KB of L1 cache per core and a 768KB L2 cache shared across a group of 32 cores known as a streaming multiprocessor, or SM
where in fact there is a single 768KB L2 cache shared between all 512 cores, and 64KB L1-like memory shared across each SM.
'The Fermi GPU has sixteen streaming multiprocessors, each comprising 32 cores and 64KB of fast memory, and a 768KB L2 cache shared by the sixteen SMs' would be a more correct way to put it.