The Nvidia-sponsored 2010 GPU Technical Conference kicks off today in San Jose, California, and all of the key HPC players as well as some upstarts will be on hand to try to surf on the cresting wave of CPU-GPU hybrid computing models that will no doubt start taking over the HPC centers of the world and start moving out to our …
Amber benchmark doesn't look so amazing to me...
Those Amber benchmarks are comparing 32 sockets worth of 6 core 2.6GHz Opteron HEs (in a partition on a large - presumably busy system) against 8 M2050 GPU boards. According to Top500 (http://www.top500.org/system/10185) those Opterons are rated at ~40 GFLOPs per socket. The latest and greatest x64 processors are pushing on towards 100GFLOPs per socket, so it seems to me that an 8 socket box stuffed with Xeons could compare favourably against those 8 GPU boards in that benchmark in price and performance.
Why not ARM cores ?
Great Analysis !
It would be interesting to see general processor functions available. ARM core(s) would be adequate for a basic GNU/Linux kernel, as we're seeing in tablets and smartphones. Good performance and low power consumption would fit with the roadmap.
Why not indeed
That is just what I thought and commented in a related article (submitted but not cleared yet). We can only hope!
What you need
Is a graphics co-processor with an inbuilt arm core. I know of at least one designed for the mobile space (<1w) which would make a great building block. Lots of oomp somewhere between PS2 and PS3 3D performance, HD encode decode etc using vector processors and dedicated HW blocks, and a nice Arm core to keep things in order.
Embedding some kind of CPU core would be quite interesting. Trying to keep a 1TFLOP core fed with data in a signal processing application when all you've got is a 'feeble' PCIe connection might be difficult... [I say feeble - of course it's pretty impressive as things stand].
Where's my virtual fag packet...
1 TFLOPS, say two operands per operation, one result = 3 floating point numbers / operation. A floating point number is 4 bytes, so that's 12 bytes per operation. 12 x 1e12 = 12 TByte/sec. Hmmm. Optomistically PCIe is 16GByte/sec, almost 1000 times too slow (though definitely a wet finger in the air worst case guesstimate).
So to me that means getting the data in to the chip, doing about 1000 operations on it, then kicking the result out to something else, and at that the PCIe will be red hot busy. Trouble is, thinking of 1000 operations to perform on every single piece of input data sounds tricky to me for my favourite area (signal processing).
So yes, a good idea, but to really maximise the throughput of such a powerful device Nvidia would have to serisously improve on 'just' PCIe (and whatever else lies beyond it, e.g. 10GEth) as it currently stands. Otherwise you'd have a bored GPU just waiting for the next batch of data to turn up.
And more problems..
If you are unable to feed the boards using PCIe, you might think it is a good idea to create GPU/CPU hybrids, so they don't have to access memory through PCIe.
But in that case, you still have to access memory, and that will be the next bottleneck.. if you are able to remedy this, as bazza points out, that nice 10GnEth will be your next bottleneck.. one that can be solved, for example, using joined multiple interfaces...
Point is, you always have one bottleneck.
I run a Tesla based GPU core as an experiment in my development/research. My stuff is on large n-body simulations (with some interesting twists).
The point if that if you keep the GPU busy, and you are happy with single precision, it will certainly out perform by CPU (AMD 4 core Phenom Black, running Linux). That only happens if you can keep the system busy with work. That means you really, really do need to think about the way you write the application. The concept of SIMD is that all threads really are running the same instructions - introduce branches and you will slow the system down.
Worse is when you want to go back to the core application and make decisions about what to do next. Then you are back in to the rounds of passing data backwards and forwards between the CPU and GPU.
There has to be a lot more work done on this interface. PCIe will not remain fast enough for very long.
There are a couple of gotchas with ARM:
1) 64bit address space - many applications will need a lot of memory to keep those GPUs busy.
2) Boring stuff like cache and I/O (you will need lots of I/O bandwidth to drive those GPUs) burn a very large proportion of the power budget these days. Amdahl's law will tell you that changing the I/O & cache will have more effect on the power efficiency than changing the core.
Sadly there doesn't much advantage in punting ARM cores from a technical point of view.
If NVidia can't get a x64 ISA license, they could use MIPS (64bits since the mid 90s), but Nvidia already have some experience and licenses with ARM so it may be more attractive to them.
Rugged systems and bandwidth
Wander over to Simon Collins from GE at the show - see some defense and aerospace targetted rugged CUDA boards in OpenVPX chassis - 20 x 10G links or more
- Fee fie Firefox: Mozilla's lawyers probe Dell over browser install charge
- 20 Freescale staff on vanished Malaysia Airlines flight MH370
- Neil Young touts MP3 player that's no Piece of Crap
- Review Distro diaspora: Four flavours of Ubuntu unpacked
- Apple releases iOS 7.1 update in response to cars, complaints, vomit