Embedding some kind of CPU core would be quite interesting. Trying to keep a 1TFLOP core fed with data in a signal processing application when all you've got is a 'feeble' PCIe connection might be difficult... [I say feeble - of course it's pretty impressive as things stand].
Where's my virtual fag packet...
1 TFLOPS, say two operands per operation, one result = 3 floating point numbers / operation. A floating point number is 4 bytes, so that's 12 bytes per operation. 12 x 1e12 = 12 TByte/sec. Hmmm. Optomistically PCIe is 16GByte/sec, almost 1000 times too slow (though definitely a wet finger in the air worst case guesstimate).
So to me that means getting the data in to the chip, doing about 1000 operations on it, then kicking the result out to something else, and at that the PCIe will be red hot busy. Trouble is, thinking of 1000 operations to perform on every single piece of input data sounds tricky to me for my favourite area (signal processing).
So yes, a good idea, but to really maximise the throughput of such a powerful device Nvidia would have to serisously improve on 'just' PCIe (and whatever else lies beyond it, e.g. 10GEth) as it currently stands. Otherwise you'd have a bored GPU just waiting for the next batch of data to turn up.