A top General Electric techie gave a presentation at the GPU Technology Conference this week in San José, California, and discussed the benefits of Remote Direct Memory Access (RDMA) for InfiniBand and its companion GPUDirect method of linking GPU memories to each other across InfiniBand networks. And just for fun, the GE tech …
Wine won't help. After all the fact it is NOT emulation means it won't do anything to help run x86 instructions on an arm. So unless crysis is recompiled for arm, you won't have any hope of running that.
The PCI-Express bus has been a bottleneck since...
How wonderfully, scarily fast things change.
ok - I got the idea after the 1st page...
and I'm eagerly awaiting the genius Eadon taking the piss out of it
Re: ok - I got the idea after the 1st page...
Indeed, should be using the wonderful dragon CPU from china as it is more open lol.
A very clear explanation of a very complex subject
This could have been buried in pat numbers and stats but I found admirably concise. Some writers seem to feel that technical subjects have to be complex to understand, like you have to prove you're smart enough to read their work.
I'm no expert but I felt this gave me a good enough understanding that R-DMA gives you an 8x speed up in data throughput to a GPU, which sound s a pretty worthwhile gain.
Given what we now know about how many FLOPS it takes to animate a face this is not to be sniffed at.
Thumbs up for a nice write up.
Cool, seriously cool!
We live in such exciting times. I several algorithms which could benefit from much faster global access (and just more processing grunt, but that goes without saying). I do worry how to harness the power embedded in your typical GPU architecture. They do not seem to like data-driven processing order much. Scientifically, that is a challenge of course, not a problem.
It is Dustin Franklin not Dusty.
Not only is this stuff very leading edge, but we can also make the same board start up after a night in a Siberian based tank, or after sitting all day in a helicopter in the Iraqi desert, with convection or conduction cooling. Or work in your data centre!
or April fool
who in thier right mind would create such a beautiful compute machine then install windows?
Interesting piece, but I worry about any table of results that does not appear to be internally consistent....
The piece does not define what exactly is meant by "latency", but it is odd that the results in the second and third columns (both labelled "latency") are precisely 1/2 of the transfer time that one would compute using the complicated mathematical formula "time = quantity / rate". In this case, for example, on would expect that transferring 16 KiB at a rate of 2000 MB/s to take 8.192 microseconds, rather than the 4.09 microseconds stated in the table.
Latency might mean "time for the first bit of the data to arrive" or it might mean "time for the entire block of data to arrive". The latter is not possible given the stated numbers (since all the computed transfer times are precisely twice the stated "latencies"), while the former would imply a mildly perverse buffering scheme that always buffered precisely 1/2 of the data before beginning to deliver it to the accelerator.
Buffering exactly 1/2 of the data is perhaps not as crazy as it sounds -- such schemes are sometimes (often?) used in optimized rate-matching interfaces. If the input is guaranteed to be a contiguous block, then buffering exactly 1/2 the data allows the buffer to transmit the output data at 2x the input rate after pausing for 1/2 of the transfer time. Such a scheme minimizes the latency between the arrival and delivery of the final bit of data in the input block. Unfortunately, it also makes the actual hardware latency invisible (provided that the hardware latency is less than 1/2 of the transfer time of the smallest block with reported results).
Whether this buffering scheme makes sense depends a lot on the data access patterns of the subsequent processing steps. If the subsequent step demands that a full block be in place before starting, then this is the way to go. On the other hand, many signal processing algorithms could pipeline operations with data transfers in smaller blocks, in which case a different buffering scheme might make more sense.