Nvidia has announced the latest version of its GPU programming language, CUDA 6, which adds a "Unified Memory" capability that, as its name implies, relieves programmers from the trials and tribulations of having to manually copy data back and forth between separate CPU and GPU memory spaces. CUDA 6 Unified Memory schematic …
Unless I am missing something obvious, I've signed up (something I rarely do), waited to be approved, and now that I am approved there is still nothing for me to download.. (just harvesting email addresses?)
Seems that this unified memory is a response to what AMD has been talking about. (hard to tell who planned it first since this is all behind closed doors but at least AMD was talking about it first)
It makes sense because it is a PITA to upload/download explicitly to/from the device.
Re: Unified Memory
It is a response to what AMD have done with their APUs, but it's definitely a sticking plaster (band-aid for our transatlantic cousins) for NVidia's problem. Sure, the programmer doesn't have to worry about transferring data between CPU and GPU anymore, but the hardware does and the latency is still there.
If anything this could make the situation worse for NVidia. Before this, when the programmer had to do their own data transfers, the latency was explicitly there in the source code. It was practically shouting "this is painful and slow, don't do this too often coz it'll be a slow steaming pile of shite". Now that it's all hidden from the programmer it is easy to write working code with little evidence of inefficiency in the source. Laziness will become harder to spot.
In AMD land where the APUs have properly unified memory at the electronics level everyone wins. There are no inefficient data copies to be done at all, so source code that looks efficient ends up being efficient. That's a very good thing. It's not something that NVidia can compete with unless they start building themselves a serious x86 core.
Re: Unified Memory
Quote: "unless they start building themselves a serious x86 core". That is "Plan A".
Nvidia has also "Plan B" - burry the hatchet into x86 back and start shipping arm on their chips. Even a 64 bit arm core will be only a minor addition to the BOM and heat envelope and it will have synchronized memory and working spinlocks. The underlying x86 will become a mere carrier of Arm blades. One step further and it will disappear altorgether in some setups.
I would not be surprised if we hear from them that they have implemented CUDA 6 properly (not as a steaming pile of hacks and cludges) on such hardware and are shipping it.
Re: Unified Memory
If anything this could make the situation worse for NVidia. Before this, when the programmer had to do their own data transfers, the latency was explicitly there in the source code
I got the same feeling on reading the article. The latencies are still there, but now they're just hidden behind a software translation layer. I'll agree that doing explicit DMA or other main memory <-> device memory transfers is annoying, but we already have a technique for hiding DMA latencies(*), namely double (multi) buffering.
Multi-buffering can, for many problems, not only "hide" the latencies, but effectively eliminate them for all but the first block to be transferred. If this new feature does automatic loop unrolling and transparently adds multi-buffering (or even just double-buffering) when it detects it should be used, then that would be pretty nifty. Unfortunately, judging by the description in the article, this isn't what it's doing, and all we get is blocking, full-latency access to the "shared" memory, with "shared" in quotes because it's only a software abstraction, not a hardware feature. I could be pleasantly surprised, but from the article, it seems like it's only a sop to lazy programmers, and not real shared memory at all.
(*) I'm not actually up to speed on CUDA, so I'm assuming it uses DMA to do data transfers?
Re: Unified Memory
Agreed. While explicit memory transfers are a PITA unless the development environment provides good tools, having implicit, potentially unknown, memory transfers is just asking for inefficency. Pretty much a similar level to the inefficency that's anywhere near anything remotely .net where a "string" is involved.
However massively parallel programming is a bugger to get your head around when it comes to the coordination of many processes that may, or may not rely on any of each other, and while forward planning by initiating a memory fetch of blocks that will be of known interest is easy at the first level, it very rapidly gets far too complicated. Eventually other than for a few, much more sadistic than I am, coders, it will turn out to be more efficient to have a suitably "smart" development environment perform many optimisations.
The free software foundation might be interested to hear that gcc belongs to Mentor graphics! Do you think we should thrill them?
Yeh, I read that like WTF exactly? You can take that 1 way and it can work (maybe they supply extensions to GCC?), but the way it is worded seems to be going the wrong way. I'm assuming the author believes that if you use C then you can only use the GCC compiler? All C is "GCC"? ... jeesh, I wish...or maybe I just wish there was only 1 compiler.
I can't see how this fanciful wrapper can be anything other than a PR move.
"From the point of view of the developer using CUDA 6, the memory spaces of the CPU and GPU might as well be physically one and the same. "The developer now can just operate on the data," Gupta says."
Either a very ill conceived PR move, or they are just calling us retards at our face.
That's pretty unfair comment I think. Whilst some of us don't mind thinking in multiple different addresses spaces all at once (if you think 2 address spaces is hard, try the 69+ you get with VME...), anything that makes the programmer's job easier is a good thing for NVidia; they'll sell more product because of it.
At the software level they are copying what AMD have done, which is understandable. They can't copy AMD at the hardware level though, so they may start to struggle to compete in terms of whole system performance.
As long as you get loud warnings telling you that you are probably doing something that will bite you later in this and that piece of code, I'm for it.
64-bit memory space
With 32-bit systems, this was an insurmountable problem. With 64-bit ones, it is a matter of memory mapping of the GPU memory into the CPU's virtual memory space. In truth, this is not a difficult problem, and the fact that it hasn't happened until now is not a "cudo" to nVidia! Although, I will admit that the issues are more likely business process related than anything else, and those are always more difficult to overcome than the merely technical!! :-)
Meanwhile, Intel are adding very wide GPU-like operations to their own instruction set and these, obviously, enjoy the same single view of memory. Sounds like the cycle of re-incarnation is nearing completion, with everything ending up on the one [CG]PU.
Did they fix the security?
Hope they fixed the security problems...
The original problem was that there was nothing preventing a user from loading data into the GPU, then downloading it into any location in physical memory. Been there saw that. The Cray MP line would crash (most often) because of that.
All they had to do was add an IOMMU to the board. That way the host system would be able to limit the IO to just one user... and if they chose to crap on them selves it didn't matter.
I know the Cray nearly didn't pass evaluation on low level security audits because of it. Only the nodes without the GPU passed.