Hmm...
Like Samsung was ding with ARM about 8 years ago?
:-D
Intel said it was working on stacking a layer of memory on its Xeon processors to run memory-bound workloads faster. It said this in a pitch at the Denver-based Supercomputing Conference (SC13) which is running from 17 to 22 Nov. According to an EE Times report, Intel's Rajeeb Hazra, a VP and general manager of its data …
"Why on earth make a difference between internal and external memory ?"
So that you can make your memory-bound workloads run faster by telling your application to use the internal rather than the external memory. It's quite likely that the memory mapping will be taken care of by the compiler so that it will all work like magic for programmers who don't like complexity.
Work harder, little engine.
Someone is dangling gigabytes of monsterous bandwidth low-latency memory in front of you, and your first reaction is "it's tooo haaarrd"??
One day, it will be one big contiguous memory space. Until then, programmers will have to earn their pay packets. Perhaps that will sort out the cans from the can'ts.
(I speak as a programmer, not a hardware guy).
Chances are it'll be well-hidden down in the O/S's virtual to physical page translation, so it'll look like one big contiguous space if your program isn't the sort to need every speed optimisation it can find. You'll probably have access to an extended malloc with a flag to request near memory, and be able to request that pages be locked in near or in far memory. The paging system will probably have algorithms to move busy pages of far memory into near memory and to move idle pages of near memory outwards.
There's already a sort of near/far distinction on some multi-CPU systems, in that memory is local to a CPU or local to a different CPU. On the Quad CPU AMD system I once looked at in detail, 1/4 of the memory was local, 1/2 was one CPU hop away, and 1/4 was two hops away (so effectively three levels).
Nigel 11, you have it right regarding NUMA systems.
And as you say quite run-of-the mill multi-CPU motherboards are already NUMA systems.
And there are much bigger NUMA systems out there!
Install the absolutely great tool 'hwloc' from the OpenMPI project
http://www.open-mpi.org/projects/hwloc/
You can get a graphical display of how your system is laid out.
Assuming you are running Linux, install the 'numactl' package and use
numactl --hardware
numastat
When Intel pulled out of the Hybrid Memory Cube Consortium, the reckoning was that they intended to roll their own version, and this is it! It's probably put together with Micron as a partner, and is close to, but not exactly the same as, HMC.
The high end will be interesting. NVidia is in to HMC, AMD is likely working up the idea, too.
And the idea of memory stacking fits the mobe market as well, so ARM is in the game!.
Roll on persistent carbon nanotube memory. That will upset the applecart again!
HMC talks to Terabyte/second speeds, so it will be a big impactor on performance.
"There would also need to be data moving or tiering software to transfer data from Far Memory into Near Memory and vice versa."
Like the OS... :)
With a *nix you could set up the far memory as a swap device, no application tweaking necessary.
A long time ago I was infatuated with the idea that you could simply ditch L2->N caches and slap in a chunk of fast and wide local memory instead - and let the MMU and the kernel handle the caching of stuff in that local memory. The idea behind it was allowing folks to get more deterministic behaviour from their code by removing async caching logic from the equation, it wasn't my finest idea, the benefits would have been small and the downsides pretty huge I think. :)
Funny to watch Intel scrabble around looking for USPs that other folks have already done. :)
Are you sure?
The main issue with RAM on CPU's is the area required (which is also why it is statistically more likely to have a fault). Redundancy is easy (create a block of X units where product requires X-1 - disable one unit). Intel have released papers on how they do this in the 90's.
Putting more RAM as a second layer (i.e. stacked) allows you to get very high bandwidth (wide, short bus) without the level of complexity required in achieving the same thing from an off die memory subsystem where trace path lengths can result in timing issues.
This post has been deleted by its author
All systems with virtual memory have the O/S managing the physical pages of memory (and backing storage). The hardware does the virtual address to physical address translation when the data has a current physical address, and throws a "page fault" when the data has to be moved from backing storage to a free physical page. the O/S handles the page faults. With different classes of RAM the O/S will also be managing movement of data between near and far physical pages when necessary, while the virtual address of the data that the programmer uses won't change.
NUMA = non uniform memory access.
My first exposure to NUMA was in DEC's "Wildfire" product family (aka AlphaServer GS80/GS160/GS320), which was an Alpha-based high (for those days) end server with up to 32 processors, with processors in chunks of four per box. Each box (known as a "quad building block" had a certain amount of local memory, accessed relatively quickly, and could also access (transparently, in the address space) the memory in other boxes. Although the software didn't directly know whether memory accesses were remote or local, there were performance penalties for remote access. Depending on application and OS, the performance penalties might be small or might be significant. Wildfire was not initially marketed as a NUMA
Now Intel have invented NUMA again, everything will be correspondingly faster.However there will still be penalties for accessing non-local memory...
One of the big challenges for Intel, just as it was for DEC's Wildfire, will be maintaining cache coherence across the processor/chip/box boundaries. Alpha systems should have had it relatively easy with that because the architectural model said that it was a Bad Idea to assume full time memory coherence across multiple processors. Wildfire's performance still suffered because of system design issues - there was a noticeable latency in remote access (factor of 3?). The next generation Alpha boxes, based on "EV7" Alpha chips with lots of interconnect stuff built in on the CPU chip rather than in the external support chipset, significantly reduced the remote access penalty, to the extent that NUMAness was barely relevant, except perhaps at the level of the OS scheduler (don't move stuff from one CPU to another unless you have too - bit of context in the Suse/UKUUG snippet below).
On the other hand, x86 systems and applications for the last few decades are typically designed around the legacy x86 concept of all memory being coherent all of the time (or has AMD64 finally fixed that, given that Opteron introduced ccNUMA on x86-64?). If that hasn't been properly fixed, it's going to be a right pain making this work effectively on anything other than a Powerpoint slide for the journalists.
http://www.compaq.com/alphaserver/archive/gs320/
http://www.ukuug.org/events/linux2001/papers/html/AArcangeli-numa.html
etc
Currently one of the big problems in the embedded world is that you have extremely little memory. Adding external memory typically means having to go to BGA and 8 layer PCBs which is rather expensive.
Now if you had significant (>16 Megabytes) memory inside your CPU this might be an competitive edge for Intel over ARM in the embedded market.