Intel is readying its many-core "Knights Corner" coprocessor for release next year, but programming the 50-core-plus beastie and its 100 or 200 core follow-ons will require a whole new approach to code wrangling. "Something that might have been only a couple per cent overhead at most in a one- or two-core system may become …
It'd be nice if everything was so convenient
This sort of coding (concern for data locality in massively-parallel architectures) has been quite normal in supercomputing for decades. And in the GPGPU programming segment for years. And it still is darn hard. The basic problem is that not all applications nicely slot into something where data will conveniently be located locally to where it is needed. Hence Berkeley can come up with a fastest-ever parallel algorithm for matrix multiplication - because they chose a problem (matrix mult) that *can* take full advantage of the way modern multi-core systems work. And it's *still* hard to code! Try doing the same with parallel algorithms that don't conveniently have data locality and you find the system runs hardly any faster than on a normal CPU because memory bottlenecks dominate. Sure, *sometimes* you can get it faster with tricks like "compute the same variable per core", but funnily enough Intel won't be telling us of all the failures too (which I suspect vastly outweigh the successes - nobody can publish failures because it might just be their lack of imagination in solving the problem and so gets rejected by reviewers).
Sorry, adding more cores onto a chip isn't some kind of revolution - it's an admission of defeat. Trying to tell us that it's our fault 'cause we need to get smarter in coding is raw marketing spin.
Isn't it already a good improvement to have one core per process?
If you think about Google Chrome that would mean one core per tab... sounds nice to me.
but you need more....
you need one core per gestured finger on every pane; its not like the core can do something else while waiting for input. Can it?
maybe it requires a hardware rethink....
I think that 200 * mesi-protocol-latency as a potential memory latency might be a stopper here; that is about the dram speed of an ibm/xt. The expected could be much lower if everyone rewrites their software to suit a rather brain dead architecture.
At 100x memory latency, it is crouching in the field that that original virtual memory architects worked in; their solution was to surrender to the OS the scheduling of memory, and render in the hardware the necessary address translation to make it work.
The same thinking can be applied, but at the granularity of cache lines - let the OS be responsible for scheduling the cache, and implement a very hi speed link under program control to perform it. In this, the shared and remote memory becomes more like a storage device; and the machines really operate on local cache-like RAM.
With this, you don't need a massive re-implementation of all the software in the world to take good control of these resources. You will want to pay alot more attention to locality of reference, however. an sparsely addressed list will chug along at 8086 speeds.....
What this doesn't do is permit a single app [ MIMD ] to span the entire processor set; rather you have to be content with numa-esqe islands of smaller, fully connected SMPs.
This is the architecture of the big-time super computers anyways [ unless Fujitsu made an 80000 node cache coherent interconnect recently ], and is the one with scalability.
Can I still join in the other Reinders' Games ?
Rudolph T. R. N. Reinder
"Intel parallelism guru "
He's a guru? Doesn't sound like it. Just one example : "Well, if you put it in a variable, and then a hundred processors access that variable to get that data, you've got a bottleneck." - and that's foreign to his brain. This point has been obvious for decades to anyone who has had to deal with as few as two processors in a system.
He's making it out to be a cool new field of thinking, but it is really just re-invention to quite a few old farts out there.
is it me or does he sound
like a bit of a moron?
"Gee, I mean, programming is hard that's for sure, certainly for an old hick like me. Some of that stuff is really, you know, difficult, Sheesh!"
This is completely wasted on ~100% of commercial software
In that part of the software market, it's all about rapid application development, and sod the efficiency. They rely on Moore's Law to make sure that by the time their software hits customer systems, the computers are powerful enough to cope.
So MIC processors will be completely wasted on commercial boxes, which is where the majority of the systems will be sold.
Even if someone (extremely cleverly) produces an IDE that can generate parallel code to make good use of many-cores, much of the workload that is done is not suited to run in a parallel manner anyway.
Apologies in advance to those that do, but most new programmers nowadays are never taught about registers, how cache works, the actual instruction set that machines use, and I'm sure that there are a lot of people reading even on this site who do not really understand what a coherent cache actually is.
I work with people who are trying to make certain large computer models more parallel, and they are very aware that communication and memory bandwidth is the key. Code that is already parallel tops out at a much smaller number of cores than the current systems that they have available can provide. And the next generation system, which will have still more cores, may not actually run their code much faster than the current one.
But even these people, many who have dedicated their working lives to making large computational models work on top 500 supercomputers, don't really want to have to worry about this level. They rely on the compilers and runtimes to make sensible decisions about how variables are stored, arguments are passed, and inter-thread communication is handled.
And when these decisions are wrong, things get complex. We found recently that a particular vendor optimised matrix-multiplication stomped all over carefully written code by generating threads for all cores in the system, ignoring the fact that all the cores were already occupied running coded separate threads. Ended up with each lock-stepped thread generating many times more threads during the matmul than there were cores, completely trashing the cache, and causing multiple thread context switches. It actually slowed the code down compared to running the non-threaded version of the same routine.
It will be a whole new ball game even for these people who do understand it if they have to start thinking still more about localization of memory, and if they will have difficulty, the average commercial programmer writing in Java or C# won't have a clue!
Don't share anything
I can only see massive parallelism working if you stop sharing memory across processors. Shared data will inevitably become a bottleneck even if you add kludges like transactional shared memory or coherent caches. Memory is also a bad model because it implies near-instant response to accesses, which can only happen with small local memories.
So each core needs its own small, local unshared memory and all sharing is done through explicit communication. Sure, you now have a problem of minimizing communication, but at least the cost is now clearly visible. Local neighbour-to-neighbour communication can be reasonably fast, but you still should not wait for response. Instead, you handle incoming data as it comes and send data when it is ready.
This needs different programming paradigms than C or Java, as both rely intimately on shared mutable data. Sure, you can get some way by adding library functions for parallelism, but these will be kludges akin to adding jet engines to biplanes -- the basic design inhibits full exploitation of the available power.
"This new thinking is bubbling to the surface"
Except that it isn't even remotely new thinking.
We were doing this on Transputers *decades* ago...
This "guru" might want to do a bit more background reading...
I was going to mention transputers in my last post
but I decided that it was long enough already!
Ahh, but they were a British invention and so "never existed" - the USA is only just catching up !
> Ahh, but they were a British invention and so "never existed"
But the IntellaSys SEAforth is a Merkin chip, and that's multi-cored as well.
Intel do seem to be somewhat behind the times...
An old problem
This is just a distributed system on a chip. The usual rules apply.
I was never skilful enough to code in the demo scene, but I remember when PCs were taking over it was said that CPUs had at that point become faster at calculating trig functions than precomputing and accessing them from memory.