US nuke boffins say they have seen the future of multicore computing, and it is troubled. Researchers at the Sandia national lab say that their projections indicate that performance gains flatten out badly after quad cores and cease altogether after eight - and beyond that point, performance actually worsens as more cores are …
20 year old news?
This theoretical limit was known 20 years ago. The more cores, the more inter-core communication is required and eventually the cores do nothing but talk to each other.
When I was at university, the core limit was said to be 10. Each core absorbed 10% of the total available CPU just communicating.
That is why Sun and others concentrate on hardware threads within the core. Main memory is so slow compared with on chip caches that it pays to have additional hardware threads available to run on the CPU while data is being fetched from memory. That way you can get each core running for much more of the time.
the industry knew this 15+ years ago
The scalability of multiprocessors has always been a sticking point. It came into sharp focus (for me and others) when Motorola et. al. released the M88k RISC chips in the early 90's. The architecture inherently limits the speed that processor caches can be accessed by other CPUs (to see if one thread has altered a variable your thread wants to use), the way interrupts get handled and some other features that I've forgotten about over time.
Even then, none of this was new. IBM had run into issues of their own with mainframe processors - which still have significant flattening of the CPUs/MIPS curve as numbers go up.
You don't have to be a nuclear, or any other type of scientist to have this knowledge. Just a little background in general purpose computing and the realisation that very little is novel - even in this field.
Say what now?
Suitability of multiple cores depends on the application. This here 8800 GTX has way more than four cores and goes like shit off a shovel - provided I ask it to render stuff. If I ask it to lex a C file it's not so hot.
I find it hard to believe that Sandia are only just now modelling the performance of their core computations versus number of processors per die and memory bandwidth.
There is no solution if all you want to do is scale your single-instance computation up to larger N. That's Amdahl's Law. We've know that for more than 20 years.
But I have plenty of workloads that need almost no memory bandwidth but do a lot of computation. Run these in parallel with workloads that have reasonable bandwidth requirements and you're golden.
I wish I worked in some kind of recognized national lab, then I could issue iconoclastic press releases which state the bleeding obvious and pass them off as original research.
News just in: Gordon Brown was an incompetent Chancellor
And at around the same time Mr Brown started training for his UK-economy-wrecking position, the good folks at www.spec.org started designing benchmarks and publishing audited (properly audited, not by dodgy Big Five casino "accountants") results.
Go back as far as you like, say the early 1990s, and look at the scalability of the multiprocessor benchmarks. Throughput never scales linearly as the number of processors goes up, especially if the OS is Windows (if you can find the same bm on similar hardware with Windows and then with Unix/Linux, Windows doesn't win). And the time to get a given job done doesn't necessarily decrease linearly as you increase the number of processors either.
It's good to know that, nearly twenty years later, this is still news.
...the boffins take a rather simplistic view of this methinks.
Sure, I would agree, dumping effectively 8 pentium cores (i.e. complex devices) on a single dye, you'll very quickly come up against the core/IO celing - there are only a certain number of pins you can economically place on the device.
The problem isn't the concept of multi-cores, it's the design of them.
The current 'big gun' chip manufacturers are simply taking their years old processors, and shoehorning 4, 6 or 8 onto a dye. Fail.
As a case in point, take a look at Intellasys (I don't work for them and am not associated with them in any way) and their SEAForth40C18 chip. Chuck Moore has managed to fit 40 (yes, forty) simple processors, each one with it's own ROM and RAM onto a single chip the size of a postage stamp. Only the cores around the edge of the chip have access to the pins, so careful thought and planning is required to maximise the efficiency of your application (specifically how you spread you application out across the cores). The performance, if you design your application correctly is, frankly, of biblical proportions.
And of course, the human level knowledge to leverage the power of these devices already exists - any of you that did any work on the Inmos chips in the 90's will know what I mean.
Just because the 'big boys' have hit the core/performance ceiling, doesn't mean Moore's law is bust - others are making huge progress.
If the cores are being starved of memory bandwidth, then it's the memory interface that's at fault, not the number of cores.
So if Quad core is best..
...and chip makers can still continue to increase the number of transistors availbe on the chips and the bottle neck is getting data to / from the chip from main memory why not use the additional chips on the transistor to really increase on chip cache.
Or if they can really pack the number of transistors in do away with system RAM altogether and develop a Quad core chip with a couple of Gigs of memory ON CHIP, that should be quick no?
By the way, you might have guessed from my comment I have absolutley NO knowledge of chip architecture or the interface between processor and memory so if it is total bollocks I appologise!
Maybe thats the case for cores, but the Cell has an elegant 1 CPU surrounded by 8 syngergistic processing units approach, to stupendous effect.
Sure if you just throw a load of cores on a die, its gonna be shit, but approach the problem with some brains behind it you can get around it.
3D stacked cores anyone?
pass the activia, watson.
So let me get this right, it takes a nuclear physicist to say that tomorrow's chips won't work properly with today's memory bandwidth?
I hope they're better at programming tic-tak-toe than they are a making insightful comments about the future of computing..
This may be compleltely wrong...
as I have no clue about Chip design, but why cant they stick say a gig of fast ram onto the processor itself?
Finally, someone gets a little publicity for the obvious
The problem - in any kind of close-coupled multi-processing, you hit a point of diminishing returns as you add processors, where the coordination overheads become larger than the gains - has been obvious for quite a while. Gene Amhdal pointed it out in the early seventies. The chip manufacturers' engineers know this perfectly well, but marketing departments have been caught up in the race to the most cores, and kept on demanding them. It's just like the megahurtz wars.
They will, of course, point out that they are answering the point by putting more memory controllers onto processors - Intel's forthcoming Nahalem has three, while Sun's UltraSPARC T2 has eight - but this only moves the bottleneck elsewhere. The penalties for coordination between memory busses may become the problem, or else something else. But there isn't a Happy Hunting Ground of unrestrained parallelism out there waiting for Moore's Law to go far enough. There's only more and more struggle to make it work.
I don't expect that the manufacturers will change strategy soon; given that some kinds of code don't need much coordination, there will be example cases they can claim as "programming the right way" for quite a while. But there is no one right way, and no simple route; anyone who claims to have found it is fooling himself, usually by looking at a small set of problems, and claiming his ideas generalise to everything.
I don't know what the next fashion will be; were I given the power to decide, I'd ask either for memory that was a load faster, or some really huge on-chip caches, say half a GB. Those two ideas would have pretty similar effects: reducing the "memory wall". To get really huge amounts of processing power, we're probably going to need to climb out of the Von Neumann playpen, and deal with some different programming model entirely.
I have the solution!
Split your workload across multiple servers. You know, Server1 does A-N Server2 does M-Z.
It really simplifies the engineering required to tackle the problem.
AC, because it was Paris' idea, not mine.
"The idea is that very accurate sims using huge amounts of computing power will allow the US nuke arsenal to be maintained in reliable condition without the use of live tests."
"The idea is that waving the 'National Security' flag at the right moment will allow the scientists to secure gigantic grants and create job security the likes of which normal people can only dream off, all the while ignoring the economic and social cost of throwing these huge sums of money at solving a totally unimportant problem with weapons which will never be used while access to decent health care in America would shame an African nation."
we apologise for any confusion caused.
Does that include the cell
or are we talking X86
First bit of sense in here. Yes it is the memory, and its channel, that's starving the registers now, and actually that's been the case for a long time, which is why my old 486 has to have cache, etc.
Other nails in the coffin of progress are, EDO RAM, SDRAM, DDR-SDRAM etc. Each one giving better "straight line" performace for a single accessor, but falling harder onto their arses when it comes to HPC at each "improvement".
Actually it all started back in the 70's when DRAM got a market, thus advertising and r&D, lead over SRAM.
There should be a law declaring EDO/SD/DDRSD etc. not to be RAM at all, but SAM (sequentially accessed memory) Its more like accessing a disk, and latency/turnaround cycle needs to be considered when calculating the "speed".
Isn't this about copper?
So the processor can't get data quick enough from the RAM or Network... doesn't that mean the problem lies in the connection between ie. the copper wiring?
Surely someone is working on a fiber optic solution to seriously ramp up the data delivery from RAM to processor?
DRAM on CPU
DRAM is contains one capacitor per bit. The capacitors have to be big enough to hold enough charge to avoid a stray electron from changing the value of the bit. This is normally done by cutting trenches in the chip to provide extra surface area. It is possible to have trenches only in a DRAM portion of the chip. As far as I know, Intel and AMD have not done this but some ASIC manufacturers have. It is also possible to use unusual materials to get enough capacitance without a trench. Again, this is done by some ASIC manufacturers but not by AMD and Intel (as far as I know).
Intel and AMD could sell combined CPU/DRAM modules similar to the CPU/SRAM slot 1 / slot A devices from last decade. It takes a significant amount of power to send fast memory signals through a socket, across a PCB to another socket, along the DIMM to a memory chip and back. Putting the DRAM and CPU into a module would decrease the power and increase the speed at the cost of making it awkward to upgrade the memory.
ASIC manufacturers can put an ARM or MIPS CPU onto the same chip as some DRAM. The process is more expensive and the capacities are not as high as a dedicated DRAM chip. On the other hand, the bus width can be much wider and latency can be lower.
The Cell CPU, IBM's power CPU and several games consoles already use embedded DRAM for cache and 3D rendering. AMD is moving towards being a FABless chip designer. This might make it easier for them to catch up with the non-x86 crowd than Intel.
The SPE's in Cell aren't really general purpose cores and only achieve their best when doing maths.
So it depends on the problem domain.
Supercomputers do well with thousands of cores often x86 chips...
@Eddie Edwards and @Does that include the cell
The difference with your GPU is that it has a real CPU sitting behind it to manage the work that each of the GPU's units is doing. And,... correct me if I'm wrong, the units within a GPU mostly aren't full "cores" but smaller functional units.
As for the Cell, that's almost the same case isn't it? Not all the units on a Cell are full free standing processors. I believe in Cell-language they're called "elements" and are connected via the Element Interface Bus. Anyways, you can read wikipedia as well as I can... it basically boils down to the Cell is a few actual processors managing the work of lots of co-processors. Remember when we used to use external FPU's?
The problem with current X86 SMP boxes is that you have X number of full-real-processors that have the ability to shit on what it's siblings are currently working on and managing that is not easy. ;)
> any of you that did any work on the Inmos chips in the 90's will know what I mean.
yup I miss my transputer arrays and OCCAM language
Paris - I bet she has parallel access
Pedantic I know but
"Performance gains extrapolated from Moore's famous Law can't be sustained"
Well who ever thought they could extrapolate performance gains from a prediction about miniaturization processes.
If supposedly technical people were to stop using inaccurate terminology they might save some confusion later. Moore's famous law wasn't even a theory. Relying on it to predict performance gains is idiotic.
Nuke boffin gains stop at zero.
For Christ's sake, this is just such utter pessimism and I am actually surprised to see it being repeated here. To keep the explanation short: The trick to it is to do as many calculations as you can with as little memory necessary.
CPUs do not only come with an increasing amount of cores but with increasing amount of memory channels, too. Guess why?
And guess what, increasing the number of memory channels is often not necessary! In the past people programmed their computers in a way as to do a simple operation over a large amount of data and then to repeat the approach over and over again until they got a result. Especially the physicists loved to play with large vectors, arrays and matrices as it was their only way of solving physics problems with the use of a computer. This was when the expression RAM was defined, meaning Random Access Memory, and implying that one can randomly access all memory in an equal way. But this is not the whole truth any longer and programmers have learned that RAM, while it still can be randomly accessed, has very different performance costs. One effect of this was that programmers had to learn that while the quick-sort algorithm is being the fastest in theory merge-sort turned out to be faster in practise. Today a computer's memory stretches from a CPU's registers, over its caches and (virtual) main memory on to the hard disk storage. The concept of randomly accessible memory is very old and nothing more than a convenience to an experienced programmer and a help to an inexperienced programmer. Those who cannot make use of the performance potential of multiple cores have not been tracking the development of computer technology and have become as old as the "old irons" they used to work with. The nuke boffins either need to learn how to change their programming style and to get the most out of newer hardware, or better stop playing with nukes altogether, or else they are being replaced with people who can make use of what hardware can offer. It (or IT) does not need guys who invent new and unbreakable barriers, instead does it need concepts like multiple cores to allow for faster parallel processing.
Re: This may be completely wrong...
"as I have no clue about Chip design, but why cant they stick say a gig of fast ram onto the processor itself?"
From my time lurking in comp.arch, I gather that the simple answer is that different semi-conductor processes are used for RAMs and CPUs. This probably isn't insurmountable, but the smart move up until now has been to keep them separate. A second issue is that system builders have more flexibility in the amount of RAM if it comes on separate chips and you can choose how many banks to fill. Again, this is an economic argument, not a fundamental one. More memory on chip will come, whether it is in the form of cache or directly addressable RAM.
But even this just buys us more time to change how our programs work. Making 4- or 8-way multi-core chips lets an OS full of single-threaded programs runs slightly faster. It also gives developers an incentive to put 4 or 8 threads into their apps next time they are designing. Given an armful of 4- or 8-way multithreaded apps, an OS could probably use 16- or 32- cores before reaching the point of diminishing returns again. RAM-on-CPU will probably deliver that.
Beyond that, someone mentioned clustering, which of course forces the programmer to manage all the data traffic between cores. If *that* problem could be even semi-automated, we might be able to produce software that scaled to N-way for N=a-few-dozen. That would let the chip designers explore yet more options. Intel demonstrated an 80-way chip last year. The transistor budget is not the problem here.
Real parallelism will come, simply because there is no alternative, but it will be a painful evolution, with software and hardware having to take alternate steps. Then again, where's the job security in doing something easy?
The Memory Wall strikes again
But just try getting research funding to fix things. Not as exciting as transactional memory or virtualization or yet another cache-reshuffling.
Snow leopard anyone?
If Apple can get it's rumored Snow Leopard going, it'll be a game changer. as far as I can tell, you need a new way of doing things. Supposedly Snow Leopard will toss special tasks to the graphics processor, freeing the main processor. That's what I've heard anyway.
Isn't this an x86 limitation?
I'm under the idea that all these studies seem to be on the stupid x86 arch we've been stuck with these days. Cell seems to cope well with its 8 SPE's, and RISC processors in general seemed to be doing well on the multiprocessing area. Not to mention those other chips mentioned here (the 40-core one, for example.)
The real problem is that we're still using 80's tech (x86) with the equivalent of "patchwork" all over it, when we should've moved on to better technologies. Instead of improving the arch, x86 is getting more and more cores without any real improvement. Cray had it right: better to have 2 oxen than 1024 chickens.
It's about time we switched over to a better arch.
Multicore vs mesage passing?
This always has been rather obvious, and has been remarked on ever since the first dual core cane on the market. The most remarkable development is that quad core works. The solution is a software architecture that abstract the processor, so that software can be migrated with no effort between symetric architecture (multicore) and asymetric (message passing). This too is limited by the speed of communication channels but with an intelligent optimising distributed scheduler the gains can be even more dramatic than the migration from single core to quad. See www.connectivelogic.co.uk for more information.
Sorry, but has no-one heard of multi-threading? Has anyone looked at what processors like CUDA-compatible PhysX or Sun T1/T2s can do when given massively parallel tasks? Tried running a J2E app on Niagara? Or SAP on an M9000? I cry "Bullshit!" and wish I could get a government grant to write gibberish
"intelligent optimising distributed scheduler"
That's fine, for the <1% of typical computer usage which can usefully be parallelised. Things that can usefully be parallelised typically already have been, since the 1980s or earlier, in one way or another. For the rest, the ones which don't easily parallelise, we either need better software (eg nothing Vista-related) or faster hardware or both.
If parallelisation is going to help, we don't need PC toys, we need clued-up designers that know about things like Communicating Sequential Processes: http://www.usingcsp.com/ (*the* CSP book is now available online!).
@Snow leopard anyone?
Yes, apple will fix these fundamental architecture issues that Intel and AMD so far haven't. What's more they'll do it totally in software. But wait... once you have enough GPUs on XYZ bus to run OS X super mega edition don't you have the same problem that you have a contended bus and lots of headless chickens (GPUs) running around on that bus?
/me wonders if Mac people realise that a modern Mac is just an Intel reference design in an expensive box.
So at what point....
....does (assuming X86) multiple CPU's performance level out?
I rightly or wrongly figured that as there's no point going above quad core, the next performance leap would be made by motherboards holding multiple CPU's until that levels out and another fix is needed.
No, it's not an x86 limitation, it's a theoretical limit to the performance that any system can gain by adding cores.
Note that your Cell comparison is completely irrelevant: that's not a multicore system, and in particular, each SPE has its own local memory, so they don't have to all fall over each other trying to access the same DRAM pages at the same time. This is why there are already multi*processor* systems with way more than 8 CPUs - systems exist with tens of thousands of processors - and they don't suffer the same limitation. But in multi*core* systems all the CPU cores have to compete for access to the same memory bandwidth, and that's where the limitation arises. It has nothing to do with RISC-vs-CISC architecture.
Modern version of the Transputer
Give me a single-core CPU with 128MB of RAM (or even just 64MB) on the same die and six interconnects. One interconnect is for the peripheral bus controller. The CPUs can't all be in contention for the peripherals -- give one or two processors that job and have others make requests. Another is for a bank of slower shared memory for passing messages to all the other CPUs. This could be the DDR2 we are using now. The other four are to connect to the neighbors on each side of it directly to pass messages back and forth.
Central memory can still be huge. The on-chip memory would be the size of what we used for main memory a decade ago. The rest of the connections allow fairly fast message passing to direct neighbors.
If you put 16 sockets on the motherboard, you'd have 2GB of memory at full CPU speed (1GB if you only put 64MB memory on each) and you could have 64GB or whatever of shared central memory. You could even use a separate memory controller for the centralized memory, because most work would be done in the on-chip memory.
Only 4 cores is useful?
Funny, the guys over at intellasys seem to have a 40-core processor that works nicely (http://www.intellasys.net/). Maybe the problem we are seeing isn't so much an issue with lack of resources, as lack of effecent useage of them. Maybe we should stop expecting the compiler to do all the work of programming, and start managing our resoures ourselves.
Then again, it seems these guys only tested one architecture with one problem domain and El Reg assumed their findings applied to all. Weak, guys, weak.
(Note: This is clearly not an architecture designed for super-computing applications, but one for embedded systems. That is not the point. The point is that maybe we need to be considering other ideas then the 'traditional' idea that a computer is x86)
Are these boffins suggesting that nothing, anywhere, ever - will be able to run Crysis at a decent frame rate?
These nuclear physicists could do with some engineering knowledge.
The answer is not to consider an archaic "non-law" as a limitation, the answer is a change in algorithm. There is no reason whatsoever additional cores shouldn't be log( n ), it's merely an engineering challenge - and not a new one.
If the issue is speed, the answer is either more cores running the same speed as memory, or interleaving memory, so one memory address actually maps to two or more memory locations serviced by different controllers. So a 64k address is actually 8 x 8k addresses serviced by 8 controllers.
It the issue is address contention, then contend less.
Make memory locking cheaper
Make memory addresses swappable amongst cores
If P1 has locked a memory address P2 requires write access to, tell P1 to read from P2's cache and write the data for you. Memory reads should always be contention free.
Ensure a threads execution context is separate from the thread itself, blocking threads can pick up new work rather than sleeping.
If it's the multi-core over head, then reduce inter-core communication.
Communicate only with neighbours
Share less & peer more
So in an 8 core chip, use the space for 4 of the cores for communication rather than CPU. There, problem solved! For my next trick, solving NP complete in (n log n)^2 time...
"Are these boffins suggesting that nothing, anywhere, ever - will be able to run Crysis at a decent frame rate?"
I know it was a joke, but ... I suspect that these boffins would spoil their pants if one of their workloads was as embarrasingly parallel as Crysis. The issues are
1) ...whether your problem has a lot of computation compared to sharing of data
2) ...whether that ratio scales nicely as you chunk the problem into smaller pieces.
3) ...whether your chosen implementation language(s) let you express that
For most applications, the answer to (1) is yes for "parts of the problem". I have no trouble identify small loops and other parts of my code that are embarrasingly parallel. Neither has my compiler, come to that.
Sadly, this isn't useful, because of (2). These fragments *are* small and by the time I've glued together several hundred embarrasingly parallel fragments, I discover that the overheads of the glue weigh more than the original single-threaded solution. Worse, there are quite a few important problems that don't even have embarrassingly parallel fragments.
Even assuming the parallelism exists, the answer to (3) is still no if you are programming in any language that you've ever heard of. OK, some kind soul mentioned Erlang earlier and it is interesting, but if it really solved the problem then Intel and AMD would have beaten a path to their door and hammered it down by now. As I mentioned earlier, Intel have built an 80-way processor. Sun have built and sold a 64-way processor. The hardware isn't the problem.
On a small scale, almost any compiler can take your function, spread out all the data dependencies and find the optimal solution. Good compilers can do this for groups of functions that call one another in some sort of closed system. On an out-of-order processor, that actually delivers useful parallelism even today.
On a larger scale, no-one has found a concise way of scaling that. That is, there's no compiler where you feed it a number N and it generates code that is efficiently on an N-way system for arbitrary values of N.
Even with such a language *someone* would still have to rewrite everything from your OS upwards and those in the closed source universe would still have to pay for an upgrade to everything they own, before end-users would see a benefit.
But I remain an optimist, because modestly multi-core processors make it affordable for lots of people to experiment with modestly parallel rewrites of their software. (The cost of the glue drops in relation to the benefit.) It also makes languages like Erlang more affordable and more attractive. For 50 years, there has been a serious penalty on anyone who didn't manually serialise their algorithm. At last, this is beginning to change.
saw this a while back
Here is a advance already !
With a square chip you can then cover each surface with IO points
problem solved for the next 2 years
One one level, this is just re-discovering Amdahl's Law 30 years late.
On another level, it is just utter bollocks. There are a number of processor companies with multi-core systems which do scale to vastly more devices.
* Picochip has 300 cores on one device - and can cascade devices. The biggest system they have had 5000 processors.
* Ambric had something similar. They ran out of $ but the technology worked.
* Tilera has 64.
* Cisco has a design using 256 Tensilica cores in one ASIC
And so on...
At most you can say "a processor which was never designed to be used in a multicore way won't be very good at it". Big wow.
Paris because she knows more aboput processor architecture than these numpties.
This boffin is a buffoon
Hate to tell this boffin/baffoon - he forgot to look at the rest of the multi-processor world.
It seems SPARC has not been suffering the same problems as AMD or Intel, in regard to multiple cores. SUN has been producing 8 core CPU's with 8 crypto cores on a single die for years from 4, 6, and 8 cores with linear performance.
Moving beyond 8 core has to do with the memory bandwidth, memory interconnects, and cache coherency - design it right and move on.
Vendors who successfully do near-linear SMP performance into 100 physical processors (like SUN) are most likely able to succeed in moving into 16 and more cores per socket - since they licked the problem in the large scale and the only thing that is to be done is to stick the existing technology onto a piece of silicon.
- Product round-up Ten excellent FREE PC apps to brighten your Windows
- Review Tough Banana Pi: a Raspberry Pi for colour-blind diehards
- Analysis Pity the poor Windows developer: The tools for desktop development are in disarray
- Product round-up Ten Mac freeware apps for your new Apple baby
- Chromecast video on UK, Euro TVs hertz so badly it makes us judder – but Google 'won't fix'