The top minds at IT analyst Gartner have been mulling over the ever-increasing number of cores on modern processors, and have come to a conclusion that many academic experts have already come to - boosting core counts to take advantage of Moore's Law is gonna run out of gas, and sooner rather than later, because software can't …
WOW! Great article
Even Gartner doesn't get it! The problem is not the breakdown of Moore's law, or the problems of parallization. Those problems only affect physics simulates and such. For you average every day user (and website), the problem is memory capacity, memory speed, and most importantly, disk speed and caching. Tim gets that! Why can't everyone else?
I want a 1000 monkeys to crank out shakespear's greatest work.
Any idea how long this will take?
>> operating systems have an eight-bit field
"This can be changed, of course, but it has to be changed and it most likely will not be done as a patch to existing operating systems running in the data centers of the world."
So with the number of cores around 8 now & doubling every 2 years, we've got a decade to hit that limit. Nah, no chance of an OS patch in that timeframe...
Please Mr Gartner, can I get paid for this sort of shrewd analysis? If you fly me business class I promise to glean all the wisdom of the inflight mags enroute.
Core Count versus BIOS Count
If all the cores have to use one BIOS, what does the bandwidth of that BIOS have to be to enable all of them to work efficiently? I've always been able to beat other geeks' systems by using an individual system for each process, including multiple hard drives and/or ram disks and/or networked disks per system. Isn't this how SETI and folding@home et alia work? And I bet a 128-atom system would throw less heat than a 128-core server box.
256 core limit?
Only if the OS has to know about all the cores.
People used to put 256K of RAM onto systems that only had 64K of address space. Not an OS limit, a hardware limit. How did they do that?
I want a 1000 monkeys to crank out shakespear's greatest work.
Any idea how long this will take?"
I dunno, but this article, 1 monkey, 10 minutes...
How many times has this same subject been covered already? Yeesh.
Readallabaahdit!!: Gartner talks very expensive drivel...
... picked up and carried by journalist who doesn't understand hardware tradeoffs!!! Story not so simple!!!
I've been had by Gartner's 'analyses' before. They need critical reading in turn, which isn't done here.
I don't get it
Even though I am a professional developer, I just don't get why some hardware can't be designed that automatically spread the load between different cores. Is it really that much of a problem to do so?
``boosting core counts to take advantage of Moore's Law is gonna run out of gas, and sooner rather than later, because software can't take advantage of the threads that chip makers can deliver.''
Ah, the good, old ``future is like the past, only more so'' fallacy.
``The software never changes. The developers never learn.
And please pay us for our very informative analysis.''
Gartner : Stating the obvious!
Wht do companies keep on buying Gartner reports....! Yet again they state the blo^^y obvious!
The problem with spreading the load is 'which thread do you put on which core and when is it going to run?" or to put it more technically 'thread deadlock avoidance algorithm'
That take processing power, and it takes it away from the CPU's job of crunching your tax bill for the entire year, so while you think doubling your cores from 4 to 8 will result in your work being done in 1/2 the time, you have to have the software to spread the load and time to run the software which could mean jobs taking 3/4 of the time instead of 1/2 the time
Going back to ye olden days (about 1996) you could buy a twin CPU m/b to run a pair of P200 chips, they did some experiments with putting more P200s on a motherboard and found once you got beyond 4, then you needed more than 1 cpu to cope with the scheduling for each extra CPU you added, which ended up a 8 cpu motherboard performed no better than a 1 cpu motherboard.
Paris, because she's avoided wholey deadlock
Whatever happened to BeOS?
for more than ten years proving scalable programming over hundreds of nodes. Not transferable knowledge then? surely multi-cores are analogous to multi-node systems and therefore scalable OS architectures can address this. Just because M$ can't do it, doesn't mean it's not possible!
Dump Multithreading Now!
Excellent article. The powers that be at Intel, AMD, IBM, Sun, HP and the other leaders in multicore processor technology know that multithreading is not part of the future of multicore computing. In fact, multithreading IS the reason for the parallel programming crisis. The big boys and girls are pushing multithreading because their current crop of processors are useless without it. However, this is a disaster in the making because the market will be stuck with super expensive applications that are not compatible with tomorrow's computers. Fortunately, there is a way to design and program multicore processors that does not involve the use of threads at all (see link below).
Having said that, I agree with the article's author. The biggest problem facing the processor industry is not the parallel programming crisis. The worst problem of them all has to do with memory bandwidth. As the number of cores continues to increase, memory subsystems will be hard pressed to keep up. The memory bandwidth problem is the real showstopper because it threatens to repeal Moore's law regardless of whether or not the programming problems are solved.
I suspect, however, that a correct programming model will open new avenues of research that may lead to a solution to the memory bandwidth crisis.
How to Solve the Parallel Programming Problem:
Cores Cores and More Cores!
Cores bobbing in a sea of memory.
Cores are becoming cheap like cycles. When there are thousands of them, the cores can be allocated to tasks rather than swapped to tasks. This is a valid paradigm when cores become essentially free, like cycles are for the typical PC user (if you ever look at the idle process, it is usually 99% for Joe PCUser).
A lot of businesses obsess about using every last cycle and core and wire under the floor, but really the objective is to maximize profit... which is not the same thing at all. But people think it is and support their mortgages by such a belief, so there it is.
...lots of products are influenced by what Gartner reports say.
That's not because they're all accurate or realistic, but because the customers don't know the difference. If Gartner says it's true, then the customer thinks it's true, so it must be true (at least if you want to sell to them). Like I said, shame.
But they're right about too many cores spoiling the broth. It's a quick fix that isn't scalable. I'm sure we've all seen those before.
It's official; Moore's Law Is Dead
It just turned out that it wasn't really gate count we cared about, it was uniprocessor horsepower. Doubling gate count used to increase performance, but no longer. We might squeeze another doubling or two of gate count out of semiconductor physics, but the aspect of Moore's Law that mattered to sales hit the wall 5 years ago. Since doubling the gate count turns out to also double the cost of building the fab, the industry may choose to stop investing, now that it is getting clear that there's no money in it.
The chip business has never been about design innovation (or the success of the x86 is one gi-normous fluke). It's been about turning the process crank to generate free money. With those days gone, expect some "financial turbulance" in this sector. "Financial turbulance" is to companies what "Rightsizing" is to workers. It's a laughable euphamism for disaster.
With chip costs heading toward zero, expect truly humongous downward pressure on software prices too. Wh's gonna pay $300 to run Windows on a $200 PC? More important, when a company only gets $100 for the PC for which you pay $200, what company wants to be Microsoft's bitch and jack up the price, then hand over more than half the proceeds of the sale?
As I sit in the unemployment office, what really pisses me off is that I called this turn of events back in 2005, but was too wimpy to make the money bet that would have made me rich instead of broke.
So, CPUs, memory, and hard drives have all been getting bigger and faster. Since CPUs have been getting bigger/faster than the others, I guess it's only logical to complain bitterly.
A stunningly stupid idea.
So, you put as many cores as you can inside the processor but you still have a single bus restricting access to the memory - durgh!
This problem was solved 20 years ago by the Transputer - a British invention.
My understanding was that real world problems were not amenable to standard configurations, hardware design never solved the problem, and that transputers became great graphics engines, briefly (and as we see, GPUs are turning into great gen purpose number crunchers)
All programming required compilation of pre-defined serial and parallel procedures, and so assumed closed system problems (still remember the coursework..)
Never been near a super computer but I understand configuring the problem is still more of an art than a science.
Don't understand why any of this doesn't apply to multi-core processors running normal OS
can our current Windows even handle the existing hardware?
When I run Cognos Data Manager and various SQL Server processes the CPU ulitilisation is typically 12%. That is an eighth of all the CPU. it would appear that the bottleneck is CPU but the software cannot take advantage of the multi-processor, multi-core, multi-thread arrangements we currnetly have. If we are only getting 12%, will moving to a more modern server just give us 12% (or maybe 6% !) of a slightly more poiwerful CPU?
I didn't even read the article (it's too long for my increasingly shortening attention span), but the article title entirely caught my eye as I thought at first it read "Exploding core c*nts: Heading for the buffers". Now THAT would have been an article worth reading.
There is an interesting article here: http://en.wikipedia.org/wiki/Transputer
To paraphrase; it talks about an early attempt (1980's) at parallel computers and gives a good comparison of what we have now.
I've no idea what your developer skills are so please forgive me if I sound patronising, it's not intended :)
Take a problem like applying a filter to an in memory bitmap.
For simplicities sake, let say that filter is really crap, all it does is read the last pixel in the X coordinate and the current pixel. It does a sum with the two values and writes out a new value to the current pixel. If the X pixel is at element zero, it does nothing.
You could write something like:
(I'm not very good at pseudo code, sorry)
For Y = 0 to Bitmap.X.Length
For X = 0 to Bitmap.Y.Length
do filter calculation on Bitmap[X,Y] and write new value to Bitmap[X,Y]
(end pseudo code)
The CPU core cannot understand that your code is stepping through a two-dimensional array, reading two values and writing out a new one.
From its point of view, it is reading two memory locations and storing them. Performing a sum and writing out a value to one of those locations. It does not even know how many times this is going to happen (I think).
Now lets say that your not happy with the performance of this filter and it looks like a great task to multi-thread.
So you ask the OS how many cores you have any split up the bitmap into chunks to be processed by these cores. Lets say that you divide the bitmap into quarters with a bisection along the X and the Y coordinates of your bitmap (wrong).
You then write a routine that can be used by multiple threads:
Pseudo code again.
MyMethod(bitmap, startX, endX, startY, endY)
/* My method is passed a reference to the bitmap and told the pixels to start processing and where to stop */
For Y = startY to endY
For X = startX to endX
do filter calculation on Bitmap[X,Y] and write new value to Bitmap[X,Y]
/* Spawn four threads, each running the method with different coordinates */
(end pseudo code)
Again, the process is the same as before, but this time all the cores are performing the same processing on the bitmap. Great.
But, there is a deliberate problem with the method I described. It seems that I divided the bitmap along the Y-axis. This means that the threads dealing with the right hand side of the bitmap will give an incorrect value for their left most pixels unless the threads on the left hand side have finished processing their right most pixels.
Had I created a method that split the bitmap into chunks along the Y-axis only, my algorithm would have worked.
Hopefully, this rather long illustration shows the problems encountered with multi-threading.
For a core to know that I relied on the value in pixel that, at some point, was going to be processed by another core and for it to wait until that core had processed the pixel would be impossible.
It takes the programmer to know how an algorithm can be multi-threaded, if at all.
A compiler may be able to work out this simple example, but I doubt that it could cope with more complex problems.
Still, get your head around multi-threading and your salary should increase!
Disclaimer. This is the best simple problem I can come up with. If someone wants to post a better example or link, please feel free to do so. I wont be offended. I don’t mind being corrected either!
Eventually, Server-side software will run into the same flatline that desktop unit sales have hit : you just dont need to go there. with the exception of gamers, why do you upgrade your desktop anymore? Word and IE dont suck so much resources.
Eventually businesses will hit the point of "well, we could upgrade, but we have disk space and cycle space for double the customers and work we have now, so whats the point?"
Massive Multicore processors will be used primarily by the people that need ever-growing power or ever-faster processing; academic facilities, nuclear profiling, people like dreamworks and pixars rendering contractors, economic profilers, corps like google.
Mines the one with 1024 cores in the pocket running 4 lines of code _really fast_.
Fnargh! Moore's Law pedantry
"Gartner: boosting core counts to take advantage of Moore's Law is gonna run out of gas"
"Wiki: the number of transistors that can be placed inexpensively on an integrated circuit has increased exponentially, doubling approximately every two years"
Gartner: Wrong. Moore's Law says nothing of the layers taking advantage of the integrated circuit, it *only* applies to the transistor count.
The rest of their report may or may not be relevant.
The (relatively) obvious solution is to move more of it onto the chip, so you're less dependent on the motherboard bus. 256 megs or so will let you cache entire programs. A new, larger processor form factor might be needed to accommodate the extra transistors, but that's not too hard to design.
Threads are useful
Threads are actually useful for working around the issue of memory bandwidth. While traditional CPU's (Power/PA-RISC/Intel) are sitting idle waiting half their life for memory, a modern CPU (Niagara) is getting work done on the threads that actually have memory available. It may not solve the throughput problem, but at least it solves the time a CPU spends in context switching (which is enormous).
Threads are not the problem.
The desktop is not interesting in this discussion as not much is really going on there. As someone mentioned earlier how much faster do you need to run OpenOffice? At least I think that's what they said...
Painful to read
I'm confused about the problem that people are trying to solve here. It's not that multithreading is hard--it's not. And it's not that CPU-intensive software isn't multithreaded--because it often is.
I do raytracing, video/audio transcoding, and some scientific computing on my quad core computer. And it's all multithreaded and it's all awesome. I get great CPU utilization and speedups. I can not WAIT for 8 and 16-core CPUs to come out. I'm going to be first in line.
I guess my e-mail program and my IM program may not be taking full advantage of my quad-core processor, but WTF? Why in the world would I ever care? The fact that some software isn't multithreaded only seems to be of concern to some propeller-beanied nerds who decided it's a problem that NEEDS to be solved with new programming languages, CPU architectures, etc.
Paris because she has the good sense to not worry about this non-issue.
Re: I don't get it
Kenny Swan wrote:
"I just don't get why some hardware can't be designed that automatically spread the load between different cores. Is it really that much of a problem to do so?"
Consider the loop:
Every iteration of the loop depends on the result of the previous iteration. It's very tricky to parallelize (if not impossible).
It's data dependency issues like that which bugger up your scaling. You have to take a step back and redesign your code to avoid them.
If your CPU is at 12% then one of two things hold: you are either not pushing the processor hard enough, or you are trying to push it but it is bottlenecking elsewhere. By definition it isn't bottlenecking on the CPU because the CPU is not at 100%; a bottleneck means 'this piece of the system can't deliver results any faster because it has reached its performance limit'.
If the sql server is bottlenecking elsewhere it's most likely to be the disk being hammered, or the program calling the DB to do the work - the cognos bit - which can't run any faster. These aren't the only possibilites but are the most common.
DBs can scale pretty well cpu-wise if set up OK. Always have done.
@savain: your posts rank with AMFM for content.
Parallel processing is doing different things at on
So how allocating discrete tasks to different processors....
EG one processor to
Stereo image processor
Mic 1 Speech interpreter
Speaker voice synthesizer
= automatic car
even Paris can drive
I think Intel saw this coming years ago. Unfortunately, the memory industry didn't want to go with RDRAM and we were stuck with DDR SDRAM, which is completely inferior at all levels of speed to RDRAM.
RDRAM would have been able to scale. Now the mobo makers have to test all the paths that DDR SDRAM uses, which must be a real nightmare.
Nick, exactly right. That loop cannot be parallelized. Nested loops where the calculated variable(s) is or are independent of neighbouring values are excellent candidates for parallelization. Very common in a number of scientific applications, and I suppose in graphics and image processing as well, which is why multicore processors are becoming popular in those fields if you can't afford clusters. Reduce a six-day CPU-limited model to 12 hours and people get very interested, at least the ones that do that sort of thing.
CPU cores are rare and expensive
Intel is the bottleneck here.
The business model of selling "CPU chips" is as dead as the model of selling music on plastic disks.
Future "computers" will handle commands like "play me that song that goes like..." by sending this command to all the memory units, each of which will have sufficient "CPU power" to fully flood their internal memory buses.
If Intel has a future it's by going back to being a RAM producer.
Blugreen is right
Most RDBMSs I work with can scale well to at least 64 CPUs/cores. It's not really true that they always have. Back in the early 90s I remember performance on above 8 CPUs actually got worse.
Of course the problem here is that some tasks don't gain anything from parallelism, getting 9 women pregnant still means a nine month wait for a baby.
It also seems to me that a lot of developers really don't understand parallelism and think it's something you add afterwards.
Finally, as with other commentators, let me take this opportunity to mention what a bunch of idiots Gartner are. I remember them telling me in a meeting that a certain RDBMS couldn't do more than 100 transactions a second. I said that this surprised me as we had it in production doing 700. They replied that I must be lucky, I replied that I'd seen it running faster at other sites I'd worked on. They had no reply.
Moral of the story, don't listen to Gartner and certainly don't pay for this rubbish.
There, that feels better?
@Core Count versus BIOS Count
Yes, because all IO goes through the BIOS.... even on systems that have no "BIOS". doh.
You might want to actually read up on "how computers work" before you wade in next time.
@256 core limit?
That would be bank switching. So 256k of RAM would be 4 banks, which you can only see 64k of at a time. Switching banks isn't free.
History repeating on me
"Moore's Law is gonna run out of gas, and sooner rather than later, because software can't take advantage..."
Maybe I am an old man, but I would swear that I read the same thing written about the 80386 chip when it was launched. "What was the point of a 32 bit chip, when there were no 32 bit applications?"
At that time, a 16MHz 80386 computer actually took longer to do the same job as a 16MHz 80286. This was because it was doing 32 bit fetchs to get the same 16 bit data. but the software caught up... as usual.
I want a 1000 monkeys to crank out shakespear's greatest work.
That entirely depends upon how long you take to choose which of the Bard's was his greatest.
"Of course the problem here is that some tasks don't gain anything from parallelism, getting 9 women pregnant still means a nine month wait for a baby"
True, but getting 9 women pregnant would be a hell of a lot more fun that sorting out this bl*ddy database
"all IO goes through the BIOS" ?????
I thought that statement started being generally false in (say) 1993 or so, when NT first started to hit the streets? Today's BIOS only exists to configure the system and get it booted following a powerup or three finger salute, any proper OS completely bypasses the BIOS once it's up and running, surely?
Don't they teach kids anything at school these days?
Chicken and egg - Azul example catastrophe
Ask the numbskull management at Azul how core-counts only help if your software takes advantage of it - that buggered science project has to be one of the biggest cash-sinks in recent memory - nothing like running an application on an over-priced appliance with 192 cores and only one slow-one is active...yet the rumour has it that they keep the "hope-alive" for the future of multi-threaded applications being just over the horizon...pity the poor sod who bought into that - of course this all just confirms "...there is a sucker born every minute...." and the founders keep gettin' paid...brilliant!
@Paul Smith + my 1/2p
>>but the software caught up... as usual.<<
And now we have m$ windows. how the fuck has the software caught up ?
What we have is dinosaur O/S linked with dinosaur GUI linked with fuckwit user. Recipe for disaster.
Even basic mutithreading can get into deadlock that is difficult to resolve. You have to have a means of ensuring memory that is being read is not written into at the same time. Look at the loop processing example above. That example does not expand to multicore, unless you unroll the loop. If that is practical.
Basic multithreading uses locking flags, mutexes, etc to avoid deadlock, race conditions, etc. And that's on a single core. Now we go multicore. Something has to arbitrate, and suddenly as you scale up the arbitration resource requirement exceeds the multicore resource benefit.
Designing multithreaded software is hard enough when you control all the threads in that software. Now throw in other software, and the fact that it isn't all running on the same CPU, but may be accessing the same data, etc. The deadlock possibilities are mindboggling ! How does thread 7 on core 2 know that thread 5 on core 4 has locked such and such resource, without a dedicated processor controlling access to shared resources ?
My brain isn't big enough to handle all of this !
However, I'm all for these high power multicore boxes. Sooner or later I'll be able to sell the CPU cycles for academic research (eg. chocolate teapot) whilst heating my home. Nice one !
If the memory bus is a bottleneck
Why not go from slots to sockets? After all, Slot 1 & Slot A were abandoned - maybe it's time we did the same for RAM. Granted, you lose half your possible capacity and you need to make some room on the mobo, but that's not all that difficult.
For a different approach to problem solving and multicore devices,
check out a 40 core device, in production now.
This is not something to run Windblows or *nix,
but if you want to design a filter, or do some real time processing.
Embedded processors are everywhere and we often don't need
a full blown OS and a C++## compiler for a little old router, etc.
Forget Moore's law, that's for Intel type processing.
Try Chuck Moore's law, less is more, the master minimalist.
not so fast
Gaming will always eat every CPU and GPU cycle available.
The bottnenecks mentioned are already solved by SUN, how about other vendors?
"More recently, researchers at Sandia National Laboratory released a paper showing that chips run out of gas at eight cores."
Eight cores on the SUN CoolThreads T1 and T2 processors scale near linearly with no sign of running "out of gas".
Solaris with SPARC scales near-linearly into dozens of cores on a single large machine instance (without clustering) - place those same mechanisms in a single piece of silicon and any common man can see that the research was done by stupid people.
Similar things was said about 4 processors per box where 8 processors per box would cause Windows NT to suffer performance degradation - but this was due to a poor OS design, not a poor hardware design.
"I/O and memory bandwidth issues keep the processors tapping their feet, waiting for data."
This is why a good system vendor will make sure there is ample memory and I/O bandwidth available. Moving from 4 access channels to 8 access channels, as people speculate that SUN is going to do, will resolve this issue.
Also, there is a wide-scale movement in SUN to move to more solid state hard disks in their storage will alleviate I/O issues. Even ZFS will support solid-state flash to augment disks, at the OS level, so software applications will not be required to tune specifically for new hardware. This seems to resolve this concern.
"Sun Microsystems already has eight cores per chip (with eight threads per core) with its 'Niagara' family of Sparc T series, and will boost that to 16 cores and 32 threads with its 'Rock' UltraSparc-RK processors, due later this year."
Will SUN demonstrate near-linear performance increments with "Niagra" CoolThreads T series as well as with with "Rock" RK series?
If the issues with scaling from 4 to 8 cores are isolated to the issues mentioned in this article, then it seems SUN has had them resolved.
I am sure that SUN is not the only hardware vendor who has these issues resolved.
We need some smarter people writing these article.
Spare us the sales BS!
Yeah - Niagara is a wonder - lots of cores, no memory bandwidth. Look at any benchmark Sun has dared publish results for using Niagara boxes. They are no where near leadership in performance per core or even performance per watt against. I read somewhere on El Reg one time that ROCK stood for Regatta On a Chip Killer. Regatta - isn't hat IBM POWER4 technology that came out in 2001. ROCK's own initials imply it was meant to combat POWER4...hmm...IBM is on POWER6 now. ROCK should be a real barn burner! :-)
Single threaded performance is still important. Multithreading isn't that easy, hence the efforts in our higher learning institutes to promote more R&D in writing software for these wonderful new technologies. Multicore is only easy for the hardware guys!
POWER6 doesn't look too shaby - great single theaded performance at 5GHz with a system interconnect running as high as 2.5GHz providing lots of memory bandwidth per core. And Intel finally getting rid of the the front side bus bottleneck...
I personally like POWER6 and Nahelem and won't be getting of the mainframe anytime soon. Those are three pretty solid platforms I think.
Re: Regatta On a Chip Killer
I haven't heard that one, but I wouldn't doubt that's what it stands for. Regatta had 32 cores in an
entire system costing millions of dollars. So, if one ROCK chip can do what an entire Regatta could do, then I am impressed.
That's not what I meant...
I think you missed my sarcasm. If ROCK actually stood for Regatta on a Chip Killer is was designed to take on 2001 technology. That makes it rather outdated and rather late.
At the end of the day, my position is multicore multhithreaded hardware is much easier for the hardware vendors to produce than it is for software folks (outside of Top 500 users) to exploit. Somehow those cores have to be fed data and instructions too...Sun's core count seems to be way beyond its ability to feed the cores with data (memory bandwidth per core).
That's not what I meant...
No. You don't get me. There is a lot of real work being done outside of the Top 500. Sun has been working in the greater than 64 core/cpu world for over 10 years. Solaris is perfectly tuned for this. It is not unrealistic for a Solaris box to use 64-256 cores and scale perfectly. Just because Linux and Windows cannot do this, does not mean that it is not doable.
Also, Sun has been doing hardware domaining for over 10 years as well. No other vendor will be able to do the same amount of work that a ROCK will be able to do in 250W's. Just image 32 virtual servers in 250W's all running at 2.3GHz. It can't be done by any other vendor. That is impressive.
Just because you are stuck in a PC world doesn't mean that the rest of the world isn't actually doing real, interesting things. No other vendor can do the work of a 32-core Regatta in a single CPU today or in the near future, especially at anywhere near 250W's.
- Vid Hubble 'scope scans 200,000-ton CHUNKY CRUMBLE ENIGMA
- Google offers up its own Googlers in cloud channel chumship trawl
- Bugger the jetpack, where's my 21st-century Psion?
- Interview Global Warming IS REAL, argues sceptic mathematician - it just isn't THERMAGEDDON
- Apple to grieving sons: NO, you cannot have access to your dead mum's iPad