"Power is now the limiter of every computing platform, from cellphones to PCs and even data centres," said NVIDIA chief executive Jen-Hsun Huang, speaking at the company's GPU Technology Conference in Beijing last week. There was much talk there about the path to exascale, a form of supercomputing that can execute 1018 flop/s ( …
Power is now the limiter of every computing platform, from cellphones to PCs and even data centres
"Power is now the limiter of every computing platform, from cellphones to PCs and even data centres"
Nope - still bandwidth as far as many users are concerned. That's what limits both my home and work systems. Perhaps I should add "affordable" to that term as I'm sure I could get higher if I was prepared to mortgage my next five decades of income.
Does it have to be Many Weak vs. Powerful Few?
"Using much simpler processors and many of them, we can optimise for throughput. The unfortunate part is that this processor would no longer be good for single-threaded applications."
This is an interesting statement, and I have an interesting question to follow it.
All processors today use, essentially, a duplicating scheme. Dual core on up are just copies of the first core; an 8 core CPU has 8 identical cores. Why? Knowing that some operations are ideal for a single, fast core, and other operations are ideal for a huge pile of parallel, weak cores, why not create a processing unit that has both a small amount of high-power, strong, fast cores, and a large number of low-power, small, weak cores? That way, the scheduling process wouldn't need to bridge two distinct processing units, and better still, processes could be 'weighted' towards parallel or serial. If the parallel part of the processor is bogged down, the commands can be kicked to the high-speed side, and if the high-speed processors are full, some processes can be kicked to the slower parallel processes.
Essentially, it would tie a powerful, parallel-approach GPU to a high-speed, serial-approach CPU, but without limiting processes to one or the other - not just two designs on one chip, but a single, dual-purpose chip.
It's almost as though you read the article and then reworded it slightly and posted it as your own idea.
That's what the article is talking about in a nutshell.
HOMOgeneous multicore computing came about for reasons of practicality and compatibility. Practicality because they were easier to design. Then there was the idea that no matter where on the CPU the instructions went, it would work. It was a practical solution for the first mainstream push of multicore computing.
These days, though, we're starting to see the limitations, such as underutilization. Some tasks can only be divided so much in the x86 instruction set, plus there are tasks like matrix and floating point operations that don't play to x86's strengths. With a better understanding of identifying these outlying tasks, it becomes more practical to start calling in "specialists" and start departing from the homogeneous multicore design.
With proper task management, we can make sure tasks don't get sent to the wrong core, and we can utilize all parts of the CPU more efficiently: kinda like the modern workplace where the push is to get the workers to do as much as they can so you can keep the labor costs down.
Heterogeneous multicore computing is far from new, but it now has more mainstream support. The Cell Broadband Engine was an early attempt at a mainstream heterogeneous CPU, but it suffered from the "first is worst" syndrome: other players managed to outclass it pretty quickly, but the idea has stuck. We'll probably see more heterogeneous computing in the coming years.
Perhaps I misunderstood, then
The statement "By adding the two processors, the sequential code can run on the CPU, the parallel code can run on the GPU, and as a result you can get the benefit of the both. We call it heterogeneous computing" seemed to indicate that the collection of processors were still seen separate processors; perhaps in the same chip package, but still separate. It still mentions CPU and GPU as different parts to a whole.
My thought was not for a collection of independent GPU and CPU processors, but rather a processor chip with a few fast cores and many parallel cores, all in a single package. Not a large collection of mostly-fast cores, but a few very fast cores, and a lot of slow, massively parallel cores. Instead of choosing from the beginning to write a program based on a CPU or a GPU, a programmer would write however he or she pleased. Then, either the compiler, or the processor itself in real time, would choose what instructions get executed on which core.
Further, if the fast core was overtaxes, a few of the single-core instructions could run on the massive parallel cores. If the parallel instructions were full, a few serial instructions could go through the fast processor - the processor could balance itself.
Unlike, for example, the MIC processor, which is basically a GPU that runs x86 instructions, or the possible ARM SoC, which is merely a CPU and a GPU put in the same package, but no more tied together than if they were separate.
Maybe I'm seeing a difference where there is none, but it seemed clear at the time :-)
@ArmanX I believe Nvidia is thinking along these lines - a GPU with some cores based on the ARM CPU rather than simpler GPU cores. That's what Tesla CTO Steve Scott told me anyway.
computing power vs transistor count
I saw a graph of this a few years ago and it was perfectly linear (can't find it now, anyone help?).
It may be that the "...50 times more energy is dedicated to the scheduling of that operation than the operation itself" is unavoidable. Without spending this much effort it might be that the consequent pipeline bubbles, register dependencies etc., & especially read/write dependencies, would drop the performance by that amount. Just a guess though.
The "With four cores, in order to execute an operation, [...]" doesn't make sense as the execution is in-core and therefore totally independent of however many other cores there are - reads/writes which certainly would introduce the dependencies I mentioned, don't (IMO) factor in at this point.
(I am not a chip architect and may be guilty of some fuzzy thinking here)
Sounds like a nice problem
Under reasonable assumptions as to the hardware architecture consider the scenarios:
1) Optimize useful instruction flow, increasing the number of actual hardware operations.
2) Incur inefficiences in the useful instruction flow, with wasted hardware operations and delays.
A hardware operation needs the energy of mumble-mumble times the Boltzmann constant.
Draw a graph showing "Watt / useful instruction" for a series of increasing instruction flow optimization hardware, keeping the value "useful instruction / s" constant.
GO! You can probably tie it up to a No-Go Theorem in Applied Theoretical Computer Science, too.
Why are we so obsessed about the power profile of these chips. Yes they suck up power but they are sucking up power to improve the world we live in. No one seems to care about what power Coke (or other advertisers) uses to light up their billboards across the World.
Your bloody Christmas tree lights would be better switched off in return for a few extra core hours.
That's not the problem
That power translates to heat being generated in a very small space, which is bad. You need bulky cooling, copper pipes, coolant tubes, noisy and dusty ventilators, special air ducts and space between the elements. This reduces density and increases cost.
The demand for that power translates to your data center needing a cheap power source, best nearby, like a hydroelectric dam around the corner, or a dedicated gas turbine (NSA builds its data miners where there is cheap electricity available). If the machines need more power in the mean, the diesel engines and batteries for uninterruptible power supply will become larger. My local data center bills in "power steps", i.e. Watt and there is a cap on how much energy you can pull out of the socket. That's a hard limiting factor.
openCL is not lower level
Other than differences in the boiler plate needed to launch it and plumb everything together - you are hard pressed to spot the difference between CUDA and openCL kernel code for most common tasks.
CUDA is currently favourite because of some very nice tools and a lot of libraries. But only because it has a couple of years head start. If intel put some of it's compiler writing muscle behind openCL or AMD paid PGI to port Thrust the game changes.
Actually since it's common to rewrite openCL code at runtime for specific cases you could argue it's higher level than CUDA
Dual CPU GPU
@ArmanX tricky to do with big complex x86 type processors. One problem is that they are synchronous - all the billions of transistors clock together which uses a huge amount of power. GPUs typically run at a much lower clock speed.
Designing CPUs where different parts run at different speeds, or free run has been a series of failures for at least the last 25years. One possible approach is to add a small linear CPU to the GPU to do the boring housekeeping stuff - but probably an ARM rather than x86
Until we can get processors with freely scaling speed, it would be difficult indeed to combine them - especially in terms of power. It's not worth it, otherwise.
Though I could see it working - turn off all the low speed/high parallel processors, turn on the high speed/low parallel, process the serial command quickly, then flip it back the other way - it wouldn't be worth the effort to put it on one chip, rather than two distinct chips.
There are a few big problems with asynchronous computing. One of then is instruction coordination and dependency. A lot of computing tasks in a CPU are interdependent so have to wait on the state of different units. In synchronous units, timings are better known so processors can be tuned so that dependent results arrive at predictable times, reducing the need for gatekeeping.
The second and more fundamental issue goes to uncertainty. Without the clock ruling the processor, there must still exist some form of coordination between the parts of a processor to determine who gets what first and so on. Otherwise, you can end up in metastable states which can produce dangerous uncertainties.
- Mounties always get their man: Heartbleed 'hacker', 19, CUFFED
- Batten down the hatches, Ubuntu 14.04 LTS due in TWO DAYS
- Samsung Galaxy S5 fingerprint scanner hacked in just 4 DAYS
- Feast your PUNY eyes on highest resolution phone display EVER
- AMD demos 'Berlin' Opteron, world's first heterogeneous system architecture server chip