back to article Stanford super runs million-core calculation

Stanford University engineers are claiming a record for the year-old Sequoia supercomputer, after running up a calculation that used more than a million of the machine’s cores at once. The work was conducted by the university’s Centre for Turbulence Research, seeking to get a model for supersonic jet noise that’s more …

COMMENTS

This topic is closed for new posts.
Gold badge
Thumb Up

Quite impressive in term of size but am I alone in wondering.

Shouldn't this SIMD thing just work by now instead needing lots of twiddling?

Thumbs up as this is a serious problem for aircraft and if anyone wants to taking a scheduled supersonic flight in their lifetime this is going to be needed.

0
1
Bronze badge
Boffin

Re: Quite impressive in term of size but am I alone in wondering.

Shouldn't this SIMD thing just work by now instead needing lots of twiddling?

It's not just SIMD. Although the article doesn't state it explicitly, each of the cores models a small area of space and it has to communicate various outputs to neighbouring small areas of space. The clue is in the line The waves propagating throughout the simulation require a carefully orchestrated balance between computation, memory and communication. Amdahl's Law puts a brake on how well any real-world computation like this will scale up when run on a parallel (or SIMD) architecture due to the need for components to interconnect and transfer data between each other (such as propagating global force/pressure vectors after each local computation per simulation time quantum) . In this case, I'm sure a lot of their time spent "ironing out the wrinkles" was trying to get those inter-core messaging parts of the simulation humming. But there are other potential bottlenecks too that need to be looked at to prevent stalls/starvation too (ie, "computation, memory and communication" above). There's definitely not just a single "point and shoot" solution to parallel programming.

4
0
Anonymous Coward

Re: Quite impressive in term of size but am I alone in wondering.

"Shouldn't this SIMD thing just work by now instead needing lots of twiddling?"

That's sort of like asking why there's still bad software in the world--shouldn't compilers just produce good software?

Making software run on multiple cores (not SIMD, but whatever) is more art than science, no matter how many compilers come along promising to do it for you automatically.

1
0
Silver badge

Re: Quite impressive in term of size but am I alone in wondering.

Dead right. Getting a million cores running effectively is very hard!! It is not just Amdahl's law that can get in the way (i.e. the max speed-up is limited by a section of the code that is sequential), it is communication overhead. We are working on a problem that, if implemented naively, would required O(N log N) communication, with N the number of pixels, which in practice means that the largest data set we have to work on (1.5 Tpixel) requires in the order of 120TB of data traffic. We are trying to get that down to O(G √N log N), with G the number of grey levels, which boils down to in the order of 240 GB of traffic in our case. Still a lot, but it should bring the algorithm into the realms of the possible.

I do not see compilers take over this sort of redesign of the code automatically any time soon

5
0
Boffin

Re: Quite impressive in term of size but am I alone in wondering.

"Shouldn't this SIMD thing just work by now instead needing lots of twiddling?"

What SIMD thing?

1
0

Re: Quite impressive in term of size but am I alone in wondering.

It's a little simplistic to say that this SIMD thing should just work. A friend of mine used to write code for multi-core systems. She said it's relatively simple to code for 2 or 4 cores. When that code has to scale to thousands (or even a million, in this case), then scheduling and managing communication between the threads running on each core becomes both a massive job and a nightmare.

1
0

Re: Quite impressive in term of size but am I alone in wondering.

@John Smith 19: The SIMD thing _does_ just work. Out of the box, no prob.

It's when you insist that you want every last bit of performance from your hardware that things get difficult. Running a computer at a couple percent efficiency isn't hard and for many situations is plenty good enough. Running your home PC at anything approaching full throttle is an interesting engineering task. Running a million CPU super at efficiency is a serious challenge.

So, it's not that there aren't tools. There are and they do a decent job. But a team of specialists piling on the man-months can do better. So it becomes economics: is the cost of the specialists worth the additional performance? This is really no different from programs like Photoshop containing a few hand-coded pieces of assembly. The compiler is good, but sometimes it's worth it to do better.

0
0
Silver badge

Re: supersonic flight

John Smith 19, this simulation is not about supersonic flight but about gas coming out of the engine at supersonic speeds. A very common thing among jet engines...

1
0
Boffin

Re: Quite impressive in term of size but am I alone in wondering.

If by your question, you wish to find out what SIMD is, it is one of four models for organisation of parallel computations and parallel computers.

S - Single I - Instruction (stream) M - Multiple D - Data (stream)

The other three (just for completeness' sake are as follows.

SISD (Single Instruction, Single Data);

MISD (Multiple Instruction, Single Data);

MIMD (Multiple Instruction, Multiple Data).

The most common ones commercially are SISD (Standard Serial Computing), SIMD, and MIMD.

0
0
Bronze badge

At what point

is it cheaper and simpler to build a jet engine rather than build a simulation of one?

3
0
Anonymous Coward

Re: At what point

not at this point, clearly.

Even if the simulation takes longer than a physical test would, the physical test cant tell you what is going on inside the engine at every point. And I would think that a virtual design is quicker to virtually build than whole or part of a real, physical, jet engine.

0
0
Gold badge

Re: At what point

At what point is it cheaper and simpler to build several dozen jet engines, each to a one-off specification, rather than build a simulation of just one and tweak its parameters?

"There, fixed that for you" as they say.

But yes, this beastie would be fairly close to that point. I assume it can also be used for other things, and I also assume that its builders have learned a fair bit about architecture and so the next one will be cheaper.

1
0
Boffin

Re: At what point

One important advantage of constructing a simulation is that you get to validate models you might have constructed by induction from empirical methods like building (many versions of) jet engines and testing them with instruments.

The benefit of having a validated, quantitative model is that you are now aware of how the various attributes and parameters of the model constrain each other, thus you are better enabled to do effective engineering. You will be better aware of the various tradeoffs and optimisations you may perform during the specification and design process.

1
0
Bronze badge

The trick is...

"The trick is getting the models to run quickly enough: and it was the search for speed that led the researchers to get to work getting their code to run across so many cores in parallel."

Well, that's one way of putting it.

Or,

"The trick is getting enough points, and it was the search for higher resolution that led the researchers to get to work getting their code to run across so many cores in parallel."

Anyone can get a model to run in minutes. The trick is getting a model that has enough resolution to tell you what you want to know.

.

Ok you can transform faster==more parallel, except you can't exactly, and now that parallel simulations run "fast enough", you aren't trying to make it faster: you're trying to get more points at the same speed.

1
0
Anonymous Coward

Why did I read that as "university’s Centre for Tumescence Research"?

1
0
Gold badge

Perhaps you looked at the picture at the bottom of the article.

0
0

Take.

Mind.

Out.

Of.

Gutter!

0
0
Gold badge
Unhappy

So it's partly the communcations overhead and the synchronosation that's the problem.

I seem to recall a system that decoupled inter processor comms from the the data/instruction bus while incorporating a simple hardware scheduler.

But that was a long time ago.

0
1

Re: So it's partly the communcations overhead and the synchronosation that's the problem.

Sure, you can do that (and I expect they have done just that). But, if your calculation cannot continue until you have received the new input values from the neighboring CPU, then it doesn't matter if the hardware is decoupled, your software will stall.

So, what you need is not just well-balanced hardware, but well-balanced software that plays precisely to the strengths and weaknesses of that hardware. That's tough.

1
0

Re: So it's partly the communcations overhead and the synchronosation that's the problem.

You're thinking of Transputers. Which were certainly the right idea for CFD, not so much on account of the architecture but because of the good balance between compute and communication speeds, and especially the very low communication latency. Low latency meant that relatively little time was wasted hanging around at the end of an iteration, waiting for data to arrive from neighbouring processors. But they had their problems too - especially absence of any kind of global/broadcast communication.

I don't know how things have changed since those days, but at that time there was a tension between algorithmic efficiency and parallel processing - the more efficient CFD algorithms coupled cells across the whole domain, which was generally OK to parallelise on a shared-memory system but was no-go on a distributed-memory architecture where less efficient algorithms could be used.

So... real kudos to the guys & gals, both system architects and software developers, who have pulled off the feat of building a system and a real-world application that scale across 1^^6 processing elements.

0
0
Boffin

This release makes it sound like Stanford engineered it

The 1,572,864 core computer is built by IBM and runs Linux.

http://en.wikipedia.org/wiki/IBM_Sequoia

I don't know what record they are claiming to have achieved with 1mill cores.

0
0
Silver badge

Re: This release makes it sound like Stanford engineered it

I believe, having actually read the article, that this is the first time they have used over a million cores in the same calculation. Usually they are spread over multiple jobs.

0
0

Is it hard to do that?

Could you not do that just by writing some bad code?

0
0
Silver badge
Joke

Re: Is it hard to do that?

"Could you not do that just by writing some bad code?"

You'd better put more paper in the printer if you're making a million copies

0
0
Silver badge
Joke

no biggie!

I am reliably informed that the new 128GB iPad can do it faster

0
0
This topic is closed for new posts.

Forums