back to article Cray turns cluster crank with ScaleMP

Does your performance in the datacenter suffer because you don’t have enough memory to really get the job done? Do you have apps that don’t perform well on clusters, or don’t parallelize at all? If this describes you or a loved one, read on, because Cray thinks it has the solution for you. Cray, with partner ScaleMP, recently …

COMMENTS

This topic is closed for new posts.
Silver badge
Linux

And still

There are those who claim Linux will not scale. Not that it really matters as those who buy that sort of equipment know what they want.

1
0
Roo
Silver badge

Re: And still

I guess those guys who refuse to learn the lesson will continue to take a big hit in the profit margin. :)

0
0

Re: And still

Implementing cache coherency over IB doesn't scream scaling. Not unless someone publishes the NUMA ratio to back it up.

0
0
Silver badge

Re: And still

Frankly if the whole of that memory looked like a bloody fast DISK it would still make stuff immensely faster.

summing huge series or working over large arrays of data is not generally CPU bound. The actual algorithms work in small ram areas, its collecting and storing the data to work on that screws things up.

0
0

Re: And still

Reply Icon

Re: And still

What are you trying to say?

0
0

Re: And still

Yeah, very good point on that. And this solution should be a hell of a lot faster on those types of workloads vs. message passing on clusters. But we'll need to see some number to really know for sure.

I expect to see this kind of technology filter down into enterprise fairly soon. ScaleMP is part of the secret sauce in SAP's Hana in-memory analytics product set.

0
0

Re: And still

I see what you mean, but I remember the days when we had SMP systems using buses and crossbars with flat flat access times (ie. non-NUMA). The system I'm most familiar with, the Sun E10000, had a 16x16 crossbar with 1.3 GB/sec bandwidth at about 550 ns latency. The E10k scaled like a scorched weasel (which is pretty good) when compared with the systems of that era.

Today, you can get 56Gb FDR Infiniband that offers 6.8 GB bandwidth at around 700 ns latency. That's not too bad at all, and should be enough to ensure reasonably linear scalability on a good portion of workloads.

In any case, a SMP system should be able to scale better than a cluster running message passing. While sitting here writing out this reply, I can't think of any application that would scale better in an MPP system vs. SMP, but I could be wrong.

0
0

Re: And still

In any case, a SMP system should be able to scale better than a cluster running message passing. While sitting here writing out this reply, I can't think of any application that would scale better in an MPP system vs. SMP, but I could be wrong.

Sorry you are wrong. There is always an overhead, (cache coherency), which the cluster does not have, so scaling classical MPI code like CFD will "always" scale better than any/all ccNUMA systems. ScaleMP does well with CFD because inherently in HW it's a cluster. Where is doesn't do well is in mixed workloads, e.g. CFD and FEA structural analysis on the same system.

0
0
Anonymous Coward

Re: And still

The market for big SMP UNIX boxes has shrunk but meet the great grandson of the Sun E10K: the SPARC Enterprise M9000. It has 64 sockets of SPARC64 running at 3 Ghz, (which gives you 256 cores, 512 threads) 4 TB of memory, and a crossbar interconnect rated at 737 GB/s.Also comes with 288 I/O slots. Available from either Oracle or Fujitsu. It will smoke that Cray box; but it's not cheap.

0
0
Anonymous Coward

Re: And still

I guess M6-32 is the grandson of M9000 then.

http://www.oracle.com/us/products/servers-storage/servers/sparc/oracle-sparc/m6-32/overview/index.html

32TB of memory + 32 CPUs at 3.6GHz (384 cores, 3072 processing threads) with 3,072 GB/sec interconnect. And you can run single Solaris image on it. Talk about scalability...

0
0
Roo
Silver badge

Re: And still

"In any case, a SMP system should be able to scale better than a cluster running message passing. While sitting here writing out this reply, I can't think of any application that would scale better in an MPP system vs. SMP, but I could be wrong."

The data/sync signals travel the same distance regardless of whether the code explicity (Message Passing) or implicitly (Shared Memory) initiates the transfer. There is no physical reason why Message Passing code should be slower than Shared Memory on the same hardware, it's all down to the implementation.

For the record these days shared memory systems are usually implemented on top of hardware that does message passing (mainstream examples from the x86 world: HyperTransport, QuickPath).

In my time I have implemented lightweight messaging passing APIs that sit on shared memory and converted shared memory code to make use of them. Benchmarking revealed that there was no performance penalty over the original code on the same hardware. The message passing code was portable while the shared memory code was pretty much tied to the machine it was developed on, and the message passing code could tackle problems that exceeded the available address space, so it was far more scalable than the original code.

Personally I think the biggest benefit of message passing is that it formalizes the interaction between processes, which (can) be leveraged to produce more robust and fault tolerant code. I like fault tolerant code because when you run it on a seriously big network of machines you can almost guarantee that a node will fail somewhere. Making an existing piece of shared memory code survive losing a chunk of it's address space and a processor is usually very difficult if not impossible.

Don't get me wrong. For some problems I really wouldn't bother with message passing, but they are in the minority. :)

0
0
This topic is closed for new posts.

Forums