Even before Rackable Systems bought the carcass of supercomputer maker Silicon Graphics in April and took its name a month later, the future of the Itanium-based Altix shared memory supercomputers was in question. Since the takeover, the new SGI has been trying to stir up some interest in the existing machines while not exactly …
open to interpretation
"The real pity (perhaps) is that Java, because of its interpreted nature and therefore its high overhead compared to compiled languages..."
did we miss the last ten years of JIT compilation technology work? once you've been through a piece of code a couple of times and the JIT has been at it, it's not interpreted at all... java's reluctance to give you access to the low-level fiddly bits you need to implement efficient HPC code is the real problem.
is C++ comparable to FORTRAN these days? people were up to all sorts of template-unpleasantness to do efficient vector calculations last time i looked...
Lies, damn lies, statistics, and El Reg articles
It's a bit hard to follow the hardware setups and their respective performance. I presume that the article is making all of its SGI benchmark references to a 1024-core system with 512 cores devoted to the tests. This is then compared to 32 cores, 128 cores, and 64 cores.
Excuse me, but doesn't it make sense to do a little math for the reader and note how much performace *per core* a system is producing? Sure, 512 cores beats 32 cores. This isn't expected? But the Sun system gives 26,293 BOPS per core, while SGI gives just 18,750 BOPS per core.
Now who smokes whom?
I want to know the nr of CPUs in each machine. Not the nr of cores. Because if there are 64 CPUs then I know that machine uses lots of power, and I can rank the CPUs against each other. If a machine uses 128 cores, it could be using 16 cpus which tells me it uses less power. Or it could use 64 cpus which tells me that the machine uses lots of power.
Please talk about sockets, not the nr of cores. The nr of cores are not interesting, when we talk about performance. Which we do in this article.
So-what, if a core is faster than another core, does this mean that the entire CPU is faster than the other? Not necessarily. To draw that conclusion is misleading and it only confuses when you try to draw conclusions. It is like "this machine had 16 dram memory sticks" - and hide how much memory was used.
It is much easier and revealing if you say 32 dual core CPUs instead of 64 cores. In the future, there will be 16 core CPU and even 32 core CPUs. To say 64 cores then, would only serve to confuse. It would be better to say "2 of the 32 core CPUs". The nr of sockets are far more revealing, in terms of power usage, and performance, and bang for the buck.
Java, because of its interpreted nature ... compared to compiled languages
Well, that's not the point...
The point is not performance per core. The point here is that IBM, Sun, HP, etc. do not make ANY machine that will scale up past 128 cores, while SGI does. For some tasks a shared-memory supercomputer is the only way to go, a cluster of smaller systems will not work.
It'd be a brave MIS manager who bought SGI
I used SGI kit for HPC back when they were the old, purple-beastie SGI and loved it. But it'd be a brave purchaser who bought their kit to run ERP or middleware.
- Standard hardware?: err, no, not remotely. NUMA I can live with (lots of people ran Oracle on Sequent Dynix +NUMA after all). But the use of Itanium would scare my Forbes-reading colleagues
- Viable vendor?: Hmm, maybe, but their commitment to the data center is yet to be proven. They're certainly no Dell or Sun here.
- Cheap? Hell no.
As for the comments that Java might be a good HPC language, I'm skeptical. For all their faults, Fortran and C offer access to really whopping amounts of memory _in a deterministic fashion_ which you need when you are modeling genomes and suchlike. I'd love to know how they get Java to access a couple of Tb of RAM and not occasionally think that it ought to do a stop-the-world garbage collection. -Xmx=1000000Gb --no-garbage-collection maybe ? Would make for a very interesting article.
Scaling in shared memory
Java benchmarks like this are no way to stress shared memory model machines. By its very nature this benchmark is designed to run multiple Jave VMs with very low levels of intercommunication. The results that came out (512 Itanium cores vs 32 Niagara cores) is pretty well what you would expect, and contrary to the article, is hardly a shining result. Nobody in their right mind would contemplate using a massive shared-memory system like this for such an application. It's a very expensive way of providing that much processing capacity. Compare the costs with a modern x64 Intel architecture implementation using 24 core blades to provide the same sort of throughput.
A much better test of shared memory machines is as a database server. Unless you have a workload which lends itself to DB partitioning, then that will really exercise shared memory (which is what you buy this sort of box for - not running dozens or hundres of largely independent Java VMs).
Incidentally, SUN were indeed brilliant at marketing the E10K, but the hype went far beyond the capability of the box. I recall it being sold as having partitioning capabilities comparable to those on contemporary mainframes (which was a joke - E10K partitioning was in no way comparable, but I saw senior IT managers with mainframe backgrounds falling for this stuff). The E10K wasn't a particularly reliable machine, it had a poor I/O system and high latency across the backplane and very rigid partitioning rules. But it was marketed brilliantly, much of it in the lead-up to the Y2K "crisis" when IT shops had money to burn courtesy of panicked boards.
RE: Well, that's not the point...
Depends on your definition of a "machine" - do you mean one hardware system or do you go by the OS image? If it's the former then even the SGI is suss as it looks like a massive amount of what are effectively individual hardware systems ("blades") bolted together with a high-speed interconnect to make one immense "system" which runs a single OS image. So, if you are going to allow the amalgamation of individual hardware systems, then any Beowulf cluster can be considered one "system", which means even Sun could claim they can scale out (with Galaxy) to that number of cores. After all, if it does the same job, why not?
The bit that made me smile is that the 9040 is the old dual-core, 1.6GHz 18MB cache CPU, so you could probably expect at least a 10% performance increase if it was running the 1.66GHz 24MB cache 9150N model as can run in the hp Superdomes. The problem we have found with such large systems is that they really need a massive amount of memory to really get the best performance, to the point where the memory becomes the most costly part of the server at almost twice the cost of the CPUs! I'm guessing memory would be all the more critical for Java (<shudders in horror>).
Well, that's not the point EITHER...
On one hand, this article addresses the need/possibility of SGI pitching to the commercial space, and THEN goes on to give specs in terms of performance per core. Totally off target.
If you want to pitch the the commercial space, then the ONLY metric is performance per dollar, period. And perhaps trade-in allowances for your current kit, but you can't get that from a benchmark obviously.
In the commericial space, most aquisitions are based upon cost metrics first, supportability second. SGI always sold cool kit, and I've used them myself on a few projects, but they have limited experience in supporting their boxes in a commercial environment. For them to be succssful in that space, they need to compete on a cost per performance number, AND have off-the-shelf solutions that include pre-configurations for network integration, storage integration, load scheduling, fault reporting, etc. None of this stuff is rocket science, but it's stuff that HP and IBM do very, very well - and SGI would have to make some investment in their sales and configuration abilities to support to the same degree I suspect.
SUN (or rather Fujitsu on their behalf) certainly do offer a machine that scales beyond 128 cores. The SUN M9000 in 16 system board mode with the quad-core SPARC64 VII cpus will provide 256 cores (and 512 hardware threads).
But as mentioned on another post, SPECJBB is a lousy benchmark for stressing shared memory systems.
What are you talking about? SUN has a T5440 with 4 Niagaras = 256 cores. Do you mean sockets? You see how confusing this talk about "cores" is? The nr of cores doesnt say anything about the power usage of one cpu, nor the performance of one cpu, etc. And besides, with the acquistion of SUN, Oracle will surely change the pricing not to punish many core Niagaras. In the future everyone will have 8-16 cores / cpu. Then it will be REALLY confusing to pit them CPUs against a legacy dual core Power6+.
-Wow, the Power6 core is twice as fast as cpu X, this means that the power6 is really fast!
-Nope, the cpu X has 32 cores, each half as slow as one Power6 core. This means the cpu X is 16 times as fast as one Power6. The Power6 is dead slow in comparison.
-Ouch, I didnt see that!
-Nope. IBM likes it that way. IBM likes you to draw wrong conclusions. And one Power6 uses 16 times as much power as one cpu X. That is also relevant to be able to draw correct conclusions.
The T5440 has 4 x 8 = 32 cores. Cores shouldn't be mixed up with hardware threads. Each T5440 core has 8 threads, 2 integer processors (4 threads sharing each) and one FP. What the Niagara thread do is make use of otherwise "dead" time whilst the thread is stalled waiting for main memory. In effect a hardware thread is a virtual CPU. There are other contention points besides the IP & FP, but the upshot is that you don't have the equivalent of 256 cores of processing power in a T5440, but it is very workload dependent - if there are losts of cache misses, you get considerable extra throughput, if processor cache misses are rarer, then you don't. The Niagara is great for some things, but if single thread speed matters it's a poor choice.
For that matter, Itaniums now have two threads per core as does the SPARC64 VII (which means an M9K can have as many as 512 threads).
Cost per Operation
It would be nice if the article included the total cost of the systems mentioned. Cost per operation (similar to TPC numbers) would be more useful, in my opinion.
Ah, yes. You are absolutely correct. I had a brain slip. I know that. Thanks for pointing that out.
PS. I put all blame on all this talk about cores and sockets and threads.