To round out our visits with the major hardware vendors, we stopped by Hewlett-Packard’s booth. We had our new pal Steven give us a walk-through on their newest big x86 iron – an eight-socket monster (64 cores total max). The system is composed of two 4-socket chassis, each of which sports up to 64 DIMM slots – meaning that …
But it will still be rubbish
HP have tried this countless times playing catch up behind the rest of the world 64 core is a zone few people bother to play in as you can achieve better results for your buck in the 32 core sector, think about it HPs last attempt to be the biggests was the DL785 G6 which fell onto its face 15 days later when Dell sent the R815 to be benchmarked and both of these systems score a moderate result.
But then the next generation of processors roll out and they are both sat in a corner smelling slightly of wee and wearing a silly hat, in rolls the Westmere generation form Intel (Xeon 56xx)
and now the 32 core area of the market is king, and who sits on top of that mountain?
None other than the sausage munchy rather effiencent german based Fujitsu Technology with the RX600S5, why not have a look at this el'reg? rather than checking out the mediocure end of the market?
dont believe me? Have a look yourself
its a good read.
I disregard the Cisco as its more expensive than the sun and isnt mass produced.
Re: But it will still be rubbish
Why dont you try looking a the technology and understanding it before criticising eh?
The biggest challenge with pushing more cores in a NUMA architecture is CPU and memory locks and more precisely keeping the impact of those locks to a minimum (which is why x86 has traditionally been locked in the 2 socket and 4 socket space where thats easier to control). In layman's terms the more components in a shared architecture the more lock negotiation has to occur to manage those components and often that negotiation pulls resource away from the actual task of executing application code.
We used to call it blue droop after a famous three letter company's (begins with I ends with M) "expansion" technology for x86 (recently launched as v5) that allowed you to link multiple 4 socket nodes together. All used to work lovely till about 3 nodes and above where you would actually get diminishing returns due to lock management.
Now this problem of resource management has been well understood and conquered in the UNIX space (think Itanium/PA-RISC, POWER et cetera) with clever crossbars and management chips (Open VMS actually has the best methodology but thats a conversation over a pint!).
So seeing as HP has some pretty funky technology in that UNIX space wouldn't it be nice if they took some of that and applied it to the x86 space ................. hmmm? (Ill give you a clue they have)
So maybe you want to actually look at the technology in the product before criticising!
Yep - you nailed it
I'm the analyst who did the filming and wrote the short blog piece - not a Reg reporter...just wanted to say nice work, Adam 73.....you explained the challenge of SMP scaling in a clear and concise way. Meaning I'm going to steal it and start using it as my own right away. The factors that you write about above are also what makes it so hard to scale things like Oracle RAC or other applications using distributed systems - there's just too much locking and internode communication flying around.
The question you bring up about the speed/latency of IBM's expansion stuff - their scalability port - is one that I've had as well. I've been assured that it's fast enough these days to ensure flat memory access - but we won't know without testing with benchmarks and real world workloads.
As you point out, the Unix guys control their own hardware and os's and thus have the ability to optimize for SMP performance. Hmm...Open VMS was your favorite? For me, it's the original Cray/Sun E10k with the cross bar....true flat SMP that scaled like a howler monkey climbing a tree. As we used to say back then "NUMA is a bug" that need to be stomped out. However, tech marches on and there are now network connections like Infiniband that top that crossbar in terms of bandwidth, but aren't quite there in terms of latency (yet). When they get the latency number down, we might see true scaling using distributed systems - assuming the operating systems are designed to take advantage of it.
hot swap memory?
FL1X- I don't get your point there, pointing el reg to a quad socket server when the article is about an eight socket server. The VMwark results you point to show the Fujitsu is just a tad faster(the difference is a rounding error) vs the HP DL580G7(quad socket too)
I suspect The reg went to the HP booth to check out the 8 socket beast because there are so few of them left, probably why AMD left that market. HP hasn't had a 8 socket Intel system in a long time(I recall a Reg article I think that mentioned the last 8 socket intel system HP had was from the Compaq days).
My question is around hot swap memory, for such a big system, I imagine you'd want hot swap memory but from what I can tell the design of the system prevents accessing the memory slots without removing an entire module which includes CPUs.
HP has had hot swap memory in proliants for a long time -
I'm not sure of the state of hot swap cpus, though I suppose these days you can probably safely take CPUs offline to remove the module with some operating systems, though the idea scares me, unlike swapping memory that is being mirrored.
I agree that quad socket is likely the sweet spot for right now, though I'm sure there are big Oracle database loads out there that can benefit from the 8 sockets, and the price of the hardware is a drop in the bucket compared to the price of the software. Contrary to popular belief RAC isn't for everything.
Another great question
Again, I'm the guy who did the original story. And I'm not a reporter hired by El Reg, I guess I'd say that we're partners. I have my own small analyst firm and I sort of consult with the Reg as their tame HPC Industry Analyst.
I went to the HW vendors to see what they were bringing to the table in terms of big iron. Aside from my own weakness for systems that get big and go fast, I believe that large shared memory spaces and lots of cores (plus I/O) are the most efficient way to host lots of virtualized workloads. Having more system resources to parcel out among workloads can yield higher overall efficiency since you can host more apps and spread the virtualization overhead across a larger number of guests (up to a point).
I also think that having larger systems enables admins to build more customized system images to optimize performance for apps that might become memory or CPU starved in smaller boxes under peak load. I think we're going to see a greater awareness of the need for 'lumpy systems' rather than the past 'balanced system' mantra that the vendors used to chant. Lumpy systems (with greater or lesser amounts of memory, CPU, or I/O shares) fit system resource shares to the workload rather than simply throwing it an equal share of horsepower. This will lead to higher overall performance and, with informed placement of workloads on the virtualized boxes, will also lead to higher system utilization. [-end rant-]
Pretty poor description from the HP buff, it's one system with two CPU cages.