Sun Microsystems is today officially debuting its two new "massively threaded" servers based on the UltraSPARC T2+ processor. The SPARC servers, part of Sun and Fujitsu's partnership, were first revealed in gory detail by El Reg early last month. The major advancement in Sun's new CMT SPARC Enterprise T5140 and T5240 servers is …
128 threads, just not at once.
"....a single machine is able to process 128 instructional threads at the same time...." I thought we agreed in an earlier thread that at most it was two concurrent threads per core, so max sixteen threads at once, the rest all being stalled waiting for the too small cache to cough up data or instructions. And seeing as those threads are weeny threads unsuited to most datacenter applications I can't see them making many "data centers, kittens happy". Sun proved this using Lotus Notes as a case study, a torture device which is likely to make most users want to throw kittens let alone New Zealand hedgehogs at their sysadmins. Even SPARC64 is a better solution for most business apps than this!
/countdown to Valdis seizure in three... two... one...
Poor SMP scalability
I was disappointed that RIcky said it only gets 1.7X the performance which means that adding the second chip only gives 70% more performance. I was also shocked that when asked question "Since you round robin the threads on the core how many thread are actually running at once" they insisted it was 128 (vs. 32) then strangely talked about how they run two threads at the same time better than Intel's HyperThreading. I also did not hear how this was going to be a database class systems vs. the web only T2 system. The chips cores are the same so they only support light threads. If anything the removal of two of the memory controllers would decrease its database performance. Obviously, no one would ever use this system for Oracle since it requires 12 licenses for about a half million dollars US. Certainly having only 6 slots is a major inhibitor for performance and reliability. Also curious about 1.4GHz availability given the heat density of the system. Command line thread partitioning is unacceptable, only time will tell if Sun's promises of a GUI with dynamic management will come to fruition. Obviously the relationship with Fujitsu continues to be strained as the M-Class systems are deemphasized.
Re: Poor SMP scalability
Firstly, probably most systems that go from 8-16 cores are not seeing linear scalability. Secondly these things are not intended to be big enormously scalable database systems: that's what the Rock boxes will be (among other things). This is just a bigger Niagara box (which is not a useless thing by any means, but is not the same).
re: SMP scalability
Wow, this is obvious Sun competitor FUD, if I ever saw it. Either you're being spoon fed by (IBM?), or you work for them.
"1.7X the performance which means that adding the second chip only gives 70% more performance."
So what? IBM more than doubled their MHz from P5 to P6 and got less than a 1.6x performance increase based on SPECint2006. So what? They're still faster.
"Since you round robin the threads..."
This is nonsense... Well, not exactly nonsense, but lacks a great bit of understanding of how processors work (which of course, I myself am guilty of). The idea of Niagara is to take advantage of all the time that a thread is waiting on memory. Memory is inherently slow in relation to processors. Threads are constantly waiting on their next piece of information. In other architectures, this thread will then either have to go off CPU allowing another thread to get on the CPU, which is extremely expensive (taking several cycles). With Niagara, and T2, the CPU will just switch to another thread, which is all ready to run (taking 0 cycles).
To point at this as a weakness is just acknowledging that other platforms have a huge issue.
"Obviously, no one would ever use this system for Oracle since it requires 12 licenses..."
Another piece of nonsense. For one, Oracle is not the only game. Also, many companies have site licenses (which Oracle seems to prefer), so this is not an issue.
"Command line thread partitioning is unacceptable..."
Again, who cares??? You can control your use of threads via several means, including via a hypervisor (LDOMS), psrset, or even Containers. Seeing as Sun's competitors don't have domains on this class/price of system, and that IBM's "containers" are so new and not even supported by Oracle yet, I don't see the issue here. Of course, as you said, you manage the threads via command line, which is not that great... oh yeah, unless of course you own any other platform and have no threads to manage in the first place!
"Obviously the relationship with Fujitsu continues to be strained as the M-Class systems are deemphasized."
Huh? What the... Where'd this come from? Fujitsu is selling these systems under their own label. How does that... what?
Cores, threads, instruction pipelines
Matt, I presume you're talking about:
2 instruction pipelines + 1 floating point unit + 1 stream processing unit (cryptographic)
8KB data cache + 16KB instruction cache
The Intel Dunnington has a 32Kb instruction cache, so I don't think the T2's smaller chache is really going to be a problem since its a different architecture and instruction set. Dunnington gives two threads per core, but I didn't find information on how many instruction pipelines it has, either.
It will be interesting to see what comes of all of this in the market, as that is the real evaluator of technology.
RE: 128 threads, just not at once.
The Matt Bryan's commend is a completely senseless piece of drivel. The new Sun systems perform extremely well on the Domino/Notes benchmark, delivering both the highest number of concurrent users and the lowest response time. I sure as hell would have picked the Sun Niagara systems over IBM's p5's. Here are more on the Lotus Domino performance numbers:
And some more numbers handily beating IBM p5 once again on SAP benchmark:
Niagra2 boxes SPANK HP
I'm running Sun Niagra kit for a very large scale web farm - we're now throwing out the HP DLx86 systems entirely. Too slow, too hot, too expensive. Good on Sun for nailing the architecture... don't know what they'd be like for modeling bombs, but that's not what I do all day long. And IBM seems wide of the mark with P6.
Yes. 128 Threads All At Once.
This would best be explained in a half hour with a whiteboard to draw pictures on. But here goes anyway: I suspect the confusion has to do with not everyone understanding the difference between course grained and fine grained multithreading. Course grained (like Intel HyperThreading I think) is basically a hardware context switching mechanism. The threads are switched at the start and end of the instruction pipeline. So the entire pipeline is running thread A or thread B but never both. The T1, T2, T2+ chips use a fine grained approach where each stage of the pipeline can be running a different thread. So, the instruction decoder can be working on a thread A instruction, while an address is being generated for thread B, and an XOR operation is being done for thread C, etc. All during the same clock cycle. So, when things line up nicely, you do get 128 threads all executing at the same time on the new 2 way T2+ boxes.
Re: Poor SMP scalability
"Obviously, no one would ever use this system for Oracle since it requires 12 licenses for about a half million dollars US"
Um last time i looked Oracle charges 1/4 CPU per core So a 2 way 8 core per CPU would require 4 CPU License.
We use one (5220 single CPU 8 Core) for Oracle that is mostly OLT based and it out performs our old Ultrasparc III SF480 hands down. This server for less than half the price is a great box. Try one but don't expect it to beat a Power 6 thats not what it was intended to do.
Horizontal JOSS/BEA/Tomcat scaling on a single box is amazing!!! We hit a wopping 5% CPU utilization on 250 concurrent transactions per second while our database server was pinned (an old E450) With a bigger db server we could easily do >1000 per second. Not bad for <$30K
Big bang for little buck!!!
Um last time i looked Oracle charges 1/4 CPU per core
Yes that was the last time you looked. The T1 was a handicapped chip which Oracle gave a price break. With T1 1.4GHz and all T2's the multiplier is .75, so the per chip Oracle license is 8*.75=6 * $40K = $240K or in the case of this new box 12 * $40K = $480K. So be careful of the ELA renewal if you are loading up these weak core chips.
Multithreading and caching
Matt Bryant shows a fundamental lack of understanding of multithreading and its relation to caching. Caching exists to mitigate latency. Period. Nothing more.
Vertical multithreading is a method for mitigating latency. Period.
The fact is, a multithreaded architecture like Niagara needs less cache, not more, precisely because it has more threads active. A thread which is not ready to process (i.e., data is not in cache), waits for execution, but three other threads continue to execute. If the data was readily available, there is no need to multithread.
Next, the relative performance per thread is certainly a valid concern, but realize a Niagara thread is probably equal to about a 750 MHz UltraSPARC III or a 1GHz Pentium III Xeon. Not fast by today's standard, but realize if, for example, an ERP database server has 128 active threads, the Niagara server is actively working on all of them simultaneously, while the Xeon, POWER, or Itanium chip is time slicing those software threads among far fewer processor threads. The results are additional context switches, cache flushes, etc.
Now regarding scalability. To suggest going from one to two processors gains only a 1.7X increase in software performance also shows a fundamental lack of understanding of basic computer science. This is not going from one to two. It is going from 64 to 128. This is like going from a 64 processor system to a 128 processor system. It is hard to scale software from 64 to 128 threads. I'm sure in a partitioned environment, the scalability will be more linear.
Finally, comments about "six slots" is also irrelevant. Compared to the 32 x 33 MHz, 64-bit PCI slots on a 1999 era E10K, the SE5240's six 8x PCI-E slots provide 80% more I/O bandwidth.
Personally, I would love to see Intel take its great new Atom processor core and create a massively multicore x86 architecture. 32 Atom cores with 64 threads would be so much more interesting than four Itanium cores with eight threads.
i do love...
...how announcements of Sun's new kit always stirs such FUD-laden debates! personally, i use T1 based kit and it goes like the clappers. i should be getting some T2 machines in the next few months and i really can't wait. i don't think the budget will stretch to some of these, though, unfortunately.
"These machines are obviously ..."
"... rubbish because if you apply a load which Sun says is unsuitable, it performs badly."
The Niagara processor architecture takes a modestly clocked, simplified processor core and then gets as many threads as possible to run on it. Sun will clearly state "if you want single threaded performance, Do Not Use these systems, go for either Sparc64 or Opteron/Xeon." OTOH, if you have lots of 1U or blade systems handling load balanced traffic, it is likely that Niagara will give a huge saving in space/power in your datacentre.
If we were really all nerdy techies, this wouldn't need spelling out every time Sun launches a CMT product.
Finally ... I'm sure I'll regret this .. but I'm going to directly address a comment to Mr Bryant. Matt, it is good to be an enthusiast for a vendor's technology - in your case HP, in my case Sun. It is good to openly acknowledge the strong and weak points of both your favourite and competitors. Hell, Sun went through a stage three years ago when the product line was ageing and weak, but with the M series, Niagara and Opteron/Xeon ranges it is now solid. Similarly, we could have a measured exposition of HP's transition from PA-RISC to Itanium - from which HP finally seem to be emerging with sound products. However, the tone of the discussion matters a lot. A good rule of thumb is: "If I appear to be simply slagging off the competition, my remarks will not carry weight with readers."
Mine's the flak jacket.
So many cowards! And Joel.
"....The fact is, a multithreaded architecture like Niagara needs less cache, not more, precisely because it has more threads active..." I call male bovine manure on that one! Precisely because you have so many mini-threads running, you will have to serve all of them through the same cache chain. Do you expect the data to just get to the cores by magic? And seeing as Niagaras have so little cache (L1, L2 and L3 - they're all part of the chain to reduce the latency of going to main memory or disk, so don't just mention instruction cache), you will have to keep continually flushing cache to keep all the threads even ticking over, let alone humming, which means you actually have lots of requests going out to cache, memory or disk. That is why you actually need more cache for all those stalled threads, otherwise you will have to do more requests out to memory or disk. Niagara doesn't have the cache. Sun's benchmarking fiddles involve massive amounts of system memory to try and hide this.
"....So, when things line up nicely, you do get 128 threads all executing at the same time on the new 2 way T2+ boxes....." So things line up nicely when? Remember, we're talking real world apps not Sunshine benchmarking fiddles here. Intel has spent a lot of time on their predictive technologies to keep cache hit rates very high, Sun has not. Intel (and IBM) have designed chips with large cache sizes to maximise this advantage, Sun has not. So an Itanium or Xeon (or Power) keeps its threads spinning a lot better than any Niagara design will ever do, which is why Sun has to work with lots of stalled threads. Which is why Sun carefully craft any benchmark to use tiny, linear workloads (like webserving) for which Niagara designs are actually very good. But even old Xeon will trump it easily on anything more complex, especially database apps like Oracle.
"....If the data was readily available, there is no need to multithread...." Yes, in nice and easily predictable workloads (like Sun benchmarks, I guess). But in the real world, this is the reason we have things like out of order execution, branch prediction, etc, etc. Because data is not readily available, it has to come from cache (low latency), memory (medium latency), local disk (big latency) or even worse, another system or SAN (BIG latency). And because real world work flows don't always come in nice and easily predictable streams, you get stalled threads. Sun's response is not to try and design a way to keep the threads spinning as much as possible, but to simply accept the poorest solution. It's like being told to buy a dozen unreliable scooters in the hope one will be able to carry you as far as a reliable car.
"....This would best be explained in a half hour with a whiteboard to draw pictures on...." Draw this on your whiteboard. Draw a hose going into a can with pinholes up the side. Make the hose thin. Then draw water going into the can. With a thin hose, the water dribbles out of the can fast enough so that the water level never reaches above the middle holes. This is Niagara, with it's poor chain of small cache and low memory bandwidth. To spray water out off all the holes would need a bigger hose. Suns's solution is not a bigger hose but to switch between the holes and pretend it is flowing out of all of them at once. To try and make the flow better for benchmarking, Sun puts a large tank on the hose (system memory) in the hope of keeping the flow steady, but it still has to go through the tiny hose to the can, and they still have to switch between holes. If you were trying to put out a fire with this you'd get burnt.
Now draw the can as a tube with a large hole at the bottom. Make the hose as wide as the can mouth. Put the large tank in if you like. This is Itanium (or Xeon, or Power, or Opteron) which has a massive cache chain, high memory bandwidth, and better predictive technologies. More water comes down the hose and goes straight through the can. It's like a fire hose nozzle and it is a far better solution for ninety-nine-out-of-a-hundred fires. There, I'm betting that with even your slow drawing ability that didn't take half-an-hour.
Where's teh kitteh?
Being a B3tard, I clicked on the article right away when I saw 'kittens happy'.
Imagine (well, you don't have to - it's right here for you) my disappointment when I could not find even one vague link or reference to kittens.
Is a picture of a kitten asleep on a nice warm 2U rack too much to ask? Does the new server have kitten-proof filters on the fans?
Threading and cache misses
Lets look at this based on a single socket system, so we have:
1 Socket, 8 Execution cores, 64 Threads
Now, the whole point of the Niagara design is to deal with what happens when you miss the cache. You only have 8 cores, so only 8 things will be actually processing at any time. But when the executing item misses the cache, instead of just staying on the CPU core (showing CPU busy), something else can go on the actual execution core.
So, Niagara is ideal for workloads where you are inherently likely to miss cache a lot of the time - for example, a webserver which has 100s of users connected to it. A T2 based server reports it has 64 CPUs, but there are only 8 places where instructions can actually operate (the bit which burns power), even though there are 64 sets of registers etc.
So, to spend die space on increasing the likelihood of hitting cache would run counter to the basic design concept - which is to keep the execution cores busy all the time despite the mismatch between CPU speed and memory latency. You have to invert your viewpoint, and instead of looking at the processor from the thread's point of view, you have to look from the execution core outwards. In a fast single threaded processor you keep spending clock cycles waiting for stuff to come in from main memory: in a Niagara processor, you can go and serve another customer instead of waiting - so it takes longer to serve a particular customer, but you get more customers served overall. If your objective is to maximize the number of customers served at an acceptable speed, Niagara may well be the answer; if your objective is to serve each customer as fast as possible then it definitely isn't.
As for running databasen, it is fair to judge the suitability against the Oracle licensing costs. The original T2000s were charged for at 0.25 processors per core, and performance was "mixed". The T2 based systems are charged at 0.75 processors per core, and work much better.
Maybe you should post your credentials so we have a better idea where you might be coming from with all your..."insight".
Let me guess. Marketing flunky at a Sun competitor?
Da kitteh is cute
100+ Sparc64 processors just last week replaced by 16 T1, listen to the purrs.
Got some T2s on order. Purrs more.
Hoping the T2+ will (if we ever think we need them) tickle the kitteh behind both ears at once.
We have 12 - T2000's and 2 - 5220's in house. They are fantastic as production web and app servers. (In non-production, good at everything else.)
On a philosophical note, I subscribe to the anything but hp philosophy. IBM has a screamer with the P6 (Costs are high and In-order transactions hurt it.) and I love the X86 architecture because of the direct competition in chips.
RE: gizmo and Dunnie
Lol! Actually I don't work for a vendor, and I find the accusation of "working" in marketting just plain insulting. If you don't like what I post about Sun then you can blame their marketting droids for all the Sunshine I have to carefully disprove to senior management almost monthly, to stop them making really stupid decisions. Having suffered Sun's marketeering junk, feature sales and down-right lies for years I feel quite happy to share a little cynicism with the world in general. Now, seeing as I've disclosed that I don't work for a vendor, do you care to disclose which kindergarten you are at?
And Dunnie, even when HP had PA-RISC it was a far better product than SPARC, especially when you consider that customers that did go for SPARC also usually went for Veritas clustering due to Slowaris's awful product, whilst ServiceGuard was always a sales leader for hp-ux. HP built a carefully integrated solution of management tools, servers and storage products - Sun didn't (in fact, the most common management/monitoring product for Slowaris environments is still HP OpenView products).
Sun has been on the decline since Y2K, and they now have a point solution in Niagara that is hard put to defend itself against x86 (in fact so hard pressed that Sun have had to resort to selling x86 and cuddling up to Microsoft!), and no enterprise product in view (Rock is still just vapourware), leaving them stuck with having to badge a competitor's product (FSC's SPARC64) they once labelled "a poor man's clone". Those of us that remember the arrogance of "Solaris on SPARC for everything" find this all highly amusing! HP have carefully developed the Integrity range into the market leader, if Sun hadn't have been so stupid and dropped their Itanium development they could have had Slowaris on Itanium servers out four or five years ago and actually have a competitive product. You may call that "slagging off", I just call it a fair appraisal of the facts. You may either offer rebuttal or post more Sun marketeering - your choice.
The the person that thinks that he knows all
You think that this machine is designed for DB workloads. It's not.
And just FYI NOTHING, absolutely NOTHING beats an IBM mainframe in OLTP. Why? Mainframes are there for huge IO, that is their point.
The databases on that thing is just a coincidence. The thing is IDEAL for concurrent content serving and request processing with stable latency. I mean, I had witnessed how connections were timing out between 2 servers under heavy load.
Integrity is a market leader? Who are you kidding?
> HP have carefully developed the Integrity range into the market leader
Ha ha ha, that has to be one of the funniest and stupidest things I've seen from you Matt! You must be sucking on your crack pipe each time before you write up a post. HP-UX is a stagnant platform that has absolutely nothing going for it with perhaps a small bunch of Tru64 customers that are being forcefully migrated to HP-UX. About 5 to 8 years ago I might have conceded and would agree that HP-UX is winning platform compared to Solaris 8/9, but by now HP completely lost the plot and I would not consider HP-UX for a second as a platform for new deployments. Innovation on HP-UX completely stalled in its latest incarnation, which is 11i v3, I would say it is now at least 5 yeas behind Solaris 10 and AIX 6.1. As a testament to that is the latest release of HP-UX 11i v3 Update 3, which is nothing more than a repackaging exercise with almost zero new features being introduced. HP-UX and Integrity by extension are heading to the dust bin really fast unless HP wakes up really soon. Case in point, I'm working with a large telco and they're currently in the process of retiring of *all* HP-UX gear (both PA-RISC and Integrity) to be replaced by Solaris on Sparc and x86 as the new midrange platform standard. And it is not the only organization ditching HP for Sun in the Unix space. It is a slippery slope for HP and with the way things are going HP-UX will be relegated to be distant third behind Sun and IBM in the Unix space.
Don't waste your time responding to Matt Bryant's comment. His goal on this board is to diss every exciting things that comes out from either IBM or Sun. Then talk highly of everything that comes out of HP.
Even a blind person can see he is a HP cheerleader. Check IBM and Sun posts and all you get from him is negative comments. Check out HP posts and he's praising this or that.
So his comments IBM & Sun posts are pretty much biased and misinformed.
Re: 128 threads, just not at once
Actually Sun did not prove T2+'s performance with a Lotus Notes Benchmark, they in fact published leading results for SPECrate (int and fp), SAP, SPECjapps, SPECjbb and SPECompM2001 as well as the Notes results.
FYI, SAP is a Data Center application, in most big data centers you will also find large numbers of email servers and even larger numbers of web/app servers. From the numbers posted by Sun it would appear that the T2 is ideally suited as a platform to support the workloads which are currently occupying at least 70% of the servers in most data centers.
The only draw back with the old T1 architecture FP seems to have been fixed and the addition of a more comprehensive Crypto unit (it would be nice if is supported bigger block sizes for MD5 sigh) makes it very suitable as a platform to replace more specialist systems as well.
Relating to a point you made about memory throughput in an earlier anti T2 posting Sun also published a STREAMS result for the T5240 which comes in at 30GB/s interesting because the smallest Itanium based server that can better this number is the rx8640 with 4 Cells and 32 cores a 2 cell rx8640 only comes in at 2/3 of the throughput of a T5240. I don't have to remind you that a rx8640 is a 17RU server or put another way you could put 8 T5240's in the same space as a single rx8640 and stil have 1RU left.
CPU clock rates have risen rapidly, memory latency (not bandwith) has not risen at the same rate requiring vendors to configure larger and larger L1/L2/L3 caches to try to reduce the rate of cache misses in order to reduce the number of times in a given period that the whole core stalls.
The T2 does not have a large cache because it does not need a large cache, if a thread stalls this is only 1/128th of the available processing resource stalled doing nothing. On a 2 socket Itanium server a stall consumes a minimum of 1/4 of the total available CPU resources much worst hence the need for huge caches.
What you do need is a lot of memory throughput which the T2 has, more per core in fact than the rx8640. The STREAMS results for the 8640 come in at about 1.28GB/s per core with 16 cores as opposed to 1.93GB/s per core for the 16 core T2.
Digital observed back at the end of the 1990's that a typical Oracle application caused the system to spend 70% of its total available CPU cycles stalled waiting to main memory. Despite increases in L1/L2/L3 cache sizes because the relative CPU/memory latency has got worse this ratio is unlikely to have changed for the better.
The T2 is a very elegant sidestep of this issue. I would be tempted to suggest that you would be better off not knocking what you don't understand.
- The land of Milk and Sammy: Free music app touted by Samsung
- The long war on 'DRAM price fixing' is over: Claim YOUR spoils now (It's worth a few beers)
- Privacy warriors lob sueball at Facebook buyout of WhatsApp
- 20 Freescale staff on vanished Malaysia Airlines flight MH370
- Dell thuds down low-cost lap workstation for
cheapfrugal creatives or engineers