So many cowards! And Joel.
"....The fact is, a multithreaded architecture like Niagara needs less cache, not more, precisely because it has more threads active..." I call male bovine manure on that one! Precisely because you have so many mini-threads running, you will have to serve all of them through the same cache chain. Do you expect the data to just get to the cores by magic? And seeing as Niagaras have so little cache (L1, L2 and L3 - they're all part of the chain to reduce the latency of going to main memory or disk, so don't just mention instruction cache), you will have to keep continually flushing cache to keep all the threads even ticking over, let alone humming, which means you actually have lots of requests going out to cache, memory or disk. That is why you actually need more cache for all those stalled threads, otherwise you will have to do more requests out to memory or disk. Niagara doesn't have the cache. Sun's benchmarking fiddles involve massive amounts of system memory to try and hide this.
"....So, when things line up nicely, you do get 128 threads all executing at the same time on the new 2 way T2+ boxes....." So things line up nicely when? Remember, we're talking real world apps not Sunshine benchmarking fiddles here. Intel has spent a lot of time on their predictive technologies to keep cache hit rates very high, Sun has not. Intel (and IBM) have designed chips with large cache sizes to maximise this advantage, Sun has not. So an Itanium or Xeon (or Power) keeps its threads spinning a lot better than any Niagara design will ever do, which is why Sun has to work with lots of stalled threads. Which is why Sun carefully craft any benchmark to use tiny, linear workloads (like webserving) for which Niagara designs are actually very good. But even old Xeon will trump it easily on anything more complex, especially database apps like Oracle.
"....If the data was readily available, there is no need to multithread...." Yes, in nice and easily predictable workloads (like Sun benchmarks, I guess). But in the real world, this is the reason we have things like out of order execution, branch prediction, etc, etc. Because data is not readily available, it has to come from cache (low latency), memory (medium latency), local disk (big latency) or even worse, another system or SAN (BIG latency). And because real world work flows don't always come in nice and easily predictable streams, you get stalled threads. Sun's response is not to try and design a way to keep the threads spinning as much as possible, but to simply accept the poorest solution. It's like being told to buy a dozen unreliable scooters in the hope one will be able to carry you as far as a reliable car.
"....This would best be explained in a half hour with a whiteboard to draw pictures on...." Draw this on your whiteboard. Draw a hose going into a can with pinholes up the side. Make the hose thin. Then draw water going into the can. With a thin hose, the water dribbles out of the can fast enough so that the water level never reaches above the middle holes. This is Niagara, with it's poor chain of small cache and low memory bandwidth. To spray water out off all the holes would need a bigger hose. Suns's solution is not a bigger hose but to switch between the holes and pretend it is flowing out of all of them at once. To try and make the flow better for benchmarking, Sun puts a large tank on the hose (system memory) in the hope of keeping the flow steady, but it still has to go through the tiny hose to the can, and they still have to switch between holes. If you were trying to put out a fire with this you'd get burnt.
Now draw the can as a tube with a large hole at the bottom. Make the hose as wide as the can mouth. Put the large tank in if you like. This is Itanium (or Xeon, or Power, or Opteron) which has a massive cache chain, high memory bandwidth, and better predictive technologies. More water comes down the hose and goes straight through the can. It's like a fire hose nozzle and it is a far better solution for ninety-nine-out-of-a-hundred fires. There, I'm betting that with even your slow drawing ability that didn't take half-an-hour.