A team of researchers at Carnegie Mellon University have been studying how they can make cheap, low-powered, and relatively unimpressive server nodes gang up and do more work than the two-socket x64 server that is the workhorse of the IT industry. They have come up with an approach called FAWN, which is short for Fast Array of …
Nothing new, move along
Not surprising. I have been using in-order (Via up till recently) cpus for servers for 8 years now. They can run circles around a Xeon as long as you do not force them to do heavy computation.
By the way, Intel and AMD are actually better now than they used to be. The difference was truly staggering 8 years back.
In 2002 a Via C3 800 delivered with ease >= 60MB/s ext2 filesystem performance on a RAID 1 software set while consuming 7W (motherboard without disks). An OEM edition of an Intel Xeon 2x2 (2 cpus, 2 threads each) barely crawled at 20MB/s while consuming 200W+ (again motherboard only). That is more than 60 times difference not in Intel's favour. So looking at the graph there has been a considerable improvement in Intel's finest. However, it is still not good enough to beat an in-order CPU doing a simple IO-bound task.
Not That Unique
"One of the key factors behind the wimpy nodes doing so well is that on such a node, processing, main memory, and flash memory speeds are more in synch than they are on a modern x64 or RISC server."
Not all RISC servers - sounds just like the design logic of a SPARC T series cpu.
Aren't they the ones that use proper chips?
I know fawns dont have arms
but if they did...
Depends what you mean by "big query set"
For the first time recently, I saw a properly specified SQL Server box. Rather than just screw about with processors, and the like, it was also simply FULL of ramalangadingdong.
As a result, the entire database was cached, warts and all, so queries require no disk access at all, leaving the machine free to use all disk access for the non-buffered, non-cached writes.
It absolutely flew when necessary, while still bumping along at almost no cpu when doing little.
Sure, there's big queries, and big queries, but there's no excuse so far as I can see, for hiring someone like me to spend weeks analysing a system, when you can just put a quarter or a Terabyte of Ram in. I was impressed at what a processor can do when there's no disk waits at all and no cached cycling.
On another note, what I'd pay for MS to add cached system disk to it's operating system, such that Program Files, users and Windows were all rammable.
This way I'd just get a machine with 32Gig of Ram, and partition 28Gig of it to store all the stuff that makes it run. MS could even make booting like lighting by pulling all the boot files out in one go, by positioning files sequentially as far as the disk heads were concerned. After boot, the disk could idle along pulling in all the stuff you could possibly need.
Just think how infuriating Office 2007's stupid f*cking ide would be then, when you know it was only their crap stupidity that was causing your productivity problems. Let's make a new IDE that takes 60 days to learn, so that the guys who use it once a month for two hours, will be still screwing about 5 years from now.
not much use for bursty workloads
Here's the thing. In the real world you don't get customer facing systems that run a flat workload. The amount of work they are required to do varies over time. So f'r instance, all those online shops might have very few transactions to process while Corrie is on, but as soon as the ad.s start people make a dive for their 'pooters and start buying stuff (in response to the advertisements, telling them to do so?).
So, if you've implemented your online sales system with a FAWN array - and it's working flat out, where does the extra capacity to cope with your peak loads come from? Answer: the punters have to wait. And and we know, waiting punters tend to go elsewhere.
So while it makes for a nice green headline: that these low powered processors are working flat out and delivering lots of work per unit energy, it means they have no flexibility when a surge happens. A bit like saying that the M25 averages 20,000 vehicles per hour, so we only need 2 lanes. Without designing it to cope with rush (hah!) hours or bank holiday traffic.
Now, if you REALLY want to improve the useful-work-per-Joule rating of a computer, why not take a look at that major source of bloat and inefficiency: the operating system?
Let this be a lesson to all you who mock the anorak clad
Never underestimate the cerebral power of a Fast Array of Wimpy Nerds
I thought all the Wimpy joints had gone by now
But no, they're still going strong.
...relatively stupid chips used in the wimpy nodes can deliver 1 billion instructions per joule while running at a much lower frequency... and they can also ask 'do you want fries with that?'
Yes, the one with chip fat on the sleeve, thanks!
Now build it with ARM chips
what is the number of instructions per joule that you get ? -- corrected for you needing less ARM instructions than X86 ones to get some work done.
Inconcistencies in report
I picked up these two, apparently contradictory quotes
"And a node can do 1,300 256-byte queries per second, according to the paper and process 364 queries per joule of energy. This, say the techies, is two orders of magnitude better bang per joule than a regular server can deliver."
and later on :-
"The this Intel server node was able to process 4.771 random 256 byte reads, providing an efficiency rating of 52 queries per joule. The 21-node FAWN cluster idled at 83 watts, and peaked put at 99 watts during gets and 91 watts during gets. This is 36,000 against a 20 GB dataset, which is what gives you the 364 queries per joule (including the power drawn from the switch linking the nodes)."
A "couple of orders of magnitude" is a factor of 100 whilst the comparison with an optimised conventional server gave a factor of exactly 7:1. So unless this two orders of magnitude was against a non-optimised conventional server with physical disks (hardly a sensible comparison when what you are looking at is an optimised architecture for the future), then that two orders of magnitude quote is highly misleading.
Given that there is a lot more to come on conventional servers designs with many more cores, then I wonder if this is truly a good indicator.
I wonder how this would have looked if it had been compared with a SUN Niagara server equipped with solida state disk. That's pretty well designed for this sort of workload, and I would expect it to be a factor of at least 2x more power efficient than the Intel server per unit of throughput on highly threaded applications such as this. The Niagara also forgoes fancy speculative processing and other such features in the name of keeping the cores very simple and exploiting dead-time.whilst core resources are stalled for memory access.
Ultimately this sort of workload will depend on the total memory bandwidth that can be bought to bear on the problem. Multiple motherboards and slower memory might, or might not, be more power-efficient than a single, higher speed shared memory architecture. However, the Niagara design approach could still be the best one for many such systems as it is easier to scale in software terms and less to manage.
The monster grew miserable as it paddled on the shingle
Not so surprising as the target for current server processors has been until recently blazing speed first. Efficiency only recently entered the room. How would something actually designed for low power, an ARM say, fare? Can't actually find ARM power consumption - any helpers?
The reasons for why modern processors are slower are relevant I'm sure but blaming simultaneous multithreading into it is probably wrong; SMT is a way of hiding memory latency by having many slower threads, thus avoiding (potentially) much of the pipelining/branch prediction/etc. expensive stuff. So I guess stuff like sun's latest chip designs would in spirit resemble this design - lots of simple units bolted together but on one chip. Roughly speaking. IANA chip designer.
As for having disks & not thrashing them, if your app is fairly time insensitive then you may be able to hide the latency by smart batching of queries. Brains usually beats hardware.
Nonetheless, simplicity has to be the way to go and I like this idea lots.
(FWIW from a quick read of voldemort, it seem to be the same thing as memcached at heart, a distributed hash table)
ARM vs Intel
"Can't actually find ARM power consumption - any helpers"
One example is the Samsung ARM Hummingbird chip (which can be clocked at up to 2Ghz). Its rated at 0.75 mW/MHz so at 500Mhz, it would be about 375mW.
But you have to add in the power needed for the DRAM etc.. as well, but still, its going to be way better than Intel, but then Intel wouldn't want to back that kind of research. ;)
I've had clusters of Mini-ITX boards doing all sorts of things for years, although more for the distributed nature than raw grunt.
It's not rocket science. Oh no, hang on a moment, that is what I do!
500MHz is whimpy?
Once upon a time, and not that long ago, 500MHz would have been considered in the WTF-that-is-unbelievably-awsome range. Or perhaps it would have been a processor on a very expensive mainframe. But ya know what? Today it "sucks."
For some wierd reason, efficiency in software has been on the back plate. I really don't understand why. Software engineering has really been tossed out the window. Interpreted languages are really pushed, and vm-based languages are the main current platform for development. And what do we get for all that?
Really, it is bloody well time that we take a serious look at what it takes to actually have good, reliable, and EFFICIENT software. Writing a small kernel is NOT difficult. Writing clean code is NOT difficult. It just takes something called "planning." Yes, forethought. We have super computer-class machines, and we utilize them like a Jack-in-the-box. Today's students are taught simple-minded concepts, and they come out of schools and write garbage and can't comprehend simple concepts like recursion or stacks.
Bloody educational failure.
That picture of their stage 2 work shows more or less exactly what I have running as my router right now. They are Alix boards, w/ 500mhz Geodes, which run cool enough to not even require a heat sink, let alone a fan. I even use the same 4GB CF Card that is pictured. I imagine that you could fit about 60 of those boards in a full-depth 3U rack, plus some power supply action and a mighty switch or two.
But yeah, seems like you could get pretty far with some of those serious arm chips that they have these days, probably not nearly as cheaply as low end x86 hardware though. Those Alix boards with the SD card are under $150 each, and they come with basically everything you need.
RE: not much use for bursty workloads
True, but not every task in the enterprise is customer-facing, there's a lot of batch stuff still going on in the back-office. And even if your requirement has bursty workloads, there are technologies out there to allow you to power-up and power-down servers either to schedule or demand (last time I checked, Corrie did follow a predictable schedule, along with a predictably mind-numbing storyline). I've long nagged our vendor rep for some Atom-powered blades that will fit the same chassis as the x64 blades as we have some tasks which would suit this kind of low-CPU-power-but-continual-grind computing.
As an example, we have a rack of old DL320s with Celeron CPUs. They were bought years ago for a project where we needed a bank of backup servers for some Linux clusters, but our commercial backup software of the time didn't cover Linux. Instead, one of our Linux gurus (and he was the real item, sandals and all!) wrote a simple backup program that sits on top of standard x86 Red Hat, and CPU power is not really required. At the time, the simplest and cheapest servers were the DL320s. I would love to replace them with blades, but the current IBM and hp x64 blades we use are overkill for the task. Atom-based blades or servers, especially if we could boot them from flash or USB keys, would be perfect as we could just port the current Linux code, whereas anything like T2+ would be both too expensive and require a code re-write (and there's no Linux support on T2+) or a new backup app. Unfortunately, there is little chance of management allowing us to build a DIY FAWN rack for the job as they insist everything has to be covered by support contracts.
So, I'm looking at hp and IBM to step up and make me some Atom blades that will fit the generic chassis. Come on, guys, it can't be that hard?
This has been done. Its called Beowolf. It was a NASA project where they took a mass of old machines that were going to be discarded and built computing clusters with them. This is not a new idea.
Matt Bryant, check supermicro for atom based servers. Newegg was selling them for about $359.00 US for a box with 2GB of RAM and a harddrive. You can buy them without the drive and boot them from USB thumbdrives.
SUPERMICRO SYS-5015A-H 1U Barebone Server Intel 945GC Intel Atom 330 Dual-Core 1.6GHz processor
rackmount (1U) only 14 inches deep. very nice solid little box. I've been deploying other similar units for years.
This is hilarious
Wow, I'm groundbreaking. I've been running a FAWN for years. An array of old beige boxes, obsolete for desktop purposes but great for running a web server, a mail server, a file server a backup server, a little fist sized print server appliance, the list goes on.
I wasn't sure if I was being cost effective and green by recycling obsolete gear or if I was horribly behind the times and needed to spend $20K to consolidate all this onto one higher powered server supporting half a dozen virtual machines. Now I find out I'm on the cutting edge, proving that server virtualization is a shell game with no payoff. Ultimately you have to provide the processor and RAM to support the same amount of processing and making it denser will always cost more and run hotter. Time to go sell all those server and virtualization stocks.