When someone provides me the funding I request for a project without fighting me tooth and nail over every bent copper, I believe that I should return the courtesy by going above and beyond to earn the trust that has been placed in me. One result of this is that the bigger the budget of the project, the pickier I become about …
Last time I was in love with SM it was short lived
Last time I was "in love" with a SM box it was a rather short lived affair.
After about a year (granted, that was in a warmer than average equipment room at 25C or so) the capacitors started popping up like Guy Falkes day fireworks. Granted, this was at the height of the "duff capacitor" affair when all of the industry suffered from it so it may have been a one off. None the less, the experience has taught me to very wary of their kit specs and make sure i have plenty of resource to spare when using them.
Compared to them HP may not be as efficient. It is not as cheap either. However, usually it continues to run for way longer than its scheduled time. In fact, I have yet to see a HP/Compaq server be retired as a result of a usage-induced hardware failure (different from duff disks of course). All the ones I have retired had to go to the skip simply because they have become obsolete, not because they have become flaky. If it was not for the moore's law some of them would be still trucking along today 10 years after they have come down the assembly line.
HPC server stress testing
a.k.a. "we played Crysis all weekend"
You're having more fun than me it sounds like - cheers!
How do you play Crysis
on a Tesla? No outputs! However...that would be one /sexy/ gaming machine if I had a DVI port...
Should have used the joke ahead icon, but seriously this sounds like an amazing project to work on and I am green with envy. The crisis comment was just in jest :)
I know it was in jest...but I wasn't. If you can figure out a way to play Crysis without the DVI port, I'll buy you a pint. I would *love* to take these beauties for a spin! :) After all, the question has to be asked: with two Xeons, 48GB of RAM and two Tesla cards...does Crysis still run like crap?
Becuase it wrecks my laptop...
Dude! You're getting...
Realtec NICs. Yech. Granted, it's for management, but still.
Anyway. Two GPUs and two CPUs in 1U. If you're only after GPUs, you could instead take a bigger box and stuff 8x dual GPU cards in it. That's less CPU sockets (and less memory sockets), but given that CPUs and chipsets eat a lot of power too, it could save you some kilowatts. Of course that all depends on just what you're trying to do with the boxes.
It's an interesting compromise, this GPU processing thing.
If I made some ridiculous uber-machine with quad 12-core CPUs and 8 GPUs it would crunch numbers so fast I'd need a little lie-down. That said, what is the kind of time spend crunching numbers versus chatting with the control server looking for new jobs? Personally, I wish control software were a little bit more dynamic. I would love to have a couple real number-crunching beefcakes for the render jobs that can't quite be broken up as much. The rest could be farmed out to the smaller nodes.
Instead, you need to find a balance between speed of processing, power efficiency, cooling, ability to supply X number of watts to a single system and ability to actually get jobs from the control server. Given that the client uses Lightwave, I've found from testing that 2xCPU and 2xGPU seems to be about the right balance. At the end of the day, the control software just doesn’t seem to be good enough to deal with more.
many GPU setup
I think it's not only control software - put too many high-spec GPUs in a box and PCIe will hold you back, since the software will have to copy some serious amount of data to and from GPUs.
Well I never built the box like this one, but .... I would be tempted to learn CUDA just to see what it can do :)
You don't have to make use of the realtec if you don't want to - the IPMI can share with LAN1 on the mobo. It also supports vlan tagging when sharing in this manner.
Yes, though that depends on just what you're loading the GPUs with. 4x2GPUs on a 4x4x4x4x board might just suffice for things that are require more crunching "hotspots" than munching through sheetfuls of data for just a few calculations. Things like brute forcing cryptographic hashes are very "hotspot-y", so to speak. Some other things, less so.
Have to admit I haven't paid enough attention to know if Trevor mentioned just what sort of jobs he has in mind. Anyhow, the control software not dealing well with configurations other than 2xCPU+2xGPU pretty much moots speculation there for production work.
I've got an.....
Abacus you can borrow, the next time you need graphical speed!
Cooling No Problem..
"My next challenge - designing cooling systems to handle over 25kW per rack “on the cheap” – is the daunting one"
You live in Alberta you say, and it's winter there right now? A dryer hose and a wall grommet shoud do the trick.
What about when it's +40c? Also, in my experience, servers don't appreciate you directly dumping air at -40c onto them either...
The image from the supermicro site shows one board up, one board down.
As to the price... 6K Canadian is very interesting. Sounds like you're getting a steep discount.
This would make a nice node in any hadoop cluster.
@Ian Michael Gumby
It only seems like a “great deal” if you assume maxed RAM. While the board supports 192GB of RAM, I'm only actually loading the systems out with 48GB. That's 12x 4GB modules, a pair of CPUs, the two GPU cards and the server. You can buy the barebones server with the 2x GPU modules retail for $5700 here in Canada. 48GB of RAM + CPUs aren't less than another grand, retail. Buy a few of them and a discount of $1000 off the retail really isn't that much.
I wasn't expecting it to be filled up with memory.
Even with 48GB its still a pretty good deal. Each m2050 card retails for around $2500-$2900 USD with a higher MSRP.
I don't recall how many drive you have, but for a hadoop cluster data node you'd put in 4 2TB SATA drives. (Maybe 3TB if priced right.)
Definitely something to consider.
Well, we kind of did
I'm sure I wasn't the only one who thought you'd maxed the RAM.
I'm very sorry I didn't make that clearer. That is totally my bad. Even 48GB of RAM is probably excessive for these nodes...but I like to fill all the slots. I guess I forgot that not everyone would realise that the average video rendering box would not make use of 192GB of RAM. It's mostly about the number crunching. They typically crunch work units in the 4-8GB range, though they could get tasked with up to 36GB, depending on the job.
We're doing tests now to see if 10Gig NICs will really speed up overall farm performance, or if it is (as I suspect) going to be bottlenecked by the control software, not the network. Only tests will tell…
@Ian Michael Gumby
Getting the cards in the server seems way cheaper. What I am ordering really isn't that farr off the retail price. Even the local supplier I use for retail gear has a decently low retail price: http://www.cdw.ca/shop/products/Supermicro-SuperServer-6016GT-TF-FM205-no-CPU/2251250.aspx. Remember that you have to add CPUs and RAM to that.
That said, my client has some decent connections, and got a reasonable discount off of what seems to be the Canadian retial price for this gear. Also to be noted is that I don't have any disks in any of these nodes: they load thier OS over the network. It's just board/chips/RAM/GPUs.
All about the plumbing
Wasn't that Seymour Cray's observation about the toughest bit of designing the Cray 1?
I suspect the cooling might be a little trickier than basic hardware selection. I suspect it depends how good the facilities in your machine room are and how freedom you have to remodel (Historic building versus using a jigsaw on the sheetroc).
@John Smith 19
It's a big converted warehouse sitting on top of a massive concrete slab with a two-story 3500 sq ft basement underneath. Sadly, most of the building is offices and warehousing. The corner of the building I get to work in really isn't that big...but I can punch holes in the wall/roof/floor if I need. I just can’t move walls.
I'd never touch SuperMicro again...
...several years ago I used to build computers for radio stations as part of my job. I used to source motherboards without anything on board as at the time most on-board stuff was non-standard and crap, and we always used specialist audio cards and graphics anyway (when dual monitors were unusual).
Anyway, I digress. To cut a long story short, after weeks of issues with a client's playout software wobbling all over the place, timing wise (despite us using a serial radio clock) it turned out that the CMOS clocks were shoddy. Different motherboards fixed the issue.
So, there's a story you didn't need to read.
Wobbly CMOS clock to blame for garbled playback? Not likely...
In a particular generation of SuperMicro dual-Xeon boards (Nocona/Irwindale = socket 604), some devices in the RAM VRM's were dying. Cannot say if it was the old elyt caps or the switching FET's combined with some thermal design omission... multiple different motherboards of that generation showed higher RMA rates. Since those days, I've heard no complaints about SuperMicro - I'm pretty sure they've learned from that historical experience :-)
As for system clock distribution: that's a relatively complex issue. You have multiple levels of clocks in the system (some of them in hardware, some of them in software) and multiple levels of audio data buffers (again HW/SW). Makes me wonder if you were facing buffer underruns, or indeed wobbly playback clock (as in sampling rate). All audio cards that I know of have their own Xtal oscillators for the sampling rate clock - so the system-wide PCI clock should have little effect. BTW, the CMOS RTC clock can hardly be the culprit - the PCI clock and the various hardware timers' clocks (=> also your OS system clock) tick along some other master reference crystal, different from the non-volatile CMOS RTC. And, that multi-output clock synthesizer for the various busses and chipset subsystems can employ a technique called "spread spectrum" on purpose - to duck some EMI radiation limits simply by making the radiated "frequency poles" broader / softer. In some BIOSes, the "spread spectrum feature" can be disabled (in others, it cannot). This "spread spectrum" thing is quite common and perfectly legitimate in modern chipsets/motherboards.
However, I still don't think a little bit of added jitter in the CPU+PCI clocks would hamper your audio playback. Rather, I have a different explanation: IRQ and general bus transfer latencies, resulting in buffer underruns. "A few years back" could quite as well correlate to the transition from the old-fashoned discrete interrupt delivery over dedicated signals, to the new+hip message-signaled delivery, in-band over the "hub link" or whatever the chipset backbone link is called. The change has come in the form of chipsets such as i815 / i845. Previous Intel chipsets and contemporary chipsets from cheaper competition still used the old "out of band" IRQ delivery and were therefore showing better "interrupt latencies" under load. Another factor might be that, at about the same time, motherboard vendors (BIOSes) started to use SMI more extensively for software emulation of some missing features (such as, to emulate legacy keyboard / floppy on top of USB devices) - again resulting in occasional excessive latencies. The RTAI project even had some standard test utilities for this. I recall that some telco voice processing boards for the PCI bus did have a problem with that - and a feasible workaround at the time was to replace the Intel-based mobo's with something SiS-based.
I mean to say that none of this is a problem on part of SuperMicro - it's evolution, and it's common to a particular generation of system chipsets. Blame the chipset makers...
I wonder how many window unit air conditioners you'd need to dissipate that much heat. Super Wal-Mart always seems to have them on sale for about $100 each. That's Probably not the most energy efficient solution (nor is it likely practical)
On a more serious note, have you looked into geothermal? The ground up here has to be relatively cool year-round. As I understand it, geothermal is quite a bit more efficient than any hvac system.
Cooling 25kW per rack? Easy solution
Don't do it. High density != high efficiency.
You are not running ulta low latency interconnect (Ethernet is slow) so you don't need to hit that power density which is just going to cause problems your client doesn't need.
If those boxes are really only happy up to 25C intake at full load then the cooling design is fundamentally compromised and there will be several problems;
1) Those near the top of the rack are going to run hotter whatever you do (inside the rack if not at the front intake) means you have to supply <<20C to the rack to avoid problems
2) The systems are going to be wasting power on cooling fans which will cost your client money they don't need to spend (1U cases with 36mm fans are crap for cooling a few hundred watts let alone 1kW), they could spend this money on more compute nodes instead of noisy inefficient little fans
3) The internal components are going to be running way too close to the edge of their thermal envelope, in reliability testing terms we would call this "accelerated life testing". By increasing the thermal stress your client is likely to see in 1 year how many failures the components would suffer, properly cooled, in 5 or 10 years
A decent 2U chassis (agree with you that white box is frequently as good and cheaper) would provide not just fans large enough to move air effectively and efficiently but also enough space to get sensible airflow and heatsink sizes in. You won't lose much density per rack with 2U boxes as you'll be able to get more of them in before hitting the wall on cooling.
The decent 2U chassis should also be able to run to the basic spec all the Tier 1 vendor kit is tested to, the ASHRAE TC9.9 data cent(er) environmental specification which states 28C intake is OK for years on end and 32C / 35C intake is acceptable for shorter periods. As your client will be junking the kit in a few years anyway you should be able to run up into the low 30s without problems. Of course those little 1U boxes are probably going to start dying in a big way within 18 months so look carefully at how long your client wants to keep running them for before they have to go in a skip.
If low humidity is the constraint then you don't really want to be running below a 5.5C dew point for extended periods anyway, a little bit of the hot exhaust air remixed with the cold intake after passing through a spray or evaporative humidifier works wonders for controlling minimum humidity for hardly any power cost.
The low specific humidity concern is not about the intake air humidity, it is about the relative humidity once the intake air is heated inside the server, if you are just using super dry (from cold outside) intake air then keeping the intake temp low to manage the intake RH is not going to fix the RH inside the box where the problems occur.
As for cooling 25kW (or 20kW as you could get to with 2U boxes without trouble) per rack you might consider;
1) Contained air flow, don't bother with hot / cold aisle, it doesn't work properly at those densities you will waste your life chasing hot spots and doing CFD to find out that you should have contained the air flow.
a) Doesn't matter whether you contain the hot or the cold air, given the high temperatures the GPUs run at and the hot exhaust air though you might want to contain the hot aisles and use a suspended ceiling to extract the air to outside or for cooling and recirc. Humans won't want to spend long in the exhaust of those boxes running a render job
b) In row cooling is no better than external air or perimiter CRAC units in a new build and can be rather expensive
2) Don't bother with a raised floor, at 20+kW per rack you would need a very deep plenum and a cold aisle 5-6 vented tiles wide to get enough air in without starving the systems at the bottom of the racks. Contain the hot aisles and use nice wide cold aisles, then just push the cooled air into the room and let it flow down the cold aisles. The large volume of the cold aisles will work as a plenum and feed near constant pressure air to all of the racks.
3) Overall density is not about how many kW per rack but about how many kW per square metre of total floor space, the power and cooling plant are going to get larger as the total IT kW grows and at 25kW per rack you are likely to end up wasting a lot of space due to air flow problem areas, 15-20kW per rack is likely to yield the same overall kW / m2
Great article, very interesting, thanks,
You could also consider water-cooled racks, which could supply 25KW of cooling per rack very efficiently (as it is only cooling the racks and not the surrounding air). I've used Rittal ones in the past (for blade servers) and they've been fine.
Some consideration is needed for what happens if the rack heat exchanger fails - equipment will overheat very quickly unless machines shut down automatically, and/or you have an automatic door release system and a cool enough ambient air temp to buy some time. Shouldn't happen very often though :-) !
can you actually stack them together?
If you use any PCIe card that is not passively cooled like the M-series Tesla are, the card fans will need to suck air from somewhere, sideways - thus the air intake on the top right of the case, for the right hand GPU. If you want to place more than 20 servers in a rack, then at some point you have to bolt two of them together, and one of them will not "breathe". Then I suppose you are limited to passively cooled cards.