"but what if the sun does shine? For more than a couple of hours?”
Well, some probabilities are so unlikely you can just discount them.
Welcome again to On-Call, our regular tale of things that happen when readers are called in to fix big messes on weekends and evenings. Before we get to this week's tale, a quick reminder we've some prizes for new submissions as part of our Sysadmin Day celebrations. Write to me if you've a story of being called out to do …
"but what if the sun does shine? For more than a couple of hours?”
Well, some probabilities are so unlikely you can just discount them.
Indeed, there's never any sun each summer, never any newspaper headlines about the "surprising" sunny spell, never any hosepipe ban each summer, never any problem with railway lines buckling in the sun, never any problems with cold weather each winter, never any newspaper headlines about the "surprising" cold spell, never any flooding, never any problem with railway points freezing, etc.
Reminds me of running an old PDP-11 in 1990 (it was attached to an even older pieces of important measuring equipment). It had airco fitted and that summer it promptly died. Apparently it wasn't rated for an outside air temperatue of more that 30C. Think about it - an airco where when the outside temperature is more than 30 it shuts down. It says something about old British summers that people thought that good enough...
> It had airco fitted and that summer it promptly died
A bit like a new computer room project I was involved with - merging two separate computer rooms in a new building. We offered to do a proper calculation (but being the support team were not trusted to do anything related to the new computer room) but the managers chose instead to just install double the capacity of one of the old ones. They chose the wrong (smaller) one. Likewise with UPS capacity.
The end result was a 2-chiller aircon system where theoretically we could take one chiller offline for maintenance but actually had to run them both at about 70% load all the time. Likewise with the UPS - about 80% of the way into the merge, the UPS went into bypass mode as we we were pulling too much current and we couldn't put in a bigger unit as the UPS was surrounded with fixed frames and racks.. It ran on bypass for months with us praying that the power supply in the new building wouldn't crap out on us. Especially as restarting the systems had to be done in a very specific fashion otherwise transactions would almost certainly be lost.
Many years ago, at a previous company, after a bit of re-furb, we got a dedicated room for our 2 racks of development kit, the room was divided in 2 with a wire-mesh. The other half had structured wiring and Ethernet switches. We had two identical, but separate A/C units fitted for redundancy, both with their own outside unit. Everything seemed to be working well...
One Monday morning we arrived and someone alerted us to the loud noise coming from the room. We unlocked the door and were met with a huge wave of heat and what turned out to be the noise of the fan blades scraping against ice that had built up in one of the units. I can't remember the details exactly, but I think it had been cycling through freeze and melt cycles all weekend. I think the two units were also working against each other, or the other one had given up.
There was water all over the place, dripping from the unit, lucky none had dripped onto our equipment.
A common mistake is to get the aircon spec for a server room wrong.
If a compute load is constant, steady, and unchanging then an airconditioner to match is OK. A lot of cheaper aircon units cannot be 'throttled' - they need the full load heat to work properly, otherwise they ice up. So with a steady compute load and a matched aircon, it's all OK.
However, if one has a compute load that varies throughout the day (i.e. a lot of internet connected services), or your server room is going to be built up slowly then a throttleable aircon unit is needed. These cost more money, so they're not always specced. False economy.
And you'd think that having a spare unit ready to go would be a no brainer too...
One of my favourite interview questions for budding software engineers is, how much do they know about aircon? A related one is that of mains supply. A good answer is an acknowledgement that it's important to characterise a system's heat output and spec an aircon to match.
How is this usually handled? You have N+1 power, multiple levels of redundancy, but spec the AC at max load and assume it won't break. You laugh, but I know of a Fortune 100 company's main datacenter for which this was true. Maybe having two units, one of which is idle, is a bit much, but having 5 units, only 4 of which are needed... That would also solve the problem of variable loads, as one or more could be cycled off as the load changes. That way you can avoid the more expensive ones with multistage compressors - the savings might actually pay for that fifth unit.
Hopefully the design is smart enough that it isn't always the same units running and the same unit sitting idle, as we all know what will happen when the long-idle one is needed...
Until very recently DC people didn't think like that (or at all)
All racks took X KW so the electricians wired in X KW, the HVAC people specced 1/2X AC and the management billed you for it.
Nobody every seemed to put any thought into how hot it really needed to be, where the airflow went, how much power was actually being used - so you would typically go into machine rooms which were being cooled so you had to wear gloves.
The same in Africa. On summer weekends doing maintenance we would be dressed in shorts and T-shirts - with jumpers in our bags for when we were in the computer room. If we were standing about we did so behind the replaceable disk units to get the benefit of their exhaust heat.
The city centre shops had aircon units with exhausts at pavement level. Inside the shops was usually much too cold - and the pavements even hotter than the general ambient temperature.
"One of my favourite interview questions for budding software engineers is, how much do they know about aircon?"
You're obviously at one of these companies that confuses any sort of IT role together. A Sysadmin MAY find such knowledge relevant, but really they should be focused on the software aspects of their charges which is complex a job enough. If your programmers are being asked about air conditioning, then they should take that as a strong warning sign that they're applying to a company that doesn't understand IT roles very well and they're about to be dumped in some generic pool of "the IT types".
> A common mistake is to get the aircon spec for a server room wrong
Yes, but an "on/off" unit can cope with variable loads - it just doesn't run the compressor all the time. That is in fact how pretty well all systems worked until relatively recently when the power electronics to do variable load working because "cheap". Alternatively, if you correctly design the system, the compressor works full time sucking but can only suck down to a specific pressures, which sets a specific evaporator temperature. On part load, the compressor starts with it's cylinder partly evacuated - and the effective load on it reduces. In extreme, the vacuum at the inlet is such that at full compression it doesn't expel any gas - and the actual load on the motor is low.
A much much bigger problem is speccing the wrong sort of unit - typically one with too cold an evaporator. An AC unit intended for "comfort cooling" will expect a certain amount of the heat removed to be due to condensing water - and as long as there is enough of this heat then it doesn't freeze the condensate which then runs off the evaporator. If the air is "too dry", then it cools the air much cooler, the evaporator fins get very cold, and what moisture there is will freeze - blocking the airflow and allowing that part of the evaporator to get even colder and thus ensuring that the ice cannot melt.
So you need a unit designed for dry air - it'll have a larger evaporator so as to compensate for the higher evaporator temperature, and a different refrigerant cycle designed to work at a higher temperature so as not to freeze the condensate.
Well that's the over-simplified version at least !
Not a server-room, but a two-person overflow office ...
When I worked in Italy, the portakabin's aircond freezing up became a regular occurrence. Just about August - when the dry heat of early summer has become the humid heat of late summer - was worst. We learned a few workarounds, like switching it off and going for a coffee every couple of hours to give it a chance to recuperate.
It was still better than working in the main building, when the aircond would bring with it the filth of neighbouring smokers in their offices. And for much of the year - spring and autumn - we could keep the door open and welcome the cats in.
A few weeks ago I got a call from an elderly friend complaining she could send or receive emails on her iPad; after going through a few hoops over the phone, I realised that her router was switched off.
"Go check it is plugged in and switch it on" I suggested
"Oh no, I cant do that" was the reply, "I'll have to get a man in".
>"I'll have to get a man in".
Have we wandered into the script of a porn movie?
Now sysadmin porn - THAT is an overlooked market.
"I'm here to install your new SAN"
"ooh what a big SSD that is"
... black Linux conference t-shirt falls to the floor revealing the 6pack of the typical sysadmin.....
One gig I had was in a temporary building: one story, wooden walls, black tar roof. Half of the aircon system was b0rk3d; it had a small leak and they had already consumed their freon quota. Even on cooler days this was already a problem, never mind days that had the sun stoking the roof full blast. So they had installed garden sprinklers under the heat exchanger (which by sheer accident happened to be in the shade for most of the day), and by 10 o'clock or so, when the computer room temperature hit 25C, someone had to open the tap. Then, depending on the accumulated heat, one of the team had to stay overtime to turn the tap off in the early evening. On very warm days this system was augmented by opening the back door and having a couple of fans wafting in outside air, which was perceptably cooler.
Due to the wildly varying computer room temps we had a fair number of disk and component failures. Way more than could be accounted for by the age of the systems.
My utterly practical solution to get a couple of buckets of white latex or some rolls of that alu-covered bubble-wrap foil, and chuck it at the roof was dismissed with "We're going to move out of here soon anyway.". Yeah right, but not before the end of the summer (AFAIK, they were there for two more years)
"dismissed with "We're going to move out of here soon anyway.". Yeah right, but not before the end of the summer (AFAIK, they were there for two more years)"
Sounds about right.
In most industrial settings, "temporary" means "until it falls apart + 2 years to build the replacement".
In most industrial settings, "temporary" means "until it falls apart + 2 years to build the replacement".
Nah, this was government. So, "until it falls apart, plus five years to try to build a replacement which doesn't work, two years to break open the contract with $VENDOR, then four more years getting $VENDOR to build something that has half the functionality for double the original price."
Joe, one piece of advice, leave that amateur club, now!
>running the lab wasn't really a part of my role
Don't do it, get the manager to find somebody who's role that is.
>A newly-installed anti-virus product wouldn't let anyone do remote access unless their machine ran the same AV software. And Joe's machine was locked down by IT, so he couldn't just install the software himself and get on with things.
This guy is a techy and they lock down his PC, really ? I can remotely understand that you lock down machines the drones use, but techies ???
If he had gotten the manager to move off his backside and find the right guy for the task, with matching AV on the laptop, they could have saved quite some dosh ...
The real kicker was the fact that our AV solution was Norton A/V, mandated and password protected so it couldn't be un-installed, despite this being my home machine, but the corporate IT that weekend had dictated that endpoints had to be Symantec AV, to be granted access, who had just taken over Norton. I'd tried to get the trial version of Symantec on my machine to no avail, hence my hunt around the world network to find an access point this policy hadn't been rolled out to.
Server room is in a sealed room in the attic, we had a single A/C unit to cover 2 racks... one day the A/C shutdown, lets just say the room hit about 65c. The door was so hot that we had to use a towel on the handle. We had to shut 90% of our infrastructure down. The company who maintain the A/C provided 2 portable units to covers us whilst they fixed the main unit, both of these burnt out within 45 minutes of switching!
Lesson learnt - Fast forward 2 years, this small room now has 4 A/C units...
I have done experiments that show dramatic improvements in ambient temperature using passive cooling techniques (getting outside air into the room and back out again). It amazes me that every time I broach the question I'm told the filtration requirements would be "ruinously expensive". Unlike the costs of a complete machine-room and restart.
One place I supported phoned in reports of random disconnections. The network hub (yes it is that long ago) obviously needed a restart, I thought.
I turned up on site, collected the comms room key from security and headed to the 5th floor.
As I approached the room (metal wall up to waist height, with glass above) I could feel the heat. Both aircon units had failed over the weekend, and the temperature had reached over 50C.
After the door had been open for a while, I tried to open the window to the outside. Locked. Special Key needed.
AND I burned my hand on the bare concrete pillar supporting the floors above.
All the time, the servers (a little closer to the floor than the LAN hubs) behaved perfectly. The potted plant on someone's desk just outside never recovered though.
I remember at a job about 15 years ago now, when we built out the "server room" as we called it (about 11 racks of gear, much of it low density stuff sitting on rack shelves in 2 post racks), I made sure to equip it with tons of battery backup, quite literally there was probably 1,200+ pounds of APC Smart UPS rack mount systems in various racks. Battery run time was probably at least 90 minutes.
One Sunday morning my phone wakes me up with a message saying the system is going on battery power. I was tired, and said, oh neat, the system works. Took me what I guess was about 30 seconds to realize oh shit, I forgot the AC is now down and the room is going to turn into an oven. So I rushed over to the office(2 miles away) and started manually shutting shit down(it would of shut down automatically when the UPS battery got low but I didn't want to wait that long) .
Fortunately there was nothing truly mission critical there (90%+ was development systems that weren't in use on weekends, lots of Ultra Sparcs, PA-RISC, a couple Alphas, buncha x86 too). Room was pretty warm when I got there but at the end of the day nothing damaged, nothing lost. Power came back on probably an hour or so later.
All my jobs after that our stuff was always in a co-location facility, short of one facility in Seattle that was prone to issues(I moved my then-company out years before their big fire which caused 30+ hours of downtime), never had to worry after that.
Another place I worked actually had the A/C on the no-break too, so, no problem, you'd think. Most power failures didn't last more than half an hour or so, but when it had to run for a while longer for the first time it transpired that yes, the no-break itself would very much like to have some cooling too.
Not quite in the same weight category as most stories, but on an individual machine basis it can be important too.
I had a Pentium 166 (that will date it nicely...) machine with a SCSI disk and a few other add-ons. One day the disk started giving read errors, so I went to fish the machine out from where it lived to have a look at it. When I did, I could feel the heat radiating from it, it turned out that every single fan in the machine had failed. This was before the addition of speed sensors on PC cooling fans. It had soldiered on bravely with the internal temperature rising, until it reached a point at which the disk drive couldn't maintain performance, hence the read errors. When it was cool, I stripped it down and there was noticeable discolouration on the Pentium chip where the die temperature had clearly reached impressive levels. As I recall, the chip survived another couple of years before expiring, but by then it had been relegated to a secondary role so it wasn't a catastrophe when it did finally shuffle off this mortal coil.
On a larger scale- a VAX 11/730 ( IIRC) test machine in a little office.
When the aircon failed over the weekend, the disk drive (huge box the size of a washing machine, about 50Mb...) started to squeal. Eventually the disk stopped rotatating, just as we came in.
Fearing the worst, a quick shutdown was in order.
After the AC was fixed, the VAX started up without any complaint and kept running thereafter.
Servers? Wimpy little tin boxes... :)
but quick version. 1990, room fulll of VAXen and drives and shizz. 1st job.
tell the sysop feels very warm behind the 8650. he says yeah but it puts out a lot of heat.
few weeks later... feels a bit warm again. sysop goes and looks at old aircon panel outside, no fault lights. its my imagination.
few weeks later, 1st super hot day of year, cannot access the cluster from desk. run over to find all doors open, screaming teleprinters and worried sysop just powering down disks.
turns out the panel that showed the faults had failed. so we got no fault lights as gradually the 3 units failed (must coincide with me noticing temp changes).
aircon panel was re-engineered to show 'run' lights not 'fault' lights. interesting distinction. we all learned something :)
however since then at my current place of work we have a small comms room with servers and 150TB san drive for our area. due to building works both aircon units were replaced with brand new units (though with significantly longer pipe runs).
remember that brief heatwave we had a month or so ago? I thought i'd just pop my head in there to see if evertything was ok (because of above). one ac unit dead, no power. I report this to our internal facilities guys as urgent. Even after repeated phone calls it took TWO AND A HALF FUCKING DAYS to get our on site (but subcontracted) engineers to look at it. I was doing my nut.
turns out, some nobhead hasn't marked the trip switch back at the distributor with 'server room aircon do not turn off assholes!' and another nobhead just turned it off by accident.
anyway for 2.5 days in the highest heat we've had this year we were on 1 ac unit with me thinking about what happened 25 years ago was going to happen again...
"aircon panel was re-engineered to show 'run' lights not 'fault' lights. interesting distinction. we all learned something :)"
Just you watch. That panel will fail in another way: in this case, a short-circuit that makes the light stay on even when the AC isn't really running anymore. Murphy ALWAYS has a way to strike.
Was it 2003 or 2004 where they recorded close to 40 degrees at Heathrow? A mate was working in London and they were doing 60 second rotations in the comms room to close down boxes that they couldn't remote on to. It was so hot that they went in for a minute then came out to cool off while someone else went in.
On a visit to Hong Kong to see friends in the mid-90s, I got chatting with a bloke who worked for a Major News Relay Network (which shall otherwise remain nameless). They were so desperate to recruit someone with network knowledge to keep their flaky international connections up that I almost agreed to leave the UK and accept double the wage to live in an expensive but almost tax-free city. What stopped me was the fact that while I knew barely anything about networking I could see that their infrastructure was horribly, horribly wrong -- that and I had a conscience.
I think I probably worked for that Major News Relay Network. Or at least, the company I worked for was gobbled up by them a couple of months after I left, so I almost worked for them. Office on Connaught Road Central with glorious views of the Route 4 elevated motorway right outside the window?
The company I currently work for had the factory extended a few years back, with a nice shiny new server room specced out. (A windowless concrete box)
I got sent in to wire up a couple of phone lines as we started moving staff into the new block.
While doing this work, I realised that the builders had installed nothing more technical than a bathroom extractor fan, as the plans called for.
I immediately pointed out to my superiors how much this would likely cost in cooked servers, and suggested they got the builders back to tie it into the AC of the new lab next door.
This also happened at the college I went to. They blew the budget on the PCs and left nothing for the aircon. So unlike the cooled and working Apple Mac room, the PC room constantly had fried PCs. Especially as that room also was fully windowed along the one wall facing the full sun all day long.
A new computer room in Africa had been designed with two aircon units. These were tall units each mounted against a wall - with the cold air pumped from the bottom under the false floor. After a few weeks it was obvious that the air conditioning wasn't coping with the heat. Another aircon unit was installed and everything was ok.
The mystery was solved when we visited the site to do a software update. The computer operators hospitably offered us a cold beer. They lifted a floor tile in front of one of the aircon units to reveal their stash of beer and water melons - up against the cold outlet of the aircon unit.
In the days when networks were a new thing the company had small nodes in convenient offices across the country. The node in Manchester kept failing intermittently - but no fault or cause could be found. It did seem to happen about the same time on any random day.
The circumstances for a failure turned out to be very specific.
1) One of the access panels on the node computer's case had to have been left open.
2) The open panel had to be facing the window in the room
3) The sun had to be shining brightly through the window for a while.
The problem was a chip on a board. It only failed when the sun's heat caused a critical thermal expansion. By the time the engineers arrived the sun had gone in or moved past the window - and the chip was happy again.
Back in my early days in
IT EDP as a lowly Computer Operator, we had a similar problem in NZ. Over winter a number of new washing machine sized HDD drives had been added. Come February (usually the hottest month in Auckland) the central air conditioning system couldn't cope with the extra heat, so several millions of dollars of ICL 2900 mainframe and associated equipment had to be powered down every afternoon from 2pm to 6pm.
The problem was quickly solved by adding some large window mounted aircon units. Upgrading the building systems was too difficult.
When I started with the current employers the computer "cupboard" had been fitted with a heat pump instead of proper ait conditioning. The heat exchangers were located inside the warehouse part of the building, just under a sheet metal roof.
End result, in hot weather there's next to no cooling, then the device would ice up and stop working, dripping water down the wall while it defrosted. The unit wasn't rated for an outside temperature below about 10C so in winter it would also stop working. Eventually it decided to die and was replaced with the right unit for the job, 7.5KW of decent cooling which we moved to the current offices.
The thick end of about 15 years ago, some of our comms stuff was hosted in one facility of very well known datacentre/co-location provider in London, who shall remain nameless. We'd been using that particular facility for a little while and all had been fine, but it was getting a little 'busier' and each time we visited more of the place was being used.
One fine summer's day I rocked up at our rack to find a tall desk-side fan whirring away next to it, blocking access to the rack door. There was also a desk-side AC unit with a long, long exhaust pipe running to a propped-open window (luckily, a few floors off the ground).
Immediate inquiries to facility team then a) brought a response about the building AC not coping with the load, and b) an instant response from us that we were cancelling our contract and were moving out just as soon as BT et al could relocate our connectivity. Luckily for us the rack was mostly switches and comms kit, not servers which were elsewhere (and most of which at the time were 4U power monstrosities).
One of the huge Airedale aircon units had failed at a site I used to visit but they were 'moving soon' so they didn't bother replacing it.
They had a temporary 'solution' in place which was to remove the failed AC and leave open the hole in the wall where the outlet was. Approx eight feet wide by a foot tall.
Cue servers covered in pigeon shit and cute baby pigeons stalking around the server room.
Never did work out exactly where they were nesting but I have my suspicions it was around the back of the DG mainframe where nobody but the bravest sysadmin dared go.
I used to work in a place that was *somewhat* large, so they had water chiller units producing 1000 tons of refrigeration. There were FOUR redundancies. (We say 1 ton of refrigeration when the unit is capable of generating 1 METRIC TON OF ICE PER HOUR. So yeah, this thing could BURY US in snow if we wanted. And yeah, the place was HUGE.)
Instead of dealing with freon and whatnot, there were just cold water lines derived into each building, feeding heat exchangers. These heat exchangers had throttled intake valves, except not. Those valves reacted on the building temperature sensors, and crapped out. Permanently. But there was still a bypass valve, thank $deity.
So, to avoid formation of ice in our office building because of the faulty valve, the building operator (yes, each building had one operator designated to it, for other systems beyond A/C) would just shutdown the cold water over the weekend, but he usually forgot to turn it back open monday mornings.
Guess what usually happened at monday dawn.
They never fixed the valve, but reassigned the night shift security to turn the water back on. Markings on the valve and in the wall behind it told the exact amount of cold water to match the building heat output /input, (except in hot days, cold days, direct sunlight days)...
We eventually found out about the faulty valve, and put ourselves to the task of opening the bypass valve to our general desired temperature, night shift security be damned. We even instructed the secretary on which valve to operate, being her the first person on premises monday mornings.
A single quarter turn on the valve kicked the temps 2C, so you just turned the valve and waited a couple hours.
Even our HVAC engineer admitted we managed the temps in our office better than the automatic valve, because we changed it in advance, prediction of a sunny, or coldy day...
I worked (briefly) for a fast food company with nationwide chain stores, some selling a popular fried chicken and some a popular pizza.
The store "servers" were *purchased* as 4-year-old ex-lease desktops, usually placed on the top of the cupboards in the manager's "office" (a shoe-box annex to the kitchen area). No aircon, no grease filters, nada. It would regularly top 55 degrees up there, and that was outside the PC enclosure.
I had the idea to perform remote WMI queries on the hardware to see if fan speed was low and CPU temp high, which would indicate that the inside of the PC could be rendered down for one last serve of hot chips. But being ex-lease desktops they didn't even have WMI instrumentation built in. We just had to wait for them to fail and the store sell everything via manual transactions for a few days while we built, shipped and installed new 4-year-old ex-lease desktops by way of replacement.
I lasted three months before choosing not to renew my contract.
Edit for speling an grammer [sic]
Biting the hand that feeds IT © 1998–2019