Power supplies used by HP ProLiant DL380 G5 rack-mounted servers may fail if they're left dormant for long periods time in certain environments, according to the experience of one customer and comments from HP. This could be a problem for servers that are only powered up for failovers and failover testing. A reader of The …
Oh come on...
"It's hard to tell what actually caused the issue, but HP believes it was environmental – that the power supplies or the servers were stored in an area of high humidity, water, etc."
From what the article suggests, the servers were in racks in a data-centre, not on the floor in a cellar. Are Parisian data-centres really likely to be any more wet or humid than any other data-centre??
The title is required, and must contain letters and/or digits.
Maybe by high they mean >0%
Maybe their power supplies got wet in a different Paris.
if perplexed, see icon.
..that for a dollar!
What else would you expect...
...from HP support? Had a long argument with them once about their 32bit version of a driver for a particular printer missing a feature present in the 64bit version and was told it was a "problem with the 32 Windows" (the feature in question was to do with reducing print margins so that it would get closer than 20mm to the edge). Ended up at the complaints line with the "most senior" person who basically said "tough, wait for a new driver which may never be created, go away I don't care".
"HP believes it was environmental – that the power supplies or the servers were stored in an area of high humidity, water, etc" - maybe they've taken water cooling to the extreme in Paris and just flooded the data floors...
To be fair..
dude... HP as printer company is a totally different beast from HP as a an eterprise datacenter vendow (much higher accountability should be expected).
Re: To be fair...
Yep. HP's printer warranty department is lightyears ahead of the consumer PC/laptop department.
We usually get a brand new printer shipped when customers have warranty issues with printers.
Laptops/desktops on the other hand - let's just say one customer with ON SITE NBD warranty was required to take the computer 100km to an authorized repairer, and ended up not having the computer for nearly 3 months.
I thankfully went with Dell for my home server - 3 years NBD onsite, and I know Dell come to town (same company as the "local" HP repairers, but Dell's travel distances are more sane).
It's probably the capacitors and bad design causing the failures, I'd imagine sitting discharged for extended periods on powering up they would momentarily present a dead short load until they have a bit of charge in them and the inrush current resulting is doing damage somewhere. There are design measures to mitigate inrush from a cold start but if these are the same supplies as the G5s I look after the quality isnt up to much, I've had a few where the output rectification diodes shorted killing the mobo and hdds just out of warrenty, and its £850 for a new mobo for these. Thanks HP. At least the G4s biggest problem was the electrolytics on the mobo itself, that was fairly cheap to rectify if a bit of a pig to solder.
I've had computers sitting around for years not being used, plug them in and they fire up fine. What kind of weird crap are they doing to get that to occur? Especially in a datacenter, I've had boxes running at a colo in Dallas for years, temperature remained 65F the entire time. When we went to check on them, there was not a speck of dust anywhere on them. I cannot fathom how being turned off in an environment like that could cause failure.
They're not *off* off.
They're left on standby. The system is still powered up insofar as the power switch is lit up, and the iLO board is still running. I know that the PSUs in the DL380G5s that I run make a nasty buzzing noise when the systems are in standby. They settle right down when powered up (and it's not distressingly loud anyway), but it's there.
Glad all of mine stay on!
Who was fibbing?
Is this the HP VP of marketing telling porkies when the power supplies really are faulty, or was it the help team thinking up the first excuse they could to explain the failure - as soon as they heard the kit had been left powered down it was the old "(sucking air through the teeth) oh well, that explains it mate, you can't do that or you're asking for trouble".
Prolients have always had issues
With over 20 years in IT, I have ALWAYS ran into strange and unusual issues with Prolient servers.
Back with Compaq, it was always an adventure doing firmware/software updates. I quit counting the number of times they systems would just die during or after the update. I dreaded the process. It would always end up with a call to support. Inevitably end with a visit with a field tech with new parts (firmware update failed and no recovery option), or having to wipe the system and re-install, or restore, EVERYTHING(driver or Compaq software would conflict with O/S). Thank DOG I always ran backups.
It always dumbfounded me that during the initial call, that the support person would always say "Oh yea, we have had issues with that." I would come back with "Why the $%^@ would you it out when you knew about it?!?! Or at least put a warning in the notes!!!!" They couldn't ever give an answer.
I was excited when HP came in. Hoping they would drop the Prolient crap, but alas they dropped the old HP line of servers, and kept the ComCrap line. HP went to the bottom of my list as viable equipment suppliers.
Sounds like bullsh!t to me.
And just as a heads up - DL380 G5 power supplies don't just die from being unused. I moved our pair of DL380's from one spot on the rack to another, and once installed in their new location only one power supply was functional - and that one was replaced when it failed after being in service for six months. All of the other original supplies went from working to dead in a space of 45 minutes.
Still waiting on replacements.
The Most Likely Cause
Is our old friend the electrolytic capacitor. Although our servers rarely failed (they were never powered down unless for maintenance), we had lots of 2 year old desktops fail to start after holidays. They were under warranty so Dell supplied replacements for all the failed power supplies (well over a hundred of them).
I had examined one of the failed supplies and found that the cause was faulty capacitors. this is very common nowadays, the power supplies run at fairly high temperatures and when they are switched off the capacitors die when they cool down then the supply won't start when power is re-applied. In one case I started one by warming it up with a fan heater then powered it up -- got the machine running until a replacement supply was available.
I have seen the same thing in other equipment as well, sometimes after a power outage some switches would fail to power back up, the cause was faulty caps in the power supplies and I would just replace the caps with good quality high temp ones and get the thing back up and running in an hour or so. Often part of the cause was higher than normal temps in the power supply due to seized fans.
Electrolytics need voltage
The electrolytic capacitors need a voltage to maintain the Al2O3 layer that acts as the dielectric between the aluminum and the conducting paste that are the two plates. You can reform the layer by gradually increasing the DC voltage applied on startup. This is a standard technique when examining old antiques.
Really old antiques?
"This is a standard technique when examining old antiques"
Old antique being a relative term, sitting at my 17th Century desk in my 19th Century house :)
<<The electrolytic capacitors need a voltage to maintain the Al2O3 layer that acts as the dielectric between the aluminum and the conducting paste that are the two plates. You can reform the layer by gradually increasing the DC voltage applied on startup. This is a standard technique when examining old antiques.>>
Yes, that's why they haven't made capacitors like the one in Vintage Radios since 1930s..
Modern capacitors only need that treatment if faulty anyway.
Seems to be batches of Capacitors out there made with dodgy ingredients.
I share this guys experience : Would anyone at HP like to take a look at my supplies?
During a weekend power-down for electrical testing, our two Proliant G5s were unplugged from the UPS and left disconnected for two days. When reconnected - I couldn't get them to start, no LEDs, no fans, nothing at all. All four power supplies seemed completely dead. Since they were out of warranty, I ordered four replacements (cursing mildly), but after 36 hours (plugged in) - all had returned to life. Environmental conditions might be one possibility, but so is NBTI wear-out in the power electronics. But the G5's are four years old - and our newer Proliant 380 G6/G7 which have a different common power supply were unaffected. It would be nice to know my power supplies will last for another two years.
Whilst a full scale failover is to be applauded (but try getting approval for that on 24 hour systems), the idea of just powering servers up once a year is none too clever. As a confidence measure it would be extremely sensible to test them at least monthly. Even if it's not possible to test the full app, it's still possible to test basic hardware operation, connectivity and the like.
I smell bullshit
.. and/or bad design.
I can't believe anyone designs a REAL power supply these days that doesn't soft-start.
If moisture is causing problems then the boards haven't been properly varnished/sealed.
Electrolytics failing through decaying oxide layer was a thing of the past in 1980! Far more likely is that they are under specced for ripple current and are overheating then drying out.
Quite often the 'start' circuit is basically a capacitor and high value resistor in series. Frequently the designers chose a resistor that is at the limit of its power rating - with obvious consequences.
DL380 G5 power supplies had a problem
We went through a number of them in the first year. Repeated failures (the replacement part gave out after an initial failure). Some time in (might've been about a year) we started getting replacement power supplies that were obviously a different revision than the previous part, and we had almost no problems after that. Our newest DL380 G5's have never had a ps failure. We had 10 or so of them, mostly used as VMware hosts.
We haven't had any issues with the G6/G7 models.
I'd start comparing power supplies that failed vs. ones that haven't. There may be obvious differences you can spot in the power supplies (labeling or visible components).,
Problem also in Aruba with DL380G5
I Got the same problem here in Aruba. My Servers have been unplugged for a Long Weekend and today that I started bacl up, Down was 1 of my DL380G5 power supply.
Luckily my system is redundant with 2 powresupply so I am runing, but this problem should be investigated by HP. My system was dormant for 4 days and oops, there went 1 of my power supply.
Greeting from Aruba
I once shut down a DL110 so I could remove the heatsink from the processor to check its stepping in order to purchase an identical one for the second CPU socket. This was done under proper servicing conditions (antistatic wrist strap) and the system was down for all of about 3 minutes but when powered up again the system board 'red lighted' and that was the end of it - I couldn't get the system working again so the drives were hastily moved to a standby (non-HP) server and the Proliant scrapped.
cpu details including stepping are iirc available to the OS
Various utilities should provide that for you.
You'd need to take it down to swap cpus so I guess it still gets its chance to die.
Happening since 2007
We saw this back in late 2007. There's a long thread on HP's support forum about this ...
Once we got the problematic PSUs swapped with different revs we haven't had a problem (yes the occasional PSU fails but not a silly number).
Our G5 servers do get left switched off for quite some time while still plugged in. The problem I've put this down to is that the PSU seems to get very hot when the server is off. Obviously there is power still going to the motherboard for ILO, etc but as there is no PSU fan activity, there is no cooling. When you first power on the server, the PSU fans are then running overtime to cool itself down because of the heat buildup.
The G6 and G7 don't seem to have this problem anymore.
Could it be heat?
I don't know if any of you have ever felt the heat coming off a G5 power supply when it's on standby but it is HOT. I mean really hot. There aren't any fans active to blow air through and there's something inside still generating alot of heat.
If these servers had been left for an extended period like this my guess is they just cooked.
"HP believes this was an isolated case...."
Hmm, ten isolated cases all in one place. How unlucky is that?
I had a no-name homebrew style box as a server at my previous job. Ran no worries for about 2 years prior to my arrival, and another 12 months after I started. Shut it down one day to move it, and 10 minutes later, it wouldn't turn on.
In my case, popped caps in PSU and on the mobo. Stable until it got switched off.
Why we beat our hp field engineers regularly.
I feel quite left out as we don't have any old, failed PSUs to moan about. But then that may be because we have regular meetings with our local hp field engineer, who keeps an eye on all our kit and makes sure the firmware is up-to-date, and anything listed as potentially "dodgy" in the hp engineering bulletins is scheduled for replacement or update before they go on the fritz. He seems to think this is a better idea than just letting everything that is going to cause problems happen whenever (usualy 2am on a Sunday morning). Admittedly, we pay a bit extra for the hp support that delivers this service. Strangely enough, this type of proactive system management means I don't have any PSU failures to moan about. Can I suggest a company that pays enough to afford international failover should also spend a bit more on proactive system managemt, or at least some admins with a clue?
Really. Glad you work in an Enviroment where your job is done for you.
Known problem, http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1303807984398+28353475&threadId=1118732
Bad idea anyway
Keeping DR servers powered off until you need them is a false economy, and very poor DR practice in any case. Any sysadmin will tell you that more system failures (not just power supplies) happen at power-on than any other time.
It's also when you find out that someone "borrowed" the network cable from the obviously-unused system, or that you missed the daylight-savings update, or...
The whole point of a DR site is that it is up and running, so you can use server monitoring software to check the health of your servers. If you do end up with a disaster and have to power on a dormant site, this is exactly the sort of problem you can run into. You may have your disk systems powered and replicating, but it's not much good if your servers have all quietly failed due to dodgy PSUs, broken internal disks, failed HBAs, broken optical fibres etc. Furthermore, unless you boot from SAN you're not going to have any software updates applied to your servers, so they'll come up with an out of date OS image.
This is pretty poor practice on behalf of the company and HP are being blamed for causing a problem that should never have come about in the first place. Yes, dodgy PSUs are HP's problem, but best practice says they shouldn't have been in a position to cause this failure in the first place.
Why would you need to even open the box, let alone remove the heatsink from a PSU, in order to find out the processor's setpping? I'm sorry to say, but you brought this problem on yourself - you should never remove the heatsink from a server PSU, you probably knackered up the heat transfer layer and didn't reseat the processor properly. This is a classic problem when replacing desktop processors and re-using an old heatsink.
Err - what?
"The whole point of a DR site is that it is up and running, so you can use server monitoring software to check the health of your servers"
Strictly speaking no - the point of DR in general is to decide what is required for the business and if the business can do a move to a DR site in such a way it's valid (in many DR scenarios the staff have to leave anyway so regrouping on a new infrastructure 12 hours later may be appropriate)
Resillience is keeping everything on whihc is part of DR but isn't the be-all and end all. It's harsh to criticise a working method that is better in some ways for the environment and costs when some DR providers make charges for power use in standby.
I've seen "DR" where the answer is "keep everything resilient" unfortunately no-one took into account a scenario where no-one could get onto the systems even though they were up.
Proliant PSU's and Proliants in general
We have lots of proliants of various flavours and have not have a problem whilst running but it's notable that we recently moved 3 servers between sites and back (easier to do the rebuild over a few days and re-deploy).
We lost 2 out of 6 PSU units and had 3 dead drives as a result. Moral.. don't switch them off unless you're desperate!