Another reason not to blade everything
all eggs in one basket...
A power supply failure in HP BladeSystem c7000 enclosures can cause the whole BladeSystem to fail, the firm has admitted. According to an HP advisory note: "HP has identified a potential, yet extremely rare issue with HP BladeSystem c7000 Enclosure 2250W Hot-Plug Power Supplies manufactured prior to March 20, 2008. "This …
all eggs in one basket...
Yep. That's me. What kind of organisation doesn't use some form of mesh power distribution in critical systems.
Hint: Racal were doing that in the 1970s
We replaced all of ours in November last year after this advice was released.
Each blade contains several virtual servers and
each server hadles several applications and
one rare error happens to occur and
The whole shebang goes totally titsup.
You gotta love progress.
not all blades are created equal, though... some are more redundant then others.
So did they redesign them so a failed power supply will isolate itself, and not blow the whole power bus... or did they just find the early ones had a higher failure rate than they'd like and replace them? The former would be great. The latter helps the immediate problem, but still it wouldn't be too cool if (5+ years down the road) the system goes back from high-availability to low availability as the power supplies age.
Replacing power supplies doesn't seem to solve the design problem, according to the article, it just increases the mean time between failure.
I know that IBM blades have 4 p/s - two load balanced for each half of the chassis. Dell's blades contain 6 power supplies and are fully load-balanced and redundant. Sounds like they have a better design here.
It's not at all uncommon for manufacturers of 'highly available' hardware to have failures like this, particularly in the power system.
Actually, thinking back, I can't remember dealing with any company that has not caused me an outage due to 'redundant' power failing.
Nonetheless, it's a good advertisement for ensuring that you architect properly and don't rely on a single blade chassis. Or rack. Or power supply. Or datacentre. Or power company. Or comms company. Or building. Or site on the wrong side of a fault-line. Or...
True, not all blades are created equal. Check your favourite vendor's blade architecture:
a) Does the blade have only one power connector?
b) Can the power supplies be configured to do N+1 redundancy?
If the answer to either is "Yes", you have blades that are dependent on a single DC power bus. A power supply fault on the DC side can kill the DC bus and take down everything powered by it.
This single DC power bus design is clearly in the HP C7000. It is also in the Dell M1000e (see page 32 in http://www.dell.com/downloads/global/products/pedge/en/pedge_m1000e_white_paper.pdf)
Changing to better power supplies might reduce the risk, but cannot eliminate it because the SPOF is not in the power supplies, it's in the enclosure midplane's single DC bus. Good luck to all the people who have heeded the HP recall, don't be surprised if the problem re-surfaces as the power supplies age.
The only way to eliminate this SPOF is to duplicate the DC bus, have half the power supplies on one and half on the other, then connect each blade to both DC buses. This duplicate power midplane is not a new design, it's been around since November 2002 in another vendor's (IBM) blade chassis. Why HP and Dell have not copied it is a mystery.
The IBM blade chassis has 4 power supplies and if I lose the wrong two, I lose power to half my blades. There's no mystery - HP and Dell didn't copy this because it's an inferior design.
Furthermore, IBM also has a single active midplane, which is an even greater SPOF.
I disagree with A/Coward.....2 power supplies with a redundant power domain is clearly proven to be more reliable than multiple power supplies in a single power domain....the reason HP/Dell use a single domain (single power connector on each blade) is to get a greater density. No matter how well you design components sometimes bad things happen....that's why you have redundant everything....especially in a chassis that hosts multiple physical servers, and each of them hosting multiple virtual servers. HP will "fix" the power supplies but the design compromise remains....and power supplies will still fail outside this "bad batch" problem.
To Anonymous Coward(Posted Thursday 15th January 2009 13:13 GMT), SPOF means Single Point of Failure. Losing two power supplies is Multiple Points of Failure...far less probable. Even so, in the IBM Bladecenter, the power supply pair that must fail together to take the enclosure down are connected to separate power harnesses, with each power harness meant to connect to separate external power grids. Therefore the failure of any single power grid will only take out one member of each of the two power supply pairs, leaving the enclosure running.
HP and Dell have also adopted the same design in splitting the six power supply inputs over two power grids. Unfortunately, instead of extending this redundancy to the DC side of the power supplies, they all converge onto one DC bus on one midplane.
The IBM Bladecenter has two separate midplanes, each one with it's own DC bus. That's why all IBM blade servers have two power connectors, they draw power from two separate DC buses. I don't know if the active components you speak of are hardware monitors or in the data and power paths...whatever the case may be, there are two duplicate sets because there are two midplanes, so again, no SPOF.
Was the HP power supply recall a result of a bad batch of power supplies? That does happen to every vendor from time to time, so it is plausible that this is just bad luck. However, the recall affects all power supplies for the C7000 manufactured before 20 March 2008, that is, since the launch of the C7000 in 2006. By their own calculation, HP claim to have shipped more than a million blades. Considering e-class, p-class and c-class, c-class is by far the most successful and would account for 500,000 blades or more. Assuming 6 power supplies for every 16 blades, that's around 180,000 power supplies! That is not a bad batch, it's an expensive design flaw. A profit making company would not make that kind of recall unless the cost of not doing it was even more costly...it makes you wonder about HP's definition of "extremely rare".
Unfortunately, the design flaw is not in the power supply (I would expect HP to be capable of making power supplies as good as IBM) but in not having a redundant DC bus. To fix this is a lot harder, because the midplane would need to be changed and a redundant power connector has to be added to every blade. This is a whole new architecture which would be incompatible with existing blades, something HP would loathe to do given that e-class, p-class and c-class blades are mutually incompatible.
So rather than fix the real problem, HP have elected to issue improved power supplies (probably with better DC fault isolation) to reduce the probability of failure. It's like issuing a recall on all cars to upgrade the suspension rather that fixing the potholes in the road that are causing the crashes in the first place. I can understand why they have done this, but it certainly convinces me that my VMware cluster is going to be deployed on rack mount servers rather than blades..at least not HP or Dell blades anyway.