Knowing just what breaks a storage box is of obvious interest to data center admins. It's quite reasonable to conclude the blame should be heaped on the 80-some platters spinning all day at 7200 RPMs. But a recent study presented at the USENIX Conference on File and Storage Technologies argues that disk failure isn't nearly the …
.....the short version is that "disks" aren't the only thing that can break in your average storage system?
Wow.....research money well spent. Any BOFH would be able to tell you that for a fraction of the cost.
27-68 percent?! Can't you get any ACCURATE results?!
"The research indicates between 27-68 per cent of storage subsystem failures come from physical interconnects."
Bit of a difference! saying between a quarter and nearly 3 quarters of all disks is akin to saying between 1-100% of disks fail!
CRAP STATISTICAL RESULTS!
Read the article. The long version is if you address Y and Z you can reduce failure rates by P and Q. The average BOFH couldn't tell you that, but might be interested to know, since Y and Z probably cost less than a new disc.
Now perhaps I have designed too many RAID groups, but if the AFR (sic) is higher in the backplanes and connecting wires, doesn't spanning racks increase the chance of failure? While it does arguably reduce the risk of a single point of unavailability, it exponentially increases the chance of partial failure inside a single storage array. In addition it increases complexity, and it still doesn't actually prevent an outage. Partial RAID group failure might as well be just like losing the whole thing unless the faulty rack in question is fairly intelligent. Even in RAID 1 configuration loss of the mirror, would expose the RAID group to an even greater danger were a disk to fail in the time it took to bring the other rack back online.
Let's see here, where are LXUCEXCH-Disks 1-7 and 8-14?
Replication is what they mean, spanning racks is what they say...or are they trying to sneak an academic paper by us using smoke and mirrors to cover up a marketing ploy?
Then again, I am not sure how exhaustively NetApp tests their kit before packing it up and sending it out the door...if not much, then you can count on a good number of failures from connecting wires and loose chip connections and/or faulty chips the first time they really get toasty.
My god, disks aren't the only component in a storage subsystem that fail? *gasp*
Cables? Power Supplies? Backplanes? My god, who knew that the other components weren't infallible too.
*sighs* It's taken the 44 months to find this out? Could they not have talked to the top tier vendors (EMC, HDS, IBM et al) and ask what components fail the most?
Or possibly ask most storage administrators, we'd straight away tell you about HBAs, duff cables, dodgy backplanes, replication faults etc.
If quite frankly a dodgy power supply is killing your storage system you really must be buying budget kit.
Is there going to be a paid 44 month study into what fails with tape backups? I'll quite happily get paid to help out on that one... I'm guessing it's not the tape 100% of the time.
leave it to the professionals
All this tells us is that when you leave storage management to people who "have a PC at home" (and therefore know about disks - ha!) you get amateurish, cobbled together systems using cheap components and lousy architectures that fall over, figuratively speaking, as soon as you look at them.
This study tells us more about the commoditisation of datacentres. Just because eBay will sell you a disk for about the price of a packet of cornflakes means that t'powers that be now object to paying realistic prices for _solutions_ to store their enterprise data.
What we need are some decent disasters (a few more minor earthquakes should do it) to literally shake-up peoples' perception of storage and make them start doing it properly.
Re: "27-68 percent?! Can't you get any ACCURATE results?! "
That is the point, this is the range of failure rates across the different disk systems whose failure data was assessed. By giving the range they are giving you the accurate and useful statistical result and the information that different designs of storage shelf have substantially varying failure rates.
Re: "Could they not have talked to the top tier vendors (EMC, HDS, IBM et al) and ask what components fail the most?"
Nope, because many vendors don't want to admit what the real failure rates are as this would damage the marketing impression that if you spend 10 times as much money on big iron your data is perfectly safe.
Re: "but if the AFR (sic) is higher in the backplanes and connecting wires, doesn't spanning racks increase the chance of failure?"
Not if implemented properly. See the internal architecture of the Sun X4500. It is designed to let you lay out the 48 disks in RAID 6 (RAID Z2) dual parity sets that can survive either a controller or disk plane failure without losing access to an intact array. Yes this exposes the RAID group to greater danger after an interconnect failure but I would take that over loss of the RAID group as you now have options on how to deal with the failure that don't start with a restore from the last backup.
Re: "leave it to the professionals"
Well, that is rather the problem, many of the 'professionals' are still promoting high margin, monolithic big iron that takes the 'throw more tin at the problem' approach to storage reliability instead of innovation and science. See http://storagemojo.com/2008/02/24/why-do-storage-systems-fail/ for some examples. The sort of technology that eBay and Google have used brings real business benefit, when implemented properly it beats big iron on both price and Mean Time To Data Loss (forget MTBF, that is a distraction and frequently a fiction). If you can store the vast volumes of data that these operators do for 1/10th or less of the cost of 'enterprise' price storage systems then you are more competetive. Stand back and watch distributed parity storage systems with smart software such as ZFS take over from the expensive tin as the market starts to understand storage reliability and cost.
When you are explaining to the end customer that their system has gone down due to a storage problem, do you tell them all about interconnects, HBA ports, SAN switches, fibrelinks, multipathing software etc. etc. or do you tell them that there was a problem with the disk? Unless there is a good reason to go into the nitty-gritty you tell them that there has been a disk failure.
As for people designing RAID systems but not multipathing to them, I don't have words. Why would you mitigate against the failure of disks, but not the failure of (usually these days) fibres, which are much more fragile and under the floor, just waiting for a tile to be dropped onto them? Even if you are using copper, you need a good reason not to multipath.
Replacing disks that arent faulty? wtf?
So this study thinks that people are replacing disks after someone has unplugged a fibre cable or something else? Thats just daft. They are either a) idiots or b) using some cobbled together (whether by themselves or an external company) shit that is not fit for purpose. When users are pulling disks on their storage to fix a problem with connectivity or something like that, then that is obviously a "training opportunity" making itself known.
We use high end kit and disk/boards/power components are (almost) all capable of being non-disruptively replaced with no risk of data loss. We have seen that on very rare occasions the vendor cannot be 100% certain what is causing a problem on an array and will spell out what steps will be taken. They usually start with the most probable faulty part(s), then moving on to the easiest to replace ..... but never NEVER would they suggest "lets start pulling disks randomly to see if that fixes it"
Muppets and data centres dont mix well .... thats why its best to keep the damn management out of them ;)
Leave it to the Military?
When I worked for a MOD hardware company, we knew the hardest part to get reliable was connectors.
Domestic goods have little to no gold plating on connectors, you need bendy materials to take the shock of thermal expansion and thick wires with inspected pins where the wire mounts onto the pin in the connector.
Power supplies of course have to be of high quality. But you need after sales support for at least 5 years on the framework around the storage device too.
Maybe the Aviation industry standards should be looked at if Mil-spec is too high.
After all they have to have very reliable wiring and backplane framework in commercial planes.
Personally I'd like to see vibration/shock sensors in HDs.
It is likely in 10 years we'll all be using solid state devices. So failure will be mainly due to vibration and thermal slopes.
5 nines availalbility??
"As an example, in low-end storage systems (defined as having embedded storage heads with shelf enclosures) the annualized failure rate (AFR) is about 4.6 per cent. The AFR for the disks only is 0.9 per cent, or only 20 per cent of overall AFR."
I bet the manufacturers still claim five nines availability though.
Spinning all day...
"Knowing just what breaks a storage box is of obvious interest to data center admins. It's quite reasonable to conclude the blame should be heaped on the 80-some platters spinning all day at 7200 RPMs"
I think you'll find that in most datacentres the majority of disks are spinning all day (and night) long at 10-15,000 rpm.
You tend to find 7,200 rpm disks in 'commodity' desktop machines.
Didn't have the money to replace a drive when an interconnect failed
Ok, so anytime I was presented with a downed server or storage device, I did not have the budget to just replace the hard drive without identifying the cause.
Yes, I've seen interconnect failure. More often I've seen power supply failure. And more often then that, physical hard drive failure, either the motor/bearing wouldn't spin the drive, click of death type thing (who knows where the drive head is but it isn't where it's supposed to be) or an actual head crash into the platter.
So someone who didn't have the money to just replace a whole drive mechanism when something went wrong can tell you, yep, other things can go wrong. But the majority of the time it was a physical hard drive failure.
Steve Jobs because his iHardDrives will never fail.
Failure does not necessacarily = outage
Note that these systems use RAID disk so a disk failure, which happens regularly, does not equal an outage, nor does cable, SFP, controller etc. failure because these systems are redundant. I see user error, like incorrect configurations, not updating code or updating it incorrectly etc. as the #1 cause of true outages where users cannot access data, so I am not exactly sure what this study is saying because hardware failure is rarely the cause of an outage.
I will agree that many disk failures are not really disk failures at all, but simply disks that are erroneously indicated as failed by the system because of errant code or another device like a cable or SFP inserting errors inteh system. Many "defective" drives that are called in for replacement are actually fine.
Mine is the one with the hot spare drive in the pocket.
What systems are they testing this theory of theirs?
I can tell you right now that disk failures make up about 35-55% of storage failures.
The only reason to misdiagnose a disk failure is when the Control card is bad a very specific way so it brings only 1 disk down instead of multiples and when the enclosure slot itself is dead. Those 2 would make about sub 5% of all disk failures.
I am talking about EMC equipment that already utilizes multiple host paths as well as multiple paths to each disk within the enclosure itself.
Not a very useful study unless your storage system is bought in electronics store like BestBuy
Their data source is the NetApp AutoSupport database.
... that a lot of claimed "disk failures" are frequently nothing of the sort.
However it's just easier to pull the disc and get it replaced if it's got corrupted.
Home users I bet tend to just blame the disc, and professional storage admins likely are under pressure to get the thing fixed quick. If it's a random corruption from a software glitch then likely they'll just assume it's fixed.