The RAID industry standard for storage is RAID-6, with recovery from a double drive failure. But it's not going to be good enough as disk capacities increase, prolonging failed disk rebuild times and so lengthening the window of unrecoverable failure if a third disk fails before the recovery from a double drive failure is …
Um, I don't know what industry he is talking about, but most shops I have worked with (and I have worked with a few) use RAID 1+0 (striped mirrors.) Disk is cheap, and RAID 5 often sucks for write performance (especially without hardware) - and write performance is what most people want, these days. On the *odd* occasion, we have used mirrored RAID 5/6, for the absolutely paranoid - but when you start looking for the odds of three / five particular disks of a set failing within such a tight window (especially given the availability of hot spares), you should perhaps invest the time in looking at far more likely risk factors - such as a Boeing landing in your data centre. If your data is all in one building - and your company cannot live without it being live, you will be screwed anyway. (A campus cluster is what you should be looking at, with SRDF or similar disk mirroring across sites, with a dedicated fibre connection.)
(Actually, on that last point, one of my past clients relocated some of their Amsterdam data centres to be further away from Schiphol Airport, for precisely that reason.)
Personally, I think this RAID 5-6-7 bollocks is insane. The answer is simple: Use something like RAID 5 if you have to, but keep it mirrored - and keep a RAID1+0 / RAID 5 array of SSDs as a hot spare. Then make sure that when a disk fails, the data gets copied to these SSDs as fast as possible. Then when you have redundancy, copy the data back to a spindle-based drive. Doing it this way will be much faster, because SSD write performance is way faster, and the SSD array can cope with a large number of seeks without impacting on the copy-to-spindle performance. In the end, SSDs are going to overtake spindle-based storage (the writing is already on the wall for notebook drives) - and nobody is going to care about double redundancy or triple redundancy RAID if they just mirror the data and have a duff drive replaced by a hot spare in less than five minutes.
The more room the more crap.
Companies are wasting massive amount of monies storing information they think they need, writing to disks, then replicating & backing up.
And when comes the times to recover ... ooops the file system is no longer supported, the piece of software that wrote the data is no longer around.
Useful but pointless
Surely a factor of your recovery system is the speed at which it recovers so whether drive manufacturers create larger discs or not you're still going to want to go with more smaller drives than fewer larger drives so your system comes back online and upto performance speed in as little time as possible.
With the price of smaller drives driven down by there shiny larger replacements this can be cost effective too with the right balance of drive size.
Also drive technology whilst not blazingly increasing in performance still is a little which will all factor.
And there's no mention here of switching to SSD's.
Disk Throughput Rates
Whilst it is true that disk throughput rates do not go up at anything like the same rate as capacity increases, it isn't quite as bad as described - at least for sequential access where increased capacity comes from higher areal densities (where the capacity comes only by adding platters then the rebuild time would go up proportionately).
Where capacity comes from increased bit density, then total capacity goes up to the square of linear density. Double the bits per unit length and total capacity goes up by a factor of four. Basically twice as many tracks and twice the amount of data on each track (roughly - track and bit density won't necessarily go up at quite the same rate). That means if the capacity of a disk is quadrupled in this way, then the rebuild time will be doubled (twice as many tracks to read). The only way to improve this would be using multiple independent read-write mechansims(which introduces cost, complexity, reliability and issues of aerodynamic and vibrational interations) or higher spin rates - which are already close to reasonable mechanical limits.
(The position on random access is a lot worse than sequential access - the number of IOPs per second is pretty well fixed unless there are mechanical improvements).
I'm not wholly convinced about the need for triple parity though. Double parity is important as there is a significant chance of an unocrrectable read error on your remaining copy (a complete failure of another drive in the rebuild window is a very much less likely event). However, the chances of a failure in the double parity protection in a RAID-6 setup is very much reduced. Against that, putting a third parity into the configuration introduces even more write overheads and reduces available capacity.
The killer problem here is that the very geometry involved in disk storage (which is essentially a sequential access system with a moderately high speed seek facility) is going to make things worse. Patching up RAID systems to cover for incredibly lengthy rebuilds at considerable cost to write performance is no fix for the fundamental problem that you have many TB all being funneled through a single limited bandwidth bottleneck as represented by that single active read/write head.
Egads, the terminology for RAID was established over 15 years ago by Patterson and Chen. There have been some variations since then, but the underpinnings remain.
RAID 6 is not a synonym for two parity disks. It is the terminology used for P+Q parity. Using three parity disks is still RAID 6. Or possibly RAID 2, although that would be an unusual layout.
RAID 10 FTW!
I agree with Oliver on this one. RAID 5/6 is is often referred to as a poor mans RAID. RAID 10 all the way!
That's not correct.
RAID-6 can also be 3D-XOR - it is fundamentally a dual parity mechanism - P+Q is more efficient and is just another variant. It's also harder to implement which is why most software RAID6 is 3DXOR.
I'm guessing you're a wikipedia user because it doesn't even mention 3D-XOR - no surprise there then.
"I'm not wholly convinced about the need for triple parity though. Double parity is important as there is a significant chance of an unocrrectable read error on your remaining copy."
Can you explain this part again? Maybe I misunderstand you, or it is you who misunderstands ZFS?
Apparently "1+0" doesn't count as a title
Because of course, we all have the budget and the shelf space to install 2 shelves of disk instead of 1.
RAID and triple parity etc
A few thoughts from the history of RAID, plus on triple parity etc:
RAID 0 was used for speed by striping data across multiple disks. Whilst faster than writing to a single disk, this increased the risk of losing all your data proportionately to the number of disks used.
E.g. the chance of losing all your data due to a failed drive in a 4-drive RAID 0 configuration is four times that of a single drive.
So, in order to give some protection against an increased chance of losing all your data, mirroring (RAID 1) was combined with RAID 0 to give us RAID 0+1 or RAID 10. Now we had speed, plus another copy of the data on the mirror for redundancy, in case of drive failure/loss.
In the past, due to processors being fairly weak, RAID 5 and RAID 6 was slow when the parity calculations were done in software, so the only practical alternative was to put the RAID processing onto a separate RAID controller card (Host bus adapter, or HBA).
However, due to the problem of what happened when the power was lost between writing the data stripe and writing the parity data (called the 'RAID-5 write hole'), the solution/kludge was to add NVRAM so that upon power being restored, the RAID card could then complete the write operation for data/parity still not written. But NVRAM was expensive and so the 'I' (inexpensive)
in RAID was lost.
Roll on a few years and the power of processors had increased dramatically and so were mostly idle, low CPU utilization, so it started to become possible to do heavy RAID calculations in software, with the advantage of no card to buy, plus no data being held ransom to a proprietary hardware RAID controller (especially if the software is open source).
Also, disk capacities became bigger and bigger, but the error rates remained fairly constant. But due to the increased amount of data passing through these storage systems, the number of data errors occuring was starting to become a problem.
See CERN report: http://indico.cern.ch/getFile.py/access?contribId=3&sessionId=0&resId=1&materialId=paper&confId=13797
and an analysis: http://blogs.zdnet.com/storage/?p=191
"Based on the findings - a PB data storage statistically will have 2,500 corrupt files that you won't even know about - and that is with non-compressed files - the number goes up with compressed files."
ZFS was designed with many goals in mind such as how to deal with ever growing amounts of data, larger file sizes, increasing numbers of errors, latent data failures (scrubbing to fix these) etc.
ZFS employed 256-bit checksums per block to allow easy detection of errors in the data, and fixes the errors on-the-fly when the file is read, either as a result of a direct read access or via a scrub of the storage pool.
Due to other features like delayed, aggregated writes, ZFS can now block-up many writes into one big write operation so that I/O is much more efficient, and this feature, combined with much more powerful modern processors, now allows modern file systems like ZFS to make RAID 5/6/'7' calculations in software reasonably quickly.
So now we have the best of all worlds: high levels or redundancy & other protective measures, plus reasonable speed too.
Due to drives often being bought in one batch, when one drive fails there is often a short time to rebuild the failed drive before the next drive fails too, due to similar design/build/material characteristics.
So, as Adam Leventhal says, due to larger drive sizes, rebuild time is increasing, so it's better to have two further drives in reserve than just one to protect your data during this vulnerable time, hence triple parity. So triple parity is not such a bad idea at all, IMHO.
And there's another advantage with RAID-Z3 over the RAID 10 example above:
With the RAID 10 configuration above consisting of 4-drives in the RAID 0 set, plus a further 4-drives in the RAID 1 mirror, uses 8 drives to give
4 TB of usable capacity, but it is more fragile: when a drive fails in the RAID 0 set, if a second drive fails in the RAID 1 set you're toast.
In contrast, with an 8-drive ZFS RAID-Z3 configuration you can have ANY 3 drives fail before losing any data, and you have 5TB of usable capacity instead of the RAID 10 set's 4TB.
Hopefully I haven't made any typos or mistakes, but please feel free to correct any mistakes I may have made.
I'm sorry Oliver, but your assessment of the likelihood of multiple disk failures is off a little. The key issue is that you tend to buy all your disks for a raid array at the same time, and if you study failure times you find that they exhibit a bathtub curve. This means that you tend to get clustered failures early (covered usually by manufacturer testing to a large extent) and another cluster around MTBF (mean time between failure). On top of that when a raid array (either mirrored or striped) disk fails, during the rebuild, the raid array can quite often be running at a higher load, increasing the likelihood of that second failure. I've certainly seen clustered disk failures that have resulted in a striped and mirrored set getting nailed.
@AC Actually I'm quoting from the seminal paper: Peter M. Chen , Edward K. Lee , Garth A. Gibson , Randy H. Katz , David A. Patterson, RAID: high-performance, reliable secondary storage, ACM Computing Surveys (CSUR), v.26 n.2, p.145-185, June 1994. Section 3.2.7.
The point is that Chen defined the terms. Since then the terms have been abused, and also extended. But since this is generally the paper which is accepted as the basis for most RAID discussion I think I remain correct. No, I don't quote Wikipedia.
There is some wiggle room, the paper only discusses two bit errors corrected with P+Q, but is should be clear that the fundamental mechanism scales withing the same architecture. Thus RAID 6 remains the correct term for extended parity beyond 2, unless the system is a RAID 2 style.
"......and so the 'I' (inexpensive) in RAID was lost."
I reckon that happened as soon as the storage manufacturer's approved disk started costing somewhere north of twice what an off-the-shelf component did. i.e. pretty much as soon as it was handed over from product development to sales and marketing.
FYI, if anyone is curious about the fundamental math
It is Gaussian elimination, where the matrix elements are chosen from a finite field (e.g., polynomials of degree N with 1/0 coefficients, modulo an irreducible polynomial (see linear feedback shift register for an example of this)). Take an N+K row (*), N column matrix, multiply it by a length N column to get a length N+K column, and that is your protected data. If you lose any K items of your protected data, you remove the corresponding rows from the N+K by N matrix to get a N by N matrix -- invert it, multiply that by the subset of the protected data column, and you recover your original data.
(*) the rows need a special property, than any N of them form a full rank matrix. A Vandermonde matrix has this property.
It's all algebra.
"But NVRAM was expensive and so the 'I' (inexpensive) in RAID was lost."
RAID = Redundant Array of Inexpensive Disks
The acronym doesn't say anything about the _controller_ being inexpensive....
Not going to add much to this debate
But. . .
"Every added terabyte adds four hours to the rebuild time, half a day"
Since when is 4 hours half a day? Last time I checked half a day was 12 but I could be wrong what with clocks being inaccurate and all. Unless of course we are talking about half a WORK day, whihc in that case I can understand.
Array reliability factors
The trouble with single parity is when you are recovering, you have no parity. And it's at that time you discover thata block on one of the other disks has gone bad and can't be read. Ouch.
Data-scrubbing helps. (That means that the array reads itself in toto every night or every week, while there is still a working parity disk. If a read error happens, don't fail the disk immediately. Recalculate the block from the other disks and re-write it. For a single bad block, the drive will re-map it and that's the temporary end of the problem, with the block now stored elsewhere in the drive). You are of course also watching the SMART statistics, and if any Reallocated sector count starts increasing other than very infrequently, you pre-emptively replace that drive )
Even so, you'd prefer to have parity during a rebuild operation.
If it takes a day to rebuild and the expected time to failure of another drive in the array is 100 days, that's a 1/100 chance of losing the multi-Terabyte array with RAID-5 after the first drive fails. That 100 days is not 3+ years, because the first failure may reflect a common defect in the whole batch of disks. It may well be less than 100 days to the next, fatal, failure.
With Raid-6, two more drives have to die for it to be fatal, that's 1 in 10,000 chance . Probably good enough. Three parity disks gives you 1 in a million, even for a fairly pessimistic MTTF assumption. I think SUN has this about right.
I wonder if any RAID array supplier has ever bought disks in batches well ahead of time and kept them "hot", so that they could ship arrays to customers with no two disks from the same batch, and all with different run-times (spread over several months) at time of assembly? Supplying somewhat "used" disks would actually considerably increase reliability!
I used to build 4-5 disk Linux software RAID arrays using one disk made by each of 4-5 manufacturers to eliminate common-mode failure as far as possible. But today, there are only 4 manufacturers: Hitachi, WD, Seagate, Samsung.
As for always using Raid-1 or -10, it isn't much better UNLESS you always mirror to a drive made by a different manufacturer. If both drives are from the same batch, common mode failure, failure of first drive may imply MTTF of ther drive is 100 days. 1 day to re-mirror -> 1% chance of total loss. Ouch, again.
If you (or your vendor) has had the foresight to mirror to a different make of drive, failures should not correlate and MTTF of the surviving disk is probably 1000 days, 0.1% chance of total loss. RAID-6 is more reliable. Of course, performance isn't as good. And these are post-first-failure probabilities, some folks will never see any drive fail at all.
BTW how many folks out there are using hardware RAID controllers with non-partity RAM buffering your data? So a faulty memory chip goes undetected until all your data is scrambled?
Another win for Linux software RAID, just as long as it's running in a server with ECC RAM. Non-ECC RAM in controllers is IMO the reason why RAID-[5,6] has a bad rep.
Maybe it all becomes irrelevant soon. With the latest filesystems you will be able choose per-file or per-folder-tree what level of redundancy you want. Multiple partity with backups for the filesystem metadata. RAID-1, or -6, or -5, or none for files, depending on how important their contents are deemed to be. Also end-to-end checksumming to detect faulty controller electronics or buffer RAM. Also self-healing. So systems won't want RAID controllers at all, they'll just want a big pool of raw disks for the filesystem to manage.
Sun started this ball rolling with ZFS, a shame they seem to have dropped it of late.
I'm less knowledgeable about the guts of RAID algorithms than other commenters, but I do know this: RAID-DP is not the same as RAID 6 in that RAID-DP is proprietary RAID algorithm belonging to NetApp. In any case, there are solutions beyond throwing more expensive disk resources at the problem; Xiotech, for example, has architected their solution to bypass one of the causes of the bathtub curve, namely vibration resonance in disk enclosures. They also allow for much more efficient use and reuse of disks. Not that I'm intending to be a shill, mind you, I just think the technology is cool.
Silent corruption of files
Studies at CERN shows that on average, one byte per 30MB is errorneous.
It is the same problem as ECC. Some bits will be flipped in RAM, without the computer even noticing it. It can be due to power spikes, cosmic radiation, whatever. You therefore need some additional bits that can detect and correct these errors, hence you use ECC.
The very same problem occurs with hard drives. A modern drive has 20% of it's surface dedicated to error correcting codes. There are lots of errors when reading/writing all the time, which gets corrected. But some of the errors are not possible to correct. Worse, some of the errors are not even detectable by the hardware. Such errors occur with very low probability, but they occur. We talk about "silent corruption"
The problem is that with today's large drives and large RAIDs, there are so many bits involved that even if the probability of corruption is very low, they are bound to occur because there are extremely many bits. Errors occur quite often in fact, as CERN study shows. This is the reason RAID-5 must be abandoned (there are too many bits, some of them faulty without telling you):
We need some kind of ECC mechanism for hard drives. Which is exactly what ZFS provides. And THAT, gentlemen, is the single reason to use ZFS. Because of it's ECC features. Here the ZFS architect explains about the problems a modern file system must solve, and the future of top modern file systems. Very good read, this ACM article:
I would like to see...
I would like to see triple parity from ZFS and an option to bypass hardware error correction.
if ZFS is doing all the error correction at the file system level, the user should be able to reclaim the extra disk space.
and so it begins...
> I would like to see triple parity from ZFS and an option to bypass hardware error correction.
Self. Foot. Shoot.
Enjoy the extra 20% capacity while you can.
> Self. Foot. Shoot.
> Enjoy the extra 20% capacity while you can.
See: http://queue.acm.org/detail.cfm?id=1317400 where it says...
DB What are the provocative problems in storage that are still outstanding, and does ZFS help? What’s next? What’s still left? What are the things that you see down the pike that might be the big issues that we’ll be dealing with?
JB There are not just issues, but opportunities, too. I’ll give you an example. We were looking at the spec sheets for one of the newest Seagate drives recently, and they had an awful lot of error-correction support in there to deal with the fact that the media is not perfect.
BM They’re pushing the limits of the physics on these devices so hard that there’s a statistical error rate.
JB Right, so we looked at the data rates coming out of the drive. The delivered bandwidth from the outer tracks was about 80 megabytes per second, but the raw data rate—the rate that is actually coming off the platter—was closer to 100. This tells you that some 20 percent of the bits on that disk are actually error corrections.
BM Error correcting, tracking, bad sector remapping.
JB Exactly, so one of the questions you ask yourself is, “Well, if I’m going to start moving my data-integrity stuff up into the file system anyway—because I can actually get end-to-end data integrity that way, which is always stronger—then why not get some additional performance out of the disk drive? Why not give me an option with this disk drive?” I’ll remap all the bad sectors, because we don’t even have to remap them. It suffices to allocate it elsewhere and basically deliberately leak the block that is defective. It wouldn’t take a whole lot of file-system code to do that.
Then you can say, “Put the drive in this mode,” and you’ve got a drive with 20 percent more capacity and 20 percent higher bandwidth because you’re running ZFS on top of it. That would be pretty cool.
DB That’s a really exciting idea. Have you had those discussions with the drive vendors about whether they would offer that mode?
BM Not quite, because they’re most interested in moving up the margin chain, if you will, and providing more unreliable devices that they sell at a lower cost; it isn’t really something they care to entertain all that thoroughly.
Make Up Your MInds
You went with RAID so you wouldn't have to buy as many disks. If you don't want the RAID rebuild time then mirror instead. ----- sheeshhhh!
- Xmas Round-up Ghosts of Christmas Past: Ten tech treats from yesteryear
- Special Report How Britain could have invented the iPhone: And how the Quangocracy cocked it up
- Analysis Microsoft's licence riddles give Linux and pals a free ride to virtual domination
- Massive! Yahoo! Mail! outage! going! on! FOURTH! straight! day!
- Bring it on, stream biz Aereo tells TV barons – see you in Supreme Court