It's too much to demand that any vendor try and make RAID storage exciting. Our hats, however, go off to Panasas for trying. The company this week started hyping its Tiered Parity Architecture (TPA) – technology that serves as an extension to RAID. Panasas CTO Garth Gibson, one of the inventors of RAID back at Berkeley, is once …
Panasas decides to redefine RAID
"With Horizontal Parity, for example, the company uses multiple RAID controllers to perform recovery tasks in parallel."
. . . so one of their customers had triple disk failures in one raid group - nasty. Does the Panasas O/S do regular "disk scrubs" like one vendor I know does? Was there no predictive failure and copy-out of the "failing" disks onto spares? No, even nastier.
"But basically there are a bunch of codes that we implement on there to detect errors at a sector level."
. . . Would that be something like the scsi "g" list of bad sectors? And anyway all disk writes are checksummed - that's a given.
"RAID 6 or double parity RAID"
. . . Since RAID 6 is defined as RAID 5 with another dedicated parity disk, the data used for reconstruction is literally, all over the place. With NetApp's Double Parity RAID (RAID DP), there are 2 dedicated parity disks per raid group so there's far less overhead on both everyday writes and reconstructions.
Then there's the fact that e.g. NetApp's O/S, Data ONTAP is RAID aware, I don't know of another vendor who can say the same.
"Like the RAID 6 approach, Vertical Parity does require extra overheard in the way of disk space. Panasas contends that its overhead - 20 per cent - equals that of RAID 6."
. . . NetApp's default raid group size for RAID DP is 16 disks, 2/16 = 12.5% overhead.
"But Panasas does not require more space as the disks grow in density, while the RAID 6 crowd does."
. . . and there was me thinking that 20% of a bigger thing was bigger than 20% of a smaller thing.
"With Network Parity, Panasas performs a complete check on data as it moves between storage boxes and server/client systems."
. . . That will be something like the OSI model's layer 3 then - TCP
Rosenthal seems to be talking out of his bottom.
"Distributed" RAID on 3PAR systems
I've been a customer of 3PAR for most of 2007, and one of the things that I really like about their systems is the "virtualized" disks. The physical disks are split up into 256MB chunks, and the RAID configuration is on the chunks, rather than the disks. Each disk has reserved chunks set aside for redundancy purposes, on my 300GB FC drives it is ~18GB. Long story short, when a drive fails, the entire array(all spindles, depending on the size of the volume, a 10GB volume will be spread across up to 39 disks) participates in the rebuild process. There are no "spare" disks, all disks are in use. So rebuilds are real quick compared to traditional RAID systems.
Other advantages are things like being able to run multiple RAID levels on the same physical disks (I run RAID 1+0 with data on the outer regions of the disks and RAID 5+0 on the inner regions).
Of course it's also handy to be able to convert between RAID levels online, with no downtime.
But if you really want to protect against multiple failures you have this ability today in several higher end RAID systems, by mirroring multiple times, say two RAID 1 volumes, mirrored together. I've been told that some of the highest end HDS systems for example(the ones where they pay you for any downtime associated with the array) default to something like triple mirroring. If data protection is THAT important the user should be using such a system(assuming they want the data to be completely available vs replicating to a second array where there may be downtime in pointing servers over to it in the event of a failure).
My 3PAR array has a "mirror depth" option which may be the same thing, I'm not sure, haven't tried it. It also has the ability to automatically lay out the data so that (provided you have enough shelves) it can ensure redundancy in the event an entire shelf goes off line. My array is only 2 shelves so we can't take advantage of that, but I imagine it can save a lot of planning for folks that want that kind of assurance.
And of course while not completely risk free, running RAID 1+0 instead of say RAID 5 with a hot spare gives you a higher chance of surviving multiple disk failures(as long as they are the right disks). Especially say if your raid volume is made up of 10+ disks. The only time I've experienced multiple disk failures on the same array before I could replace the disks was back in 2001 I think with the IBM 75GXP disks. Had 4 drive failures in the span of 3 days on two different systems. The systems were dedicated 'backup' servers, so nothing important was lost, just rebuilt the system that lost it's marbles and re-synced the data. I must've lost at least 30 75GXP drives and even joined the class action lawsuit for a while until the judge decided to exclude people outside the state of California.
So, fast RAID rebuilds are here, and have been for years, at least with 3PAR, I'm not personally aware of other vendors that offer similar technology.
Here I was believing that everything except the bit-rot problem surround atomic writes would be handled by a 2D-raid (thats two dimensions), and to solve the bit-rot problem a 2D setup with ZFS being the topmost one. Notice that this solution is faster, cheaper, simpler.
RAID set by "chunks"
RAID sets by "chunks" of disks is not unique to 3PAR - it's used on several hiogh-end arrays. As for the effect on rebuild time, then rebuilding these will still be limited by the transfer rate on the single disk that failed. If you have a 300GB disk that needs to be rebuilt (whether configured as a whole disk in a single RAID set or into multiple RAID sets made up of "slices" of disks then you still have this ultimate bottleneck of having to regenerate 300GB of data onto a single spindle.
There is another point to note - if the recovery time is effectively dictated at the limit by the throughput of the regenerated volume and you involve more disks through this approach, then your exposure to a double disk failure is increased. If a RAID set of 5 disks (single parity) is involved in the rebuild, then you are exposed to a hard failure on just 4 disks during the rebuild time. However, if this one disk participates in 10 different RAID sets of this size, then there will be 40 other drives involved, the failure of any one of which would invalidate that RAID "chunk". Only if the rebuild time is reduced proportionately is that exposure equivalent (and at the limit you still have the throughput of that one drive).
Of course you have spread the read I/Os required for the regeneration over more disks, so the effect is averaged over more disks but if the sustainable write transfer rate to a 300GB drive is 40MBps then it will still take something over two hours to rebuild and cannot possibly trake less. This problem will get worse as disk capacities get larger. The reason? Well it's quite simple, if the disk rotational speed remains fixed as capacities get larger (and 15K appears to be about the cost-effective mechanical limit and even that isn't available on large capacity disks) then the transfer rate only goes up as to the square root of the increase in aerial density. In other words if technological improvements mean that the bit density increases by a factor of four, it will take twice as long to read or write a disk in its entirity. This problem is only set to get worse.
The real problem with RAID failures with large disks is not really the possibility of encountering a "hard" double disk failure. The problem is that during the rebuild operation an unrecoverable read error will occur. This can stop a rebuild operation dead in its tracks and is the real reason for multi-redundant parity schemes (whatever method is used). The MTBF figures quoted by disk vendors are those for a complete disk failure and not for single unrecoverable read operations. With the increase in size of disks, these exposures are increasing and many of the engineering limits which were deemed acceptable in the days of GB disks are no longer so in those 1,000 times larger.
But if your RAID is contained both outside and inside a disk, you are still royally fucked if you have a hardware error, most disks fail rather than giving data you don't expect. Indeed, offen RAID controll systems fail (or at least degrade) a disk if it does give unexpected data. Your best bet is to virtualise and have small segments of each disk presented to different hosts, in different RAID sets, this way if you have a single drive outage there is minimal impact to many machines. The segments that have failed can be moved around (rebuild from the parity) as required.
Now, what would be best is to 0+1 data that you _really_ need and use snapshots and multiple concurrent copies, but you do need a serious amount of wedge for this.
but you do have to ask
Where is the IT angle?
Just Say No
Shhhh... It's a s_e_c_r_e_t ...
"The company has proved reticent to talk about the exact nature of the technology, even though it has already filed for patents."
Sigh. US patent applications are all available on-line at www.uspto.gov. I can't be bothered to go look up these ones - RAID is far too boring a subject.