big disks him go titsup
spin around mix it up
him get up walkaround disk again
IBM's research department haz released a research paper on RAID 5 that has intrigued and baffled our correspondent in equal measure. The paper, by IBM Almaden researcher Mario Blaum, professes to solve a problem where RAID 5 is insufficient to recover data when two disks fail. Blaum's solution claims to do so better than a …
I didnn't attempt to read the paper, the quoted bits were enough to put me off.
Speaking of bits, I do remember some sysadmin stating a couple of years ago, that rebuilding a degraded raid5 error is essentially poinless, and the best course of action is to restore from your backup onto a fresh array.
The argument was, that if you look at the expected bit error rates of a harddrive, and compare it to the size of your array, you'll discover you're statistically likely to encounter atleast one flipped bit during raid rebuild, and end up with corrupted data, in one file, somewhere.
As I'm a mere caveman studying the entrails of a wooly mammoth, I must admit I'm just speculating when I propose the theory, as told to me by the mammoth's internals, that perhaps these researchers are presenting error correction codes, and placement strategies for these codes, that would push the probability of single bit errors up into the realm of "wont happen" for another decade or so?
ilmari speculated thus:
"...perhaps these researchers are presenting error correction codes, and placement strategies for these codes, that would push the probability of single bit errors up into the realm of "wont happen" for another decade or so?"
I'm sensing a Grant Application in all of this. Doubtful payoff down the road, but sustenance for a Research Boffin meanwhile...
That sysadmin is full of shite. Firstly, any flipped bits on the raid set are going to be duplicated in the backup set. Second, all decent storage manufacturers maintain checksum data to detect flipped bits, and then use RAID to fix them. Which os why a failed disk in a raid-5 set is a dangerous place to be in if you don't have scrubber functionality in your array. Third, mature array technology often includes technology to do that aforementioned scrubbing. This dramatically reduces the incidence of a flipped bit causing a rebuild failure.
There is a good reason for not rebuilding a degraded RAID 5 array, but that's not it.
An obvious reason is that you can bet your salary that the array was commissioned with a full set of shiny, new disks. Unless the disk that failed was a serious short-lifer, you can bet that the others are now not long for this world too. Rebuilding the array soon gets like painting the Forth bridge.
Compound that with the fact that disks are more likely to fail (if they're going to) when heavily loaded and that the rebuild is going to cane the living shit out of all of them.
The result is that the business risk associated with allowing it to limp on degraded while you spin up a new one is often lower than the risk of dropping the lot on the floor when it goes titsup in the rebuild.
Wow, that's nasty... some of it has echos of reed-solomon, but that's all I can follow once it gets past defining conventional RAID in matrix terms.
I can tell you that this stuff is obscure. Really obscure. This isn't just your regular nasty math that any professor can figure out - this is cutting-edge research maths. There have only been a handful of papers I can find on the subject. There are probably about ten people who really understand this right now. It doesn't look impossible to follow, just arcane enough that even a professional mathematician is going to need to slog over it for an hour.
I can tell you the big catch though - and it's a really big catch. Failure model. These codes are made for correcting unreadable blocks on a hard drive, and depend upon the rest of that hard drive still being accessible. They won't do you a lot of good when two whole drives die at once. At least, that's as far as I can follow it. I'm not entirely confident, but I think that's how it works.
Might be applicable to flash controllers though - while drives tend to fail in a 'clunk-bzz-dead' manner, flash memory is more likely to lose individual cells.
... if you sat down the the author of the paper, it'd make sense (at a high level) and you'd 'get' the principles of it in 20 minutes.
That's a big problem with mathematicians, they can explain stuff simply but when they write papers it's their custom to be as terse as possible. Clarity is sacrificed by convention and that's a crying shame. It's time they changed, really (any eggheads listening?)
Caesarius gave a good example of what it's like to have it put in simple terms with his comments about GF(2).
Maths is not easy but it doesn't have to be nearly so hard; I'm not into masochism.
I just realised while soaking in the bath - there is an application of this in RAID. It can correct the single-sector read errors encountered during an array rebuild.
Also, why do none of your tablet accessory reviews include the most useful accessory to me, the watertight food bag from Asda?
"Also, why do none of your tablet accessory reviews include the most useful accessory to me, the watertight food bag from Asda?"
You, sir, are the true genius we're looking for. Forget all this maths stuff for recovering data. That's only of trivial importance compared to to the fact that you've just solved the most pressing problem known to bathkind.
Yours truly, from the B Ark.
Personally, I didn't find much in there that really confused me. But then, that's probably because I've written a set of Perl modules (see Math::FastGF2 and Crypt::IDA on CPAN) to construct RAID-like "erasure codes". The paper here has elements of Reed-Solomon codes or Rabin's "Information Dispersal Algorithm", but also needs some understanding of the maths underlying maximal distance separation (MDS) codes in theory. I'll try to give a brief outline of what I think is going on...
First, you need to understand the basics of error correction. Although the implementations differ, essentially the problem of reconstructing the original data is analogous to the related problem of "secret sharing". A simple scheme to share a secret among some number of people so that 3 of them are required to recover it would be to (1) encode the secret as a point in a 3-d space, (2) generate n "shares", which are equations of randomly-generated planes, each of which passes through the secret point, and (3) distribute the shares (ie, the equations of the planes) to each of the n participants. Any 3 of these participants can then gather together and plot out their respective planes to discover the unique point which is at the intersection of the three. Or equivalently, they can use algebraic means (as opposed to geometric) to invert the 3x3 matrix formed by combining their equations and find the solution that way. Note that if we only have two people, the intersection of their planes will be a straight line, implying an infinite number of solutions, so just two people alone can't reconstruct the secret. Also note that by setting n > 3 we introduce an element of redundancy, which is the basis of many error-correcting codes, so that if n = 5, for example, any two of the participants can lose their shares and the secret may still be recovered by using simply linear algebra.
That's the simplest analogy to explain how these kinds of codes work. In practice, we don't use lines, planes and hyperplanes in a real space, but instead (generally) use either field arithmetic modulo some prime or (more usually) Galois fields modulo an irreducible polynomial (which is effectively a prime, or serves the same function, anyway). They're just more convenient to work with. I won't try to explain how those work here, but really there's not much difference between using GF(2^x) mod p and working with points, lines, planes and so on. See my Math::FastGF2 Perl module for my attempt at explaining it. Another major difference is that most implementations of erasure codes use polynomials instead of using the intersections of lines or planes. Just as 3 planes are enough to uniquely determine a single point of intersection in 3-space, the evaluation of a polynomial of order 3 (ax^3 + bx^2 + cx + d) at any 3 points is enough to determine the original polynomial parameters (a through d).
So all of this is a brief introduction to how Reed-Solomon codes work, or at least a good working analogy for it. Leaving aside implementation details about whether we're working with Galois fields or polynomials, we create a redundant code by making an k x n matrix and then solve it to recover the original (or secret) data by taking any k x k submatrix and solving (inverting) it. The maths of secret sharing is actually the same as information dispersal/Reed-Solomon codes. I'll follow this up with another post to explain RAID and the paper...
You didn't put it simply. You just wanked off your own ego and didn't clarify squat..
I didn't bother reading the paper, but did I get the impression that the researcher just wanted to add two more sets of parity bits to increase the probability of complete recovery?
OK, so I think I'll dive right in and try to paraphrase the paper, explaining bits about RAID and what advance the paper makes along the way.
As mentioned in the intro, the way RAID works is based on the kind of maths as I outlined in my previous post. Even though it's got all this maths as a backdrop, for various reasons (efficiency being the most important one), RAID only implements some very simple bit-level operations rather than bringing the full power of the maths theory to bear. RAID looks at redundancy by distinguishing between data "slices" and parity slices. In the simplest case, you can have a data slice mirrored across two drives, and use the XOR of the two slices to provide your parity stripe. For this type of parity checking, if you lose either of the data stripes, you can still recover it by XORing the remaining one with the parity stripe. This kind of redundancy is (more or less) how different RAID levels are built up, with extra mirroring and parity added on top of existing layers. As the paper says, this is pretty wasteful and it can still fall prey to certain kinds of multiple failure modes that can't be recovered from.
It would be much better if, instead of simply adding more mirror/parity disks, we could use something more akin to a secret-sharing scheme as I described earlier. With such a scheme, actually there would be no distinction between data stripes and parity stripes. To reconstruct (read) the data, you'd have to read in k good stripes out of n (n > k if we want redundancy), do some maths magic and return the correct bits. Unfortunately, that's not how RAID works. All of its redundancy is built using either plain XOR of data or (in RAID 6) a more complex formula that's (generally) based on Hamming codes (or maximal distance separation codes in general). Unfortunately, these make assumptions about which and how many of the bits of the matrix used to reconstruct the data can fail. Effectively what this paper is showing is how to construct a couple of different parity calculations that can be used to scale up the redundancy level to tolerate more failures in a (presumably) more realistic pattern. It seems to be using the same kind of matrix operations as used in Reed-Solomon codes, including using a Vandermonde-form matrix to prove the invertibility (solvability) of their coding (parity check) matrices. So it's not quite as generic as a secret-sharing scheme would be, but it's definitely using the same mathematical machinery.
Some of this stuff is still obscure to me, but overall it still makes general sense within the framework of Reed-Solomon codes and Hamming codes. I think that once you understand the basics of using matrices for solving linear equations and know generally how to work in Galois fields, this paper isn't actually all that obscure. IMO, of course :)
I think AC @22:10 captured it pretty well. It does help to have some familiarity with Hamming and Reed-Solomon codes, and/or with the application of group theory to error correction. (Groups are simpler structures than fields, and you can use them to create error-correcting codes that operate similarly to R-S, PMDS, and SD codes, so they give you the idea with a bit less mathematics to understand.) I generally dig out my copy of Gersting's textbook Mathematical Structures for Computer Science when I need a refresher on this sort of stuff, but surely there are good introductions available online.
My CS degree included a couple of discrete mathematics classes and one in linear algebra - I did a quick skim of the paper, and I didn't see anything more esoteric than the stuff we covered there, if memory serves (this was back in the '80s). Do they not teach that sort of thing these days? (I realize many, probably most, Reg readers don't hold CS degrees - IT and CS are not at all the same thing - but surely a good number do.)
Start with the idea that adding a parity bit on the end enables you to detect one bit in error, but not two bits in error. Now trust me that adding more "parity" bits, each calculated using XOR of a subset of the data bits, can not only detect more errors but also provide enough information to work out the original data if there were few enough errors. For a trivial example, sending one bit of data and two parity bits, each of which is merely a copy of the data, allow you to correct up to one error in each set of three bits. But that is a very poor ratio of data to parity, so a lot of work has been done to find "codes", which are described by the "subsets of bits to XOR" mentioned earlier, which offer a much better data to parity ratio.
The idea is to apply this to disks by taking a block of data that we want to write to disk, add parity bits to it, and share out the data and parity bits across the RAIDed disks. If one disk becomes unreliable, when we read back a set of data and parity bits we can reconstruct the data, but only if there are not too many errors. So research is done to find "codes" that work with a specific number of disks and can cope with one disk failure. Or two disks, if there are enough parity bits, and so on.
Mr Blaum is trying for sufficient parity bits to cope with two failed disks, which is what RAID 6 claims. He has only presented one aspect of the codes he considers, called the parity check matrix. Whereas this matrix does indeed imply multiplying the data+parity bits by said matrix, there are always methods of avoiding doing exactly that but getting the required answer. The point is that the parity check matrix is comparatively easy to analyse.
If you want to understand the detail, start by reading about Hamming codes.
Also, read about "finite fields". GF(2) means Galois Field order 2, which is a very posh way of saying "binary, and use XOR to add and AND to multiply". GF(anything else) is a more complicated arrangement where we redefine ADD and MULTIPLY to behave themselves, even though the numbers don't behave quite like familiar integers and certainly don't look like them. See e.g. Reed-Soloman codes.
That's enough for today...
Lets start with some data stored on a disk, and increase the size until problems get in the way. Step one, buy a bigger disk. When that doesn't work, split the data between two disks. Pretend you are locked into software that won't work like that. Plan B is to make a big virtual disk out of many small disks. Plan B is a disaster waiting to happen. When any disk fails, you get file system corruption and lose all your data.
The old solution was RAID 5: Add one more disk than you need. If you now have N disks, reserve 1/Nth of each disk for parity data. For each sector of parity data, assign one sector of real data from each of the other drives. Start with each parity sector set to the exclusive or of all the corresponding data sectors. Before every write, read the sector that is about to be overwritten, write the new data. XOR the old data, the new data and the data from the corresponding parity sector and store the result back on the parity sector. This preserves the fact that each parity sector is the exclusive or of the corresponding data sectors. If one drive breaks, the array still works. When you need to read a sector on the broken drive, you read the sector from each drive that shares the same parity sector as the missing sector. The exclusive or of all those sectors is the data you wanted. Pull out the broken drive, plug in a new one and you can use that trick to restore all the data you cannot read directly from the broken drive.
RAID 5 has issues. That read before every write costs performance. When a drive fails, reading all the other drives to get the missing data hits performance. When you replace the broken drive, the array thrashes hard and you loose performance. The subtle disaster is if a read fails, and the error detection algorithm doesn't notice. (If you have a large busy array, this will bite you). The miss-read sector will cause the parity sector to contain garbage. When the next disk fails (when - not if - as you have so many disks) the garbage in the parity sector will cause data corruption to the corresponding sector on the broken disk. RAID 5 is almost always the wrong answer. Instead double the number of disk drives and store each sector on two drives. Disks are cheap, and mirroring does does not impose the large performance penalties that come with RAID 5.
Now lets move into the IBM universe. In this universe, error detection is perfect, so the file system corruption risk of RAID 5 disappears. The poor write performance of RAID 5 does not matter to you, and you do not mind you system slowing to a craw each time a disk fails. You do care about the price of disks, so you are too cheap to go for mirroring. Lastly, you buy your drives from which ever manufacturer is currently going through a bad patch (this happens to all of them - just watch the commentards slagging off the particular manufacture that happened to caused them grief). You decide that a second drive can fail before a first failed drive can be restored, but that you will never have three broken drives at the same time because you have sacrificed a chicken to to voodoo gods.
Now comes the difficult bit. In the IBM universe, how much space to you need to reserve so that you can recover all the data if two drives are broken? This is quite a difficult mathematical problem. The lazy answer is to choose a simple algorithm that uses more than the minimum required space. A less lazy answer is to get a computer to search for a better solution than the simple one you came up with. Mathematicians at IBM have solved the problem, and now the minimum required space (and the algorithm that uses it) can be calculated.
If you have been programming a while, you might well have bumped into single bit error correcting codes (If that day has yet to arrive, just remember to ask wakipedia about Reed-Solomon error correction). If you cannot make data packets small enough to get only single bit errors, you will need to look at correcting multiple bits at once. Mathematicians have done all the grunt work for you already, but they express it in their own jargon which is clear as mud to most programmers.
Lets start with binary and exclusive or. Programmers should have met modulo arithmetic. Byte 130 plus byte 136 is byte 10 because there was an over flow. In computing integer arithmetic is module 256 or 65536 or the bigger numbers for 32 and 64 bits. Now take that down to modulo 2 - the arithmetic for bits:
Mathematician's addition modulo two is the same as the exclusive or instruction from computing. Mathematicians know all about addition, so they use it whenever exclusive or is the obvious solution to a programmer. If you looked inside the paper, you will notice GF(2^b). That 2 is for arithmetic modulo two (Binary! :-). It has been ages since I had to deal with this stuff, but if I recall correctly, that b is the number of bits in a sector - or whatever data packet you need an error correction code for (mathematicians imagine a vector of bits). GF is short for Galois field (he his French pronounce it Galwa or mathematicians will know you are a programmer). In mathematics, a field is a set with an 'add' and a 'multiply' instruction. In this case, the set is all possible contents of a disk sector. There are a bunch of additional restrictions on how you define add and multiply. Today, add means make a new sector exclusive oring the corresponding bits of two other sectors. Multiply means bitwise and the corresponding bits. The fun bit (for mathematicians) is computing's 'and' and 'exclusive or' operations on sectors have the required properties for a Galois field. They can use all the things they have discovered about Galois fields (far more than most programmers can stand) to find the equations for recovering data from a RAID array with two broken drives.
We have almost reached the point where I am going to run away screaming, but there is one more piece of mathematics that I have actually programmed, so with any luck, I can explain it and perhaps three or four of you will actually need to use at some point in the next 50 years. Pretend we have N bits and a Gremlin can flip up to m of them. If m=0, we have 2^N useful messages. If m=1, for each useful message, we have to use (1+N) code points: one for the message received without errors, and N more for all the single bit errors. That gives us at most (2^N)/(1+N) useful messages. If m=2, we only get (2^N)/(1+N+N*(N-1)/2) useful messages. When we receive one of the 2^N code points, we have to work out which useful message was actually sent. Clearly we want the 'nearest' one. In order for there to be a 'nearest' one, we need a list of the useful codes and a function that converts two code points into a positive integer that we can call the distance between those code points. One handy definition of distance is the number of bits you have to flip to get from one code point to another.
If the minimum distance between any two useful code points is two, then we can detect (but not correct) some errors because there will be some code points that are the same distance (1) from two different useful code points. If the minimum distance is 3, we can correct some errors (if a code point is 1 unit away from a useful code point, it must be at least two units from any other useful code point). To correct two bit flips, we need a minimum distance between valid code points of 5. For small values of N, you could find a good set of useful code points and a distance function by trial and error. When N gets large, you really want some mathematics that find those things for you.
All this is really useful when you do not know which bits the gremlin flipped. In the IBM universe, disk drives can always spot when reading a sector failed. Instead of some unknown bits getting flipped, some known bits are set to 0. This requires some slightly different mathematics to to usual error correcting code programmers meet. If you get as far as page 2, the paper defines Partitial Minimum Spanning Distance and SD codes which are just acronyms to frighten journalist on page 1. I skimmed through to the end, and did not see any mention of performance, thrashing when a drive fails or error propagation when a bad read has the right checksum by chance. I expect this set up has worse performance and corruption problems than RAID 5. For most people, the extra disks required for RAID 10 are going to be cheaper than implementing IBM's pretty mathematics.
Now lets move into the IBM universe. Quite.
The USENIX presentation on SD codes at https://www.usenix.org/conference/fast13/sd-codes-erasure-codes-designed-how-storage-systems-really-fail is very helpful (+1 for that post!) talking about erasure codes, but also lets us know where they are coming from: they really don't like using more disks than is theoretically needed to protect against their chosen failure mechanism.
I can see that disks are cheap, and I agree that recovering from a failure should not detract from normal operation more than is acceptable. Discussion then follows the "unfortunately .. fortunately .. unfortunately there was a pitchfork in the haystack ..." route, where people pay their money and make their choice. So I observe:
- some systems need higher spec disks which cost more
- more disks means more power dissipation, and we are seeing customers realising that they need to look for lower dissipation equipment
- you'll get much better performance from a hardware implementation of error correction, but it costs money, and it is still not likely to run so fast that the rest of the system does not notice it
Murphy's Law says that there are a finite number of things that you can think of that might go wrong - and an infinite number of things that can go wrong. This means there are failure modes that will cause a disk misread to go undetected.
<misty eyed mode>
There once was an ingenious hardware-assisted disk file system called Indexed Sequential that developed a mysterious problem for one specific customer. The fast search depended on whether a matching data key was found or not. Every once in a few million searches, the data key was reported by the hardware as not found - although it was there. The customer application then overwrote the existing key record with a new "arbitrary" key - which orphaned any data records that used the old key. The "missing" records anomaly was usually not noticed until a few weeks after the miss.
It was a hardware problem that happened extremely infrequently - and only if the voltage tolerances exceeded about 5% on a particular batch of disk controller boards. However the engineering set up tolerance was 10% - and these were checked, and reset, once a week. It was the controller's key compare logic that failed - not the reading of the disk data.
The application software had assumed that any hardware failure would cause an error to be reported on the device. Therefore what seemed an efficient way to index an awkward data set using "arbitrary" keys - was an unfortunate choice in hindsight. Other customers were not affected as their applications used keys which were data hashes - and a blip merely recreated the same key..
</misty eyed mode>
I think I'll go with dual parity 14+2 RAID and at least understand how the data protection is working so I can enjoy my beer instead of stressing out wondering if the proof has a math error somewhere that will require a team of PhD's to uncover...
If my understanding of the introduction is correct, in a nutshell, it's adding extra parity protection. The way I *think* it works is this:
Imagine this is some data on your raid array, the vertical columns represent individual disks, the horizontal rows are stripes of data:
1 0 1 0 1
1 1 0 1 1
0 1 1 1 0
In RAID-5 or RAID-6, we protect against a disk failure by adding parity to the *horizontal* rows. This is good, it means we can protect against one or two total disk failures using these schemes, and for RAID-6, we can correct individual read errors even if we have one disk failed.
What the author is saying is that these schemes still can't cope with multiple read errors at the same addresses. So if we have a disk failure and two more read errors, we could lose data. Using our same block of data, this is what we've lost:
1 0 1 E 1
E E 0 E 1
0 1 1 E 0
Disk 4 has failed (the vertical line of errors), and we've also had read errors from two other disks. Due to some really bad luck, both of those errors are within the same horizontal stripe.
Now this is pretty rare, and we're talking lightning bolt strike while winning the lottery kind of rare here, but the paper seems to be saying that by using both vertical *and* horizontal parity, you can increase the level of protection.
From the above example, you can use vertical parity to correct the individual bit errors:
1 0 1 E 1
1 1 0 E 1
0 1 1 E 0
And now it's a straightforward job of using the 'normal' parity to complete the rebuild. Nice idea, not sure how much benefit it adds vs simply adding more traditional parity, but it's interesting at least.
With 30 seconds of effort I can see from LinkedIn and the IBM web site that Mario has an advanced degree in Mathematics from CalTech and that he's been working for IBM and Hitachi since 1985 researching error correcting codes and data reliability all that time, for organization for whom this topic is going to be important. Clearly someone who is going to have more that the usual familiarity in the field.
You choose to suggest that the way to understand this paper is to solicit the opinions of your readers rather than, say, calling or writing to Mario directly. I'm inclined to agree with your assessment of your cerebral limitations though perhaps for different reasons.
Nail. Head. On the. Hit.
I couldn't understand the paper as I don't have 90 per cent of the maths knowledge to do so - cerebral insufficiency. Mario couldn't explain it to me so I could understand it because of my limitations - so that would be no use in trying to get the paper's contents described on the Reg'.
This way is much better - and more fun.
ZRAID will detect the problems before you do, unlike conventional rubbish RAID 5 or 6, so will automatically move the data to good sectors, so no need for data correction; I also use ZRAID2 (much better than RAID6) because it is nice to have the extra genuine safety when a drive fails (one WD Red did, replaced in warranty); this is arguably faster and cheaper than mirrored RAID5.
Biting the hand that feeds IT © 1998–2019