A question on hashing
Thanks for the clear explanation Trevor.
One question though - how are hash collisions avoided within deduplication software - there are parallels with encryption/decryption here.
Steve
Two months ago hyper-converged infrastructure appliance startup SimpliVity sued its rival Springpath. I know next to nothing about the lawsuit – nor do I want to – but the whole thing has caused commentators and armchair analysts everywhere to ask what exactly SimpliVity does that is unique. I have spent the past two months …
Hi Steve,
Hash collisions are a pain. These can be avoided in any number of ways but usually boil down to using a hash space so large the chances are exceptionally unlikely you'll ever encounter one. (See here: http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html) Of course, this doesn't eliminate hash collisions which is why a almost nobody uses simple hashing anymore.
Each company has their own take, but a scary number of deduplication approaches really are just "hashing with some kludges to solve minor problems" (like collisions), where "solving" doesn't mean "eliminating entirely".
Actually coming up with really novel ways to do data efficiency didn't seem to become "a thing" until the latest round of the storage wars started about 5 years ago. For many government and enterprise buyers this can be a problem. They only buy from the legacy vendors, unless the new ones go through years of proving out. This can - and does - leave some ops teams in a bit of a pickle trying to deliver the full range of modern functionality. (Or, at least, trying to delivery it at prices that compete with public cloud computing, which isn't as feature rich or as secure, but it is the reality of what they are compared to. Damned if you do, damned if you don't.)
That being said, kludges aside, things like hash collisions can happen. So don't take your 10 year old SAN with it's primitive deduplication features and try to shove 1PB worth of data into it by "just adding some shelves". The bigger the space the more likely you're going to run into problems with implementations.
Mostly, if you refresh every 5 years you shouldn't have to worry. Also, if you use one of the new technologies that isn't hash reliant, you shouldn't run into these issues.
At 100TB you should be fine. Even at 500TB you probably run no risks worth mentioning. But if you are planning at any point to have a storage space larger than 1PB using one of today's storage technologies then demand to speak to a nerd and start asking exactly these sorts of questions of them. It isn't that hash collisions themselves are a likely risk, but you start getting into the scale where implementations don't quite match the theory and it's worth asking the questions.
And don't just look at what you plan to load your storage system with today. Think about the total life cycle of the system. 5 years in production and 5 years as an archival unit. Will you expand it to 1PB in that time? If so, be relentless in your inquisition. If you don't get the warm fuzzies from the answers you are given, move on.
There are lots of storage companies to choose from.
IANACE (I Am Not A Crypto Expert), so don't take this as the final word, but if I recall correctly hash collisions are a theoretical possibility with all hashing algorithms. It simply becomes a question of the size of the hash space as to how likely those collisions are.
"a bit by bit comparison if the hash matches, and store the blocks separately if it they are different."
Surely doing that (and the other unmentioned but essential bits) isn't a sign of paranoia, it's a sign that doing otherwise is stupid?
A storage system that returns some largely unexpected data with no relationship to the original except it has the same hash isn't really much of a storage system, in my book. But then maybe I'm a dinosaur. Dinosaurs are extinct. Marketing people and their equivalents on the purchasing side (techno-hipsters?) are still with us.
While I agree that the extra level of confidence supplied by the bit for bit comparison in hash matches is good for peace of mind, whether it is "stupid" to do otherwise depends on the expected collision rate.
Go read up the specs on your favorite hard drives and look at the specs for error rates. Not the ones for corrected or detected errors, but the ones for undetected read and write errors. Yes, those errors are extremely unlikely but if the hash collision was even more unlikely would you still feel you need to do the bit by bit comparison? What's the point, if the underlying hard drives were 1000x more likely to have an undetected error that gave you back false data? Heck, even ECC RAM has an undetected read/write error rate.
"read up the specs on your favorite hard drives and look at the specs for error rates. "
Many people are fully aware of the failure rates of hard drives. That's one of the reasons that some people choose to use disk mirroring, or some flavour of RAID: to protect valuable data against random hardware failure. You usually can't predict exactly when a hardware failure will happen but you can be confident that one will happen sooner or later.
"if the hash collision was even more unlikely [...]"
If the de-dupe system doesn't handle hash collisions in a sensible way, and someone is unlucky enough to get a collision, there is no recovery, the original data is gone forever and the alleged data on "disk" will be wrong forever with no recovery, whereas if someone gets an inevitable actual disk error, it doesn't need to matter, precautions are easy (RAID etc) and in case of desperation there are data recovery services that can sometimes help.
You're comparing apples with oranges, random hardware problems with "defective by design" systems.
Badly done de-dup may work for Google's search engine data and a handful of other applications where nobody cares much if the results are a bit (or a lot) wrong occasionally.
Badly done de-dup should not be relevant in the run of the mill applications where it matters that the data you get back from your storage subsystem really is the data you originally put in.
My post had NOTHING to do with hard drive failures. I was talking about silent data corruption. Two totally different things, and mirroring won't help - because 1) 99.9% of mirroring implementations only read one side of the mirror and thus would have a 50/50 chance of using the "bad" data and 2) even a super paranoid implementation that read both and compared would require triple mirroring to overcome this error, or at least try to re-read the same sector (forced physical read) and hope the read gets the correct data this time.
ECC isn't that good a comparison. First, RAM (ECC or not) is in general only used for transient data, not persistent storage. Usually the work-in-progress data can be recreated somehow if something goes noticeably wrong - you re-load the program, retype the document, re-run the transaction from the logs, re-record the video, or whatever. If not, then you either need to live with it or use something reasonably robust, e.g. ECC RAM.
ECC RAM is fantastically good at detecting errors, and pretty good at correcting them. Undetected errors can still occur, albeit with a remarkably low probability, which is why people who really care (e.g. financial institutions) use systems like Tandem NonStop, that compare the outputs of two independent computers linked "as one", working on the same data at the same time.
Note also that ECC RAM errors are typically unpredictable, they are not data dependent. If you run the same program set multiple times on ECC RAM with a minor fault, it'll typically fail in different places on each run. Correctable errors (which is most of them) will be invisible, uncorrectable errors (much lower probability) will be detected and action can be taken accordingly, and in the event of an undetected error (even lower probability) your data is screwed.
If you run the same dataset multiple times through a de-dup system with a defective design in respect of hash collision, the same data will fail in exactly the same place every time. And it will not tell you that it has failed. If that's what you want, fine. Just tell readers where you work so we can steer clear.
One size does not fit all.
When the hashes are the same, and Netapp or ZFS compares and saves both, does it change the contents of one slightly to change the resulting hash, does it save some metadata note about which is which, put them in different parts of the file structure and link them to their correct sets...? How does it cope? Apologise for the stupid question, this is all beyond my ken really, but fascinating, I'd like to know how it gets resolved.
It will store the files twice, a link to them appropriately. The de-dup table must be able to take both locations of the same hash though as the older flecther4 hashing algorithm used just for checksum would cause problems if de-dup turm on for that pool.
more info can be found at https://blogs.oracle.com/bonwick/entry/zfs_dedup
e.g.
Trust or verify?
If you accept the mathematical claim that a secure hash like SHA256 has only a 2\^-256 probability of producing the same output given two different inputs, then it is reasonable to assume that when two blocks have the same checksum, they are in fact the same block. You can trust the hash. An enormous amount of the world's commerce operates on this assumption, including your daily credit card transactions. However, if this makes you uneasy, that's OK: ZFS provies a 'verify' option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same, and ZFS resolves the conflict if not. To enable this variant of dedup, just specify 'verify' instead of 'on':
it is reasonable to assume that when two blocks have the same checksum, they are in fact the same block.
That is not reasonable.
An m-bit hash may take 2^m different forms. An n-bit block may take 2^n different forms. So if n > m, there are 2^(n-m) different blocks that would generate the same hash. Nothe that n<=m is an entirely useless operation, so we have this risk.
This is the Counting Argument. It's used to disprove "magic" lossless compression algorithms that will store anything in nothing...
Vic.
That is not reasonable.
The git tool uses large digests (sha-1) to identify all commits. This is a 160-bit hash and apparently you'd need 2^80 random messages to get a 50% chance of a collision.
I recall reading that git hasn't any safeguards or double-checks to see if two commits hash to the same value. So Linus was just trusting the maths and making a calculated risk. For his purposes, at least, the chance of collision was so vanishingly small that's it's a risk that he was willing to take (for the sake of simplifying the code, one assumes).
I get what you're saying about the pigeonhole principle, it's not always unreasonable to assume that collisions won't happen.
I don't know what you're trying to say, there, Vic. By your logic, git having less entropy in the source should imply less entropy in the output hashes, which would in turn imply more collisions.
Anyway, entropy is irrelevant since the point of a good (cryptographic) message digest is to decorrelate patterns in the input from those in the output hash. Entropy measures over the set of output hashes should be the same regardless of input (again, for a "good" hash algorithm).
I'm just making the point that while you can't get away from the pigeonhole principle, if you have vastly more holes than pigeons, you can put some sort of bound on what's an acceptable probability of collision and design your application with the assumption that this risk is "vanishingly small enough".
It's all a trade-off, like Trevor explained in the article and in his post above.
"the point of a good (cryptographic) message digest is to decorrelate patterns in the input from those in the output hash."
The point of a good cryptographic message digest (hash) is (amongst others) to simplify detection of whether a message has been tampered with in transit.
If the hashes are different, the message has been tampered with. 100% confidence.
If the hashes are identical, the message may still have been tampered with. You don't *know* either way unless you refer to the source, which is in many cases too tedious to contemplate.
Same goes for checksums and retries in TCP etc.
If the checksums are different, the messages are different (the received data does not match the transmitted data). 100% confidence.
If the checksums match, the received data still may not match the transmitted data, but there is a level of confidence which is greater than zero (and significantly less than 100%).
By your logic, git having less entropy in the source should imply less entropy in the output hashes, which would in turn imply more collisions.
No, not at all.
The set of possible inputs which could generate a given hash is much reduced by the requirement that the inpout in question is source code (with a limited character set). The same cannot be said for the general case of a file.
this risk is "vanishingly small enough"
And I'm, saying that, in the general case, this is simply not true. Only if you constrain your input set - such as is the case with git - can you make it so.
Vic.
OK, this is my final post on the question, Vic. I promise.
I understand better now the point you're making about "entropy". You're saying that, eg, if I hashed all the natural numbers, represented in ascii, in a range (let's say they're zero-padded up to some minimum block length to make a fair comparison) that there will be fewer collisions than if the blocks all contained random 8-bit data. I think that a well-designed hash won't have significantly different collision rates over the two kinds of data for a given capacity range (ie, the number of messages to be stored).
And I'm, saying that, in the general case, [that the risk is "vanishingly small enough"] is simply not true
Mathematically speaking, you're absolutely right. Any hash can and will have collisions as proved by the counting argument/pigeonhole principle (or birthday paradox). All I'm saying is that in practice assuming that the same hash implies the same contents can be a reasonable engineering assumption. As evidenced by how git does it.
The assumptions that git makes on the actual hash size (160 bits) and expected number of commits (and, if I were to accept it, a factor to account for "entropy") aren't going to hold for a massive block-based de-dupe system, but you can plug the numbers into some formulas to find the expected collision rate and choose your digest size to make the risk "low enough" that it's not worth worrying about (eg, a mean time to collision of 50 years, if that's what you want).
I think that a well-designed hash won't have significantly different collision rates over the two kinds of data for a given capacity range (ie, the number of messages to be stored).
But we already know that the hashes used in git were not specifically designed for text data; thus we can expect that the data blocks that will cause a hash collision with our desired block will be spread randomly throughout the possible input space. If we now dramatically subset that input space (by requiring our input to be text data), we have ipso facto discarded all those possible collisions form data blocks that were not text data - thus we have a much reduced number of possible collisions, and thus a much reduced probability of collision.
Within the filesystem view, we cannot do that input selection; any possible block variant is permitted. Thus we cannot achieve that probability reduction, and the process is much less safe.
Mathematically speaking, you're absolutely right.
I know.
All I'm saying is that in practice assuming that the same hash implies the same contents can be a reasonable engineering assumption
And I'm saying that is not a reasonable assumption in the general case. I have outlined my reasoning above.
you can plug the numbers into some formulas to find the expected collision rate and choose your digest size to make the risk "low enough" that it's not worth worrying about
For the general case without subsequent block-checking? The hash is required to be of at least the same size as the block. And that is not exceptionally useful...
Vic.
OK, so I'm going back on my promise to not write any more, but ...
1. Does git's normal input type lead to fewer collisions?
Your line of reasoning about the structure of git's input leading to fewer collisions is pure conjecture. There is no "ipso facto" about your conclusions. You say that subspacing the input space discards potential collisions, but you neglect to consider that it also removes many non-colliding inputs, too. For your argument to work you'd have to explain why subspacing preferentially removes more colliding inputs, proportionately speaking.
In fact, the input to hash algorithms are (in the class of) infinite strings, so the normal rules of logic when dealing with finite numbers simply don't apply. Half of an infinite space is still an infinite space. In those terms, it's hard to see how your counting argument can hold any water on this point.
2. Is assuming hashes are collision-free a "reasonable" assumption in a block-centric app?
I believe that it is, but you have to plug the block size and digest size into the formulas for the birthday paradox so that you get an acceptable collision rate for a given capacity/occupancy (ie, how many blocks you intend to store).
A simple analysis should be able to convince you that doubling the number of possible hash buckets (ie, adding one more bit to the digest) will more than halve the collision rate for practical occupancy levels (which is a minuscule fraction of the hash space). This takes it out of the realm of being a question of pure mathematics and turns it into an applied maths/engineering question. And that's why I'm saying that it's an entirely reasonable assumption to make if you pick the right hash size for the problem at hand.
The actual formula to determine the probability that there is no collision, with a hash of h bits and an occupancy level o is:
(^{2 h} P_{o}) / 2 ^{ho}.
where ^{x}P_{y} is the Permutation function.
I submit, without proof, that as h increases (with o fixed), the P term will increase faster than twice the rate of increase in the other term, assuming that o is sufficiently small. (in fact, it should increase much faster). Thus we can make the collision probability (1 minus the above) arbitrarily small by making a linear increase in the number of hash bits.
Basing the assumption that a well-chosen digest algorithm will be (practically speaking) collision-free on the above is, I am sure, completely justified.
OK, so I'm going back on my promise to not write any more
Well then, you're going to have to start doing your own research. This is the last time I'm going to spoon-feed you.
For your argument to work you'd have to explain why subspacing preferentially removes more colliding inputs, proportionately speaking.
We don't care about "proportionally", we care about "absolutely".
For a given hash & block size, there are a finite number of blocks that will cause collisions in a given hash. By removing some of that finite set, we have fewer potentials for collision. It is that simple.
It is clear that collisions are a problem in the general case - we know the number of collisions for any particular block by way of the Counting Argument. The only way this cannot be improved by llimiting the input set to textual data is if the hash were designed to maintain the same bit spectrum for similar input data. This would make it a very poor hash. And so, the number of possible colliding blocks is necessarily reduced by limiting the inpuit dataset. LIke I said.
And from this point on, you're on your own.
Vic.
Well then, you're going to have to start doing your own research. This is the last time I'm going to spoon-feed you.
I have presented my research.
For a given hash & block size, there are a finite number of blocks that will cause collisions in a given hash. By removing some of that finite set, we have fewer potentials for collision. It is that simple
Yes, but your argument was about git, not fixed-sized block. I have pointed out that we are not dealing with finite sets there. Thus, your counting argument is fallacious.
It is clear that collisions are a problem in the general case
And equally clearly (actually, more so), I gave you the equation for quantifying the collision rate and outlined a simple proof that the error rate can be made arbitrarily small for practical input parameters.
I don't know why you have such a problem with understanding this.
I also don't know what is your hang-up with this "but in the general case" line of argument. We agree on the pigeonhole principle (hash collisions must exist) and I think we can both agree that the analysis of the birthday paradox is apropros. That is the general case and I'm confident that I've analysed it correctly and that it vindicates my argument. Of course, I left some small amount of work for you to do to verify that what I said is correct, but that's simple high-school maths.
If you do decide to argue further, please make it clear whether you're arguing about git or block-level hashing. And don't try to bring a fallacious argument from one (ie, git) across to bolster your argument (such that it is) in the other. Thank you.
If the hash is different between two blocks, you know (100% certainty) that the blocks are different.
If the hash is the same for two blocks, you have no certainty whether the data is the same in the two blocks.
To be certain whether the data is the same you have to look at the data. All of it - or at least until you find a difference, at which point you know the blocks are different. If you reach the end of a block without finding a difference, THEN you *know* the blocks are the same.
Any claim beyond that is statistical mumbo jumbo or vendor hype or both.
Hash tables themselves are old technology, dating back at least to the era when 64kB was all a computer could address. Assuming the storage industry hasn't redefined the meaning of hashing, of course.
Hash tables were primarily used for fast lookup - do I already have this value in my table, or (more usually) not have it. There's plenty of stuff out there describing how they work in general.
Knowledge of the hash value and the hash algorithm is not in general sufficient to re-create the input data.
Hashing has been repurposed by the de-dup industry - do I already have a matching chunk of data in the store? If the hash is different, as it frequently will be, it is 100% certain that this data block is not there already (otherwise the hash would exist already in the look up table), so I have to write it to persistent storage, enter the hash value in a lookup table, and make a note of where I wrote it so I can read the data back later when something else has that hash value (yes I know that's massively oversimplified, but it doesn't matter here).
If the hash is the same, the matching data *may* be there already, but only a fool would think that they *know* the data is the same, or that they can re-use the existing data without fear simply because the hash matches. Sadly the IT industry has no shortage of fools.
As I understand it for the netapp, only newly written blocks are looked at by the dedupe process. If a block has the same hash but different contents to an existing block, then I think it just marks it as having been processed, and thereafter no longer considers it a newly written block.
There's an initial pass over the whole volume when turning dedupe on, but the same principle applies - if the hashes match, and the contents match, it gets deduped, otherwise it's marked as having been processed and isn't looked at again until the contents change.
"it is reasonable to assume that when two blocks have the same checksum, they are in fact the same block. You can trust the hash."
Rubbish, unless you're willing to accept that (e.g.) your mail may occasionally go to completely the wrong location, your financial transactions may occasionally have the wrong recipient, etc.
Oh, hang on...
If zfs and netapp have options to compare the whole block when a hash collision occurs is there any reporting on how often the collisions occur? A faster hash routine might have some collisions but if those collisions are handled properly then an overall speed up may happen.
Few of my friends in the Netapp's dedup team used to mention that it is extremely rare to hit hash collisions and then fail to dedup the blocks. Still, Netapp didn't trust the finger prints (hashes). If we believe in the hashes and dedup two totally different blocks, just because hashes are a match, it leads to corruption of file system (A file content corruption in which a file system would never be able to recover from and only application knows that the file is corrupted)
But, I also hear the other side of the argument. As it is extremely rare to hit the case "hashes match but not block's content" and probability of hitting that case is smaller than the probability of disk errors, it is OK to trust the fingerprints/hashes. But, I don't believe in that argument as file systems can usually recover from disk failures. But, an induced corruption just because you believed in fingerprints is not.
The analogy did put me in mind of dictionary-based compression (including LZW), too, as well as Merkle trees (for the sync problem).
The idea of re-using dictionaries sounds good, but you need good algorithms and data structures to make it tractable. You can't just blindly try compressing all blocks against all dictionaries to find the best compression ratio. That would take forever. I think that some clever tree-based/recursive construction (probably along with Bloom filters or maybe Rice-Golomb encoding) and some dynamic programming could make this work.
Yep; Merkle trees popped into my head as well.
It sounds like a hierarchical type of dictionary combined with hash trees for sync; really neat idea, but not sure how it's dedupe in the dedupe sense as opposed to some sort of inline compression - or a fancy type of RLE.
Might be me just semantically nitpicking.