back to article Could a hard drive dedupe data?

Hitachi GST president Steve Milligan says one of the drivers affecting the hard drive industry is the need for efficient storage with technologies like virtualisation and deduplication. What is he on about? He presented at a recent Needham conference for HDD investors and said that the storage market was driven by three things …

COMMENTS

This topic is closed for new posts.
  1. M. Burns Silver badge
    Boffin

    What the drive industry needs

    Is to start offering drives that internally are some form of high level RAID but so that the drive "looks just like a standard drive" to the OS. I'd pay a premium for that type of almost bullet proof drive in my laptop.

  2. Anonymous Coward
    Boffin

    Filesystem-level deduping

    If you use a file system that uses hash-based block addressing, deduping is a non-issue. These systems can also provide cryptographic guarantees of file integrity with negligible performance overhead for many applications, and in some cases they provide performance improvements due to reduced duplication of blocks within and between files.

  3. Austen
    Joke

    binary dedup

    1,1,1, another 1, oooh another 1, gees half this drive is just 1's...

  4. Lionel Baden

    only downpoint

    the drive capacity marketing dept will go apeshit with bullshit claims

    Meh

    Our drive holds 50 petabytes!!!!!! *

    *of the same empty text file not real life data usage

  5. Anonymous Coward
    WTF?

    Make 'em work first!

    How about you make drives that work OK for more than 5 mins, when they come out of the factory? Eh? Yes Seagate I am looking at you!

    Rather than faff about with all this DD cods-wallop, let's get the latest 2TB whoppers to actually run for more than 6 months without crapping out. In the bad old days you'd buy a drive it would run and run, until you sold it, nowadays your lucky if they last 9 months tops!

  6. PG 1

    re: "hash-based block addressing, deduping is a non-issue"

    Oh dear... hash algorithms have collisions. You can't rely on hashing alone. Hashing can only tell you that there *might* be a dupe - you then need to compare both blocks bit-by-bit to ensure that they are really duplicates.

    However, because you can't stick a TB of ram in a HD (or can you?) you can't maintain all those blocks in memory, therefore you need to read the potential block back from disk - do the compare before performing the write. That is likely to be a reasonable performance drag - maybe acceptable in the rare-ish dedupe-write events.

    Minimally a 1TB drive will need approx 1GB ram onboard to maintain the block-hashes for look up. What happens after a reboot? have to read the entire drive back to build the block-hash lookup table before allowing any writes again?

    Seems a little optimistic to me.

  7. rch
    Thumb Down

    Dedupe should be as global as possible

    For dedupe to be efficient you need lots of data and a lot more than a single hard drive can offer of capacity. A file server with a mechanism that can dedupe across tens or hundreds of drives will obtain a lot more space savings than what a single drive can do.

    Hardware compression on the other hand, would be quite beneficial for a single drive. Tape drives do this with success as the performance of the drive in fact increases with the compressibility of the data.

  8. Trevor Pott o_O Gold badge
    Pint

    @Anonymous Coward 16:50

    Hear hear!

  9. Pascal Monett Silver badge
    Thumb Up

    @ M. Burns

    Now that is a great idea. I second it.

  10. norman
    FAIL

    Dedupe?

    My thought is the deduplication engine belongs off the hard drive, hard drives need more speed, not more overhead.

    NetApp and EMC (I assume EMC) can deduplicate iSCSI and fiber channel luns in a regular scheduled process.

    I give a self deduping hard drive a fail.

  11. norman
    Flame

    A clever plan....

    Data is all ones and zeros....

    Dedupe it down to a single one, zero doesn't count anyway...

  12. Anonymous Coward
    Anonymous Coward

    RE:PG1

    You clearly dont understand the probabilities involved in hash collisions, using an sha256 hash, you would need to generate around 2 x10^12 times the amount of data that is projected to be created in 2010 (around 1 zetabyte of data) for there to be a chance of 10^18 of a collision, when the hashed data blocks are 4KB is size (which they would likely be larger, thus less likely to happen).

  13. Hugh McIntyre

    @ M Burns and @ PG 1

    @ M Burns

    Raid inside a single disc enclosure is not so good an idea:

    - Can't replace one side of a failed mirror and say "rebuild". You need to replace both sides and then presumably copy.

    - A failed controller is common and kills the whole RAID set.

    However perhaps SSDs will alter what RAID means, given that you have a collection of lots of flash chips rather than two spindles or so.

    @ PG 1 ("hash functions have collisions")

    Yes if you use a weak hash function. As it happens though, the OpenSolaris ZFS folks have recently been adding de-dup, and the argument is that something like SHA-256 means the collision probability is something like 1 in 2^88 to 2^100, even assuming 2^64 bits of storage in future and 1MB block sizes. this assumes the hash function has very good cryptographic randomness (which most people do for SHA-256).

    Since this is on a single disk rather than a whole filesystem, it will take even longer into the future before you can buy 2^64 bits on a single disk - that's about 2 million TB. Although this also means that filesystem-level de-dup across multiple disks will probably give better de-dup performance.

    Now, the Solaris folks have added "bitwise verify" as an option for paranoid people, but I think it's not on by default for SHA-256. You may even get a prize if you get a collision out of SHA-256 although clearly it's possible.

  14. James Halliday
    Alert

    What I want

    is a combination disc and SSD unit. I don't want to see two drives though, I just want to see that big disc volume. What I want the Drive to do is to dynamically cache commonly accessed files onto the SSD (e.g. all that OS gubbins that loads when I start it up).

    Current strategy is you buy an SSD for your primary boot and then add a second large disc for your 'storage'. Just inefficient as currently you have to buy an SSD you know to be larger than your boot disc will require (so you cough up a fortune for gigs of flash you probably don't need). Also would greatly help laptops where there isn't enough physical space for two drives and you currently have to make the choice between speed/battery life and storage

  15. Chris007
    Joke

    @Trevor Pott o_O

    What can you hear?

    Here here!

  16. PG 1
    FAIL

    @AC and @Hugh McIntyre

    I understand the probabilities very well. What you both don't seem to realise is that *any* probability of collision > 0 for hard drive data means that it is unacceptable for use as a method of deduplicating data.

    You both seem to be arguing "it is so unlikely and so rare, it'll never happen" - but what if that block happens to match something crucial, from something in an .mp3 file to your company's core database - or what about your bank account? A dedupe collision in any of those will cause pain from "this song won't play anymore" to "what do you mean my bank account is empty?". You won't think it is acceptable then.

    Again, I repeat, all hashes by definition have collisions (you can't fit 1 billion bits into 1 million spaces - mathematically you have a minimum of 1000 possible inputs for each hash assuming a equispaced distribution, it gets worse if not euqispaced). Regardless of how likely a collision is, you can't use a hash for anything more than indicating that there MIGHT be a duplicate block and to act accordingly.

  17. Hugh McIntyre
    Thumb Up

    @ PG 1

    Re: "What you don't seem to realize ... Regardless of how likely a collision is, you can't use a hash for anything more than indicating that there MIGHT be a duplicate block and to act accordingly."

    I realize perfectly well.

    The argument is that once you get out to 10^-77 or so, the rest of your computer is not nearly that reliable, so you could also end up with the block-by-block compare giving you the wrong result, or a phantom disk write to the wrong block, or a virus overwriting your file, or many other failures (also including things like you getting hit by a bus and not needing the file any more). I.e. it's not that there's no chance of a collision, but it's also possible that the computer fails or gets hit by a natural disaster.

    Now personally, I'd probably use this at the filesystem level and would quite probably turn on the "verify" option, or at least want to be able to say "the following critical files want to be golden and not just marked as dups of other blocks. But some other less critical files could have dup-processing". Partly in case of unexpected randomness in the hash function, or performance not being so critical at home.

    But the key point is that arguing that "risk of false dup > 0" misses the point that the risk of your computer failing is non-zero (athough small) for many other reasons, so you still have risk with full verification. Especially for consumer-grade hardware which does not bother with things like ECC because of cost.

This topic is closed for new posts.

Other stories you might like