ZFS offers deduplication for free
soon. It's on the way. :o)
IBM is adding replication to its ProtecTIER deduplication product, which it acquired by buying Diligent in April 2008. ProtecTIER products will be able to replicate deduplicated data to a remote site and thus reduce any needs customers might have to ship tapes to a remote site for disaster recovery. Transmitting deduplicated …
soon. It's on the way. :o)
Um, with replication, you're primarily sending newly created blocks... There really should not be a large amount of data to dedupe in NEW data! Yes, there's some, but mostly that's from people flinging files around in e-mail and copying them to multiple personal folders scattered all around the companies serevrs. That is handled by simply configuring workspaces and restricting internal attachment direct delivery within the mail system... That's not only free, but reduces the storage burden on the mail system and throughput of those servers, saving money too.
We have about 3,000 servers here. We're currently testing 3 different SAN vendor's technologies that support storage virtualization, dedupe, and replication. Thus far, we've discovered that thanks to our architecture, excluding OS and application volumes and considdering only data, we can only save about 6% using deduplication. The cost of the dedupe licensins in each case exceeds the cost of even an additional 10% storage.
since we have a vast system imaging and deployment methodology, we don't back up or store server boot and app drives on SAN, only data. Our servers are not "recovered", they're simply re-imaged if they crash, on a new hardware or VM box, so making OS backups is unimportant. Also, a large number of our systems are non-windows, and virtualized in shared binary VM machines (the ultimate form of dedupe).
Question: Does ZFS offer memory-based inline deduplication for free at >900MB/sec? Is it hash based?
I have no clue. ZFS dedup has been recently announced on a talk. That's all Ive heard. I think that dedup has been integrated into the ZFS code now.
But I do know that the more drives you use, the higher the bandwidth. If you use 46 SATA 7200 rpm drives, you reach 2-3 GB/sec read speeds. That is >900MB/sec. But I dont know how dedup will affect that. I guess if you have a fast enough CPU it should be no problem, as ZFS uses no hardware raid controller cards. Everything is done on the CPU.
That was my point, really.
Diligent (Protectier) does >900MB/sec dedupe today, off two boxes of commodity hardware clustered together. It scales hugely (1PB, off the shelf) because it's less limited by memory scaling problems than "traditional" (if there is such a thing) hash-based dedupe algorithms are.
Yes, ZFS will do dedupe for free (if you consider storage I/O, processor and RAM to be free). Diligent isn't free, but it's more effective than the mooted ZFS dedupe will be anyhow.
Forgive the slightly combatative way of asking the question, I've just finished writing a whitepaper on this stuff, and comparing hash-based dedupe to diligent's fingerprinting approach is sort of like comparing the ark to the Ark royal.