Windows Server 8 is coming, and it is bringing storage enhancements with it. Data deduplication in particular has caught my eye: it is something I have wanted on my Windows file servers for a long time. This technology is nothing new; ZFS has had deduplication for a while now, and this technology is (experimentally) available …
Windows Server 8 IS coming...
But the link you provided goes to a general Windows 8 blog -- lots of client stuff, not much server stuff. Might I suggest
Not as common/ready as you may think...
The deduplication functionality in BTRFS is completely useless, it is at a pre-experimental stage, and isn't (and judging by the discussions I had on the mailing list likely never will be) synchronous like in all other sensible dedupe implementations. So don't bank on that. But - ZFS has been available on Linux for a while, with both kernel level and fuse implementations available.
Another option for deduplication is lessfs (fuse based).
Another thing worth mentioning is that synchronous deduplication is always going to be cheaper in terms of total CPU and disk I/O terms than asynchronous deduplication because you end up having to do the disk I/O twice. Conceptually, this is similar to the parity RAID (5, 6) write-hole (write-read-checksum).
online dedup on btrfs is not impossible. Doing offline dedup on btrfs is just much much easier.
Once Linus said, that Linux won't have any Real Time OS capabilities as this would require complete redesign of the OS. Fast forward few years, and this "complete redesign" has been achieved incrementally, and Linux is gaining more and more features typical for RTOS.
If we take that view, Linux won't be leading the field but be a back marker. That is not an acceptable position. ZFS port is the best thing that happened to Linux as far as file systems go in a long time. BTRFS is so far behind both in design and implementation that it isn't going to be approaching ZFS any time soon.
btrfs has two advantages over ZFS: you can shrink volumes too, not only extend them. The second thing: the on-disk-format is much easier to extend.
Ability to set redundancy level on a per-directory and per-file basis will be a killer feature for me though...
But, yes, unfortunately, OpenSolaris ZFS is much more feature-complete than btrfs on Linux. btrfs is much more feature-complete than ZFS on Linux though.
Short and sweet. The article covers most of the + and - of dedupe in a straight forward and digestible manner. The point about backup is a very important one and I expect lots of people will get a headache from their 1TB live system becoming a 12TB backup image!
Dedup the backups too
"The point about backup is a very important one and I expect lots of people will get a headache from their 1TB live system becoming a 12TB backup image!"
Of course you could always dedup your backups too. This enables longer retention of more backups on disk meaning quicker restores.
Some vendors even support dedup to tape although just because you can doesn't mean you should, and restores would be horrendous.
The other option is to start archiving data out of your production email and file systems so that you don't end up backing up old data over and over again.
And increase the risk of your backup failing
As mentioned in the article, a deduped backup with a single dodgy media can break the entire backup set.
Most compression technologies that I have come across tend to have this asymmetry, where a small amount of corruption causes a disproportionate failure. With storage sizes into the petabytes, you have to build in redundancy as the chance of getting an undetected read or write failure starts to become statistically significant.
Still, two 6TB systems with dedup will give you much more data retention than a single 12TB one with dumb copy of a 1TB production system. While still preserving redundancy.
Thanks! Actually rather proud of this one; was trying to compress a difficult and complex topic into ~500 words. Additionally, it was bashed out via voice-to-text on my HTC Desire whilst driving down a pitch black hiway through the rain at 100Kph.
Glad someone liked it!
Is deduplication necessary ?
With current SAS disk prices of £214 for 3TB, is it worth the expenditure on a fast processing platform to do deduplication. With a reasonable RAID system (including spares) 500 disks would cover a 1 petabyte array for a disk price (excluding controllers, racks, PSUs etc) of just over £100k and a system price of probably £200k-£300k (with controllers, racks,PSUs etc). Unless the dedupe equipment can save enough on the storage costs to pay for itself then there is no point in having it.
The dedupe equipment also becomes a single point of failure so needs to be replicated as the disk array contents are meaningless without the dedupe unit to convert back to the flat files expected by the client systems.
Consideration also needs to be taken about what happens if the dedupe equipment fails during writing - can the required spare take over without data loss ?
(1 corrupt database could easily exceed the cost of the whole storage array - dedupe equipment included.)
Unless the storage farm is in the multiple petabyte size then deduplication is probably not cost effective.
If you have a virtualized environment, with 5 or more instances of an OS on a single system - dedup makes a lot of sense.
Dedup built into the underlying OS (ZFS) providing the virtualization also means the expensive memory serving as the disk cache (ARC) is also deduped... not to mention any expensive blocks of storage sitting in flash memory (L2ARC) is also deduped.
To understand the benefits of dedup (reduction in cost of memory, flash, storage, I/O) for increased throughput, at the expense of a needing some more CPU cores - the tradeoff is a no-brainer, especially since ZFS is free and already bundled with multiple OS's.
An extra CPU core is cheap in comparison to memory, flash, and disk storage while no other special hardware controllers are needed (at least for ZFS.)
Not All Dedup is the Same
When dedup in integrated into the file system (such as ZFS, or apparently now Windows 8) there is no "dedup equipment" to be separated from the files. Any disk array is meaningless if you don't have a copy of the file system that filled it in.
When you are talking about "one corrupt database" ruining your entire set of stored data you are assuming that there is a "dedup database" that is somehow separate from and not integrated with the file system directories themselves.
Of course you file system directories need to be sufficiently robust,and that is something you should kick the tires on with any FS vendor.
The other issue is whether dedup is necessary. The justification for saving raw disk space loses some appeal every year. But in the long run, dedup will also reduce network traffic. This is already true to the degree that backup traffic can be reduced, and will be more as distributed dedup algorithms become more widely deployed.
Virtualised OS deduplication
With multiple copies of the OS in use, a much cheaper (and hardware free) method is for the underlying OS (hypervisor) to use a common OS image. If the hypervisor knows about the common image then there is no need for hardware based deduplication.
In any case given the current size and cost of disks and the size of OS images, the cost per OS image on disk would be under £0.50 (Linux) or £2 (Windows 7) so there is very little money to be saved even on a large server farm with a thousand virtual systems running.
As regards the point of the occupancy of the RAM or SSD based disk cache, this depends on the access rate to the OS disks. Once booting has completed most of the OS disk is accessed only rarely so should not unduly occupy the cache.
Re Not All Dedup is the Same
I was talking about databases in the traditional type eg Oracle. If a failure in the dedupe equipment causes corruption in a database then fixing it could be a nightmare.
What was described in the original article was external dedupe used at the disk array level (invisible to the OS) not a dedupe facility built into the OS.
BTW: Would you let a Microsoft deduplication bit of software loose on a production database - I know that I would be very reluctant.
OK, so what happens when ..
the algorithm changes from release to release, or I want to move to another platform and bring my data with me , or I want to store windows data on EC2 and incorporate it into a linux-based web solution? I see one more opportunity for long term platform lock-in with huge cost implications if I want to switch.
Good brief article and we need to see more discussion on this before Windows 8 Server hits the streets. Of all the features that Microsoft could cut from Server 8, I think this is the most probable. It creates huge problems for partners in the de-dupe space. And, as you mentioned the horsepower requirements will only add to the uncertainty level of available compute power for PaaS.
> OK, so what happens when... I want to move to another platform and bring my data with me , or I want to store windows data on EC2 and incorporate it into a linux-based web solution?
When you want to move to another platform, move to the cloud, move from the cloud, use a different OS - just do it... open source dedup is already running as base infrastructure in freely available OS's and in the cloud.
Free, Open Source, and DeDup go wonderfully together!
Truth is if you want to port data from one vendor to another, you have to such the undeduplicated data off the long way and then rededuplicate it.
No yarding disks out and moving them around.
Err, seems wrong to me.
> Synchronous deduplication takes a lot of CPU power.
> It’s easy to imagine why. Try to compress 5GB of text files into
> a zip ball. Now, picture your hard drive as a half-petabyte zip ball
> that you are reading from and writing to at 10Gbit/s. Processing
> power is suddenly very important.
What on earth has compressing zips go to do with anything?
I would hope that de-duplication would actually save power.
Here's how it should work: for each block that you're going to write, compute a checksum. Yes this requires some CPU effort but it's not great - and you are probably already computing checksums for data integrity. Do a lookup in an in-memory database of that checksum. If you already have that data, you can stop - there is no need to actually write the block to the disk, saving you power. If you don't already have that data, save it as usual.
If people are doing de-dupe in a way that requires more power, they are doing it wrong.
Oh, an in memory database, of course!
That will be really cheap to buy, use no power to run and be addressable from a single CPU!
You might want to go and look at the relative size of the appropriate checksums (specifically your hash must be provably unique) and see how many GB of RAM you need for each 1TB of disk space.
Meantime I am off to buy shares in a RAM vendor ;-)
At the risk of making a mistake on the fag packet (1978 vintage Casio fx-39 actually) I reckon that half a petabyte would need 250 GB of checksums, assuming that your checksum is 32 bits and your block is 8 KB.
So maybe, if you have smaller blocks, or you use longer checksums to improve the chances of not having the same result for two different blocks, that you may do have to do a bit more work.
>Do a lookup in an in-memory database of that checksum. If you already have that data, you can
>stop - there is no need to actually write the block to the disk, saving you power. If you don't already
>have that data, save it as usual.
You missed out the important step (which is what makes async and sync dedupe more similar in resource than you think).
If your checksum shows a *possible* duplicate block, you bit-by-bit compare the two data blocks to make sure they really are identical. Which involves reading the potentially identical block from disk ofc.
Then, if they are *actually* identical, you write a pointer instead of a duplicate of the data.
Presumably an async dedupe process would need some way of marking up non-identical blocks with checksum collisions to prevent constant comparisons.
And that's exactly how it does work. Now work out how much RAM you need per TB of deduped data.
Deduping is not free, it requires masses of memory to keep checksums in, slows down disk writes and is often a false economy. Unless you are expecting dedupe rates of >5, you probably are going to spend way more on RAM than you would on having much more disk capacity.
Given a choice between a 2 x 6 x 3 TB deduped raidz server with 128GB of RAM and a 8 x 6 x 3 TB unduped raidz with 32GB of RAM, the latter will perform faster and have more capacity.
No, you don't do a bitwise check
> If your checksum shows a *possible* duplicate block, you bit-by-bit
> compare the two data blocks to make sure they really are identical.
No, surely not. Large hashes are sufficient; look at how e.g. git works. It uses a 160-bit hash. Presumably you can make arguments about being more likely to be hit by a comet than to suffer a hash collision.
With a 160-bit hash and 4 kbyte blocks, for each GB of files you have 250 k blocks and each needs around 24 bytes to store the hash and its location, i.e. 6 MB of metadata per GB of data.
You could keep all that in RAM all the time - 6 GB of RAM for 1 TB of disk doesn't seem crazy to me - but if that RAM is too expensive you can keep it on disk and just cache it in RAM. You would probably need something like a Bloom filter in RAM in front of the main database in that case.
I still find the idea that de-dupe increases power to be unbelievable.
In memory database
Power failure, RAM stick gone bad, Northbridge packing it in, CPU fan seizes up...
I'll stick with things that gurantee writes of both the data and the hash tables thanks...
Yes, you do.
"No, surely not. Large hashes are sufficient; look at how e.g. git works. It uses a 160-bit hash. Presumably you can make arguments about being more likely to be hit by a comet than to suffer a hash collision."
Only an idiot would make such an argument. Even a potential cometary collision is a rare event. Furthermore any cometary collision decreases the chance of a future event, as the comet involved is removed from the roster of collision candidates. Finally a cometary collision is, given the current state of technology, difficult to avoid.
A potential hash collision, on the other hand, is guaranteed to happen every block write after the first. And every block hashed increases the chance that the next block will suffer a collision, as there are more blocks to collide with.
According to http://en.wikipedia.org/wiki/Birthday_attack, an ideal hash function would have, on average , roughly one collision for every 1.25 * sqrt (2^160) = 1.25*2^80 unique blocks hashed. 1TB = 2^28 blocks, so on average, you'd see a collision once every 4TB of unique data. (Git uses 160-bit hashes for its objects. Since very few code projects have 2^40, let alone 2^80, objects associated with them (linux kernel 2.6.30 [the latest version for which I could find this statistic] has approximately 28,000 (i.e, less than 2^15) files), and collisions between projects would not be a problem, Git would rarely if ever have a collision issue.)
But since many of the organizations using these dedupe technologies have petabytes of data, (and can write terabytes per hour) and collisions are _100%_AVOIDABLE_ (simply by doing the bitwise comparison you so quickly disparage), dedupe developers have wisely chosen to compare blocks when hashes match rather than assuming that they are the same when there's ever-increasing probability that they are not.
As a backup/storage person, I'm rather ambivalent about dedupe. On the one hand you can squash your data very small, this is particularly useful for backups. On the other, you run the risk of losing everything in the event of a database table corrupts. I speak from experience - having had a backup related dedupe system totally die, due to database corruption. The only reason that we didn't loose everything was because replication was scheduled, so the corruption didn't replicate.
What I would say is - always replicate de-duped data, replication should be scheduled, rather than online/synchronous and if you backup, dump your long retention backups out to tape.
I'd say, if you have two systems already, do a double backup from main system (if it is at all possible).
Then there's no common point between dedup servers (apart from software).
But yes, duplicating (if not triplicating) the de-duped data is a very good idea, unless you have a very "lively" data set, you'll still have a net gain in storage.
It compresses the data, also. Hence the extra cpu power. Which I believe can be chosen to be done client or server side.
I prefer byte-level deduplication.
>> Deduplication can be done at the file level, the block level or the byte level. File and block level are the most common.
I prefer byte-level deduplication. You can store *everything* on a 256 byte disk drive.
Check out the Dedupe Cards
In-line dedupe and compress (and turn on encryption as well) for Windows servers is available at a bargain-basement price from BridgeSTOR. The heavy computational lifting is handled by a processor on a PCIe card.
Here is what The Register had to say about it:
"Most people won’t back up data as deduplicated blocks – it is just too risky. The loss of one piece of backup media can render data irretrievable on all other media. This means budgeting backup bandwidth for the fully undeduplicated data to run every night."
The first point is correct and aside from that, to restore you have to first restore your deduplicated tapes to an area of disk before you can commence recovering server(s), adding an additional stage into any recovery process.
The second point, however isn't strictly true. Forget your expensive Data Domain boxes and use software client side dedup to backup to cheap disk overnight. Client side dedup means less bandwidth required and as it's a block-level incremental,backup windows are cut too. Then rehydrate that data off to tape when the backups are finished. Deduped copy on disk for immediate recovery and rehydrated copy on tape for off-siting/long-term retention. Simples.
Backup or DR ?
Dedupe should not be considered a basis of good DR. Deduplicated backups are a quick/safe way to provide data recovery for customer level and system level data loss events, but your DR position must be based on solid data/system redundancy i.e Data/system mirroring across sites or at least full backups to removable media to an off-site location. Never think that a Deduplicated solution single sited (which is really pointless) or a two sited solution with replication is a way to recover the entire system(s) in the case a catastrophic loss, that would be very fool-hardy. In summary Dailies even Weeklies may be a thing of the past with Dedupe but beyond the normal 60-90 day loss point, then Monthlies/Yearlies to Tape are still the sure way to get something back even if the building burns down or in the event of a extreme event. tape tape and tape is the only trustworthy position for good DR.
With Tape as the safety net, Deduplicated backups are the cream.
@Phil Endecott 19:20
"Large hashes are sufficient; look at how e.g. git works. It uses a 160-bit hash"
Am I reading this right? You're not seriously suggesting that a (e.g.) 160bit hash has no (not negligible, no) risk of collision from a 4k filesystem block?
If you are, let me tell you a well known secret: hashes of whatever algorithm aren't used to guarantee that two different chunks of data will hash differently, they are used for things like fast but not entirely reliable comparisons such as used in table lookups and/or for reducing (but definitely not totally eliminating) the chances of undetected data corruption.
The hash is by definition (much) shorter than the data being hashed, therefore there is by definition a possibility of identical hashes from different data (see also, e.g. rainbow tables, which produce all possible hash values from a much smaller set of data than the number of potential input data values). When two sets of different input data happen it is called a collision. Please go read about it, it's not exactly rocket science and goes back to the pre-Internet era.
If my understanding of Phil's simple model were right you not only could but quite frequently would get different data hashing identically and therefore being incorrectly de-duped. Payroll wouldn't like that, nor would Accounts. Engineering might get upset too. But write-only data that is never read again (eg much Powerpoint, divx, and corporate email) wouldn't exhibit a visible problem, so that's probably 80% or more of the storage that doesn't matter and could readily be directed to /dev/null anyway. Correctly identifying the other 20% that will be needed again is tricky though.
OK so I've got this file called (e.g.) AUTOEXEC.BAT. As hinted at earlier, I've got 5 identical copies, one in each of five independent virtual machines, so the de-dup stores only one copy of it.
Who wants to explain to me what happens when one of those five wants to modify their autoexec? It's no longer a duplicate, so presumably we need some copy-on-write mechanism which lets the writer see the modified copy and the rest see the original copy.
It all sounds rather risky, but probably not a great deal more risky than trusting business critical data to many of today's IT people.
Re: AUTOEXEC.BAT questions
Re: "Who wants to explain to me what happens when one of those five wants to modify their autoexec? It's no longer a duplicate, so presumably we need some copy-on-write mechanism"
Yes, you need copy on write. And this is what systems like ZFS do today (ZFS always does copy-on-write for everything so it's easy). You need copy-on-write anyway if you want to support things like snapshots in the filesystem.
@ various people worrying about data errors on one copy of data:
This is a valid concern; ZFS at least had some options to say "no dedup on these critical directories" or "keep at least N copies, then dedup after that".
"You need copy-on-write anyway if you want to support things like snapshots in the filesystem."
Exactly. I used to understand some of this stuff in the days of DEC Storageworks, which supported snapshots in host software or in controller hardware depending on what you needed. I've lost touch with it now.
Like a lot of core IT technology, it gets forgotten and needs reinventing, preferably with a different set of bugs to iron out.
Dedupe isn't always the same...
It’s good to see Microsoft raising the visibility of the data deduplication opportunity. As your article indicates, there are some key issues that one need be concerned about when considering deduplication. As I see it, they can be grouped into three areas: performance, scalability and efficiency. Performance is critical because it can limit dedupe’s use cases. Fortunately, there are faster processors constantly coming to market that also have multi-core capabilities which are addressing the issue. There have been recent indexing and memory management advances that provide additional incremental scalability, which addresses my second point. We are seeing rampant data growth, so the more dedupe can scale the better it will be able to keep up with the growth. Efficiency is also important, and is an often overlooked characteristic. Deduplication solutions that only work with large granularity (ex. 128K) are markedly less effective at saving space than those that can efficiently handle small chunk sizes. Furthermore, the amount of RAM required to perform efficient deduplication can make the process extremely inefficient, particularly if a GB (or more) of RAM is required to deduplicate a TB of storage. Remember, in 2011 both components cost about the same!
There are many flavors of dedupe available in today’s marketplace. As we’ve often seen in the past, the open source community steps up to the plate when there’s a real need that is not being filled, so it should be no surprise to see developers building their own solutions. Dedupe solutions are differentiated based on how they address performance, scalability and efficiency requirements. When your readers can find all of these requirements addressed in a single offering, they are looking at a leading dedupe technology.
One of those de-dup activities
Maybe all those copies of GPL2 or GPL3. We could start by just saying "GPL2" or "GPL3" and let it go.
Of course most of the de-dupe activity is in email, where just about everyone copies the original email when just a reference to it would be all that is needed. That in itself would save LOTS of storage.
The next step would be to just refer to spam by reference, unfortunately the spammers aren't in on the need (*SIGH*).
Dedup word documents.
I theorise that a lot of corporate disk storage is used in storage of word documents. Most of these are boilerplate derived corporate crap. Surely it would it be worthwhile to store these documents as a set of links to often used phases and only save the unique content with the document on some kind of 'doc server' instead of each as a separate document on a file server. This could be much cheaper to achieve than block level dedupe.
- Review Reg man looks through a Glass, darkly: Google's toy ploy or killer tech specs?
- MEN WANTED to satisfy town full of yearning BRAZILIAN HOTNESS
- +Comment 'Stop dissing Google or quit': OK, I quit, says Code Club co-founder
- Nokia: Read our Maps, Samsung – we're HERE for the Gear
- Ofcom will not probe lesbian lizard snog in new Dr Who series