CPU resources?
"In-line deduplication is more space-efficient and requires CPU resources for dedupe to be supplied as data lands on the array."
So what performance impact will this have on the array?
NetApp has updated its Clustered Data ONTAP OS to support in-line deduplication and 3.8TB SSDs. Inline deduplication means data is deduplicated as it lands on the array instead of being initially stored in its full form and then having later, post-process deduplication applied to it. In-line deduplication is more space- …
Listening to the Tech ONTAP podcast it's something that can still be turned off, so it's still an afterthought.
I'll bet the release notes / documentation for it give a whole long list of reasons to stay away.
Done properly, dedupe speeds things up, not slows things down since you get more out of each deduped cache, and you also avoid I/Os since you can discard duplicates.
Isn't that the truth! Do yourselves a favor and read the release notes because some of us have been burned one too many times as sales <> deployment reality. Our experience has been there are too many gotchas and last minute surprises with this system that automagically tend to surface just before implementation. it has been a frustrating battle for us the last 14 mos.
Hi Paul, miss you!
Not an afterthought, a primary objective. And while inline does reduce disk IO, it also increases CPU cycles. This is why Pure, for example, will disable it under heavy load. To be done "properly" you would have to have a more or less fixed metadata storage to capacity ratio. This means your DRAM would be the limiting factor in your ability to scale up capacity, which Dell/EMC face with Xbricks.
In its current iteration, Data ONTAP's inline dedupe is recommended to be used in conjunction with post-process to provide optimal savings. Inline is most helpful for a some VDI use cases that were an issue with post-process only. Primarily OS patching for persistent desktops.
The guidelines are simple: Don't use it for databases (they don't dedupe), use it for VDI.
Cheers, mate!
Hi Dan, miss you (and the rest of the gang) too :-)
http://www.netapp.com/us/technology/storage-efficiency/feature-story-compression.aspx most workloads benefit from dedupe. Not everything, of course, but a significant portion.
Any time you can offload from the back end or save space is a benefit.
I agree, most implementations of dedupe are pants. If a feature is worth using then it should always be on. Being under load isn't an excuse - since systems properly purchased should be as busy as possible.
In an ideal world, everything would be in-line and the idea of a 'post process' for anything would be eliminated since doing work post process just means you're doing the same work twice, along with all the other implementation joys such as having blocks locked in place by other features.
As you rightly note, memory can be limiting factor if all metadata for hashes is in main memory. Fortunately, that is solvable as well, either using SSDs to hold that metadata, or using scale out memory solutions. I can understand why not SSDs if you're on a race towards 0, but most workloads don't need it...
Dedupe is still alive? It never seems to have lived up to its promise somehow, not since those heady days of 2008, before Data Domain was gobbled up by EMC.
Dedupe is a Nirvana, in theory. By now it was supposed to be everywhere: even in ZFS and Linux. One of the problems is that encrypted data can't be deduped, and encryption is becoming the norm in some areas. Corporate desktops (or at least laptops) are now usually encrypted. So is data held in the "cloud".
What happens to a 100TB array if somebody dumps 20TB of encrypted data on there? Do the inline ingesters just dumbly thrash themselves to death trying to dedupe/undedpue it, or do they know what they are dealing with and somehow skip those blocks? Compressed data almost as bad, eg. nearly all media files.
It seems that non-compressed, non-encrypted data will soon be restricted, perhaps, to internal corporate office servers. And who wants to buy a fancy DD SSD array system for that mundane stuff?
If encrypted or pre-compressed (and unique) data hits a dedupe device of any type, it'll not dedupe.
Backups of said data will still dedupe well.
If it's properly in-line, no thrashing, they would (sensibly) generate a hash of the data and go 'new' (store), 'new' (store), 'new' (store) for each block of data.
But, reality check. You encrypt the (physical) disks, not the apps. I've yet to come across any customer who runs server VMs with guest-level encryption. I'm sure there must be some people out there who per-VM encryption and to those, they'll already have their restricted product lists that they'll be used to working with.
Backups of said data will still dedupe well.
Don't use a dedupe array as a backup target, tempting as it will seem. Yes, you would save tons of space. But for backups, you actually *want* to have multiple physical copies of the data, Not one physical copy and many logical copies (which is what deduped data is). If those physical blocks die, you could lose not just a single backup, but all generations of that backup within the deduped domain.
I guess another reason that dedupe isn't widespread is that storage is just so cheap now. Even for primary enterprise storage, what you pay for is the speed rather than the capacity. "Big data" might be a better candidate.
"...If those physical blocks die"
Maybe still a case to be had for tape backup?
Even better hey, as you said "multiple physical copies of the data", why not have those physical copies on different media. Disk only backups (inc traditional backup, CDP style, replication etc) is not the only option, expensive and often unnecessary, having one full (backup) copy + logical copies on disk is then maybe a reasonable option, with additional full copies found elsewhere.
While on the subject, why keep old backup data on disk at all?
What you'd usually do on a NetApp system is, leave dedupe on, keep your snapshots, and *snapmirror* that to a different physical site. That way you have another physical set of copies of your data, not on the same spindles, but in another location, where you'd also be protected against site disasters (e.g. prolonged power outages, floods, fire, ...)
Doesn't really matter if it took them ages to offer inline dedupe...
NetApp is just happy to be in the news...
And that Snap Protect vs Intelli Snap nonsense is a waste of news space, too.
The underlying technology has always been Intelli Snap. It's just that Netapp marketing like to confuse customers. Once they realise that customers aren't that dumb they revert to the original name.
Soon we'll see "NetApp Data Cluster Ontap Mode FAS".
They were the first to offer dedupe... also for a long time the only ones to offer it for primary storage.
Who cares, if it wasn't inline!
That's only important if you want to reduce write amplification on flash systems.
And they did use 'always on deduplication' before, if you check the relevant Tech Reports (e.g. about Horizon View on a AFF system). Yes, there was some more write amplification, but again, who cares, since it was covered by the NetApp warranty anyway (and they didn't yet have a single SSD fail because of aging either).
I'd love a dedupe home NAS, so I can save space storing all those mp4 files - oh wait. No. Well then my massive FLAC archive, surely...no, can't dedupe that either. Well what about that massive archive of ISO images, I can, no, hang on,er... mp3, flv, encrypted backups, no, no, and no... jpegs no, gif no...
Yes dedupe is alive and well - in fact, it is flourishing. In the flash array market, all of the top 5 vendors offer it in one form or another. A few vendors, like HDS, even offer effective primary storage deduplication for disk and hybrid systems on their NAS units. Don't confuse what Data Domain has (backup-optimized dedupe) with what is shipping in primary flash arrays (random-access optimized dedupe). They are completely different technologies. Backup dedupe is optimized for sequential IO, specialized backup formats, and high rates of duplicate data. Primary dedupe operates on blocks of a fixed size and is optimized for random access in mixed workloads. Under these workloads, primary dedupe products can see zero performance impact or even performance improvement. It all depends on the underlying hardware and even more importantly on the dedupe software capabilities.