In our homes and offices duplicated information is such a fact of life we don’t even think about it. In the digital world though we can think about it, and should, because it can stop a lot of wasteful spending. Imagine a department of twenty people. They each have their own filed copies of their employment contract and a …
Hmm. . .
"deduplication means you only pay for the storage of unique data there, and not multiple copies of a PowerPoint presentation or an image that has been identically attached to twenty emails"
But The Powers That Be insist on spamming everyone with Powerpoint files when a single A4 sheet would do just fine and most likely deliver the message clearer. And there's the company dross of platitudes and corporate bullshit that take up pages and, once again, are either pretty much meaningless or have a tiny message to impart but as it's from on high needs to be wrapped in flowery crap.
They will also be the very same people who send out yet another Powerpoint file about keeping things small and not cluttering up systems with unimportant dross. And then several emails.
Alongside all this is the insistance that loads of hard copies must be made 'for record keeping purposes'.
First you need to stop the top end from peppering the lower orders with corporate confetti that somehow, is so really very important that just a two-line memo is never enough.
DeDuping's easier than changing corporate behaviour
Look, the marketing and management people who keep spamming you with Powerpoint decks when a single text file would be easier for both you and them aren't going to stop that kind of behaviour. And the employees who save the files in their directories where they've got control over them instead of leaving them in their email aren't going to stop doing that, even if the email system operators relax their "please don't keep files longer than 60 days" rules to 90 or 180 or 365, because management has told them they need to hang on to this content and they've learned not to trust the Email Operators not to delete anything actually important, so even if the email system dedupes messages, you'll still have multiple copies.
But if you send them 42 copies of the request for deduping equipment for your shiny new Storage Area Network, somebody's going to buy it, and it probably will save you a lot of storage; maybe it'll even save enough to pay for itself, since the SAN storage costs you far more per very fast terabyte than medium-speed Terabyte drives you bought at your local consumer products store.
Too many conditionals
The article (and one presumes, the underlying justification for the product) is packed full of assumptions - sentences starting with "if", "assume" and "can"s all over the place. These are presented as if they are facts, rather that wild and optimistic guesss. The 20:1 shrinkage only comes about if 95% of a company's data is copies of other stuff.
For most organisations the vast majority of the data they hold is either business/product/customer data in held in honkin' big databases or it's the mass of timewasting trivia known a email. The business data is all there because it has been justified in costed business cases and the email is there because most employees need to fill their pointless days doing something.
If you wanted to reduce the size of backups - or more importantly: the time required to take/restore them, a better solution would be to purge all the MP3's, video and browser caches that the staff amass on the core storage. But with disks costing fifty quid a terabyte (i.e. roughly 1 hours "funny money" cost of a sys-admin) the case would be very hard to make.
"dedupe is coming next year"
We'll keep hearing this line for a while.
Get ready to not really be sure how much storage you actually have. Someone could wipe out heaps of back end by simply using "Find and Replace". But then thin provisioning is a bit like this already.
It Works For Me :-)
rsnapshot is great, I get 7 days backups onto a portable disc the size of the main disc
du -D currently reports 1057 GB on a 512 GB disc :-)))
Can I have that .....
... in triplicate please ;-)
There is a trust issue in the examples in the article
"electronic copies of their HR contract and pension scheme guide"
If there were just a single copy of this type of information, then employees would have trouble making sure that what they agreed with when they started a job was the same as the current single copy. You would end up with needing an audit trail for changes to your single copy, with it's own complexity and storage issues. People keep their own copies as a point-in-time reference, to guard against change in the primary copy.
It's funny. Lotus Notes, which was originally sold as a document management system, was selling this "store a pointer, not the document" 15 or more years ago. This aspect of it's use has fallen by the wayside, but I'm sure that it is still in there somewhere. Mind you, using Notes as the glue for a mobile workforce (as IBM does) requires multiple copies to be stored in the hard disk of all of the laptops anyway, so the de-duplication benefits are negated.
Another thing is that you don't want to de-dupe your backup, at least not completely. You must have multiple copies on different media, because media fails. Enterprise backup tools such as TSM work using an incremental-forever model, meaning that by default, only one copy of each version of a file is backed up, but then has specific techniques to force more than one copy of a file to be kept on separate media.
I know I am a born sceptic, but I must admit to being unconvinced by the block-level de-duplication that is being pushed by vendors. Maybe I have just not seen the right studies, or maybe I'm already de-duplicating a lot of what I do using techniques like incremental-forever backups and single-system image deployment.
Maybe I'm just a neo-luddite. Who knows.
It's about "who" and "whom"
or 1984 or something.
I'd want a contact or something from HR as my own copy preferably on hard copy so there's no need to wonder if things have changed between me getting it and me needing to read it.
I'm not ecstatic about "dematerialisation" in the Land Registry either, but they didn't ask me...
No Animal Shall...
...sleep on a bed. Between Sheets.
because storage is a signifigant enterprise cost these days?
de-duping is useful and good practice but I don't think there's much cash saving to realise from it.
Banks, please take note ...
My bank insists on sending me *four* tax certificates every year. All together on the same day. On *four* separate A4 pages, each in a separate first class envelope.
Not only does it waste postage, it also wastes paper unnecessarily and means I have to keep four separate bits of paper and add up four sets of numbers when I have to fill in my tax form.
Once, banks would claim it was "too hard" to identify plural accounts belonging to the same person -- but that excuse can't be used now they have online banking and don't require a separate login to each account.
Yes banks, you *know* how many accounts we have and could very easily print just one tax certificate showing each of the accounts and the total figure as well as the individuals. So why waste paper and postage this way?
Isn't it time the law required banks to provide tax certificates online instead of by post? Some do, but not mine. And the law should also require them to keep statements and tax certificates available on line for seven years (mine deletes them after 9 months, which is unreasonably short) so you don't have to remember to make copies every six months.
De-dupe is done at block level
De-duplication is a great tool for eliminating the storage cost for exact copies of files. However, as soon as the content is changed in typical office files, they cannot be de-duplicated at the block level. For example, a Word document that has had one character changed somewhere in it will add an "edited by" tag at the top of the document, thereby "pushing" all the content down a few bytes, and therefore all the blocks will not match up for de-dupe.
God help anyone who believes they can achieve anything over 15% storage reduction in a typical environment. The exception might be a large Exchange 2010 installation.
First, as already mentioned, the examples are poor. People have those individual paper copies (and mine are at home, NOT at work) so that they have a permanent record of what they signed up for - rather than what someone later changed it to. And yes, I've firsthand experience of people trying to move goalposts in a contract.
Also, when it comes to off-site backup then the statement made doesn't actually work in real life. You wouldn't normally try and replicate your backup store off-site in one go. Typically you'd transfer one full copy of your live data once at the startup of the remote copy, and then thereafter you would transfer increments. But there is an overhead in transferring increments - so if (say) 5% of your data changed, you'd have to transfer more than 5% because of the overhead of working out what's changed.
Incidentally, it's a problem that the Rsync protocol was developed specifically to handle efficiently.
But yes, there is a p lace for DeDup.
What percentage of data *is* duplicated?
If the rate of duplication is (for example) 20% (seems extreme), then the maximum possible saving has to be less than 20%. Even given this extreme example, why not just take 'a long lunch break' (joke alert) and by the time you get back, Moore's Law will have increased the size of hard drives by about that much.
In summary - probably a problem that could be profitably ignored. But YMMV.
I've worked with a lot of SAN disk, at a lot of firms. I've deployed both DeDupe primary storage as well as backup systems that dedupe data on the fly. Block and file level dedupe.
The average savings, for firms on SAN (local storage and DASD is just so cheap it's not worth the cost of the fiber controllers, let alone $10K/TB SAN licensed disk to have it deduped), is about 30%. Yep, not 7:1, 1.3:1.
Fact is, when we're talking TBs of data, we're talking record data, scanned images, e-mail, etc; its highly unique. The number of copies of data circulated that are true duplicates (any people same file) are typically sent via e-mail (which natively dedupes, except when crossing servers, and even Exchange 2010 does that now cross-server. A large enterprise with a DRM system sends links to start with, not raw files. Most of the duplication is on local workstation HDDs, not servers.
30% savings is still significant, when we're talking multi-dozen TB savings, especially in SAN, but the license cost of SAN to start with, vs tying together a bunch of smaller, less advanced, decicated disk clusters (server or node specific) is so much higher, it generally eats the cost savings. You buy SAN for IOPS, and for reliability, not because dedupe can save you money. Adding the dedupe licensing on top of the SAN costs most often rarely breaks even, except for user oriented file systems.
Since e-mail already dedupes, DRM already does as well, and most backup systems too (in the D2D backup world, not so much tape) the cost increase to add dedupe across an entire SAN is not often worth it. For this reason, I've been suggesting most of my clients to buy 2 or more SAN infrastructures, a high performance high reliability SAN for VMs, databases, and app servers, which have little to dedupe, and then getting a smaller SAN without the advanced performance, but with dedupe, to use on file servers hosting user data.
If the license cost of dedupe was cheap, say 5% extra, or better yet a function of the disk savings (save 30% disk, pay 10% more, but only save 3% disk, pay 1% more) then I could more openly agree.
In MASS systems, and especially where you're dealing wit storing mostly identical documents, it works, but most of tyhose are moving to "document on demand" models, and onyl storing the uinique differences in the database to start with, plus a few scanned bits like signature lines, and they re-build the doc when it;s requested, and don't store the whole thing locally anyway. This saved the SAN based dedupe technology for a lot of people.
Ever heard of "incremental backups"?
Same idea, slightly different way to get there. Same dependency, though: Now you need the entire backup, and not just the most recent one, for your restores. It's more interdependent and thus more prone to failure. Given the very purpose of backing up, you can't just ignore that effect.
I'm not sure what mentioning the cloud is supposed to prove, but it's worth noting that instead of the physical media you only have contractual obligations, if that, securing your access to the data in case of need. And /when/ you need, the need is /high/, because that's what you have backups /for/.
And, of course, advocating deduping religiously will handily turn it into yet one more excuse to not care just how much data we generate. About on par with dropping the font size to unreadable in the belief that doing so will reduce attachment size. Shyeah right.
There is no doubt the technology is useful. But not duping data in the first place is more effective; a little thought up front will return much longer and farther reaching yields than trying to patch up earlier non-thought with ever cleverer technology.
"Never time nor money to do it right the first time, but always to patch up or do it again afterward."
When Winston was working at his terminal re-writing history by 'correcting' specific newspaper articles, I though the technology to do this was too fanciful. With perfect de-duping in place across the nation, now I see it.
The article above this when I looked was one about Oracle and a "dramatic leap in storage technology"
Wonder if it involves dedup and ZFS?
In the digital world duplication is a money-wasting sin
Then thank the lord I de-duped my photos directory over the weekend.
Then I synced the changes (with delete enabled) to the back up drive.
I saved a few gigabytes, so I'm feeling closer to heaven than ever now.
Actually: it was a project much needed. It was a mess, with quite a few duplicate directories.
When can i expect the cheque?
Makes backing up your datastore nice and quick
"Oh, those 20 Tb are a dupe of these ones. We'll just store a pointer - but we'll protect the pointer with RAID6".
Won't happen. Can't happen?
Deduping also has requirements namely: brains, thoughtful analysis usually before committing action(s), knowledgeable people.
Hence the title
at the record level. if its not just the actual deltas it will wtill be too big ......