Xiotech technology VP Rob Peglar has moved to Isilon, now an EMC business, to become chief technology officer (CTO) for the Americas. We interviewed Rob and asked him questions that reveal quite a lot about Isilon's prospects, big data, the role of flash in scale-out filers, reduplication and Isilon, and what we should think …
Big data is about low cost storage -hard to see EMC playing here
the Big Data work isn't just about storage that scales, it's about low cost storage that scales. Historically, EMC haven't played in that market...
If you are smart, dedupe will work with all data - big or otherwise
First of all, congrats on your new position Rob-
Next, on the topic of dedupe and scale-out, yes its true that dedupe will not scale infinitely - at some point the return is not worth the investment made in creating billions of metadata hashes with longer and longer lookup times. Anyone promoting global deduplication is likely confusing marketing hype with technical reality.
The solution for scale-out dedupe will be to store similar data in manageable storage volumes. Object-oriented file storage will play an important role here by grouping similar files into common repositories. The probability of finding dupes goes up, and the time to find the dupes goes down. Simple, huh?
Not all dedupe solutions are created equal. You are correct in that some fundamental dedupe solutions simply won’t scale due to their index structure limitations and as a result their performance drops off quickly as they attempt to do look-ups. There are alternatives today. We have delivered to our OEMs a deduplication solution that has hashing and indexing that scales linearly because the index, memory utilization and overhead is optimized for scale out architectures. Dedupe built for scale out can scale out! It’s just a matter of design.
Dedupe and Big Data = Diminishing Returns
You are missing the bigger point. the law of diminishing returns applies to dedupe and scale-out storage. There is no reason to dedupe a PB of data if that PB contains lots of dissimilar objects. Instead, intelligently group objects with similar characteristics and dedupe the groups much more efficiently. Like my Dad used to say, there are two ways to fix a problem, you can use a sledgehammer or use your brain.
Law of diminishing returns?
Why do you assume there’s less duplicate data in a PB storage? In fact, what we’ve seen is that the larger the storage repository, the greater the amount of duplicate data. There may be a rare use case where there is entirely unique data but that would be the exception not the rule. It comes down to does the dedupe engine scale out to accommodate larger data stores and does it have the performance to dedupe and not impact storage performance.
I’d rather bet there will be duplicate data and deploy a system that can scale to PBs. What have you got to lose (assuming the system can scale and handle the performance of course)?