5 posts • joined 9 Jul 2010
Dedupe isn't always the same...
It’s good to see Microsoft raising the visibility of the data deduplication opportunity. As your article indicates, there are some key issues that one need be concerned about when considering deduplication. As I see it, they can be grouped into three areas: performance, scalability and efficiency. Performance is critical because it can limit dedupe’s use cases. Fortunately, there are faster processors constantly coming to market that also have multi-core capabilities which are addressing the issue. There have been recent indexing and memory management advances that provide additional incremental scalability, which addresses my second point. We are seeing rampant data growth, so the more dedupe can scale the better it will be able to keep up with the growth. Efficiency is also important, and is an often overlooked characteristic. Deduplication solutions that only work with large granularity (ex. 128K) are markedly less effective at saving space than those that can efficiently handle small chunk sizes. Furthermore, the amount of RAM required to perform efficient deduplication can make the process extremely inefficient, particularly if a GB (or more) of RAM is required to deduplicate a TB of storage. Remember, in 2011 both components cost about the same!
There are many flavors of dedupe available in today’s marketplace. As we’ve often seen in the past, the open source community steps up to the plate when there’s a real need that is not being filled, so it should be no surprise to see developers building their own solutions. Dedupe solutions are differentiated based on how they address performance, scalability and efficiency requirements. When your readers can find all of these requirements addressed in a single offering, they are looking at a leading dedupe technology.
Law of diminishing returns?
Why do you assume there’s less duplicate data in a PB storage? In fact, what we’ve seen is that the larger the storage repository, the greater the amount of duplicate data. There may be a rare use case where there is entirely unique data but that would be the exception not the rule. It comes down to does the dedupe engine scale out to accommodate larger data stores and does it have the performance to dedupe and not impact storage performance.
I’d rather bet there will be duplicate data and deploy a system that can scale to PBs. What have you got to lose (assuming the system can scale and handle the performance of course)?
Not all dedupe solutions are created equal. You are correct in that some fundamental dedupe solutions simply won’t scale due to their index structure limitations and as a result their performance drops off quickly as they attempt to do look-ups. There are alternatives today. We have delivered to our OEMs a deduplication solution that has hashing and indexing that scales linearly because the index, memory utilization and overhead is optimized for scale out architectures. Dedupe built for scale out can scale out! It’s just a matter of design.
77GB/s is faster than 27.5 Isn't it?
On October 19 Permabit announced 77GB/s in an ESG validated test. See our press release http://bit.ly/9M39tV ...I am sure Brian Garrett would be happy to confirm with Curtis the results.
Mike has a Point!
The gist of Mike’s position is that there are unknown risks that inhibit him from going outside of the incumbent storage suppliers to acquire storage technology. Interestingly, he does acknowledge that there are known risks with the incumbents too! So, it’s more the devil you know versus the one you don’t. I think Mike has a point!
Vendor incumbency breeds familiarity, and a bit of contempt occasionally, as Mike’s comments indicate! His frustration with vendor speed of change, and implicitly their ability to implement newer technology that can address his business needs, also jumps out in the piece. Unfortunately, he is making a very significant point. Vendor incumbency matters!
Incumbent vendors have a huge opportunity because of the thousands of Mikes out there. And a few have the formula right! They can and do optimize their time to market by not sticking to the typical “not invented here” syndrome. The best storage companies have shown an ability to advance products with a combination of internal, OEM and strategic acquisitions as sources for the “kit” they offer. For example, more enlightened vendors optimize by “sticking to their knitting” doing what they do best, build hardware and operating software that optimizes their product offerings. Once they establish base level architecture and systems, they source features from the best-of-breed suppliers (yes Mike those smaller companies!) for advanced efficiency, automated tiering, performance, for example, to embellish and differentiate their offerings. The result is that the enlightened incumbent vendor can deliver products to market faster and more efficiently, that is differentiated from their competitors and more closely addresses their customers “pain.” The result is that the Mikes of the world can have the security of the incumbent vendor with rapid time-to-market solutions to address their needs. You just have to make sure you pick the right ones!
Because there are aspiring companies who have advanced efficiency, automated tiering, performance, data optimization and deduplication, the pressure has never been higher to obtain the best-of-breed and embed it into the incumbent vendor offerings. Today there is increasing momentum to build out the next generation of storage solutions that optimize feature sets, differentiate and deliver quickly. A few have it right! Mike has a point!