back to article Storage DEATHMATCH: Permabit v Isilon

Permabit has taken issue with our Isilon and a question of Big Data story, insisting that its Albireo deduplication technology can be applied to scale-out file data storage, and deliver a tenfold reduction in storage costs with no performance penalty. We had previously asked Isilon's chief technology officer for the Americas, …

COMMENTS

This topic is closed for new posts.
WTF?

Albireo only requires 0.1 bytes of RAM per block of data indexed

So that would be two fifths of a bit then? Or are we talking Qubits here.

0
0

Change the perspective...

End users dealing with big data or not so big data should change the perspective: where is the big data volume coming from?

It is the result of the massive number of single files created. Attacking this problem, reducing single file sizes through advanced technologies such as native format optimization provides significant savings throughout the data lifecycle without tweaking existing infrastructure....

It is not so much the question between an old Buick or a Hybrid car. Both cars, if they are fully packed, need a lot of energy to get from A to B. Having the same information content weighing 50, 60 or 70% less, means both cars need less energy (gas or hybrid) to get from A to B... it is time to change the perspective: not the engine is the problem, but the excess luggage

0
0
WTF?

Is Albireo using black magic?

The math is giving me a headache. Peglar says 4 trillion items of metadata are needed to track the 4K blocks in 4PB of data; I think that should be 1 trillion, but it doesn't change his overall point, especially since he also said you'd only need to use 8 bytes of data per 4K block for indexing, while I think most systems would need more than 8 bytes.

To wit, Permabit says "in a system such as Albireo, the percentage of overhead is a bit over 1 per cent of the disk for 4K blocks" -- which means they're using 40+ bytes per 4K block.

However, in the next breath, Permabit says "Albireo only requires 0.1 bytes of RAM per block of data indexed" -- and 0.1 is less than 40.

I'm assuming they mean they store 40+ bytes per block *on disk* while using an in-memory index that takes only 0.1 bytes per block.

Problem I have with that is 0.1 bytes is not even a bit. I don't even understand the meaning of "0.1 bytes" and I don't see how it's possible to index *anything* using less than a bit. The wizards at Permabit *seem* to be dabbling in black magic.

1
0
FAIL

and it's not chocolate

Something like Isilon is a big, wide grid platform serving 1000s or 10000s of concurrent client sessions.

One challenge that type of storage platform solves is managing access locking between those 1000s of clients, e.g. this rides a high speed interconnect between the nodes.

Now, if we then have to add arbitration of a dedupe/reconstitute process that Permabit grid is gonna get damned complex on top of the storage grid.

I'd say it's a way off just yet for Big Data (and fast data), not quite as far off as holographic storage, but maybe some GPU cores might help.

0
0
Stop

often highly unique and can rarely be deduplicated

Really?

Has he ever looked at Facebook?

It is 90% the same pointless drivel, who would even notice if you compacted the data into a couple of hundred generics and added a bit of random number generation to "personalize" it?

0
1
Thumb Down

Close...

His retort doesn't make permabit anymore viable in a big data environment. For 4PB of data, they need 100GB of ram PER NODE, just for storing the dedup hash tables? What kind of crack is he smoking that he thinks that's a scalable solution? It might be slightly more scalable than Isilon believes it is, but it's still nowhere near the realm of viable.

1
0

Good discussion, but wrong problem?

Rob's comments are more reasonable. Permabit's math is a bit difficult to follow. Generally, Big Data is weakly compressible, and where it is not, an application or data type specific algorithm is probably better. In any case, I don't believe this is a storage issue. Disks are cheap, even though the Big5 tend to mark them up immensely.

The real issue is network transmission bandwidth. Until 40GE (via port ganging) is economic, we'll see serious latencies in the network with Big Data. The solution for this IS as much compression as possible, but the rub is that it has to be done in the host at write time to maximize the improvement, and, it has to be undone in the host at read time. With 12 and 16 core x64 CPUs available within 12 months, this sounds like a tunable software tool is needed, with a library of applets for the true optimizer.

0
0

native format optimization...

Jim, good observation but compression cannot be the answer for the fact that data needs to be rehydrated (doesn't' matter how powerful the machine is)...

native format optimization is, as NO rehydration is needed, NEVER, and data is reduced by 40-75% through optimization of content INSIDE the files... this works especially for big data where dedupe doesn't add benefits. It resolves not only the storage problem but as well the network problem as w/o decompression the single data is permanently smaller, not only on disk but as well in the network...

0
0

Answers to Math Questions

Here's a reply that answers all the math questions asked above:

Question: Albireo only requires 0.1 bytes of RAM per block of data indexed. So that would be two fifths of a bit then? Or are we talking Qubits here.

Permabit Answer: Albireo requires, on average, 0.1 bytes of RAM per block being indexed because Albireo uses a hybrid memory/disk index. Only a small portion of the index is retained in memory at any given time, however Albireo maintains extremely high performance because more than 99% of the time a deduplication request can be fulfilled without having to retrieve any portion of the disk index.

Question: Is Albireo using black magic?

The math is giving me a headache. Peglar says 4 trillion items of metadata are needed to track the 4K blocks in 4PB of data; I think that should be 1 trillion, but it doesn't change his overall point, especially since he also said you'd only need to use 8 bytes of data per 4K block for indexing, while I think most systems would need more than 8 bytes.

To wit, Permabit says "in a system such as Albireo, the percentage of overhead is a bit over 1 per cent of the disk for 4K blocks" -- which means they're using 40+ bytes per 4K block.

However, in the next breath, Permabit says "Albireo only requires 0.1 bytes of RAM per block of data indexed" -- and 0.1 is less than 40.

I'm assuming they mean they store 40+ bytes per block *on disk* while using an in-memory index that takes only 0.1 bytes per block.

Problem I have with that is 0.1 bytes is not even a bit. I don't even understand the meaning of "0.1 bytes" and I don't see how it's possible to index *anything* using less than a bit. The wizards at Permabit *seem* to be dabbling in black magic.

Permabit Answer: Albireo only requires, on average, 0.1 bytes per block of data indexed because Albireo uses a hybrid memory/disk index. For the portions of the index that are currently in memory, Albireo uses around 4 bytes per entry, however only a small portion of the index is required to be in memory during normal operation. Based on sophisticated data modelling that Permabit has developed over the past ten years, Albireo is able to maintain extremely high performance because more than 99% of the time a deduplication request can be fulfilled without having to retrieve any portion of the disk index.

Question: Close...

His retort doesn't make permabit anymore viable in a big data environment. For 4PB of data, they need 100GB of ram PER NODE, just for storing the dedup hash tables? What kind of crack is he smoking that he thinks that's a scalable solution? It might be slightly more scalable than Isilon believes it is, but it's still nowhere near the realm of viable.

Permabit Answer: No, Albireo would only require 100 GB across the entire grid. Deduplication does not require that the entire index be stored local to each node; this is another misconception that Peglar presented in his original interview.

1
0
Silver badge

So then Permabit don't actually answer his final and major objection:

Big Data of the type Peglar is discussing: data which cannot be significantly de-duped in any event.

I don't work with Big Data, but his answer makes sense to Peglar's objection makes sense to me. I was once able to achieve 99% compression when I zipped together a bunch of zip files, but it was a rather unique situation. In normal operations I got 70-80% compression on my main files, with some I got a mere 20% on the primary pass. But once I was done with the primary compression, running another compression against the file didn't reduce the file size, unlike my unique one. Peglar is working with data that are analogous to those 20% compression on first pass, which makes it not worth the effort to dedupe.

0
0

More a sales effort than a discussion by Permabit

Permabit's main response to the Isilon comments is that they disagree in regards to how their solution handles dedupe in large data environments and not in general terms which is more the point of the Isilon comments. Permabit seems more interested in touting their patented technology than actually answering questions or providing informative details.

Permabit's Albiero is an interesting product in that it basically is a standalone system that scours the data for duplicates as a side process so by not being a part of the data stream does not impact performance seems clever enough but much of the rest of it seems to be mostly marketing PR at the moment.

0
0
This topic is closed for new posts.

Forums