Ocarina, the deduplication startup, is making waves with its partnerships with storage vendors, due to its unique lossless image compression technology. Yet the Ocarina founders were not wedded to image deduplication when they started up the company. How did it come about? The way Murli Thirumale, Ocarina's CEO, tells it, the …
I was bored a few weeks ago and spent some time running various compression tests on about 1TB of images (no - not *those* kind) and found I could get around 15-20% compression on JPG's just using "find ... -exec bzip2". Not as effective as this - but then it's quite a bit cheaper ;-) The spread was around 5% - 30% depending on the image & original compression.
>"Intriguingly, the DCT concept is often used in signal and image processing for lossy compression."
Well yes, that's basically what a bleedin' JPEG is in the first place. So
>"extracts the full rich image data from an existing image file in to a Discrete Cosine Matrix (DCT space)"
is nothing but a wordy and obfuscated way of saying "They look at the raw data blocks in the JPG file" (rather than ... what? decoding it to a bitmap and analysing that?). That doesn't mean their claimed compression ratios aren't impressive, but I think they'll probably turn out to be mostly based on their standard assumptions about de-duping archive data, applied when deduping at the level of the internal structure of the file type rather than the drive-sector level.
So do try not to let them dazzle you with big-sounding sciencey words; I think the whole thing can be summed up as "Files have repetitive internal block structure that you can apply deduping to if you write some file-format sensitive plugins for your backup architecture." Note particularly their disclaimer that
>"results on grouped studies are better than on individual images"
and I don't see any reason to believe they've come up with some magical compression technique that defies Shannon's laws or can compress arbitrary random data. Typical marketing department snake-oil bullshit sales-speak.
I skim-read the article admittedly, but what I gleaned was that massive amounts of processing (like, really massive) will give you a smaller file.
Are we so stingy these days that we're unwilling to pay for storage? I thought scrabbling around for disk space was a thing of the past.
What he said
Was about to post a long rant, but then read the post by Anon #2 - pretty much spot on.
I would say though that neither the chaps sending out the press release nor the chap writing the article have a clue what they're talking about - either it uses DCT, and is therefore "lossy", or it doesn't - which is it?
Lossles image compression that can compress files as much as they are claiming, - i.e beyond "lossy" jpg compression techniques are pretty much the information equivalent of a perpetual motion machine.
you have a point and whilst I usually roll my eyes and think, ANOTHER snake oil company. They actually have deployments? So, it can't actually be that bad if people are actually paying for it and seeing results.
Admittedly, it might just be a small improvement, but enough to get clients and earn money, so more power to them, if you see my point.
Nothing to see here
Agree with the above. No wizardry here - looks like all they are doing is taking a JPEG (which is a lossily compressed image in the first place), partly decoding it to recover the quantised DCT coefficients and then re-encoding that data with an entropy coding scheme which is more efficient although more computationally expensive than the antique JPEG scheme. You might as well use ZIP.
interesting, not earth-shattering
Given the hardware requirements, and that they're apparently CPU-bound, and that it looks at groups of images, leads to the conclusion that they're just setting larger windows on how far back they look for similar data.
You can test this by compressing large numbers of almost-alike files. One file per archive, and you get a certain level of compression. Multiple files per archive, and you'll probably achieve a better compression ratio (assuming file size smaller than look-behind window).
You can also see this in action if you use pngout or kzip (see http://advsys.net/ken/utils.htm ), which uses a more exhaustive search of all compression methods (compared to normal png or zip producing apps). The output you get is (usually) smaller, and is a standard png/zip file to boot.
Lossless lossy compression
So the gist of this technology is to use common features of groups of JPEGs and MPEGs to further compress the content (a bit like using bzip2 to compress a tar rather than individual files is probably a crude analogy). It's a bit of a stretch to call this lossless compression, given that MPEG and JPEG are both lossy anyway. Not to mention that part of the raison d'etre for MPEG (well, MPEG4, anyway) is to allow efficient streaming, isn't streaming performance going to suffer a bit?
...how many of those negative comments came from NetApp fan boys. The last Ocarina article claimed Ocarina Deduped better than NetApp..... perhaps their recent $1.5B investment wasn't wisely spent - Ocarina may have been a better purchase. One thing seems pretty likely - one of the big boys will buy Ocarina soon enough - perhaps HDS since they have pretty much zero to offer in de-dupe today.
Silly smoke and mirrors
I'm with No Duh man,
Take a 1000 files of 2k each. Store them in one big file, instant 50% saving*. Talk mumbo jumbo to world to explain it. Add a bit of deduplication and bingo...
But if they had a lossless image compression then they wouldn't be selling dedupe software, and if they had high compression video compression they wouldn't be selling dedupe software, and if they had magic compression software, they wouldn't be selling dedupe software.....
The DCT stuff is smoke and mirrors, yes we use DCT and wavelets and fractals for image compression... it's not lossless and could never be.
You understand that the DCT approach is to take a block of pixels, calculate an equation using a DCT or wavelet or some other approach. The equation takes the form of a set of term k1x, k2x^2, k3x^3, k4x^4, k5x^5, k6x^6... k-infinity
But if x is in the range 0 to 1, then on average x is 0.5, and on average x^2 is 0.25, and on average x^3 is 0.125
So k2 is less important than k1, and k3 less important than k2 and so on. So you can throw away later terms because they contribute less than the early terms.
Using k1, k2, k3 gives a good approximation to the original data when decompressed, so you throw away k4,k5.... k-infinity terms. Thus you've taken a block of pixels (e.g. 16x16 pixels) and replaced them with perhaps 32 bits of data which when shoved through the inverse DCT gives a good approximate to the original 16x16 pixel block.
By it's nature it can never be lossless (even if you had infinity DCT terms, you'd still suffer from the float rounding errors!), so their description can never match their claim and so must be smoke and mirrors.
* A 2k file takes up a 4k cluster
@ Anonymous Coward Posted Wednesday 27th May 2009 12:01 GMT
Alternatively as stated they are using the lossless form of DCT. There were a couple of articles on lossless DCT in the IEEE archives if you are interested.
Mind you such schemes are only marginally better than lossless jpeg, so there scheme sounds like a lot of effort for not much gain.
I guess it depends on how much space you're going to save, I suspect large image repository operators like Photobucket might be interested. It's one thing for a home user to pop down to Maplin/PC World/Tesco and grab an extra few hundred GB of space for the price of a decent night out, it's quite another for a company to fork out tens/hundreds of thousands for the new drives, servers to put them in, racks to put the servers in, server rooms to stick the racks in, power supplies and cooling systems to keep the servers happy...
Storage might be cheaper than it's ever been, but that doesn't necessarily make it cheap when you're talking high volume, and especially not when you factor in the associated costs that come with those volumes.
To add on to the debate...
Can Ocarina provide performance figures for image formats that started out losslessly-compressed themselves, in particular PNG files (which use decently-robust Deflation as the algorithm of choice).
Lossless DCT? Be careful about what your saying there, the lossless DCTs are *N bit* lossless DCTs.
i.e. given data with a certain resolution, the DCT and inverse DCT combination return the same data TO THAT RESOLUTION.
I know it's not a big deal when we're talking integers from 0 to 240 or 0 to 255, but it becomes a big deal when your images aren't consumer grade stuff.
Re: AC "I wonder..."
There are /Netapp/ fan boys ???!
@ DCTs can't be lossless
"either it uses DCT, and is therefore "lossy""
"By it's nature it can never be lossless (even if you had infinity DCT terms, you'd still suffer from the float rounding errors!)"
Both wrong; DCT is mathematically invertible and if you calculate DCTs using integer arithmetic you can go to and fro losslessly. (You can do the same with float arithmetic but it gets quite tricky.)
@DCT can't be lossless
"Both wrong; DCT is mathematically invertible and if you calculate DCTs using integer arithmetic you can go to and fro losslessly. (You can do the same with float arithmetic but it gets quite tricky.)"
I don't think that follows, Addition is invertible, but addition is not lossless when using floats.
x+1.0 -1.0 !=x , for x = 1e-40 when using floats. Just because it's invertible doesn't mean it's lossless except to some accuracy, e.g. 16 bit integers.
You may think that because a double is stored as 8 bytes that you could treat the 8 bytes as 8 integers, but for DCT to work there has to be a underlying pattern that the DCT exposes, you then eliminate the redundancy and reverse DCT it.
Without the underlying pattern and redundancy you don't have a compression algorithm at all.
I think this is kind of off topic, since we're talking about DCTs as though Ocarina really have some magic DCT compression algorithm that doesn't involve throwing away the least relevant terms of the DCT equation and somehow that gains them 40% compression over JPEG and yet they sell deduping software not this compression method.
But the article is sciency talk, not the stuff of real math. More the sort of stuff put out for pointy haired bosses to quote to sound knowledgeable.
I'm the no duh guy. In another life, I do sysadminning. I have to make netapps work with AD/CIFS, SMB, NFS and Cygwin all at the same time.
I am not exactly a "fan".
If they're guaranteeing lossless compression on everything
they really need to look into the "pigeonhole principle." If you are guaranteeing that any N-bit sequence can be losslessly compressed into an M-bit sequence, where M is less than N, then you're eventually going to run out of M-bit sequences when it comes time to encode the next N-bit sequence. Since any M-bit sequence can only be expanded into a single N-bit sequence, there is no way for everything to be losslessly compressed into a smaller sequence.
Granted, there are multiple compression schemes which do better or worse jobs with different sorts of data, and you can certainly improve average-case compression for certain classes of images, but there is absolutely no way to guarantee lossless compression *at all* for ALL images.
Lossless compression using DCT is perfectly possible
It is perfectly possible to use DCTs to make a lossless compression algorithm, even without using very high accuracy DCTs. Nowhere in the article does it say that the compression uses *only* DCTs. First use DCTs to encode a reasonable lossy version of the picture. Then subtract that from the original picture, and you have the high-frequency noise that didn't fit the DCTs. This can be compressed in other ways (there should be very low amplitude in the signals, so you can use only a small number of bits), tied onto the DCT compressed data, and you have a lossless image compression that uses DCTs. In particular, if the "original" image is already a jpeg, the noise difference data should be very small and easily compressed.
Aaah, a trip down memory lane
Does anybody else remember the saga of "Adams Platform"
I wonder what ever happened there?
Apparantly, Adams Platform wasn't just a revolutionary video codec, Adams Platform has proved to be the mathematical expression of matter!
I kid you not!
Are they mixing up "patented" and "secret"?
If the process is patented, it can't be kept secret - you have to declare all the details in the patent. Well, that was the idea, anyway. I suppose if the US patent office allows patents that are so general that they could apply to anything, there would be room to declare enough basic details to enable them to sue anyone who began developing along the same lines, but still keep the vital details secret.
I thought that the JPEG2000 project had already come up with much more efficient compression algorithms for images, taking advantage of the big advances in processor power etc since JPEG compression first appeared. However, the creators of the technologies, quite understandably, wanted to be paid for their work. And since, as has been said, storage capacity has also made big advances in that time, few if any system or application developers felt a need to buy in.
Ouch - I think I've seen more technically-correct articles in InformationWeek. Picking on the details of DCTs and the like is almost beside the point; this piece was lost well before it tried to offer specifics.
Just who, precisely, believed lossless compression of images was "impossible"? It's trivially true that for any given message longer than one bit, there's at least one encoding that compresses it, though it may expand all other messages. (Encode the target message as a single zero bit; encode all other messages as a one bit followed by the original data verbatim. Implementing the decoder is left as an exercise for the reader.)
Even degenerate cases aside, it's clear that there will often be some redundancy in image and other files that can theoretically be exploited by lossless compressors; and when compressing a set of images, the probability of redundancy increases. This is exploitable in practice using good HMM-based entropy-encoders such as bzip2 and ppmd (the BWT used by bzip2 is effectively a simplified HMM), as one or two commentators have already noted.
Conversely, as other commentators have noted, you can't losslessly compress everything, thanks to the pigeonhole principle. Lossless compression is a question of mapping from the original set of messages to a new set, such that the ones you're interested in tend to be on the short end of the range. Ocarina may have found a practical way to improve that mapping somewhat for the messages their customers are interested in; but they haven't violated some mythical law of compression.
- Geek's Guide to Britain INSIDE GCHQ: Welcome to Cheltenham's cottage industry
- 'Catastrophic failure' of 3D-printed gun in Oz Police test
- Game Theory Is the next-gen console war already One?
- BBC suspends CTO after it wastes £100m on doomed IT system
- Peak Facebook: British users lose their Liking for Zuck's ad empire