This sounds great but the real question is what's its Weissman score?
Microsoft used the Open Compute Project (OCP) Global Summit to announce the open-sourcing of the company's cloudy compression technology, Project Zipline. Disappointingly not a wire strung between data centres, along which techies can whizz with armfuls of USB sticks – an aerial Sneakernet if you will – Project Zipline is all …
Microsoft structures data by creating two parallel databases. One contains zip information, and the other (the received from it) textual patterns. And then it becomes possible to search in the zip information by its meanings, using the patterns only. So I guess the zip has just the regular characteristics, and I don't know if and how Microsoft zips the patterns.
Brotli can’t quite keep up with faster internet connections. For instance, a fast internet connection can upload several megabytes per second, but Brotli may require up to 20 seconds to compress just 4 megabytes of data. As an alternative to the Zopfli compression, using a greedy algorithm like gzip -9 to do the compression can waste up to 10% of the space but can keep up with almost any line speed.
I've been using xz (LZMA) compression for my tarballs. It takes longer but makes the files smaller.
Why Microsoft would choose ZLib over other methods is unclear. Maybe popularity?
lz4 is supposed to be several times as fast at compression (and also decompression) when compared to 'deflate'. So if they want speed, lz4 would be a better choice, I think.
well if they had anti-replication in their file system, it would make more sense than indexing based on text contents. I'd venture to guess that the text indexing is for a basic search algorithm, and probably not one that goes across their entire file system.
It's my personal opinion that an algorithm that simply scans the data live, off of the hard disk, with aggressive disk cacheing [like Linux and FreeBSD have], outperforms any background "index everything" algorithm and data set. As an example, if I want to find something on any file system, I typically use 'grep'. Even with Cygwin, it seems to be SO much more flexible (and results more relevant) than trying to use MS's ridiculous "search".
The kinds of things _I_ would search for on a windows system: "Which header file has THIS function in it" [and considering where Microsoft wants to place header files, it's painful and bad enough already trying to naviguess to that - so I typically make a symlink in a Cygwin environment so I can do it more sensibly with 'find' and 'grep'].
It's also my experience that with compressed hard drives, decompression actually IMPROVES throughput. SSD drives, maybe not so much, but DEFINITELY on a hard drive. It has been so since the 90's, when MS first integrated disk compression into the OS (and got sued by STAC for it).
anyway, I think it would be an overall 'win' for them (pun intended) to not bother so much with the indexing, and just focus on throughput and performance.
I wonder how much storage could be saved if cloudy providers were allowed to match up files stored across their estate, and eliminate duplicates ? Obviously user-generated content is pretty unique, but how many people are backup up copies of programs and the like being repeated thousands of times across the world ?
Hopefully this will trickle down to Windows Server and bring a little bit more feature parity with ZFS. They've only had about 15 years to tackle a problem everyone else knew was coming. I wonder what systems the NSA is will be using in Utah to store our data without "bursting at the seams"?
I analysed a large quantity of binary data only to find that it consisted solely of ones and zeros so I wrote a de-dupe algorithm that can reduce any arbitrary data-set to two (2) bits.
I will be launching a kickstarter shortly to fund further development of the decompression algo, but if anyone here wants to invest early on preferential terms just DM me.
--> when I hit the investment target
Biting the hand that feeds IT © 1998–2019