back to article Microsoft unzips Zipline, lets world+dog have a go with cloudy storage compression tech

Microsoft used the Open Compute Project (OCP) Global Summit to announce the open-sourcing of the company's cloudy compression technology, Project Zipline. Disappointingly not a wire strung between data centres, along which techies can whizz with armfuls of USB sticks – an aerial Sneakernet if you will – Project Zipline is all …

  1. Munkeh

    Pied?

    This sounds great but the real question is what's its Weissman score?

    1. Doogie Howser MD

      Re: Pied?

      Damn, beat me to it.

      1. I.Geller Bronze badge

        Re: Pied?

        Microsoft structures data by creating two parallel databases. One contains zip information, and the other (the received from it) textual patterns. And then it becomes possible to search in the zip information by its meanings, using the patterns only. So I guess the zip has just the regular characteristics, and I don't know if and how Microsoft zips the patterns.

        1. President Orange

          Middle out?

          So you are saying the AI stands in the middle and goes back and forth between the two databases? Sort of like Middle-out compression?

    2. john.jones.name

      compatibility

      Brotli can’t quite keep up with faster internet connections. For instance, a fast internet connection can upload several megabytes per second, but Brotli may require up to 20 seconds to compress just 4 megabytes of data. As an alternative to the Zopfli compression, using a greedy algorithm like gzip -9 to do the compression can waste up to 10% of the space but can keep up with almost any line speed.

      1. bombastic bob Silver badge
        Devil

        Re: compatibility

        I've been using xz (LZMA) compression for my tarballs. It takes longer but makes the files smaller.

        Why Microsoft would choose ZLib over other methods is unclear. Maybe popularity?

        lz4 is supposed to be several times as fast at compression (and also decompression) when compared to 'deflate'. So if they want speed, lz4 would be a better choice, I think.

  2. I.Geller Bronze badge

    Microsoft wants to parse texts and get patterns.

    Microsoft wants to parse texts and get patterns, that's it. This process does not contribute to espionage in any way, since it is impossible to understand what the text is about from its patterns.

    1. bombastic bob Silver badge
      Devil

      Re: Microsoft wants to parse texts and get patterns.

      well if they had anti-replication in their file system, it would make more sense than indexing based on text contents. I'd venture to guess that the text indexing is for a basic search algorithm, and probably not one that goes across their entire file system.

      It's my personal opinion that an algorithm that simply scans the data live, off of the hard disk, with aggressive disk cacheing [like Linux and FreeBSD have], outperforms any background "index everything" algorithm and data set. As an example, if I want to find something on any file system, I typically use 'grep'. Even with Cygwin, it seems to be SO much more flexible (and results more relevant) than trying to use MS's ridiculous "search".

      The kinds of things _I_ would search for on a windows system: "Which header file has THIS function in it" [and considering where Microsoft wants to place header files, it's painful and bad enough already trying to naviguess to that - so I typically make a symlink in a Cygwin environment so I can do it more sensibly with 'find' and 'grep'].

      It's also my experience that with compressed hard drives, decompression actually IMPROVES throughput. SSD drives, maybe not so much, but DEFINITELY on a hard drive. It has been so since the 90's, when MS first integrated disk compression into the OS (and got sued by STAC for it).

      anyway, I think it would be an overall 'win' for them (pun intended) to not bother so much with the indexing, and just focus on throughput and performance.

  3. Anonymous Coward
    Anonymous Coward

    Ultimately privacy and security are drivers against compression ...

    I wonder how much storage could be saved if cloudy providers were allowed to match up files stored across their estate, and eliminate duplicates ? Obviously user-generated content is pretty unique, but how many people are backup up copies of programs and the like being repeated thousands of times across the world ?

  4. RyokuMas Silver badge
    Coat

    Zipline, George and Bungle

    Don't know so much about "George", but "bungle"... well, it's Microsoft, so only a matter of time, really.

  5. Leedos

    Storage Spaces and ReFS

    Hopefully this will trickle down to Windows Server and bring a little bit more feature parity with ZFS. They've only had about 15 years to tackle a problem everyone else knew was coming. I wonder what systems the NSA is will be using in Utah to store our data without "bursting at the seams"?

    1. bombastic bob Silver badge
      Devil

      Re: Storage Spaces and ReFS

      if it were me designing it, ZFS with de-duplication and lz4 compression... running on FreeBSD.

  6. Blockchain commentard Silver badge

    What they need to do is put forward a compression system for HTML/Javascript. All that text with whitespace just so the source is human readable is frankly a disgrace. And since it might be harder to block adverts I'm surprised people like Google and Facebook don't push it - you've also got less storage needed and faster page loading. Win-win I think.

    1. JohnFen Silver badge

      "What they need to do is put forward a compression system for HTML/Javascript."

      Is there something wrong with the existing solutions that do this?

    2. bombastic bob Silver badge
      Meh

      "What they need to do is put forward a compression system for HTML/Javascript"

      Interesting, but I'd prefer GETTING RID OF THE JAVASCRIPT instead. And I think the HTTP protocol already allows for compressed data transfer over 'teh intarwebs'...

      /me points out that you can't turn a pig into something that's not a pig. Javascript in HTML is what it is. A pig. A really FAT pig that gains LOTS of weight over time, and eats bandwidth+CPU without ever getting full...

  7. JohnFen Silver badge

    I almost got excited!

    "Disappointingly not a wire strung between data centres, along which techies can whizz with armfuls of USB sticks"

    Damn. If that's what it was, I'd actually be excited about it. Instead, this is only of interest to the cloudheads.

    1. bombastic bob Silver badge
      Joke

      Re: I almost got excited!

      it'd be more fun if they launched packages of sneakernet devices using potato guns...

  8. 's water music Silver badge
    Coat

    de-dupe FTW

    I analysed a large quantity of binary data only to find that it consisted solely of ones and zeros so I wrote a de-dupe algorithm that can reduce any arbitrary data-set to two (2) bits.

    I will be launching a kickstarter shortly to fund further development of the decompression algo, but if anyone here wants to invest early on preferential terms just DM me.

    --> when I hit the investment target

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019