Swiss-based file compressor Balesio has added PDF file compression to its space reduction capabilities for Microsoft Office and image files. Balesio's FILEminimizer Server software, uses a native format optimisation process, effectively recoding how data is stored in PDF target files without altering its format. This means the …
PDF already has compression built-in, but it's apparently underused?
Since PDF is essentially PostScript without the loops and such, but with compression added, it's not that hard to see how that would work. Would be interesting to see what exactly they're doing. I'm guessing turning on compression in the first place, maybe use a better encoder and up the compression ratio. But it does say curious things about existing PDF-generating applications.
I wrote a document, text without pictures, in MS Word 2010. I let MS Word write it as PDF and it came to around 300k. I then used the shareware PDF995 PDF-generating printer driver to create a PDF and it was half the size, and looked as good. Same font embedding options too.
PDF is one of those formats that can be used well or badly, and a badly-written file has lots of room for optimization.
Doesnt say much
Only that the zip equivalent PDF compression is several generations behind state of the art.
These guys probably have proprietary algorithms highly tuned for SSEv27.whatever. So a newish processor chip will blast through the compression.
It also says very curious things about all the MS Office applications.
Either they are storing a huge amount of unnecessary data or their output filters are rather poor as regards file size.
I suppose the real question is how long this small size output takes to make. If it's comparable to a normal save,then...
That said, it's not really very surprising. The file sizes of MS Word have always surprised me.
Native Format Optimization technology
What we do is applying our Native Format Optimization (NFO) technology on PDF files. Our technology is composed of a comprehensive set of content-aware native optimization algorithms especially developed for unstructured file formats such as Microsoft Office files, PowerPoint presentations, PDF files and images.
More info and a technical White Paper can be found here:
Chris Schmid, COO balesio AG
Ever try saving a Word doc as html? There are <font> and style tags around EVERY block of text, not to mention the style tags in each <p> or <td>. Simply removing these for a top-level CSS style would save space. Likely it is these type of tricks that are employed in "native format optimization" since, by definition, they can not use 3rd party compression methods.
Very informative! How it works: 'Balesio's native format optimization technology applies a comprehensive set of content-aware optimization mechanisms which are both "lossless" and "visually lossless" meaning they are technically "lossy", but in practice deliver optimized files which are visually identical to the original,' blah blah blah circular blah.
If "socially lossless" compression is allowed you can probably get 95% savings straight off the bat by simply deleting the vast majority of PPTs - if the repetitious uselessness of most presentations you've attended in your life is taken into account.
Thank you for your chipping in. However, honesty bid me say that the verbiage doesn't exactly invite me to read more of the same; my eyes glaze over from the word salad rather than that I am awed with your proprietaryness. You're not talking to a batch of tech-illiterate IT-salescritters and -PHBs here. As such, a bit of a missed opportunity.
So they claim to be able to compress already compressed files?
Yeah - right.
Not all compression schemes are created equal... e.g. the run-length encoding (RLE) scheme used in some BMP files is a simple but sometimes quite effective means of losslessly compressing image data, but you can pretty much guarantee that an RLE-compressed BMP will still be able to be losslessly compressed even further using a half-decent general purpose compression tool (e.g. 7Zip).
Even using a decent compression scheme, the creator of the file may have decided not to use the highest levels of compression available as a trade-off to reduce the amount of time/memory required to generate the compressed output, leaving some scope to further reduce the file size if another party is willing to expend the additional effort required to run the compressor at its highest level.
..they simply remove duplicated crap from the PDFs and store the relevant information only once. I recently developed am OpenXML-based PowerPoint generator and I can tell you that Power-point creates lots of redundant formatting information. Say you want to make the dividers of a table a bit fatter - that means spitting out 50 bytes for (each of) the left, right, top and bottom of EACH CELL ! That certainly is simple to compress, but I assume there is much more less-easily compressed redundancy in this kind of document formats.
They created a custom compressor and I assume it is entirely possible to achieve the said compression rates, as many commercial pieces of software (such as MS Office) have been developed with a single imperative in mind - maximum functionality at earliest possible release date.
Would have been awesome a decade ago
However, now storage is approaching free, hasn't that horse rather bolted?
I can sort of see your point, although as someone who learned about computers back in the days when storage space was relatively expensive, and access times were relatively slow, I still find it hard to let go of the idea that every byte matters.
There's also the question of who this is aimed at - if you've got a handful of PDFs on your home PC with a few hundred GB of free space, then what's a few extra MB here or there? But what if you're a business with hundreds of thousands of PDFs on a server, with your users generating millions of accesses every day to those files? I'm guessing that being able to shave even just a few % off each file might then start to make good sense...
You might want to note that enterprise storage
a) has never been cheap
b) has barely kept pace with demand.
Why do you think there is a market for all the various deduping and compression appliances?
Storage is never free
Especially in an enterprise environment.
With terabyte disks available now (generally speaking - the tsunami effect will not last), one would think that all IT departments have stuffed their servers with disk space.
That is not true, unfortunately. For some strange reason, enterprise servers are still quite often choking from mailbox mismanagement and small disk sizes.
Must have something to do with the fact that server-grade disks are held to much higher reliability and I/O throughput than basic consumer disks and thus more expensive in cost per GB.
That, plus the fact that IT is still viewed as a cost center, makes enterprise servers the red-headed stepchild in disk space availability.
On my personal PC, I have no less than 2 disks with 2TB each, plus four other disks for a total of over 5TB of storage space.
I know for a fact that there are a few companies not far from where I live that employ several hundred people and they have far less storage available on their mail server (think 300GB max).
But my disks are not RAID, or SCSI, nor are they high-throughput parity-checking error-correcting disks. They are just high-capacity consumer-grade disks and I was crazy enough to buy that kind of storage for myself.
My money, my choice. Company IT departments do not have that kind of freedom.
Why Don't We Build Lead Airplanes
...after all, if we mount 20 Saturn V engines to an A380 made out of lead, it will still take off. Aluminum is soo yesteryear !
It is not just about disk space...
...it is also about memory, network traffic, retrieval time etc. The bigger the file, the greater the cost (one way or another).
<quote>It also says very curious things about all the MS Office applications.
Either they are storing a huge amount of unnecessary data or their output filters are rather poor as regards file size.</quote>
Nothing unusal there. Take a look at the HTML generated by Outllook in an email, the output of Excel and Word's "Save as WebPage" and Access' OutputTo HTML.
Bloated cr*p every time.
it could also be called an Un-Bloater.
In some other news, pigs can't fly...
Storage is cheap, bandwidth (and not just the network one) isn't. A modern CPU will decompress a file in memory in far less time than it takes to pull a larger file from storage into memory. That's the reasoning behind moving flash>PCIE. Removing (or widening at least) the current bottleneck.
As for the matter at hand, it comes down to how much junk metadata the content creating app will jam into the PDF. Adobe apps are particularly nasty there, trust me, and compression will give you HUGE savings.
Stuff like CutePDF or PDFCreator have a much better time (and smaller output) because they get handed the "raw print data" stripped of all the junk. Compression will give some savings still, just not on a major scale.
I just exported an invoice from an Open Office spreadsheet with graphics as a pdf.
Filesize 27 KB (To be exact it is 27.1 KB)
Interested as to the savings this can make in our backup window on the fileservers. Provisioned three times what we had 30 months ago and we're getting full again... So much for quota's!!
smaller or smaller.
i don't have a problem with the kb used by a file but i often have a problem reading pdfs on ereaders. sure you can zoom in and scroll but that can be annoying and time consuming. a java app called briss does the job of trimming of excess margins. just my 2c.
it as a by product trims off a few kb.
Original .DOC is huge.
Original Word file, one page, text only, barely no formatting - 75kb.
Using one of those free ad PDF printers - 15kb PDF. Text file, no tables, converted to IMAGE PDF, as if you just scanned the thing. I believe there is some JPEG compression behind the scenes too?
Same file on notepad, that is barely just ASCII - less than 1kb.
Yes, I'd guess MsOffice wastes space on those files. OK, I can agree that in order to make a file editable, you must have a bit of over-coding... but WTF?
As I wrote in my previous message, just making a single table divider a bit fatter means 50 bytes in (uncompressed) Powerpoint. So if the table has 50 cells, that is about 50*4*50 bytes just for that feature ! At least, that is how powerpoint does it; maybe OpenXML has better options I am not aware of...
Lets have more adverts posing as news articles please. Truly excellent work!
Because, after all, if we needed compression technology it's not like us readers could find a company harking it on our own.
Exactly! We're all quite capable of going to our favourite IT news website to find out about new and interesting technologies - we hardly need The Register to tell us about them.
Maybe balesio could make a reader for their new format that isn't a malware nightmare. Saving a bit of rust is nice, but a replacement for reader would be awesome.
PDF differences mostly JPEG 2000 & downsampling
The PDF sample on their website is ~2.45MB.
95% of that is images. They're pretty large, hi resolution (304 dpi) and compressed with the JPEG filter.
In the compressed file from Balesio the images are 200dpi and compressed with the much newer JPEG2000 filter.
Using the "PDF Optimizer" in Acrobat if you use the same settings the file reduces to 426K using low quality for the compression. 82.91% of this file is images.
The Balesio compressed file appear to use similar settings to Acrobat's "low quality". Its file size is 431K. 81.80% of this file is images.
Acrobat and Balesio both turn on Object level compression which helps a bit. Balesio do do a bit more than this (eg they join the Contents streams together to save the overhead of the compression filter parameters and the PDF structure around them, and they get rid of the metadata which deliberately includes a blank 2KB block for updates as well as useful document info). Some of these things can also be done by Acrobat.
Of course you need to ensure that your reader works with JPEG2000 and Object streams, so need to be PDF 1.5 compatible at least.
Almost all of the above can be done at the PDF creator time, if using a decent PDF creator application and if the user knows how to use it. And wants to downsample the images and use JPEG2000, which often is not the case.