How does it compare with Tesseract ( https://github.com/tesseract-ocr ) or GOCR ( http://jocr.sourceforge.net/ )?
Apart, of course, from having to send all your stuff to somebody else's computer?
Amazon Web Services has announced the general availability of Textract, a service for converting scanned documents to text. Optical character recognition (OCR) is a mature technology built into many applications. Insert a scanned document into Microsoft's OneNote, for example, and you can "copy text from picture" with …
Dr S mentions Tesseract. Other products are ABBYY reader and Adobe Acrobat; and as the article mentions, One Note. I have used all these others. ABBYY I find the best if I want to reflow the text into different page sizes or columns, and Acrobat the best if I want to preserve the original appearance of the document. One Note turns each line into a new paragraph -- not good for reflowing text.
None of them work well against cursive or script fonts, and they all fail with handwritten images.
Ran a few documents through it to test it. A document written in standard English version worked alright (A few minor errors), but then started getting a little worse with docs in German and French. But it just feel over and puked when I fed it some documents written in Chinese, Korean, Arabic, and Hindi. Pretty much the further you got away from ASCII characters, the worse it performed.
All of these were just translated versions of the same marketing document.
mature my ass, I've used FineReader since version 6 or 7 and it's still as shitty with tables and pics as it was almost 20 years ago (they're version 14 or something now, only about 4 times larger install file). And still the same bugs in actual OCR (no, they clearly don't act on bug reports). But hey, I've tried all other major OCR software and they all fail miserably this or that way. All you need is a several-hour long post-OCR edition. Good news: you kind of get to know the text better ;)
btw, with all the "AI" posturing around, you would think that, by now, some of this "intelligence" would have already been applied to recognize at least the same layout patterns ("this book has inserted pics every x pages, so treat them as pix with description THROUGHOUT THE FUCKING TEXT!"). Or if too challenging, at least recognize the same words _consistently_ across the same text. At least those words that are - in some languages - pretty unambiguous, oH, newar mimb..
Biting the hand that feeds IT © 1998–2019