back to article Amazon's optical character recognition toy Textract is here but still a bit short-sighted

Amazon Web Services has announced the general availability of Textract, a service for converting scanned documents to text. Optical character recognition (OCR) is a mature technology built into many applications. Insert a scanned document into Microsoft's OneNote, for example, and you can "copy text from picture" with …

  1. Doctor Syntax Silver badge

    How does it compare with Tesseract ( https://github.com/tesseract-ocr ) or GOCR ( http://jocr.sourceforge.net/ )?

    Apart, of course, from having to send all your stuff to somebody else's computer?

  2. Primus Secundus Tertius

    Many other products

    Dr S mentions Tesseract. Other products are ABBYY reader and Adobe Acrobat; and as the article mentions, One Note. I have used all these others. ABBYY I find the best if I want to reflow the text into different page sizes or columns, and Acrobat the best if I want to preserve the original appearance of the document. One Note turns each line into a new paragraph -- not good for reflowing text.

    None of them work well against cursive or script fonts, and they all fail with handwritten images.

    1. Mage Silver badge
      Alert

      Re: Many other products

      So not much change in 20 years. You STILL need a human proof reader and sometimes a good copy-typist is faster unless you have an industrial scanner.

      So basically yet another rent computer time like the 1960s and trust a US Megacorp with your privates service.

    2. Doctor Syntax Silver badge

      Re: Many other products

      "ABBYY reader and Adobe Acrobat"

      ABBYY is Windows only. Acrobat, a long time since I looked at it but only Reader was available for Linux. If it ain't cross platform it don't count.

  3. tiggity Silver badge

    Plenty of other web OCR APIs

    Some have been around for ages, used a few of them in the past to see how they compared to tesseract. Big names such as MS have cloudy OCR APIs, and lots of smaller more specialist companies provide such services

  4. Crazy Operations Guy Silver badge

    Total garbage with non-ASCII text

    Ran a few documents through it to test it. A document written in standard English version worked alright (A few minor errors), but then started getting a little worse with docs in German and French. But it just feel over and puked when I fed it some documents written in Chinese, Korean, Arabic, and Hindi. Pretty much the further you got away from ASCII characters, the worse it performed.

    All of these were just translated versions of the same marketing document.

  5. Anonymous Coward
    Anonymous Coward

    (OCR) is a mature technology

    mature my ass, I've used FineReader since version 6 or 7 and it's still as shitty with tables and pics as it was almost 20 years ago (they're version 14 or something now, only about 4 times larger install file). And still the same bugs in actual OCR (no, they clearly don't act on bug reports). But hey, I've tried all other major OCR software and they all fail miserably this or that way. All you need is a several-hour long post-OCR edition. Good news: you kind of get to know the text better ;)

    btw, with all the "AI" posturing around, you would think that, by now, some of this "intelligence" would have already been applied to recognize at least the same layout patterns ("this book has inserted pics every x pages, so treat them as pix with description THROUGHOUT THE FUCKING TEXT!"). Or if too challenging, at least recognize the same words _consistently_ across the same text. At least those words that are - in some languages - pretty unambiguous, oH, newar mimb..

  6. ButWhy?

    Sorry but has Oracle Text not already been doing this very successfully for the past 20 odd years?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019