Two weeks after it was first made aware of the problem, Xerox has begun rolling out a fix for a software glitch that caused numbers in documents scanned by certain of its WorkCentre multi-function printers (MFPs) to come up garbled. "Our engineering team has been working around the clock to deliver the patch," Xerox wrote in a …
I still don't fully understand this.
If it's faulty OCR, why does the copy look such poor quality?
If it's not using OCR, how how the digit get flipped?
Sounds Like a over optimization in an attempt to improve processing time. Like they reduced the resolution that it gets digitized into while attempting ocr. I can see that if 8 and 6's are one of the problems why this could be a problem, Throw in oil and dirt on the scan surface and low resolution comparison for digitization would equal an incorrect result.
When you think about it, How often do you clean the glass on a copier, or worry about dust and other grime on it, in comparison to something like a camera lens. I know when i do high quality art scans it can be a big annoyance dust and other fine debris that can degrade the quality of a scan.
It's not OCR as such - it's a phenomenally aggressive and ill-advised compression algorithm - JBIG something, if memory serves. It figures that since a lot of data involves repeated glyphs, they only need to store one copy of each glyph and then just paste 'em all over the place.
So it's not OCR, in the sense that it doesn't "know" it's a 6. It could be a ɮ or a ʯ (or would that be an ʯ?); all the algorithm cares about is that it stores one ʯ and then goes CONTROL V Y'ALL whenever it finds a similar shape in the document. So that terribly expensive 500-odd bits of data for every ʯ gets crunched down into one ʯ bitmap (which, from the looks of things, is compressed within an inch of its life itself) and then the rest of them are just coordinates and a pointer.
It's fabulously efficient - for all intents and purposes, it is a bit like OCR, except they cleverly get around the problem of accurately knowing what a symbol is by not caring what it is; they just care that they're all alike.
Except, of course, when they're not. And that's the problem: It's so aggressive that it's apparently seeing a '6' and seeing an '8' and saying, "Eh, close enough for government work", and deciding they're both 6s (or both 8s). So, bada google, bada boom, you end up with a nigh-on undetectable error made in the worst of all possible ways, and all that for no particular reason anyway, since I have yet to figure out why saving memory in exchange for lots of CPU does any damn good whatsoever in a freakin' copier.
But I guess that's why I don't work for Xerox.
Thanks for the explanation. Baffling that they decided to take that route for the sake of a few hundred kilobytes.
It sounds like it could be the bizarre brain-child of an eccentric but senior engineer at Xerox. And it's worked well enough until now that no one could be bothered to tell him what a pile of over-engineered crap the idea was.
To enable the various options
It's so you can read in 100's of pages at a time and then collate 50 copies of them. Without the ability to store all the pages you would have to cycle around the physical input multiple times (which would not be feasible for bound books etc.) or use some huge collation robot.
Of course, disks are now pretty damn cheap, so I don't know if this is only older copiers, or if Xerox decided it was cheaper to ship an aggressive compression algorithm than a decent amount of storage.
Re: for the sake of a few hundred kilobytes.
It's not the kilobytes they are trying to save, it's processing cycles spent swapping the kilobytes every time it encounters the glyph. Memory is cheap, processing speed not quite as much. Especially when you have a fleshy standing at the copier who thinks it's all just an optical copy like it was in the good old days.
Still the worst kind of error to have: one that doesn't call attention to itself. Unless you are doing a careful line by line comparison, you aren't likely to spot it.
I'll bet the manager responsible for this mess has seen his career prospects deep-8ed, eh?
Actually, that manager will blame some programmer that no longer works with the firm.
I know few managers that would take responsibility for a cock-up like that. Have you noticed your managers bathroom never needs sir freshener. Never stinks in there.
Well, even if we accept the myth that a manager is ever responsible for anything, I'm not sure they'd know who the manager responsible for it was.
RE: I'll bet the manager responsible for this mess has seen his career prospects deep-8ed, eh?
More likely, promoted.
Re: RE: I'll bet the manager responsible for this mess has seen his career prospects deep-8ed, eh?
It is JBIG2 (http://en.wikipedia.org/wiki/JBIG2) and was designed by a legitimate Standards group (http://en.wikipedia.org/wiki/Joint_Bi-level_Image_Experts_Group). Xerox may be responsible for the aggressive glyph-matching bug, though.
Many of these documents will be from paper to PDF scans, and there is no way back or any ability to determine if changes have occurred. The good news is that any in-machine OCR did not use the compressed images, so OCR data may be better than the image.
A solution for a problem
that should not have gotten past their internal alpha testing.
Re: A solution for a problem
I can easily see this getting past testing. All it takes is two or three rotations of the management teams. First team put together the idea when everything was expensive. They set out testing parameters for the edge cases, ran it and everything checked. Rotate in the second team. They tweak some things to speed testing. Rotate in the third team. They tweak some things in the algorithm. Now the tweaked algorithm cause a problem the tweaked test no longer checks, but it passes the test.
And that's assuming they all respect the work that came previously. A friend came back from a convention I helped run. He noted the registration line never really cleared the whole weekend. It's a problem some friends and I solved with hard work and by spending some hard earned money years ago. But they figured we didn't know what we were doing and made a change "to improve efficiency." In this case the specifics are that they axed the paid help we brought in to assist because "they were just slowing things down." Never checked to see if the free volunteers (who do tend to be faster because of enthusiasm) could supply enough free workers to staff everything. I saw they also converted our modified bank line algorithm to a full bank line model. We'd stack registrants two or three deep at each booth, but feed from a single bank line. In theory the bank line is more efficient because the next person moves to the next aisle. That's fine as long as the walk to the booth is quick compared to the average processing time. But as the walk to the booth approaches the average, the efficiency goes away. So if you stack two or three at the booth, you don't lose down time for the walk. All critical when you are trying to hit a mean processing time of about 47 seconds a person including information confirmation, badge selection, assembly and payment. But we didn't understand their new problems so they've tossed us on the ash heap of history.
Contract small print.
I can see a lot of people carefully checking any contracts which may have passed through any affected copiers. Who knows what contract terms may have been invalidated when the customer got their copy.