[WEB4LIB] Re: Text extraction from pdf files.

rich at richardwiggins.com rich at richardwiggins.com
Sat Sep 8 10:56:42 EDT 2001


But isn't that a settable parameter?  I thought the whole idea was that if you are doing massive batch runs where you can't have a human operator make the decision for characters in doubt, you want to preserve the image of each suspect character so that a future reader can make his or her own decision. 

Adobe always understood that that inhibited automated use of the text (i.e. indexing) but I thought you could say scan time what confidence level it needed for each character.  And in any event you can have a human operator go through and fix the suspect characters.

If scanning instead just took its best guess for suspect characters, you'd still end up with a text file that was not very usable, as a large % of the time the guess is wrong.

So isn't it the case that the frequency of this problem would vary greatly depending on the amount invested in the scanning work?   Or you saying that in the real world no one ever invests that much effort?  :-)

/rich

On Wed, 05 September 2001, Roy Tennant wrote:

> 
> If you scanned the documents using Adobe Acrobat, then you may not 
> have as much "plain text" as you think. The way Adobe Capture worked 
> (and I can only presume that Exchange works in a similar fashion) is 
> that it does what OCR it is sure of, and then it fills in the gaps 
> with image fragments. This means that although you have a document 
> that is fairly close in appearance to the original, and with at least 
> part of the text converted to machine-readable form, it isn't 
> necessarily all converted. Extracting plain text from such a file 
> would be an exercise in futility -- you'd be better off rescanning 
> from scratch or sending it offshore to be rekeyed.

_____________________________________________________

Richard Wiggins
Writing, Speaking, and Consulting on the Internet
rich at richardwiggins.com  http://richardwiggins.com 


More information about the Web4lib mailing list