[WEB4LIB] Text extraction from pdf files.

Roy Tennant roy.tennant at ucop.edu
Thu Sep 6 00:21:38 EDT 2001


If you scanned the documents using Adobe Acrobat, then you may not 
have as much "plain text" as you think. The way Adobe Capture worked 
(and I can only presume that Exchange works in a similar fashion) is 
that it does what OCR it is sure of, and then it fills in the gaps 
with image fragments. This means that although you have a document 
that is fairly close in appearance to the original, and with at least 
part of the text converted to machine-readable form, it isn't 
necessarily all converted. Extracting plain text from such a file 
would be an exercise in futility -- you'd be better off rescanning 
from scratch or sending it offshore to be rekeyed.

But that isn't exactly what you asked...you want to know what your 
options are for converting PDFs to text. The SWISH-E indexing engine 
includes filters for MS Word and Adobe Acrobat files. More info at 
http://sunsite.berkeley.edu/SWISH-E/2.2/docs/split/SWISH-FAQ/Can_I_index_my_PDF_Word_and_co.html 
. It is based on code from the XPDF effort (see the pdftotext module) 
which can be found at http://www.foolabs.com/xpdf/xpdf.html. There 
are no doubt other options as well. Good luck,
Roy

At 7:07 PM -0700 9/5/01, Tony Parsons wrote:
>Dear all,
>
>This is only a vaguely web-related question, as we'd be using email to
>disseminate the information once this problem is solved. Hopefully it's not
>too inappropriate, I'm just not aware, yet, of many library computing-type
>lists.
>
>Does anyone know how to extract plain text from a pdf file? I have scanned
>some documents with Adobe exchange, which we would like to manipulate into
>text. I've done a bit of hunting around with not much luck, as far as
>conversion software is concerned. Ghostview/Ghostscript progams seems to
>extract no text with the pdf's I've scanned.
>
>Should I just give up and organise a different method of scanning, or is
>there a *reasonably* straightforward way of doing this?
>
>Regards
>Tony.
>--
>Tony Parsons
>Technical Services Librarian
>Royal Australian College of General Practitioners - Resource Centre
>Ph  (03) 9214 1487
>Fax (03) 9214 1403
>http://www.racgp.org.au
>--
>PRIVATE & CONFIDENTIAL
>***********************************************************************
>The information contained in this e-mail and their attached files, including
>replies and forwarded copies, are confidential and intended solely for the
>addressee(s) and may be legally privileged or prohibited from disclosure and
>unauthorised use.
>If you are not the intended recipient, any form of reproduction,
>dissemination, copying, disclosure, modification, distribution and/or
>publication or any action taken or omitted to be taken in reliance upon this
>message or its attachments is prohibited.
>
>All liability for viruses is excluded to the fullest extent permitted by
>law.
>***********************************************************************



More information about the Web4lib mailing list