[WEB4LIB] Scanning text

Grace Agnew gagnew at rci.rutgers.edu
Fri Mar 15 10:59:25 EST 2002


Betty,

There are at least a couple of options:

1.  Digitize your pages as uncompressed TIFF at 150-300 ppi, 
uncompressed.  Then provide the book as either PDF or DJVu.  Both 
commercial products provide automatic page concatenation with bookmarking, 
etc.  Both require additional plug-ins at the browser for use.  Large PDF 
files are difficult to download and work better for download to the printer 
but Adobe Acrobat 5.0 has a "save as web-enabled" feature that steamlines 
the output for the web.  It's easy to OCR with either DJVu or PDF.  You can 
use Adobe Capture or the Caere product (which I think uses 
FineReader).  FineReader is better OCR.  OCR packages such as Abbyy (not 
sure about spelling) FineReader offer you options of a separate text file 
or text file attached to the image which is transparent to the user.  DJVu 
has a particularly good OCR capability that lets you apparently "life" 
words right off the image into word processing.  Both of these are 
commercial "use copy" solutions, and offer great functionality for the cost 
but not interoperability with other scanned books.

2.  There are some fairly simple page turners out there, written primarily 
in Perl, that pull together JPEG images with navigational capability at the 
page level (go to page 7, previous page, next page, etc.)  If you find one 
you like, contact the library.  They will generally provide you with their 
Perl or Java script.   You can use a simple page turner to display JPEG or 
PNG images without the need for additional plug-ins at the browser.


3.  A more complicated solution and more difficult to implement immediately 
but a preferable long-term solution for building interoperable collections 
across many institutions would be to create a structure map or "page 
turner" using the METS structure map.  a "page turner" is a PERL or Jave 
program primarily that works with the structure map to provide navigation 
by section name (e.g. chapter names), by page, etc.

  In this case, you structure the divisions of the work into chapters, 
pages, illustrations, etc.  In the long run, I think METS will have the 
most utility because it should enable output in numerous formats, from TEI 
(for OCRed text) to image displays in multiple formats.  It also creates an 
interoperable framework where--again over time--users could mine structured 
text across a collection--for example to retrieve all the bibliographies 
from a book collection or all the table of contents.  METS structure maps 
will provide an open standards means for creating any structured 
resource--books, e-journals, diaries, correspondence, etc. and searching 
across collections to retrieve, for example, any digital diaries.  Within 
an institution, structure maps could be used in an annual report collection 
to pull out all the budget spreadsheets, for example.  You are able to 
define how granular the structure should be.



At 10:06 AM 3/14/2002 -0800, you wrote:
>Greetings, folks,
>
>I have been lurking on this list for a while and find your hints, tips &
>tricks to be very useful, so I'm hoping you'll be able to help me out.
>
>The Thunder Bay Public Library staff are working on a digitization project.
>We're scanning lots of historical photos which will be accessible via our
>website, and we're pretty comfortable with that whole process.
>
>What's new for us is text scanning. We  would like to scan the contents of
>several (public domain) books in non-OCR mode - we're simply doing the
>images of the pages. We're not sure how we make the page scans of the book
>contents available to users as a single unit, if that makes sense, i.e.
>they would click on a link to "Algoma Mines" and then voila, the scanned
>book magically appears in its entirety. ;) I know it's do-able, since books
>like this are visible all over the Web. But *how* did they do it?
>
>We're using Adobe Photoshop 6.0.
>
>Any advice, help, URLs of resources etc. would be most appreciated.
>
>thanks much,
>
>Betty Braaksma
>Betty Braaksma
>Head of Reference Services
>Thunder Bay Public Library
>Thunder Bay, Ontario
>807-624-4203
>

Grace Agnew
Associate University Librarian for Digital Library Systems
Rutgers, the State University of New Jersey
Library Technical Services Building
47 Davidson Road
Piscataway, NJ  08854-5603

gagnew at rci.rutgers.edu
PH: (732) 445-5908
FAX: (732) 445-5888




More information about the Web4lib mailing list