[WEB4LIB] Scanning text
Grace Agnew
gagnew at rci.rutgers.edu
Fri Mar 15 10:59:25 EST 2002
Betty,
There are at least a couple of options:
1. Digitize your pages as uncompressed TIFF at 150-300 ppi,
uncompressed. Then provide the book as either PDF or DJVu. Both
commercial products provide automatic page concatenation with bookmarking,
etc. Both require additional plug-ins at the browser for use. Large PDF
files are difficult to download and work better for download to the printer
but Adobe Acrobat 5.0 has a "save as web-enabled" feature that steamlines
the output for the web. It's easy to OCR with either DJVu or PDF. You can
use Adobe Capture or the Caere product (which I think uses
FineReader). FineReader is better OCR. OCR packages such as Abbyy (not
sure about spelling) FineReader offer you options of a separate text file
or text file attached to the image which is transparent to the user. DJVu
has a particularly good OCR capability that lets you apparently "life"
words right off the image into word processing. Both of these are
commercial "use copy" solutions, and offer great functionality for the cost
but not interoperability with other scanned books.
2. There are some fairly simple page turners out there, written primarily
in Perl, that pull together JPEG images with navigational capability at the
page level (go to page 7, previous page, next page, etc.) If you find one
you like, contact the library. They will generally provide you with their
Perl or Java script. You can use a simple page turner to display JPEG or
PNG images without the need for additional plug-ins at the browser.
3. A more complicated solution and more difficult to implement immediately
but a preferable long-term solution for building interoperable collections
across many institutions would be to create a structure map or "page
turner" using the METS structure map. a "page turner" is a PERL or Jave
program primarily that works with the structure map to provide navigation
by section name (e.g. chapter names), by page, etc.
In this case, you structure the divisions of the work into chapters,
pages, illustrations, etc. In the long run, I think METS will have the
most utility because it should enable output in numerous formats, from TEI
(for OCRed text) to image displays in multiple formats. It also creates an
interoperable framework where--again over time--users could mine structured
text across a collection--for example to retrieve all the bibliographies
from a book collection or all the table of contents. METS structure maps
will provide an open standards means for creating any structured
resource--books, e-journals, diaries, correspondence, etc. and searching
across collections to retrieve, for example, any digital diaries. Within
an institution, structure maps could be used in an annual report collection
to pull out all the budget spreadsheets, for example. You are able to
define how granular the structure should be.
At 10:06 AM 3/14/2002 -0800, you wrote:
>Greetings, folks,
>
>I have been lurking on this list for a while and find your hints, tips &
>tricks to be very useful, so I'm hoping you'll be able to help me out.
>
>The Thunder Bay Public Library staff are working on a digitization project.
>We're scanning lots of historical photos which will be accessible via our
>website, and we're pretty comfortable with that whole process.
>
>What's new for us is text scanning. We would like to scan the contents of
>several (public domain) books in non-OCR mode - we're simply doing the
>images of the pages. We're not sure how we make the page scans of the book
>contents available to users as a single unit, if that makes sense, i.e.
>they would click on a link to "Algoma Mines" and then voila, the scanned
>book magically appears in its entirety. ;) I know it's do-able, since books
>like this are visible all over the Web. But *how* did they do it?
>
>We're using Adobe Photoshop 6.0.
>
>Any advice, help, URLs of resources etc. would be most appreciated.
>
>thanks much,
>
>Betty Braaksma
>Betty Braaksma
>Head of Reference Services
>Thunder Bay Public Library
>Thunder Bay, Ontario
>807-624-4203
>
Grace Agnew
Associate University Librarian for Digital Library Systems
Rutgers, the State University of New Jersey
Library Technical Services Building
47 Davidson Road
Piscataway, NJ 08854-5603
gagnew at rci.rutgers.edu
PH: (732) 445-5908
FAX: (732) 445-5888
More information about the Web4lib
mailing list