[WEB4LIB] PDF files and scanners

Thu Feb 10 09:00:22 EST 2000

----- Original Message -----
From: "Michael A. Weber" <webermi at alvernia.edu>
>
> I am working on putting our college viewbook out on our webpage in Abode
> Acrobat "pdf" files...

> I can't read the text that I have in my multi-colored, photopacked
file...

> Another problem source may be how I save the file.  I cannot save the
file
> directly as a PDF file.  I am saving it as Bitmap...

PDF allows both formatted text and bitmap images in the same file.  The
text elements in PDF are resolution independent, while (obviously) the
bitmap images are not.  Zoom in to 1600% magnification and the differences
are immediately obvious.  As you describe the process, you're ending up
with no text, per se, but just pictures of text--a very important
difference.

To avoid this, your best bet is to generate the PDF from an original
electronic version, like a word processor document, that stores the text
as text.  If that isn't possible, A) tell your college viewbook editors to
get their act together, and B) explore Acrobat's "Capture" function, which
basically does an OCR scan on the bitmapped sections of a PDF file and
converts any text it finds.  But Acrobat Capture will never be as simple
or accurate as running PDFWriter or Distiller on the original file.

> ...what is the strength of using adobe files on
> the web in the first place?

You have to ask, on a net where you commonly see HTML contructions like
<CENTER><FONT FACE="Palatino, Book Antiqua, Zapf Calligraphic, Times,
serif" size="4" color="#055361">...?  Authors and publishers are
presentational control freaks.

The strengths and weaknesses of HTML and PDF are almost complementary.
HTML marks up documents structurally, and by professional publishing
standards provides only crude layout options.  Even with HTML 4's use of
Unicode, problems will arise with unexpected characters or character sets,
and there's currently no way to set mathematical formulas.  Also, of
course, an "HTML" document with graphics is actually a number of files,
possibly living in different file systems or on different servers, which
makes document portability a nightmare.

PDF, on the other hand, was designed as a page description language that
allows authors to make use of all the layout tools in their word
processor, desktop publishing program, etc.  It allows you to package with
the document any fonts, font subsets, and images that need to be included,
all in one file.  It also includes security settings so that a file may
require a password to print, edit, or select text from the document.  On
the other hand, it lacks HTML's device independence, and to some extent
its platform independence; it is a proprietary format and requires
proprietary software; accessibility is a greater challenge than with HTML;
and file sizes are bigger.

Thomas Dowling
OhioLINK - Ohio Library and Information Network
tdowling at ohiolink.edu