[WEB4LIB] Re: PDF versus HTML

Patricia F Anderson pfa at umich.edu
Wed Mar 30 11:29:04 EST 2005


Hi, Walt,

On Wed, 30 Mar 2005 Walt_Crawford at notes.rlg.org wrote:

> "1 to 10%" strikes me as considerably different than the reality I see, at
> least using Acrobat 7 directly in Word. A 20,000-word Cites & Insights
> (that is, about 120K of pure text) becomes a 290K PDF; at that's at worst a

I think you've been lucky. I've seen PDF documents range from barely over
the size of the plain text to hugely inflated files 10 or more times
larger. Yes, graphics and color often play a role in this. ;-)

> 2.5:1 expansion over the pure text. HTML requires overhead as well.
> Ditto the "very large amount of space"--in fact, the stories within a
> typical C&I, separated out into HTML pieces, take up almost exactly as much
> space as the single PDF, thanks to HTML overhead.

What do you mean by "HTML overhead"?

> As far as "cable modems" are concerned, somehow I don't have any trouble
> loading ordinary PDFs on a dial-up modem...

I've had modem transfers time out and cancel the download, even for
10-page document files. Again, I think you've been lucky.

> I'm sure there are cases where PDF is enormously larger (for example, if
> you're comparing a scanned page in PDF to the same page rekeyed as text),
> but for cases where the PDF is being created directly from a document, and
> there are no bit-mapped graphics, I'd love to see evidence that even a 10:1
> ratio is at all common.

You're right -- this would make an interesting and useful study! There are
some folks who have looked at this, but not much that I found. Perhaps
someone else has seen a study on this?

> I note one aspect that's not being mentioned: PDF allows typographic
> integrity, which HTML does not, particularly because HTML depends on font
> availability at the user's computer.

There are times when this is *particularly* important, for example,
poetry, music, documents in a mix of languages, files with intricate
layouts, embedded graphics connected to the text in some fashion. A good
point to raise, and an argument for providing a variety of ways to access
the information. Or to decide that the information is so dependent on
layout and font that it is not relevant to anyone outside of the target
population (and defining that clearly).

Patricia Anderson, pfa at umich.edu



More information about the Web4lib mailing list