SGML for Web Pages

Keith Engwall engwall at uthscsa.edu
Tue Dec 19 09:46:38 EST 1995


>At 20:53 95/12/19, Heinrich C. Kuhn wrote:
>>And we will have to rely on ink on stable paper
>>for a *very* long time yet. The only sensible
>>solution, that I can think about for longterm
>>archiving "electronic" documents is having a
>>paper printout of all the documents in a small
>>number (e.g. one or two on every continent)
>>of archiving libraries.
>
>You cant print dynamic documents and you can't sensibly try and print
>douments built into a hypertext matrix and hope to make sense of them.

I agree, to a point.  Electronic documents that are designed to be dynamic
or part of a whole organism (a hypertextual manuscript) cannot be captured
properly on paper.  For instance, what do you capture?  The code?  That is
difficult to interpret and would have to be hand-entered before it could be
seen as it should be.  The output?  At what time?  We now have electronic
documents that change as we watch them... we would have to choose an
arbitrary moment in time to represent this document for archival purposes.
That is not satisfactory.

>I would suggest that that the future of the preservation of paper is in
>digitisation not in paper or microform (remember that?) copies.

True for some things, not for all.  We should not pidgeonhole ourselves
into one medium for everything.  It is good to have redundancy across
media.  In particular, electronic media is so technology dependant... all
it takes is a good blackout and you lose everything not stored
magnetically, and don't even have access to that.  Visible media only take
light to read.  True, storage becomes an issue, but there will always be
(or should, anyway) certain manuscripts that are valuable enough to merit
higher storage demands.

>>I admit that it is not too improbable, that interpreting
>>GIF- and JPEG-encoded graphics will be a problem after
>>a quarter of a century,
>
>Not so.  Formats with open standards will have no problem with
>interpretation.  Its proprietry standards which are not public which might
>die.

Again, I agree... to a point.  Formats with obsolete standards (open or
proprietary) may easily be lost.  This is far less likely with open
standards, but it is not outside of the realm of possibility (or even
likelihood).  Proprietary standards are, of course, in much graver peril.
Already, word processing formats of less than a decade ago are unreadable
on many of today's word processors without acquiring special format
conversion software.  It cannot be too long before certain obsolete formats
get dropped from the list.

>
>>but I see no such problems with
>>HTML, as HTML is basically ASCII and ASCII will be readable
>>for a long time to come probably.
>
>HTML just uses _readable_ ascii.  Just because a binary object might use
>ASCII which does not map directly to printable characters makes no
>difference in its ability to be interpreted.
>
This is probably the most important point.  Just because we cannot read
barcodes in grocery stores does not mean that information is not there.
Similarly, just because non-ascii encoding is not readable without
interpretation software does not mean it is not readable.  Even image files
can be scanned for ascii by Optical Character Recognition software.  So
long as the data is not corrupt and the proper interpreter is used, data
can be translated from any format to one that is readable, editable, etc.
ASCII is really no different (it's just that the interpretation occurs for
us automatically).

As for HTML, it was never meant to be a formatting standard for indexing or
archiving text.  Its purpose is to provide fluid and dynamic navigation of
the internet (its internal navigation is rather primitive).  SGML and HTML
are vastly different in goal and scope.  SGML attempts to break down a
textual manuscript and index it by parts (chapters, paragraphs, footnotes,
etc.), and it is only the first step in full-text augmentation.  Even
still, SGML is not necessarily the best (and certainly not the only) choice
in electronic document formatting.  It is a very good and much needed
start, but we are still deep within the shakeout period for this new
technology.

Let's hope that we do not let things in this area go the way that
electronic cataloging did with the use of the MARC record for on-line
cataloging (another format that is performing a function for which it was
never intended).

Keith

---------------------------------------------------------------------
Keith Engwall          Just one thing before we talk about computers:
Systems Librarian      PCs are just great... PCs are infuriating
Briscoe Library        Macs are just great... Macs are infuriating
UTHSCSA                Neither one will help you scratch the middle
engwall at uthscsa.edu    of your back worth a darn.  Ok, go on...
---------------------------------------------------------------------




More information about the Web4lib mailing list