Stats O' the Day

Thomas Dowling tdowling at OHIOLINK.edu
Fri Feb 7 17:25:08 EST 1997


Yes, I do have other things I'm *supposed* to be doing...

I was recently working on a project that involved looking at the source
of a quite a few library home pages, and was a surprised at the
inconsistency of HTML I was seeing.  So I took the opportunity to pull
down the source for about 640 homepages (North American academic,
consortia, state, and selected public libraries--no offense to anyone I
left out, that was just the group I was looking at).  I hunted down my
copies of nsgmls and the HTML 3.2 DTD, just to see what was what.

Please don't take this as an attempt to criticize any one else's page
or to demand changes.  Even picking a standard to validate against is a
tricky proposition these days, so we can all agree that being out of
line with any such standard isn't a heinous crime.  Just forgive me
when I mumble that if we aren't serious about abiding by standard
information formats, who will be?


After removing the several pages that returned 404 Not Found errors and
other server-generated messages--which all parsed correctly, I'm glad
to say--I ended up with 624 home pages, and these figures.

  Total number of pages:            624
  Number of pages w/ no errors(*):   24 (That's 3.8%, folks)
  Number of pages w/ no errors
    when I allow BGCOLOR in tables
    and TARGET in anchors:           25
  Average / Median Number
    of Errors:                       20.4 / 13
  Pages with 3 or fewer errors:      97
  Number of pages w/ 40+ errors:     85
  Number of pages w/ 80+ errors:     14

(*) Hats off to the library web workers at Edinboro U of PA, Foothill
Coll., Indiana U, Middle Tennessee SU, Northeastern, Saginaw Valley SU,
St. Olaf Coll., Swarthmore Coll., Tulane, U of Cal - Irvine,  U of New
Mexico - Valencia, U of Northern Colorado, Allegheny U of Health
Sciences, U of Wisc - La Crosse, Well Coll., Wesleyan U, Illinet, Long
Island Lib Resources Council, PORTALS, Nat. Lib of Canada, U of
Calgary, U of Guelph, Case Western, and the Colorado School of Mines. 
No mention will be made of the home pages  containing 172, 217, and 267
errors apiece.  Unfortunately, I guess I can't say you know who you
are.

Obviously, what constitutes an HTML error is subject to some
interpretation, and the only clear-cut consequences are the loss of
predictability over how a browser will render the page.  Nsgmls, being
an SGML parser, does not allow you to fudge in any way, which is why I
manually recalculated the figures allowing the
not-yet-but-pretty-soon-now attributes of BGCOLOR in table elements and
TARGET in anchors; quite a few sites seem to be using those, though not
enough to greatly affect averages for the entire group.  In the
interest of full disclosure, the OhioLINK home page shows 57 errors
when parsed against the HTML 3.2 DTD, and 15 even when I allow BGCOLOR
and TARGET (due to style sheets and related attributes).

I noticed several kinds of errors repeating over and over:

Unquoted attributes.  I'm sure the exact list of allowable unquoted
characters is buried in the DTD somewhere, but the general rule of
thumb is that attributes with anything other than letters and numbers
need to be quoted.  <font size="+1"> works, while <font size=+1>
doesn't.  <a href=http://gohere.now.edu> needs quotes, as does <p align
= center> (because it includes a space).

Improperly nested tags.  Some elements just aren't allowed to include
others.  <h1><b>Really</b> Important Header</h1> doesn't work. 
Likewise, some elements *need* to be in others: notably, all data in a
table needs to be in a table cell, and all table cells need to be in
table rows.

Comments.  Comments are not HTML tags like other elements in a
document.  They are comments within SGML declarations: the declaration
has to start with "<!" then the comment has to open with "--"; the
comment then closes with "--" and the declaration closes with ">".  So
<!--This is a working comment--> but <!--This isn't>.

Bizzarro tags from whacked out HTML editors and/or sloppy typists. 
<l1> instead of <li>, <bold> instead of <b>, <X-SAS-WINDOW> instead
of...I have no idea what.

Netscapisms, and to a lesser extent Internet Explorerisms.  No
religious wars here, but if you're going to use <NOBR> and <TEXTFLOW>,
or <MARQUEE>, you just need to do so in the understanding that their
use isn't standard [yet].  If you're going to use <BLINK>...

It's probably fair to ask if these errors are really important enough
to fix.  I can only offer the cautionary tale of how many pages on the
Web suddenly disappeared when Netscape 2.0b came out with corrected
comment parsing.  Many sloppily created comments never closed, in the
new version's eyes, and so the entire rest of the document was part of
the comment and therefore not displayed.  I'd say HTML has become
harder to write correctly since then, and it's harder to predict how
today's (or tomorrow's) browsers might handle unexpected, incorrect
HTML.  So, validate, validate, validate; I use WebTechs
(http://www.webtechs.com/html-val-svc/) and a couple of command-line
parsers.  If you validate, at least you'll know where you stand in
regards to standard HTML.

Thomas Dowling
Ohio Library and Information Network
tdowling at ohiolink.edu



More information about the Web4lib mailing list