Stats o' the Day revisited

Thomas Dowling tdowling at ohiolink.edu
Tue Apr 29 12:28:30 EDT 1997


Web4Lib--

About three months ago, I posted some statistics regarding the validation
of HTML on library home pages.  By the standard calculation, that was two
Web Years ago, so I thought I'd do it again and also touch on a couple of
other web authoring topics.

Let me try to recap some of the discussion points that arose from my
earlier post.  First and foremost, it is not a crime to write invalid HTML;
the most you can say is that invalid HTML may display in unpredicted ways
on some browsers.  Unfortunately, the browser in question may be the next
version of one your readers commonly use, so the next big upgrade might do
something nasty to your pages.  Current example: some HTML editors (notably
FrontPage) let you include numeric entities for characters such as smart
quotes or em- and en-dashes that are defined in the Windows character set
but not in ISO Latin 1 or Unicode; Netscape 4.0b has taken to displaying
all characters undefined in Latin 1 or Unicode as question marks:

    She looked at me and said, ?Hey?Didn?t you go to the 
    University of Wisconsin?Madison??

Second, using a strict SGML validator to compare documents to HTML DTDs has
a number of problems.  One important one is that HTML and URLs are written
to different specifications, and an SGML validator will list as errors
anchor tags with certain characters (specifically ampersands).  

Third, there is no HTML DTD which is really current with accepted practice.
 The current standard--to the extent the term "standard" applies--is HTML
3.2, which was specifically written to describe popular browser behavior as
of early- to mid-1996.  The experimental Cougar DTD has only recently been
updated from nine months ago; I confess I haven't had a chance to look at
it closely, although I do notice that it adds the FRAMESET element for the
first time.  Anyone not specifically validating against the newest Cougar
draft would generate errors for any Frames-based document.

Because of shortcomings with the available DTDs, there are times when it's
perfectly defensible to write invalid HTML.  However, as with improvised
jazz and abstract art, you should really know the rules before you set out
to break them.  Also, if the rules you play by aren't written in a DTD,
they aren't formally spelled out anywhere.

Conversely, however, validating HTML against a DTD can only turn up syntax
errors.  You could write semantic gibberish (or at least reverse the WIDTH
and HEIGHT attributes in an IMG tag): "<p>Colorless green ideas slept
furiously.</p>" is perfectly valid.

So here's the state of validation on our home pages.  I expanded the number
of pages checked over last time, and was also able to correctly validate
against DTDs other than HTML 3.2 if specified in a DOCTYPE declaration.  A
word to the wise: if your DOCTYPE is specifying HTML 2.0 or 3.0--or 2.1,
whatever that is--you may want to check if that's really what you mean.  I
took these documents at their word.


VALIDATION STATS

Pages checked: 1114 (Libweb's listings for U.S. and Canadian libraries as
of 4/24)
Average/Median number of errors: 24 / 13 (compared to 20 and 13 in
February)
Number of pages with zero errors: 77, or 7% (compared to 4% in February)
Number of pages with three or fewer errors: 236, or 21% (compared to 16% in
Feb)
Number of pages with 40 or more errors: 186, or 17% (14% in Feb)
Number of pages with 80 or more errors: 57, or 5% (2% in Feb)

Number of pages that specified a DOCTYPE: 243, 22%
Complete list is at <URL:http://gold.ohiolink.edu/tdowling/libpages/doctypes
.html>


HTML EDITOR STATS

Since I was looking at people's pages, I took the opportunity to see what
HTML editors were identifying themselves in their pages.  Out of 1114
pages, I could find 164 that identified an HTML editor.  None of these
showed HoTMetaL or Hot Dog; do these programs identify themselves in the
HTML source in any way?

  Netscape Gold       72
  MS FrontPage        51
  Adobe Pagemill      16
  MS Word 97/
    Publisher 97/
    I'net Assistant    9
  Claris Home Page     5

Complete list is at <URL:http://gold.ohiolink.edu/tdowling/libpages/generato
rs.html>


SERVER STATS

We're hearing from at least one database vendor that we should change our
Web server from NCSA to either Apache or Netscape Enterprise.  That
naturally made me curious about what other people were using:

  NCSA        25%
  Apache      22%
  Netscape    21%
  CERN         6%
  MS IIS       5%
  WebStar      4%
  WebSite      3%
  OSU          3%

The complete list is at <URL:http://gold.ohiolink.edu/tdowling/libpages/serv
ers.html>

Note that this is *very* different from the stats reported by Netcraft at
<URL:http://www.netcraft.co.uk/Survey/> for the net as a whole, which shows
Apache in the mid-40% range and IIS next at around 15%.
  BTW, does anyone know why Netcraft no longer provides subtotals for the
.edu domain?


Thomas Dowling
OhioLINK - Ohio Library and Information Network
tdowling at ohiolink.edu



More information about the Web4lib mailing list