Alexa, archiving the web

Robert J. Tiess u1019306 at warwick.net
Mon Jan 11 18:22:37 EST 1999


JQ Johnson wrote:
> Ancillary services include a rating service and an archive of deleted
> pages (so if you get a 404 not found Alexa can give you a copy of the page
> as it appeared last time Kahle's company archived it).

I believe web archiving also yields an array of interesting questions,
such as perhaps the 404 is there for good reason, such as a medical
site in which there was inaccurate information and the page was
removed on purpose.  Or perhaps it was a news page which
contained some amount of misinformation and was deleted to prevent
any further problems.  Or perhaps it was an author's work that was
contracted to appear for only so long and then was to be removed.
Or maybe it was removed by the ISP due to content/policy violations,
or for copyright purposes, for security purposes, for personal or legal
purposes....  All aspects worth considering.  By its very nature, the
404 also tends to imply "old data," although it may result from many
other reasons (e.g. bad URL, file renaming, directory moves, etc.).
In any case I would never recommend using a cached web page as a
definitive source of information.  It is a convenient service, based in
good intention, but old data, potentially bad data may be the greater
inconvenience.

An interesting quote from Alexa:  "The average Web page has a life of
approximately 44 days." (http://www.alexa.com/whatisalexa/faq.html)
Not sure if I agree with this or how accurate that number is, but it is
a potentially sobering indicator of how volatile the information is and
how info-caches may rise in importance in the future.  I do not take
such a short term view of web development, placing an emphasis
on a sound site structure before development.  Some long-term
planning beforehand could eliminate many 404s, as many of those are
merely pages relocated to other areas of a server or another domain.
I would also appeal to webmasters to leave forwarding URLs
wherever major pages are moved; this would alleviate some of the
frustration users experience when searching for information,
especially in a time-intensive environment such as a library.

Alexa indicates it observes robot exclusion (SRE), but how many
sites have the robot file in place, or how many webmasters know/care
enough to do this?  Alexa has guidelines on this for webmasters:
http://www.alexa.com/support/for_webmasters.html

If you would like to have your site archived, you can go to
http://www.alexa.com/support/get_archive.html
It's actually a good idea, a form of insurance, but not the equivalent
of a full remote backup.  You can try Freedrive
(http://www.freedrive.com) for that, and that's free.

Robert

rjtiess at warwick.net
http://members.tripod.com/~rtiess




More information about the Web4lib mailing list