Search Engine History -> nostalgia reigns; history may suffer

Tue Dec 18 22:12:44 EST 2001

Google's revival of the Usenet backlog seems to have sparked a bunch of old timers to search the corpus for historical artifacts -- or just to see what they said in 19xx...   It is, for once, a positive story of digital preservation.

For instance, Brad Templeton has gone after the question of how "spam" evolved in its meaning of mass commercial mailing online: http://www.templetons.com/brad/spamterm.html

Here's a cautionary tale about using quick Google Usenet searches to establish "history."

One of the issues that arose very quickly with nascent search engines was the problem of the crawler being overly aggressive.   If you follow the links for history of search engines already posted, you'll see that some of the early crawlers were pretty dumb about frequency of revisiting, and this could melt your server performance-wise.

There were also people who VEHEMENTLY objected to being indexed in the first place.  They thought that, even though they were publishing openly on the Net (Gopherspace, FTP-space, or Webspace) that they shouldn't be indexed for random discovery by random people.

When such a discussion popped up on the Gopher newsgroup in 1993, I proposed that we ought to have a mechanism for publishers to tell spiders to lay off. See: 

http://groups.google.com/groups?q=g:thl4020516427d&hl=en&selm=259itv%24106i%40msuinfo.cl.msu.edu

<quote>
In the immediate term perhaps we could devise a simple way for 
you to "just say no"?  How about if a file exists in the root
of the Gopher tree, under a name like "no-index"?  Veronica and
cousins would look for that file first; if it's there, stop
immediately."
</quote>

I posted that on 1993-08-22.  So far as I know, that was the first proposal for letting Internet content providers give signals to crawlers to stop at a certain level of the document hierarchy.  The Veronica folks soon implemented the concept.

The Web folks promulgated the same concept, under the rubric of the "Robot Exclusion Standard."  The earliest reference to that that I can find in Google Groups is 10 Aug 1994.   

The Robot Exclusion Standard for the Web was conceived somewhere in the 1993..1994 time frame.  But the discussion of that standard didn't take place, so far as I know, primarily on Usenet.  It took place on a mailing list.  See: http://www.robotstxt.org/wc/norobots.html 

<quote>
This document represents a consensus on 30 June 1994 on the robots mailing list (robots-request at nexor.co.uk) [Note the Robots mailing list has relocated to WebCrawler. See the Robots pages at WebCrawler for details], between the majority of robot authors and other people with an interest in robots. It has also been open for discussion on the Technical World Wide Web mailing list (www-talk at info.cern.ch). This document is based on a previous working draft under the same title. 
</quote>

The bottom line is that anyone writing a history of the topic who does a rote search for "robot exclusion standard" in Google Groups isn't going to find the earlier history that happened to use a different nomenclature, and isn't going to find relevant discussion that didn't take place on Usenet.  History requires more sophisticated sleuthing. 

Good reference librarians, and good historians, will understand these considerations intuitively.

/rich

PS -- I fear none of this path down memory lane has much to do with how libraries use Web tools; my apologies to the list, as I hope to avoid the Wrath of Tennant.

Richard Wiggins
Writing, Speaking, and Consulting on Internet Topics
rich at richardwiggins.com       www.richardwiggins.com