*implementing* search engines for a campus

Nick Arnett narnett at Verity.COM
Wed Nov 22 15:03:33 EST 1995


At 10:16 AM 11/22/95, JQ Johnson wrote:
>Since we spend less money
>accomplishing things than some other sites (cf another thread on this
>list) we're particularly interested in freeware.  That suggests using
>harvest (or perhaps lycos?), but our experience with harvest has been
>that it is a rather complex system to set up and administer.

I'd be curious to hear more, on the list or offline, about how you'd like
to see this work from an admin standpoint.  We see a great need for this --
most of our early customers were hungry for the spider that we offer.  Many
organizations have multiple servers and would like to create a unified
index.  Here are some of the admin issues:

* How do you define the range of servers that will be indexed?  For a small
number, it seems reasonable to expect an administrator to enter the root
URLs.  However, larger organizations don't want to have to keep track of
every server that may come on-line.  Would a hostname-based system work?
The trouble with that is coping with multiple domains, creating wildcards
that work internationally and such.

* Is a decentralized "push" model acceptable?  This would mean that each
Web server owner would be responsible for setting up an indexing
application on the server, something like a Harvest Gatherer, which would
send documents and/or indexes to the search server(s).  Each server owner
would determine which documents to include and when to do updates.

* How critical is elimination of duplicate documents?  Symbolic links,
network mounted drives, CGIs and just plain old multiple copies of
documents can result in duplicate entries in the index.  It's not easy to
identify and remove them.

* Is a completely centralized "pull" model acceptable, in which a spider
runs on the search server(s), doing updates over the wire?  In this
scenario, the search administrator makes all of the decisions about which
documents, how often to update, etc.

* What would be reasonable per-server and site license prices?  We have
envisioned a possible scenario in which you'd pay one price for the search
engine plus a certain amount for each server that you index.  What if the
search engine could use the existing freeware Harvest Gatherers; would you
use them and what would you then be willing to pay for the search engine?

Nick
(not announcing any product plans here!)




More information about the Web4lib mailing list