*implementing* search engines for a campus

Prentiss Riddle riddle at is.rice.edu
Wed Dec 20 10:08:40 EST 1995


| Date: Wed, 22 Nov 1995 10:20:00 -0800
| From: JQ Johnson <jqj at darkwing.uoregon.edu>
| To: Multiple recipients of list <web4lib at library.berkeley.edu>
| Subject: *implementing* search engines for a campus
| 
| The various threads discussing search engines on this list have focused
| mostly on the public web-wide search tools.  I'd like to pursue a
| slightly different thread:  what has your experience been implementing 
| local search engines/indices for web pages on your campus?
| 
| Many sites have such indices, often implemented using wais, glimpse, or
| swish if the server platform happens to be Unix.  In our case we have a
| distributed "site" with a growing number of servers, and it would be
| very desirable to have easy to manage tools to provide indices of web
| pages that spanned several web servers.  Since the size of the database
| if comparatively small ( < 100K pages), ease of use probably dominates
| expressive power in searches.  Since we spend less money
| accomplishing things than some other sites (cf another thread on this
| list) we're particularly interested in freeware.  That suggests using
| harvest (or perhaps lycos?), but our experience with harvest has been
| that it is a rather complex system to set up and administer.

To chime in late to this thread:

We haven't done it yet, but our current plan is to use Harvest.  We too
have a distributed site with a growing number of servers, and
single-server solutions won't work for us.  Harvest seems to have been
designed from the ground up with distributed heterogeneous sites in
mind, so knock on wood it will turn out to be the right choice.

A bigger problem for us is access control and the prevention of
"leakage".  We have many documents which are not for export beyond the
boundaries of our campus, and my users have let me know that they are
not willing to tolerate a search engine which leaks even titles or
excerpted lines in violation of access control rules.  Through what
seems to me to be poor foresight on the part of protocol designers,
protocols like HTTP, gopher, FTP, WAIS, or you-name-it have no
mechanism for a search robot or a proxy server or other gateway to say,
"treat me like an unprivileged user at some off-campus site".  Even
single-server search engines which theoretically could pay attention to
access control rules in httpd configuration files fail to do so.  So
every WWW search engine I have seen would happily build an index which
mingles restricted and unrestricted documents and serve them out to the
world.

The solution we plan to try is to set up our Harvest engine on a
nominally "off-campus" machine (physically housed on campus but with a
distinct IP address).  It will therefore include only those documents
which we are sure are permitted to go "beyond the hedges", as they say
at Rice.

-- Prentiss Riddle ("aprendiz de todo, maestro de nada") riddle at rice.edu
-- RiceInfo Administrator, Rice University / http://is.rice.edu/~riddle
-- Home office: 2002-A Guadalupe St. #285, Austin, TX 78705 / 512-323-0708


More information about the Web4lib mailing list