implementing search engines for a campus

Wed Dec 20 12:27:49 EST 1995

> Date: Wed, 20 Dec 1995 08:54:33 -0800
> From: Michael Alan Dorman <mdorman at caldmed.med.miami.edu>
> Subject: Re: *implementing* search engines for a campus
> 
> On Wed, 20 Dec 1995, Prentiss Riddle wrote:
> > A bigger problem for us is access control and the prevention of
> > "leakage".  We have many documents which are not for export beyond the
> > boundaries of our campus, and my users have let me know that they are
> > not willing to tolerate a search engine which leaks even titles or
> > excerpted lines in violation of access control rules.
> 
> Anyway, it would seem to me that Harvest might provide an easier solution. 
> Something like this: 
> 
> Each web server runs a gatherer. 
> 
> Each gatherer only indexes information appropriate for outside
> consumption.  Presumably this is putting responsibility for choosing what 
> to index in the hands of the people who are making that determination.

Thanks for your suggestion, but the last sentence is not a valid
assumption.  We allow our users to create personal web pages, including
their own .htaccess files (on machines where we run NCSA httpd).  The
administrator of a particular server may have no idea of what sort of
access control restrictions his users have imposed on their personal
pages.  Access control at that level is a "page maintainer" issue, not
a "server maintainer" issue.

One *might* be able to hack something together which would scan the
tree for .htaccess files (and also look in the server's top-level
configuration files) for access control rules, and convert those rules
to a format meaningful to the Harvest local gatherer.  However, such a
solution would be tied to a particular flavor of HTTP server (e.g. NCSA
httpd in the case of ".htaccess" files), and some sites on our campus
are running other flavors of servers.

As I say, the proper place to have addressed this issue in an open and
portable fashion would have been in the protocols themselves, but the
protocol designers seem to have overlooked the "gateway leakage"
problem.  (If anyone knows of proposals to address this problem at the
protocol level, please let me know.)

> [The broker on the primary server] presents a union of the data
> indexed by each of the [server-specific] gatherers.
> 
> As I understand the capabilities of Harvest, this is the most efficient
> and therefore recommended configuration. 

Yes, my plan to run Harvest remotely will sacrifice some efficiency in
the interest of avoiding "leakage".

> Further, you might be able to have multiple gatherer datasets on each
> server, one of publicly accessible material and one of restricted
> material, and you could set up another broker on a restricted server that
> could allow your internal users access to an index of your entire web. 

I'd thought of that, too.  If my world-visible broker turns out to be
efficient enough, I may supplement it with a campus-only broker, built
from "within the hedges".  But I wouldn't be surprised if it turns out
that I can only afford to run one.

-- Prentiss Riddle ("aprendiz de todo, maestro de nada") riddle at rice.edu
-- RiceInfo Administrator, Rice University / http://is.rice.edu/~riddle
-- Home office: 2002-A Guadalupe St. #285, Austin, TX 78705 / 512-323-0708