Automating whats new pages

Dave Lewis drlewi1 at srv.PacBell.COM
Wed Mar 13 12:13:59 EST 1996


I use the harvest system (http://harvest.cs.colorado.edu ) to do a 
variety of things across my company's internal web.  While I don't currently 
do an automated "what's new" it would be reasonably easy to do so. 

The harvest architecture consists of "gatherers" and "brokers."  A
gatherer is used to roam (walk) across a web space (bounded by #
of hosts and leaves to visit, patterns to limit hosts visited, etc) and
collect a database of resource description objects (a set of name/val
attributes in "Standard Object Interchange Format").  The gatherer
offers this collection to brokers who would like to use it (without
regard for how it will be used).  Meta info about each resource includes
information about time of discovery (initial or change), url, typical
meta object set (author, ...), set of URLs referenced from within the
resource, set of images referenced, etc..

A broker is used to pull object descriptions from one or more 
gatherers, index this (for faster retrieval) and then offer query
services against the collected set.  Fielded search against the
resource/object descriptions are possible.  I believe (though I
haven't verified) that I can therefore ask a broker for all pages which 
appeared or changed after a given date.  One could call the broker
with such a query from a cron job (e.g., at most as frequently as
the gather/collect process), and build a "what's new page."

Netscape has just introduced a "Catalog Server" which has exactly
this feature, and which uses a commercialized version of this 
technology.  (They hired one of the PIs at Colorado to come on board
and build it.)

In general I question the utility of a simple implimentation of this
for the following reasons.  If a "published resource" consists of
multiple pages, it will appear multiple times in the what's new
publication.  There needs to be a way to limit inclusion to the title
or "home" (sic) page of the published resource.  Many items are not
worth including.  For instance many sites on our IW publish daily or
bi-weekly (or what have you) "statistics" pages.  I really don't want
those in "what's new."  Clearly these are serious problems (depending
upon what utility is desired from the "what's new" page) but are
solvable.  Additional desirable features would be topic or task specific
"what's new."  This is essentially the area of customized information
cultivatation and delivery.  With the right search engine under a 
broker (e.g., one that supports search capable of providing required 
precision and recall), one could easily do periodic queries to derive
such resources.

Many other neat things are possible using this architecture.  I'm
planning, for instance, to provide a service where I receive a URL
and return the set of urls which point to it.  This answers the
question, "what pages point to mine," and may be used to facilitate
"notify referers of change or deletion."

Dave Lewis
drlewi1 at pacbell.com



More information about the Web4lib mailing list