MOMspider on multi-hostname sites?

Prentiss Riddle riddle at is.rice.edu
Thu Dec 17 16:43:50 EST 1998


I'm getting started using Roy Fielding's terrific link checker
MOMspider (http://www.ics.uci.edu/pub/websoft/MOMspider/) and have run
into a limitation which I wonder if anyone has addressed.

Our website has acquired a number of cnames over the years, most of
which are equivalent (www.rice.edu, riceinfo.rice.edu, etc.).  We have
many links which use these variant hostnames interchangeably.

MOMspider seems to assume that a site or tree must consistently use one
and only one host:port pair.  Thus MOMspider will fail to recognize
child links where the child uses a different cname from the parent,
e.g.:

    http://cname1.blah.edu/index.html -> http://cname2.blah.edu/sub/index.html

(Note that this is a slightly simpler case than the more general
problem of using MOMspider to check a site which is spread across
multiple servers.)

My questions, in order of complexity:

(1) Does anyone see a way around this limitation using standard
    MOMspider 1.00?

(2) Has anyone already done mods to MOMspider 1.00 to get around this
    limitation?

(3) I'm thinking of trying to hack MOMspider 1.00 to handle sites like
    mine.  As a first step I'm trying to determine the minimum number
    of places where I would need to canonicalize a URL (i.e., convert
    variant cnames to a default hostname).  I could canonicalize URLs
    only during the child test, but then would fail to detect
    already-visited links.  I could canonicalize all URLs upon first
    seeing them (probably the easiest solution), but then the hostnames
    in broken link reports wouldn't match the URLs in the HTML itself,
    possibly confusing the web maintainers who read the reports.  Can
    anyone who's already tinkered with MOMspider internals hazard a
    guess about the best way to approach this?

Thanks for any advice anybody can offer.

-- Prentiss Riddle ("aprendiz de todo, maestro de nada") riddle at rice.edu
-- Webmaster, Rice University / http://is.rice.edu/~riddle


More information about the Web4lib mailing list