SSI (was CSS includes)

JQ Johnson jqj at darkwing.uoregon.edu
Wed Sep 29 19:18:30 EDT 1999


Roy Tennant notes that "I don't give my SSI text files the ".html"
filename extension, since doing so would cause them to be indexed as web
pages."

The indexing of "dynamic" web pages such as those generated by SSI is
something of a can of worms, so I'd like to get some ideas about issues
from this list.

I've noticed that some Apache web servers generate different HTTP headers
when they serve SSI-created files.  Specifically, they don't generate a
Last-Modified: line.  That's not unreasonable; an SSI-processed web page
really didn't exist until the moment it was served, even if it was based
on a .shtml document that was very similar in content.

But, the question:  how do various spiders handle such non-date-stamped
pages?  My impression is that many ignore such pages, just as they ignore
URLs that end in .cgi or .shtml or .asp.

What is the most reasonable behavior here?  I can see arguments for both
sides.  In particular, for sites that have heavy use of small-scale
dynamism (e.g. automatic "last updated" stamps at the bottoms of pages,
implemented using SSI or ASP), I assume we'd want to treat SSI-generated
pages as "real".  On the other hand, it's probably not appropriate to
crawl and index all of the dynamically generated pages of an OPAC, even if
they are accessible via menus (rather than just via a search form).  More
generally, what is the RIGHT heuristic for spiders to use as they crawl
the web and want to prune their search?

JQ Johnson                      Office: 115F Knight Library
Academic Education Coordinator  mailto:jqj at darkwing.uoregon.edu
1299 University of Oregon       phone: 1-541-346-1746; -3485 fax
Eugene, OR  97403-1299          http://darkwing.uoregon.edu/~jqj/



More information about the Web4lib mailing list