Web Search Engines "Made Simple"--HotBot/Infoseek reply
Bob Duncan
duncanr at lafvax.lafayette.edu
Tue Nov 11 13:28:45 EST 1997
My parents still don't get this whole librarian thing, but once I inform
them I started a thread (which someone actually wanted stifled!!!) I'm sure
they'll be proud...
I received responses from Infoseek and HotBot regarding search performance
with the date rape queries.
As suggested by many on the list, "date" is a stop word in HotBot, and is
therefore discarded. However, pages containing "date" still make their way
to the top of the results list because "If a stopword is included in an
'exact phrase' search, it [the engine] will wildcard that word finding the
correct sequence of words entered but allowing any word to replace the
stopword."
HotBot says:
>Stopwords are excluded from searches to avoid slowing down the system and
>returning irrelevant matches. There is no static list of stopwords; the
>list is constantly updated as the frequency of words appearing in webpages
>changes with every crawl HotBot does. In the meantime, the HotBot
>engineering team is researching ways to allow stopwords to be included in
>queries without compromising our system security. Remember, the more
>precise you are in specifying your search term, the more relevant your set
>of results will be.
IMHO the task of being "more precise" becomes difficult when terms
conveying content become stop words. Equating "of" and "the" with terms
like "web," "HTML," "cgi," "text," and "computer" (other terms on HotBot's
dynamic list of stop words) seems unsound.
I'm still not sure exactly what's going on at Infoseek; I'm guessing a
similar "wildcarding" was at work with the phrase query. Their response to
my question was scary:
>You need to run the query this way:
>
> date +rape
>
> the + before date is not needed.
Uh...according to your help page, and the way the engine usually performs,
if I want to *require* the word it is! (And the search results are
different according to which words the + is applied.) When the "support"
folks don't get it (my question to them; what the + should imply; the
difference between a term being required or optional; how their engine
operates; etc.), then we're in deep doo-doo.
I agree with various list folk that we can't be too picky considering the
resources in question are effectively free. I agree that Web search engines
could use some improvement. I understand that the Web is different than a
DIALOG database. (But I would also argue that when we pay for a DIALOG
search, more of our money's going towards the quality of info available,
not a more capable search program.)
My main concern is that the engines, for the benefit of Web users (savvy
and not-so), could be a tad more up front with how results are arrived at
without revealing state secrets. As Byron Mayes implied in his post on the
subject, if an engine is "intelligent" enough to ignore a term, isn't it
capable of printing a few lines of text which explain that the term occurs
too frequently and was discarded? (AltaVista does it, but I notice they've
placed this little piece of enlightenment at the *bottom* of the results
page. (Didn't it used to be up top?)) More importantly, if a
tips/options/help page is provided, how much would it take to add a few
words about exceptions to the rule? Referring a user to a "why searches
fail" page is of no use when the search results page carries no indication
that the results are indeed flawed.
HotBot's presentation of search results for the "exact phrase" 'date rape'
shows:
"Returned: 141448 matches.
Breakdown: date: 11193928, rape: 144093 "
What indication is there to the user that his or her "exact phrase" query
was ignored? In fact, when I see those numbers, my natural assumption is
that *both* terms *were* used to achieve the final result; why else are
they letting me know? A little explanation could go a long way.
Bob Duncan
~'~'~'~'~'~'~'~'~'~'~'~'~'~'~'~'~
Robert E. Duncan
Reference/Instruction Librarian
David Bishop Skillman Library
Lafayette College
Easton, PA 18042
duncanr at lafayette.edu
http://www.lafayette.edu/faculty/duncanr/
More information about the Web4lib
mailing list