Comparison Study for Search Engines
Peter W. He
peterhe at ibm.net
Thu Nov 30 22:46:15 EST 1995
At 01:37 PM 11/20/95 -0800, you wrote:
>The most basic kind of search is Boolean -- test for the presence and
>absence of certain words ("and" and "or" operators). Most search engines
>are built on an "inverted index" of all of the words in the indexed
>documents (like a concordance) and apply Boolean operators to them.
>
>To address the problem of documents that are missed because they use
>related words, rather than the query terms, there are technologies that
>broaden searches automatically -- thesauruses, stemming, sound-alike and
>statistics, for example. Thesauruses look for synonyms, stemming looks for
>the roots of words, statistical systems expand by also searching for
>co-occurring words. These tend to introduce a lot of irrelevant documents,
>however. (Building a dictionary/thesaurus into a "semantic network" can
>offset the errors introduced, but it's a lot of work.)
>
>Proximity searching helps narrow things -- looking for words near one
>another. The simplest kind of proximity is to keep track of phrases,
>sentences and paragraphs. A bit more powerful is a fuzzier proximity
>operator that considers documents more relevant if the search terms are
>closer together.
>
What's the difference between using stem and wildcard? I found stem
ususally confusing and troublesome. Every variation I want can be
handled well with wildcard. For example, "annuity" stemming will annoyingly
bring up "annual", while "annuit*" will only generate "annuity" and
"annuities".
Nick gave a good outline for a good and POWERFUL search tool. I am hoping
a bit more: not only promixities but defining how many words between the
two specified words; zoning title, first paragraph or last paragraphy for
certain word hits; and on top of that the proximities in title or first
paragraph contains. Is there any search software that can do those things?
Peter
--------------
Peter He
Associate Systems/Info. Specialist
IBM
peterhe at ibm.net
914-684-3636
More information about the Web4lib
mailing list