Comparison Study for Search Engines

Peter W. He peterhe at ibm.net
Thu Nov 30 22:46:15 EST 1995


At 01:37 PM 11/20/95 -0800, you wrote:

>The most basic kind of search is Boolean -- test for the presence and
>absence of certain words ("and" and "or" operators).  Most search engines
>are built on an "inverted index" of all of the words in the indexed
>documents (like a concordance) and apply Boolean operators to them.
>
>To address the problem of documents that are missed because they use
>related words, rather than the query terms, there are technologies that
>broaden searches automatically -- thesauruses, stemming, sound-alike and
>statistics, for example.  Thesauruses look for synonyms, stemming looks for
>the roots of words, statistical systems expand by also searching for
>co-occurring words.  These tend to introduce a lot of irrelevant documents,
>however.  (Building a dictionary/thesaurus into a "semantic network" can
>offset the errors introduced, but it's a lot of work.)
>
>Proximity searching helps narrow things -- looking for words near one
>another.  The simplest kind of proximity is to keep track of phrases,
>sentences and paragraphs.  A bit more powerful is a fuzzier proximity
>operator that considers documents more relevant if the search terms are
>closer together.
>

What's the difference between using stem and wildcard?  I found stem
ususally confusing and troublesome.  Every variation I want can be
handled well with wildcard.  For example, "annuity" stemming will annoyingly
bring up "annual", while "annuit*" will only generate "annuity" and
"annuities".

Nick gave a good outline for a good and POWERFUL search tool.  I am hoping
a bit more: not only promixities but defining how many words between the
two specified words; zoning title, first paragraph or last paragraphy for
certain word hits; and on top of that the proximities in title or first
paragraph contains.  Is there any search software that can do those things?

Peter

--------------
Peter He
Associate Systems/Info. Specialist
IBM
peterhe at ibm.net
914-684-3636



More information about the Web4lib mailing list