Comparison Study for Search Engines

Nick Arnett narnett at Verity.COM
Mon Nov 20 16:12:45 EST 1995


At 2:32 PM 11/16/95, John C. Matylonek wrote:
>I know this is a recent thread, but could someone recap the list
>of references, studies, urls, that describe the searching
>algorithms, and features of the various web search engines?

Hard question to answer in a few words... but I'll take a stab at it.

The most basic kind of search is Boolean -- test for the presence and
absence of certain words ("and" and "or" operators).  Most search engines
are built on an "inverted index" of all of the words in the indexed
documents (like a concordance) and apply Boolean operators to them.

To address the problem of documents that are missed because they use
related words, rather than the query terms, there are technologies that
broaden searches automatically -- thesauruses, stemming, sound-alike and
statistics, for example.  Thesauruses look for synonyms, stemming looks for
the roots of words, statistical systems expand by also searching for
co-occurring words.  These tend to introduce a lot of irrelevant documents,
however.  (Building a dictionary/thesaurus into a "semantic network" can
offset the errors introduced, but it's a lot of work.)

Proximity searching helps narrow things -- looking for words near one
another.  The simplest kind of proximity is to keep track of phrases,
sentences and paragraphs.  A bit more powerful is a fuzzier proximity
operator that considers documents more relevant if the search terms are
closer together.

Natural language parsers, which aren't in any of the major commercial
products yet, as far as I know, take phrase searching to the next step.
They identify the phrases in the query that should be found as phrases in
the searched documents.  For example, "intellectual property rights" is a
noun phrase, so a natural language system would give higher weight to
documents containing those words as a phrase, as opposed to the typical
free text parser, which would just look for all three words, anywhere in
the documents.

Finally, relevancy ranking of the found documents is the real key to
getting results without having to refine your search repeatedly (unless
you're only searching a small number of documents, of course).  Simple
relevancy ranking typically is based on counting the number of occurrences
of the search terms in the documents.  A refinement is to look at the
density -- occurrences relative to the document length.  Really
sophisticated relevancy ranking allows you to weight various kinds of
evidence -- Booleans, density, proximity, thesaurus, etc.; all of the
things I've described, perhaps more, in order to decide if a document is
relevant.

When you get into that level of sophistication, queries can become quite
complex, which means that to be most useful, the search engine should have
a means of storing, re-using, managing and pre-calculating the results of
useful queries.  If such a system is structured according to the meanings
of the queries, allowing information domains to be described, it is usually
called a knowledgebase.

Whew... that's the nickel description.  Not a bad exercise for me, I guess.

>What I'm particularily interested in is how companies utilize the
>knowledge of searching algorithms to maximize the chances that
>thier sites come up high on the list.

I'm not sure that many web sites are quite that sophisticated yet.  I've
been quite intrigued by the idea, though.  Advertising-based sites
especially will want to know which combinations of words and phrases are
likely to trigger retrievals and to move documents to higher relevancy
rankings.  Of course, what works for one search engine might not work for
another, so there's probably no single strategy that will work for all.

As sites become more sophisticated about search engines, it'll be incumbent
on the search services to increase the sophistication of their search
algorithms.  Of all of the large services on the net today, I think
InfoSeek is the most sophisticated and the hardest to "spoof."  Their
advantage will probably become more and more obvious as time goes by.
However, there are a number of large competitors coming along and the
business will be competitive.  Those who depend on engines that only use
Boolean and density will be left behind, I'm certain.

As people learn more and more about searching (just as they learned about
typography to use desktop publishing), they'll be frustrated with the
simple engines.  And I'll have people with whom I can commiserate!

Nick




More information about the Web4lib mailing list