Comparison Study for Search Engines

Mon Nov 20 17:46:50 EST 1995

Thought you would be interested to know that as I did my Church Music search
in Infoseek yesterday, the advertisement that appeared as a clickable logo
on the top of the page was tied to my search terms, it was a Music Store,
sales on the net.

The future is now!! Don't forget glimpse either. 

BTW- the advertisments on Netscape and other large companies on the WWW have
a cost per thousand comparable to package insert programs in Direct
Marketing, $25/m - $65/m. 

Elisabeth Roche ace at opus1.com

At 01:28 PM 11/20/95 -0800, Nick Arnett wrote:
>At 2:32 PM 11/16/95, John C. Matylonek wrote:
>>I know this is a recent thread, but could someone recap the list
>>of references, studies, urls, that describe the searching
>>algorithms, and features of the various web search engines?
>
>Hard question to answer in a few words... but I'll take a stab at it.
>
>The most basic kind of search is Boolean -- test for the presence and
>absence of certain words ("and" and "or" operators).  Most search engines
>are built on an "inverted index" of all of the words in the indexed
>documents (like a concordance) and apply Boolean operators to them.
>
>To address the problem of documents that are missed because they use
>related words, rather than the query terms, there are technologies that
>broaden searches automatically -- thesauruses, stemming, sound-alike and
>statistics, for example.  Thesauruses look for synonyms, stemming looks for
>the roots of words, statistical systems expand by also searching for
>co-occurring words.  These tend to introduce a lot of irrelevant documents,
>however.  (Building a dictionary/thesaurus into a "semantic network" can
>offset the errors introduced, but it's a lot of work.)
>
>Proximity searching helps narrow things -- looking for words near one
>another.  The simplest kind of proximity is to keep track of phrases,
>sentences and paragraphs.  A bit more powerful is a fuzzier proximity
>operator that considers documents more relevant if the search terms are
>closer together.
>
>Natural language parsers, which aren't in any of the major commercial
>products yet, as far as I know, take phrase searching to the next step.
>They identify the phrases in the query that should be found as phrases in
>the searched documents.  For example, "intellectual property rights" is a
>noun phrase, so a natural language system would give higher weight to
>documents containing those words as a phrase, as opposed to the typical
>free text parser, which would just look for all three words, anywhere in
>the documents.
>
>Finally, relevancy ranking of the found documents is the real key to
>getting results without having to refine your search repeatedly (unless
>you're only searching a small number of documents, of course).  Simple
>relevancy ranking typically is based on counting the number of occurrences
>of the search terms in the documents.  A refinement is to look at the
>density -- occurrences relative to the document length.  Really
>sophisticated relevancy ranking allows you to weight various kinds of
>evidence -- Booleans, density, proximity, thesaurus, etc.; all of the
>things I've described, perhaps more, in order to decide if a document is
>relevant.
>
>When you get into that level of sophistication, queries can become quite
>complex, which means that to be most useful, the search engine should have
>a means of storing, re-using, managing and pre-calculating the results of
>useful queries.  If such a system is structured according to the meanings
>of the queries, allowing information domains to be described, it is usually
>called a knowledgebase.
>
>Whew... that's the nickel description.  Not a bad exercise for me, I guess.
>
>>What I'm particularily interested in is how companies utilize the
>>knowledge of searching algorithms to maximize the chances that
>>thier sites come up high on the list.
>
>I'm not sure that many web sites are quite that sophisticated yet.  I've
>been quite intrigued by the idea, though.  Advertising-based sites
>especially will want to know which combinations of words and phrases are
>likely to trigger retrievals and to move documents to higher relevancy
>rankings.  Of course, what works for one search engine might not work for
>another, so there's probably no single strategy that will work for all.
>
>As sites become more sophisticated about search engines, it'll be incumbent
>on the search services to increase the sophistication of their search
>algorithms.  Of all of the large services on the net today, I think
>InfoSeek is the most sophisticated and the hardest to "spoof."  Their
>advantage will probably become more and more obvious as time goes by.
>However, there are a number of large competitors coming along and the
>business will be competitive.  Those who depend on engines that only use
>Boolean and density will be left behind, I'm certain.
>
>As people learn more and more about searching (just as they learned about
>typography to use desktop publishing), they'll be frustrated with the
>simple engines.  And I'll have people with whom I can commiserate!
>
>Nick
>
>
>