inconsistencies web search performance

Wed Nov 12 15:05:02 EST 1997

At 04:12 PM 11/10/97 -0800, Booher, Craig wrote:

>	Now we have reached the second conundrum faced by users of
>Internet search engines - relevancy ranking.  While our knowledge of the
>search engine details may be unacceptable, we know even less about their
>relevancy ranking algorithms.  With what confidence can we state that
>(using the example provided) item #4,763 is less relevant than
>previously presented items?

I'll take issue with that on two counts.

1. Given the number of documents being searched, relevancy-ranking is
insufficient to reasonably differentiate among documents for
subject-oriented searches.  It is only sufficient when the searcher knows
almost exactly what he or she is seeking and how it differs from the corpus.
This is a small percentage of searches.

2. We do not understand the "algorithms," if there are such structures, used
by *humans* for subject-oriented categorization; sufficiently advanced
relevancy ranking will be essentially unpredictable because it is based on
fuzzy logic in an effort to imitate the poorly understood human mind's
methods.  This is true of Verity's relevancy ranking for all but the
simplest queries.  There is no useful way to predict how a set of evidence
will accrue into a relevancy score.  The useful information is the
algorithms' goal, not the actual algorithms.  For example, our density
operator is a third-order algorithm that ranks the first few repetitions of
a term much higher than the later ones, while also taking into account the
document length.  The goal is to have a reasonable curve, to capture the
presumed human behavior that when a term is repeated a few times, it is
significant, but when it is repeated too many times, it significance
increases gradually.  Thus, you can know the algorithm, but unless you
completely understand how peoples' use of language is revealed in term
density, the information isn't useful.

Our appreciation of the usefulness of categorization in conjunction with
search has led us to defocus somewhat on improving relevancy ranking.  The
volume of documents being searched has grown beyond its limits; we're
focusing on returning results in the context of categories (something we
learned from librarians!), with relevancy ranking coming into play only as
the searcher has chosen the context(s) in which to search in detail.

Nick Arnett

       Product Manager, Knowledge Applications
        Verity Inc.  (http://www.verity.com/)
        "Connecting People with Information"
  Phone: (408) 542-2164  E-mail: narnett at verity.com