inconsistencies web search performance
Booher, Craig
cbooher at kcc.com
Fri Nov 14 11:18:26 EST 1997
Nick,
Thank your very much for your reply to my comments about relevancy
ranking. I was hoping some vendors would respond and figured, given
your willingness to participate in this forum, that you would be the
first.
I wholeheartedly concur with your first point - especially as
implemented with the current web search engines. Unfortunately, when
the average user hears "relevancy ranking" touted as a feature by a
search engine, they naively assume that the first "X" hits of a
retrieved set are ALWAYS the most relevant. I find myself constantly
reminding staff in our organization that relevancy ranking is an inexact
science (art?) at this stage of its development in the web arena.
I truly appreciate your second point for several reasons. First, I have
been (unsuccessfully) trying for several years to obtain information on
Verity's relevance ranking "algorithms". I usually encounter hand
waving or no response. Your description was a significant first step to
answering my questions. I realize that the exact algorithms (i.e.,
object and source code) are considered proprietary and guarded for their
"competitive advantage" (whether perceived or real). However, I still
would like to see some more disclosure in this area. For example, which
academic schools of thought were used as a basis or springboard for the
current implementation? Can we see the mathematical development which
led to the conclusion that this particular relevancy ranking approach
was valid and optimal for the search engine and knowledge domain? In
short, I'd like some theoretical basis and references to the literature
which I can use to assess whether I, as a consumer, am comfortable with
your relevancy ranking implementation.
I would disagree with you when you say that "the algorithm's goal" not
the algorithm itself is the "useful information." The goal of all these
algorithms is to present the "most relevant" information first. HOW
they determined what is "most relevant" is the useful information a
consumer needs to know in order to assess the usefulness of the tool.
(Again, I don't need to know the exact code, but I do require an
understanding of the theoretical foundation for the algorithm.)
Second, I was interested in your concluding comments wherein you
indicated Verity was recognizing the inadequacy of applying relevancy
ranking to large volumes of disparate information (at least that was my
interpretation or your remarks). Instead, you are attempting to place
search results into some sort of context and then apply relevancy
ranking within the context. Is that context explicitly determined by
the user, or somehow "calculated" by the computer? In either case, I'd
be interested in learning more about this approach (and as you can
probably surmise from my earlier remarks, can you identify any published
literature on which you are basing your determination? :-))
Again, thanks for your comments, and I hope others will contribute to
this thread.
----------
From: Nick Arnett[SMTP:narnett at verity.com]
Sent: Wednesday, November 12, 1997 3:04 PM
To: Multiple recipients of list
Subject: RE: inconsistencies web search performance
At 04:12 PM 11/10/97 -0800, Booher, Craig wrote:
> Now we have reached the second conundrum faced by users
of
>Internet search engines - relevancy ranking. While our
knowledge of the
>search engine details may be unacceptable, we know even less
about their
>relevancy ranking algorithms. With what confidence can we
state that
>(using the example provided) item #4,763 is less relevant than
>previously presented items?
I'll take issue with that on two counts.
1. Given the number of documents being searched,
relevancy-ranking is
insufficient to reasonably differentiate among documents for
subject-oriented searches. It is only sufficient when the
searcher knows
almost exactly what he or she is seeking and how it differs from
the corpus.
This is a small percentage of searches.
2. We do not understand the "algorithms," if there are such
structures, used
by *humans* for subject-oriented categorization; sufficiently
advanced
relevancy ranking will be essentially unpredictable because it
is based on
fuzzy logic in an effort to imitate the poorly understood human
mind's
methods. This is true of Verity's relevancy ranking for all but
the
simplest queries. There is no useful way to predict how a set
of evidence
will accrue into a relevancy score. The useful information is
the
algorithms' goal, not the actual algorithms. For example, our
density
operator is a third-order algorithm that ranks the first few
repetitions of
a term much higher than the later ones, while also taking into
account the
document length. The goal is to have a reasonable curve, to
capture the
presumed human behavior that when a term is repeated a few
times, it is
significant, but when it is repeated too many times, it
significance
increases gradually. Thus, you can know the algorithm, but
unless you
completely understand how peoples' use of language is revealed
in term
density, the information isn't useful.
Our appreciation of the usefulness of categorization in
conjunction with
search has led us to defocus somewhat on improving relevancy
ranking. The
volume of documents being searched has grown beyond its limits;
we're
focusing on returning results in the context of categories
(something we
learned from librarians!), with relevancy ranking coming into
play only as
the searcher has chosen the context(s) in which to search in
detail.
Nick Arnett
Product Manager, Knowledge Applications
Verity Inc. (http://www.verity.com/)
"Connecting People with Information"
Phone: (408) 542-2164 E-mail: narnett at verity.com
Sincerely,
Craig S. Booher
Technical Information Coordinator
Kimberly-Clark Corporation
P.O. Box 999
telephone: 920/721-5219
Neenah, WI 54956-0999
fax: 920/721-8471
More information about the Web4lib
mailing list