inconsistencies web search performance

Fri Nov 14 11:18:26 EST 1997

Nick,

Thank your very much for your reply to my comments about relevancy
ranking.  I was hoping some vendors would respond and figured, given
your willingness to participate in this forum, that you would be the
first.

I wholeheartedly concur with your first point - especially as
implemented with the current web search engines.  Unfortunately, when
the average user hears "relevancy ranking" touted as a feature by a
search engine, they naively assume that the first "X" hits of a
retrieved set are ALWAYS the most relevant.  I find myself constantly
reminding staff in our organization that relevancy ranking is an inexact
science (art?) at this stage of its development in the web arena.

I truly appreciate your second point for several reasons.  First, I have
been (unsuccessfully) trying for several years to obtain information on
Verity's relevance ranking "algorithms".  I usually encounter hand
waving or no response.  Your description was a significant first step to
answering my questions.  I realize that the exact algorithms (i.e.,
object and source code) are considered proprietary and guarded for their
"competitive advantage" (whether perceived or real).  However, I still
would like to see some more disclosure in this area.  For example, which
academic schools of thought were used as a basis or springboard for the
current implementation?  Can we see the mathematical development which
led to the conclusion that this particular relevancy ranking approach
was valid and optimal for the search engine and knowledge domain?  In
short, I'd like some theoretical basis and references to the literature
which I can use to assess whether I, as a consumer, am comfortable with
your relevancy ranking implementation.

I would disagree with you when you say that "the algorithm's goal" not
the algorithm itself is the "useful information."  The goal of all these
algorithms is to present the "most relevant" information first.  HOW
they determined what is "most relevant" is the useful information a
consumer needs to know in order to assess the usefulness of the tool.
(Again, I don't need to know the exact code, but I do require an
understanding of the theoretical foundation for the algorithm.)

Second, I was interested in your concluding comments wherein you
indicated Verity was recognizing the inadequacy of applying relevancy
ranking to large volumes of disparate information (at least that was my
interpretation or your remarks).  Instead, you are attempting to place
search results into some sort of context and then apply relevancy
ranking within the context.  Is that context explicitly determined by
the user, or somehow "calculated" by the computer?  In either case, I'd
be interested in learning more about this approach (and as you can
probably surmise from my earlier remarks, can you identify any published
literature on which you are basing your determination? :-))

Again, thanks for your comments, and I hope others will contribute to
this thread.

	----------
	From:  Nick Arnett[SMTP:narnett at verity.com]
	Sent:  Wednesday, November 12, 1997 3:04 PM
	To:  Multiple recipients of list
	Subject:  RE: inconsistencies web search performance

	At 04:12 PM 11/10/97 -0800, Booher, Craig wrote:

	>	Now we have reached the second conundrum faced by users
of
	>Internet search engines - relevancy ranking.  While our
knowledge of the
	>search engine details may be unacceptable, we know even less
about their
	>relevancy ranking algorithms.  With what confidence can we
state that
	>(using the example provided) item #4,763 is less relevant than
	>previously presented items?

	I'll take issue with that on two counts.

	1. Given the number of documents being searched,
relevancy-ranking is
	insufficient to reasonably differentiate among documents for
	subject-oriented searches.  It is only sufficient when the
searcher knows
	almost exactly what he or she is seeking and how it differs from
the corpus.
	This is a small percentage of searches.

	2. We do not understand the "algorithms," if there are such
structures, used
	by *humans* for subject-oriented categorization; sufficiently
advanced
	relevancy ranking will be essentially unpredictable because it
is based on
	fuzzy logic in an effort to imitate the poorly understood human
mind's
	methods.  This is true of Verity's relevancy ranking for all but
the
	simplest queries.  There is no useful way to predict how a set
of evidence
	will accrue into a relevancy score.  The useful information is
the
	algorithms' goal, not the actual algorithms.  For example, our
density
	operator is a third-order algorithm that ranks the first few
repetitions of
	a term much higher than the later ones, while also taking into
account the
	document length.  The goal is to have a reasonable curve, to
capture the
	presumed human behavior that when a term is repeated a few
times, it is
	significant, but when it is repeated too many times, it
significance
	increases gradually.  Thus, you can know the algorithm, but
unless you
	completely understand how peoples' use of language is revealed
in term
	density, the information isn't useful.

	Our appreciation of the usefulness of categorization in
conjunction with
	search has led us to defocus somewhat on improving relevancy
ranking.  The
	volume of documents being searched has grown beyond its limits;
we're
	focusing on returning results in the context of categories
(something we
	learned from librarians!), with relevancy ranking coming into
play only as
	the searcher has chosen the context(s) in which to search in
detail.

	Nick Arnett

	       Product Manager, Knowledge Applications
	        Verity Inc.  (http://www.verity.com/)
	        "Connecting People with Information"
	  Phone: (408) 542-2164  E-mail: narnett at verity.com

Sincerely,

Craig S. Booher
Technical Information Coordinator
Kimberly-Clark Corporation						
P.O. Box 999
telephone:  920/721-5219
Neenah, WI  54956-0999
fax:  920/721-8471