inconsistencies web search performance

Steve Harter harter at indiana.edu
Fri Nov 14 16:26:19 EST 1997


> The bad ones are exact because they only do Boolean searching; the good ones
> are inexact because they use fuzzy logic.  The smarter the search engines
> get, the less exact the relevancy will be.  This is the nature of
> information.  Ask the editor of a newspaper why a particular news article
> belongs on the front page (i.e., is considered highly relevant to the
> paper's target audience) and you won't get a set of logical rules; relevancy
> ranking is a matter of opinion.  Fundamentally, subject-based searching of
> text, when done well, is subjective.  Further, relevancy should have to do
> with much more than the subject.  Sometimes a document is relevant because
> it is well-written, authoritative, provocative, humorous, popular or has
> other qualities that have nothing to do with the subject.  Only the last of
> those -- popular -- is likely to be measurable by automation.
> 

There is a large theoretical and empirical literature in information
science on the nature of relevance.  In recent years there has been much
research in user-based relevance (also called psychological, cognitive, or
situational relevance), in which the kinds of criteria described by Nick,
among many others, have been found to describe real people in real
situations.  I'm not sure I agree that only popularity offers the
potential of automation, however.  For example, the journal in which an
article is published can be an important factor in determining relevance,
as can the department or school with which the author is affiliated. With
some effort, schools and journals (and web sites?) could be weighted by
perceived prestige, it would seem.


> The basis of default behaviors is traditional information retrieval --
> balancing precision and recall.  One big problem is that some people, such
> as executives, don't want to see a single false positive (an irrelevant
> result), but they don't mind if you miss quite a few potentially relevant
> documents (false negatives); while others, such as patent lawyers and other
> professional researchers, are intolerant of an engine that misses anything
> -- they'll sift through false positives.
> 

This does not seem like such a large problem to me.  Why not have the user
characterize the search wanted, such as (a) comprehensive (high recall),
(b) striking a reasonable balance between recall and precision; and (c) 
precise and accurate (high precision).  Then conduct the search by
employing the algorithm designed to meet this goal.  In other words,
instead of having one algorithm try to fit all situations, design three or
five of them to meet a range of goals.  And make the user's goal determine
the ranking and retrieval algorithm used. 


> >Can we see the mathematical development which
> >led to the conclusion that this particular relevancy ranking approach
> >was valid and optimal for the search engine and knowledge domain?  In
> >short, I'd like some theoretical basis and references to the literature
> >which I can use to assess whether I, as a consumer, am comfortable with
> >your relevancy ranking implementation.
> 
> There are many components to our relevancy ranking, which are combined by
> default with an accrue operation, a fuzzy logic operator that means
> essentially "the more, the better."  We're also now optionally offering a
> "sum" operator, which is similar, but not fuzzy.
> 

Amen.  But for me it doesn't have to be a theoretical basis but at least
what ad hoc criteria are used.  The above statement, which refers to "many
(undefined) components," doesn't really help. 

> 
> >In either case, I'd
> >be interested in learning more about this approach (and as you can
> >probably surmise from my earlier remarks, can you identify any published
> >literature on which you are basing your determination? :-))
> 
> I don't closely follow the academic literature; I almost never see any
> informed articles in the trade or popular press.  

For interested folks, there is a ton of such literature, dating back to
the middle fifties.  Most of the algorithms suggested and tested are ad
hoc rather than rooted in theory.  They are also in partial
contradiction with one another.  And there is no theory explaining how the 
criteria used should be combined to form a single ranking.

Can't the search engines at least list the various criteria that are
employed in their algorithm, without revealing the exact way in which they
are operationalized?  That would be extremely helpful. 

   Steve

      Stephen P. Harter,  School of Library and Information Science
      Indiana University                      Voice: (812) 855-5113
      Bloomington, IN 47405                     Fax: (812) 855-6166
                           <harter at indiana.edu>

 



More information about the Web4lib mailing list