inconsistencies web search performance
Nick Arnett
narnett at verity.com
Fri Nov 14 14:29:21 EST 1997
At 08:36 AM 11/14/97 -0800, Booher, Craig wrote:
>I find myself constantly
>reminding staff in our organization that relevancy ranking is an inexact
>science (art?) at this stage of its development in the web arena.
The bad ones are exact because they only do Boolean searching; the good ones
are inexact because they use fuzzy logic. The smarter the search engines
get, the less exact the relevancy will be. This is the nature of
information. Ask the editor of a newspaper why a particular news article
belongs on the front page (i.e., is considered highly relevant to the
paper's target audience) and you won't get a set of logical rules; relevancy
ranking is a matter of opinion. Fundamentally, subject-based searching of
text, when done well, is subjective. Further, relevancy should have to do
with much more than the subject. Sometimes a document is relevant because
it is well-written, authoritative, provocative, humorous, popular or has
other qualities that have nothing to do with the subject. Only the last of
those -- popular -- is likely to be measurable by automation.
>For example, which
>academic schools of thought were used as a basis or springboard for the
>current implementation?
The basis of default behaviors is traditional information retrieval --
balancing precision and recall. One big problem is that some people, such
as executives, don't want to see a single false positive (an irrelevant
result), but they don't mind if you miss quite a few potentially relevant
documents (false negatives); while others, such as patent lawyers and other
professional researchers, are intolerant of an engine that misses anything
-- they'll sift through false positives.
>Can we see the mathematical development which
>led to the conclusion that this particular relevancy ranking approach
>was valid and optimal for the search engine and knowledge domain? In
>short, I'd like some theoretical basis and references to the literature
>which I can use to assess whether I, as a consumer, am comfortable with
>your relevancy ranking implementation.
There are many components to our relevancy ranking, which are combined by
default with an accrue operation, a fuzzy logic operator that means
essentially "the more, the better." We're also now optionally offering a
"sum" operator, which is similar, but not fuzzy.
>I would disagree with you when you say that "the algorithm's goal" not
>the algorithm itself is the "useful information." The goal of all these
>algorithms is to present the "most relevant" information first. HOW
>they determined what is "most relevant" is the useful information a
>consumer needs to know in order to assess the usefulness of the tool.
>(Again, I don't need to know the exact code, but I do require an
>understanding of the theoretical foundation for the algorithm.)
I think we're saying the same thing; you may have said it better!
>Second, I was interested in your concluding comments wherein you
>indicated Verity was recognizing the inadequacy of applying relevancy
>ranking to large volumes of disparate information (at least that was my
>interpretation or your remarks). Instead, you are attempting to place
>search results into some sort of context and then apply relevancy
>ranking within the context. Is that context explicitly determined by
>the user, or somehow "calculated" by the computer?
Bo
th; either. Our shipping products can dynamically cluster documents by
comparing automatically extracted features; this often helps discover
sub-topics within a search results list. Among the search services, Excite
offers something similar. Future products will be able to use pre-existing
taxonomies, integrating them with search results. There's a lot of value in
presenting people with search results in a familiar context, especially in a
familiar space, literally, since humans have excellent spatial recollection,
which is the basis of a great deal of semiotics (lead articles in newspapers
almost always start in the upper left corner; we have little trouble
remember such things).
>In either case, I'd
>be interested in learning more about this approach (and as you can
>probably surmise from my earlier remarks, can you identify any published
>literature on which you are basing your determination? :-))
I don't closely follow the academic literature; I almost never see any
informed articles in the trade or popular press. In fact, I pay little
attention to the IR and AI communities that gave birth to Verity. I'm much
more interested in learning from librarians and publishers so that we build
products based on an understanding of, and compromise among the values of
the three -- technology, library science and publishing. I would hope that
Verity can build tools that are as good a compromise among these as Yahoo!
is as a service. Narrowly viewed from any one of the domains, Yahoo! isn't
so great, but as a packaging of the three, it's a success.
Nick
--
Product Manager, Knowledge Applications
Verity Inc. (http://www.verity.com/)
"Connecting People with Information"
Phone: (408) 542-2164 E-mail: narnett at verity.com
More information about the Web4lib
mailing list