Comparing search log distribution curves
Richard Wiggins
rich at richardwiggins.com
Sun Sep 22 14:02:40 EDT 2002
For an upcoming article, I'm looking for folks who are willing to compare
their search log distributions. While analyzing the most popular searches
at Michigan State University, I plotted unique search phrases versus their
rank, and came up with this curve:
http://netfact.com/rww/write/searcher/rww-searcher-msukeywords-searchdist-apr-jul2002.gif
If that link is too long please use this: http://tinyurl.com/1ktk
It turns out that this sort of distribution is common in nature. Folks more
learned than I, such as Avi Rappoport, Lou Rosenfeld, and members of the
list, are already familiar with such curves, as predicted by Pareto (and
Zipf and Bradford). Here for instance is a curve showing Web site
popularity as a Zipf distribution:
http://www.useit.com/alertbox/zipf.html
Another example: I plotted data from Wordtracker.com, which compiles actual
searches by end users of metacrawlers, and came up with a curve like this:
http://netfact.com/rww/write/searcher/rww-searcher-wordtracker-distribution-sep2002.gif
(or http://tinyurl.com/1kug )
Here's the question: Does the shape of that curve vary for different
information spaces? Your library offers all kinds of searchable spaces:
-- Your library's Web presence
-- Periodical databases
-- Your OPAC
-- Commercial databases, varying from general to highly specialized.
-- Your intranet
My hypothesis is that a similar curve will apply in virtually all databases
that allow general purpose keyword searching. The question is how much the
curve varies by audience and by topic. I suspect a completely general index
with general users will have stronger "best sellers" than a very specific
corpus such as an obscure scientific database. But maybe that's wrong;
every discipline has fashionable topics du jour. (The shape of the curve
definitely varies with how long a period you sample; a 3 month sample will
pick up far more unique phrases than a 1 day one.)
I can share a search log analysis script written in Perl if that'd help. Any
literature citations on this would be greatly appreciated too.
Thanks,
/rich
PS -- I'm also looking for examples of Web sites that share their most
recent searches openly, such as:
http://search.msu.edu/info/logger.html?when=last31days
http://www.utexas.edu/search/popular/
http://www.hyperdictionary.com/dictionary?ShowTop=1
http://www.pbbt.com/Directory/keywords.shtml
http://50.lycos.com/
http://www.google.com/press/zeitgeist.html
____________________________________________________
Richard Wiggins
Writing, Speaking, and Consulting on Internet Topics
rich at richardwiggins.com www.richardwiggins.com
More information about the Web4lib
mailing list