[WEB4LIB] Comparing search log distribution curves
Robertson, James
Robertson at ADM.NJIT.EDU
Mon Sep 23 08:47:33 EDT 2002
Rich, et al.
Those interested in this topic might be interested in Bernardo A.
Huberman's "The laws of the web: patterns in the ecology of information"
(Cambridge, Mass.; MIT Press; c 2001).
A "breezy" (conversational, easy read) 100 page book that finds such
patterns (Zipf, Pareto, Bradford, 80/20, etc.) in everything from (1) the
number of clicks a user selects per web site, to (2) the number of pages in
a given web site, to (3) the "degree of seperation" between any two random
sites (number of clicks to get from here to there), and so forth.
--Jim
Jim Robertson
Assistant University Librarian
Van Houten Library
New Jersey Institute of Technology
323 King Blvd.
Newark, NJ 07102-1982
(973) 596-5798 -- james.c.robertson at njit.edu -- www.library.njit.edu
-----Original Message-----
From: Richard Wiggins [mailto:rich at richardwiggins.com]
Sent: Sunday, September 22, 2002 2:09 PM
To: Multiple recipients of list
Subject: [WEB4LIB] Comparing search log distribution curves
For an upcoming article, I'm looking for folks who are willing to compare
their search log distributions. While analyzing the most popular searches
at Michigan State University, I plotted unique search phrases versus their
rank, and came up with this curve:
http://netfact.com/rww/write/searcher/rww-searcher-msukeywords-searchdist-ap
r-jul2002.gif
If that link is too long please use this: http://tinyurl.com/1ktk
It turns out that this sort of distribution is common in nature. Folks more
learned than I, such as Avi Rappoport, Lou Rosenfeld, and members of the
list, are already familiar with such curves, as predicted by Pareto (and
Zipf and Bradford). Here for instance is a curve showing Web site
popularity as a Zipf distribution:
http://www.useit.com/alertbox/zipf.html
Another example: I plotted data from Wordtracker.com, which compiles actual
searches by end users of metacrawlers, and came up with a curve like this:
http://netfact.com/rww/write/searcher/rww-searcher-wordtracker-distribution-
sep2002.gif
(or http://tinyurl.com/1kug )
Here's the question: Does the shape of that curve vary for different
information spaces? Your library offers all kinds of searchable spaces:
-- Your library's Web presence
-- Periodical databases
-- Your OPAC
-- Commercial databases, varying from general to highly specialized.
-- Your intranet
My hypothesis is that a similar curve will apply in virtually all databases
that allow general purpose keyword searching. The question is how much the
curve varies by audience and by topic. I suspect a completely general index
with general users will have stronger "best sellers" than a very specific
corpus such as an obscure scientific database. But maybe that's wrong;
every discipline has fashionable topics du jour. (The shape of the curve
definitely varies with how long a period you sample; a 3 month sample will
pick up far more unique phrases than a 1 day one.)
I can share a search log analysis script written in Perl if that'd help. Any
literature citations on this would be greatly appreciated too.
Thanks,
/rich
PS -- I'm also looking for examples of Web sites that share their most
recent searches openly, such as:
http://search.msu.edu/info/logger.html?when=last31days
http://www.utexas.edu/search/popular/
http://www.hyperdictionary.com/dictionary?ShowTop=1
http://www.pbbt.com/Directory/keywords.shtml
http://50.lycos.com/
http://www.google.com/press/zeitgeist.html
____________________________________________________
Richard Wiggins
Writing, Speaking, and Consulting on Internet Topics
rich at richardwiggins.com www.richardwiggins.com
More information about the Web4lib
mailing list