[WEB4LIB] Comparing search log distribution curves

Mon Sep 23 08:47:33 EDT 2002

Rich, et al.

	Those interested in this topic might be interested in Bernardo A.
Huberman's "The laws of the web:  patterns in the ecology of information"
(Cambridge, Mass.; MIT Press; c 2001).

	A "breezy" (conversational, easy read) 100 page book that finds such
patterns (Zipf, Pareto, Bradford, 80/20, etc.) in everything from (1) the
number of clicks a user selects per web site, to (2) the number of pages in
a given web site, to (3) the "degree of seperation" between any two random
sites (number of clicks to get from here to there), and so forth.

					--Jim

Jim Robertson
Assistant University Librarian
Van Houten Library
New Jersey Institute of Technology
323 King Blvd.
Newark, NJ  07102-1982
(973) 596-5798 -- james.c.robertson at njit.edu -- www.library.njit.edu

-----Original Message-----
From: Richard Wiggins [mailto:rich at richardwiggins.com]
Sent: Sunday, September 22, 2002 2:09 PM
To: Multiple recipients of list
Subject: [WEB4LIB] Comparing search log distribution curves

For an upcoming article, I'm looking for folks who are willing to compare
their search log distributions.  While analyzing the most popular searches
at Michigan State University, I plotted unique search phrases versus their
rank, and came up with this curve:

http://netfact.com/rww/write/searcher/rww-searcher-msukeywords-searchdist-ap
r-jul2002.gif

If that link is too long please use this: http://tinyurl.com/1ktk

It turns out that this sort of distribution is common in nature. Folks more
learned than I, such as Avi Rappoport, Lou Rosenfeld, and members of the
list, are already familiar with such curves, as predicted by Pareto (and
Zipf and Bradford).  Here for instance is a curve showing Web site
popularity as a Zipf distribution:

http://www.useit.com/alertbox/zipf.html

Another example: I plotted data from Wordtracker.com, which compiles actual
searches by end users of metacrawlers, and came up with a curve like this:

http://netfact.com/rww/write/searcher/rww-searcher-wordtracker-distribution-
sep2002.gif

(or http://tinyurl.com/1kug )

Here's the question:  Does the shape of that curve vary for different 
information spaces?  Your library offers all kinds of searchable spaces:

-- Your library's Web presence
-- Periodical databases
-- Your OPAC
-- Commercial databases, varying from general to highly specialized.
-- Your intranet

My hypothesis is that a similar curve will apply in virtually all databases
that allow general purpose keyword searching.  The question is how much the
curve varies by audience and by topic.  I suspect a completely general index
with general users will have stronger "best sellers" than a very specific
corpus such as an obscure scientific database.  But maybe that's wrong;
every discipline has fashionable topics du jour.  (The shape of the curve
definitely varies with how long a period you sample; a 3 month sample will
pick up far more unique phrases than a 1 day one.)

I can share a search log analysis script written in Perl if that'd help. Any
literature citations on this would be greatly appreciated too.

Thanks,

/rich

PS -- I'm also looking for examples of Web sites that share their most
recent searches openly, such as:

http://search.msu.edu/info/logger.html?when=last31days

http://www.utexas.edu/search/popular/

http://www.hyperdictionary.com/dictionary?ShowTop=1

http://www.pbbt.com/Directory/keywords.shtml

http://50.lycos.com/

http://www.google.com/press/zeitgeist.html

____________________________________________________
Richard Wiggins
Writing, Speaking, and Consulting on Internet Topics
rich at richardwiggins.com       www.richardwiggins.com