Comparing search log distribution curves

Richard Wiggins rich at
Sun Sep 22 14:02:40 EDT 2002

For an upcoming article, I'm looking for folks who are willing to compare
their search log distributions.  While analyzing the most popular searches
at Michigan State University, I plotted unique search phrases versus their
rank, and came up with this curve:

If that link is too long please use this:

It turns out that this sort of distribution is common in nature. Folks more
learned than I, such as Avi Rappoport, Lou Rosenfeld, and members of the
list, are already familiar with such curves, as predicted by Pareto (and
Zipf and Bradford).  Here for instance is a curve showing Web site
popularity as a Zipf distribution:

Another example: I plotted data from, which compiles actual
searches by end users of metacrawlers, and came up with a curve like this:

(or )

Here's the question:  Does the shape of that curve vary for different 
information spaces?  Your library offers all kinds of searchable spaces:

-- Your library's Web presence
-- Periodical databases
-- Your OPAC
-- Commercial databases, varying from general to highly specialized.
-- Your intranet

My hypothesis is that a similar curve will apply in virtually all databases
that allow general purpose keyword searching.  The question is how much the
curve varies by audience and by topic.  I suspect a completely general index
with general users will have stronger "best sellers" than a very specific
corpus such as an obscure scientific database.  But maybe that's wrong;
every discipline has fashionable topics du jour.  (The shape of the curve
definitely varies with how long a period you sample; a 3 month sample will
pick up far more unique phrases than a 1 day one.)

I can share a search log analysis script written in Perl if that'd help. Any
literature citations on this would be greatly appreciated too.



PS -- I'm also looking for examples of Web sites that share their most
recent searches openly, such as:

Richard Wiggins
Writing, Speaking, and Consulting on Internet Topics
rich at     

More information about the Web4lib mailing list