FW: [Web4lib] The sources of Wikipedia

Gimon, Charles A CAGimon at mplib.org
Thu Sep 7 15:41:54 EDT 2006


Of course, that the #1 resource out of 161,000 only had 460 hits kind of puts things into perspective, too. Long tail, anyone?

Pokémon as a subject is likely to generate a large number of discrete articles. There have certainly been more pokémon than there have been Presidents of the United States. 

Here's a typical botanical article: http://en.wikipedia.org/wiki/Spelt . Of five references, two are online, two are abstracts online, one is a print item with ISBN. I'm assuming your regex would handle dashes and spaces in the ISBN pattern? Or does the dump canonicalize them? It looks like it might, the dashes were removed in the URL that is linked from the ISBN here.

--Charles "Gotta Catch 'em All" Gimon
  Web Coordinator
  Minneapolis Public Library



-----Original Message-----
From: web4lib-bounces at webjunction.org [mailto:web4lib-bounces at webjunction.org] On Behalf Of Lars Aronsson
Sent: Thursday, September 07, 2006 2:25 PM
To: web4lib at webjunction.org
Subject: [Web4lib] The sources of Wikipedia



For those of us helplessly addicted to Perl programming, one of 
the greatest joys of Wikipedia is the ability to download the 
entire database dump in XML format and dig through it for hidden 
patterns.  These are available at http://download.wikimedia.org/

One of the original peculiarities of the wiki markup language used 
in Wikipedia's articles is that the letters ISBN followed by one 
whitespace and ten digits (or an X) is recognized as a link to a 
separate page, from where you can look up that ISBN number in 
various bookstores or libraries.  In the most recent dump of the 
English Wikipedia, I found 161,973 such ISBN patterns.  All books 
are created equal, but some are more equal than the rest.  I found 
the following ISBNs to be the most referenced:

Count ISBN        Title

  460 0954381157  "Trade unions of the world"
  391 0439154049  "The official Pokemon handbook"
  389 193020650X  "Official Nintendo Pokémon FireRed Version"
  387 130206151   (an error for 1930206151, another Pokemon title)
  372 1930206585  "Official Nintendo Pokémon Emerald Player's Guide"
  357 0761547614  "Prima's Official Pokemon Guide"
  346 0002169878  "Collins Guide to the Sea Fishes of New Zealand"
  342 1569315604  "Pokemon Adventures, Adventure 3: Saffron Cit..."
  334 1930206194  "Super Smash Bros. Melee, Official Guide from..."
  334 1569315086  "Pokemon Adventures: Legendary Pokemon, Vol. 2"
  333 1569314365  "Pokemon Graphic Novel vol. 3: Electric Pikac..."
  332 1930206313  "Gameboy Advance Pokemon Ruby Version and Sap..."
  332 1598120026  "Official Nintendo Pokémon XD: Gale of Darkne..."
  332 1569318514  "Pokemon Adventures, Volume 7: Yellow Caballe..."

Well, I could go on, but I'll stop there.  I guess all it takes is 
a handful of people with a strong interest in Pokemon who are very 
careful to cite sources with ISBN numbers, and pretty soon you 
outnumber everybody except the guy who wrote 460 articles about 
trade unions, always citing the same book.


-- 
  Lars Aronsson (lars at aronsson.se)
  Aronsson Datateknik - http://aronsson.se _______________________________________________
Web4lib mailing list
Web4lib at webjunction.org
http://lists.webjunction.org/web4lib/


More information about the Web4lib mailing list