[Web4lib] Problems with Wikipedia

Fri Jan 5 23:32:05 EST 2007

Steven Jeffery wrote:

> Actually, this is somewhat incorrect. While many times larger than an
> encyclopedia, Wikipedia does not plan to cover every topic. There is a
> specific policy (http://en.wikipedia.org/wiki/WP:N) covering what will and
> will not be included in the database. In essence, they DO believe that some
> topics are more worthy of documentation than others, but that threshold is
> lower than that of an encyclopedia.

One alternative way to read your second sentence above is that 
Wikipedia, even if it would eventually cover every topic, doesn't 
*plan*.  Planning just isn't in the toolbox.

I think the "WP:N" (Wikipedia's notability guideline) is more a 
tool for getting rid of teenagers who want to document their 
buddies.  The guideline does set a threshold for what can be 
included, but it is far lower than for any printed encyclopedia.  

Just like Zipf's law in linguistics (some words appear far more 
frequently than others) or "the long tail" in e-commerce
(see http://en.wikipedia.org/wiki/Zipf's_law
and http://en.wikipedia.org/wiki/The_Long_Tail),
it should be possible to rank topics for encyclopedia articles.  
I don't know how to actually do this.  Perhaps by analyzing which 
articles people tend to read, perhaps by analyzing which articles 
have the most edits.  At the top of any such list, you will find 
words like "the, and, in", bestselling books, and encyclopedia 
articles about people who appear on television.

So, in theory Wikipedia can cover everything from George W. Bush 
to some lesser known aspects of popular culture in the 1970s -- or 
the 1870s. There are 1.5 million examples of this.  Wikipedia does 
have a lower threshold in the "notability" guideline, which is 
fine.  But does that mean it *covers* everything in between?

I'm afraid there are big gaps, that don't get filled with time, 
and that we have no way of even estimating how big that gap is.

Recently, I've been trying to help improve the spell checking 
dictionaries for free software.  These dictionaries should contain 
as many correctly spelled words as possible, so that all 
unrecognized words can be underlined in red.  This sounds overly 
simplistic, but that's actually how they work. If you have a 
sufficiently large body of text (a corpus), you can do pretty 
accurate statistics on how often each word tends to appear.  If 
your dictionary is far too small, its words will only cover maybe 
95 percent of the words in a typical text, meaning that on average 
one word in 20 will be underlined, even if it is correctly 
spelled, because it doesn't appear in the dictionary.  This 
happens to less common words like "Web4lib", "Wikipedia", "Zipf" 
and "Lars"(*).  Any text contains some uncommon words like that.  
For your personal use, you can add these words to your personal 
dictionary, but how can one improve the base dictionary that is 
distributed with the software? Adding "Web4lib" to that dictionary 
would help me, but not so many other people.  It's the "globally 
relevant" words that should be added.  Fortunately we have ways to 
know which they are, because of research in computational 
linguistics since the 1960s.

Similarly, it's the "globally relevant" topics that should be 
covered in Wikipedia, but how do we know which they are?

Day by day since 2001, people get news through media and go to 
Wikipedia to do background checks.  What is a tsunami?  Which 
countries border on Afghanistan?  Why did they call this hurricane 
Katrina?  If the fact wasn't already there, they dug up 
information in other places and wrote the articles in Wikipedia.  
But nobody is going back, day by day, through the 1950s to add 
background information on news stories day by day.  Perhaps this 
is what needs to be done?  Can we estimate what we're missing?

Or perhaps the news stories from the 1950s aren't globally 
relevant, because nobody will ask for these facts.  Then Wikipedia 
is fine just as it is, covering hurricane Katrina and all the 
figures from Star Wars.  That's what people want to know, and 
that's already covered.  I don't know which scenario is more 
scary, that we're already done or that we'll never be.

---

(*) The posting above contains 612 occurrences of 283 unique 
words.  The word "the" occurs 25 times (4% of the text) but 156 of 
the words (half of the vocabulary) appear only once (as Zipf's law 
predicts).  The standard "GNU Aspell" English spell checker 
recognizes everything except Wikipedia and Zipf, which together 
occur 10 times and thus make up 1.63 percent of the posting, for a 
quite acceptable coverage of 98.37 percent.  Does that mean the 
dictionary knows about "Web4lib"?  No, it cheats and treats digits 
like white space, and the dictionary does recognize Web and lib. 
It is also possible that I wrote "seize" where I should have used 
"size", because the spell checker only checks that the words are 
right, not that I use the right words.

-- 
  Lars Aronsson (lars at aronsson.se)
  Aronsson Datateknik - http://aronsson.se