[Web4lib] Problems with Wikipedia
Lars Aronsson
lars at aronsson.se
Fri Jan 5 23:32:05 EST 2007
Steven Jeffery wrote:
> Actually, this is somewhat incorrect. While many times larger than an
> encyclopedia, Wikipedia does not plan to cover every topic. There is a
> specific policy (http://en.wikipedia.org/wiki/WP:N) covering what will and
> will not be included in the database. In essence, they DO believe that some
> topics are more worthy of documentation than others, but that threshold is
> lower than that of an encyclopedia.
One alternative way to read your second sentence above is that
Wikipedia, even if it would eventually cover every topic, doesn't
*plan*. Planning just isn't in the toolbox.
I think the "WP:N" (Wikipedia's notability guideline) is more a
tool for getting rid of teenagers who want to document their
buddies. The guideline does set a threshold for what can be
included, but it is far lower than for any printed encyclopedia.
Just like Zipf's law in linguistics (some words appear far more
frequently than others) or "the long tail" in e-commerce
(see http://en.wikipedia.org/wiki/Zipf's_law
and http://en.wikipedia.org/wiki/The_Long_Tail),
it should be possible to rank topics for encyclopedia articles.
I don't know how to actually do this. Perhaps by analyzing which
articles people tend to read, perhaps by analyzing which articles
have the most edits. At the top of any such list, you will find
words like "the, and, in", bestselling books, and encyclopedia
articles about people who appear on television.
So, in theory Wikipedia can cover everything from George W. Bush
to some lesser known aspects of popular culture in the 1970s -- or
the 1870s. There are 1.5 million examples of this. Wikipedia does
have a lower threshold in the "notability" guideline, which is
fine. But does that mean it *covers* everything in between?
I'm afraid there are big gaps, that don't get filled with time,
and that we have no way of even estimating how big that gap is.
Recently, I've been trying to help improve the spell checking
dictionaries for free software. These dictionaries should contain
as many correctly spelled words as possible, so that all
unrecognized words can be underlined in red. This sounds overly
simplistic, but that's actually how they work. If you have a
sufficiently large body of text (a corpus), you can do pretty
accurate statistics on how often each word tends to appear. If
your dictionary is far too small, its words will only cover maybe
95 percent of the words in a typical text, meaning that on average
one word in 20 will be underlined, even if it is correctly
spelled, because it doesn't appear in the dictionary. This
happens to less common words like "Web4lib", "Wikipedia", "Zipf"
and "Lars"(*). Any text contains some uncommon words like that.
For your personal use, you can add these words to your personal
dictionary, but how can one improve the base dictionary that is
distributed with the software? Adding "Web4lib" to that dictionary
would help me, but not so many other people. It's the "globally
relevant" words that should be added. Fortunately we have ways to
know which they are, because of research in computational
linguistics since the 1960s.
Similarly, it's the "globally relevant" topics that should be
covered in Wikipedia, but how do we know which they are?
Day by day since 2001, people get news through media and go to
Wikipedia to do background checks. What is a tsunami? Which
countries border on Afghanistan? Why did they call this hurricane
Katrina? If the fact wasn't already there, they dug up
information in other places and wrote the articles in Wikipedia.
But nobody is going back, day by day, through the 1950s to add
background information on news stories day by day. Perhaps this
is what needs to be done? Can we estimate what we're missing?
Or perhaps the news stories from the 1950s aren't globally
relevant, because nobody will ask for these facts. Then Wikipedia
is fine just as it is, covering hurricane Katrina and all the
figures from Star Wars. That's what people want to know, and
that's already covered. I don't know which scenario is more
scary, that we're already done or that we'll never be.
---
(*) The posting above contains 612 occurrences of 283 unique
words. The word "the" occurs 25 times (4% of the text) but 156 of
the words (half of the vocabulary) appear only once (as Zipf's law
predicts). The standard "GNU Aspell" English spell checker
recognizes everything except Wikipedia and Zipf, which together
occur 10 times and thus make up 1.63 percent of the posting, for a
quite acceptable coverage of 98.37 percent. Does that mean the
dictionary knows about "Web4lib"? No, it cheats and treats digits
like white space, and the dictionary does recognize Web and lib.
It is also possible that I wrote "seize" where I should have used
"size", because the spell checker only checks that the words are
right, not that I use the right words.
--
Lars Aronsson (lars at aronsson.se)
Aronsson Datateknik - http://aronsson.se
More information about the Web4lib
mailing list