[Web4lib] Google Books a tease, not a useful tool, for serious research

Fri Jul 6 13:36:09 EDT 2007

Stephen Cauffman wrote:

> Per the 'search within full text' issue: on July 3, Google books 
> released "View plain text" where you can view the OCR text. 
> feature for the public domain items. See their blog post on it 
> at: 
> http://booksearch.blogspot.com/2007/07/greater-access-to-public-domain-works.html 
> or http://tinyurl.com/ysddmt

Oh really, less than three years after the project was announced.  
It took the University of Michigan's "Making of America" longer 
than that before they dared to expose the OCR text to readers.

Now we're only waiting for a "edit/proofread this page" link,
because as we all know there are tons of OCR errors, especially in 
the non-English books.

Now we can start to check which OCR errors are most common in 
Swedish books from the 1860s, and whether that differs from OCR 
errors in Swedish books from the 1850s, and so we can hint Google 
on how to improve their OCR process.  But they have already been 
OCRing over 2 years worth of books that they might have to redo. 
And Swedish books from the 1890s or 1920s are considered still to 
be under copyright, so only snippets are shown and no OCR text. 
This feedback loop is a lot slower than one would expect from the 
web 2.0 decade.  It's a bit faster than the 5 year plans of the 
Soviet Union, but not a lot.

-- 
  Lars Aronsson (lars at aronsson.se)
  Aronsson Datateknik - http://aronsson.se