[Web4lib] Google Books a tease, not a useful tool, for serious
research
Lars Aronsson
lars at aronsson.se
Fri Jul 6 13:36:09 EDT 2007
Stephen Cauffman wrote:
> Per the 'search within full text' issue: on July 3, Google books
> released "View plain text" where you can view the OCR text.
> feature for the public domain items. See their blog post on it
> at:
> http://booksearch.blogspot.com/2007/07/greater-access-to-public-domain-works.html
> or http://tinyurl.com/ysddmt
Oh really, less than three years after the project was announced.
It took the University of Michigan's "Making of America" longer
than that before they dared to expose the OCR text to readers.
Now we're only waiting for a "edit/proofread this page" link,
because as we all know there are tons of OCR errors, especially in
the non-English books.
Now we can start to check which OCR errors are most common in
Swedish books from the 1860s, and whether that differs from OCR
errors in Swedish books from the 1850s, and so we can hint Google
on how to improve their OCR process. But they have already been
OCRing over 2 years worth of books that they might have to redo.
And Swedish books from the 1890s or 1920s are considered still to
be under copyright, so only snippets are shown and no OCR text.
This feedback loop is a lot slower than one would expect from the
web 2.0 decade. It's a bit faster than the 5 year plans of the
Soviet Union, but not a lot.
--
Lars Aronsson (lars at aronsson.se)
Aronsson Datateknik - http://aronsson.se
More information about the Web4lib
mailing list