[Web4lib] Google Allows Downloads of out-of-copyright Books

Thu Aug 31 15:58:08 EDT 2006

Huh?  The whole premise of Google Book Search is that it lets you find
content across a vast corpus of millions of books that have been OCRed.  If
the OCR isn't good enough to deliver the text in the PDF you download, then
the OCR isn't going to be good enough for you do perform successful
searches!

Yesterday I searched for "Michigan Agricultural College" and found the book
I needed immediately.  If the OCR had interpreted it as "Michigen
Agr1iculturol Collage" my search would have failed.  Any spelling correction
you could apply to the OCR for search purposes could be applied to the OCR
for delivering the actual text.

I think Google is really missing the boat by not putting full text in those
PDFs.  It greatly lowers the value of the product. All I can figure that
Google perceives some loss of competitive advantage for themselves if they
do so.

Ironically, in my very first experience with Google Book Search / download,
it was the metadata they got wrong.  Google's index dates the book to 1850.
The correct date is 1876 -- and that was obviously OCRed correctly, because
a search including the phrase and the date succeeds.

/rich

On 8/31/06, Jonathan Gorman <jtgorman at uiuc.edu> wrote:
>
>
> > Technologically it is
> > not that difficult.
>
> I'm skeptical of this.  I've followed OCR on and off again out of my own
> interest over the last few years.  To be able to handle nearly any book
> pre-1923 with a reasonable error rate is a bit tricky.  Even if the
> rate is really good, most projects I have seen require human
> effort and pre-processing to get these rates.  I'd be glad to be proven
> wrong here.
>
>