[Web4lib] Google Allows Downloads of out-of-copyright Books
Jonathan Gorman
jtgorman at uiuc.edu
Thu Aug 31 16:30:37 EDT 2006
Hi Richard,
If you look at the post I respond to, Deborah Kaplan says it's
technologically simple. I'm sorry, it's prety obvious to me on re-reading
that I took that statement in a broader sense than she meant. Still,
saying it's simple is assuming Google keeping a "book-like" text
somewhere.
Google's been secretive about this whole thing. It's not even clear to me
that there exists copies of the books as sequences of OCRed pages that
could be read end to end. If Google has streamlined their process in some
way that they keep only a) page images, b) image coordinates, c) and a
compressed index, they might not have an "technologically simple" way to
reconstruct the text.
I'm not arguing Google shouldn't provide these, just making the point we
need to be careful about throwning around terms like "simple". Also that
we shouldn't dismiss an existing resource just because the provider of
that free resource (to the patron that walks into our library) didn't
finish making all of what we consider "easy" improvements.
My suspicion is that Google is having not necessarily having a better rate
than many others and probably doesn't want things like the error rate to
be known. Of course, nothing can compel them to reveal these things,
although I certainly hope people will try. I myself would like to be able
to access text as well, it would provide some material for some projects I
have filed away in the back of my mind.
Again, apologizes for the mis-reading,
Jonathan T. Gorman
Research Information Specialist
University of Illinois at Champaign-Urbana
216 Main Library - MC522
1408 West Gregory Drive
Urbana, IL 61801
Phone: (217) 244-4688
On Thu, 31 Aug 2006, Richard Wiggins wrote:
> Huh? The whole premise of Google Book Search is that it lets you find
> content across a vast corpus of millions of books that have been OCRed. If
> the OCR isn't good enough to deliver the text in the PDF you download, then
> the OCR isn't going to be good enough for you do perform successful
> searches!
>
> Yesterday I searched for "Michigan Agricultural College" and found the book
> I needed immediately. If the OCR had interpreted it as "Michigen
> Agr1iculturol Collage" my search would have failed. Any spelling correction
> you could apply to the OCR for search purposes could be applied to the OCR
> for delivering the actual text.
>
> I think Google is really missing the boat by not putting full text in those
> PDFs. It greatly lowers the value of the product. All I can figure that
> Google perceives some loss of competitive advantage for themselves if they
> do so.
>
> Ironically, in my very first experience with Google Book Search / download,
> it was the metadata they got wrong. Google's index dates the book to 1850.
> The correct date is 1876 -- and that was obviously OCRed correctly, because
> a search including the phrase and the date succeeds.
>
> /rich
>
>
> On 8/31/06, Jonathan Gorman <jtgorman at uiuc.edu> wrote:
>>
>>
>> > Technologically it is
>> > not that difficult.
>>
>> I'm skeptical of this. I've followed OCR on and off again out of my own
>> interest over the last few years. To be able to handle nearly any book
>> pre-1923 with a reasonable error rate is a bit tricky. Even if the
>> rate is really good, most projects I have seen require human
>> effort and pre-processing to get these rates. I'd be glad to be proven
>> wrong here.
>>
>>
>
More information about the Web4lib
mailing list