[Web4lib] Google Books a tease, not a useful tool, for serious research

John Fereira jaf30 at cornell.edu
Sat Jul 7 08:18:14 EDT 2007


At 10:15 AM 7/6/2007, Richard Wiggins wrote:
>Stephen,
>
>Right, gotcha.  That helps.  My beef is the PDF you download is page
>images.  Obviously if it's searchable and if they offer plain text, they
>have OCRed the entire work.  Maybe it's too hard for them to make the PDF
>searchable.
>
>Thanks,
>
>/rich
>
>
>On 7/6/07, Stephen Cauffman <SCauffman at cslib.org> wrote:
>>
>>
>>
>>
>>My main question is why the PDFs of fiull text books are page images, not
>>benefitting from the OCR that obviously took place.

I have worked on several projects in which books have been scanned to 
produce digital versions of the books so I've dealt with the same questions.

If the PDFs are produced from scanning as opposed to being born 
digital all you can get is a photo image of each page.  That page can 
be OCRd which can produce a "text" version of the original page which 
can be indexed for a search engine.  For the most recently project 
I'm working on a "position" file is also produced which can provide 
information about where words in the text version appear on the 
scanned image.  The text version of a page will *not* be the same as 
the original page.  OCR text is not 100%, especially when you're 
dealing with very old publications, or pages containing photos or odd 
fonts.   Retaining all the formatting and fonts of the original page 
is not easy and would typically required a manual overview of every 
page.  On one site I visited  it indicated that Google is scanning 
3,000 books a day.  At that rate, a manual QA and structuring process 
is just not feasible.   The project that I am dealing with is going 
to be doing 5-10K books a month and we can only do spot checks on the 
OCR accuracy at that rate.


>>It's fair enough, I suppose, if they want to stick to the claim that
>>Google
>>Book Search is a book locator tool, not a full text resource.  But the
>>book
>>I found in my example was scanned by U of California, and stamped on the
>>first page as "Discarded by Los Angeles Public Library."
>>
>>Imagine how much cooler it would be if anyone, anywhere, had access to a
>>PDF
>>in OCRed form of a book that's thousands of miles away, and discarded by
>>the
>>library that housed it.
>>
>>Anyhow, has anyone else experienced the behavior of having the same search
>>return different results in Google Book Search?   Another poster has seen
>>similar behavior with Google Patent searches.
>>
>>/rich
>>
>>
>>
>>
>>
>>On 7/6/07, Coleman, Ronald <rcoleman at ushmm.org> wrote:
>> >
>> > I always viewed Google Books as a discovery tool that could help serious
>> > researchers locate books they can use for their research.  I don't think
>> > of it as its own standalone research tool.  It doesn't replace the copy
>> > sitting on the library shelves, but it can lead you to that copy when
>> > other tools (library catalogs, journal databases, etc) cannot.  It's
>> > simply another tool in a researcher's "toolbox" of research techniques.
>> >
>> > The example I always use when working with researchers is this: Once, at
>> > the reference desk, I was approached by a historian who was looking for
>> > information on a Romanian businessman from the 1930s named Nicolae
>> > Malaxa.  The historian had already looked through all the books on the
>> > Holocaust in Romania that were in the DS135 R& section--including all
>> > the books he himself had written on the subject--and came up with
>> > nothing.  We searched JSTOR, Project Muse, and all the other usual
>> > suspects, and found nothing.  Before he walked away, I pulled up Google
>> > Books and tried searching for "Nicolae Malaxa."  The very first hit was
>> > for a book entitled "Wanted! The Search for Nazis in America," which is
>> > a title we would never have considered before.  I used the catalog to
>> > locate the book on our shelves and, sure enough, there was an entire
>> > chapter on Malaxa.  Again, Google Books didn't replace the book on the
>> > shelf; it was merely a tool we used to discover it.  In this regard,
>> > Google Books (and A9, Microsoft Live Book Search, et al) is a useful and
>> > valuable tool that augments other search techniques.
>> >
>> > Your concerns are valid and understandable, but from my perspective it
>> > seems like you are trying to make this service out to be something it is
>> > not.
>> >
>> >
>> > Ron Coleman
>> > Reference Librarian
>> > United States Holocaust Memorial Museum
>> > 100 Raoul Wallenberg Place, SW
>> > Washington, DC  20024
>> >
>> >
>> > -----Original Message-----
>> > From: web4lib-bounces at webjunction.org
>> > [mailto:web4lib-bounces at webjunction.org] On Behalf Of Richard Wiggins
>> > Sent: Friday, July 06, 2007 8:22 AM
>> > To: Web4Lib
>> > Subject: [Web4lib] Google Books a tease, not a useful tool,for serious
>> > research
>> >
>> > I think we've plumbed these troubled waters before, but my experience
>> > over
>> > the last two days has me shaking my head, wondering if Google really
>> > considers Google Book a serious research tool.
>> >
>> > To me, to be useful, a research tool needs these features:
>> >
>> > -- You must be able to cite what you find. You must be able to provide a
>> > reference that others can follow in order to retrieve exactly what you
>> > retrieved.
>> >
>> > -- You must be able to quote it.  That is, you must be able to copy text
>> > from it and paste that text into an article, an e-mail, whatever.
>> >
>> > -- You must be able to reproduce the search that found the item.
>> >
>> > -- You must be able to search within the full text.
>> >
>> > -- Others must be able to do all of these things.
>> >
>> > As a matter of sport in the last couple days I've been trying to chase
>> > down
>> > a matter of historical fact:  is the proper name of a thoroughfare in
>> > East
>> > Lansing "Harrison Road" or is it "Harrison Avenue."  This has been a fun
>> > research project worthy of History Detectives (except the subject matter
>> > is
>> > a lot more boring than their tales).
>> >
>> > Google Book Search offered some tantalizing evidence from the Michigan
>> > public laws of 1907.  What was especially cool was that the book was
>> > digitized by the University of California just this past May.
>> >
>> > Here's what's not cool:
>> >
>> > -- My first search revealed the tantalizing tidbit re the founding of
>> > East
>> > Lansing, when Harrison Avenue was a boundary of the town.
>> >
>> > --- For some reason, subsequent searches did not pull up that tidbit,
>> > but
>> > rather metadata about the volume.
>> >
>> > -- And now, unless I'm losing my mind, repeats of the same searches
>> > don't
>> > even find that volume.
>> >
>> > -- I was able to find the URL in my browser cache, in this bizarre form
>> > (not
>> > even sure it will paste)   http://books.google.com/books?id=_VUyAAAAIAAJ
>> > ....  (In my browser address bar the upper case AAAs are crossed out.)
>> >
>> > -- If you manage to locate the PDF and download it, of course you cannot
>> > search it, because it is a PDF stripped of Acrobat power; the pages are
>> > images, and not searchable.  This is a volume with 1200 pages.  Eyeball
>> > scanning for the text that Google Book Search once coughed up on screen
>> > is a
>> > waste of time and an insult.
>> >
>> > Again, I know we've covered some of this turf, but doesn't the
>> > combination
>> > of these facts destroy the value of Google Book Search as a serious
>> > research
>> > tool?
>> >
>> > Google seems to be paranoid about others mining their data.  Do they
>> > actually change search behavior to limit the number of searches for a
>> > book?
>> > If so it's obviously preventing reproducibility of research and even
>> > opening
>> > the door to denial of service.
>> >
>> > /rich
>> > _______________________________________________
>> > Web4lib mailing list
>> > Web4lib at webjunction.org
>> > http://lists.webjunction.org/web4lib/
>> >
>>_______________________________________________
>>Web4lib mailing list
>>Web4lib at webjunction.org
>>http://lists.webjunction.org/web4lib/
>_______________________________________________
>Web4lib mailing list
>Web4lib at webjunction.org
>http://lists.webjunction.org/web4lib/

John Fereira
jaf30 at cornell.edu
Ithaca, NY 



More information about the Web4lib mailing list