[Web4lib] Google Books a tease, not a useful tool, for serious research

Sat Jul 7 09:46:35 EDT 2007

John Fereira wrote:

> At 10:15 AM 7/6/2007, Richard Wiggins wrote:
>
>> Stephen,
>>
>> Right, gotcha.  That helps.  My beef is the PDF you download is page
>> images.  Obviously if it's searchable and if they offer plain text, they
>> have OCRed the entire work.  Maybe it's too hard for them to make the 
>> PDF
>> searchable.
>>
>> Thanks,
>>
>> /rich
>>
>>
>> On 7/6/07, Stephen Cauffman <SCauffman at cslib.org> wrote:
>>
>>>
>>>
>>>
>>>
>>> My main question is why the PDFs of fiull text books are page 
>>> images, not
>>> benefitting from the OCR that obviously took place.
>>
>
> I have worked on several projects in which books have been scanned to 
> produce digital versions of the books so I've dealt with the same 
> questions.
>
> If the PDFs are produced from scanning as opposed to being born 
> digital all you can get is a photo image of each page.  That page can 
> be OCRd which can produce a "text" version of the original page which 
> can be indexed for a search engine.  For the most recently project I'm 
> working on a "position" file is also produced which can provide 
> information about where words in the text version appear on the 
> scanned image.  The text version of a page will *not* be the same as 
> the original page.  OCR text is not 100%, especially when you're 
> dealing with very old publications, or pages containing photos or odd 
> fonts.   Retaining all the formatting and fonts of the original page 
> is not easy and would typically required a manual overview of every 
> page.  On one site I visited  it indicated that Google is scanning 
> 3,000 books a day.  At that rate, a manual QA and structuring process 
> is just not feasible.   The project that I am dealing with is going to 
> be doing 5-10K books a month and we can only do spot checks on the OCR 
> accuracy at that rate.

The Open Content Alliance makes searchable PDFs available through a
technique similar to what you describe. See
http://www.archive.org/details/goodytwoshoes00newyiala for an example,
then look at the PDF version (link on the left-hand side).

Cheers,

--Sebastian

>>> It's fair enough, I suppose, if they want to stick to the claim that
>>> Google
>>> Book Search is a book locator tool, not a full text resource.  But the
>>> book
>>> I found in my example was scanned by U of California, and stamped on 
>>> the
>>> first page as "Discarded by Los Angeles Public Library."
>>>
>>> Imagine how much cooler it would be if anyone, anywhere, had access 
>>> to a
>>> PDF
>>> in OCRed form of a book that's thousands of miles away, and 
>>> discarded by
>>> the
>>> library that housed it.
>>>
>>> Anyhow, has anyone else experienced the behavior of having the same 
>>> search
>>> return different results in Google Book Search?   Another poster has 
>>> seen
>>> similar behavior with Google Patent searches.
>>>
>>> /rich
>>>
>>>
>>>
>>>
>>>
>>> On 7/6/07, Coleman, Ronald <rcoleman at ushmm.org> wrote:
>>> >
>>> > I always viewed Google Books as a discovery tool that could help 
>>> serious
>>> > researchers locate books they can use for their research.  I don't 
>>> think
>>> > of it as its own standalone research tool.  It doesn't replace the 
>>> copy
>>> > sitting on the library shelves, but it can lead you to that copy when
>>> > other tools (library catalogs, journal databases, etc) cannot.  It's
>>> > simply another tool in a researcher's "toolbox" of research 
>>> techniques.
>>> >
>>> > The example I always use when working with researchers is this: 
>>> Once, at
>>> > the reference desk, I was approached by a historian who was 
>>> looking for
>>> > information on a Romanian businessman from the 1930s named Nicolae
>>> > Malaxa.  The historian had already looked through all the books on 
>>> the
>>> > Holocaust in Romania that were in the DS135 R& section--including all
>>> > the books he himself had written on the subject--and came up with
>>> > nothing.  We searched JSTOR, Project Muse, and all the other usual
>>> > suspects, and found nothing.  Before he walked away, I pulled up 
>>> Google
>>> > Books and tried searching for "Nicolae Malaxa."  The very first 
>>> hit was
>>> > for a book entitled "Wanted! The Search for Nazis in America," 
>>> which is
>>> > a title we would never have considered before.  I used the catalog to
>>> > locate the book on our shelves and, sure enough, there was an entire
>>> > chapter on Malaxa.  Again, Google Books didn't replace the book on 
>>> the
>>> > shelf; it was merely a tool we used to discover it.  In this regard,
>>> > Google Books (and A9, Microsoft Live Book Search, et al) is a 
>>> useful and
>>> > valuable tool that augments other search techniques.
>>> >
>>> > Your concerns are valid and understandable, but from my 
>>> perspective it
>>> > seems like you are trying to make this service out to be something 
>>> it is
>>> > not.
>>> >
>>> >
>>> > Ron Coleman
>>> > Reference Librarian
>>> > United States Holocaust Memorial Museum
>>> > 100 Raoul Wallenberg Place, SW
>>> > Washington, DC  20024
>>> >
>>> >
>>> > -----Original Message-----
>>> > From: web4lib-bounces at webjunction.org
>>> > [mailto:web4lib-bounces at webjunction.org] On Behalf Of Richard Wiggins
>>> > Sent: Friday, July 06, 2007 8:22 AM
>>> > To: Web4Lib
>>> > Subject: [Web4lib] Google Books a tease, not a useful tool,for 
>>> serious
>>> > research
>>> >
>>> > I think we've plumbed these troubled waters before, but my experience
>>> > over
>>> > the last two days has me shaking my head, wondering if Google really
>>> > considers Google Book a serious research tool.
>>> >
>>> > To me, to be useful, a research tool needs these features:
>>> >
>>> > -- You must be able to cite what you find. You must be able to 
>>> provide a
>>> > reference that others can follow in order to retrieve exactly what 
>>> you
>>> > retrieved.
>>> >
>>> > -- You must be able to quote it.  That is, you must be able to 
>>> copy text
>>> > from it and paste that text into an article, an e-mail, whatever.
>>> >
>>> > -- You must be able to reproduce the search that found the item.
>>> >
>>> > -- You must be able to search within the full text.
>>> >
>>> > -- Others must be able to do all of these things.
>>> >
>>> > As a matter of sport in the last couple days I've been trying to 
>>> chase
>>> > down
>>> > a matter of historical fact:  is the proper name of a thoroughfare in
>>> > East
>>> > Lansing "Harrison Road" or is it "Harrison Avenue."  This has been 
>>> a fun
>>> > research project worthy of History Detectives (except the subject 
>>> matter
>>> > is
>>> > a lot more boring than their tales).
>>> >
>>> > Google Book Search offered some tantalizing evidence from the 
>>> Michigan
>>> > public laws of 1907.  What was especially cool was that the book was
>>> > digitized by the University of California just this past May.
>>> >
>>> > Here's what's not cool:
>>> >
>>> > -- My first search revealed the tantalizing tidbit re the founding of
>>> > East
>>> > Lansing, when Harrison Avenue was a boundary of the town.
>>> >
>>> > --- For some reason, subsequent searches did not pull up that tidbit,
>>> > but
>>> > rather metadata about the volume.
>>> >
>>> > -- And now, unless I'm losing my mind, repeats of the same searches
>>> > don't
>>> > even find that volume.
>>> >
>>> > -- I was able to find the URL in my browser cache, in this bizarre 
>>> form
>>> > (not
>>> > even sure it will paste)   
>>> http://books.google.com/books?id=_VUyAAAAIAAJ
>>> > ....  (In my browser address bar the upper case AAAs are crossed 
>>> out.)
>>> >
>>> > -- If you manage to locate the PDF and download it, of course you 
>>> cannot
>>> > search it, because it is a PDF stripped of Acrobat power; the 
>>> pages are
>>> > images, and not searchable.  This is a volume with 1200 pages.  
>>> Eyeball
>>> > scanning for the text that Google Book Search once coughed up on 
>>> screen
>>> > is a
>>> > waste of time and an insult.
>>> >
>>> > Again, I know we've covered some of this turf, but doesn't the
>>> > combination
>>> > of these facts destroy the value of Google Book Search as a serious
>>> > research
>>> > tool?
>>> >
>>> > Google seems to be paranoid about others mining their data.  Do they
>>> > actually change search behavior to limit the number of searches for a
>>> > book?
>>> > If so it's obviously preventing reproducibility of research and even
>>> > opening
>>> > the door to denial of service.
>>> >
>>> > /rich
>>> > _______________________________________________
>>> > Web4lib mailing list
>>> > Web4lib at webjunction.org
>>> > http://lists.webjunction.org/web4lib/
>>> >
>>> _______________________________________________
>>> Web4lib mailing list
>>> Web4lib at webjunction.org
>>> http://lists.webjunction.org/web4lib/
>>
>> _______________________________________________
>> Web4lib mailing list
>> Web4lib at webjunction.org
>> http://lists.webjunction.org/web4lib/
>
>
> John Fereira
> jaf30 at cornell.edu
> Ithaca, NY
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
>
>

-- 
Sebastian Hammer, Index Data
quinn at indexdata.com   www.indexdata.com
Ph: (603) 209-6853 Fax: (866) 383-4485