[Web4lib] Google Books a tease, not a useful tool, for serious
research
Sebastian Hammer
quinn at indexdata.dk
Sat Jul 7 09:46:35 EDT 2007
John Fereira wrote:
> At 10:15 AM 7/6/2007, Richard Wiggins wrote:
>
>> Stephen,
>>
>> Right, gotcha. That helps. My beef is the PDF you download is page
>> images. Obviously if it's searchable and if they offer plain text, they
>> have OCRed the entire work. Maybe it's too hard for them to make the
>> PDF
>> searchable.
>>
>> Thanks,
>>
>> /rich
>>
>>
>> On 7/6/07, Stephen Cauffman <SCauffman at cslib.org> wrote:
>>
>>>
>>>
>>>
>>>
>>> My main question is why the PDFs of fiull text books are page
>>> images, not
>>> benefitting from the OCR that obviously took place.
>>
>
> I have worked on several projects in which books have been scanned to
> produce digital versions of the books so I've dealt with the same
> questions.
>
> If the PDFs are produced from scanning as opposed to being born
> digital all you can get is a photo image of each page. That page can
> be OCRd which can produce a "text" version of the original page which
> can be indexed for a search engine. For the most recently project I'm
> working on a "position" file is also produced which can provide
> information about where words in the text version appear on the
> scanned image. The text version of a page will *not* be the same as
> the original page. OCR text is not 100%, especially when you're
> dealing with very old publications, or pages containing photos or odd
> fonts. Retaining all the formatting and fonts of the original page
> is not easy and would typically required a manual overview of every
> page. On one site I visited it indicated that Google is scanning
> 3,000 books a day. At that rate, a manual QA and structuring process
> is just not feasible. The project that I am dealing with is going to
> be doing 5-10K books a month and we can only do spot checks on the OCR
> accuracy at that rate.
The Open Content Alliance makes searchable PDFs available through a
technique similar to what you describe. See
http://www.archive.org/details/goodytwoshoes00newyiala for an example,
then look at the PDF version (link on the left-hand side).
Cheers,
--Sebastian
>>> It's fair enough, I suppose, if they want to stick to the claim that
>>> Google
>>> Book Search is a book locator tool, not a full text resource. But the
>>> book
>>> I found in my example was scanned by U of California, and stamped on
>>> the
>>> first page as "Discarded by Los Angeles Public Library."
>>>
>>> Imagine how much cooler it would be if anyone, anywhere, had access
>>> to a
>>> PDF
>>> in OCRed form of a book that's thousands of miles away, and
>>> discarded by
>>> the
>>> library that housed it.
>>>
>>> Anyhow, has anyone else experienced the behavior of having the same
>>> search
>>> return different results in Google Book Search? Another poster has
>>> seen
>>> similar behavior with Google Patent searches.
>>>
>>> /rich
>>>
>>>
>>>
>>>
>>>
>>> On 7/6/07, Coleman, Ronald <rcoleman at ushmm.org> wrote:
>>> >
>>> > I always viewed Google Books as a discovery tool that could help
>>> serious
>>> > researchers locate books they can use for their research. I don't
>>> think
>>> > of it as its own standalone research tool. It doesn't replace the
>>> copy
>>> > sitting on the library shelves, but it can lead you to that copy when
>>> > other tools (library catalogs, journal databases, etc) cannot. It's
>>> > simply another tool in a researcher's "toolbox" of research
>>> techniques.
>>> >
>>> > The example I always use when working with researchers is this:
>>> Once, at
>>> > the reference desk, I was approached by a historian who was
>>> looking for
>>> > information on a Romanian businessman from the 1930s named Nicolae
>>> > Malaxa. The historian had already looked through all the books on
>>> the
>>> > Holocaust in Romania that were in the DS135 R& section--including all
>>> > the books he himself had written on the subject--and came up with
>>> > nothing. We searched JSTOR, Project Muse, and all the other usual
>>> > suspects, and found nothing. Before he walked away, I pulled up
>>> Google
>>> > Books and tried searching for "Nicolae Malaxa." The very first
>>> hit was
>>> > for a book entitled "Wanted! The Search for Nazis in America,"
>>> which is
>>> > a title we would never have considered before. I used the catalog to
>>> > locate the book on our shelves and, sure enough, there was an entire
>>> > chapter on Malaxa. Again, Google Books didn't replace the book on
>>> the
>>> > shelf; it was merely a tool we used to discover it. In this regard,
>>> > Google Books (and A9, Microsoft Live Book Search, et al) is a
>>> useful and
>>> > valuable tool that augments other search techniques.
>>> >
>>> > Your concerns are valid and understandable, but from my
>>> perspective it
>>> > seems like you are trying to make this service out to be something
>>> it is
>>> > not.
>>> >
>>> >
>>> > Ron Coleman
>>> > Reference Librarian
>>> > United States Holocaust Memorial Museum
>>> > 100 Raoul Wallenberg Place, SW
>>> > Washington, DC 20024
>>> >
>>> >
>>> > -----Original Message-----
>>> > From: web4lib-bounces at webjunction.org
>>> > [mailto:web4lib-bounces at webjunction.org] On Behalf Of Richard Wiggins
>>> > Sent: Friday, July 06, 2007 8:22 AM
>>> > To: Web4Lib
>>> > Subject: [Web4lib] Google Books a tease, not a useful tool,for
>>> serious
>>> > research
>>> >
>>> > I think we've plumbed these troubled waters before, but my experience
>>> > over
>>> > the last two days has me shaking my head, wondering if Google really
>>> > considers Google Book a serious research tool.
>>> >
>>> > To me, to be useful, a research tool needs these features:
>>> >
>>> > -- You must be able to cite what you find. You must be able to
>>> provide a
>>> > reference that others can follow in order to retrieve exactly what
>>> you
>>> > retrieved.
>>> >
>>> > -- You must be able to quote it. That is, you must be able to
>>> copy text
>>> > from it and paste that text into an article, an e-mail, whatever.
>>> >
>>> > -- You must be able to reproduce the search that found the item.
>>> >
>>> > -- You must be able to search within the full text.
>>> >
>>> > -- Others must be able to do all of these things.
>>> >
>>> > As a matter of sport in the last couple days I've been trying to
>>> chase
>>> > down
>>> > a matter of historical fact: is the proper name of a thoroughfare in
>>> > East
>>> > Lansing "Harrison Road" or is it "Harrison Avenue." This has been
>>> a fun
>>> > research project worthy of History Detectives (except the subject
>>> matter
>>> > is
>>> > a lot more boring than their tales).
>>> >
>>> > Google Book Search offered some tantalizing evidence from the
>>> Michigan
>>> > public laws of 1907. What was especially cool was that the book was
>>> > digitized by the University of California just this past May.
>>> >
>>> > Here's what's not cool:
>>> >
>>> > -- My first search revealed the tantalizing tidbit re the founding of
>>> > East
>>> > Lansing, when Harrison Avenue was a boundary of the town.
>>> >
>>> > --- For some reason, subsequent searches did not pull up that tidbit,
>>> > but
>>> > rather metadata about the volume.
>>> >
>>> > -- And now, unless I'm losing my mind, repeats of the same searches
>>> > don't
>>> > even find that volume.
>>> >
>>> > -- I was able to find the URL in my browser cache, in this bizarre
>>> form
>>> > (not
>>> > even sure it will paste)
>>> http://books.google.com/books?id=_VUyAAAAIAAJ
>>> > .... (In my browser address bar the upper case AAAs are crossed
>>> out.)
>>> >
>>> > -- If you manage to locate the PDF and download it, of course you
>>> cannot
>>> > search it, because it is a PDF stripped of Acrobat power; the
>>> pages are
>>> > images, and not searchable. This is a volume with 1200 pages.
>>> Eyeball
>>> > scanning for the text that Google Book Search once coughed up on
>>> screen
>>> > is a
>>> > waste of time and an insult.
>>> >
>>> > Again, I know we've covered some of this turf, but doesn't the
>>> > combination
>>> > of these facts destroy the value of Google Book Search as a serious
>>> > research
>>> > tool?
>>> >
>>> > Google seems to be paranoid about others mining their data. Do they
>>> > actually change search behavior to limit the number of searches for a
>>> > book?
>>> > If so it's obviously preventing reproducibility of research and even
>>> > opening
>>> > the door to denial of service.
>>> >
>>> > /rich
>>> > _______________________________________________
>>> > Web4lib mailing list
>>> > Web4lib at webjunction.org
>>> > http://lists.webjunction.org/web4lib/
>>> >
>>> _______________________________________________
>>> Web4lib mailing list
>>> Web4lib at webjunction.org
>>> http://lists.webjunction.org/web4lib/
>>
>> _______________________________________________
>> Web4lib mailing list
>> Web4lib at webjunction.org
>> http://lists.webjunction.org/web4lib/
>
>
> John Fereira
> jaf30 at cornell.edu
> Ithaca, NY
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
>
>
--
Sebastian Hammer, Index Data
quinn at indexdata.com www.indexdata.com
Ph: (603) 209-6853 Fax: (866) 383-4485
More information about the Web4lib
mailing list