[Web4lib] Google Books a tease, not a useful tool, for serious research

Richard Wiggins richard.wiggins at gmail.com
Fri Jul 6 10:15:46 EDT 2007


Stephen,

Right, gotcha.  That helps.  My beef is the PDF you download is page
images.  Obviously if it's searchable and if they offer plain text, they
have OCRed the entire work.  Maybe it's too hard for them to make the PDF
searchable.

Thanks,

/rich


On 7/6/07, Stephen Cauffman <SCauffman at cslib.org> wrote:
>
> Per the 'search within full text' issue: on July 3, Google books released
> "View plain text" where you can view the OCR text. feature for the public
> domain items. See their blog post on it at:
>
> http://booksearch.blogspot.com/2007/07/greater-access-to-public-domain-works.html
> or
> http://tinyurl.com/ysddmt
>
> Of course, there will be problems with the OCR text too...
>
> Regards,
> Steve Cauffman
>
> >>> "Richard Wiggins" <richard.wiggins at gmail.com> 7/6/2007 9:57:09 am >>>
> I understand your point.  After all, this is what Google says Google Book
> Search offers:
>
> *
> >
> > What is Google Book Search?
> > Search the full text of books to find ones that interest you and learn
> > where to buy or borrow them.
>
> *
>
> So, arguably, I should be happy to find the book, and anything else is
> gravy.  But they also say:
>
>
> > Full view: If we've determined that a book is out of copyright, or the
> > publisher or rightsholder has given us permission, you'll be able to
> page
> > through the entire book from start to finish, as many times as you like.
> If
> > the book is in the public domain, you'll also be able download, save and
> > print a PDF version to read at your own pace.
>
>
> My main question is why the PDFs of fiull text books are page images, not
> benefitting from the OCR that obviously took place.
>
> It's fair enough, I suppose, if they want to stick to the claim that
> Google
> Book Search is a book locator tool, not a full text resource.  But the
> book
> I found in my example was scanned by U of California, and stamped on the
> first page as "Discarded by Los Angeles Public Library."
>
> Imagine how much cooler it would be if anyone, anywhere, had access to a
> PDF
> in OCRed form of a book that's thousands of miles away, and discarded by
> the
> library that housed it.
>
> Anyhow, has anyone else experienced the behavior of having the same search
> return different results in Google Book Search?   Another poster has seen
> similar behavior with Google Patent searches.
>
> /rich
>
>
>
>
>
> On 7/6/07, Coleman, Ronald <rcoleman at ushmm.org> wrote:
> >
> > I always viewed Google Books as a discovery tool that could help serious
> > researchers locate books they can use for their research.  I don't think
> > of it as its own standalone research tool.  It doesn't replace the copy
> > sitting on the library shelves, but it can lead you to that copy when
> > other tools (library catalogs, journal databases, etc) cannot.  It's
> > simply another tool in a researcher's "toolbox" of research techniques.
> >
> > The example I always use when working with researchers is this: Once, at
> > the reference desk, I was approached by a historian who was looking for
> > information on a Romanian businessman from the 1930s named Nicolae
> > Malaxa.  The historian had already looked through all the books on the
> > Holocaust in Romania that were in the DS135 R& section--including all
> > the books he himself had written on the subject--and came up with
> > nothing.  We searched JSTOR, Project Muse, and all the other usual
> > suspects, and found nothing.  Before he walked away, I pulled up Google
> > Books and tried searching for "Nicolae Malaxa."  The very first hit was
> > for a book entitled "Wanted! The Search for Nazis in America," which is
> > a title we would never have considered before.  I used the catalog to
> > locate the book on our shelves and, sure enough, there was an entire
> > chapter on Malaxa.  Again, Google Books didn't replace the book on the
> > shelf; it was merely a tool we used to discover it.  In this regard,
> > Google Books (and A9, Microsoft Live Book Search, et al) is a useful and
> > valuable tool that augments other search techniques.
> >
> > Your concerns are valid and understandable, but from my perspective it
> > seems like you are trying to make this service out to be something it is
> > not.
> >
> >
> > Ron Coleman
> > Reference Librarian
> > United States Holocaust Memorial Museum
> > 100 Raoul Wallenberg Place, SW
> > Washington, DC  20024
> >
> >
> > -----Original Message-----
> > From: web4lib-bounces at webjunction.org
> > [mailto:web4lib-bounces at webjunction.org] On Behalf Of Richard Wiggins
> > Sent: Friday, July 06, 2007 8:22 AM
> > To: Web4Lib
> > Subject: [Web4lib] Google Books a tease, not a useful tool,for serious
> > research
> >
> > I think we've plumbed these troubled waters before, but my experience
> > over
> > the last two days has me shaking my head, wondering if Google really
> > considers Google Book a serious research tool.
> >
> > To me, to be useful, a research tool needs these features:
> >
> > -- You must be able to cite what you find. You must be able to provide a
> > reference that others can follow in order to retrieve exactly what you
> > retrieved.
> >
> > -- You must be able to quote it.  That is, you must be able to copy text
> > from it and paste that text into an article, an e-mail, whatever.
> >
> > -- You must be able to reproduce the search that found the item.
> >
> > -- You must be able to search within the full text.
> >
> > -- Others must be able to do all of these things.
> >
> > As a matter of sport in the last couple days I've been trying to chase
> > down
> > a matter of historical fact:  is the proper name of a thoroughfare in
> > East
> > Lansing "Harrison Road" or is it "Harrison Avenue."  This has been a fun
> > research project worthy of History Detectives (except the subject matter
> > is
> > a lot more boring than their tales).
> >
> > Google Book Search offered some tantalizing evidence from the Michigan
> > public laws of 1907.  What was especially cool was that the book was
> > digitized by the University of California just this past May.
> >
> > Here's what's not cool:
> >
> > -- My first search revealed the tantalizing tidbit re the founding of
> > East
> > Lansing, when Harrison Avenue was a boundary of the town.
> >
> > --- For some reason, subsequent searches did not pull up that tidbit,
> > but
> > rather metadata about the volume.
> >
> > -- And now, unless I'm losing my mind, repeats of the same searches
> > don't
> > even find that volume.
> >
> > -- I was able to find the URL in my browser cache, in this bizarre form
> > (not
> > even sure it will paste)   http://books.google.com/books?id=_VUyAAAAIAAJ
> > ....  (In my browser address bar the upper case AAAs are crossed out.)
> >
> > -- If you manage to locate the PDF and download it, of course you cannot
> > search it, because it is a PDF stripped of Acrobat power; the pages are
> > images, and not searchable.  This is a volume with 1200 pages.  Eyeball
> > scanning for the text that Google Book Search once coughed up on screen
> > is a
> > waste of time and an insult.
> >
> > Again, I know we've covered some of this turf, but doesn't the
> > combination
> > of these facts destroy the value of Google Book Search as a serious
> > research
> > tool?
> >
> > Google seems to be paranoid about others mining their data.  Do they
> > actually change search behavior to limit the number of searches for a
> > book?
> > If so it's obviously preventing reproducibility of research and even
> > opening
> > the door to denial of service.
> >
> > /rich
> > _______________________________________________
> > Web4lib mailing list
> > Web4lib at webjunction.org
> > http://lists.webjunction.org/web4lib/
> >
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
>


More information about the Web4lib mailing list