[Web4lib] Google Allows Downloads of out-of-copyright Books

Thu Aug 31 17:07:43 EDT 2006

Interesting example. If you go to page 1 you get a message saying "This 
page does not contain any text recoverable by the OCR engine." Is it 
possible that Michigan is providing OCR "on the fly?" If you go to page 
8 you get:

  Copyright, 18w,

B@ DODD, MEAD AND COMPANY,

411 r at h @umieS

@n(Wr at ft@ @rr@

5 OHN WILSON AND SON, CAMBRIDGE, U. S. A.

Here's the table of contents page:

(@t'

@ 1@ -r: @

@Je@ @3(

CONTENTS

CHAPTER PAGS

I. MATERIAL AND METHOD . . 7
II. TIME AND PLACE 20
III. MEDITATION AND IMAGINATION 34
IV. THE FIRST DELIGHT . . . 51
V. THE FEELING FOR LITERATURE 63
VI. THE BOOKS OF LIFE . . . 74
Vii. FROM THE BOOK TO THE READER 8@
VIII. BY WAY OF ILLUSTRATION . 95
IX. PERSONALITY 109
X. LIBERATION THROUGH IDEAS . 121
XI. THE LOGIC OF FREE LIFE. . 132
XII. THE IMAGINATION 143
XIII. BREADTH OF LIFE 154
XIV. RACIAL EXPRESSION . . . i65
XV. FRESHNESS OF FEELING. . . 174

And, as I suspected, the OCR has trouble with line breaks:
(p.15)
There is no
getting to the bottom of Shake
speare,
...
of Shakespeare, because it continu
ally brings to the student of his
...

I can see where the large caps at the beginning of chapters are causing 
problems:

Chapter II.

Time and Place.

PJ' 0 get at the heart of Shakes@@
peare's plays, and to secure

(This was "To get" of course)

Large amounts of the text (except for the line breaks) look fine 
although I only read carefully through a couple of pages. I certainly 
don't fault the OCR program -- this book is an interesting example of 
the challenges of doing OCR on a book from, as it says "18w". 
Undoubtedly modern texts would produce better results.

kc

Brian Sheppard wrote:
> Note that you can view the OCR'd text via Michigan's page-turner 
> interface:
>
> http://mdp.lib.umich.edu/cgi/m/mdp/pt?id=39015016881628;q1=%22great%20mass%20of%20information%22;start=1;size=25;seq=44;orient=0;view=text 
>
>
> On August 31, at 3:30 PM, Jonathan Gorman wrote:
>
>> My suspicion is that Google is having not necessarily having a better 
>> rate than many others and probably doesn't want things like the error 
>> rate to be known.  Of course, nothing can compel them to reveal these 
>> things, although I certainly hope people will try.  I myself would 
>> like to be able to access text as well, it would provide some 
>> material for some projects I have filed away in the back of my mind.
>
> --------------------------------------------------
> Brian Sheppard
> University of Wisconsin Digital Collections Center
> bsheppard at library.wisc.edu    (608) 262-3349
>
>
>
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
>
>

-- 
-----------------------------------
Karen Coyle / Digital Library Consultant
kcoyle at kcoyle.net http://www.kcoyle.net
ph.: 510-540-7596
fx.: 510-848-3913
mo.: 510-435-8234
------------------------------------