[Web4lib] Google Allows Downloads of out-of-copyright Books

Brian Sheppard bsheppard at library.wisc.edu
Thu Aug 31 17:33:39 EDT 2006


That would be the book cover in that case. I imagine the text you see  
is manually inserted by the operator in cases where a cover has no  
text. Also, these aren't page numbers but physical page sequences. I  
think a preservation decision was made to include covers in the  
scans. I'm sure the OCR is stored somewhere in a fixed form.


On August 31, at 4:07 PM, Karen Coyle wrote:

> Interesting example. If you go to page 1 you get a message saying  
> "This page does not contain any text recoverable by the OCR  
> engine." Is it possible that Michigan is providing OCR "on the  
> fly?" If you go to page 8 you get:
>
>  Copyright, 18w,
>
> B@ DODD, MEAD AND COMPANY,
>
> 411 r at h @umieS
>
> @n(Wr at ft@ @rr@
>
> 5 OHN WILSON AND SON, CAMBRIDGE, U. S. A.
>
> Here's the table of contents page:
>
> (@t'
>
> @ 1@ -r: @
>
> @Je@ @3(
>
> CONTENTS
>
> CHAPTER PAGS
>
> I. MATERIAL AND METHOD . . 7
> II. TIME AND PLACE 20
> III. MEDITATION AND IMAGINATION 34
> IV. THE FIRST DELIGHT . . . 51
> V. THE FEELING FOR LITERATURE 63
> VI. THE BOOKS OF LIFE . . . 74
> Vii. FROM THE BOOK TO THE READER 8@
> VIII. BY WAY OF ILLUSTRATION . 95
> IX. PERSONALITY 109
> X. LIBERATION THROUGH IDEAS . 121
> XI. THE LOGIC OF FREE LIFE. . 132
> XII. THE IMAGINATION 143
> XIII. BREADTH OF LIFE 154
> XIV. RACIAL EXPRESSION . . . i65
> XV. FRESHNESS OF FEELING. . . 174
>
> And, as I suspected, the OCR has trouble with line breaks:
> (p.15)
> There is no
> getting to the bottom of Shake
> speare,
> ...
> of Shakespeare, because it continu
> ally brings to the student of his
> ...
>
> I can see where the large caps at the beginning of chapters are  
> causing problems:
>
> Chapter II.
>
> Time and Place.
>
> PJ' 0 get at the heart of Shakes@@
> peare's plays, and to secure
>
> (This was "To get" of course)
>
> Large amounts of the text (except for the line breaks) look fine  
> although I only read carefully through a couple of pages. I  
> certainly don't fault the OCR program -- this book is an  
> interesting example of the challenges of doing OCR on a book from,  
> as it says "18w". Undoubtedly modern texts would produce better  
> results.
>
> kc
>
> Brian Sheppard wrote:
>> Note that you can view the OCR'd text via Michigan's page-turner  
>> interface:
>>
>> http://mdp.lib.umich.edu/cgi/m/mdp/pt?id=39015016881628;q1=%22great 
>> %20mass%20of%20information% 
>> 22;start=1;size=25;seq=44;orient=0;view=text
>>
>> On August 31, at 3:30 PM, Jonathan Gorman wrote:
>>
>>> My suspicion is that Google is having not necessarily having a  
>>> better rate than many others and probably doesn't want things  
>>> like the error rate to be known.  Of course, nothing can compel  
>>> them to reveal these things, although I certainly hope people  
>>> will try.  I myself would like to be able to access text as well,  
>>> it would provide some material for some projects I have filed  
>>> away in the back of my mind.
>>
>> --------------------------------------------------
>> Brian Sheppard
>> University of Wisconsin Digital Collections Center
>> bsheppard at library.wisc.edu    (608) 262-3349
>>
>>
>>
>> _______________________________________________
>> Web4lib mailing list
>> Web4lib at webjunction.org
>> http://lists.webjunction.org/web4lib/
>>
>>
>
> -- 
> -----------------------------------
> Karen Coyle / Digital Library Consultant
> kcoyle at kcoyle.net http://www.kcoyle.net
> ph.: 510-540-7596
> fx.: 510-848-3913
> mo.: 510-435-8234
> ------------------------------------
>
>

--------------------------------------------------
Brian Sheppard
University of Wisconsin Digital Collections Center
bsheppard at library.wisc.edu    (608) 262-3349





More information about the Web4lib mailing list