[Web4lib] Google Allows Downloads of out-of-copyright Books
Karen Coyle
kcoyle at kcoyle.net
Thu Aug 31 17:07:43 EDT 2006
Interesting example. If you go to page 1 you get a message saying "This
page does not contain any text recoverable by the OCR engine." Is it
possible that Michigan is providing OCR "on the fly?" If you go to page
8 you get:
Copyright, 18w,
B@ DODD, MEAD AND COMPANY,
411 r at h @umieS
@n(Wr at ft@ @rr@
5 OHN WILSON AND SON, CAMBRIDGE, U. S. A.
Here's the table of contents page:
(@t'
@ 1@ -r: @
@Je@ @3(
CONTENTS
CHAPTER PAGS
I. MATERIAL AND METHOD . . 7
II. TIME AND PLACE 20
III. MEDITATION AND IMAGINATION 34
IV. THE FIRST DELIGHT . . . 51
V. THE FEELING FOR LITERATURE 63
VI. THE BOOKS OF LIFE . . . 74
Vii. FROM THE BOOK TO THE READER 8@
VIII. BY WAY OF ILLUSTRATION . 95
IX. PERSONALITY 109
X. LIBERATION THROUGH IDEAS . 121
XI. THE LOGIC OF FREE LIFE. . 132
XII. THE IMAGINATION 143
XIII. BREADTH OF LIFE 154
XIV. RACIAL EXPRESSION . . . i65
XV. FRESHNESS OF FEELING. . . 174
And, as I suspected, the OCR has trouble with line breaks:
(p.15)
There is no
getting to the bottom of Shake
speare,
...
of Shakespeare, because it continu
ally brings to the student of his
...
I can see where the large caps at the beginning of chapters are causing
problems:
Chapter II.
Time and Place.
PJ' 0 get at the heart of Shakes@@
peare's plays, and to secure
(This was "To get" of course)
Large amounts of the text (except for the line breaks) look fine
although I only read carefully through a couple of pages. I certainly
don't fault the OCR program -- this book is an interesting example of
the challenges of doing OCR on a book from, as it says "18w".
Undoubtedly modern texts would produce better results.
kc
Brian Sheppard wrote:
> Note that you can view the OCR'd text via Michigan's page-turner
> interface:
>
> http://mdp.lib.umich.edu/cgi/m/mdp/pt?id=39015016881628;q1=%22great%20mass%20of%20information%22;start=1;size=25;seq=44;orient=0;view=text
>
>
> On August 31, at 3:30 PM, Jonathan Gorman wrote:
>
>> My suspicion is that Google is having not necessarily having a better
>> rate than many others and probably doesn't want things like the error
>> rate to be known. Of course, nothing can compel them to reveal these
>> things, although I certainly hope people will try. I myself would
>> like to be able to access text as well, it would provide some
>> material for some projects I have filed away in the back of my mind.
>
> --------------------------------------------------
> Brian Sheppard
> University of Wisconsin Digital Collections Center
> bsheppard at library.wisc.edu (608) 262-3349
>
>
>
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
>
>
--
-----------------------------------
Karen Coyle / Digital Library Consultant
kcoyle at kcoyle.net http://www.kcoyle.net
ph.: 510-540-7596
fx.: 510-848-3913
mo.: 510-435-8234
------------------------------------
More information about the Web4lib
mailing list