[WEB4LIB] Re: More on Google digitization and Europe

Thu Apr 28 17:19:49 EDT 2005

Karen Coyle wrote:
> Google is scanning texts, running them through OCR, and creating 
> indexes based on *un-corrected* OCR. So they aren't interested 
> in reproducing the text itself as text, and are using a "good 
> enough" approach to access. It's kind of a quantity vs. quality 
> approach.
> [...]
> This is not a criticism of the Google project; they've chosen 
> this method as an economic way to do something that would be 
> unaffordable otherwise. It's a simple trade-off.

Add to this that what Google does today is not necessarily an 
indication of what they will do tomorrow.

My own rough estimate is that buying an old book costs $0.01--0.10 
per page, scanning and running OCR costs $0.10--$1.00 per page, 
and careful proofreading costs $1--$10 per page.  You could 
receive the book for free as a gift, or you could make a volunteer 
proofread it for free, but there is always a limit to how many 
such gifts you can receive.  These costs indicate how much you can 
accomplish.  With the same resources, would you rather buy a 
hundred books, scan ten or proofread one?

 * Buying: Traditional libraries apparently prioritize buying and 
   shelving over digitization.  I enjoy shopping for books, but my 
   home is so full of them that I'm holding back.  Even though 
   they are within arm's reach, the contents is so unavailable, 
   because I cannot search it.  Where did I read this...?  I 
   cannot find it.  How did people live without running water and 
   searchable books?

 * Scanning: My experience is that non-proofread OCR text combined 
   with facsimile images is useful enough for many purposes.  The 
   scanning effort is often worthwhile even if I were the only 
   reader, but restricting myself to out-of-copyright books that I 
   can share openly makes so much more sense.  I personally 
   believe that we will live to see a copyright reform or massive 
   Creative Commons licensing that enables us to digitize and 
   share most works from the 1950s, 1960s and 1970s.

 * Proofreading: Requiring full proofreading and markup severly 
   limits how much you can digitize and publish.  Through the web 
   interface for proofreading that we have at Project Runeberg, or 
   through Distributed Proofreaders (pgdp.net), it is very easy to 
   involve hundreds of volunteers in this process.  But I think it 
   is a pity that PGDP postpones publishing til after proofing.

As a commercial (for-pay, for-profit) service offering, there is a 
problem to publish non-proofed text because competitors can accuse 
you of poor quality.  But volunteer projects (and most libraries) 
can afford the attitude to tell their readers to stop whining and 
do the job themselves.  If anybody wants to do a fully proofread 
and marked up electronic text of a book, it is easier to start 
with a rough digital facsimile than with the paper book.  One 
prerequisite is that the raw OCR text is openly available, and 
that volunteers are given a chance to help in proofreading.  I 
don't know if Google has any plans in this direction.  Maybe they 
read Web4Lib and are getting ideas.

-- 
  Lars Aronsson (lars at aronsson.se)
  Project Runeberg - free Nordic literature - http://runeberg.org/