[WEB4LIB] Re: More on Google digitization and Europe
Lars Aronsson
lars at aronsson.se
Thu Apr 28 17:19:49 EDT 2005
Karen Coyle wrote:
> Google is scanning texts, running them through OCR, and creating
> indexes based on *un-corrected* OCR. So they aren't interested
> in reproducing the text itself as text, and are using a "good
> enough" approach to access. It's kind of a quantity vs. quality
> approach.
> [...]
> This is not a criticism of the Google project; they've chosen
> this method as an economic way to do something that would be
> unaffordable otherwise. It's a simple trade-off.
Add to this that what Google does today is not necessarily an
indication of what they will do tomorrow.
My own rough estimate is that buying an old book costs $0.01--0.10
per page, scanning and running OCR costs $0.10--$1.00 per page,
and careful proofreading costs $1--$10 per page. You could
receive the book for free as a gift, or you could make a volunteer
proofread it for free, but there is always a limit to how many
such gifts you can receive. These costs indicate how much you can
accomplish. With the same resources, would you rather buy a
hundred books, scan ten or proofread one?
* Buying: Traditional libraries apparently prioritize buying and
shelving over digitization. I enjoy shopping for books, but my
home is so full of them that I'm holding back. Even though
they are within arm's reach, the contents is so unavailable,
because I cannot search it. Where did I read this...? I
cannot find it. How did people live without running water and
searchable books?
* Scanning: My experience is that non-proofread OCR text combined
with facsimile images is useful enough for many purposes. The
scanning effort is often worthwhile even if I were the only
reader, but restricting myself to out-of-copyright books that I
can share openly makes so much more sense. I personally
believe that we will live to see a copyright reform or massive
Creative Commons licensing that enables us to digitize and
share most works from the 1950s, 1960s and 1970s.
* Proofreading: Requiring full proofreading and markup severly
limits how much you can digitize and publish. Through the web
interface for proofreading that we have at Project Runeberg, or
through Distributed Proofreaders (pgdp.net), it is very easy to
involve hundreds of volunteers in this process. But I think it
is a pity that PGDP postpones publishing til after proofing.
As a commercial (for-pay, for-profit) service offering, there is a
problem to publish non-proofed text because competitors can accuse
you of poor quality. But volunteer projects (and most libraries)
can afford the attitude to tell their readers to stop whining and
do the job themselves. If anybody wants to do a fully proofread
and marked up electronic text of a book, it is easier to start
with a rough digital facsimile than with the paper book. One
prerequisite is that the raw OCR text is openly available, and
that volunteers are given a chance to help in proofreading. I
don't know if Google has any plans in this direction. Maybe they
read Web4Lib and are getting ideas.
--
Lars Aronsson (lars at aronsson.se)
Project Runeberg - free Nordic literature - http://runeberg.org/
More information about the Web4lib
mailing list