[Web4lib] RE: sharpening images and OCR

Robert Malesko maleskonk at yahoo.com
Wed Dec 3 01:20:40 EST 2008


Hi John,

I have two suggestions: Adobe Acrobat and Tesseract.

Adobe Acrobat: At my last position I used OCR to digitize hundreds of thousands of pages of widely varying quality source material. My software choice was limited by the powers that be. We used the OCR engines built into Adobe Acrobat versions 7 and 8. Version 7 was good enough, and Version 8 was a noticable improvement (mostly in speed). Looks like version 9 is out now.


Tesseract: The fact that you're starting with TIFF images caught my eye. The Tesseract OCR engine is possible candidate for you. Originally developed by HP, it's now under development by google. It's open-source software, so it's free --all you have to lose for your efforts is time. It's been on my radar ever since it was open-sourced two years ago.

http://code.google.com/p/tesseract-ocr/

It's a robust OCR engine, but does have some limits. For starters, because it's command-line only (no GUI), you might find it less user-friendly then commercial packages. Also:

"Tesseract is a raw OCR engine. It has no document layout analysis, no output formatting, and no graphical user interface. It only processes a TIFF image of a single column and creates text from it."

The wikipedia article from which that quote comes has some good background info:

http://en.wikipedia.org/wiki/Tesseract_(software)

Hope that helps!

-Rob Malesko


Hi,

We have just captured photographs from old newspapers that are on
microfilm. We are about to put these images on the Web. What software is
best for enhancing these images? Sometimes, there are black lines going
down through the image.

A second related question is this: we are hoping to OCR some articles
from a TIFF image thus created. What OCR package costing fewer than
2,000 dollars might be best for this task. I suspect that none of them
will be good enough because the original newspaper was in such a poor
state when microfilmed, but, I thought I would investigate it anyway.

Any advice would be much appreciated.

 

Regards John




      


More information about the Web4lib mailing list