[Web4lib] RE: sharpening images and OCR

Thu Dec 4 12:59:47 EST 2008

Regarding Tesseract's limitations: you should also pay attention to
Ocropus, which provides the document layout analysis and so on, and uses
Tesseract as its ocr engine (though it's designed to allow other engines
to be plugged in). It's still in its early stages (version 0.3.1) but
shows a lot of promise. 

http://code.google.com/p/ocropus/

As always, judge an open source project by its community. Ocropus shows
a lot of activity by several developers, so it looks pretty healthy:

http://code.google.com/p/ocropus/updates/list

That said, they've fallen behind their release plans:

http://sites.google.com/site/ocropus/roadmap

Peter

> -----Original Message-----
> From: web4lib-bounces at webjunction.org 
> [mailto:web4lib-bounces at webjunction.org] On Behalf Of Robert Malesko
> Sent: Tuesday, December 02, 2008 11:21 PM
> To: web4lib at webjunction.org
> Subject: [Web4lib] RE: sharpening images and OCR
> 
> Hi John,
> 
> I have two suggestions: Adobe Acrobat and Tesseract.
> 
> Adobe Acrobat: At my last position I used OCR to digitize 
> hundreds of thousands of pages of widely varying quality 
> source material. My software choice was limited by the powers 
> that be. We used the OCR engines built into Adobe Acrobat 
> versions 7 and 8. Version 7 was good enough, and Version 8 
> was a noticable improvement (mostly in speed). Looks like 
> version 9 is out now.
> 
> 
> Tesseract: The fact that you're starting with TIFF images 
> caught my eye. The Tesseract OCR engine is possible candidate 
> for you. Originally developed by HP, it's now under 
> development by google. It's open-source software, so it's 
> free --all you have to lose for your efforts is time. It's 
> been on my radar ever since it was open-sourced two years ago.
> 
> http://code.google.com/p/tesseract-ocr/
> 
> It's a robust OCR engine, but does have some limits. For 
> starters, because it's command-line only (no GUI), you might 
> find it less user-friendly then commercial packages. Also:
> 
> "Tesseract is a raw OCR engine. It has no document layout 
> analysis, no output formatting, and no graphical user 
> interface. It only processes a TIFF image of a single column 
> and creates text from it."
> 
> The wikipedia article from which that quote comes has some 
> good background info:
> 
> http://en.wikipedia.org/wiki/Tesseract_(software)
> 
> Hope that helps!
> 
> -Rob Malesko
> 
> 
> Hi,
> 
> We have just captured photographs from old newspapers that 
> are on microfilm. We are about to put these images on the 
> Web. What software is best for enhancing these images? 
> Sometimes, there are black lines going down through the image.
> 
> A second related question is this: we are hoping to OCR some 
> articles from a TIFF image thus created. What OCR package 
> costing fewer than 2,000 dollars might be best for this task. 
> I suspect that none of them will be good enough because the 
> original newspaper was in such a poor state when microfilmed, 
> but, I thought I would investigate it anyway.
> 
> Any advice would be much appreciated.
> 
>  
> 
> Regards John
> 
> 
> 
> 
>       
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
> 
>