[Web4lib] Extracting images from PDF?

Jonathan Gorman jtgorman at uiuc.edu
Thu Dec 15 09:59:13 EST 2005


Hi Bob,


Was just thinking about this again, as it reminded me of some of the 
personal cds I have lying around that I really should put into a more 
convient form for my own reading.

Some observations:

I believe ImageMagick works on the principle that it takes the "rendered" 
page and creates an image from that.  It's a similar process to taking a 
screenshot and saving that.

pdfimagaes I believe examines the pdf document and removes the embedded 
images, although this won't work for native vector-drawn images in pdf. 
Probably not an issue for you though.  A rough analogy would be a program 
that could extract any image out of a Word document that you cut and paste 
into it. It's not going to get the little boxes or lines that you draw 
into Word.  I doubt that there was any OCR done either from your 
description of the problem, so transferring the text isn't an issue 
either.

So I guess one more issue is the quality of the images.  If you want to 
insure that the quality is as good as the original, you'll probably want 
to go with something along the lines of the second method where it does 
literally extract the images.  If you want to just ensure a "good enough" 
either approach works.


Jonathan T. Gorman
Visiting Research Information Specialist
University of Illinois at Champaign-Urbana
216 Main Library - MC522
1408 West Gregory Drive
Urbana, IL 61801
Phone: (217) 244-4688



More information about the Web4lib mailing list