[Web4lib] Extracting images from PDF?
Jonathan Gorman
jtgorman at uiuc.edu
Thu Dec 15 09:59:13 EST 2005
Hi Bob,
Was just thinking about this again, as it reminded me of some of the
personal cds I have lying around that I really should put into a more
convient form for my own reading.
Some observations:
I believe ImageMagick works on the principle that it takes the "rendered"
page and creates an image from that. It's a similar process to taking a
screenshot and saving that.
pdfimagaes I believe examines the pdf document and removes the embedded
images, although this won't work for native vector-drawn images in pdf.
Probably not an issue for you though. A rough analogy would be a program
that could extract any image out of a Word document that you cut and paste
into it. It's not going to get the little boxes or lines that you draw
into Word. I doubt that there was any OCR done either from your
description of the problem, so transferring the text isn't an issue
either.
So I guess one more issue is the quality of the images. If you want to
insure that the quality is as good as the original, you'll probably want
to go with something along the lines of the second method where it does
literally extract the images. If you want to just ensure a "good enough"
either approach works.
Jonathan T. Gorman
Visiting Research Information Specialist
University of Illinois at Champaign-Urbana
216 Main Library - MC522
1408 West Gregory Drive
Urbana, IL 61801
Phone: (217) 244-4688
More information about the Web4lib
mailing list