[Web4lib] Extracting images from PDF?

Jonathan Gorman jtgorman at uiuc.edu
Wed Dec 14 16:16:24 EST 2005


Can't say I have any "real world experience".  But some programs I'd 
look into are xpdf/pdfimages (also pdftohtml which uses these) and 
perhaps imagemagick.  Not sure what the state of Windows ports of 
these programs are.

Do you know what type of images are contained in the pdf file?  Are they 
actually tiff, jpeg, or something else?

I guess my first approach to the problem would be to process each issue 
through the conversion programs, extract the images, and then just copy 
the "last" image of each issue.  Of course, this depends on the naming 
scheme for issues and the like.  The person would need a little bit of 
knowledge of scripting.  Given a reasonable setup it shouldn't be too 
difficult.

Sounds like an interesting problem.

Jon Gorman



More information about the Web4lib mailing list