[Web4lib] Extracting images from PDF?

John Fereira jaf30 at cornell.edu
Fri Dec 16 07:00:45 EST 2005


At 07:59 PM 12/14/2005, Araby Y Greene wrote:
>This has worked for me in the past, at least with individual PDF
>documents and Acrobat 6:
>
>Open the PDF in Adobe Acrobat
>Select Advanced from the top menu bar
>Export all images...
>   Select your format (jpg, tif, png, jpg 2000)
>   Select your settings
>
>Acrobat outputs the images using the basic filename plus page number and
>image number.

This process might work well if you've only got a handful of pdfs to 
handle.  If you've got a several hundred or thousand an automated 
approach might be needed.   If you can write java code I've had good 
success using a java api called iText, but I've used it to go the 
other way around.  I wrote a little program using the API (took about 
an hour) last week that was used to build pdf files from a collection 
of tif images.  From six CD's of scanned documents containing over 
20,000 images I produced 2200 pdf documents.


>You could also export single images (from the messed up pages) by using
>the Select Image tool, then right click and Save As (.bmp or .jpg), but
>you don't have all the nice format options that are available with
>"Export all images."
>
>-Araby
>__________________________
>Araby Greene
>Web Development Librarian
>Getchell Library/322
>University of Nevada, Reno
>http://www.library.unr.edu/
>araby at unr.edu
>775.784.6500 x343
>
>      /|
>   \'o.O'
>   =(___)=
>     U
>ACK! THPTPHH!
>
>
>
>
> > -----Original Message-----
> > From: web4lib-bounces at webjunction.org [<mailto:web4lib->mailto:web4lib-
> > bounces at webjunction.org] On Behalf Of Robert Sullivan
> > Sent: Wednesday, December 14, 2005 12:48 PM
> > To: web4lib at webjunction.org; genealib at lists.acomp.usf.edu
> > Subject: [Web4lib] Extracting images from PDF?
> >
> > A local labor organization had some old newsletters scanned and
> > presented us with 4 CDs of PDFs, with each issue a separate file.
> >
> > This would be great, except that they were scanned from a bound volume
> > 2 pages at a time, so any given file will contain the last page of the
> > previous issue and be missing the last page of the issue named.  This
> > portends some amount of patron confusion.
> >
> > I'm considering trying to take these images apart and reassemble them
> > in a more useful way.  We have Acrobat 7, but our graphics staff uses
> > it at a fairly low level so they can't help me.  I have found
> > references to software which will let you save images form PDFs as
> > TIFFs, but I was hoping for some real world experience.
> >
> > Thanks for any advice on the least painful way to handle this,
> >
> > --
> > Bob Sullivan
> > Schenectady Digital History Archive
> > <http://www.schenectadyhistory.org/>
> > Schenectady County (NY) Public Library
> > _______________________________________________
> > Web4lib mailing list
> > Web4lib at webjunction.org
> > http://lists.webjunction.org/web4lib/
>_______________________________________________
>Web4lib mailing list
>Web4lib at webjunction.org
>http://lists.webjunction.org/web4lib/

John Fereira
jaf30 at cornell.edu
Ithaca, NY  



More information about the Web4lib mailing list