[Web4lib] capturing data from Amazon in an Access database

Will Kurt wkurt at bbn.com
Tue Jul 3 11:30:42 EDT 2007


While the authors, titles etc aren't indicated 
with XML tags, the DOM seems to remain pretty 
consistent and doesn't seem too difficult to parse.
I threw together a quick proof-of-concept Python 
script in about 1/2hour that grabs the author, 
title, price and isbn. I only tested it against a 
few random pages but it seems to work pretty 
well. You'll need the BeautifulSoup module to run 
the script http://www.crummy.com/software/BeautifulSoup/

I've attached the AmazonInfoGrabber.py file so 
you(or anyone else) can feel free to edit it as 
you see fit, or create something similar.  You 
can easily just plug in any book url (doesn't 
work with other media for now) and it should spit out the right info.

It's sort of ugly, but I was just curious as to 
how difficult it would be to parse the Amazon 
files, and I figured the results might be helpful 
to someone out there. A few small edits that this 
could easily take the url (or ISBN) as user input 
and return a excel file or something similar.

Although the AWS route(if there is one) would be more stable.

--Will

At 05:36 AM 7/3/2007, John Fitzgibbon wrote:
>Hi,
>
>In the old days when we compiled order lists we used Whitaker's Books in
>Print on Cd-Rom and this allowed us to save a record into a text file
>with fields delimited by commas and records delimited by carriage
>returns. I believe that the US equivalent, Bowker, offered the same
>functionality.
>
>Now, we are using Amazon more and more. However, when we find a record
>in Amazon there is no easy way of saving it into Access or Excel.
>
>I don't think XML along with XSLT can be used because authors, titles,
>and prices are not indicated with any special tags on the 'wish list' or
>'shopping cart' pages of Amazon. For example, authors are not delineated
>with <span class='author'> or some such tag.
>
>Also, there is no RSS feed which would provide us with a ready made XML
>page of the results.
>
>Copying and pasting into Excel seems very cumbersome.
>
>When I first considered some solutions to the problem the problem seemed
>very trivial but now, that these solutions won't work, I am at a loss.
>
>If only more web pages were based on XML...
>
>I would welcome any ideas.
>
>Regards
>John
>
>John Fitzgibbon
>
>p: 00 353 91 562471
>f: 00 353 91 565039
>w: http://www.galwaylibrary.ie
>
>*******************************************************************
>Tá eolas atá príobháideach agus rúnda sa ríomhphost seo
>agus aon iatán a ghabhann leis agus is leis an duine/na daoine
>sin amháin a bhfuil siad seolta chucu a bhaineann siad.
>Mura seolaí thú, níl tú údaraithe an ríomhphost nó aon iatán
>a ghabhann leis a léamh, a chóipáil ná a úsáid.
>Má tá an ríomhphost seo faighte agat trí dhearmad,
>cuir an seoltóir ar an eolas thrí aischur ríomhphoist
>agus scrios ansin é le do thoil.
>
>This e-mail and any attachment contains information which is
>private and confidential and is intended for the addressee
>only. If you are not an addressee, you are not authorised
>to read, copy or use the e-mail or any attachment.
>If you have received this e-mail in error, please notify
>the sender by return e-mail and then destroy it.
>*********************************************************************
>_______________________________________________
>Web4lib mailing list
>Web4lib at webjunction.org
>http://lists.webjunction.org/web4lib/
-------------- next part --------------
from BeautifulSoup import BeautifulSoup
import urllib
url = "http://www.amazon.com/Divine-Comedy-Purgatorio-Paradiso-Everymans/dp/0679433139/"
htmlSource = urllib.urlopen(url)
soup = BeautifulSoup(htmlSource)
authorTitleChunk = soup.body.findAll("div",{"class" : "buying"})[2]
bookTitle = authorTitleChunk.find("b",{"class":"sans"}).contents[0]
bookAuthor = authorTitleChunk.a.contents[0]
price = soup.find("div",{"id":"priceBlock"}).find("table",{"class":"product"}).find("b",{"class":"price"}).contents[0]
bookDescription = soup.find("td",{"class":"bucket"}).find("div",{"class":"content"}).ul
isbn = bookDescription.findAll("li")[3].b.next.next
print bookTitle
print bookAuthor
print price
print isbn


More information about the Web4lib mailing list