[Web4lib] capturing data from Amazon in an Access database
Will Kurt
wkurt at bbn.com
Tue Jul 3 11:30:42 EDT 2007
While the authors, titles etc aren't indicated
with XML tags, the DOM seems to remain pretty
consistent and doesn't seem too difficult to parse.
I threw together a quick proof-of-concept Python
script in about 1/2hour that grabs the author,
title, price and isbn. I only tested it against a
few random pages but it seems to work pretty
well. You'll need the BeautifulSoup module to run
the script http://www.crummy.com/software/BeautifulSoup/
I've attached the AmazonInfoGrabber.py file so
you(or anyone else) can feel free to edit it as
you see fit, or create something similar. You
can easily just plug in any book url (doesn't
work with other media for now) and it should spit out the right info.
It's sort of ugly, but I was just curious as to
how difficult it would be to parse the Amazon
files, and I figured the results might be helpful
to someone out there. A few small edits that this
could easily take the url (or ISBN) as user input
and return a excel file or something similar.
Although the AWS route(if there is one) would be more stable.
--Will
At 05:36 AM 7/3/2007, John Fitzgibbon wrote:
>Hi,
>
>In the old days when we compiled order lists we used Whitaker's Books in
>Print on Cd-Rom and this allowed us to save a record into a text file
>with fields delimited by commas and records delimited by carriage
>returns. I believe that the US equivalent, Bowker, offered the same
>functionality.
>
>Now, we are using Amazon more and more. However, when we find a record
>in Amazon there is no easy way of saving it into Access or Excel.
>
>I don't think XML along with XSLT can be used because authors, titles,
>and prices are not indicated with any special tags on the 'wish list' or
>'shopping cart' pages of Amazon. For example, authors are not delineated
>with <span class='author'> or some such tag.
>
>Also, there is no RSS feed which would provide us with a ready made XML
>page of the results.
>
>Copying and pasting into Excel seems very cumbersome.
>
>When I first considered some solutions to the problem the problem seemed
>very trivial but now, that these solutions won't work, I am at a loss.
>
>If only more web pages were based on XML...
>
>I would welcome any ideas.
>
>Regards
>John
>
>John Fitzgibbon
>
>p: 00 353 91 562471
>f: 00 353 91 565039
>w: http://www.galwaylibrary.ie
>
>*******************************************************************
>Tá eolas atá príobháideach agus rúnda sa ríomhphost seo
>agus aon iatán a ghabhann leis agus is leis an duine/na daoine
>sin amháin a bhfuil siad seolta chucu a bhaineann siad.
>Mura seolaí thú, níl tú údaraithe an ríomhphost nó aon iatán
>a ghabhann leis a léamh, a chóipáil ná a úsáid.
>Má tá an ríomhphost seo faighte agat trí dhearmad,
>cuir an seoltóir ar an eolas thrí aischur ríomhphoist
>agus scrios ansin é le do thoil.
>
>This e-mail and any attachment contains information which is
>private and confidential and is intended for the addressee
>only. If you are not an addressee, you are not authorised
>to read, copy or use the e-mail or any attachment.
>If you have received this e-mail in error, please notify
>the sender by return e-mail and then destroy it.
>*********************************************************************
>_______________________________________________
>Web4lib mailing list
>Web4lib at webjunction.org
>http://lists.webjunction.org/web4lib/
-------------- next part --------------
from BeautifulSoup import BeautifulSoup
import urllib
url = "http://www.amazon.com/Divine-Comedy-Purgatorio-Paradiso-Everymans/dp/0679433139/"
htmlSource = urllib.urlopen(url)
soup = BeautifulSoup(htmlSource)
authorTitleChunk = soup.body.findAll("div",{"class" : "buying"})[2]
bookTitle = authorTitleChunk.find("b",{"class":"sans"}).contents[0]
bookAuthor = authorTitleChunk.a.contents[0]
price = soup.find("div",{"id":"priceBlock"}).find("table",{"class":"product"}).find("b",{"class":"price"}).contents[0]
bookDescription = soup.find("td",{"class":"bucket"}).find("div",{"class":"content"}).ul
isbn = bookDescription.findAll("li")[3].b.next.next
print bookTitle
print bookAuthor
print price
print isbn
More information about the Web4lib
mailing list