[Web4lib] new book list script
Will Kurt
wkurt at bbn.com
Thu May 17 12:41:33 EDT 2007
I'm actually in the middle of working on a project myself using
Python for some screen scraping. All I use is the 'urllib2' module
to grab and a really excellent HTML/XML parsing library
BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
The documentation for Beautiful Soup is not too long or difficult.
Additionally, for bad xml etc, BeautifulSoup has a nice 'prettify'
method which attempts to fix any errors and returns reasonably formatted xml.
The best python resource I can recommend is 'Dive into Python', which
is both a book and a free website: http://www.diveintopython.org/
It's particularly good if you already have a background in another
language, but in general it's very good for quick syntax questions
(plus it also gives a good overview of the language.)
I hope that helps!
--Will
At 12:19 PM 5/17/2007, Bret Parker wrote:
>Ouch. Yes, I see what you mean about the Millenium approach being
>inaccessible.
>
>I have toyed with using Python's html and http modules (not their
>formal names). The examples I used when I was playing with this were
>all from Alex Martelli's Python in a Nutshell (around pp. 419; the
>httplib module). Perl may have similar features but I find Python
>to be less Byzantine than Perl. I have another O'Reilly book, John
>Callender's Perl for Web Site Management. I have used some scripts
>from there to just extract links from html files that reside on the
>same computer as perl program. But with Python I actually was able
>to 'get' the html using http calls (with Python acting like the
>browser on my workstation as I then contact a web server) and then I
>attempted to parse the xhtml (in this case) with Python. I ran into
>some deadends due to invalid XML or bad character data and abandoned
>my approach. But you might find something that would work for you,
>specially if you are not needing valid XML like I did for what I was doing.
>
>For a helpful Python intro, try either of these sources:
>
> How to Think Like a Computer Scientist, Learning with Python
> http://www.ibiblio.org/obp/thinkCSpy/
>
> or Learning Python by Mark Lutz (O'Reilly) or any of the other
> books by Mark.
>
> I really like Alex Martelli's Python in a Nutshell, but it is not
> in any way a recommended way to see many examples of Python in
> terms of a systematic approach to learning Python scripting.
>
> Python can be freely downloaded at Python.org.
>
>Bret Parker, Senior Applications Programmer Analyst (MLIS)
>Stockton-San Joaquin County Public Library
>City of Stockton (California)
>bret.parker at ci.stockton.ca.us
>(209) 937-7148
>
>http://www.stockton.lib.ca.us
>
>
> >>> "Ben Haines" <bhaines at forestparkpubliclibrary.org> 5/17/2007 8:16 AM >>>
>Thanks for all your responses!
>
>Kathleen: the ISBNs are from other staff doing collection
>development, and ultimately from Baker and Taylor, etc. I could
>manually put together a page by typing in the HTML to display the
>title and author for each book, then locating the cover image in our
>catalog and adding the URL. But it would certainly be quicker and
>easier if the process could be automated.
>
>Bret: Our OPAC is managed at the consortium level, so I can't really
>do much configuration. Also, the Millenium reports server can't be
>accessed from member libraries at this point(although this might
>change in the near future). That's why I thought that scraping the
>OPAC page and reassembling the data might be the way to go. Is this
>sort of thing difficult to do in Python? Can you point me to any good examples?
>
>-Ben
>
>--
>Ben Haines
>Reference/Technology Librarian
>Forest Park Public Library
>bhaines at forestparkpubliclibrary.org
>
>-----Original Message-----
>From: Turner,Kathleen [mailto:kt32 at drexel.edu]
>Sent: Thursday, May 17, 2007 8:25 AM
>To: Ben Haines; web4lib at webjunction.org
>Subject: RE: [Web4lib] new book list script
>
>
>Where are you getting the ISBN's and why couldn't that source also give
>you the rest of the info?
>
>Kathleen
>
>
>Kathleen H. Turner
>Web/Education Librarian
>W.W. Hagerty Library
>33rd and Market Streets
>Philadelphia, PA 19104-2875
>
>Tel: 215.895.6783
>Fax: 215.895.2070
>khturner at drexel.edu
>_______________________________________________
>Web4lib mailing list
>Web4lib at webjunction.org
>http://lists.webjunction.org/web4lib/
>
>_______________________________________________
>Web4lib mailing list
>Web4lib at webjunction.org
>http://lists.webjunction.org/web4lib/
More information about the Web4lib
mailing list