[Web4lib] new book list script

Will Kurt wkurt at bbn.com
Thu May 17 12:41:33 EDT 2007


I'm actually in the middle of working on a project myself using 
Python for some screen scraping.  All I use is the 'urllib2' module 
to grab and a really excellent HTML/XML parsing library 
BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/

The documentation for Beautiful Soup is not too long or difficult.

Additionally, for bad xml etc, BeautifulSoup has a nice 'prettify' 
method which attempts to fix any errors and returns reasonably formatted xml.

The best python resource I can recommend is 'Dive into Python', which 
is both a book and a free website: http://www.diveintopython.org/

It's particularly good if you already have a background in another 
language, but in general it's very good for quick syntax questions 
(plus it also gives a good overview of the language.)

I hope that helps!
--Will

At 12:19 PM 5/17/2007, Bret Parker wrote:
>Ouch. Yes, I see what you mean about the Millenium approach being 
>inaccessible.
>
>I have toyed with using Python's html and http modules (not their 
>formal names). The examples I used when I was playing with this were 
>all from Alex Martelli's Python in a Nutshell (around pp. 419; the 
>httplib module).  Perl may have similar features but I find Python 
>to be less Byzantine than Perl.  I have another O'Reilly book, John 
>Callender's Perl for Web Site Management. I have used some scripts 
>from there to just extract links from html files that reside on the 
>same computer as perl program. But with Python I actually was able 
>to 'get' the html using http calls (with Python acting like the 
>browser on my workstation as I then contact a web server) and then I 
>attempted to parse the xhtml (in this case) with Python. I ran into 
>some deadends due to invalid XML or bad character data and abandoned 
>my approach. But you might find something that would work for you, 
>specially if you are not needing valid XML like I did for what I was doing.
>
>For a helpful Python intro, try either of these sources:
>
>   How to Think Like a Computer Scientist, Learning with Python
>     http://www.ibiblio.org/obp/thinkCSpy/
>
>   or  Learning Python by Mark Lutz (O'Reilly)  or any of the other 
> books by Mark.
>
>  I really like Alex Martelli's Python in a Nutshell, but it is not 
> in any way a recommended way to see many examples of Python in 
> terms of a systematic approach to learning Python scripting.
>
>    Python can be freely downloaded at Python.org.
>
>Bret Parker, Senior Applications Programmer Analyst (MLIS)
>Stockton-San Joaquin County Public Library
>City of Stockton (California)
>bret.parker at ci.stockton.ca.us
>(209) 937-7148
>
>http://www.stockton.lib.ca.us
>
>
> >>> "Ben Haines" <bhaines at forestparkpubliclibrary.org> 5/17/2007 8:16 AM >>>
>Thanks for all your responses!
>
>Kathleen: the ISBNs are from other staff doing collection 
>development, and ultimately from Baker and Taylor, etc.  I could 
>manually put together a page by typing in the HTML to display the 
>title and author for each book, then locating the cover image in our 
>catalog and adding the URL.  But it would certainly be quicker and 
>easier if the process could be automated.
>
>Bret: Our OPAC is managed at the consortium level, so I can't really 
>do much configuration. Also, the Millenium reports server can't be 
>accessed from member libraries at this point(although this might 
>change in the near future). That's why I thought that scraping the 
>OPAC page and reassembling the data might be the way to go. Is this 
>sort of thing difficult to do in Python? Can you point me to any good examples?
>
>-Ben
>
>--
>Ben Haines
>Reference/Technology Librarian
>Forest Park Public Library
>bhaines at forestparkpubliclibrary.org
>
>-----Original Message-----
>From: Turner,Kathleen [mailto:kt32 at drexel.edu]
>Sent: Thursday, May 17, 2007 8:25 AM
>To: Ben Haines; web4lib at webjunction.org
>Subject: RE: [Web4lib] new book list script
>
>
>Where are you getting the ISBN's and why couldn't that source also give
>you the rest of the info?
>
>Kathleen
>
>
>Kathleen H. Turner
>Web/Education Librarian
>W.W. Hagerty Library
>33rd and Market Streets
>Philadelphia, PA 19104-2875
>
>Tel: 215.895.6783
>Fax: 215.895.2070
>khturner at drexel.edu
>_______________________________________________
>Web4lib mailing list
>Web4lib at webjunction.org
>http://lists.webjunction.org/web4lib/
>
>_______________________________________________
>Web4lib mailing list
>Web4lib at webjunction.org
>http://lists.webjunction.org/web4lib/



More information about the Web4lib mailing list