[Web4lib] Re: Google Search Appliance and OPACs

Thu Feb 7 13:40:18 EST 2008

Roy - 

Or, to take the vision you describe below one step further, it could be much
better to have an open and freely accessible repository of such data where
any library, library cooporative, non-profit, etc. could build upon a range
of useful services.

-- jaf

³Open is as open does²

On 2/7/08 9:38 AM, "Roy Tennant" <tennantr at oclc.org> wrote:

> I just want to point out that there is a world of difference between
> exposing unique content to web crawlers and exposing commonly held content.
> I can remember back in the days of Gopher (youngsters may wish to refer to
> <http://en.wikipedia.org/wiki/Gopher_%28protocol%29> for an explanation)
> when a few libraries put their catalogs up on Gopher. It was a complete
> disaster. If you searched for just about anything (using Veronica, I
> believe) you would get swamped with individual catalog records from
> libraries that usually were some distance from you. What good was that? It
> would have been much better had a library cooperative that had information
> about the holdings of thousands of libraries provided a link to a service
> that would quickly route the interested party to their local library that
> has the book. That's funny, that sounds amazingly similar to what happens
> now...
> Roy
> 
> 
> On 2/7/08 5:08 AM, "Breeding, Marshall" <marshall.breeding at Vanderbilt.Edu>
> wrote:
> 
>> > Likewise, the Vanderbilt Television News Archive proactively works to
>> > ensure its catalog is well represented in the global search engines.
>> > We've been doing this for a couple of years now.
>> >
>> > For my April 2006 column in Computers in Libraries, I described the
>> > basic approach:
>> >
>> > "How we funneled searchers from Google to our collections by catering to
>> > Web crawlers"
>> > (http://www.librarytechnology.org/ltg-displaytext.pl?RC=12049)
>> >
>> > Basically, a script generates static HTML pages for each dynamic record
>> > in the database.  In the process it creates an HTML index and a sitemaps
>> > following the protocol originally proposed by Google:
>> >   http://www.sitemaps.org/
>> >
>> > Microsoft Live and Yahoo have recently begun supporting the sitemap
>> > protocol.
>> >
>> > For our TV News archive, we're interested in serving researchers all
>> > over the world and the proactive approach that we've followed by pushing
>> > our content into the search engines have resulted in significant levels
>> > of increase on visits to our site and requests for materials.
>> >
>> > I use similar techniques for Library Technology Guides
>> > (http://www.librarytechnology.org) but instead of generating flat HTML
>> > pages, I point the sitemaps at persistent links that call up pages
>> > directly from the database.
>> >
>> > I'm not sure that this approach is ideal for library catalogs, where
>> > tens of thousands of libraries around the world have overlapping
>> > content.  This might end up being really messy if every library exposes
>> > its records independently.  I see it as a great approach for local
>> > digital collections with unique content.
>> >
>> > -marshall breeding
>> >  Executive Director, Vanderbilt Television News Archive
>> >  Director for Innovative Technologies and Research,
>> >            Vanderbilt University Library
>> >
>> >
>> > -----Original Message-----
>> > From: web4lib-bounces at webjunction.org
>> > [mailto:web4lib-bounces at webjunction.org] On Behalf Of Martin Vojnar
>> > Sent: Thursday, February 07, 2008 4:23 AM
>> > To: Tim Spalding
>> > Cc: Gem Stone-Logan; web4lib at webjunction.org
>> > Subject: Re: [Web4lib] Re: Google Search Appliance and OPACs
>> >
>> > Dears,
>> >
>> > we did something like this. We dumped our catalog
>> > (http://aleph.vkol.cz, cca 1 mil. of records)
>> > into static html pages, so crawlers could come
>> > and take them. Every static page has a link to
>> > the live record in the catalog.
>> >
>> > Firstly we built a tree structure between all
>> > records, so robots would start at home page
>> > (http://aleph.vkol.cz/pub) and find the rest of
>> > records, this proved ok, but took Google cca 2
>> > months to get all the records.
>> >
>> > So we switched to sitemap solution
>> > (http://aleph.vkol.cz/sitemap.xml) and Google
>> > crawled/indexed everything in 2 weeks.
>> >
>> > Some stats say we got cca 2000 new visitors every
>> > day with 80% bounce rate. Obviously there are
>> > many follow-up questions (the world is not our
>> > target, so why to publish the catalog in Google
>> > instaed of local search engines etc.), but this
>> > was more or less just experiment.
>> >
>> > Other crawlers (Yahoo, MSN) do not match Google
>> > performance and do not work with sitemap files
>> > efficiently.
>> >
>> > BR, Martin
>> >
>> > On 6 Feb 2008 at 21:27, Tim Spalding wrote:
>> >
>>> >> Has anyone tried just making a HUGE page of links and putting it
>>> >> somewhere Google will find it? Almost all OPACs allow direct links to
>>> >> records, by ISBN or something else. On a *few*-I've seen it on
>>> >> HiP-spidering this way causes serious sessions issues. (LibraryThing
>>> >> made this mistake once.) But it might be a way to get data into
>>> >> Google.
>>> >>
>>> >> Tim
> 
> --
> 
> 
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
> 
> 
> 
> 
> ===============================================
> Jeremy Frumkin
> Head, Emerging Technologies and Services
> 121 The Valley Library, Oregon State University
> Corvallis OR 97331-4501
>  
> Jeremy.Frumkin at oregonstate.edu
>  
> 541.602.4905
> 541.737.3453 (Fax)
> ===============================================
> " Without ambition one starts nothing. Without work one finishes nothing. " -
> Emerson