[Web4lib] Re: Google Search Appliance and OPACs

Thu Feb 7 12:38:40 EST 2008

I just want to point out that there is a world of difference between
exposing unique content to web crawlers and exposing commonly held content.
I can remember back in the days of Gopher (youngsters may wish to refer to
<http://en.wikipedia.org/wiki/Gopher_%28protocol%29> for an explanation)
when a few libraries put their catalogs up on Gopher. It was a complete
disaster. If you searched for just about anything (using Veronica, I
believe) you would get swamped with individual catalog records from
libraries that usually were some distance from you. What good was that? It
would have been much better had a library cooperative that had information
about the holdings of thousands of libraries provided a link to a service
that would quickly route the interested party to their local library that
has the book. That's funny, that sounds amazingly similar to what happens
now...
Roy

On 2/7/08 5:08 AM, "Breeding, Marshall" <marshall.breeding at Vanderbilt.Edu>
wrote:

> Likewise, the Vanderbilt Television News Archive proactively works to
> ensure its catalog is well represented in the global search engines.
> We've been doing this for a couple of years now.
> 
> For my April 2006 column in Computers in Libraries, I described the
> basic approach:
> 
> "How we funneled searchers from Google to our collections by catering to
> Web crawlers"
> (http://www.librarytechnology.org/ltg-displaytext.pl?RC=12049)
> 
> Basically, a script generates static HTML pages for each dynamic record
> in the database.  In the process it creates an HTML index and a sitemaps
> following the protocol originally proposed by Google:
>   http://www.sitemaps.org/
> 
> Microsoft Live and Yahoo have recently begun supporting the sitemap
> protocol.
> 
> For our TV News archive, we're interested in serving researchers all
> over the world and the proactive approach that we've followed by pushing
> our content into the search engines have resulted in significant levels
> of increase on visits to our site and requests for materials.
> 
> I use similar techniques for Library Technology Guides
> (http://www.librarytechnology.org) but instead of generating flat HTML
> pages, I point the sitemaps at persistent links that call up pages
> directly from the database.
> 
> I'm not sure that this approach is ideal for library catalogs, where
> tens of thousands of libraries around the world have overlapping
> content.  This might end up being really messy if every library exposes
> its records independently.  I see it as a great approach for local
> digital collections with unique content.
> 
> -marshall breeding
>  Executive Director, Vanderbilt Television News Archive
>  Director for Innovative Technologies and Research,
>            Vanderbilt University Library
> 
> 
> -----Original Message-----
> From: web4lib-bounces at webjunction.org
> [mailto:web4lib-bounces at webjunction.org] On Behalf Of Martin Vojnar
> Sent: Thursday, February 07, 2008 4:23 AM
> To: Tim Spalding
> Cc: Gem Stone-Logan; web4lib at webjunction.org
> Subject: Re: [Web4lib] Re: Google Search Appliance and OPACs
> 
> Dears,
> 
> we did something like this. We dumped our catalog
> (http://aleph.vkol.cz, cca 1 mil. of records)
> into static html pages, so crawlers could come
> and take them. Every static page has a link to
> the live record in the catalog.
> 
> Firstly we built a tree structure between all
> records, so robots would start at home page
> (http://aleph.vkol.cz/pub) and find the rest of
> records, this proved ok, but took Google cca 2
> months to get all the records.
> 
> So we switched to sitemap solution
> (http://aleph.vkol.cz/sitemap.xml) and Google
> crawled/indexed everything in 2 weeks.
> 
> Some stats say we got cca 2000 new visitors every
> day with 80% bounce rate. Obviously there are
> many follow-up questions (the world is not our
> target, so why to publish the catalog in Google
> instaed of local search engines etc.), but this
> was more or less just experiment.
> 
> Other crawlers (Yahoo, MSN) do not match Google
> performance and do not work with sitemap files
> efficiently.
> 
> BR, Martin
> 
> On 6 Feb 2008 at 21:27, Tim Spalding wrote:
> 
>> Has anyone tried just making a HUGE page of links and putting it
>> somewhere Google will find it? Almost all OPACs allow direct links to
>> records, by ISBN or something else. On a *few*-I've seen it on
>> HiP-spidering this way causes serious sessions issues. (LibraryThing
>> made this mistake once.) But it might be a way to get data into
>> Google.
>> 
>> Tim

--