[Web4lib] Re: Google Search Appliance and OPACs
Roy Tennant
tennantr at oclc.org
Thu Feb 7 12:38:40 EST 2008
I just want to point out that there is a world of difference between
exposing unique content to web crawlers and exposing commonly held content.
I can remember back in the days of Gopher (youngsters may wish to refer to
<http://en.wikipedia.org/wiki/Gopher_%28protocol%29> for an explanation)
when a few libraries put their catalogs up on Gopher. It was a complete
disaster. If you searched for just about anything (using Veronica, I
believe) you would get swamped with individual catalog records from
libraries that usually were some distance from you. What good was that? It
would have been much better had a library cooperative that had information
about the holdings of thousands of libraries provided a link to a service
that would quickly route the interested party to their local library that
has the book. That's funny, that sounds amazingly similar to what happens
now...
Roy
On 2/7/08 5:08 AM, "Breeding, Marshall" <marshall.breeding at Vanderbilt.Edu>
wrote:
> Likewise, the Vanderbilt Television News Archive proactively works to
> ensure its catalog is well represented in the global search engines.
> We've been doing this for a couple of years now.
>
> For my April 2006 column in Computers in Libraries, I described the
> basic approach:
>
> "How we funneled searchers from Google to our collections by catering to
> Web crawlers"
> (http://www.librarytechnology.org/ltg-displaytext.pl?RC=12049)
>
> Basically, a script generates static HTML pages for each dynamic record
> in the database. In the process it creates an HTML index and a sitemaps
> following the protocol originally proposed by Google:
> http://www.sitemaps.org/
>
> Microsoft Live and Yahoo have recently begun supporting the sitemap
> protocol.
>
> For our TV News archive, we're interested in serving researchers all
> over the world and the proactive approach that we've followed by pushing
> our content into the search engines have resulted in significant levels
> of increase on visits to our site and requests for materials.
>
> I use similar techniques for Library Technology Guides
> (http://www.librarytechnology.org) but instead of generating flat HTML
> pages, I point the sitemaps at persistent links that call up pages
> directly from the database.
>
> I'm not sure that this approach is ideal for library catalogs, where
> tens of thousands of libraries around the world have overlapping
> content. This might end up being really messy if every library exposes
> its records independently. I see it as a great approach for local
> digital collections with unique content.
>
> -marshall breeding
> Executive Director, Vanderbilt Television News Archive
> Director for Innovative Technologies and Research,
> Vanderbilt University Library
>
>
> -----Original Message-----
> From: web4lib-bounces at webjunction.org
> [mailto:web4lib-bounces at webjunction.org] On Behalf Of Martin Vojnar
> Sent: Thursday, February 07, 2008 4:23 AM
> To: Tim Spalding
> Cc: Gem Stone-Logan; web4lib at webjunction.org
> Subject: Re: [Web4lib] Re: Google Search Appliance and OPACs
>
> Dears,
>
> we did something like this. We dumped our catalog
> (http://aleph.vkol.cz, cca 1 mil. of records)
> into static html pages, so crawlers could come
> and take them. Every static page has a link to
> the live record in the catalog.
>
> Firstly we built a tree structure between all
> records, so robots would start at home page
> (http://aleph.vkol.cz/pub) and find the rest of
> records, this proved ok, but took Google cca 2
> months to get all the records.
>
> So we switched to sitemap solution
> (http://aleph.vkol.cz/sitemap.xml) and Google
> crawled/indexed everything in 2 weeks.
>
> Some stats say we got cca 2000 new visitors every
> day with 80% bounce rate. Obviously there are
> many follow-up questions (the world is not our
> target, so why to publish the catalog in Google
> instaed of local search engines etc.), but this
> was more or less just experiment.
>
> Other crawlers (Yahoo, MSN) do not match Google
> performance and do not work with sitemap files
> efficiently.
>
> BR, Martin
>
> On 6 Feb 2008 at 21:27, Tim Spalding wrote:
>
>> Has anyone tried just making a HUGE page of links and putting it
>> somewhere Google will find it? Almost all OPACs allow direct links to
>> records, by ISBN or something else. On a *few*-I've seen it on
>> HiP-spidering this way causes serious sessions issues. (LibraryThing
>> made this mistake once.) But it might be a way to get data into
>> Google.
>>
>> Tim
--
More information about the Web4lib
mailing list