[Web4lib] Re: Google Search Appliance and OPACs

Breeding, Marshall marshall.breeding at Vanderbilt.Edu
Thu Feb 7 08:08:34 EST 2008


Likewise, the Vanderbilt Television News Archive proactively works to
ensure its catalog is well represented in the global search engines.
We've been doing this for a couple of years now.

For my April 2006 column in Computers in Libraries, I described the
basic approach:

"How we funneled searchers from Google to our collections by catering to
Web crawlers"
(http://www.librarytechnology.org/ltg-displaytext.pl?RC=12049)

Basically, a script generates static HTML pages for each dynamic record
in the database.  In the process it creates an HTML index and a sitemaps
following the protocol originally proposed by Google:
  http://www.sitemaps.org/

Microsoft Live and Yahoo have recently begun supporting the sitemap
protocol.

For our TV News archive, we're interested in serving researchers all
over the world and the proactive approach that we've followed by pushing
our content into the search engines have resulted in significant levels
of increase on visits to our site and requests for materials.

I use similar techniques for Library Technology Guides
(http://www.librarytechnology.org) but instead of generating flat HTML
pages, I point the sitemaps at persistent links that call up pages
directly from the database. 

I'm not sure that this approach is ideal for library catalogs, where
tens of thousands of libraries around the world have overlapping
content.  This might end up being really messy if every library exposes
its records independently.  I see it as a great approach for local
digital collections with unique content.

-marshall breeding
 Executive Director, Vanderbilt Television News Archive
 Director for Innovative Technologies and Research, 
           Vanderbilt University Library


-----Original Message-----
From: web4lib-bounces at webjunction.org
[mailto:web4lib-bounces at webjunction.org] On Behalf Of Martin Vojnar
Sent: Thursday, February 07, 2008 4:23 AM
To: Tim Spalding
Cc: Gem Stone-Logan; web4lib at webjunction.org
Subject: Re: [Web4lib] Re: Google Search Appliance and OPACs

Dears,

we did something like this. We dumped our catalog 
(http://aleph.vkol.cz, cca 1 mil. of records) 
into static html pages, so crawlers could come 
and take them. Every static page has a link to 
the live record in the catalog.

Firstly we built a tree structure between all 
records, so robots would start at home page 
(http://aleph.vkol.cz/pub) and find the rest of 
records, this proved ok, but took Google cca 2 
months to get all the records.

So we switched to sitemap solution 
(http://aleph.vkol.cz/sitemap.xml) and Google 
crawled/indexed everything in 2 weeks.

Some stats say we got cca 2000 new visitors every 
day with 80% bounce rate. Obviously there are 
many follow-up questions (the world is not our 
target, so why to publish the catalog in Google 
instaed of local search engines etc.), but this 
was more or less just experiment.

Other crawlers (Yahoo, MSN) do not match Google 
performance and do not work with sitemap files 
efficiently.

BR, Martin

On 6 Feb 2008 at 21:27, Tim Spalding wrote:

> Has anyone tried just making a HUGE page of links and putting it
> somewhere Google will find it? Almost all OPACs allow direct links to
> records, by ISBN or something else. On a *few*-I've seen it on
> HiP-spidering this way causes serious sessions issues. (LibraryThing
> made this mistake once.) But it might be a way to get data into
> Google.
> 
> Tim

-- 
Ing. Martin Vojnar, Research Library Olomouc, 
Czech Republic
phone://+420 585 205 352
http://www.vkol.cz/



_______________________________________________
Web4lib mailing list
Web4lib at webjunction.org
http://lists.webjunction.org/web4lib/


More information about the Web4lib mailing list