[Web4lib] Re: Google Search Appliance and OPACs
Breeding, Marshall
marshall.breeding at Vanderbilt.Edu
Thu Feb 7 08:08:34 EST 2008
Likewise, the Vanderbilt Television News Archive proactively works to
ensure its catalog is well represented in the global search engines.
We've been doing this for a couple of years now.
For my April 2006 column in Computers in Libraries, I described the
basic approach:
"How we funneled searchers from Google to our collections by catering to
Web crawlers"
(http://www.librarytechnology.org/ltg-displaytext.pl?RC=12049)
Basically, a script generates static HTML pages for each dynamic record
in the database. In the process it creates an HTML index and a sitemaps
following the protocol originally proposed by Google:
http://www.sitemaps.org/
Microsoft Live and Yahoo have recently begun supporting the sitemap
protocol.
For our TV News archive, we're interested in serving researchers all
over the world and the proactive approach that we've followed by pushing
our content into the search engines have resulted in significant levels
of increase on visits to our site and requests for materials.
I use similar techniques for Library Technology Guides
(http://www.librarytechnology.org) but instead of generating flat HTML
pages, I point the sitemaps at persistent links that call up pages
directly from the database.
I'm not sure that this approach is ideal for library catalogs, where
tens of thousands of libraries around the world have overlapping
content. This might end up being really messy if every library exposes
its records independently. I see it as a great approach for local
digital collections with unique content.
-marshall breeding
Executive Director, Vanderbilt Television News Archive
Director for Innovative Technologies and Research,
Vanderbilt University Library
-----Original Message-----
From: web4lib-bounces at webjunction.org
[mailto:web4lib-bounces at webjunction.org] On Behalf Of Martin Vojnar
Sent: Thursday, February 07, 2008 4:23 AM
To: Tim Spalding
Cc: Gem Stone-Logan; web4lib at webjunction.org
Subject: Re: [Web4lib] Re: Google Search Appliance and OPACs
Dears,
we did something like this. We dumped our catalog
(http://aleph.vkol.cz, cca 1 mil. of records)
into static html pages, so crawlers could come
and take them. Every static page has a link to
the live record in the catalog.
Firstly we built a tree structure between all
records, so robots would start at home page
(http://aleph.vkol.cz/pub) and find the rest of
records, this proved ok, but took Google cca 2
months to get all the records.
So we switched to sitemap solution
(http://aleph.vkol.cz/sitemap.xml) and Google
crawled/indexed everything in 2 weeks.
Some stats say we got cca 2000 new visitors every
day with 80% bounce rate. Obviously there are
many follow-up questions (the world is not our
target, so why to publish the catalog in Google
instaed of local search engines etc.), but this
was more or less just experiment.
Other crawlers (Yahoo, MSN) do not match Google
performance and do not work with sitemap files
efficiently.
BR, Martin
On 6 Feb 2008 at 21:27, Tim Spalding wrote:
> Has anyone tried just making a HUGE page of links and putting it
> somewhere Google will find it? Almost all OPACs allow direct links to
> records, by ISBN or something else. On a *few*-I've seen it on
> HiP-spidering this way causes serious sessions issues. (LibraryThing
> made this mistake once.) But it might be a way to get data into
> Google.
>
> Tim
--
Ing. Martin Vojnar, Research Library Olomouc,
Czech Republic
phone://+420 585 205 352
http://www.vkol.cz/
_______________________________________________
Web4lib mailing list
Web4lib at webjunction.org
http://lists.webjunction.org/web4lib/
More information about the Web4lib
mailing list