[Web4lib] Re: Google Search Appliance and OPACs

Martin Vojnar vojnar at vkol.cz
Thu Feb 7 17:11:59 EST 2008


Roy,

yes, at first sight it seems like flooding.

However, when published content is common, then 
it gets quite a low score (pagerank) and does not 
show in results very high.

Au contraire, more it is unique, more it will be 
placed close to the top. 

IMHO it is still quite difficult to say what is 
the real treasure in one library collections and 
so to limit exposure to it (as Marshall 
suggested). Some content may be seen as unique 
from world search engine, different opinion may 
have local search engine (specialised in a field, 
in a geographical domain etc.).

BR, Martin 

On 7 Feb 2008 at 9:38, Roy Tennant wrote:

> I just want to point out that there is a world of difference between
> exposing unique content to web crawlers and exposing commonly held content.
> I can remember back in the days of Gopher (youngsters may wish to refer to
> <http://en.wikipedia.org/wiki/Gopher_%28protocol%29> for an explanation)
> when a few libraries put their catalogs up on Gopher. It was a complete
> disaster. If you searched for just about anything (using Veronica, I
> believe) you would get swamped with individual catalog records from
> libraries that usually were some distance from you. What good was that? It
> would have been much better had a library cooperative that had information
> about the holdings of thousands of libraries provided a link to a service
> that would quickly route the interested party to their local library that
> has the book. That's funny, that sounds amazingly similar to what happens
> now...
> Roy
> 
> 
> On 2/7/08 5:08 AM, "Breeding, Marshall" <marshall.breeding at Vanderbilt.Edu>
> wrote:
> 
> > Likewise, the Vanderbilt Television News Archive proactively works to
> > ensure its catalog is well represented in the global search engines.
> > We've been doing this for a couple of years now.
> > 
> > For my April 2006 column in Computers in Libraries, I described the
> > basic approach:
> > 
> > "How we funneled searchers from Google to our collections by catering to
> > Web crawlers"
> > (http://www.librarytechnology.org/ltg-displaytext.pl?RC=12049)
> > 
> > Basically, a script generates static HTML pages for each dynamic record
> > in the database.  In the process it creates an HTML index and a sitemaps
> > following the protocol originally proposed by Google:
> >   http://www.sitemaps.org/
> > 
> > Microsoft Live and Yahoo have recently begun supporting the sitemap
> > protocol.
> > 
> > For our TV News archive, we're interested in serving researchers all
> > over the world and the proactive approach that we've followed by pushing
> > our content into the search engines have resulted in significant levels
> > of increase on visits to our site and requests for materials.
> > 
> > I use similar techniques for Library Technology Guides
> > (http://www.librarytechnology.org) but instead of generating flat HTML
> > pages, I point the sitemaps at persistent links that call up pages
> > directly from the database.
> > 
> > I'm not sure that this approach is ideal for library catalogs, where
> > tens of thousands of libraries around the world have overlapping
> > content.  This might end up being really messy if every library exposes
> > its records independently.  I see it as a great approach for local
> > digital collections with unique content.
> > 
> > -marshall breeding
> >  Executive Director, Vanderbilt Television News Archive
> >  Director for Innovative Technologies and Research,
> >            Vanderbilt University Library
> > 
> > 
> > -----Original Message-----
> > From: web4lib-bounces at webjunction.org
> > [mailto:web4lib-bounces at webjunction.org] On Behalf Of Martin Vojnar
> > Sent: Thursday, February 07, 2008 4:23 AM
> > To: Tim Spalding
> > Cc: Gem Stone-Logan; web4lib at webjunction.org
> > Subject: Re: [Web4lib] Re: Google Search Appliance and OPACs
> > 
> > Dears,
> > 
> > we did something like this. We dumped our catalog
> > (http://aleph.vkol.cz, cca 1 mil. of records)
> > into static html pages, so crawlers could come
> > and take them. Every static page has a link to
> > the live record in the catalog.
> > 
> > Firstly we built a tree structure between all
> > records, so robots would start at home page
> > (http://aleph.vkol.cz/pub) and find the rest of
> > records, this proved ok, but took Google cca 2
> > months to get all the records.
> > 
> > So we switched to sitemap solution
> > (http://aleph.vkol.cz/sitemap.xml) and Google
> > crawled/indexed everything in 2 weeks.
> > 
> > Some stats say we got cca 2000 new visitors every
> > day with 80% bounce rate. Obviously there are
> > many follow-up questions (the world is not our
> > target, so why to publish the catalog in Google
> > instaed of local search engines etc.), but this
> > was more or less just experiment.
> > 
> > Other crawlers (Yahoo, MSN) do not match Google
> > performance and do not work with sitemap files
> > efficiently.
> > 
> > BR, Martin
> > 
> > On 6 Feb 2008 at 21:27, Tim Spalding wrote:
> > 
> >> Has anyone tried just making a HUGE page of links and putting it
> >> somewhere Google will find it? Almost all OPACs allow direct links to
> >> records, by ISBN or something else. On a *few*-I've seen it on
> >> HiP-spidering this way causes serious sessions issues. (LibraryThing
> >> made this mistake once.) But it might be a way to get data into
> >> Google.
> >> 
> >> Tim
> 
> -- 
> 
> 
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/


-- 
Ing. Martin Vojnar, Research Library Olomouc, 
Czech Republic
phone://+420 585 205 352
http://www.vkol.cz/





More information about the Web4lib mailing list