[Web4lib] Re: Web4lib Digest, Vol 35, Issue 8

Fri Feb 8 13:39:04 EST 2008

I haven't really been following this thread, but since I've recently
developed my own  approach to making a portion of our catalog "Google
friendly", I thought I'd chip in. In my html representation of a
record, my application constructs a link back to the catalog record
and, if the data allows it, a WorldCat link. Currently, about 2/3 of
my hits on these pages are coming from search engines, so this
approach throws those page viewers a bone, albeit a small one.

Sample record:
http://sunsite.berkeley.edu/wikis/datalab/Data/Cd184971071

CD Archive search page:
http://sunsite.berkeley.edu/wikis/datalab/Data/CdArchive

Harrison

PS - Roy, funny you should mention Gopher because just a couple days
ago, I inadvertantly clicked on a link that pointed to a gopher site.
Safari choked on it unfortunately.

> ---------- Forwarded message ----------
> From: Roy Tennant <tennantr at oclc.org>
> To: <web4lib at webjunction.org>
> Date: Thu, 07 Feb 2008 09:38:40 -0800
> Subject: Re: [Web4lib] Re: Google Search Appliance and OPACs
> I just want to point out that there is a world of difference between
> exposing unique content to web crawlers and exposing commonly held content.
> I can remember back in the days of Gopher (youngsters may wish to refer to
> <http://en.wikipedia.org/wiki/Gopher_%28protocol%29> for an explanation)
> when a few libraries put their catalogs up on Gopher. It was a complete
> disaster. If you searched for just about anything (using Veronica, I
> believe) you would get swamped with individual catalog records from
> libraries that usually were some distance from you. What good was that? It
> would have been much better had a library cooperative that had information
> about the holdings of thousands of libraries provided a link to a service
> that would quickly route the interested party to their local library that
> has the book. That's funny, that sounds amazingly similar to what happens
> now...
> Roy
>
>
> On 2/7/08 5:08 AM, "Breeding, Marshall" <marshall.breeding at Vanderbilt.Edu>
> wrote:
>
> > Likewise, the Vanderbilt Television News Archive proactively works to
> > ensure its catalog is well represented in the global search engines.
> > We've been doing this for a couple of years now.
> >
> > For my April 2006 column in Computers in Libraries, I described the
> > basic approach:
> >
> > "How we funneled searchers from Google to our collections by catering to
> > Web crawlers"
> > (http://www.librarytechnology.org/ltg-displaytext.pl?RC=12049)
> >
> > Basically, a script generates static HTML pages for each dynamic record
> > in the database.  In the process it creates an HTML index and a sitemaps
> > following the protocol originally proposed by Google:
> >   http://www.sitemaps.org/
> >
> > Microsoft Live and Yahoo have recently begun supporting the sitemap
> > protocol.
> >
> > For our TV News archive, we're interested in serving researchers all
> > over the world and the proactive approach that we've followed by pushing
> > our content into the search engines have resulted in significant levels
> > of increase on visits to our site and requests for materials.
> >
> > I use similar techniques for Library Technology Guides
> > (http://www.librarytechnology.org) but instead of generating flat HTML
> > pages, I point the sitemaps at persistent links that call up pages
> > directly from the database.
> >
> > I'm not sure that this approach is ideal for library catalogs, where
> > tens of thousands of libraries around the world have overlapping
> > content.  This might end up being really messy if every library exposes
> > its records independently.  I see it as a great approach for local
> > digital collections with unique content.
> >
> > -marshall breeding
> >  Executive Director, Vanderbilt Television News Archive
> >  Director for Innovative Technologies and Research,
> >            Vanderbilt University Library
> >
> >
> > -----Original Message-----
> > From: web4lib-bounces at webjunction.org
> > [mailto:web4lib-bounces at webjunction.org] On Behalf Of Martin Vojnar
> > Sent: Thursday, February 07, 2008 4:23 AM
> > To: Tim Spalding
> > Cc: Gem Stone-Logan; web4lib at webjunction.org
> > Subject: Re: [Web4lib] Re: Google Search Appliance and OPACs
> >
> > Dears,
> >
> > we did something like this. We dumped our catalog
> > (http://aleph.vkol.cz, cca 1 mil. of records)
> > into static html pages, so crawlers could come
> > and take them. Every static page has a link to
> > the live record in the catalog.
> >
> > Firstly we built a tree structure between all
> > records, so robots would start at home page
> > (http://aleph.vkol.cz/pub) and find the rest of
> > records, this proved ok, but took Google cca 2
> > months to get all the records.
> >
> > So we switched to sitemap solution
> > (http://aleph.vkol.cz/sitemap.xml) and Google
> > crawled/indexed everything in 2 weeks.
> >
> > Some stats say we got cca 2000 new visitors every
> > day with 80% bounce rate. Obviously there are
> > many follow-up questions (the world is not our
> > target, so why to publish the catalog in Google
> > instaed of local search engines etc.), but this
> > was more or less just experiment.
> >
> > Other crawlers (Yahoo, MSN) do not match Google
> > performance and do not work with sitemap files
> > efficiently.
> >
> > BR, Martin
> >
> > On 6 Feb 2008 at 21:27, Tim Spalding wrote:
> >
> >> Has anyone tried just making a HUGE page of links and putting it
> >> somewhere Google will find it? Almost all OPACs allow direct links to
> >> records, by ISBN or something else. On a *few*-I've seen it on
> >> HiP-spidering this way causes serious sessions issues. (LibraryThing
> >> made this mistake once.) But it might be a way to get data into
> >> Google.
> >>
> >> Tim

-- 
Harrison Dekker -- Coordinator of Data Services -- UC Berkeley Libraries
510-642-8095 :: GTalk:vagrantscholar :: AIM:hdekker :: Meebo:ucbdekker
———————————————————————-
Q: Why is this email 5 sentences or less?
A: http://five.sentenc.es