FW: [Web4lib] Federated Search versus Crawler or Spider

Tue Jul 3 12:22:26 EDT 2007

Perhaps federated search is the wrong technology.  

Wouldn't it be better if we could load the metadata and location
information directly into a database that we control?  This would allow
us to decide what the appropriate metadata might be, how search results
would be returned, etc.

The TOCROSS project in Great Britain in, I believe, 2003 was a proof of
concept of sorts
(http://www.jisc.ac.uk/whatwedo/programmes/programme_pals2/project_tocro
ss.aspx).

Or, we could do what Google does and contract to allow us to crawl the
vendor's database and pull back the information that we feel is
important.  

There would be some technical problems with de-duping and incomplete
records but nothing that couldn't be worked out.

As an added bonus, once the data and links are in our database,
controlling authorized access to the vendor's content would be much,
much easier.

For those people who want to make sure the content they link to is
always available there is the LOCKSS project
(http://www.lockss.org/lockss/Home). Then we would hold both the
meta-data and the data of the content we purchase. (If the vendor's are
willing to give us contracts that say we can, of course.)

I can see a three layer contracting scheme growing out of this: 0ne
contract for harvesting the meta-data, one for accessing the content and
one for holding the content.

Regards,

Michael Champion
Head, Information Technology Services
Lake Villa District Library

-----Original Message-----
From: web4lib-bounces at webjunction.org
[mailto:web4lib-bounces at webjunction.org] On Behalf Of Danielle Plumer
Sent: Tuesday, July 03, 2007 9:13 AM
To: web4lib at webjunction.org
Subject: RE: [Web4lib] Federated Search versus Crawler or Spider

I've been working on a federated search project here in Texas
(www.texasheritageonline.org), and the one lesson I think we've learned
is that if you rely on a single protocol you're not going to get very
far. I should note that my project is mostly looking at freely available
materials, although a sister site, www.libraryoftexas.org, does
federated search across licensed content.

We have sites that support OAI-PMH (with remarkably little
standardization of metadata), sites on Z39.50, sites using SRU. We have
large legacy databases for which we're developing procedures to add one
of the above protocols (SRU particularly is open to community
development, BTW). And I hope in the future to add support for indexing
web-harvested material, possibly using OpenSearch (another relatively
easy to implement option). 

So, I'd say the onus is on the federated search developer to support
multiple protocols and do normalization of widely heterogeneous results
sets. I don't know that we'll ever get to a point where I can say that
*every* resource is included, but I do think we can include everything
that Google and Yahoo! can get to, and then some.

Danielle Cunniff Plumer, Coordinator
Texas Heritage Digitization Initiative
Texas State Library and Archives Commission
512.463.5852 (phone) / 512.936.2306 (fax)
dplumer at tsl.state.tx.us

-----Original Message-----
From: web4lib-bounces at webjunction.org
[mailto:web4lib-bounces at webjunction.org]On Behalf Of Andrew Ashton
Sent: Tuesday, July 03, 2007 7:59 AM
To: Ross Singer; McIntyre, Ruth
Cc: web4lib at webjunction.org
Subject: RE: [Web4lib] Federated Search versus Crawler or Spider

 Ross Singer wrote:
"2) the onus is on the content providers to provide a standardized
search interface - you lose all control about what is indexed/how it's
indexed and how search results are presented"

This seems to be the kiss of death for any really useful Federated
search projects.  Remember the 
OCLC SiteSearch project?  We did a proof-of-concept test here at
Skidmore College and discovered quickly that the lack of vendor
standardization made it impractical.  Librarians are nothing if not
completists, and offering a Federated Search service that only covered
2/3 of our resources wasn't going to cut it.  Sure, you could buy a
Federated Search product and let the vendor worry about maintaining
access to the non-standardized targets, but any technology that
precludes community development ought to be considered dead in the
water.

--
Andrew Ashton
Systems Librarian
Scribner Library, Skidmore College
(518)580-5505

_______________________________________________
Web4lib mailing list
Web4lib at webjunction.org
http://lists.webjunction.org/web4lib/
_______________________________________________
Web4lib mailing list
Web4lib at webjunction.org
http://lists.webjunction.org/web4lib/