[Web4lib] Federated Search versus Crawler or Spider

Ross Singer ross.singer at library.gatech.edu
Tue Jul 3 07:39:02 EDT 2007


Is it a safe assumption that all targets can handle some sort of
standardized federated search request?  What do they support?  Z39.50?
 SRU?  OpenSearch?  Would they need to be screenscraped?

Do any of these targets have OAI-PMH providers so you can harvest the
data and index it locally?

Your downsides of federated search are:
1) performance - it's not quick (or easy, honestly) to search a bunch
of targets and merge and present the results
2) the onus is on the content providers to provide a standardized
search interface - you lose all control about what is indexed/how it's
indexed and how search results are presented

The downsides to crawling are:
1) Parsers would need to be written that understand what the crawler
is looking at for each provider
2) The quality of your metadata might suffer based on the information
that can effectively glean from the crawled pages

The OAI harvesting method, in my opinion, is your best option.  You
can harvest daily (or even more often) and since you'd at least be
assured of getting an oai_dc record, you'd have structured metadata to
index locally, to which can search and present results however you
like.

My guess is your search is going to have to be some combination of the
three, depending on what you're dealing with on the other end.

-Ross.


On 7/2/07, McIntyre, Ruth <RMcintyre at agric.wa.gov.au> wrote:
> I must declare my hand and say that I strongly favour the federated search
> option.  My preference for federated searching is based on the fact that it
> is "up front", it is clear what sites you are searching, the search
> functionality should replicate the functionality of searching the target
> sites, the information retrieved is absolutely current and there is no
> negative impact on the target sites.
> The argument in favour of spidering is based on the belief that it will be
> cheaper and have better searching functionality. It may well be cheaper, but
> I am unconvinced about the functionality and the quality control on the
> information retrieved.
> I am also apprehensive about undermining goodwill towards the Livestock
> Library if we propose using a spider to crawl other websites.  Just as with
> federated searching we would seek permission, as IT staff could block an
> unfamiliar spider.
> Do you have any thoughts on this?  Am I just being too conservative?
> Ruth McIntyre
> Manager, Livestock Library
>
>
> -----Original Message-----
> From: web4lib-bounces at webjunction.org
> [mailto:web4lib-bounces at webjunction.org] On Behalf Of Thomas,Dylan
> Sent: Monday, 2 July 2007 6:36 PM
> To: 'web4lib at webjunction.org'
> Subject: Re: [Web4lib] Federated Search versus Crawler or Spider
>
> McIntyre, Ruth wrote:
> > Dear Colleagues
> >
> >
> >
> > I manage the Livestock Library, http://www.livestocklibrary.com.au/
> > <http://www.livestocklibrary.com.au/> , an online database of links to
> over
> > 22,000 journal articles and conference papers about livestock production
> > developed for the benefit of all participants in Australia's livestock
> > industries.
> >
> >
> >
> > We wish to also provide access to information that has been written for
> > producers (farmers); however producer information changes frequently, so
> it
> > is important to access current information from its publishers' sites.  We
> > are having a debate as to whether a federated search or a spider/crawler
> is
> > the best way to access this information.  There are probably about 20 - 30
> > target sites, most of which are freely available to the public and do not
> > require authentication.
> >
> >
> >
> > We have trialled a federated search and are reasonably confident it will
> > work with the target sites, though it is preferable for target sites to
> have
> > advanced search screens, which is not always the case.
> >
> >
> >
> > I would be interested to know the opinion of list members.
> >
> >
> >
> > Kind regards
> >
> >
> >
> > Ruth McIntyre
> >
> >
> >
> >
> >
> > Ruth McIntyre
> >
> > Manager, Livestock Library
> >
> > c/- Department of Agriculture and Food, Western Australia
> >
> > 3 Baron Hay Court
> >
> > South Perth  WA  6151
> >
> >
> >
> > Tel:  61 8 9368 3611
> >
> > Mob:  0409 688 546
> >
> > Fax:  61 8 9368 4051
> >
> > e-mail:  rmcintyre at agric.wa.gov.au <mailto:rmcintyre at agric.wa.gov.au>
> >
> >
> >
> > www.livestocklibrary.com.au <http://www.livestocklibrary.com.au>
> >
> > The leading Australian site for credible beef and sheep industry
> > publications.
> >
> >
> >
> > I work part-time, on Mondays, Tuesdays and Wednesdays.
> >
> >
> >
> >
> >
> >
> >
> > This e-mail and files transmitted with it are privileged and confidential
> information
> > intended for the use of the addressee. The confidentiality and/or
> privilege in this e-mail is
> > not waived, lost or destroyed if it has been transmitted to you in error.
> If you received this
> > e-mail in error you must
> > (a) not disseminate, copy or take any action in reliance on it;
> > (b) please notify the Department of Agriculture and Food, WA immediately
> by return e-mail to the sender;
> > (c) please delete the original e-mail.
> >
> > This email has been successfully scanned by
> > McAfee Anti-Virus software.
> > Department of Agriculture and Food WA
> > _______________________________________________
> > Web4lib mailing list
> > Web4lib at webjunction.org
> > http://lists.webjunction.org/web4lib/
> It appears that perhaps Fsrch is perhaps bets suited, given the apparent
> limitations on srch bots' capabilities in accessing data sources...,
> (Though I may be wrong!)...
>
> Are there obvious advantages and disadvantage that you can highlight?
>
> --
> Gall y neges e-bost hon, ac unrhyw atodiadau a anfonwyd gyda hi,
> gynnwys deunydd cyfrinachol ac wedi eu bwriadu i'w defnyddio'n unig
> gan y sawl y cawsant eu cyfeirio ato (atynt). Os ydych wedi derbyn y
> neges e-bost hon trwy gamgymeriad, rhowch wybod i'r anfonwr ar
> unwaith a dilëwch y neges. Os na fwriadwyd anfon y neges atoch chi,
> rhaid i chi beidio â defnyddio, cadw neu ddatgelu unrhyw wybodaeth a
> gynhwysir ynddi. Mae unrhyw farn neu safbwynt yn eiddo i'r sawl a'i
> hanfonodd yn unig  ac nid yw o anghenraid yn cynrychioli barn
> Prifysgol Cymru, Bangor. Nid yw Prifysgol Cymru, Bangor yn gwarantu
> bod y neges e-bost hon neu unrhyw atodiadau yn rhydd rhag firysau neu
> 100% yn ddiogel. Oni bai fod hyn wedi ei ddatgan yn uniongyrchol yn
> nhestun yr e-bost, nid bwriad y neges e-bost hon yw ffurfio contract
> rhwymol - mae rhestr o lofnodwyr awdurdodedig ar gael o Swyddfa
> Cyllid Prifysgol Cymru, Bangor.  www.bangor.ac.uk
>
> This email and any attachments may contain confidential material and
> is solely for the use of the intended recipient(s).  If you have
> received this email in error, please notify the sender immediately
> and delete this email.  If you are not the intended recipient(s), you
> must not use, retain or disclose any information contained in this
> email.  Any views or opinions are solely those of the sender and do
> not necessarily represent those of the University of Wales, Bangor.
> The University of Wales, Bangor does not guarantee that this email or
> any attachments are free from viruses or 100% secure.  Unless
> expressly stated in the body of the text of the email, this email is
> not intended to form a binding contract - a list of authorised
> signatories is available from the University of Wales, Bangor Finance
> Office.  www.bangor.ac.uk
>
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
>
>
> This e-mail and files transmitted with it are privileged and confidential information
> intended for the use of the addressee. The confidentiality and/or privilege in this e-mail is
> not waived, lost or destroyed if it has been transmitted to you in error. If you received this
> e-mail in error you must
> (a) not disseminate, copy or take any action in reliance on it;
> (b) please notify the Department of Agriculture and Food, WA immediately by return e-mail to the sender;
> (c) please delete the original e-mail.
>
> This email has been successfully scanned by
> McAfee Anti-Virus software.
> Department of Agriculture and Food WA
>
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
>


More information about the Web4lib mailing list