[Web4lib] Federated searching-general question re sub groupings

Peter Noerr pnoerr at MuseGlobal.com
Thu May 17 11:40:32 EDT 2007


Katy, 

Thanks for the additional reasons. I've commented on each below. My main take from this is that the difficulties are not technical, but commercial/legal (are we allowed?), or practical (is it  useful?). These problems have been with us for ages, and are not unique to FS - it just tends to throw some of them into sharper relief. 

Certainly FS is another time-saving tool the library can use to make itself the 'go to' place for information. The less places the user has to hop around on to get information (or anything) the better, and that is what FS is for.

Peter

> -----Original Message-----
> From: web4lib-bounces at webjunction.org 
> [mailto:web4lib-bounces at webjunction.org] On Behalf Of Kathryn 
> Silberger
> Sent: Thursday, May 17, 2007 6:21 AM
> To: web4lib at webjunction.org
> Subject: RE: [Web4lib] Federated searching-general question 
> re sub groupings
> 
> Peter:
> 
>       Steve listed several good reasons why databases 
> shouldn't be included in a federated search engine. Here are a few more reasons 
> they can't or shouldn't be included:
> 
> 1)  The database does not want to be included.  Initially 
> Jstor asked the FS vendors not to target Jstor because of concerns about the 
> load it would put on their servers. (Not the case at present.)  There was 
> some talk that Google didn't want to be a target because the FSE (federated 
> search engine) bypasses the ads that Google has on their results list.  Some 
> very good museum sites put "no robots" in their metadata.  I imagine 
> they probably won't want to be FS targets either.
<PLN: Of course not wanting to be is a very good reason they _should_ not be included, not a reason they _could_ not be. I'm not advocating 'forced access' in any way, all I want is to differentiate between commercial/legal prohibition and technical/content difficulty. 

Interestingly load has not proved to be a problem in practie. Given the capability, most users limit the extent of their search to what they consider appropriate sources, so the expected 'overspill' searches don't really happen. It's also arguable (research anyone?) that given the extra post-processing tools of most FS the user is more likely to find an answer with less 'probing' searches.

Your last couple of sentences bring up an interesting nomenclature point. Despite this activity having plenty of names to go round (metasearch, federated search, distributed search, parallel search, etc.) there are no definitions of which is which. The sense in which FS is used in the library world does not involves robots. Unfortunately in the commercial/web world it is a different beast and does often involve what we would call harvesting - hence the prohibition. We have found an explanation of the process eventually allows access. This is an activity where the vendors are doing as much education as selling. />
> 
> 2)  The database does not have an XML gateway, a Z39.50 server, and no
> screeen-scraper has been written for it.  With more 
> specialized targets that may be the case.  Our federated search vendor takes requests for
> screen-scrapers.  They are enabled dozens of targets each  month, but I get
> the feeling they have a list of requests they are working on.
<PLN: I have to make my standard plea for accuracy here. What you refer to as "screen scraping" is, in fact, "HTML Parsing". Screen scraping is a different technique, as those of us of a certain age will remember, which literally involved using screen positions (line and column) to identify information on a 24 line x 80 column display.

HTML parsing is record parsing involving techniques exactly like those for extracting data from MARC or XML structured records. (end of rant).

The format of the records (and the protocol and search language) are no barrier to using that source. All sources (even Z39.50 accessed catalogs) need a specially configured connector. HTTP/HTML source are no different in this respect. 

The difference is that the HTTP/HTML format is volatile and changes every time the site owner wants it to. The interchange formats are designed and defined to be stable over a period of years. So HTTP/HTML connectors 'break' and need to be adapted to the new result structure. This is a service issue not a technical prohibition.

In terms of process, all the FS vendors have a queue of connectors to build or configure (either themsleves as a service, or in conjunction with their customers). This includes work for non-HTTP/HTML sources. The number in their library and the rate of production (and maintenance) depends on the vendor. We have over 4,500 connectors in our library (many for very specialized sources), the majority of which are HTTP/HTML. />
> 
> 3)  The database has some idiosyncracy that renders results that are
> marginally useful.  Lexis Nexis is an example of this.  They 
> still fail a search if it produces more than 1,000 results.  I can't 
> understand why they don't cut off the search and return the first 1,000, but 
> there you have it. 
> If you are doing a federated search for some less popular legal event,
> including Lexis Nexis makes sense.  But if the legal event is a little too
> popular, or worse the search is focused on a very popular event or topic,
> no results are returned.  I'm not sure how this new release will behave with FSEs.
<PLN: Again not a technical prohibition (in fact Lexis-Nexis is a popular Source), but this policy doesn't make much sense. They are not alone in it, and much worse are some popular sites which fill the results list with their "top results" if they have nothing that actually matches the search! Of course, you have the same problem if you search directly, so this really has nothing to do with including them in FS or not. This could be a situation that changes. />
> 
>      As time goes by, I imagine these problems will be worked out.
> Offering a FSE is one way a college or university can make their website the "information 
> destination" for their students.  As students are overwhelmed with website destinations I
> believe they will have their preferred social destination, r photo destination, shopping
> destination, etc, and their academic info destination.  To be a player in that sort of world,
> the database producers will have to work with the FS vendors.
> 
> Katy
> 
> Kathryn K. Silberger
> Automation Resources Librarian
> James A. Cannavino Library
> Marist College
> 3399 North Road
> Poughkeepsie, NY  12601
> Kathryn.Silberger at marist.edu
> (845) 575-3000 x.2419
> 
> 
>                                                               
>              
>              Steve Cramer                                     
>              
>              SMCRAMER                                         
>              
>              <smcramer at uncg.ed                                
>           To 
>              u>                        
> web4lib at webjunction.org             
>              Sent by:                                         
>           cc 
>              web4lib-bounces at w                                
>              
>              ebjunction.org                                   
>      Subject 
>                                        RE: [Web4lib] 
> Federated             
>                                        searching-general 
> question re sub   
>              05/14/2007 08:54          groupings              
>              
>              AM                                               
>              
>                                                               
>              
>                                                               
>              
>                                                               
>              
>                                                               
>              
>                                                               
>              
> 
> 
> 
> 
> Well, there could be a number of reasons why certain 
> databases can't be
> included in a federated search, or probably shouldn't be. Numeric
> databases, pay-per-search databases, and databases with a 
> small number of
> concurrent users are examples.
> 
> --Steve
> ___________________________________________________
> Steve Cramer
> Librarian for Accounting, Apparel, Business, & Economics
> University of North Carolina at Greensboro
> smcramer at uncg.edu, 336-256-0346, AIM: stevebizlib
> 
> 
> 
> "Peter Noerr" <pnoerr at MuseGlobal.com>
> Sent by: web4lib-bounces at webjunction.org
> 05/10/2007 06:10 PM
> 
> To
> <web4lib at webjunction.org>
> cc
> 
> Subject
> RE: [Web4lib] Federated searching-general question re sub groupings
> 
> 
> 
> 
> 
> 
> One question and one observation:
> 
> Question:
> 
> Kathryn (in her ACRL presentation) and one other poster on this thread
> have mentioned that certain databases "cannot be searched by federated
> search" (or similar, more succinct phrasing). I am intrigued 
> to know what
> some examples of the databases are, or what the 
> characteristics are which
> make them unsearchable by a federated search engine.
> 
> Observation:
> We have noticed a growing trend in both the corporate and 
> library use of
> federated search towards the use of "subject verticals". The 
> reasons are
> all over the place, but one major theme is that users want less, but
> better 'quality' results. If the user is already in a subject 
> specialized
> part of the web site, then the expectation seems to be that 
> they will get
> only very relevant material. And the converse; if they are on 
> the front
> page, they will get all sorts of stuff.
> 
> Also it is easier to consider moving a specialized search box 
> out to the
> place where the users are likely to be (a course web site, or project
> collaboration page, for example) thus getting them to use the library
> without having to be there. (This mixes with another thread, 
> but it does
> seem to be a trend to move specialist access out to where people are
> working.)
> 
> 
> Disclaimer:
> In the interests of full disclosure; MuseGlobal is a major commercial
> developer and OEM vendor of search management software, which includes
> federated search and results analysis components.
> 
> Peter
> 
> Dr Peter L Noerr
> CTO, MuseGlobal, Inc.
> 
> +1 801 208 1880
> www.museglobal.com
> 
> 
> > -----Original Message-----
> > From: web4lib-bounces at webjunction.org
> > [mailto:web4lib-bounces at webjunction.org] On Behalf Of Kathryn
> > Silberger
> > Sent: Thursday, May 10, 2007 6:10 AM
> > To: web4lib at webjunction.org
> > Subject: Fw: [Web4lib] Federated searching-general question
> > re sub groupings
> >
> > Lisa:
> >
> >       I think you have asked some good questions.  I am at
> > Marist College
> > and we have been using federated search since fall of 2005.
> > Our students
> > have been receptive and postive about it.  We have it front
> > and center on
> > our home page and we have seen article usage sky rocket.
> > When we set it up
> > we tried to look at searching from the student's perspective,
> > and that led
> > us to use the terminology of the Registrar's office.  Each of
> > our federated
> > groupings bear the name of a major awarded by the college.
> > That is the
> > terminology that guides their overall academic experience and
> > we have found
> > that it works well for grouping databases into federated
> > searches.  I agree
> > with you that students don't want to have to consider lots 
> of choices
> > before searching.  They live with a fair number of web
> > destinations for
> > broad life activities i.e. socializing, banking, travel,
> > shopping  -- I
> > believe they would like the library to be a single destination.
> >
> >       You are quite right about the clustering.  Students have been
> > conditioned by other web searching experience to using
> > clusters to filter
> > search results.  (They want the movie, not the book at Amazon
> > - they filter
> > via cluster.)  About 80% - 90%  of the time the clustering
> > will create a
> > very relevant subset.   Those proposed sub-grouping would
> > have some general
> > academic databases and they would need to use the clustering
> > regardless.  I
> > have found that newspapers can present a problem in certain
> > situations.  If
> > a technical topic has been in the news for whatever reason,
> > you can get the
> > first page of results with too many newspaper articles.
> >
> >              We gave a paper on federated searching at ACRL
> > this year.  We
> > put up our paper, Powerpoint and a couple Flash demos at
> > http://library.marist.edu/ACRL/Foxhunt_demo.html  .  You can see the
> > clustering in each of the Flashes.
> >
> >               Good luck.  I think you are on the right track.
> >
> >
> > Katy
> >
> > Kathryn K. Silberger
> > Automation Resources Librarian
> > James A. Cannavino Library
> > Marist College
> > 3399 North Road
> > Poughkeepsie, NY  12601
> > Kathryn.Silberger at marist.edu
> > (845) 575-3000 x.2419
> >
> >
> >
> >
> >              "Pons, Lisa
> >
> >              (ponslm)"
> >
> >              <PONSLM at UCMAIL.UC
> >           To
> >              .EDU>
> > <web4lib at webjunction.org>
> >              Sent by:
> >           cc
> >              web4lib-bounces at w
> >
> >              ebjunction.org
> >      Subject
> >                                        [Web4lib] Federated
> >
> >                                        searching-general
> > question re sub
> >              05/09/2007 10:18          groupings
> >
> >              AM
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > I have a general question- sorry this is so long!
> >
> > We're a few steps away from implementing our new federated
> > search tool.
> > It has been an interesting experience!
> >
> > I have some questions regarding how this tool is seen across your
> > institutions- that is, what is the vision for it's use?
> >
> > For example, we have created our tool with 21 subject 
> categories. Now,
> > some of  our subject specialists want to create sub categories, and
> > choose their own databases to be searched , and put a search box on
> > their subject guide pages that will only search within their sub
> > category.
> >
> > For example, on our main federated page, we have Earth and
> > Environmental
> > Sciences which includes 10 databases to be searched. Now, 
> the subject
> > specialist wants to create a sub-category for Geography and put the
> > search box on her subject guide page. The category may or 
> may not have
> > the same databases as the main earth and environmental sciences main
> > category.
> >
> > My question is, won't this confuse users?  Does this 
> partially defeat
> > the purpose of a "federated search" by limiting the search to a very
> > slender set of resources? We are using Serials solutions
> > central search,
> > which has Vivisimo to cluser results- shouldn't that be enough.
> >
> > Isn't this kind of library 1.0 thinking- that every tool must be
> > separate, and to find this, you must go there, to find that,
> > you must go
> > somewhere else.
> >
> > I need help here- if I am wrong I need to shut up about it with my
> > colleagues, if I am write, I need help from all the experts 
> out there
> > explaining why it is wrong.
> >
> > Thanks!
> > _______________________________________________
> > Web4lib mailing list
> > Web4lib at webjunction.org
> > http://lists.webjunction.org/web4lib/
> >
> >
> > _______________________________________________
> > Web4lib mailing list
> > Web4lib at webjunction.org
> > http://lists.webjunction.org/web4lib/
> >
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
> 
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
> 
> 
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
> 


More information about the Web4lib mailing list