[Web4lib] web site search engines

Thu Sep 22 15:31:31 EDT 2005

Hi Karen,

Although I am no Nutch or Lucene expert, I can relate some  
information based on my own experience as a Nutch user.

Regarding the which search features are supported, you need to look  
at the Lucene docs for specific answers.  Here is a very brief  
overview (with links to more info):
http://lucene.apache.org/java/docs/features.html

Many of your questions seem directed at search query options (e.g.  
boolean, wildcard, fielded search).  This documentation reveals some  
of the query options supported:
http://lucene.apache.org/java/docs/queryparsersyntax.html

Here is a book dedicated to Lucene:
http://lucenebook.com/

For what it's worth, I think the Lucene documentation is much farther  
along than the Nutch documentation.

Nutch provides a web crawler front-end to the Lucene search libraries  
(where all the IR stuff happens).  In Nutch you can configure all  
your crawler settings such as what URLs to crawl, how many links to  
follow, etc.  Tinkering with relevance algorithms is possible at the  
Lucene level.

I don't know of a zero-cost web interface for configuring Lucene- 
based apps.  However, I have read that SearchBlox (http:// 
www.searchblox.com/) offers a product that is a web based admin  
interface to Lucene.

Hope this helps.

Tito

On Sep 22, 2005, at 2:08 PM, K.G. Schneider wrote:

> Some of the questions I have when I evaluate search engines don't  
> seem to be
> answered on the Nutch pages. Is there a features page and I'm just  
> missing
> it?
>
> Questions I have include what kind of searches it supports (quoted,  
> nested,
> truncation, wildcarding [and where], Boolean), whether stemming is  
> an option
> and what it uses for stemming (and can you add exceptions/changes),  
> Boolean
> operator support (can you use Google-like plus or minus or are you  
> stuck
> with 1990s terms), weighted field searching, synonym support, what  
> kinds of
> indexes it builds, multi-format indexing, incremental indexing,  
> spell-check
> support, thesauri support, fielded searching, rank-by-reputation,  
> and a lot
> more.
>
> I want to know how the search engine handles punctuation and special
> characters (and what's configurable), document format support,
> post-coordination options... well on and on. Then is how easy it is to
> configure and how transparent is its configuration to a working
> organization: does it require geeky command line stuff, or can a
> knowledgable manager enter a web or software interface to view or  
> modify
> settings?
>
> How about result sorting? Deduping? Tinkering with relevance  
> algoritms?
> Ranking overrides? Etc.
>
> I've evaluated Google client, and I too have a deal-breaking  
> problem with it
> being "secret sauce." I also note that many of its capabilities are  
> not
> switches. If Google doesn't believe in stemming, you don't get  
> stemming as
> an option. I believe that's how it is configured at present. In
> metadata-reliant databases, that's a killer. Basically it's  
> designed for
> organizations that aren't that interested in search and just want a
> reasonably good product so they can go back to selling socks or  
> whatever.
>
> But we're librarians. I search, therefore I am. ;)
>
> Karen G. Schneider
> kgs at bluehighways.com
>
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
>