[Web4lib] web site search engines

Ross Singer ross.singer at library.gatech.edu
Thu Sep 22 15:21:29 EDT 2005


Karen,

Your questions are certainly beyond my ability to answer on many of 
these, but some of them are answered here:
http://lucene.apache.org/java/docs/queryparsersyntax.html

Programmatically you could definitely do /most/ of the things you 
mention here (and, yes, that is a barrier to entry... but, frankly, so 
are many of your requirements -- in that few products would do this out 
of the box). 

Weighted field searching and control over relevance algorithm is built 
into Lucene.  Spell-check, thesauri, synonyms, etc. could be (fairly) 
easily added (with some minor programming).

You can add any sort of structured data you want to the indexed page 
(Dublin Core, for example)... Lucene could store them as seperate fields.

It sounds, though, like your needs are a little more specialized than 
the average "site search", which has very little metadata to work with.

-Ross.

K.G. Schneider wrote:

>Some of the questions I have when I evaluate search engines don't seem to be
>answered on the Nutch pages. Is there a features page and I'm just missing
>it? 
>
>Questions I have include what kind of searches it supports (quoted, nested,
>truncation, wildcarding [and where], Boolean), whether stemming is an option
>and what it uses for stemming (and can you add exceptions/changes), Boolean
>operator support (can you use Google-like plus or minus or are you stuck
>with 1990s terms), weighted field searching, synonym support, what kinds of
>indexes it builds, multi-format indexing, incremental indexing, spell-check
>support, thesauri support, fielded searching, rank-by-reputation, and a lot
>more. 
>
>I want to know how the search engine handles punctuation and special
>characters (and what's configurable), document format support,
>post-coordination options... well on and on. Then is how easy it is to
>configure and how transparent is its configuration to a working
>organization: does it require geeky command line stuff, or can a
>knowledgable manager enter a web or software interface to view or modify
>settings? 
>
>How about result sorting? Deduping? Tinkering with relevance algoritms?
>Ranking overrides? Etc.
>
>I've evaluated Google client, and I too have a deal-breaking problem with it
>being "secret sauce." I also note that many of its capabilities are not
>switches. If Google doesn't believe in stemming, you don't get stemming as
>an option. I believe that's how it is configured at present. In
>metadata-reliant databases, that's a killer. Basically it's designed for
>organizations that aren't that interested in search and just want a
>reasonably good product so they can go back to selling socks or whatever.
>
>But we're librarians. I search, therefore I am. ;) 
>
>Karen G. Schneider
>kgs at bluehighways.com
>
>_______________________________________________
>Web4lib mailing list
>Web4lib at webjunction.org
>http://lists.webjunction.org/web4lib/
>
>  
>


More information about the Web4lib mailing list