[Web4lib] web site search engines

Thu Sep 22 11:17:04 EDT 2005

For a standard website search application I would recommend Nutch  
(particularly if you already have Tomcat server running):

"Nutch is open source web-search software. It builds on Lucene Java,  
adding web-specifics, such as a crawler, a link-graph database,  
parsers for HTML and other document formats, etc."

Here is a tutorial which provides an overview of what's involved in  
using Nutch (ignore the "whole-web crawling" section):
http://lucene.apache.org/nutch/tutorial.html

The Nutch crawler is relatively easy to configure via XML files.   
When you download Nutch you also get a .war file which contains a  
starter JSP script for displaying search results.  So you can get a  
prototype of your library search application running in a few hours  
(assuming you have Tomcat already set up).  After that, you will  
spend most of the time tweaking the crawl filter, and tweaking your  
presentation script.

In addition to HTML pages, Nutch can index Word and PDF docs.

Here is a library search application that makes use of Nutch for  
website search:
http://www.lib.ncsu.edu/search/?q=usability

Tito

On Sep 22, 2005, at 10:38 AM, John Fereira wrote:

> At 09:50 AM 9/22/2005, Mark Costa wrote:
>
>> I am looking for a good web site search engine that I can place on  
>> our
>> library's web site. It needs to be free, easy to implement, and  
>> not be
>> Google.
>>  It's not that I have a beef with Google, its just that they  
>> refuse to fix
>> their statistics reporting program.
>>
>
> You might want to look at Lucene.  One of the nice things about  
> Lucene is that it's not google but one can use google-like query  
> parsing (almost by default).  For example, combining boolean  
> expressions with phrases can be difficult to implement with some  
> search engines but the default query parser for lucene already does  
> it.  Using a google-like query parser can be a huge win as it  
> doesn't require explaining yet another query syntax.
>
> The other big advantage with Lucene is that allows an index to be  
> created with multiple fields.  For example, you can combine full  
> text from the static html files on your site along with metadata in  
> a backend database that might be used for dynamically generated  
> pages.  That metadata could, for example subsection information  
> (i.e. search only pages in the "help" section) or temporal metadata  
> (only show pages which have changed in the past week).
>
> Lucene was originally written as a java api (open source) but there  
> are implementations in other languages (for perl it's called Plucene).
>
>
>
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
>