[WEB4LIB] RE: Cataloging computer files

A. Bullen abullen at ameritech.net
Wed Apr 20 17:19:35 EDT 2005


Maria:

I come again to my first (search engine) love, Swish-E. If you:

A.) install the windows version of swish-e
B.) store the data you want indexed on networked (or shared, somehow) drives
C.) build swish-e with the appropriate filters (for Word, MP3s, Excel, 
PDFs, etc.)
D.) store your data with appropriate metadata built into the native format

you could use Swish-E to spider the network drives. Swish-E can be made 
to search specific fields; so, say, you rigorously enforce DC metadata, 
you can have it look for keyword X in the Content.Creator field.

More complicated schemas are (I like lists, sorry):

1.) You could develop a Visual Basic utility that is a "Publish This!" 
utility which appears as a button on their Word, Excel, etc. programs. 
This button would bring up a fill-in-the-blank metadata creation form. 
The form then gets created as an XML record, which in turn points to the 
original document. Swish-E could then spider the XML record and records 
its data; it then follows the link to the actual piece being indexed. 
Jiggling with Swish-E search result returns could yield both the 
structured data and the keyword/fuzzy search.

2.) You might also develop a series of rule sets that develop your 
structure for you. You could use wget and download the files you want 
(*.doc, *.pdf, *.xls, *.htm*, etc. on shared drives), and then convert 
them into text using the various x to y filters available on the 
Internet. Once in text form, you could run a set of heuristic rules 
against them to develop subject categories. With the exception of a 
description field, you could probably fill in all of the vital DC fields 
using this method. Again, Swish-E could then spider these distilled 
records, link them to the original document, spider that for keyword 
searches, and you could have a structured catalog.

Andrew Bullen
Illinois State Library

>>Hi all,
>>
>>I need to figure out a way to catalog all data files on our 
>>computer drives, and index them in such a way that all staff 
>>can later find computer files easily when they do research.  
>>Problem is, we have so much data stored all over these 
>>drives, that it is very hard to find anything when we really 
>>need it.  Consequently, if files aren't named, saved and 
>>indexed in an organized way, then  the information is 
>>effectively lost to the researcher.
>>
>>Is there a program you can suggest tthat can help me do this?
>>
>>Thanks for your help!
>>Maria
>>    
>>



More information about the Web4lib mailing list