[WEB4LIB] RE: Cataloging computer files
A. Bullen
abullen at ameritech.net
Wed Apr 20 17:19:35 EDT 2005
Maria:
I come again to my first (search engine) love, Swish-E. If you:
A.) install the windows version of swish-e
B.) store the data you want indexed on networked (or shared, somehow) drives
C.) build swish-e with the appropriate filters (for Word, MP3s, Excel,
PDFs, etc.)
D.) store your data with appropriate metadata built into the native format
you could use Swish-E to spider the network drives. Swish-E can be made
to search specific fields; so, say, you rigorously enforce DC metadata,
you can have it look for keyword X in the Content.Creator field.
More complicated schemas are (I like lists, sorry):
1.) You could develop a Visual Basic utility that is a "Publish This!"
utility which appears as a button on their Word, Excel, etc. programs.
This button would bring up a fill-in-the-blank metadata creation form.
The form then gets created as an XML record, which in turn points to the
original document. Swish-E could then spider the XML record and records
its data; it then follows the link to the actual piece being indexed.
Jiggling with Swish-E search result returns could yield both the
structured data and the keyword/fuzzy search.
2.) You might also develop a series of rule sets that develop your
structure for you. You could use wget and download the files you want
(*.doc, *.pdf, *.xls, *.htm*, etc. on shared drives), and then convert
them into text using the various x to y filters available on the
Internet. Once in text form, you could run a set of heuristic rules
against them to develop subject categories. With the exception of a
description field, you could probably fill in all of the vital DC fields
using this method. Again, Swish-E could then spider these distilled
records, link them to the original document, spider that for keyword
searches, and you could have a structured catalog.
Andrew Bullen
Illinois State Library
>>Hi all,
>>
>>I need to figure out a way to catalog all data files on our
>>computer drives, and index them in such a way that all staff
>>can later find computer files easily when they do research.
>>Problem is, we have so much data stored all over these
>>drives, that it is very hard to find anything when we really
>>need it. Consequently, if files aren't named, saved and
>>indexed in an organized way, then the information is
>>effectively lost to the researcher.
>>
>>Is there a program you can suggest tthat can help me do this?
>>
>>Thanks for your help!
>>Maria
>>
>>
More information about the Web4lib
mailing list