[Web4lib] A common filesystem framework

Mon Nov 7 17:59:58 EST 2005

Hi Guys,

The background for the filesystem-based structure is what's come out
of me working for the National Library of Australia, although it has
nothing library specific about it. Basically, there are pre- and post
processes we do with our data, such as converting to and fro various
formats, analysing them, indexing, filtering and so forth, and quite
often, because all or most of our tools exsist outside of databases,
the filesystem is the place to be. (I've written some more rationale
for the use of filesystems instead of RDBMS's, but that sort of
formality is a banality in a forum such as this ... :)

********

The gist of it is three main directories, where 'apps' and 'libs' are
a shared space for various tools, applications and reusable libraries,
and a 'data' directory is a shared space with a special meaning ;

   /apps
   /libs
   /data

The data directory can contain as many subdirectories as you want,
with the constraint that each of them is a data-set, so for example
for a data-set that consist of records of childrens literature, it
might look like ;

   /data/childrenslit

Also in the data directory is a configuration file called config.xml,
and you register your apps, libs and data-sets in it, although only
the data-set registration is mandatory ;

<config>
   <!-- Register your apps here if you like ... -->
   <apps />
   <!-- Register your libs here if you like ... -->
   <libs />
   <!-- You must register your data-sets here -->
   <data>
      <data-set id="childrenslit">
         <with-records schema="marc21" read-access="true"
            write-access="false" update-access="false" />
      </data-set>
   </data>
   <!-- Register your supported schemata here -->
   <schemata>
      <schema id="marc" extension="bin" />
      <schema id="marc21" namespace="http://..." extension="xml" />
      <schema id="mods" namespace="http://..." extension="xml" />
      <schema id="xobis" namespace="http://..." extension="xml" />
      <schema id="rss2" namespace="http://..." extension="xml;rss;rdf" />
      <schema id="foap" namespace="http://..." extension="rdf" />
   </schemata>
</config>

Each data-set's id is the name of the directory it resides in, and you
specify what schemas you can expect to find in that structure ;

         <with-records schema="marc" read-access="true"
            write-access="true" />
         <with-records schema="marc21" read-access="true"
            write-access="false" update-access="false" />

The example above would specify a data-set where we can read and write
MARC records, but only read MARC 21 XML, a typical MARC to MARC XML
conversion. We can also see that for each record we can expect one
MARC file and one MARC XML file. If there is discrepencies, we know
what tools to call to do a conversion if a new record has been popped
into the structure. Any tool that wants to work on this data-set know
what parts of the data-set they can read, write and update. (I'm sure
better concurrency information could be thought of)

Next is the structure of the data-set directory itself. If we go back
to our previous example ;

  /childrenslit

The tree-structured dirindex is required from this directory and three
levels deep. Each record is in the format [id].[schema].[extension].
An example for a MARC XML record with id '676732a' ;

  /childrenslit/a/a2/a23/676732a.marc21.xml

Notice the reverse order for the id directory structure; most id's are
rightly bound to uniqueness / traffic, so just a pragmatic choice.
There is no requirement to create a tree-structured dirindex for all
possible combinations, only those who will be filled with actual
records. If a directory becomes empty at a later stage, that directory
can also be deleted.

(There is here also the possibility to create a file index definition
file, but I'll hold this off until needed; file structure traversing
is more a configuration issue than a technical one)

How it is supposed to work
--------------------------------

I can have different apps that looks at the same data-set to do
various things to it. For example, here are two apps that use the same
data-set ;

     <application id="my_app">
        <description>Converts MARC to MARC XML</description>
        <uses data-set="childrenslit">
           <reactor schema-idref="marc" test="new;update" />
        </uses>
     </application>
     <application id="my_other_app">
        <description>Indexes MARC XML</description>
        <uses data-set="childrenslit">
           <reactor schema-idref="marc21" test="new;update;delete" />
        </uses>
     </application>

Both applications knows from the config.xml file who does what, and we
can write reactors to data added, updated or deleted in a given
schema.

The idea from here is that we simply can exchange apps and libs and
point to what data-set we want them to work with. If an app has
support for a few good schemas, it would be a simple matter of
plonking it it, update config.xml, and then run it.

By this I could share with you for example the XPF framework to create
a Topic Maps website from any data-set (MARX XML, Topic Maps, RDF,
RSS, DocBook and a few other schemata), the Phonto tool (automatic
ontology extraction and analysis tool) and a host of XSLT libraries
for various conversions. Imagine next the creation of lexical parsers,
AI tools, data-set to SemWeb bridges, and so forth, easily shared. I
have some data-sets, you have others, and often sharing these are
restricted, so let's share tools that read the same structures, on the
path to world domination.

Anyways, that's the basic idea, in no ways exhaustive terms. Any thoughts?

Regards,

Alex
--
"Ultimately, all things are known because you want to believe you know."
                                                         - Frank Herbert
__ http://shelter.nu/ __________________________________________________