[Web4lib] Dublin Core: An idea and thoughts

Andrew Hankinson andrew.hankinson at gmail.com
Mon Oct 30 20:38:40 EST 2006


Hi folks,

I've been mulling an idea over in my head for a while now, and am just
getting to the point where I think I can sufficiently explain it.  I
have not really floated this idea to many people yet, so I am
interested in hearing feedback from everyone out there.

First, some preamble:
As with most good ideas that have not gone anywhere, the problem lies
in the execution, and not the idea itself.  I think that,
unfortunately, such is the case with Dublin Core.  It was meant to
help organize the web - to provide metadata and machine-understandable
context that the average web developer could deploy without having a
degree in information retrieval.

Unfortunately, after almost 10 years of having this standard, we're
not really much further ahead.

I know many software projects use Dublin Core as a foundation for
their metadata - projects such as Dspace, Fedora, Greenstone, and
countless others.  Ironically, however, these are projects that are
used mostly by information retrieval professionals, and DC has made
few inroads to being adopted by the general web development public.

I think one of the reasons for this discrepancy is the lack of useful
and popular tools for this standard.  It all starts with the people
producing the content. Presently, for a website to have a Dublin Core
record it must be included in the metadata section of the header of
each and every page you create - a task that gets exponentially
cumbersome when it comes to maintaining hundreds or thousands of web
pages.  This metadata must often be hand-crafted - yes there are some
tools that assist in the construction to some extent, but there is no
auto-creation of metadata mechanism.

The second part of the puzzle is the people organizing the content.
They depend on the people producing the content, but with so few sites
using Dublin Core, they have no impetus to build it into their
software products.

The third part is, of course, the people consuming the content.
However, since the people producing the content are not supplying the
people organizing it with any useful information, they cannot pass
this along to the consumers.  'Consumers' take what they are given,
which in the current information environment means they rely on
Google's ranking algorithim.

Now, the idea:
Since every web site must be served from a web server, what if we
could take the metadata cataloguing and management from the page level
and place it with the server itself.  In thinking about this, I
specifically had in mind Apache as a platform, but other platforms
would function similarly.

By installing a module for Apache (say, 'mod_dc' or similar), it would
extend the functionality of the server to understand and serve
requests for Dublin Core metadata.  By keeping the configuration of
this as simple as possible I believe that we can lower the barriers of
implementation.

Consider: every subdirectory in a website can contain a .htaccess
file.  This file provides local configuration options for all files in
that directory and below.  What if, in this file, we could write
(example given in pseudo-code):

<dc.title>Title of site or sub-site</dc.title>
<dc.date.modified>2006-10-30</dc.date.modified>
<dc.creator>Joe Q. Developer</dc.creator>\
etc. etc. for all the DC elements

Since .htaccess files can get parsed hierachially, you could inherit
common properties across all the pages in your site. (sort of like a
cascading style sheet.)  So, if you had one publishing organization
for the entire site, you would not have to maintain that in each
subfolder - you could place that tag in your site root or even in your
main configuration file, and all pages and subsites below that would
inherit that information.  To override that for, say, an independent
sub-site, simply put a <dc.publisher> tag in the .htaccess file in
that sub-directory.

This would alleviate most of the work done by developers, and allow
for a centralized record that could be maintained by people other than
the 'code monkeys' without having to re-write every page on the
website.  (librarians and taxonomists, here's your chance!)

For organizers and consumers, then, they would use tools that would
pass a query to the webserver to see what it has to offer for Dublin
Core.  Something like:
http://www.mysite.com/?dc or http://www.mysite.com/subsite/?dc.
The server would then respond with properly formatted XML of the
complete Dublin Core record that could then be parsed by a web
browser, but could also be parsed with a myriad of other tools to
provide further indexing, navigation, structural and semantic
metadata.

What's out there now:
I have had a look at a number of other projects that seem to offer
similar services, but so far I have been underwhelmed.  There is a
mod_dublincore project out there, but it seemed to be focussed on
providing RSS functionality, and in any case has not been updated
since 2000. (http://web.resource.org/rss/1.0/modules/dc/).  There is
also mod_oai which looks extremely powerful, but it also fairly dense
to implement.

I would envision something that was, above all, simple to implement,
and easy to see real, tangible benefits for putting it in place.  Like
I mentioned before, the key sticking point is getting the content
developers to do it.  Once they start delivering metadata, the
organizers and consumers will start using it. "If you build it, they
will come."  The barriers in implementation and ongoing maintenence
needs just need to be much, much lower.

That said, I'm looking for some feedback on this idea.  I know there
are some drawbacks to this implementation, and it might have already
been tried and failed - I don't know.  Any and all opinions are
welcome.

Andrew


More information about the Web4lib mailing list