[Web4lib] MARC to XML: the agony and the ecstasy
Marc Cromme
marc at indexdata.dk
Thu Oct 11 03:38:03 EDT 2007
K.G. Schneider wrote:
> For a presentation, I'm trying to come up with examples of what happens
> when MARC is transformed to XML and then exposed through search engines
> such as Endeca, Siderean, etc. ... examples that work well for
> librarians who are at least a little familiar with MARC. Part of the
> message I'm getting across is that MARC is very record-oriented and XML
> is about data (MARC is from Mars... XML is from Venus?); but I'm also
> trying to suggest that when we start exposing MARC in new webby
> environments sometimes we get some new abilities, but other times the
> results are not so pretty, due to limitations of the source data, even
> though it's now XML.
I think it's important to distinguish two things here: the form and the
content.
As it goes for the content, MARC stays MARC, even if wrapped in MARCXML.
The same funny rules for fields and subfields, the same arcane ways of
digging the title information out of different fields, etc. We are used
to it, so using MARCXML or MARC really makes no difference here, the
point is we know the semantics of the content and use it accordingly.
When it comes to form - the gift wrapping - there is a one-to-one
mapping between MARC and MARCXML, so I tend to use whatever is most
convenient. Assuming you know the char set encoding of the original MARC
record, that is.
For example we build a metasearcher (called Masterkey, see
http://masterkey.indexdata.com/ ) which fetches binary MARC from the
backend targets, internally re-formats as MARCXML, and uses the
expressive power of XSLT transformations to pick out interesting bit's
and bytes for indexing. Finally, display HTML is rendered by XSLT
transformations as well. This way, we could use a standard templating
XSLT engine, instead of writing our own templating engine for binary
MARC records.
The extra verbosity of MARCXML over binary MARC is in my humble opinion
not an issue any more with the huge disks we have. Collections of 30
million records and more can easily fit on one disk.
A nice thing about MARCXML is that it states it's character set encoding
in the XML preamble. If used correctly, you will never have char
encoding troubles in your applications. In contrast, it is notoriously
difficult to figure out from outside which char set query and record
encoding a MARC based library system uses, and the public available
information is often missing or incorrect. Try and error takes time!
For human resources, I would like to say that MARCXML beeing a special
breed of XML, there are plenty of high quality end user tools and
programmers toolboxes to use with XML, and web developers are nowadays
used to use these tools, for any programming language of choice, being
open source or commercial, as you wish. Binary MARC programming
specialists are rare and hard to find, on the other hand.
Storage and indexing are other issues: In principle, it does not matter
if you use file storage, a relational database, an XML database, or a
full-text indexer to keep your records and find them again in your
application or library system. The important thing is that you can store
them and that you can perform the kind of searches against your storage
you wish.
If you want full-text indexing, or want to apply complex semantic
indexing rules on the records (like mapping from MARC to Z39.50 BIB-1
use attribute set), and you do not want to code your own templating
engine, again, MARCXML and XSLT/XPATH is a good choice. The semantic
definitions found at http://www.loc.gov/z3950/agency/bib1.html are more
or less easy translated to XSLT templates for full-text indexing
engines, or XPATH/XQUERY searches in XML databases.
If you can live with simpler searches, the facilities provided by RDB
systems might be sufficient, and many of them have nowadays XML storage
fields or binary BLOB fileds to be used with MARCXML or binary MARC.
Finally, MARCXML is a great format to use if you want to tap into other
peoples semantic work for free: for example the LOC homepage offers XSLT
stylesheets for conversion between MARCXML, MODS, and Dublin Core.
http://www.loc.gov/standards/marcxml/
So my rule of thumb capturing the strength of each format is more or less
- use binary MARC for backend communication to library systems
- use binary MARC or MARCXML if your search requirements are modest
- use MARCXML if you need sophisticated indexing in full-text engines
- use MARCXML for web frontend work
- use MARCXML for easy conversion of content to other semantic formats
Probably, other engineers with other skill sets would slice and dice
differently.
Your's Marc Cromme, Index Data
>
> Originally I was trying to come up with metaphors about the difference
> between MARC and the recombinative, hey-let's-put-on-a-show quality of
> XML (a banana, an orange, and an apple... or fruit salad; a formal
> garden... a rose float at the parade) but then I remembered what it was
> like to explain email in the early days of the 'net, and the reality is
> that the way to do that was to sit people in front of a computer and
> have them send and receive messages. (Remember all those metaphors we
> strained for in the Early Days?) So I'm hoping to find some good, live
> examples of what I'm talking about.
>
> Thanks!
>
> Karen G. Schneider
> kschneider at cclaflorida.org
> Research & Development
> College Center for Library Automation
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
>
--
Marc Cromme
M.Sc and Ph.D in Mathematical Modelling and Computation
Senior Developer, Project Manager
Index Data Aps
Købmagergade 43, 2
1150 Copenhagen K.
Denmark
tel: +45 3341 0100
fax: +45 3341 0101
http://www.indexdata.com
INDEX DATA Means Business
for Open Source and Open Standards
More information about the Web4lib
mailing list