[Web4lib] MARC to XML: the agony and the ecstasy

Thu Oct 11 03:38:03 EDT 2007

K.G. Schneider wrote:
> For a presentation, I'm trying to come up with examples of what happens
> when MARC is transformed to XML and then exposed through search engines
> such as Endeca, Siderean, etc. ... examples that work well for
> librarians who are at least a little familiar with MARC. Part of the
> message I'm getting across is that MARC is very record-oriented and XML
> is about data (MARC is from Mars... XML is from Venus?); but I'm also
> trying to suggest that when we start exposing MARC in new webby
> environments sometimes we get some new abilities, but other times the
> results are not so pretty, due to limitations of the source data, even
> though it's now XML. 

I think it's important to distinguish two things here: the form and the 
content.

As it goes for the content, MARC stays MARC, even if wrapped in MARCXML. 
The same funny rules for fields and subfields, the same arcane ways of 
digging the title information out of different fields, etc. We are used 
to it, so using MARCXML or MARC really makes no difference here, the 
point is we know the semantics of the content and use it accordingly.

When it comes to form - the gift wrapping - there is a one-to-one 
mapping between MARC and MARCXML, so I tend to use whatever is most 
convenient. Assuming you know the char set encoding of the original MARC 
record, that is.

For example we build a metasearcher (called Masterkey, see
http://masterkey.indexdata.com/ ) which fetches binary MARC from the 
backend targets, internally re-formats as MARCXML, and uses the 
expressive power of XSLT transformations to pick out interesting bit's 
and bytes for indexing. Finally, display HTML is rendered by XSLT 
transformations as well. This way, we could use a standard templating 
XSLT engine, instead of writing our own templating engine for binary 
MARC records.

The extra verbosity of MARCXML over binary MARC is in my humble opinion 
not an issue any more with the huge disks we have. Collections of 30 
million records and more can easily fit on one disk.

A nice thing about MARCXML is that it states it's character set encoding 
in the XML preamble. If used correctly, you will never have char 
encoding troubles in your applications. In contrast, it is notoriously 
difficult to figure out from outside which char set query and record 
encoding a MARC based library system uses, and the public available 
information is often missing or incorrect. Try and error takes time!

For human resources, I would like to say that MARCXML beeing a special 
breed of XML, there are plenty of high quality end user tools and 
programmers toolboxes to use with XML, and web developers are nowadays 
used to use these tools, for any programming language of choice, being 
open source or commercial, as you wish. Binary MARC programming 
specialists are rare and hard to find, on the other hand.

Storage and indexing are other issues: In principle, it does not matter 
if you use file storage, a relational database, an XML database, or a 
full-text indexer to keep your records and find them again in your 
application or library system. The important thing is that you can store 
them and that you can perform the kind of searches against your storage 
you wish.

If you want full-text indexing, or want to apply complex semantic 
indexing rules on the records (like mapping from MARC to Z39.50 BIB-1 
use attribute set), and you do not want to code your own templating 
engine, again, MARCXML and XSLT/XPATH is a good choice. The semantic 
definitions found at  http://www.loc.gov/z3950/agency/bib1.html are more 
or less easy translated to XSLT templates for full-text indexing 
engines, or XPATH/XQUERY searches in XML databases.

If you can live with simpler searches, the facilities provided by RDB 
systems might be sufficient, and many of them have nowadays XML storage 
fields or binary BLOB fileds to be used with MARCXML or binary MARC.

Finally, MARCXML is a great format to use if you want to tap into other 
peoples semantic work for free: for example the LOC homepage offers XSLT 
stylesheets for conversion between MARCXML, MODS, and Dublin Core.
http://www.loc.gov/standards/marcxml/

So my rule of thumb capturing the strength of each format is more or less

- use binary MARC for backend communication to library systems
- use binary MARC or MARCXML if your search requirements are modest
- use MARCXML if you need sophisticated indexing in full-text engines
- use MARCXML for web frontend work
- use MARCXML for easy conversion of content to other semantic formats

Probably, other engineers with other skill sets would slice and dice 
differently.

Your's Marc Cromme, Index Data

> 
> Originally I was trying to come up with metaphors about the difference
> between MARC and the recombinative, hey-let's-put-on-a-show quality of
> XML (a banana, an orange, and an apple... or fruit salad; a formal
> garden... a rose float at the parade) but then I remembered what it was
> like to explain email in the early days of the 'net, and the reality is
> that the way to do that was to sit people in front of a computer and
> have them send and receive messages. (Remember all those metaphors we
> strained for in the Early Days?) So I'm hoping to find some good, live
> examples of what I'm talking about. 
> 
> Thanks! 
> 
> Karen G. Schneider
> kschneider at cclaflorida.org
> Research & Development
> College Center for Library Automation
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/
> 

-- 

Marc Cromme
M.Sc and Ph.D in Mathematical Modelling and Computation
Senior Developer, Project Manager

Index Data Aps
Købmagergade 43, 2
1150 Copenhagen K.
Denmark

tel: +45 3341 0100
fax: +45 3341 0101

http://www.indexdata.com

INDEX DATA Means Business
for Open Source and Open Standards