[Web4lib] MARC to XML: the agony and the ecstasy

Jonathan Gorman jtgorman at uiuc.edu
Thu Oct 11 08:25:53 EDT 2007


>The extra verbosity of MARCXML over binary MARC is in my humble opinion 
>not an issue any more with the huge disks we have. Collections of 30 
>million records and more can easily fit on one disk.

Yes, to clarify I only was using size of XML files as an example of how XML history and focus, not a criticism of XML compared to MARC.  In this day of fat pipes, huge drives, excellent compression, and large storage verboseness is not the problem it would have been a few decades ago.  Certainly, it's an extremely small drawback compared to the ease of transforming and manipulating XML ;).

I was trying to demonstrate that although many people think of XML as being "data-driven" or focused on data the origins of XML really stem from document markup folks.  If the focus had been only data, I would think some of the rules might have ended up different, like closing tags.  

Quick example....compare
<name>foo</name>
<address><street>1212 bob drive</street>

might have ended up as

<name>foo</>
<address><street>1212 bob drive</>

Since you can't have overlapping tag, I think you could get away with this tag since it would always close out the current.  In most data applications, it would be pretty easy to tell which tag closed which.  Data files tend to follow patterns, have simplistic semantics, data stays close together, etc.

I have seen issues with large xml files in that it can limit your choice of tools.  Data files can get large.  This isn't as true as it used to be though, thankfully.  I do remember a lot of early tools that would try to do things like read everything into memory in an effort to have a full DOM model and not use very efficient techniques for doing so.  This has been getting better the past few years and I've seen some rather nifty techniques for large XML files.  True, some  of these problems possibly could be solved with more memory, but libraries don't have quite the resources as a private company might.  


Jon Gorman


More information about the Web4lib mailing list