[Web4lib] MARC strictness

Wed Nov 30 11:50:05 EST 2005

> One of the strangest inconsistencies that I've found was this:
> 
>     600 14 $a Chateaubriand, Fran $d ois René de
> 
> What surprised me even more, however, was the response I got 
> when I reported this.  The librarian told me that this comes 
> from a batch conversion of some legacy records a few years 
> ago, where the c-cedilla (ç) character was misinterpreted as 
> the subfield separator followed by letter d, i.e. subfield 
> "$d".  He informed me that he had now corrected the record 
> that I reported.  But wait, there are hundreds of similar 
> records!  It's as if he just didn't have the tools to do a 
> global "search and replace".

>From my (admittedly limited) experience processing and manipulating MARC records, and based on comments that I've heard throughout this discussion and elsewhere, it's not quite that simple. If you were talking about a well-structured database with a relatively clear data dictionary and rules for data entry, then yes--searching globally for errors and replacing them would be a routine matter. In that case, what would be considered consistent and inconsistent within the database would be well defined.

There are a lot of problems with MARC when you begin trying to define "consistency." Some inconsistency is built right into the combination of format and cataloging rules--such as various types of punctuation sometimes being used as delimiters for data units and sometimes not, even though MARC's subfields were designed to serve exactly that purpose. So, in order to tell what's consistent and what isn't in that context, you'd have to have some sort of data structure that defines the appropriate punctuation for each individual subfield. And even then, as you stated in your original post, you've got to take into account that abbreviations will end in a period regardless of the rule for that subfield. So you'd have to have a data structure defining all possible abbreviations.

Once you've accounted for all of the built-in inconsistencies, then you've got inconsistencies between institutions due to variance in local practices. So your programming would have to take those into account each time you change institutions.

That's not to say that there aren't necessarily more obvious problems that you could check for--for instance, you could do some kind of global search and replace based on fields under authority control. Normalize the data in personal name fields and subject fields, compare it to an authority file, and then replace the inconsistent data with the appropriate data from the authority file. But then, still, you wouldn't necessarily know which piece of data to use to replace the bad one--for this kind of work it usually takes a judgement call. The best you could probably do would be to produce a report about which records and which fields are incorrect and then have someone manually fix those records.

Furthermore, and--again--in my experience, all but the simplest automatic manipulation of MARC needs to be double checked because there is bound to be something that I didn't take into account that caused the resulting records to be incorrect. And so it falls back to a manual process.

My point is that one of the reasons libraries, including mine, tend to fix MARC records manually and retrospectively is because the rules themselves are so wacky and inconsistent that they preclude much automatic processing. There *are* no tools for global search and replace in your sense within the MARC world (I'm sure someone will correct me if I'm wrong). :-)

Now--maybe I'm making this too difficult just because I'm bad at programming. Maybe the library world in general hasn't been creative enough or motivated enough to see the consistency behind the inconsistency--the order behind the chaos of MARC. (Truly--I'm not trying to be a smart-aleck).

> This is when I thought that, hey, I have those tools.  I 
> could find and fix the majority of these records in a matter 
> of hours and maybe earn myself a few hundred dollars.  That 
> is not a full library budget, but the salary cost for one 
> workday. The current state is an official embarrassment.  It 
> would be good to fix this.  
> If it is common that library catalogs have these 
> inconsistencies, and library systems don't help to fix them, 
> I could make it a business to offer my MARC tune-up services.

I say go for it! If your methods work reliably and are relatively simple and inexpensive, it couldn't hurt. I would be very interested to see/hear what you have in mind.

> But again, this inconsistency really doesn't matter because 
> you can still search for Chateaubriand.  Who cares about the 
> given name? If libraries thought that inconsistencies were 
> important, they would have found ways to fix them long ago.  
> Add to this that libraries are monopolies.  They have to 
> fight against budget cuts, but never against competitors.  
> The library with the erroneous catalog records doesn't really 
> lose business.  If they don't directly benefit from fixing 
> them, then why should they?

Well...most librarians I know genuinely want to help their patrons. I'm not nor do I aspire to be a library administrator, but I think it's really a question of cost-effectiveness. Since it has historically been impossible to do much automatic MARC cleanup, and cleaning up an entire catalog manually would take a tremendous number of man-hours, libraries have had to make do with simpler but less effective methods: spot-checking the catalog, hacking out rudimentary report-running routines, and retrospectively fixing problems as they're reported. I think that catalog database cleanup most often occurs in a way that has the most impact on usability for patrons. If there's an obvious problem in the 245$a of a book that circulates frequently then it will probably get fixed relatively quickly, while a problem in an obscure subfield that isn't even indexed might never get fixed because it doesn't affect retrieval of the record.

> So, back to my question: How can you motivate that libraries 
> should fix their broken catalog records?  Or shouldn't they?

Again, I think that most libraries *try* so long as their efforts are worth it (i.e., they actually fix something that has a systemic impact and aren't prohibitivly difficult or time-consuming). If it doesn't make a difference to the patrons and the system in general, why go through the time and expense?

Now...I realize that I'm dancing around the larger issues that these questions raise--whether or not human-generated structured metadata is "worth it" in a larger sense, how much structure is too much structure, and whether or not data consistency ("metadata quality"?) is that important.

These are big questions, and I think the best answer that anybody can give right now to the first is that the amount of structure in an effective metadata format should fit the purpose that the metadata is trying to accomplish. For purposes of *basic* information retrieval, as Mike has pointed out, little structure is necessary--and MARC is most definitely overkill, which is why some data inconsistency is tolerable. But, for information display (which can sometimes be very complex), automatic reporting, data analysis, resource preservation, etc., metadata structures with MARC's semantic depth are not out of the question.

As for data consistency, I don't think we can argue that it's not important. Many of the postings in this thread have talked about the difficulty of working *around* MARC's inconsistencies. Data consistency is a fundamental principle of data architecture, after all, and generally makes for happy systems. But, as has been pointed out, MARC is a special beast. It has a very rich history--there's a LOT of very interesting information in the world's MARC store, despite its inconsistency. That the format still works as well as it does is a small miracle. And, if I may [sort of] quote from the ultimate fount of wisdom, "when 40 years old you become, look as good you will not, hmm?"

Jason Thomale
Metadata Librarian
Texas Tech University Libraries