[Web4lib] MARC strictness

Mon Nov 28 11:24:56 EST 2005

I'm going to question one of J. Thomale's comments, although with less than
100% confidence that I'm right (and noting that most of the comments are on
the money):

> Now, I wasn't even a twinkle in my father's eye when MARC and AACR came
> about, but, from what I understand, these cataloging rules come out of
> the card catalog era, when you had to be very concise in order to fit
> the pertinent metadata on a card. This conciseness translated well to
> MARC when it was young due to strict limitations on numbers of
> characters for subfields, fields, and records. That's why you have all
> the abbreviations, punctuation conventions, etc.

I was working in the library field when MARC came out, started working with
MARC in 1973 (five years later) and eventually wound up writing the first
book to attempt to clarify MARC for librarians and vendors (MARC for
Library Use, first published in 1984--admittedly 16 years after MARC II
first appeared in 1968).

Issues of conciseness apply to catalog cards, to some extent--but MARC has
never had significant limitations on "numbers of characters for subfields,
fields, and records." The sheer variability of MARC has always been one of
its great strengths.

There's a 9,999-character limit on a given field (and, thus,
subfield--there's no special limit on subfield length)--although,
admittedly, many systems prior to Web times probably had roughly
1,500-character limits on displayable fields (because with screen-at-a-time
character-based displays, there's no good way to handle a longer field).
There's technically no limit on the length of a MARC record, although it's
necessary to use special conventions if the record grows beyond 99,999
characters--but that's the length of a novella, or, say, 50 to 60 screens
worth of text!

("When it was young" can have many meanings; the system I refer to, now
known as the RLG Union Catalog, went to a 30K length limit in 1980. I know;
I wrote the batch processing system that enforced those limits and actually
created USMARC records from the internal RLINMARC structure. That 30K limit
was based on mainframe programming realities. The system no longer has such
limits.)

Yes, real-world systems had and still have tighter limits in some cases.
The system I've worked with had a 30K record length limit (and a 12K
directory length limit, for a maximum of 1,000 fields within a record) for
a long time, and that was one of the largest limits around (I don't know
what WorldCat's limit is now, or if there is one, but it was around 8K or
16K characters for many years, I believe). None of these limits really
matter except for mixed collections (formerly archival and manuscript
control), where you may have hundreds or thousands of detailed
entries--unless, of course, you're trying to embed the whole text into the
MARC record, which is a misuse of the format.

The abbreviations and punctuation conventions come from AACR and ISBD. MARC
(both the underlying data format standard, Z39.2, and MARC21, probably the
most elaborate set of tag/subfield/indicator specifications) provides a way
to identify all the metadata--a rich combination of syntax and semantic.

Otherwise, a good set of comments.

Walt Crawford
wcc at rlg.org, 650-691-2227
-------------------------------------
Typically reachable:
Monday & Wednesday 7 a.m.-3 p.m.
Tuesday & Thursday 7 a.m.-2 p.m.
Friday 7-11 a.m.
--------------------------------------

web4lib-bounces at webjunction.org wrote on 11/28/2005 07:57:35 AM:

> Hi Lars,
>
> > I'm looking at a set of MARC records from a library near me.
> > Their cataloging guidelines are a very close translation of
> > the Library of Congress' MARC21 guidelines, but there seems
> > to be a lot of built-in tradition too, that isn't covered in
> > documents.
>
> Although it isn't necessarily obvious from looking at a set of MARC
> guidelines (whether it's MARC 21 Concise from LOC or OCLC MARC or
> whatever), MARC is only supposed to dictate the structure of the
> record's content, not the formatting of that content. Whether or not the
> 100 field is written thusly:
>
> 100 1 $a James Lathrop Meriam $d 1917 to 2000 ["incorrect" according to
> most library's cataloging standards]
>
> is outside the scope of any MARC specification. In order to get the
> content formatting, you have to use a set of cataloging rules.
> English-speaking countries use the Anglo-American Cataloging Rules,
> version 2 (AACR2), which is a very complex set of rules that tells you
> exactly how to format text in a cataloging record (no matter whether or
> not that record is encoded using MARC).
>
> Now, I wasn't even a twinkle in my father's eye when MARC and AACR came
> about, but, from what I understand, these cataloging rules come out of
> the card catalog era, when you had to be very concise in order to fit
> the pertinent metadata on a card. This conciseness translated well to
> MARC when it was young due to strict limitations on numbers of
> characters for subfields, fields, and records. That's why you have all
> the abbreviations, punctuation conventions, etc.
>
> You're absolutely right, however, that there is a lot of tradition built
> into most libraries' cataloging practices. Although we all use
> MARC/AACR2, we all have our own local practices that--for better or
> worse--sometimes contradict these standards. And, because of these
> standards' age and the richness of their histories, there can be a lot
> of variety in local practice between, and even within, libraries. In a
> lot of cases there was at one time a good reason for a particular quirk,
> but the reasons have been forgotten or simply no longer apply--and yet
> the quirk persists. For some reason, I find the topic of local
> cataloging practices, how they developed, and why they exist to be
> terribly fascinating, so I apologize if I'm rambling.
>
> > My experience (and I should point out that I'm a programmer, not a
> > librarian) tells me that people will follow formatting rules
> > if it matters, but not otherwise.  All C, Java, and Perl
> > programs have statements that end in a semicolon, or else
> > they refuse to run.
> > But not all programs are well structured, or easy to explain.
> > And this seems to apply to MARC records as well.
>
> Hmm. I also have a programming background that predates my librarian
> background, and that's a very interesting insight--although MARC/AACR2
> provides a lot more structure than do the formatting rules of any
> particular programming language.
>
> > The search interface to this library's catalog seems to
> > handle every subfield just the same.  Sometimes in the
> > personal names fields (100, 600, 700), I see subfields $c
> > (title) and $d (years of birth and death) interchanged:
> >
> >    100 1  $a Meriam, James Lathrop, $c 1917-2000.
> >    100 0  $a Husayn ibn Ali, $d King of Hejaz, $c 1853?-1931.
> >    700 1  $a Barth $d Professor $4 aut
>
> Something like this has got to be a mistake. The structure of the
> personal name fields is so standardized that switching subfields around
> like this could not be an actual practice. It's just too big of an
> inconsistency. And this is a MARC inconsistency, not an AACR2
> inconsistency.
>
> > In the two first examples, if the subfield markers are
> > removed, the remainder is a human-readable line of text with
> > commas and a period at the end.  This is the more common
> > case, but the third example doesn't have these commas. Is
> > there a rule for this?
>
> Yes. :-) In a card catalog record, "Meriam, James Lathrop, 1917-2000."
> would be written exactly so. This entire string would represent the
> author. In MARC, although the dates are separated out into a separate
> subfield, the formatting conventions persist. The last example, AFAIK,
> isn't meant to be read as a single string, so each separate subfield is
> just a separate piece of data, hence the lack of puncuation.
>
> Although I am not an expert on MARC/AACR2, similarities to those first
> two examples that you gave exist in the title fields, the publication
> information fields, and the physical description fields, among others.
>
> So, a lot of these formatting conventions come out of cataloging
> tradition. Where there are no traditions to guide them, there are no
> strange looking formatting conventions.
>
> If you're interested, I would find a copy of AACR2, or at least a
> concise version of it. The book that helped me tremendously is "The
> Concise AACR2" by Michael Gorman (yeah, yeah, I know...). It's currently
> in its 4th edition.
>
> > In trying to clean up the records, simply removing the comma
> > or period at the end of a subfield is insufficient, because
> > there are cases such as "$c Dr." or "$a Eliot, T. S." where
> > the period should be part of the subfield.
> >
> > The contents of subfield $d also varies greatly, e.g. the
> > English "fl." (flourished) is mixed with the Swedish "levde",
> > or the English "B.C." with the Swedish "f.Kr.", or more
> > complicated statements such as "was born no later than 1751".
> >  Circa can be abbreviated "c." (as in English) or "ca" or
> > "c:a" (as in Swedish).
> > Or the simple question mark after 1853 in the example above.
> > In LoC's guidelines, I find no rules for the text inside the
> > $d subfield.
>
> :-) Yes. Again, that's AACR2's job to define the text inside a subfield,
> not MARC's (with some exceptions). The examples given in LOC's
> guidelines are formatted according to AACR2, I believe, just because
> that's what everyone uses.
>
> Attempting to automatically process the content of human-created MARC
> records is going to give you the headache to end all headaches, because
> cataloging rules, even within a single standard, are not consistent--at
> least, not by a computer's definition of "consistent."
>
> > Apparently, all these formatting inconsistencies exist
> > because it really doesn't matter.  You can search for
> > "Lathrop 1917" or "King Husayn ibn Ali" and you find what
> > you're looking for.  Nobody would search for people having
> > the title 1917.
>
> Right. I'm not entirely sure how most library systems index MARC
> records, but I imagine that they would have to ignore
> formatting--otherwise searching would be impossible.
>
> The next logical question, then, is: why is it so important to
> catalogers that every comma, period, capatalized letter, etc. is in the
> right place? Well, beyond for the sake of following the "standard"
> (whether that's AACR2 or some local practice), I really don't know.
>
> I think this is part of the reason that catalogers look so suspiciously
> on "metadata," and why those of us who come from a more IT-ish
> background can get so frustrated when dealing with metadata in a library
> setting. Metadata really does not need to be (and really *should not*
> be) as complicated as some catalogers--at least, in my experience--would
> like to make it. Of course, I don't think this is their fault. I think
> it's just an effect of dealing with a metadata standard as complex and
> arcane as MARC for an extended period of time.
>
> > Is this kind of inconsistency a problem, and how do libraries
> > handle it?  Do you insist that such errors be corrected (and
> > how do you motivate this requirement?), or have you long
> > since given up that fight?
>
> I'm not a part of this at our own library, so I can't give a very
> detailed answer. But I know that our catalogers attempt to make all the
> records they create--whether it's original cataloging or modifying
> records downloaded from OCLC or a vendor--conform to local practices and
> "correct" cataloging rules, whenever possible. If there's a major
> problem with a record that's already in our catalog, the problem is
> brought to the attention of our database management team and they try to
> fix whatever is causing the problem in the record. Doubtless there are
> many, many records with problems that are still waiting to be found.
>
> How do library *systems* deal with this level of inconsistency? I
> imagine it varies from system to system.
>
> Does that help?
>
> Jason Thomale
> Metadata Librarian
> Texas Tech University Libraries
> _______________________________________________
> Web4lib mailing list
> Web4lib at webjunction.org
> http://lists.webjunction.org/web4lib/