Sunday, September 18, 2011

Meaning in MARC: Indicators

I have been doing a study of the semantics of MARC data on the futurelib wiki. An article on what I learned about the fixed fields (00X) and the number and code fields (0XX) appeared in the code4lib journal, issue 14, earlier this year. My next task was to tackle the variable fields in the MARC range 1XX-8XX.

This is a huge task, so I started by taking a look at the MARC indicators in this tag range, and have expanded this to a short study of the role that indicators play in MARC. I have to say that it is amazing how much one can stretch the MARC format with one or two single-character data elements.

Indicators have a large number of effects on the content of the MARC fields they modify. Here is the categorization that I have come up with, although I'm sure that other breakdowns are equally plausible.

I. Indicators that do not change the meaning of the field

There are indicators that have a function, but it does not change the meaning of the data in the field or subfields.
  • Display constants: some, but not all, display constants merely echo the meaning of the tag, e.g. 775 Other Edition Entry, Second Indicator
    Display constant controller
    # - Other edition available
    8 - No display constant generated
  • Trace/Do not trace: I consider these indicators to be carry-overs from card production.
  • Non-filing indicators: similar to indicators that control displays, these indicators make it possible to sort (was filing) titles properly, ignoring the initial articles ("The ", "A ", etc.). 
  • Existence in X collection: there are indicators in the 0XX range that let you know if the item exists in the collection of a national library. 
II. Indicators that do change the meaning of the field

Many indicators serve as a way to expand the meaning of a field without requiring the definition of a new tag.
  • Identification of the source or agency: a single field like a 650 topical subject field can have content from an unlimited list of controlled vocabularies because the indicator (or the indicator plus the $2 subfield) provides the identity of the controlled vocabulary.
  • Multiple types in a field: some fields can have data of different types, controlled by the indicator. For example, the 246 (Varying form of title) has nine different possible values, like Cover title or Spine title, controlled by a single indicator value. 
  • Pseudo-display controllers: the same indicator type that carries display constants that merely echo the meaning of the field also has a number of instances where the display constant actually indicates a different meaning for the field. One example is the 520 (Summary, etc.) field with display constants for "subject," "review," "abstract," and others. 
 Some Considerations

Given the complexity of the indicators there isn't a single answer to how this information should be interpreted in a semantic analysis of MARC. I am inclined to consider the display constants and tracing indicators in section I to not have meaning that needs to be addressed. These are parts of the MARC record that served the production of card sets but that should today be functions of system customization. I would argue that some of these have local value but are possibly not appropriate for record sharing.

The non-filing indicators are a solution to a problem that is evident in so many bibliographic applications. When I sort by title in Zotero or Mendeley, a large portion of the entries are sorted under "The." The world needs a solution here, but I'm not sure what it is going to be. One possibility is to create two versions of a title: one for display, with the initial article, and one for sorting, without. Systems could do the first pass at this, as they often to today with taking author names and inverting them into "familyname, forenames" order. Of course, humans would have to have the ability to make corrections where the system got it wrong.

The indicators that identify the source of a controlled vocabulary could logically be transformed into a separate data element for each vocabulary (e.g. "LCSH," "MeSH"). However, the number of different vocabularies is, while not infinite, very large and growing (as evidenced by the practice in MARC to delegate the source to a separate subfield that carries codes from a controlled list of sources), so producing a separate data element for each vocabulary is unwieldy, to say the least. At some future date, when controlled vocabularies "self-identify" using URIs this may be less of a problem. For now, however, it seems that we will need to have multi-part data elements for controlled vocabularies that include the source with the vocabulary terms.

The indicators that further sub-type a field, like the 520 Summary field, can be fairly easily given their own data element since they have their own meaning. Ideally there would be a "type/sub-type" relationship where appropriate.


And Some Problems

There are a number of areas that are problematic when it comes to the indicator values. In many cases, the MARC standard does not make clear if the indicator modifies all subfields in the field, or only a select few. In some instances we can reason this out: the non-filing indicators only refer to the left-most characters of the field, so they can only refer to the $a (which is mandatory in each of those fields). On the other hand, for the values in the subject area (6XX) of the record, the source indicator relates to all of the subject subfields in the field. I assume, however, that in all cases the control subfields $3-$8 perform functions that are unrelated to the indicator values. I do not know at this point if there are fields in which the indicators function on some other subset of the subfields between $a and $z. That's something I still need to study.

I also see a practical problem in making use of the indicator values in any kind of mapping from MARC to just about anything else. In 60% of MARC tags either one or both indicator positions is undefined. Undefined indicators are represented in the MARC record with blanks. Unfortunately there are also defined indicators that have a meaning assigned to the character "blank." There is nothing in the record itself to differentiate blank indicator values from undefined indicators. Any transformation from MARC to another format has to have knowledge about every tag and its indicators in order to do anything with these elements. This is another example of the complexity of MARC for data processing, and yet another reason why a new format could make our lives easier.

More on the Wiki

For anyone else who obsesses on these kinds of things there is more detail on all of this on the futurelib wiki. I welcome comments here, and on the wiki. If you wish to comment on the wiki, however, I need to add your login to the site (as an anti-spam measure). I will undoubtedly continue my own obsessive behavior related to this task, but I really would welcome collaboration if anyone is so inclined. I don't think that there is a single "right answer" to the questions I am asking, but am working on the principle that some practical decisions in this area can help us as we work on a future bibliographic carrier.

No comments: