Monday, February 03, 2020

Use the Leader, Luke!

If you learned the MARC format "on the job" or in some other library context you may have learned that the record is structured as fields with 3-digit tags, each with two numeric indicators, and that subfields have a subfield indicator (often shown as "$" because it is a non-printable character) and a single character subfield code (a-z, 0-9). That is all true for the MARC records that libraries create and process, but the MAchine Readable Cataloging standard (Z39.2 or ISO 2709) has other possibilities that we are not using. Our "MARC" (currently MARC21) is a single selection from among those possibilities, in essence an application profile of the MARC standard. The key to the possibilities afforded by MARC is in the MARC Leader, and in particular in two positions that our systems generally ignore because they always contain the same values in our data:
Leader byte 10 -- Indicator count
Leader byte 11 -- Subfield code length
In MARC21 records, Leader byte 10 is always "2" meaning that fields have 2-byte indicators, and Leader byte 11 is always 2 because the subfield code is always two characters in length. That was a decision made early on in the life of MARC records in libraries, and it's easy to forget that there were other options that were not taken. Let's take a short look at the possibilities the record format affords beyond our choice.

Both of these Leader positions are single bytes that can take values from 0 to 9. An application could use the MARC record format and have zero indicators. It isn't hard to imagine an application that has no need of indicators or that has determined to make use of subfields in their stead. As an example, the provenance of vocabulary data for thesauri like LCSH or the Art and Architecture Thesaurus could always be coded in a subfield rather than in an indicator:
650 $a Religion and science $2 LCSH
Another common use of indicators in MARC21 is to give a byte count for the non-filing initial articles on title strings. Istead of using an indicator value for this some libraries outside of the US developed a non-printing code to make the beginning and end of the non-filing portion. I'll use backslashes to represent these codes in this example:
245 $a \The \Birds of North America
I am not saying that all indicators in MARC21 should or even could be eliminated, but that we shouldn't assume that our current practice is the only way to code data.

In the other direction, what if you could have more than two indicators? The MARC record would allow you have have as many as nine. In addition, there is nothing to say that each byte in the indicator has to be a separate data element; you could have nine indicator positions that were defined as two data elements (4 + 5), or some other number (1 + 2 + 6). Expanding the number of indicators, or beginning with a larger number, could have prevented the split in provenance codes for subject vocabularies between one indicator value and the overflow subfield, $2, when the number exceeded the capability of a single numerical byte. Having three or four bytes for those codes in the indicator and expanding the values to include a-z would have been enough to include the full list of authorities for the data in the indicators. (Although I would still prefer putting them all in $2 using the mnemonic codes for ease of input.)

In the first University of California union catalog in the early 1980's we expanded the MARC indicators to hold an additional two bytes (or was it four?) so that we could record, for each MARC field, which library had contributed it. Our union catalog record was a composite MARC record with fields from any and all of the over 300 libraries across the University of California system that contributed to the union catalog as dozen or so separate record feeds from OCLC and RLIN. We treated the added indicator bytes as sets of bits, turning on bits to represent the catalog feeds from the libraries. If two or more libraries submitted exactly the same MARC field we stored the field once and turned on a bit for each separate library feed. If a library submitted a field for a record that was new to the record, we added the field and turned on the appropriate bit. When we created a user display we selected fields from only one of the libraries. (The rules for that selection process were something of a secret so as not to hurt anyone's feelings, but there was a "best" record for display.) It was a multi-library MARC record, made possible by the ability to use more than two indicators.

Now on to the subfield code. The rule for MARC21 is that there is a single subfield code and that is a lower case a-z and 0-9. The numeric codes have special meaning and do not vary by field; the alphabetic codes aare a bit more flexible. That gives use 26 possible subfields per tag, plus the 10 pre-defined numeric ones. The MARC21 standard has chosen to limit the alphabetic subfield codes to lower case characters. As the fields reached the limits of the available subfield codes (and many did over time) you might think that the easiest solution would be to allow upper case letters as subfield codes. Although the subfield code limitation was reached decades ago for some fields I can personally attest to the fact that suggesting the expansion of subfield codes to upper case letters was met with horrified glares at the MARC standards meeting. While clearly in 1968 the range of a-z seemed ample, that has not be the case for nearly half of the life-span of MARC.

The MARC Leader allows one to define up to 9 characters total for subfield codes. The value in this Leader position includes the subfield delimiter so this means that you can have a subfield delimiter and up to 8 characters to encode a subfield. Even expanding from a-z to aa-zz provides vastly more possibilities, and allow upper case as well give you a dizzying array of choices.

The other thing to mention is that there is no prescription that field tags must be numeric. They are limited to three characters in the MARC standard, but those could be a-z, A-Z, 0-9, not just 0-9, greatly expanding the possibilities for adding new tags. In fact, if you have been in the position to view internal systems records in your vendor system you may have been able to see that non-numeric tags have been used for internal system purposes, like noting who made each edit, whether functions like automated authority control have been performed on the record, etc. Many of the "violations" of the MARC21 rules listed here have been exploited internally -- and since early days of library systems.

There are other modifiable Leader values, in particular the one that determines the maximum length of a field, Leader 20. MARC21 has Leader 20 set at "4" meaning that fields cannot be longer than 9999. That could be longer, although the record size itself is set at only 5 bytes, so a record cannot be longer than 99999. However, one could limit fields to 999 (Leader value 20 set at "3") for an application that does less pre-composing of data compared to MARC21 and therefore comfortably fits within a shorter field length. 

The reason that has been given, over time, why none of these changes were made was always: it's too late, we can't change our systems now. This is, as Caesar might have said, cacas tauri. Systems have been able to absorb some pretty intense changes to the record format and its contents, and a change like adding more subfield codes would not be impossible. The problem is not really with the MARC21 record but with our inability (or refusal) to plan and execute the changes needed to evolve our systems. We could sit down today and develop a plan and a timeline. If you are skeptical, here's an example of how one could manage a change in length to the subfield codes:

a MARC21 record is retrieved for editing
  1. read the Leader 10 of the MARC21 record
  2. if the value is "2" and you need to add a new subfield that uses the subfield code plus two characters, convert all of the subfield codes in the record:
    • $a becomes $aa, $b becomes $ba, etc.
    • $0 becomes $01, $1 becomes $11, etc.
    • Leader 10 code is changed to "3"
  3. (alternatively, convert all records opened for editing)

a MARC21 record is retrieved for display
  1. read the Leader 10 of the MARC21 record
  2. if the value is "2" use the internal table of subfield codes for records with the value "2"
  3. if the value is "3" use the internal table of subfield codes for records with the value "3"

Sounds impossible? We moved from AACR to AACR2, and now from AACR2 to RDA without going back and converting all of our records to the new content.  We have added new fields to our records, such as the 336, 337, 338 for RDA values, without converting all of the earlier records in our files to have these fields. The same with new subfields, like $0, which has only been added in recent years. Our files have been using mixed record types for at least a couple of generations -- generations of systems and generations of catalogers.

Alas, the time to make these kinds of changes this was many years ago. Would it be worth doing today? That depends on whether we anticipate a change to BIBFRAME (or some other data format) in the near future. Changes do continue to be made to the MARC21 record; perhaps it would have a longer future if we could broach the subject of fixing some of the errors that were introduced in the past, in particular those that arose because of the limitations of MARC21 that could be rectified with an expansion of that record standard. That may also help us not carry over some of the problems in MARC21 that are caused by these limitations to a new record format that does not need to be limited in these ways.

Epilogue


Although the MARC  record was incredibly advanced compared to other data formats of its time (the mid-1960's), it has some limitations that cannot be overcome within the standard itself. One obvious one is the limitation of the record length to 5 bytes. Another is the fact that there are only two levels of nesting of data: the field and the subfield. There are times when a sub-subfield would be useful, such as when adding information that relates to only one subfield, not the entire field (provenance, external URL link). I can't advocate for continuing the data format that is often called "binary MARC" simply because it has limitations that require work-arounds. MARCXML, as defined as a standard, gets around the field and record length limitations, but it is not allowed to vary from the MARC21 limitations on field and subfield coding. It would be incredibly logical to move to a "non-binary" record format (XML, JSON, etc.) beginning with the existing MARC21 and  to allow expansions where needed. It is the stubborn adherence to the ISO 2709 format really has limited library data, and it is all the more puzzling because other solutions that can keep the data itself intact have been available for many decades.

Tuesday, January 28, 2020

Pamflets

I was always a bit confused about the inclusion of "pamflets" in the subtitle of the Decimal System, such as this title page from the 1922 edition:


Did libraries at the time collect numerous pamphlets? For them to be the second-named type of material after books was especially puzzling.

I may have discovered an answer to my puzzlement, if not THE answer, in Andrea Costadoro's 1856 work:
A "pamphlet" in 1856 was not (necessarily) what I had in mind, which was a flimsy publication of the type given out by businesses, tourist destinations, or public health offices. In the 1800's it appears that a pamphlet was a literary type, not a physical format. Costadoro says:
"It has been a matter of discussion what books should be considered pamphlets and what not. If this appellation is intended merely to refer to the SIZE of the book, the question can be scarecely worth considering ; but if it is meant to refer to the NATURE of a work, it may be considered to be of the same class and to stand in the same connexion with the word Treatise as the words Tract ; Hints ; Remarks ; &c, when these terms are descriptive of the nature of the books to which they are affixed." (p. 42)
To be on the shelves of libraries, and cataloged, it is possible that these pamphlets were indeed bound, perhaps by the library itself. 

The Library of Congress genre list today has a cross-reference from "pamphlet" to "Tract (ephemera)". While Costadoro's definition doesn't give any particular subject content to the type of work, LC's definition says that these are often issued by religious or political groups for proselytizing. So these are pamphlets in the sense of the political pamphlets of our revolutionary war. Today they would be blog posts, or articles in Buzzfeed or Slate or any one of hundreds of online sites that post such content.

Churches I have visited often have short publications available near the entrance, and there is always the Watchtower, distributed by Jehovah's Witnesses at key locations throughout the world, and which is something between a pamphlet (in the modern sense) and a journal issue. These are probably not gathered in most libraries today. In Dewey's time the printing (and collecting by libraries) of sermons was quite common. In a world where many people either were not literate or did not have access to much reading material, the Sunday sermon was a "long form" work, read by a pastor who was probably not as eloquent as the published "stars" of the Sunday gatherings. Some sermons were brought together into collections and published, others were published (and seemingly bound) on their own.  Dewey is often criticized for the bias in his classification, but what you find in the early editions serves as a brief overview of the printed materials that the US (and mostly East Coast) culture of that time valued. 

What now puzzles me is what took the place of these tracts between the time of Dewey and the Web. I can find archives of political and cultural pamphlets in various countries and they all seem to end around the 1920's-30's, although some specific collections, such as the Samizdat publications in the Soviet Union, exist in other time periods.

Of course the other question now is: how many of today's tracts and treatises will survive if they are not published in book form?