Wednesday, April 12, 2017

If It Ain't Broke

For the first time in over forty years there is serious talk of a new metadata format for library bibliographic data. This is an important moment.

There is not, however, a consensus within the profession on the need to replace the long-standing MARC record format with something different. A common reply to the suggestion that library data creation needs a new data schema is the phrase: "If it ain't broke, don't fix it." This is more likely to be uttered by members of the cataloging community - those who create the bibliographic data that makes up library catalogs - than by those whose jobs entail systems design and maintenance. It is worth taking a good look at the relationship that catalogers have with the MARC format, since their view is informed by decades of daily encounters with a screen of MARC encoding.

Why This Matters

When the MARC format was developed, its purpose was clear: it needed to provide the data that would be printed on catalog cards produced by the Library of Congress. Those cards had been printed for over six decades, so there was no lack of examples to use to define the desired outcome. In ways unimagined at the time, MARC would change, nay, expand the role of shared cataloging, and would provide the first online template for cataloging.

Today work is being done on the post-MARC data schema. However, how the proposed new schema might change the daily work of catalogers is unclear. There is some anxiety in the cataloging community about this, and it is understandable. What I unfortunately see is a growing distrust of this development on the part of the data creators in our profession. It has not been made clear what their role is in the development of the next "MARC," not even whether their needs are a driving force in that development. Surely a new model cannot be successful without the consideration (or even better, the participation) of the people who will spend their days using the new data model to create the library's data.

(An even larger question is the future of the catalog itself, but I hardly know where to begin on that one.)

If it Ain't Broke...

The push-back against proposed post-MARC data formats is often seen as a blanket rejection of change. Undoubtedly this is at times the case. However, given that there have now been multiple generations of catalogers who worked and continue to work with the MARC record, we must assume that the members of the cataloging community have in-depth knowledge of how that format serves the cataloging function. We should tap that knowledge as a way to understand the functionality in MARC that has had a positive impact on cataloging for four decades, and should study how that functionality could be carried forward into the future bibliographic metadata schema.

I asked on Twitter for input on what catalogers like about MARC, and received some replies. I also viewed a small number of presentations by catalogers, primarily those about proposed replacements for MARC. From these I gathered the following list of "what catalogers like about MARC." I present these without comment or debate. I do not agree with all of the statements here, but that is no matter; the purpose here is to reflect cataloger perspectives.

(Note: This list is undoubtedly incomplete and I welcome comments or emails with your suggestions for additions or changes.)

What Catalogers Like/Love About MARC

There is resistance to moving away from using the MARC record for cataloging among some in the Anglo-American cataloging community. That community has been creating cataloging data in the MARC formats for forty years. For these librarians, MARC has many positive qualities, and these are qualities that are not perceived to exist in the proposals for linked data. (Throughout the sections below, read "library cataloging" and variants as referring to the Anglo-American cataloging tradition that uses the MARC format and the Anglo-American Cataloging Rules and its newer forms.)

MARC is Familiar

Library cataloging makes use of a very complex set of rules that determine how a resource is described. Once the decisions are made regarding the content of the description, those results are coded in MARC. Because the creation of the catalog record has been done in the MARC format since the late 1970's, working catalogers today have known only MARC as the bibliographic record format and the cataloging interface. Catalogers speak in "MARC" - using the tags to name data elements - e.g. "245" instead of "title proper".


Those who work with MARC consider it to be "human readable." Most of the description is text, therefore what the cataloger creates is exactly what will appear on the screen in the library catalog. If a cataloger types "ill." that is what will display; if the cataloger instead types "illustrations" then that is what will display. In terms of viewing a MARC record on a screen, some cataloger displays show the tags and codes to one side, and the text of those elements is clearly readable as text.

MARC Gives Catalogers Control

The coding is visible, and therefore what the cataloger creates on the screen is virtually identical to the machine-readable record that is being created. Everything that will be shown in the catalog is in the record (with the exception of cover art, at least in some catalogs). The MARC rules say that the order of fields and subfields in the record are the order in which that information should be displayed in the catalog. Some systems violate this by putting the fields in numeric order, but the order of subfields is generally maintained. Catalogers wish to control the order of display and are frustrated when they cannot. In general, changing anything about the record with automated procedures can un-do the decisions made by catalogers as part of their work, and is a cause of frustration for catalogers.

MARC is International

MARC is used internationally, and because the record uses numerics and alphanumeric codes, a record created in another country is readable to other MARC users. Note that this was also the purpose of the International Standard Bibliographic Description (ISBD), which instead of tags uses punctuation marks to delimit elements of the bibliographic description. If a cataloger sees this, but cannot read the text:

  245 02   |a לטוס עם עין אחת / |c דני בז.

it is still clear that this is a title field with a main title (no subtitle), followed by a statement of the author's name as provided on the title page of the book.

MARC is the Lingua Franca of Cataloging

This is probably the key point that comprises all of the above, but it is important to state it as such. This means that the entire workflow, the training materials, the documentation - all use MARC. Catalogers today think in MARC and communicate in MARC. This also means that MARC defines the library cataloging community in the way that a dialect defines the local residents of a region. There is pride in its "library-ness". It is also seen as expressing the Anglo-American cataloging tradition.

MARC is Concise

MARC is concise as a physical format (something that is less important today than it was in the 1960s when MARC was developed), and it is also concise on the screen. "245" represents "title proper"; "240" represents "uniform title"; "130" represents "uniform title main entry". Often an entire record can be viewed on a single screen, and the tags and subfield codes take up very little display space.

MARC is Very Detailed

MARC21 has about 200 tags currently defined, and each of these can have up to 36 subfields. There are about 2000 subfields defined in MARC21, although the distribution is uneven and depends on the semantics of the field; some fields have only a handful of subfields, and in others there are few codes remaining that could be assigned.

MARC is Flat

The MARC record is fairly flat, with only two levels of coding: field and subfield. This is a simple model that is easy to understand and easy to visualize.

MARC is Extensible

Throughout its history, the MARC record has been extended by adding new fields and subfields. There are about 200 defined fields which means that there is room to add approximately 600 more.

MARC has Mnemonics

Some coding is either consistent or mnemonic, which makes it easier for catalogers to remember the meaning of the codes. There are code blocks that refer to cataloging categories, such as the title block (2XX), the notes block (5XX) and the subject block (6XX). Some subfields have been reserved for particular functions, such as the use of the numeric subfields in 0-8. In other cases, the mnemonic is used in certain contexts, such as the use of subfield "v" for the volume information of series. In other fields, the "v" may be used for something else, such as the "form" subfield in subject fields, but the context makes it clear.

There are also field mnemonics. For example, all tagged fields that have "00" in the second and third places are personal name fields. All fields and subfields that use the number 9 are locally defined (with a few well-known exceptions).

MARC is Finite and Authoritative

MARC defines a record that is bounded. What you see in the record is all of the information that is being provided about the item being described. The concept of "infinite graphs" is hard to grasp, and hard to display on a screen. This also means that MARC is an authoritative statement of the library bibliographic description, whereas graphs may lead users to sources that are not approved by or compatible with the library view.

Thursday, April 06, 2017

Precipitating Forward

Our Legacy, Our Mistake

If you follow the effort taking place around the proposed new bibliographic data standard, BIBFRAME, you may have noticed that much of what is being done with BIBFRAME today begins our current data in MARC format and converts it to BIBFRAME. While this is a function that will be needed should libraries move to a new data format, basing our development on how our legacy data converts is not the best way to move forward. In fact, it doesn't really tell us what "forward" might look like if we give it a chance.

We cannot define our future by looking only at our past. There are some particular aspects of our legacy data that make this especially true.          

I have said before (video, article) that we made a mistake when we went from printing cards using data encoded in MARC, to using MARC in online catalogs. The mistake was that we continued to use the same data that had been well-adapted to card catalogs without making the changes that would have made it well-adapted to computer catalogs. We never developed data that would be efficient in a database design or compatible with database technology. We never really moved from textual description to machine-actionable data points. Note especially that computer catalogs fail to make use of assigned headings as they are intended, yet catalogers continue to assign them at significant cost.

One of the big problems in our legacy data that makes it hard to take advantage of computing technology is that the data tends to be quirky. Technology developers complain that the data is full of errors (as do catalogers), but in fact it is very hard to define, algorithmically, what is an error in our data.  The fact is that the creation of the data is not governed by machine rules; instead, decisions are made by humans with a large degree of freedom. Some fields are even defined as being either this or that, something that is never the case in a data design. A few fields are considered required, although we've all seen records that don't have those required fields. Many fields are repeatable and the order of fields and subfields is left to the cataloger, and can vary.

The cataloger view is of a record of marked-up text. Computer systems can do little with text other than submit it for keyword indexing and display it on the screen. Technical designers look to the fixed fields for precise data points that they can operate on, but these are poorly supported and are often not included in the records since they don't look like "cataloging" as it is defined in libraries. These coded data elements are not defined by the cataloging code, either, and can be seen a mere "add-ons" that come with the MARC record format. The worst of it is that they are almost uniformly redundant with the textual data yet must be filled in separately, an extra step in the cataloging process that some cannot afford.

The upshot of this is that it is very hard to operate over library catalog data algorithmically. It is also very difficult to do any efficient machine validation to enforce consistency in the data. If we carry that same data and those same practices over to a different metadata schema, it will still be very hard to operate over algorithmically, and it will still be hard to do quality control as a function of data creation.

The counter argument to this is that cataloging is not a rote exercise - that catalogers must make complex decisions that could not be done by machines. If cataloging were subject to the kinds of data entry rules that are used in banking and medical and other modern systems, then the creativity of the cataloger's work would be lost, and the skill level of cataloging would drop to mere data entry.

This is the same argument you could used for any artisanal activity. If we industrialize the act of making shoes, the skills of the master shoe-maker are lost. However, if we do not industrialize shoe production, only a very small number of people will be able to afford to wear shoes.

This decision is a hard one, and I sympathize with the catalogers who are very proud of their understanding of the complexity of the bibliographic world. We need people who understand that complexity. Yet increasingly we are not able to afford to support the kind of cataloging practices of which we are proud. Ideally, we would find a way to channel those skills into a more efficient workflow.

There is a story that I tell often: In the very early days of the MARC record, around the mid-1970's, many librarians thought that we could never have a "computer catalog" because most of our cataloging existed only on cards, and we could NEVER go back and convert the card catalogs, retype every card into MARC. At that same time, large libraries in the University of California system were running over 100,000-150,000 cards behind in their filing. For those of you who never filed cards... it was horribly labor intensive. Falling 150,000 cards behind meant that a book was on the shelf THREE MONTHS before the cards were in the catalog. Some of this was the "fault" of OCLC which was making it almost too easy to create those cards. Another factor was a great increase in publishing that was itself facilitated by word processing and computer-driven typography. Within less than a decade it became more economical to go through the process of conversion from printed cards to online catalogs than to continue to maintain enormous card catalogs. And the rest is history. MARC, via OCLC, created a filing crisis, and in a sense it was the cost of filing that killed the card catalog, not the thrill of the modern online catalog.

The terrible mistake that we made back then was that we did not think about what was different between the card catalog and the online catalog, and we did not adjust our data creation accordingly. We carried the legacy data into the new format which was a disservice to both catalogers and catalog users. We missed an opportunity to provide new discovery options and more efficient data creation.

We mustn't make this same mistake again.

The Precipitant

Above I said that libraries made the move into computer-based catalogs because it was uneconomical to maintain the card catalog. I don't know what the precipitant will be for our current catalog model, but there are some rather obvious places to look to for that straw that will break the MARC/ILS back. These problems will probably manifest themselves as costs that require the library to find a more efficient and less costly solution. Here are some of the problems that I see today that might be factors that require change:

  • Output rates of intellectual and cultural products is increasing. Libraries have already responded to this through shared cataloging and purchase of cataloging from product vendors. However, the records produced in this way are then loaded into thousands of individual catalogs in the MARC-using community.
  • Those records are often edited for correctness and enhanced. Thus they are costing individual libraries a large amount of money, potentially as much or more than libraries save by receiving the catalog copy.
  • Each library must pay for a vendor system that can ingest MARC records, facilitate cataloging, and provide full catalog user (patron) support for searching and display.
  • "Sharing" in today's environment means exporting data and sending it as a file. Since MARC records can only be shared as whole records, updates and changes generally are done as a "full record replace" which requires a fair amount of cycles. 
  • The "raw" MARC record as such is not database friendly, so records must be greatly massaged in order to store them in databases and provide indexing and displays. Another way to say this is that there are no database technologies that know about the MARC record format. There are database technologies that natively accept and manage other data formats, such as key-value pairs

There are some current technologies that might provide solutions:

  • Open source. There is already use of open source technology in some library projects. Moving more toward open source would be facilitated by moving away from a library-centric data standard and using at least a data structure that is commonly deployed in the information technology world. Some of this advantage has already been obtained with using MARCXML.
  • The cloud. The repeated storing of the same data in thousands of catalogs means not being able to take advantage of true sharing. In a cloud solution, records would be stored once (or in a small number of mirrors), and a record enhancement would enhance the data for each participant without being downloaded to a separate system. This is similar to what is being proposed by OCLC's WorldShare and Ex Libris' Alma, although presumably those are "starter" applications. Use of the cloud for storage might also mean less churning of data in local databases; it could mean that systems could be smaller and more agile.
  • NoSQL databases and triple stores. The current batch of databases are open source, fast, and can natively process data in a variety of formats (although not MARC). Data does not have to be "pre-massaged" in order to be stored in a database or retrieved and the database technology and the data technology are in sync. This makes deployment of systems easier and faster. There are NoSQL database technologies for RDF. Another data format that has dedicated database technology is XML, although that ship may have sailed by now.
  • The web. The web itself is a powerful technology that retrieves distributed data at astonishing rates. There are potential cost/time savings on any function that can be pushed out the web to make use of its infrastructure. 

The change from MARC to ?? will come and it will be forced upon us through technology and economics. We can jump to a new technology blindly, in a panic, or we can plan ahead. Duh.