Friday, September 01, 2006

Murdering MARC

It's been almost four years now since Roy Tennant's rallying cry of "MARC Must Die" and little has been done to further that goal. It seems pretty clear that the MARC format will not expire of its own accord, so it may be time to contemplate murder. (I'm not usually taken to violent actions. Perhaps I've been reading a bit too much medieval history of late.)

There's understandably a great reluctance to tackle a change that will have such a wide-ranging effect on our everyday library operations. However, like all large tasks, it becomes more manageable when it has been analyzed into a number of smaller tasks, and I'm convinced that if we put our minds to it we can move on to a bibliographic record format that meets our modern needs. I'm also convinced that we can transition systems from the current MARC format to its successor without having to undergo a painful revolution.

The alternative to change is that library systems will cobble on kludge after kludge in their attempts to provide services that MARC does not support. It will be very costly for us to interact with non-library services and we'll continue to be viewed as adhering to antiquated technology. Since I don't like this alternative, I propose that we begin the "Death to MARC" process ASAP. It should start with a few analysis tasks, some of which are underway:

a) Analysis of the data elements in the MARC record. I have done some work on this, although as yet unpublished. But I will share here some of the data I have gathered over the past few years.
b) Analysis of how the MARC record has actually been used. This is underway at the University of North Texas where Bill Moen and colleagues are studying millions of OCLC records to discover the frequency with which data elements are actually used. This data is important because we absolutely will have to be able to transition our current bibliographic data to the new format or formats. (Yes, I said "formats.") Another aspect of this would be to investigate how library systems have made use of the data elements in the MARC record, with the hope of identifying ones that may not be needed in the future, or whose demise would have a minimum impact.
c) A functional analysis of library systems. There's a discussion taking place on a list called ngc4lib about the "Next Generation Catalog" for libraries. In that discussion it quickly becomes clear that the catalog is no longer (if it ever was) a discrete system, and our "bibliographic record" is really serving a much broader role than providing an inventory of library holdings. This was the bee in my bonnet when I wrote my piece for the Library Hi Tech special issue on the future of MARC. I'm not sure I could still stand by everything I said in that article, but the issues that were bugging me then are bugging me still now. If we don't understand the functions we need to serve, our basic bibliographic record will not further our goals.

If there's interest in this topic, perhaps we can get some discussion going that will lead to action. I'm all ears (in the immortal words of Ross Perot).

22 comments:

Bruce D'Arcus said...

Hi Karen -- nothing like a provocative title! You won't be surprised to know that I think out of MARC and its legacy (including MODS frankly) is RDF. So think about stripping down MODS to something closer to DC and DCQ (though it might be different in fact) and then through RDF/OWL schemas plugging those more grounded views into the more abstract world of FRBR (which has now has a nice RDF representation).

I think the time of huge, monolithic, and largely self-contained data formats is passing.

Tom Habing said...

MARC may not be as outdated as you think. Take a look at Looking back, looking forward: a metadata standard for LANL's aDORe repository. To quote from the abstract:

Although often disparaged or dismissed in the library community, the MARC standard, notably the MARCXML standard, provides surprising flexibility and robustness for mapping disparate metadata to a vendor-neutral format for storage, exchange, and downstream use.

J. McRee (Mac) Elrod said...

Murder fixed fields if you like, but insofar as the variable fields
reflect the international standard of ISBD in a language neutral
manner, we've seen nothing suggested which even begins to have its utility and versatility.


The study of OCLC records mentiioned is NOT a study of the *use* of MARC fields, i.e. their utilisation in OPACs and usefullness for patrons, but rather their inclusion in records, which is not particularly helpful.

J. McRee (Mac) Elrod mac@slc.bc.ca

Robin Wendler said...

A minor point in the grand murder plot, but...
Not to take away from the analysis of tag occurrence, but has Bill et al's work taken into account the percentage of the time a given tag is relevant? For example, the 250 (edition) might not appear in a high percentage of records, because it usually does not apply. But when it does apply, it is crucial to understanding what resource the record describes. Other less obvious examples abound of rarely used elements that are crucial for some resources or to some communities. Bill's work needs to be interpreted with care, and I fear some administrators only hear the sound bite "lots of MARC tags rarely used!"

I'm trying to be open minded about the idea that records don't matter, but somehow the statement that X is "3rd ed., rev." doesn't help much by itself.

The inevitability of this transition from the familiar to the unknown makes me feel like my great-grandmother. She never grasped cloverleaf highway exchanges, introduced late in her life. She had trouble understanding why anyone would have to turn right when they wanted to go left. Just trying not to become roadkill on the metadata highway here...

Bruce said...

I think the "language-neutrality" argument that I've seen used to justify MARC is a red-herring. MARC solves the problem of cross-language intelligability by making it equally difficult to read by everyone. That's hardly a very useful approach in a world of the emerging semantic web, where technologies like RDf and OWL allow you to elegantly attach localized natural descriptions to any URI-identified term, and to link terms (so Chinese group makes their own localized IRI for a title; not a problem to link it's meaning to a dc:title).

It's pretty telling damnation that you rarely if ever see any project these days outside the library world that uses MARC or MARCXML.

DenverJeffrey said...

Dear Karen: The best way, I think, to "murder" MARC is to create a better alternative. In my library, we still invest heavily in print assets (books), and we need a system to provide for their discovery by our patrons. These books are the only source of the information they contain; it is not available on the Internet. So providing a discovery tool is crucial. Our analyses show that using an online catalog with MARC as the metadata scheme, plus the content standards that go along with it, is currently the best method available on the market to do this. We would surely move to a better system that didn't use MARC were one available, but one is not.

Your using the phrase "murder MARC" reveals a sort of personal distaste towards the standard and makes the reader wonder why this hang up exists. Why this attitude towards a metadata standard? How did MARC jilt you?

I have bookmarked your blog and will look forward to reading your postings, just as I enjoy reading (and learning a lot from) your print publications, many of which I found using, alas, MARC.

Mia said...

Why not: "MARC must MORPH"? It's a bit more upbeat! At the outset, let me say that I personally lack the technical expertise to offer any concrete suggestions, but perhaps I can make some contribution on the conceptual level.

In terms of amplifying retrieval for users, much of the existing MARC data has not ever actually been capitalized upon, forget exploited. Case in point: OCLC recently doing some prototyping estimating Audience level in Worldcat holdings to draw out a relationship for Scholarship. That's one field.

I'm interested in getting creative with what is already in those records, and which has largely been ignored.

IE: tease out those existing relationships which are laying dormant. Wake those MARC fields up, inotherwords. Perhaps along the lines that Stefan Gradmann suggests? (which seem to point in a tantalizing direction from my perspective)

Certainly there are fields which offer way too fine granularity (007 in certain areas)- I'm not aware of efforts to harvest those grains for the benefit of a user trying to correlate things during a search. (that's the acid test, for me).

I labour under the assumption that we only get the chance to capture the data, take the snapshot, once. (perhaps that need not be the assumption??)

I guess the debate being framed is: what should be in the snapshot.

Hope these comments are not perceived as being out in left field (pun?)

Dan Matei said...

Hi Karen

I agree that MARC should die, but peacefully :-)
And it deserves military honors ! It served us well, a long time.

At the 2004 ELAG [European Library Automation Group] (www.elag2004.no) , in Workshop 10, we discussed how to move from MARC to something based on XML.

The (main) problems we detected with MARC were (and I cite from the workshop report):

- the "1 to 1 principle" is not observed (i.e. the records are not "normalized", that is, a record contains data about several entities, e.g. the book and its authors); the "bright side": the MARC record is "self-contained" !
- it is not too flexible, i.e. it is almost flat, i.e. it allows only 2 (or 3 ?) hierarchical levels, that is, it does not allow a good control of the granularity of data; [on the other hand — Ole Husby contributed — it is too flexible, i.e. (as we understand it) it allows (too often) alternative solutions to a problem];
- some tags (e.g. those for the headings) express two different things: a) the nature of the related entity and b) the kind of relationship of that entity with the "host" record; this leads to a bigger MARC "schema" than necessary;
- it does not allow (naturally) multilingual data within a record, i.e. for the fields with values in the language of the cataloguing.

At the 2006 ELAG meeting in Bucharest (elagreports.cimec.ro/papers/papers.htm), my paper listed 3 specific functional requirements for the shared catalogues, i.e.:

FR1: language neutrality, i.e. the textual elements could be expressed in several languages, and the language of en element could be detected automatically;
FR2: traceability of changes, i.e. the modifications could be tracked, dated and attributed (thus, reversed);
FR3: opinion neutrality, i.e. different opinions could coexist in the metadata, that is the elements could have alternative values, with clearly assigned intellectual responsibilities.

I think the next catalographic format(s) should solve the "MARC problems" and meet these 3 extra requirements. Beside it/they should use a data model based on FRBR & CRM conceptual models, i.e. it/they should accommodate library AND museum resources.

My preference is a granular XML schema (a suggestion is in my Bucharest paper), not RDF. In my view, RDF is an "atomic" approach. I guess that a "molecular" one is more convenient: i.e. elements for work, expression, manifestation, item, person, corporate body, etc., all linked by relationships, as in the CRM model. I will present a (first) draft of a (partial) schema in my paper at the CIDOC meeting in Gothenburg (www.meta.se/cidoc06/), in a couple of days.

Anonymous said...

One of the biggest problems with MARC is how it holds us hostage as a 'cataloging code'. MARC is not supposed to be a cataloging code at all of course, it is theoretically simply an encoding standard independent from cataloging codes like AACR2.

However, in reality, cataloging tends to be done to the MARC "standard". MARC is as much a part of the code of rules controlling conventional cataloging as AACR2.

This is a problem, becuase it leads to confusion and trouble rationalizing and improving our cataloging practice. With the AACR2/RDA people not wanting to discuss MARC, and vice versa. We _pretend_ they are independent, because they _ought_ to be independent, but they are not. So while MARC ought to be just one of many possible encoding formats, in fact we are held hostage to it, there's no way to use another encoding format---or data model, becuase too much of what we do is targetted to MARC.

In my mind in FRBR is a large part of the solution. FRBR can provide an explicit data model (instead of an implicit one in MARC), and can be the glue that holds together our cataloging code(s) to our encoding format(s) without them having to reference each other directly, but allowing a rational 'interface', if you will, between them.

But this can only happen if people work towards making it happen. FRBR is not up to the task yet, FRBR is just a start (yet FRBR hasn't been elaborated in the decade since it was introduced...). Our cataloging codes and our future encoding standards also need to be worked on with this goal in mind.

Jonathan Rochkind

Mia said...

To pick up on some loose threads - (and I promise to continue to dangle them loosely and mix metaphors while I'm at it) - one FRBR sticky point converges around the expression level. I haven't figured out which parties are bothered about it and why exactly, but setting that aside, I'm willing to give it a run. I focus on FRBR Expression since as we've all experienced, multiple E's just do not scale well. Anywhere. So, taking it from the top: an entity comes along, for which I, as creator of the metadata, feel an overwhelming clarity and urgency to specify thusly [as a unique expression]. On the other hand, it could also just be more like a gnawing, faint guilty feeling, like 'this could be one of those cases where I OUGHT to specify it thusly, and in case of doubt, then I should do so' times. Millions of bibliographic entities have had, and will continue to have, edition-level (and soon-to-be-expression-level?) ascribed to them by human rule-followers/interpreters in just that way. Hence, my earlier 'Anywhere' assertion. How to collapse and cluster, yet explode when relevant later on, is the objective. My point (though I'm trying hard to obscure it) is that either there is a clearly beneficial reason to distinguish all possible in-case-of-doubt-editions/expressions (even at an atomistic level) or, there is not. To try a different analogy: data snapshot = edition/expression; multiple E data snapshots = a stream where each E(n+1) is a frame. The E's are on a continuum, from "most clear and urgent" to "least clear and possibly debatable." What I would then next be looking for is a structure which will give the user the ability to ignore all/some/any individual frames, or freeze one, on-the-fly, at will, depending on my pivot point. At the same time, a structure where the data creators/encoders (be they human or machine) need not overly agonize about the precision, because the atomization (if you will) makes it flexible and enables 'forgiving-ness' down the road. I hope that whatever new framework emerges will allow the necessary uncoupling of elements so that things can be 'resized' later on.

Mia Massicotte

William Moen said...

Thank you Karen for starting this discussion and setting up this forum for postings and responses.

As I read the posts here and in the MARC list, I think it will be helpful to unbundle what we mean but MARC. And then focus discussion of issues on these various aspects of MARC. So we have at least the following and there may be more aspects that should be considered:

1. The Z39.2/ISO 2709 record structure

2. MARC 21's implementation of the Z39.2/ISO 2709

3. MARC 21's family of formats (bib, authority, etc.)

4. MARC 21's set of data elements for each of the formats

5. MARC 21's instructions for punctuation or other included coding in the data values.

I also agree with your suggestion for analyses that need to be undertaken or completed to inform decisions about a machine-readable record structure and data elements that will likely supplant the current MARC regime at some point in the future.

William Moen said...

In response to Tom's pointer to the Goldsmith & Knudson article, I too appreciate their conclusion about the utility of "MARC" (I'm using quotes here since I think there are several aspects that people have in mind when they use this term -- as indicated in my previous posting) in the context of their metadata requirements. As I read the article and heard the presentation at JCDL, to me the primary strength they found in "MARC" was the set of very granular data elements. Further, they focused on MARCXML rather than the Z39.2/ISO2709 structure in their choice of "MARC" for their use.

Sally McCallum from LC's Network Development and MARC Standards Office used the Goldsmith & Knudson article as a way of framing a "future of MARC" presentation at the recent RLG Member Forum (Aug 2006, Washington, DC). Sally's presentation (MP3 and Powerpoint) is available at: http://www.rlg.org/en/page.php?Page_ID=20968

specifically:

http://www.rlg.org/en/downloads/Sally_McCallum.mp3
http://www.rlg.org/en/pdfs/Forum.8-06.McCallum.pdf

William Moen said...

I've responded to J. McRee (Mac) Elrod's comments posted on the MARC or other lists in the past about our research study on MARC utillization by catalogers. In those responses, I explained that the scope of this research project was to identify the extent of use of currently defined MARC content designation structures by catalogers. We believe that our empirical investigation into catalogers' use will provide one perspective on MARC data element use. It is clear that resarch studies should also be conducted on how various systems actually take advantage of the richly coded data provided by the catalogers, and more fundamentally still, the extent to which users (the various groups of users) find the data provided in a MARC record useful to their information seeking needs.

William Moen said...

Robin Wendler's comments are critical in terms of our research study on catalogers' use of MARC content designation structures. We did do our analyses on sets of records based on format of material, so in cases where there are fields/subfields more relevant to a specific format of material, we coul get a more accurate picture of their use (e.g., 255 for cartographic materials).

But Robin's example of a 250 for a specific bib item is not part of our analyses. This was pointed out to me clearly for field 656 that occurred once in the 7.5 million LC created records for Books, Pamphlets and Printed Sheets that are in our complete dataset of 56 million records. A cataloger pointed out to me that this field may not be used a lot in Book type materials, but is highly used in archival materials.

Our hope is that providing our results to the broader communities of interest, they will be able to help interpret specific instances. We have tried to be very careful in our presentations of results to indicate that our "numbers don't stand on their own," but need the interpretive analyses that other experts can provide.

Bruce said...

Dan said:

"My preference is a granular XML schema (a suggestion is in my Bucharest paper), not RDF. In my view, RDF is an "atomic" approach."

I dunno, I really think when you go down this path of thinking in terms of document formats and XML Schema, you end up repeating the same mistake all over again, which is to treat library metadata as somehow special. It's not, and I think RDF is in fact perfectly suited to representing library metadata as is, and more importantly, in breaking library metadata out the walls of the library.

Why shouldn't one take existing DC, etc. metadata and simply mash it up and plug it into the more abstract world of a FRBR (or FRBR-inspired) model via RDF, rather to reinvent an entire infrastructure that only some librarians will ever understand?

Dorothea said...

Erm, because nobody understands RDF?

Bruce said...

Dorothea -- nobody outside a small circle of library metadata experts understand MARC or MODS or MADS or METS or FRBR either. And XML Schema is hardly an elegant answer of simplicity; arguably more complex with significantly less pay-off.

I find the argument that RDF is hard to understand absolutely perplexing given the over-the-top complexity of library-based specifications.

Dorothea said...

I'm fully aware of the problem with library-spec complexity, thanks. I just don't think RDF is the right platform to build a better spec from. Nor is XML Schema, granted.

Where's RelaxNG when you need it? Or XTM?

Anonymous said...

An interesting call to arms, since it comes from a mercenary.

As an independant library digital consultant, wouldn't you benefit greatly from any abandonment of the status quo -- whether it's needed or not?

Karen Coyle said...

I prefer the term "conslutant."

I'm flattered that anyone would consider me powerful enough to manipulate the entire library community for my own gain. Perhaps I've been selling myself short.

Bruce said...

Dorothea -- I'm a huge advocate of RELAX NG, having written a fair number of schemas (including a translation of MODS) in it. It is a very powerful and elegant way to write XML languages.

But it's solving a different -- much more limited -- problem: how to validate XML. It does nothing to solve the larger problem of semantic interoperability, and to do that you need a model. There's no way around it.

Whether it's RDF, or some reinvent-the-wheel substitute (notwithstanding XTM, which is something I'm not competent to discuss), I can't be sure, but solve this problem you must. Otherwise, libraries will fall farther and farther behind where the rest of the world is going WRT to metadata.

It might be worth considering that the complexity in library specs that you agree on is in fact not an accident, but rather coming at the problem from the wrong direction.

I actually think FRBR gets it right: a clear, abstract (and relational) model. But there's a lot of work to do to bring it to the ground.

Hilbert said...

Hello! I read this article! Big thanks to author, very interesting. Write more.