Coyle's InFormation: MARC: from mark-up to data

Friday, March 05, 2010

MARC: from mark-up to data

The main reason that I keep pushing the semantic web is not that I think the semantic web is the answer to all of our problems -- but I do think we need to have something to be moving toward in terms of transforming our data carrier to something both more modern and web-compatible. The semantic web gives us some basic concepts of data design. I'm not sure that the semantic web concepts will hold for data as complex as the library bibliographic record, but there's only one way to find out: do it. That's a huge task, of course.

The first question to be answered is: What are our data elements? In theory, this should be one of the simpler questions, but it's not. I can create a list of all of the MARC fields, subfields, and fixed field elements (which I have, and they are linked from this page of the futurelib wiki), but that doesn't answer the question. Here's why:

Indicators

The indicators in the MARC fields are like a wild card in poker -- they can be used to utterly transform the play. Some of the indicators are simple and probably can be dismissed: the non-filing indicators and the indicators that control printing. Some are data elements in themselves: "Existence in NAL collection" is essentially a binary data element. Many further refine the meaning of the field, allowing the field to carry any one of a number of related subelements:

Second - Type of ring
# - Not applicable

0 - Outer ring

1 - Exclusion ring

Others name the source of the term, such as LCSH or MeSH. It'll take a fair amount of work to figure out what all of these qualifiers mean in terms of actual data elements.

Redundancy

There is non-textual (although not non-string) data in the MARC record, primarily in the fixed fields (00X) but also in some of the number and code fields (0XX). Some of these, actually most of these, are redundant with display information in the body of the record. Should these continue to be separate data elements, or can we remove this redundancy and still have useful user displays? Basically, having the same information entered in two different ways in your data is just begging for trouble and we've all seen fixed field dates and display (260 $c) dates that contradict each other.

Inconsistency

Primarily due to the constraints of the MARC format, the same information has been coded differently in different fields. A personal author entry in the 100 field uses subfields abcdejqu; in the 760 linking entry field, all of that data is entered into subfield a. It's the same data element, and by that I mean that the some contents are contained in the concatenation of abcdejqu as in a. Bringing together all of these krufty bits into a more rational data definition is something I really long for.

And of course my favorite... data buried in text

So much of our data isn't data, it's text, or it's data buried in text. My favorite example is the ISBN. Everyone knows how important the ISBN is in all kinds of bibliographic linking operations. But there isn't a place in our record for the ISBN as a data element. Instead, there is a subfield that takes the ISBN as well as other information.

020 __ |a 0812976479 (pbk.)

This means that every system that processes MARC records has to have code that separates out the actual ISBN from whatever else might be in the subfield. Other buried information includes things like pagination and size or other extents:

300 __ |a 1 sound disc : |b analog, 33 1/3 rpm, stereo. ; |c 12 in.

300 __ |a 376 p. ; |c 21 cm.

Once this analysis is done (and I do need help, yes, thank you!), it may be possible to compare MARC to the RDA elements and see where we do and don't have a match. I have a drafty web page where I am putting the lists I'm creating of RDA elements, but I will try to get it all written up on the futurelib wiki so it's all in one place. I encourage others to grab this data and play with it, or to start doing whatever you think you can do with the registered RDA vocabularies. And please post your results somewhere and let me know so that I can gather it all, probably on the wiki.

3 comments:

MLB said...: Karen,

Thanks for your effort to move us beyond MARC in this very practical way. Determining what "our" data elements are is a necessary step toward a genuine replacement to MARC.; 3/08/2010 10:38 AM
Jackie said...: Karen, any reactions to the new OCLC Research report (Smith-Yoshimura et al.) on machine use of MARC tags with regard to the issues you raise in your post?; 3/30/2010 12:23 PM
Karen Coyle said...: Jackie, I got as far as printing it, then got caught up in deadlines. But I definitely intend to read through it once I get a chance.; 3/30/2010 1:53 PM