Friday, October 29, 2010

SkyRiver/OCLC suit moved to Ohio court

The judge in San Francisco's Ninth Circuit court has agreed to OCLC's request to transfer the proceedings in the SkyRiver/OCLC suit to the Southern District Court of Ohio. In an impressively thoughtful 10 page document, the judge weighs the various arguments by the parties relating to the request to transfer. In the end, the decision was based on two things:
  1. A majority of the potential witnesses that are neither SkyRiver nor OCLC employees (e.g. libraries that can give evidence) are closer to Ohio than to California.
  2. In terms of documentation as evidence, most of this documentation will need to come out of OCLC's file cabinets, since the suit refers to OCLC business practices over a significant period of time.
I was hoping to be able to sit in on some of the action in the San Francisco court, although more experienced folks have told me that it could be deadly dull. Now we need to find possible bloggers in the Ohio area to cover this. Any volunteers?

Tuesday, October 12, 2010

Beyond MARC-up

In the recent Code4lib journal, Jason Thomale has published an article "Interpreting MARC: Where’s the Bibliographic Data?" in which he struggles to find the separate logical elements in a MARC 245 field. I must admit that I'm not entirely clear on what he means by 'bibliographic data' but I empathize with his attempts to find the data in MARC. In his conclusion he says:
... MARC has as much in common with a textual markup language (such as SGML or HTML) as it does with what we might consider to be “structured data.”
I have myself often referred to MARC as a markup language, to distinguish it from what a computer scientist would call "data." We took the catalog card and marked it up so that we could store the text in a machine-readable form and re-create the card format as precisely as possible. Along the way, a few fields (publication date, language, format) were considered in need of being expressed as actual data, and so the fixed fields were designed to hold those. Oddly enough, though, in most cases the same information was available in the text, meaning that the information had to be entered twice: once as text, and once as data.
008 pos. 07/10 = 1984
260 $c 1984
This fact is proof that at one point the MARC developers were fully aware that the text in the variable fields was ill-suited to machine operations other than printing on a card (or display on a screen).

I have been working off and on for a number of years on an analysis of MARC that is perhaps similar to Thomale's search for the bibliographic data of MARC. I characterize my project as an attempt to define the data elements of the MARC record. The logic goes like this: if we want to create a new, more flexible format for library data, one way to begin that process is to break MARC data up into its data elements. These can then be re-combined using a new data carrier. The converse is that if we cannot break MARC up into its data elements, then any new carrier will surely be saddled with some of the problematic aspects of MARC, such as:
  • redundancy, especially the repeat of the same content in many different fields
  • inconsistency, where the content in those different fields is coded differently or with a different level of granularity
  • potential contradiction between data in fixed fields and textual data
I am still just in the beginnings of my analysis, but for anyone who wants to follow along and comment/cajole/criticize, I am doing my thinking out loud on the futurelib wiki. I thought I would start with the 0XX fields, but decided to drop back and start with 007/008. I have a database of all of the 007/008 elements and their values, (linked in tab-delimited format on this wiki page) so I've been able to sort and eliminate and do other database-y things that help me see what's there.

I'm not interested in replicating MARC, so I do not want to create something that is one-to-one with MARC fields and subfields. As an example, some fixed field data elements and their values appear more than once in the MARC format, such as the 008 "Government publication" element which is identical in the 008 for books, computer files, maps, continuing resources and visual materials. As far as I'm concerned it is a single data element. On the other hand, an element named "Color" appears in more than one 007 field, but in each case the values that are valid for the data element are different. These then are different data elements.

I am struggling with how to create usable output from my investigations. I may code some things in the Open Metadata Registry, but at the moment that would have to be done by hand and I need something more automated. I would like to represent the controlled lists in the fixed fields in an RDF-compatible way using SKOS. This should be relatively simple once certain decisions are made (naming, URIs, etc.).

A big question is how to link all of this back to MARC. For the fixed fields it's relatively easy to create a string that represents the MARC origins of the data, for example:
  • 007microform05 to represent the data element (field 007, category of material Microform, position 05)
  • 007microform05f to represent the actual value (field 007, category of material Microform, position 05, value=f)
When it comes to the variable fields this is going to be more difficult because, as Thomale points out in his article, a logical element may span more than one field/subfield, and there may also be multiple elements in a single subfield. Working that out is going to be very, very difficult. So it seems best to go for the low-hanging fruit of the fixed fields.

Note that there have been other good starts at defining the MARC fixed fields in SKOS, and eventually we may be able to bring this all together. Meanwhile, I did grab for the URI portion of this work and obviously am working toward dereferenceable URIs.