Monday, July 23, 2012

Futures and Options

No, I'm not talking about the stock market, but about the options that we have for moving beyond the MARC format for library data. You undoubtedly know that the Library of Congress has its Bibliographic Framework Transition Initiative that will consider these options. In an ALA Webinar last week I proposed my own set of options -- undoubtedly not as well-studied as LC's will be, but I offer them as one person's ideas.

It helps to remember the three database scenarios of RDA. These show a progressive view of moving from the flat record format of MARC to a relational database. The three RDA scenarios (which should be read from the bottom up) are

  1. Relational database model -- In this model, data is stored as separate entities, presumably following the entities defined in FRBR. Each entity has a defined set of data elements and the bibliographic description is spread across these entities which are then linked together using FRBR-like relationships.
  2. Linked authority files -- The database has bibliographic records and has authority records, and there are machine-actionable links between them. These links should allow certain strings, like name headings, to be stored only once, and should reflect changes to the authority file in the related bibliographic records.
  3. Flat file model -- The database has bibliographic records and it has authority records, but there is no machine-actionable linking between the two. This is the design used by some library systems, but it is also a description of the situation that existed with the card catalog.

These move from #3, being the least desirable, to #1, being the intended format of RDA data. I imagine that the JSC may not precisely subscribe to these descriptions today because of course in the few years since the document was created the technology environment has changed, and linked data now appears to be the goal. The models are still interesting in the way that they show a progression.

I also have in mind something of a progression, or at least a set of three options that move from least to most desirable. To fully explain each of these in sufficient detail will require a significant document, and I will attempt to write up such an explanation for the Futurelib wiki site. Meanwhile, here are the options that I see, with their advantages and disadvantages. The order, in this case, is from what I see as least desirable (#3, in keeping with the RDA numbering) to most desirable (#1).

#3 Serialization of MARC in RDF

Advantages

  • mechanical - requires no change to the data
  • would be round-tripable, similar to MARCXML
  • requires no system changes, since it would just be an export format

Disadvantages

  • does not change the data at all - all of the data remains as text strings, which do not link
  • keeps library data in a library-only silo
  • library data will not link to any non-library sources, and even linking to library sources will be limited because of the profusion of text strings

#2 Extraction of linked data from MARC records

Advantages

  • does not require library major system changes because it extracts data from current MARC format
  • some things (e.g. "persons") can be given linkable identifiers that will link to other  Web resources
  • the linked data can be re-extracted as we learn more, so we don't have to get it right or complete the first time
  • does not change the work of catalogers

Disadvantages

  • probably not round-trippable with MARC
  • the linked data is entirely created by programs and algorithms, so it doesn't get any human quality control (think: union catalog de-duping algorithms)
  • capabilities of the extracted data are limited by what we have in records today, similar to the limitations of attempting to create RDA in MARC

#1 Linked data "all the way down", e.g. working in linked data natively

Advantages

  • gives us the greatest amount of interoperability with web resources and the most integration with the information in that space
  • allows us to implement the intent of RDA
  • allows us to create interesting relationships between resources and possibly serve users better

Disadvantages

  • requires new library systems
  • will probably instigate changes in cataloging practice
  • presumably entails significant costs, but we have little ability to develop a cost/benefit analysis

There is a lot behind these three options that isn't explained here, and I am also interested in hearing other options that you see. I don't think that our options are only three -- there could be many points between them -- but this is intended to be succinct.

To conclude, I don't see much, if any, value in my option #3; #2 is already being done by the British Library, OCLC, and the National Library of Spain; I have no idea how far in our future #1 is, nor even if we'll get there before the next major technology change. If we can't get there in practice, we should at least explore it in theory because I believe that only #1 will give us a taste of a truly new bibliographic data model.

2 comments:

Melissa Powell said...

I would love to see what folks like Skyriver are thinking and planning in regards to this. I would also love to hear what the ILS folks are thinking about with their research pipelines knowing that this is coming. As you said, it means new products and if they are smart they are already on it.

Melissa Powell said...

I wonder what folks like Skyriver are doing to prepare? I would also be very interested in what the ILS' have in their research pipelines in regards to the changes. Considering many have not even kept up with RDA it is worrisome. A smart company would already be on it. It only makes smart business sense that they make the change when we do.

This may be where the open source companies have an advantage. We are already able to work with them on creating platforms that accept the RDA changes.

A very exciting time to be a cataloger. :)