Saturday, March 31, 2012

Can libraries change?



The declaration by Library of Congress that the time has come to make the long overdue change to a new data format has rocked the library world. A common reaction is: "How can we do that, when we have 1) thousands of library systems that are designed primarily to work with MARC records 2) no money to pay for a major change, and 3) have no clear idea what we should be heading towards." This fear is increased by the fact that so far there is little public evidence of activity on this project.

A change of this nature is huge. It's not quite Europe converting to the Euro, but within the library world it is a change of the magnitude of converting the Internet from IPv4 to IPv6. We've made other big changes in the past, in particular the change from the card catalog to the OPAC. That effort required us to purchase new systems and to convert the whole of the printed card catalog to the MARC format. Amazingly, it took only about decade to complete that transition.

However, here is the key difference:  that change was entirely internal to the library community. This next one has an additional complication brought on by the fact that the target environment for the future of library data is the Web. This new framework will need to be integrated with that massively complex environment of networked information. This adds unknowns ("How will Web users interact with library data?"), but it also affords some possibilities that we didn't have with the change to MARC. Mainly, it allows us to make use of existing Web technology and the Web community for help in both designing and implementing the change.

What this means to me is that this is not a "library-only" activity that we are embarking on. Unknown numbers of users and systems will want or need to make use of library data on the Web. At least we hope so. Right now, Web services in need of bibliographic data often point to Amazon. Others rely on "crowd-sourced" solutions like Mendeley or Zotero. What will make library data most useful and usable to the larger community? This isn't a question we should be asking of libraries, but of potential users of the data.

There are other important questions we should be thinking about. How will we test whether the new framework is well designed for system functionality and efficiency? How will we convert from what we have to this new framework? What structures must we put in place to maintain and extend the framework over time?

It seems very unlikely that the Library of Congress can address all of this in the 18-month period that is allotted for this work (of which perhaps 12 months remain) because: a) their focus is understandably primarily on the needs of their organization and b) this effort, to be successful, must have input from organizations that are external to the library community. That’s a very tall order for short time span.

None of this should be taken to imply that LC doesn't have smart, skilled staff to work on this -- they do. But if you've ever taken on a large project in your institution you know that the staff working on the project is also doing much of the day-to-day work that occupied them before the project was begun. Few are able to dedicate 100% of their time to a new effort. The question therefore becomes: How can a larger community help LC with this project, taking on appropriate task areas?

I have in mind a set of tasks that could be worked on in parallel, by a number of different interested constituencies and with some good coordination. More details are needed, but the big picture is something like this:




The Web track is an obvious one, especially given that the W3C has already shown an interest in facilitating the entry of library data onto the Semantic Web. There is also a growing realization in the library community that we need to fairly quickly begin to build on the foundation standards developed by the W3C Semantic Web activity. There appears to be a similar awareness in the Semantic Web community that library data presents interesting challenges. For example, library data has revealed that an approach to authority data is needed that cannot currently be provided by SKOS (Simple Knowledge Organization System). Discussions on lists that focus on the Semantic Web make it clear that our early efforts in defining library data in RDF are helping to inform the thinking of the experts in linked data creation.

The bibliographic description track is of course the key one for libraries. This to me is the solid ground of LC, along with its community partners: to determine the semantics of the data that libraries will use to describe their resources and to provide access for users. RDA already does a great deal of this but the task ahead is to make sure that one can express those concepts in a new data format. There should also be an analysis of the cataloging workflow and even of the expected functionality of a cataloging interface. The requirements arising out of this track will inform the work of the Web and IT tracks as they help the Library convert these requirements to implementable structures and applications.

The IT track is absolutely essential: How do we assure that we have data structures that work well with the entire gamut of library systems functions, from acquisitions through circulation? One question I have in particular is about the efficiency of a large bibliographic database structured around the FRBR entities. Efficiency must consider more than just the creation of the bibliographic data, it also must be efficient for the retrieval and display of that data. The report on the Future of Bibliographic Control recommended testing of FRBR. RDA has served to test many of the FRBR concepts, but as yet there is no proof of concept of a data structure that uses the FRBR entities. This track, as I see it, would involve library systems vendors as well as some computer professionals who work with "big data" and semantic web technologies.

The management track is very important but in LC's plan it might be relegated to a later phase since it appears to deal mainly with future activities such as maintenance and modification of the standard. This, however, would be a mistake because the standards for maintenance and extension must be in place from the very beginning of the new framework. I would even say that the new framework should be developed from the beginning with a core and extensions. This eliminates the need to have on opening day a standard that is "everything for everybody," and could allow for a phased implementation of the framework. Note that there are some immediate issues in RDA that require a maintenance standard, such as how to handle open-ended controlled lists in a way that would be compatible with Semantic Web standards.

A critical part of this is the coordination between all of these activities. Such a role, however, is not unusual in a large IT project where work is spread across groups with intersecting milestones.

It seems to me that a division of this nature (and not necessarily exactly how I have described it here) would relieve LC of some of the work that it is undoubtedly currently considering taking on; it could increase the speed with which the full design could be completed; and in my opinion it has the possibility to produce a higher quality solution than could be achieved by a single organization. Logical participants include NISO (both in its role as the standards body that manages the MARC standard and in its role as a focus for the library technology community), W3C's Semantic Web community, Dublin Core Metadata Initiative (which is working on standards for application profiles in RDF), and IFLA (which now has a Semantic Web interest group). There is also some possible synergy with projects like the Internet Archive, the Digital Public Library of America, schema. org, and the Zotero community.  Clearly funding would be needed, and that's also not a simple task.

My concern is that if we don't organize ourselves in this way, that come January, 2013 we will not be anywhere near having the ability to create bibliographic data in a new framework. RDA will be implemented inadequately in MARC and, as that solution is the path of least resistance, work to create a new framework will slow to a crawl. If we don't step up to this task, for many years to come we will continue to see library data housed in frameworks and silos that are invisible to most information seekers.  That would indeed be very unfortunate.

Note: Planned session for ELAG2012 to be led by Lukas Koster with a very similar approach, and with the intention of delivering ideas to LC for the new framework.