Tuesday, November 01, 2011

Future Format: Goals and Measures

The LC report on the future bibliographic format (aka replacement for MARC) is out. The report is short and has few specifics, other than the selection of RDF as the underlying data format. A significant part of the report lists requirements; these, too, are general in nature and may not be comprehensive.

What needs to be done before we go much further is to begin to state our specific goals and the criteria we will use to determine if we have met those goals. Some goals we will discover in the course of developing the new environment, so this should be considered a growing list. I think it is important that every goal have measurements associated with it, to the extent possible. It makes no sense to make changes if we cannot know what those changes have achieved. Here are some examples of the kinds of things I am thinking of in terms of goals; these may not be the actual goals of the project, they are just illustrations that I have invented.

COSTS
 - goal: it should be less expensive to create the bibliographic data during the cataloging process
   measurement: using time studies, compare cataloging in MARC and in the new format
 - goal: it should be less expensive to maintain the format
   measurement: compare the total time required for a typical MARBI proposal to the time required for the new format
 - goal: it should be less expensive for vendors to make required changes or additions
   measurement: compare the number of programmer hours needed to make a change in the MARC environment and the new environment

COLLABORATION
 - goal: collaboration on data creation with a wider group of communities
   measurement: count the number of non-library communities that we are sharing data with before and after
 - goal: greater participation of small libraries in shared data
   measurement: count number of libraries that were sharing before and after the change
 - goal: make library data available for use by other information communities
   measurement: count use of library data in non-library web environments before and after

INNOVATION
 - goal: library technology staff should be able to implement "apps" for their libraries faster and easier than they can today.
   measurement: either number of apps created, or a time measure to implement (this one may be hard to compare)
 - goal: library systems vendors can develop new services more quickly and more cheaply than before
   measurement: number of changes made in the course of a year, or number of staff dedicated to those changes. Another measurement would be what libraries are charged and how many libraries make the change within some stated time frame

As you can tell from this list, most of the measurements require system implementation, not just the development of a new format. But the new format cannot be an end in itself; the goal has to be the implementation of systems and services using that format. The first MARC format that was developed was tested in the LC workflow to see if it met the needs of the Library. This required the creation of a system (called the "MARC Pilot Project") and a test period of one year. The testing that took place for RDA is probably comparable and could serve as a model. Some of the measurements will not be available before full implementation, such as the inclusion of more small libraries. Continued measurement will be needed.

So, now, what are the goals that YOU especially care about?

7 comments:

Dorothea said...

Time studies? Really? Using catalogers experienced in MARC?

I can think of quite a few reasons not to trust the results there... malingering (sad to say) hardly least, but relative experience levels (and the necessity of unlearning MARC habits) is tremendous as well.

I hope at least the experimental design will be transparent.

Karen Coyle said...

Dorothea, from what I understand studies of that nature were part of the RDA test. I don't think anyone is going to be deriving stats +-3 percentage points from them, but it was enough to conclude that cataloging in RDA was not taking more time than cataloging in AACR. From the test design document:

"For each record, each tester (individual test partners and staff members of test partner institutions) will also complete an online questionnaire that elicits information on the amount of time required to produce the records ..."

http://www.loc.gov/bibliographic-future/rda/testing.html

I agree about having the experimental design be transparent.

Michael K said...

FRBR needs to be a consideration in a next generation bibliographic format. The entity/relationship model in FRBR has many strengths over current practice. But Z39.2 is ill-suited as a carrier of such data, as it mixes parts of the WEMI structure together. Ideally, one could include the entire WEMI universe in a single record (split into distinct portions), or use separate records for each entity of WEMI, and the logic on the back end would be able to stitch things together or break them apart as necessary.

If implemented properly, it means that the next time a library has to add yet another new edition of The Adventures of Huckleberry Finn, the cataloger only has to bring up the existing Work and add a new Expression to it. By doing some intelligent auto-population of fields, this could be a very speedy process.

Unknown said...

I wasn't sure whether to post here or BIBFRAME, so here goes:

System implementation is too much of a wildcard to get meaningful results from testing. The amount of time required to catalog across ILSes is already divergent, and thoughtful UI design (not a strong suit of library vendors) could reduce that time in MARC-space [much] further. Altogether, efficiency of UI and other factors don't have a lot to do with interchange formats, but we librarians think that way because MARC is so stupidly both our cataloging UI and our interchange format. Interfaces that abandon the MARC-as-an-entry-form paradigm (unrelated to whether there's a new interchange framework) would measure the implementations' effectiveness rather than the framework's goals (the total amount of metadata entered being equal).

It sounds from the document that a non-MARC carrier is mostly to accommodate linked data [URIs]. The format would be a carrier for RDA with FRBR and all the other fun stuff. As such, the schema of bibliographic control is already defined. We can build databases and implementations and output MARC21 for the time being, taking full advantage of a new interchange format when it's closer to finalized. Costs of cataloging or maintenance shouldn't depend on the outcome of this new LC document at all.

It's a given that collaboration will increase if free linked data is available and able to be readily-used. While it will be interesting to see how it pans out and measure its spread, I fail to see what such a goal accomplishes. It's not the number of groups (or their size necessarily) sharing data that makes the framework a success; it's the time saved reducing redundancy that collaboration affords.

Additionally, framework-wise RDF isn't necessarily a great choice. The model diverges from the syntax of xml and json as used most everywhere else, such that bibliographic data would be transported in a way that core language libraries wouldn't be optimized to handle.

If innovation and ease of implementation are truly goals of the new framework, a model that fits less-hijacked XML or especially JSON is a better choice. Developers are migrating to JavaScript environments like node.js in droves, and JSON-document-based NoSQL storage has become de rigueur in the cloud. Client-side implementations can natively ingest JSON, even across domains via JSONP. And as far as I know, it's programmatically more efficient to ingest JSON than XML in most languages.

According to the LC plan, a goal of this framework is backward-compatibility with RDA over MARC21. So if the plan is to 1:1 transport MARC into an RDF-based model, it's basically just guaranteeing bloat. What the library community needs future-planning-wise is a roadmap with sensible versioning; versions that aren't necessarily backwards-compatible. Making this-or-that record compatible should probably be an ILS issue rather than a standards issue in the first place.

Karen Coyle said...

Brad - Lots of good stuff in your comment. I totally agree that if we force backward compatibility we will never go forward. We just have to cross the bridge at some point, and I say sooner is better than later, since it's already pretty late.

As for XML and JSON... There already is RDF in XML that appears to work like any ol' XML. I've been working on a project with folks who are doing bib data in JSON and it seems that handling namespaces in JSON is either awkward or just not well understood. I think that namespaces are essential if our data is going to travel outside of its cloistered environment and be used more widely. This isn't an argument against JSON, but for making sure that it can handle namespaces easily.

That said, I think most of us agree that we aren't aiming at a single serialization but something flexible enough to be used in at least the most common serializations.

I also agree that testing in areas where UIs make the difference is difficult. Perhaps those are not tests that we can rely on. However, if there is *nothing* we can test then I don't see how we can be confident of our outcomes. I'm uneasy about the idea of creating a data carrier based on purely theoretical assumptions. In fact, RDF seems to suffer from this -- seems like a great idea, but implementation must not be easy because so little of it has taken place. (Ditto FRBR, by the way.)

I'm not sure if you and I are saying the same thing about collaboration. I agree that in part the measurement is time saving. I also think that there are quality considerations, as well as considerations for linking. The way I see it, today we create data only for libraries. There is some re-use through bibliographic apps like Endnote, Zotero, etc., but there are a lot of people independently creating bibliographic data at some expense to themselves (e.g. citations and bibliographies in texts). Since many libraries are funded as public entities, they increase their value if the public gets more "bang" out of the data libraries produce. This becomes important in an environment where many jurisdictions are looking to libraries as a way to cut costs. The more libraries can show that there are people relying on their services AND on their data, the better they can defend their existence.

Unknown said...

It seems like the developing standard has more to do with copy cataloging than original cataloging, since original records would be implemented to the existing RDA rules. Even if it weren't, there's a necessary separation of standard from implementation. There are tons of absolutely sprawling standards, with examples of both simple and complex implementations. There are ILSes that make copy cataloging easy, and others that make it difficult; that fact doesn't really reflect on MARC and Z39.50 as standards at all.

Collaboration-wise, the ability of catalogs to ingest and export records, and the vendor/library's willingness to publicly expose their data are also separate issues. MARC as a format can be parsed and re-serialized or directly used. But that's contingent not on the format but a decision to publicize the data. I don't imagine the standard will say you have to freely expose your records via an API without restriction, so collaboration via such a means is a happy side effect rather than a strict goal.

Karen Coyle said...

Brad, check out the recent discussion on the LLD list (starts October, goes into November, probably will continue) called "disjointedness of FRBR classes." It's a good example of how your model can make data sharing difficult.

Also, I read your post on RDF/XML, and you are right that RDF and XML are not good partners. The main thing is that RDF doesn't produce a tree structure, and that's what XML and JSON model. So fitting RDF into either is in a way a square peg in round hole event. However, I think we can produce linked data without RDF -- and I think that linked data is a goal while RDF is a particular method. Linking mainly requires identifiers and a certain granularity. RDF may be overkill for what we want to do today.