Wednesday, January 11, 2012

Bibliographic Framework: RDF and Linked Data

With the newly developed enthusiasm for RDF as the basis for library bibliographic data we are seeing a number of efforts to transform library data into this modern, web-friendly format. This is a positive development in many ways, but we need to be careful to make this transition cleanly without bringing along baggage from our past.

Recent efforts have focused on translating library record formats into RDF with the result that we now have:
    ISBD in RDF
    FRBR in RDF
    RDA in RDF

and will soon have
    MODS in RDF

In addition there are various applications that convert MARC21 to RDF, although none is "official." That is, none has been endorsed by an appropriate standards body.

Each of these efforts takes a single library standard and, using RDF as its underlying technology, creates a full metadata schema that defines each element of the standard in RDF. The result is that we now have a series of RDF silos, each defining data elements as if they belong uniquely to that standard. We have, for example, at least four different declarations of "place of publication": in ISBD, RDA, FRBR and MODS, each with its own URI. There are some differences between them (e.g. RDA separates place of publication, manufacture, production while ISBD does not) but clearly they should descend from a common ancestor:
RDA: place of publication
RDA: place of distribution
RDA: place of manufacture
FRBRer: has place of publication or distribution
ISBD: has place of publication, production, distribution
This would be annoying, but not unworkable, if these different instances of "place of publication" could be treated as having some meaning in common such that one could link a FRBRer element to an ISBD element, but they cannot. The reason they cannot is that each of these constrains the elements in a particular way that defines its relationship to a single data context (what we generally think of as a "record structure"). The elements are not independent of that context, and this means that each can only be used within that particular context. This is the antithesis of the linked data concept, where data sets from diverse sources share metadata elements. It is this re-use of elements that creates the "link" in linked data. To achieve this, metadata elements need to be unconstrained by a particular context.

Linking can also be achieved through vertical relationships, similar to "broader" and "narrower" in thesauri. This is less direct, but makes it possible to mix data sets that have differing levels of granularity. In our case, the ISBD "place of publication, production, distribution" could be defined as broader to the three RDA elements that treat those separately. Unfortunately that is not possible because of the way that ISBD and RDA have been defined in RDF. (I'll post more detail about this later for those who want more.)

The result is that we now have a series of RDF silos, expressions of our data in RDF that lack the linking capabilities of linked data because they are bound to specific data structures. Clearly we gain little in terms of linked data by creating mutually incompatible bibliographic views. Not only are these RDF schemes not compatible with each other, none will be linkable to bibliographic data from communities outside of libraries who published their data on the Web. That means no linking to Amazon, to Wikipedia, to citations within documents.

Given where we are in the development of linked data for libraries, we now have two options:

1) Define 'super-elements' that float above the record formats and that are not bound by the constraints of the RDF-defined records. In this case there would be a general "place of publication" that is super- to all of the "place of publication" elements in the various records, and would be subordinate to a general concept of "place" that is widely used (possibly a property of GeoNames). To implement linking, each record element would be extrapolated to its super elements.

2) Define our data elements outside of any particular record format first, then use these in the record schemas. In this case there would be only one instance of "place of publication" and it would be used throughout the various bibliographic records whenever an element with that meaning is needed. Those records would be interchangeable as linked data using their component data elements, and would interact with other bibliographic data on the Web using the RDF-defined elements and their relationships.

My message here is that we need to be creating data, not records, and that we need to create the data first, then build records with it for those applications where records are needed. Those records will operate internally to library systems, while the data has the potential to make connections in linked data space. I would also suggest that we cease creating silo'd RDF record formats, as these will not move us forward. Instead, we should concentrate on discovering and defining the elements of our data, and begin looking outward at all of the data we want to link to in the vast information universe.


_____
* Note on RDA: RDA in RDF includes two "versions" of each data element: one bound to FRBR and one not. The latter has potential for re-use outside of a FRBR environment, and was designed for this purpose by the DCMI/RDA task force. Its relationship to "official" RDA is somewhat unclear at this time but hopefully will gain support as the linked data concept is absorbed into the bibliographic framework.



9 comments:

Paul Burley said...

This touches on a fundamental question I have about RDA: why are there so many provisions for transcription, when so many elements of the "record" could exist as linked data or defined data elements?

Place of publication is a particularly painful example. You bring this out in your posting, but really, why are we transcribing geographics name when they’re readily available in Geonames, or even in LCSH? Similarly, why are we transcribing publisher names? A publisher is just another corporate body, this data could come from LC/NAF / VIAF. On a tangent but, why are we manually transcribing "pages", "leaves", "illustrations", "maps", etc., when these could simply be predefined data elements?

Karen, I lack your expertise, but I often look at the RDA text and wonder if it's just a hindrance towards moving into a linked data world.

Karen Coyle said...

Paul,

I have the same "sinking feeling" about RDA, but keep hoping that we'll work out some of these issues as systems are built to implement RDA. There is a group meeting at ALA that is working on recommendations to make the elements of physical description, like "pages," into data rather than text. Once again, however, this is being done within the cataloging community with no systems or vendor input. That fact just boggles my mind.

I am more and more coming to the conclusion that cataloging as we know it today (including RDA) has to change -- radically. I think that there is no longer a role for most headings, and they actually are detrimental to re-use of library data. RDA has retained way too much of the linear card environment. As you say, it is ill-suited to the Web of data.

Ryan Shaw said...

"the ISBD 'place of publication, production, distribution"'could be defined as broader to the three RDA elements that treat those separately. Unfortunately that is not possible because of the way that ISBD and RDA have been defined in RDF."

I'm one of those who would be interested to read your elaboration of this point.

Karen Coyle said...

Ryan, I'll definitely post, but as an FYI I have been keeping a wiki page on some of the discussion around FRBR:
Futurelib FRBR page. It's not easy to follow without more of an intro, I admit, so I'll get on that.

Bill said...

I would prefer an approach that focused on the "actions" applied to entities, so for a "book" you would have a property "published" and this node could have properties such as place, organisation, date and so on. This does allow the use of properties defined by others if these prove useful and help standardisation across domains..

Creating properties that combine action and entity are to me, very restrictive. To a certain extent, in this case, MARC adopted the approach I prefer - the field represents the action, the subfields represent the properties of that action.

Karen Coyle said...

Bill, the British National Bibliography data in linked data format does take this 'event driven' approach, with publication as an event rather than a "thing." You can see it in their data model (PDF).

Kevin M. Randall said...

There are things that specifically describe the manifestation, and those cannot be standardized without compromising the purpose of transcribed elements. The recent move to make the MARC bibliographic 440 field obsolete is a case in point. The series statement in a 4XX field is a description of a specific manifestation. If you replace the statement with the authorized heading, then the series statement is no longer reliable as a description of the manifestation, because the authorized heading can be changed. One single element cannot serve both purposes (representation of the manifestation on one hand, and standardized data on the other). We need to determine which elements are needed for which purposes, and make sure that all of the purposes are covered. In some contexts, for some resources, it is very important that the metadata include the place of publication and name of publisher as found on the resource. There may also be a need to create relationships between the place of publication and name of publisher and the resource, which are not dependent on the exact transcribed form of the names on the resource but on the entities to which those names refer. Can it be that RDA may simply need to define more elements, and make it clear for which purposes each is to be used?

Karen Coyle said...

Kevin,

I agree that if the transcribed elements are considered essential then they should be clearly defined as "transcribed" and entered as text.* They should not be assumed to play multiple roles in the description. Yet it is essential to understand that textual data will not be usable for linking in the linked data sense. If we want items to link on, say, place of publication, then there will also need to be an identifier for the place coming out of a controlled vocabulary. In some circumstances this can be machine-derived from the transcribed place and presented to the cataloger for verification. I say that because this kind of duplication of data does not necessarily mean that the cataloger has to input the data twice, the way it does in most MARC systems today.

* I understand the need for transcription as identification, but I'm not convinced that
1) it is always needed
2) it is always worth the time
3) the same need cannot be satisfied by other means in the future (e.g. digitization of title pages).

Since this is information to be displayed to the user, not used for retrieval, a display of the title page may be as good of a surrogate as a transcription into metadata -- and some times perhaps even better.

Kevin M. Randall said...

Karen,

Very good points, and that's pretty much why I said "In some contexts, for some resources ..." I think to say that we should always include both forms is just as foolish as to say that we never need both forms. Maybe the latest Stephen King novel doesn't need as much information recorded as does a first printing of a Mark Twain novel (I hope I'll be forgiven in some quarters for using such outdated, irrelevant examples as print resources...). How much to to record and code, and how, will always be up cataloger judgment and institutional policy. It's just important that we have data structures that support those judgments and policies. And I definitely agree with you that we should be looking at alternative methods of identification, such as images.

And I'm really glad you pointed out that seemingly "duplicate" elements need not require duplicate data entry! (I'll avoid stepping onto the soapbox that I occupied last Friday on the bibframe list...)