Friday, August 26, 2011

New bibliographic framework: there is a way

Since my last post undoubtedly left readers with the idea that I have my head in the clouds about the future of bibliographic metadata, I wish to present here some of the reasons why I think this can work. Many of you were probably left thinking: Yea, right. Get together a committee of a gazillion different folks and decide on a new record format that works for everyone. That, of course, would not be possible. But that's not the task at hand. The task at hand is actually about the opposite of that. Here are a few parameters.

#1 What we need to develop is NOT a record format

The task ahead of us is to define an open set of data elements. Open, in this case, means usable and re-usable in a variety of metadata contexts. What wrapper (read: record format) you put around them does not change their meaning. Your chicken soup can be in a can, in a box, or in a bowl, but it is still chicken soup. That's the model we need for metadata. Content, not carrier. Meaning, not record format. Usable in many different situations.

#2 Everyone doesn't have to agree to use the exact same data elements

We only need to know the meaning of the data elements and what relationships exist between different data elements. For example, we need to know that my author and your composer are both persons and are both creators of the resource being described. That's enough for either of us to use the other's data under some circumstances. It isn't hard to find overlapping bits of meaning between different types of bibliographic metadata.

Not all metadata elements will overlap between communities. The cartographic community will have some elements that the music library community will never use, and vice versa. That's fine. That's even good. Each specialist community can expand its metadata to the level of detail that it needs in its area. If the music library finds a need to catalog a map, they can "borrow" what they need from the cartographic folks.

Where data elements are equivalent or are functionally similar, data definitions should include this information. Although defined differently, you can see that there are similarities among these data elements.
pbcoreTitle =  a name given to the media item you are cataloging
RDA:titleProper = A word, character, or group of words and/or characters that names a resource or a work contained in it.
MARC:245 $a = title of a work
dublincore:title = A name given to the resource
All of these are types of titles, and have a similar role in the descriptive cataloging of their respective communities: each names the target resource. These elements therefore can be considered members of a set that could be defined as: data elements that name the target resource. Having this relationship defined makes it possible to use this data in different contexts and even to bring these titles together into a unified display. This is no different to the way we create web pages with content from different sources like Flickr, YouTube, and a favorite music artist's web site, like the image here.

In this "My Favorites" case, the titles come from the Internet Movie Database, a library catalog display, the Billboard music site, and Facebook. It doesn't matter where they came from or what the data element was called at that site, what matters is that we know which part is the "name-of-the-thing" that we want to display here.

#3 You don't have to create all new data elements for your resources if appropriate ones already exist

When data elements are defined within the confines of a record, each community has to create an entire data element schema of their own, even if they would be coding some elements that are also used by other communities. Yet, there is no reason for different communities to each define a data element for an element like the ISBN because one will do. When data elements themselves are fully defined apart from any particular record format you can mix and match, borrowing from others as needed. This not only saves some time in the creation of metadata schemas but it also means that those data elements are 100% compatible across the metadata instances that use them.

In addition, if there are elements that you need only rarely for less common materials in your environment, it may be more economical to borrow data elements created by specialist communities when they are needed, saving your community the effort of defining additional elements under your metadata name space.

To do all of this, we need to agree on a few basic rules.

1) We need to define our data elements in a machine-readable and machine-actionable way, preferably using a widely accepted standard.

This requires a data format for data elements that contains the minimum needed to make use of a defined data element. Generally, this minimum information is:
  • a name (for human readers)
  • an identifier (for machines)
  • a human-readable definition
  • both human and machine-readable definitions of relationships to other elements (e.g. "equivalent to" "narrower than" "opposite of")

2) We must have the willingness and the right to make our decisions open and available online so others can re-use our metadata elements and/or create relationships to them.

3) We also must have a willingness to hold discussions about areas of mutual interest with other metadata creators and with metadata users. That includes the people we think of today as our "users": writers, scholars, researchers, and social network participants. Open communication is the key. Each of use can teach, and each of us can learn from others. We can cooperate on the building of metadata without getting in each others' way. I'm optimistic about this.


Unknown said...

I completely agree with all of this and think it is much more achievable than other initiatives. Looked at in one way, nobody really needs to change anything they do, so long as the--what I have called "exchange format"--functions to match semantically similar concepts, as well as possible.

It doesn't have to be in RDF, which is rather complex, just in an XML format that uses all kinds of XSLs to convert to and from your own format.

The final result would not be perfect but redundancies would become clearer with time and in any case, it would be much easier to find resources than today.

And I think that should be the goal: make things better than searching today. Making things better for next week or in five years' time can be done then, but just make things better than they are now. People will appreciate that.

Karen Coyle said...


The new carrier could indeed be XML, but with some caveats -- XML is, after all, just a way to wrap data. But if the data is DEFINED in XML, then it will most likely be hierarchical in structure, and that is very confining. I think this issue deserves a longer explanation, and I will make it the topic of a future post, with some diagrams that I think will make the concept clear.

Anonymous said...

Have you considered adding a MARC crosswalk to the new Media Annotations Ontology & API that W3C launched earlier this year?
"The intent of this vocabulary is to bridge the different descriptions of media resources, and provide a core set of descriptive properties. This document defines a core set of metadata properties for media resources, along with their mappings to elements from a set of existing metadata formats."

Always thought that MARC was blatantly lacking from their long list of crosswalks, since I keep finding all kinds of digitised library resources described in MARC, since it is the tool of choice for many cataloguers and metadata specialists in libraries.

Moreover, you'll find that the RDF/XML and other syntax questions have been sidestepped by also specifying an API that metadata management software can support natively:

Karen Coyle said...

cumont -

I'm hoping we can move into a "post-crosswalk" world -- and the sooner the better. One of the values of linked data is that we can define relationships between elements in different vocabularies without having to actual copy data from one vocabulary to another. In other words, a property can have more than one definition, depending on your context.

Everyone likes their own metadata schema best -- and there is a good reason for that: metadata schemas are defined around communities and functions, and each community has different needs, and therefore different metadata. A glance at the media ontology reveals that it has little to do with the needs of libraries (Where is azimuth for cartographic resources? Where is key for music?) but may be find for a CD collection.

Rather than try to force different metadata sets together through crosswalks I would prefer that we develop a technology that allows them to co-exist and interact. It's a big order, maybe on the level of a 'holy grail,' but I think it's worth trying.