Tuesday, July 07, 2009

Yee on RDF and Bibliographic Data

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

I've been thinking for a while about how I could respond to some of the questions in Martha Yee's recent article in Information Technology and Libraries (June 2009 - pp. 55-80). Even the title is a question: "Can Bibliographic data be Put Directly onto the Semantic Web?" (Answer: it already is ) Martha is conducting an admirable gedanken experiment about the future of cataloging, creating her own cataloging code and trying to mesh her ideas with concepts coming out of the semantic web community. The article's value is not only in her conclusions but in the questions that she raises. In its unfinished state, Martha's thinking is provocative and just begging for further discussion and development.(Note: I hope Martha is allowed to put her article online, because otherwise access is limited to LITA members.) (Martha's article is available here.)

The difficulty that I am having at the moment is that it appears to me that there are some fundamental misunderstandings in Yee's attempt to grapple with an RDF model for library data. In addition, she is trying to work with FRBR and RDA, both of which have some internal consistencies that make a rigorous analysis difficult. (In fact, Yee suggests an improvement to FRBR that I think IFLA should seriously consider, and that is that subject in FRBR should be a relationship, and that the entities in Group 3 should be usable in any relevant situation, not just as subjects. p. 66, #6. After that, maybe they'll consider my similar suggestion regarding the Group 1 entities.)

I'm trying to come up with an idea of how to chunk Yee's questions so that we can have a useful but focused discussion.

I'm going to try to begin this with a few very basic statements that are based on my understanding of the semantic web. I do not consider myself an expert in RDF, but I also suspect that there are few real experts among us. If any of you reading this want to disagree with me, or chime in with your own favorite "RDF basics," please do.

1. RDF is not a record format; it isn't even a data format


Those of us in libraries have always focused on the record -- essentially a complex document that acts as a catalog surrogate for a complex thing, such as a book or a piece of recorded music. RDF says nothing about records. All that RDF says is that there is data that represents things and there are relationships between those things. What is often confusing is that anything can be an RDF thing, so the book, the author, the page, the word on the page -- if you wish, any or all of these could be things in your universe.

Many questions that I see in library discussions of the possible semantic web future are about records and applications: Will it be possible to present data in alphabetical order? What will be displayed? None of these are directly relevant to RDF. Instead, they are questions about the applications that you build out of your data. You can build records and applications using data that has "RDF Nature." These records and applications may look different from the ones we use today, and they may provide some capabilities in terms of linking and connecting data that we don't have today, but if you want your application to do it, it should be possible to do it using data that follows the RDF model. However, if you want to build systems that do exactly what today's library systems do, there isn't much reason to move to semantic web technology.

2. A URI is an identifier; it identifies


There is a lot of angst in the library world about using URI-structured identifiers for things. The concern is mainly that something like "Mark Twain" will be replaced with "http://id.loc.gov/authorities/n79021164" in library data, and that users will be shown a bibliographic record that goes like:
http://id.loc.gov/authorities/n79021164
Adventures of Tom Sawyer
or will have to wait for half an hour for their display because the display form must be retrieved from a server in Vanuatu. This is a misunderstanding about the purpose of using identifiers. A URI is not a substitute for a human-readable display form. It is an identifier. It identifies. Although my medical plan may identify me as p37209372, my doctor still knows me as Karen. The identifier, however, keeps me distinct from the many other Karens in the medical practice. Whether or not your application carries just identifiers in its data, carries an identifier and a preferred display form, or an identifier and some number of different display forms (e.g. in different languages) is up to the application and its needs. The point is that the presence of an identifier does not preclude having human-readable forms in your data record or database.

So why use identifiers? An identifier gives you precision in the midst of complexity. Author n790211164 may be "Mark Twain" to my users, and "Ma-kʻo Tʻu-wen" to someone else's, but we will know it is the same author if we use the same identifier. And Pluto the planet-like object will have a different identifier from Pluto the animated character because they are different things. It doesn't matter that they have the same name in some languages. The identifier is not intended for human consumption, but is needed because machines are not (yet?) able to cope with the ambiguities of natural language. Using identifiers it becomes possible for machines to process statements like "Herman Melville is the author of Moby Dick" without understanding one word of what that means. If Melville is A123 and Moby Dick is B456 and authorship is represented by x->, then a machine can answer a question like: "what are all of the entities with A123 x->?", which to a human translates to: "What books did Herman Melville write?"

As we know from our own experience, creating identities is tricky business. As we rely more on identifiers, we need to be aware of how important it is to understand exactly what an identifier identifies. When a library creates an authority record for "Twain, Mark," it may appear to be identifying a person; in fact, it is identifying a "personal author," who can be the same as a person, but could be just one of many names that a natural person writes under, or could be a group of people who write as a single individual. This isn't the same definition of person that would be used by, for example, the IRS or your medical plan. We can also be pretty sure that, barring a miracle, we will not have a situation where everyone agrees on one single identifier or identifier system, so we will need switching systems that translate from one identifier space to another. These may work something like xISBN, where you send in one identifier and you get back one or more identifiers that are considered equivalent (for some definition of "equivalent").

3. The key to functional bibliographic systems is in the data

There is a lot of expressed disappointment about library systems. There is no doubt that the systems have flaws. The bottom line, however, is that a system works with data, and the key to systems functionality is in the data. Library data, although highly controlled, has been primarily designed for display to human readers, and a particular kind of display at that.

One of the great difficulties is with what libraries call "authority control." Certain entities (persons, corporate bodies, subjects) are identified with a particular human-readable string, and a record is created that can contain variant forms of that string and some other strings with relationships to the entity that the record describes. This information is stored separately from the bibliographic records that carry the strings in the context of the description of a resource. Unfortunately, the data in the authority records is not truly designed for machine-processing. It's hard to find simple examples, so I will give a simplistic one:

US (or U.S.)
is an abbreviation for United States. The catalog needs to inform users that they must use United States instead of US, or must allow retrieval under either. The authority control record says:
"US see United States"

United States, of course, appears in a lot of names. You might assume then that every place where you find "United States" you'll find a reference, such that United States. Department of State would have a reference from U.S. Department of State that refers the user from that undesirable form of the name ... but it doesn't. The reference from U.S. to United States is supposed to somehow be generalized to all of the entries that have U.S. in them. Except, of course, for those to which it should not be applied, like US Tumbler Co. or US Telecomm Inc. (but it is applied to US Telephone Association). There's a pattern here, but probably not one that can be discerned by an algorithm and quite possibly not obvious to all humans, either. What it comes down to, however, is that if you want machines to be able to do things with your data, you have to design your data in a way that machines can work with it using their plodding, non-sentient, aggravatingly dumb way of making decisions: "US" is either equal to "United States" or it isn't.

Another difficulty arises from the differences between the ideal data and real data. If you have a database in which only half of the records have an entry for the language of the work, providing a search on language guarantees that many records for resources will never be retrieved by those searches even if they should be. We don't want to dumb down our systems to the few data elements that can reliably be expected in all records, but it is hard to provide for missing data. One advantage of having full text is that it probably will be possible to determine the predominant language of work even if it isn't encoded in the metadata, but when you are working with metadata alone there often isn't much you can do.

A great deal of improvement could be possible with library systems if we would look at the data in terms of system needs. Not in an idealized form, because we'll never have perfect data, but looking at desired functionality and then seeing what could be done to support that functionality in the data. While the cataloging data we have today nicely supports the functionality of the card catalog, we have never made the transition to truly machine-actionable data. There may be some things we decide we cannot do, but I'm thinking that there will be some real "bang for the buck" possibilities that we should seriously consider.

Next... I'll try to get to the questions in Martha's article.

11 comments:

Patrick Danowski said...

Hi Karen,

I also see some problems about mixing up rules and RDF. But RDF can be a data format in that way that and RDF argument represent a "field". But RDF is not a database format. It is also a way to express formats (ontologies).

About Linking: in German Catalogs (and not only there) the Authority Files are already linked with there identifiers. But that didn't mean that it is not possible to hold also a local copy.

There will be always more than one representation of data. The human readable version and a machine readable one. RDF is more about a machine readable one than about the humans.

I wrote a bit general comment to the article of Martha because I'm not able to access the article (as non Member).

Patrick

Eric Hellman said...

Thanks for this post, Karen. You've made me realize that there are aspects to authority files that I was completely unaware of.

To amplify your comment on RDF is not an x, RDF is a data model, and that model is triples. There is an easy and rough transformation to go from marc into an RDF model: the triples are (record ID, marc field/subfield, field value). A single MARC record decomposes into many triples.

At a recent conference, RV Guha from Google commented to me that the Google people would really love to see libraries start to expose their bibliiographic data as RDFa (a standard for embedding RDF into HTML). Could be interesting.

Eric

lbjay said...

Hi Karen,

An even better example link for Lolita might be to the LIBRIS catalog's representation here in rdf/xml. Note the use of common bibliographic ontologies, such as dublin core and the BIBO ontology.

Heidi Hoerman said...

Your library may have access as of late. Quite a few ALA journals are creeping into the aggregations. Both Gale and Ebsco now have IT&L in some of their packages. That's how I'm getting to the article.

Karen Coyle said...

Thanks for comments. Eric, you are probably aware of the work the Talis folks have done to translate MARC into triples. I know this has been done at other institutions as well. I am curious as to how such systems handle the more complex fields, like the title with its initial article and subtitle. I need to look into that. It is possible (and I think Patrick hints at this) that we may expose our data to the Web as RDF but use another format internally to be true to our more complex needs. Not that RDF can't do complex, but I'm not sure it's the right tool for the job... still thinking about that.

lbjay, I am not totally convinced by BIBO as a general bibliographic format, as I've said before. It definitely isn't adequate for library data. For example, like other bibliographic formats I've seen (e.g. OpenURL) that deal with journals and journal articles it has issue and number, but libraries use up to 6 different levels of enumeration for journals. Although that may be excessive, one at least needs Part. I'm trying to view the Libris data but it looks like US and Europe are not speaking at the moment. Maybe things are going badly at G8. :-) I'll try again later.

Irvin Flack said...

Karen

Thanks for drawing attention to this paper (and thanks to Heidi for the tip on accessing it). I've only had a chance to skim it so far but I'm worried that the horrible literal/non-literal issue is going to keep dogging all discussions of RDF and cataloguing. Eg the question on p. 64 about publisher's names being expressed as either a literal OR using a URI. The core point, according to my understanding of RDF, is that there is no problem in a non-literal being _represented_ by a string of characters, eg 'Shakespeare, William...'. Resolving this 'words and the world' confusion is the key to many of the cataloguing-RDF problems, as I see it.

-Irvin

Karen Coyle said...

Irwin, in essence, a URI is just a string of characters, right?

Much of library data comes from controlled lists, and the values in those lists are strings. That doesn't make them less controlled (or any less non-literal, in the RDF sense). I think the difficulty is that we have used display forms as identifiers in library data, and when the display needs to change, then so does the identifier. It also means that any community desiring a different display can't share identifiers with us. As long as we fill our records with "Shakespeare, William" we'll have trouble interacting with those who fill theirs with "Wm. Shakespeare" - if we both use that as the identifier for the author.

I'm fine with including both a URI and a display form if that is convenient for systems, but we need a constant identifier that transcends typographical and cultural differences. When people see words, they tend to read them as words, not as non-literal strings. If nothing else, you need to clearly code what your value represents. In library data, we haven't made that distinction.

Irvin said...

Yep, I agree.

I'll now go and read the paper. :-)

Lukas Koster said...

Karen,

Very good points here. I haven't read Yee's article yet (will do so soon), but I do have some remarks.

RDF indeed is not a format of any kind, it is merely a mechanism for describing relationships between things (or objects, items, whatever you want to call them), just like ERD is. I do not agree that it is confusing that anything can be a thing, because that is the way it is.....
RDF can be used to describe any model (data format) you like.
RDF is strongly associated with the "Linked Data" concept (http://www.w3.org/DesignIssues/LinkedData.html), coined by Tim Berners Lee (although recently there has been a discussion about that, see: http://efoundations.typepad.com/efoundations/2009/07/linked-data-vs-web-of-data-vs-.html). The idea is "data IS relationships". I totally agree with Yee's proposals that "subject" should be a relationship (because it is!), and that FRBR Group 3 entities should be object of any relationship. Personally I think this should also apply to Group 2 entities. I illustrate this in my post Linked Data for Libraries .
I also agree with your doubts about the hierarchical nature of Group 1 relationships. The real world just is not that hierarchical!

Now, RDF, Semantic Web and Linked Data is NOT about storing your (bibliographic) data (although a good normalised data model is always better than something like MARC...), but about exposing your data in a specific way that takes care of all relationships. In this way, RDF is an API, or exchange format, that potentially links everything to everything.
It is also not about about presentation (although you can use RDF presentation formats of course). Data obtained through RDF can be presented in any way you like.

And yes, RDF needs URI's as identifiers, for all the reasons you mention. These identifiers are only meant to be able to maintain or create the links between data, not to display to end users. It's just like ISBN or ISSN, or digital author identifiers in VIAF. The nice thing with RDF/Linked Data is that you can have relationships like "same as", which could be used to avoid confusion about which thing is referred to.

karl said...

It would be cool if you could make a summary post, with all the questions, and all links to answers posts.

Karen Coyle said...

Yes, I've been meaning to gather it all up and put it on a wiki. Will do that soon.