I've been thinking for a while about how I could respond to some of the questions in Martha Yee's recent article in Information Technology and Libraries (June 2009 - pp. 55-80). Even the title is a question: "Can Bibliographic data be Put Directly onto the Semantic Web?" (Answer: it already is ) Martha is conducting an admirable gedanken experiment about the future of cataloging, creating her own cataloging code and trying to mesh her ideas with concepts coming out of the semantic web community. The article's value is not only in her conclusions but in the questions that she raises. In its unfinished state, Martha's thinking is provocative and just begging for further discussion and development.(Note: I hope Martha is allowed to put her article online, because otherwise access is limited to LITA members.) (Martha's article is available here.)
The difficulty that I am having at the moment is that it appears to me that there are some fundamental misunderstandings in Yee's attempt to grapple with an RDF model for library data. In addition, she is trying to work with FRBR and RDA, both of which have some internal consistencies that make a rigorous analysis difficult. (In fact, Yee suggests an improvement to FRBR that I think IFLA should seriously consider, and that is that subject in FRBR should be a relationship, and that the entities in Group 3 should be usable in any relevant situation, not just as subjects. p. 66, #6. After that, maybe they'll consider my similar suggestion regarding the Group 1 entities.)
I'm trying to come up with an idea of how to chunk Yee's questions so that we can have a useful but focused discussion.
I'm going to try to begin this with a few very basic statements that are based on my understanding of the semantic web. I do not consider myself an expert in RDF, but I also suspect that there are few real experts among us. If any of you reading this want to disagree with me, or chime in with your own favorite "RDF basics," please do.
1. RDF is not a record format; it isn't even a data format
Those of us in libraries have always focused on the record -- essentially a complex document that acts as a catalog surrogate for a complex thing, such as a book or a piece of recorded music. RDF says nothing about records. All that RDF says is that there is data that represents things and there are relationships between those things. What is often confusing is that anything can be an RDF thing, so the book, the author, the page, the word on the page -- if you wish, any or all of these could be things in your universe.
Many questions that I see in library discussions of the possible semantic web future are about records and applications: Will it be possible to present data in alphabetical order? What will be displayed? None of these are directly relevant to RDF. Instead, they are questions about the applications that you build out of your data. You can build records and applications using data that has "RDF Nature." These records and applications may look different from the ones we use today, and they may provide some capabilities in terms of linking and connecting data that we don't have today, but if you want your application to do it, it should be possible to do it using data that follows the RDF model. However, if you want to build systems that do exactly what today's library systems do, there isn't much reason to move to semantic web technology.
2. A URI is an identifier; it identifies
There is a lot of angst in the library world about using URI-structured identifiers for things. The concern is mainly that something like "Mark Twain" will be replaced with "http://id.loc.gov/authorities/n79021164" in library data, and that users will be shown a bibliographic record that goes like:
http://id.loc.gov/authorities/n79021164or will have to wait for half an hour for their display because the display form must be retrieved from a server in Vanuatu. This is a misunderstanding about the purpose of using identifiers. A URI is not a substitute for a human-readable display form. It is an identifier. It identifies. Although my medical plan may identify me as p37209372, my doctor still knows me as Karen. The identifier, however, keeps me distinct from the many other Karens in the medical practice. Whether or not your application carries just identifiers in its data, carries an identifier and a preferred display form, or an identifier and some number of different display forms (e.g. in different languages) is up to the application and its needs. The point is that the presence of an identifier does not preclude having human-readable forms in your data record or database.
Adventures of Tom Sawyer
So why use identifiers? An identifier gives you precision in the midst of complexity. Author n790211164 may be "Mark Twain" to my users, and "Ma-kʻo Tʻu-wen" to someone else's, but we will know it is the same author if we use the same identifier. And Pluto the planet-like object will have a different identifier from Pluto the animated character because they are different things. It doesn't matter that they have the same name in some languages. The identifier is not intended for human consumption, but is needed because machines are not (yet?) able to cope with the ambiguities of natural language. Using identifiers it becomes possible for machines to process statements like "Herman Melville is the author of Moby Dick" without understanding one word of what that means. If Melville is A123 and Moby Dick is B456 and authorship is represented by x->, then a machine can answer a question like: "what are all of the entities with A123 x->?", which to a human translates to: "What books did Herman Melville write?"
As we know from our own experience, creating identities is tricky business. As we rely more on identifiers, we need to be aware of how important it is to understand exactly what an identifier identifies. When a library creates an authority record for "Twain, Mark," it may appear to be identifying a person; in fact, it is identifying a "personal author," who can be the same as a person, but could be just one of many names that a natural person writes under, or could be a group of people who write as a single individual. This isn't the same definition of person that would be used by, for example, the IRS or your medical plan. We can also be pretty sure that, barring a miracle, we will not have a situation where everyone agrees on one single identifier or identifier system, so we will need switching systems that translate from one identifier space to another. These may work something like xISBN, where you send in one identifier and you get back one or more identifiers that are considered equivalent (for some definition of "equivalent").
3. The key to functional bibliographic systems is in the data
There is a lot of expressed disappointment about library systems. There is no doubt that the systems have flaws. The bottom line, however, is that a system works with data, and the key to systems functionality is in the data. Library data, although highly controlled, has been primarily designed for display to human readers, and a particular kind of display at that.
One of the great difficulties is with what libraries call "authority control." Certain entities (persons, corporate bodies, subjects) are identified with a particular human-readable string, and a record is created that can contain variant forms of that string and some other strings with relationships to the entity that the record describes. This information is stored separately from the bibliographic records that carry the strings in the context of the description of a resource. Unfortunately, the data in the authority records is not truly designed for machine-processing. It's hard to find simple examples, so I will give a simplistic one:
US (or U.S.)
is an abbreviation for United States. The catalog needs to inform users that they must use United States instead of US, or must allow retrieval under either. The authority control record says:
"US see United States"
United States, of course, appears in a lot of names. You might assume then that every place where you find "United States" you'll find a reference, such that United States. Department of State would have a reference from U.S. Department of State that refers the user from that undesirable form of the name ... but it doesn't. The reference from U.S. to United States is supposed to somehow be generalized to all of the entries that have U.S. in them. Except, of course, for those to which it should not be applied, like US Tumbler Co. or US Telecomm Inc. (but it is applied to US Telephone Association). There's a pattern here, but probably not one that can be discerned by an algorithm and quite possibly not obvious to all humans, either. What it comes down to, however, is that if you want machines to be able to do things with your data, you have to design your data in a way that machines can work with it using their plodding, non-sentient, aggravatingly dumb way of making decisions: "US" is either equal to "United States" or it isn't.
Another difficulty arises from the differences between the ideal data and real data. If you have a database in which only half of the records have an entry for the language of the work, providing a search on language guarantees that many records for resources will never be retrieved by those searches even if they should be. We don't want to dumb down our systems to the few data elements that can reliably be expected in all records, but it is hard to provide for missing data. One advantage of having full text is that it probably will be possible to determine the predominant language of work even if it isn't encoded in the metadata, but when you are working with metadata alone there often isn't much you can do.
A great deal of improvement could be possible with library systems if we would look at the data in terms of system needs. Not in an idealized form, because we'll never have perfect data, but looking at desired functionality and then seeing what could be done to support that functionality in the data. While the cataloging data we have today nicely supports the functionality of the card catalog, we have never made the transition to truly machine-actionable data. There may be some things we decide we cannot do, but I'm thinking that there will be some real "bang for the buck" possibilities that we should seriously consider.
Next... I'll try to get to the questions in Martha's article.