Tuesday, September 02, 2008

Semantic Dementia

"Semantic dementia" is a term for something many of us of advanced age experience: forgetting words we once knew. It brings to my mind, however, the kind of demented semantics that we often encounter in standards in our field, and the use of or creation of words that obscure the meaning of the standard.

I understand the need that standards have to be very precise in their terminology; to give terms specific meaning. There often is a conflict, however, between that desire for precision and the need to communicate well with the users of the standard. An example of this is the OpenURL* standard, which pioneered the "Context object" and its ever-obscure children like the Referent and the Referring Entity. Quick: give me a definition for Referent.... right, it's not exactly on the tip of anyone's tongue.

I'm going to say that there are two kinds of people in the world: those who think that using a standard should require many hours of study leading to a complete understanding and absorption of the concepts and terminology, so that there cannot be any possible mis-use of the standard; and those who think that a standard should be fairly understandable in a single reading, and usable shortly thereafter. Members of the former group seem to feel that the ideas in their standard are so clever, so unique, that they cannot be comprehended easily. Members of the latter group (to which I obviously belong) assume that standards recombine previous concepts into new structures, and, deep down, are generally simple ideas that one could express simply.

Jeff Young, who clearly has an element of Type 2 in him, managed to unbundle the studied obscurity of the OpenURL with the opening post to his blog, Q6. He replaced the OpenURL terms with Who, What, Where, Why, When, How. I believe that for many people, the light bulb suddenly lit up upon reading his explanation.

A similar simplification is needed for the Dublin Core Abstract Model, and I'm going to attempt that even though I think it's a dangerous thing to do. DCAM defines a set of metadata types that can help us communicate to each other about our metadata. It should simplify crosswalking of metadata sets, and make standards more understandable across communities. Unfortunately, it has not done so, at least in part, because of some rather demented semantics.

DCAM, Simplified

First, you need to understand that the DCAM is about metadata structure, not meaning, or at least not meaning in the human sense of the term. It describes a generalized underlying format for machine-readable metadata. In the most simple terms it provides the information that a program would need to determine what operations it can perform on the metadata that it receives. In this sense, it is a general metadata analogy to the OpenURL's Context Object: a formalized message about something.

The basis of the DCAM is key/value pairs, each of which is called a statement, which is the terminology from RDF. Any group of statements describe a single resource. A resource can be just about anything, but they are based on what your metadata ultimately describes. Examples are: a book; a person; an event. The set of key/value pairs that describes a resource is called a description. If you will describe more than one resource in your metadata record, then you will have multiple descriptions. These make up the description set, which is the sum of descriptions that you have defined for your purpose. These descriptions can be packaged into a record. It all looks something like this:
The statement level is where we get to the real meat of the DCAM, and the part that I think holds great potential. The actual DCAM diagram is very large and filled with terminology that makes it difficult to grasp the meaning of the concepts. I'm going to simplify that meaning here, with the understanding that there is more to DCAM than this simple explanation. Consider this Step 1, not the whole enchilada.

Essentially you have key/value pairs. A key/value pair can look something like:
title = Moby Dick
where "title" is the key and "Moby Dick" is the value.

The first rule in the DCAM is that the key must be identified with a URI, a Uniform Resource Identifier. Here's a URI that you might use for this key/value pair:
http://purl.org/dc/elements/1.1/title

There is nothing new in this; using URIs is a very common convention. It's in the definition of the value that DCAM adds something. Values can be "literals" or not. DCAM makes use of the RDF definition of a literal:
"Literals are used to identify values such as numbers and dates by means of a lexical representation. Anything represented by a literal could also be represented by a URI, but it is often more convenient or intuitive to use literals."

Literals can be plain or typed, as defined in the RDF documentation. An example of a typed literal is a date or currency. A typed literal gives you some control over the format of the string, such as "YYYYMMDD."

This is contrasted with the non-literal values. The non-literal values are not defined in the RDF documentation, except to imply that they are everything that is not a literal. The DCAM goes further and defines non-literals as being of two types: the non-literal value is either a URI that names the value or it is a value from a named, controlled vocabulary. So you can have:
http://purl.org/dc/dcmitype/Event

which is a URI for a controlled value, in this case the DCMI type value, Event. Or you can have:
[URI for ISO 3166-2, country codes] + "en" for English
This latter is similar to what we often do in MARC records, which is to record a code as a string and indicate what controlled list that string is from.

Obviously, if this was all there was to DCAM, we'd all be all over it by now. What happens next is that we start trying to apply this in a metadata world that is not as neat as we would like. For example, what do we do with an ISBN -- is it a structured value? yes. Is it a member of a controlled vocabulary? sometimes, yes, because there is a finite list of ISBNs and each one of them is known (at least to Bowker and other ISBN agencies). So, is it a typed literal, or a nonliteral?

In the end, however, perhaps it doesn't matter that this definition of "nonliteral" leaves us with some ambiguity. Perhaps what really matters is that we distinguish between these three kinds of values:
  1. Plain strings. These will be necessary for values that simply cannot be controlled or are not being controlled.
  2. Structured strings. In these, the range of values can be great, and is not being housed in a finite list, but because of their structure they can often be acted on for functions like quality control, transformations, etc.
  3. Identified values. An identified value is the essence of the semantic web. It allows algorithms to make connections between values even though those algorithms are not machine intelligences, are not understanding the human meaning behind the value.
Our mission, as information professionals, if we choose to accept it, is to move our data, where possible, from point 1 to points 2 or 3 so that there are more possibilities to act on the data in the larger web environment.

I welcome discussion, criticism, additions... whatever comes to your mind. Really.

* NOTE: I'd give you a link to the OpenURL standard, but NISO has gone with one of those content management systems that produce unusably long links. So you'll have to go to the NISO site and look for it under Standards.

8 comments:

Anonymous said...

The OpenURL standard is here:
http://tinyurl.com/6lgv3h

I've found myself to be a TinyURL evangelical lately..

Jonathan Rochkind said...

Thank you! I've been looking for a clear language explanation of what the heck DCAM is forever! (And trying to needle people like Stu Weibel and Andy Powell to write :) )

Jonathan Rochkind said...

Your taxonomy of plain, structured, and identified makes a lot of sense to me, and a lot more sense "literal" and "non-literal" ever did, that's for sure. I also like that you emphasize our role as information professionals is trying to move things up the hieararchy (hopefully that will be clear to some who are worried anyone interested in things like DCAM or RDF have a secret agenda to move things DOWN the hieararchy!)

It remains unclear to me what DCAM gives us beyond what RDF does. To be sure, RDF is already both complicated and not neccesarily right (you mention the confusing 'literal' and 'non-literal' come from RDF), so I'm not saying RDF is neccesarily The Answer either. But DCAM is even MORE complicated than RDF, it incorporates RDF as just one component, and RDF is more widely adopted, and it's not clear to me what DCAM actually gives us to justify it's additional (conceptual and engineering) overhead. Never has been, and I've never gotten an answer from the DCAMites that made sense to me.

This stuff is very abstract, which is part of what makes it confusing. We're not talking about things, things are very removed from what we're talking about. We're really sort of talking about meta-meta-data. Data about metadata, to let computers make use of metadata in a general sense more easily, when that metadata can come from so many different sources and be of so many different types. Meta-meta-data is meant to give us a common very abstract framework for dealing with metadata in general, to not always be re-inventing the wheel and allow more interoperabilty. But this is a very abstract thing, and neccesarily is going to be a BIT confusing.

The trick is figuring out how to make it no more confusing than it needs to be. Part of that, I think, is making it no more abstract than it needs to be! Don't try to make it solve everything possible for it to solve. To me, part of the problem of OpenURL is that it is in fact way too abstract. It was designed to do anything and everything, not just to solve the particular problems that motivated it. As a result, it's both more difficult to use for the particular domain that motivated it than it needs to be (very few people have actually tried to write a general purpose OpenURL parsing/generating library that is actually standards compliant. Those who have will tell you it's a bear. I have the misfortune to have to work with one, and it's generally not fun) -- as well as, ironically, even less likely to be adopted outside the original domain that motivated it, because it's just so darn complicated.

Karen Coyle said...

Here's a section from a document I sent to the OpenURL committee on 11/11/2002 regarding the then current OpenURL draft:

b) The "R" Words
The entity words, referent, referrer, referringentity, etc. are both unnatural and confusing in their similarity. Where possible we should find terms closer to natural language and avoid the repetition of "ref...",. The OAI-PMH is a good example of a standard that uses clear language for similar concepts. For example,

ContextObject could be RequestRecord
referrer could be sender
referringEntity could be item

While these might not be the best or only options, it should be clear that we need to avoid terminology that does not resonate immediately with the reader of the document.


Needless to say, I did not prevail. Nor did I prevail over the confusion between the broader ContextObject standard being defined within a document named "OpenURL." The idea of branding won out over logic, I'm afraid, and a ContextObject that is not embedded in a URL is still an "OpenURL," by definition. It just makes no sense to me.

Anonymous said...

Okay--here's how I understand the primary difference between "literal" and "non-literal," which I think your summary--which is otherwise excellent--doesn't quite make clear. This also helps explain why they're called literals and non-literals.

On a broader conceptual level--not from a machine-level POV--whether or not a value "should" be a literal is partly dependent on the property. Some properties lend themselves better to containing literals, while others lend themselves to containing non-literals. Let me explain.

A literal is a value that references nothing other than itself. You could consider it the "end of the line" when you're thinking about linked data. It's data that isn't linked to anything else. For example, the property "FirstName" would probably have as its value a literal. Consider "FirstName=Karen"--Karen isn't referencing the person, it's a literal string (or "value string") that tells what the FirstName is. The FirstName property, in turn, would probably be part of a description set that describes the resource--the person--that could be identified by the string "Karen Coyle."

A non-literal, on the other hand, is a value that serves as a reference to something else. Hence "non-literal"--it's not a data value to be taken literally. It's a pointer--a link--that refers to something else. Properties whose value would logically be another resource should contain non-literals. "Author," for example. Even when we say, "The Author of this blog posting is Karen Coyle," we're not referring to the literal string "Karen Coyle." That string didn't write the posting. We're using that string as a name that references the actual person. The person authored the blog posting. "Karen Coyle" is just a convenient reference--or non-literal--that points to the person. So--first of all, that's the difference between a literal and a non-literal.

So--you have the two types of non-literals. 1. A URI. The URI uniquely identifies the resource being referenced. Using the above example, since a URI doesn't have to be resolvable, we could make up a URI for Karen, or we could use a URI that points to a resource describing Karen. 2. Any other value string (and, by definition, a value string can't be a URI) that references an entity beyond the string itself. Now--since this second type of non-literal isn't a URI, you're also supposed to include a URI pointing to a resource that identifies the encoding schema for the value so that the value can be interpreted. The value string might be a non-URI identifier--an OCLC number, for instance. Or it might be a string from a controlled list like "Coyle, Karen 1970-".

Does that make sense?

I think what the DCAM intends with the two types of non-literals is to allow for non-URI identifiers.

Since I'm an RDBMS guy at heart, it's easy to think of it in those terms. A non-literal would be like a foreign key. The value itself may or may not mean anything--it just references a record in another table. A literal would be a cell that isn't a foreign or primary key. It's the actual data.

Now--this certainly isn't unambiguous. Going back to the FirstName example, one might use a non-literal for this property if you're actually thinking about first names as entities/resources in your data model. Maybe you have a separate description of each name, complete with history, related names, etc. In this case you could use a URI to identify each name, or each resource that describes any given name, or you could keep using the value string "Karen"--but in the latter case you would also need a URI associated with it that identifies how to interpret that value. Otherwise it's just a literal. So--in this case, you have the same value string ("Karen") that we could use for the same property as a literal or as a non-literal. From my understanding, what matters is whether or not it you're using it as an identifier to refer to something else and whether or not you include a URI that describes the identifier--not whether or not it's "structure data."

I think one reason this gets so confusing is because, in language, there is no such thing as a literal. Language is an abstraction--words always reference something else. Even numbers represent mathematical concepts and dates represent temporal units. And, in reality, we derive meaning from context--the infinite references and links between everything with which we've ever come into contact. But data models aren't reality. The references and links eventually dead-end somewhere, which is why we have the concept of literals.

So--I do want to say that you are absolutely right that, with our data, we should strive to move up your hierarchy--from using literals to using unambiguous, URI non-literals. The fewer literals in our data model, the richer the data model and the more closely the model resembles reality. This is why linked data is so exciting, because it's a step toward a rich, machine-encoded version of reality. :-)

Anonymous said...

So I just looked back at the DCAM, and it looks like my previous comment sort of confused the "vocabulary encoding scheme" of a non-literal and the "syntax encoding scheme" of a "typed value string." Wow, that is confusing.

So--in the definition of a non-literal, the URI that identifies the non-literal is optional, the URI that identifies the vocabulary encoding scheme is also optional, and one or more value strings that identify the non-literal value are also optional.

But--I still think the general idea is that a literal value represents a literal value and nothing else. It is what it is. A non-literal value identifies or represents something else. The non-literal value could be a URI, it could be one or more non-URI values, or (according to the DCAM) a combination of both. A non-literal may or may not include a URI pointing to a vocabulary encoding scheme for the value.

So--in most cases an LCSH heading would be a non-literal value (it references a subject, a conceptual entity). The non-literal value might consist of a few different components. If there's a URI that identifies the particular heading you're using, you could include that. You could also include the plain text LCSH string. You could also include a URI that identifies the LCSH vocabulary as a whole. All of that would be part of the same "non-literal value," in DCAM-speak. In another context, the same plain-text string might be considered just a literal if you're not using the string to refer to the LC heading.

This might help, also--later in the DCAM it makes the following distinction: a "value is the physical, digital, or conceptual entity /or/ literal ... " So--you can infer that a non-literal would be a reference to the "physical, digital, or conceptual entity" and a literal would just be a data value. So--one last analogy. If I said, "This book's author is Karen Coyle," then the real value of "author" is *the person,* and "Karen Coyle" is being used like a non-literal value to identify *the person.* If I say, "This book's author is the person with the name 'Karen Coyle,'" then "the person ..." is still the value of "author" but now "Karen Coyle" is simultaneously the non-literal value of the "author" property and the literal value of "name" property.

Again, does that make sense? I apologize for the multiple novels I've written here--it does seem like an important thing to work out, though, and I'm not 100% sure that I'm right.

Karen Coyle said...

Jason, I can't thank you enough for your post, which takes us to Chapter 2 of this discussion: WHY do/should we select literals or non-literals for our values? I'd like to repost your comment as a blog post to 1) get it a larger audience and 2) be able to interleave some comments of my own. I'll do that shortly, although if you prefer not, just let me know.

Karen Coyle said...

Jason: "If I said, "This book's author is Karen Coyle," then the real value of "author" is *the person,* and "Karen Coyle" is being used like a non-literal value to identify *the person.*"

I believe that you can indeed say: "this book's author is [literal value 'Karen Coyle']." Simple metadata does that all of the time. I think that the distinction is *not* in the string or even in the fact the you put it in classic RDF-triple terms, but in the intended use. So in a MARC record following AACR2, an author name in the 100 field is a non-literal because it represents a heading in the authority file. In a MARC record that is not using any particular cataloging rules (or where you as a recipient have no idea what the rules are), the entry in the 100, even if it is identical to the entry in the AACR2 record, is a literal because you can make no inference about what it might represent outside of the metadata record.

By defining the author name as a non-literal (and setting up particular rules for the creation of that non-literal value) you are saying, as you describe in your first comment, that this string has some possibilities beyond just being a string. How much is known about those possibilities would depend on information stored elsewhere.

A URI is one way to point to that information, but doesn't necessarily perform that function. I'm highly sympathetic to the use of good identifiers, but in fact a URI is just an identifier -- it doesn't itself fill in any gaps you need to make good use of the value.

I guess what I'm saying is that I'm comfortable with the various practical distinctions between literal and non-literal, but I find purely abstract distinctions to be a bit of a waste of time (unless I'm in the mood for philosophical contemplation, rather than trying to get work done). So my view is: what's the action? rather than: is this literal or non-literal? The answer to that latter question is important, but doesn't give me an answer to the former.

What we need to work on as "chapter 3" of this discussion, and which I think will take a great deal of effort, is: how to we create data with useful linking possibilities, and how do we convey those possibilities to anyone who might try to use the data? If the DCAM helps us clarify our thoughts in this area, then great. If not, then we may need to create a different set of distinctions and a vocabulary that will help us work with those distinctions.