Friday, September 05, 2008

Literals and non-literals, take 2

Jason Thomale responded to my previous post with his insights into literals and non-literals, and I have to say that this really hit me up-side the head, in the best of ways. Here are some paragraphs from Jason's comment (which is worth reading in full):

A literal is a value that references nothing other than itself. You could consider it the "end of the line" when you're thinking about linked data. It's data that isn't linked to anything else. For example, the property "FirstName" would probably have as its value a literal. Consider "FirstName=Karen"--Karen isn't referencing the person, it's a literal string (or "value string") that tells what the FirstName is. The FirstName property, in turn, would probably be part of a description set that describes the resource--the person--that could be identified by the string "Karen Coyle."

A non-literal, on the other hand, is a value that serves as a reference to something else. Hence "non-literal"--it's not a data value to be taken literally. It's a pointer--a link--that refers to something else. Properties whose value would logically be another resource should contain non-literals. "Author," for example. Even when we say, "The Author of this blog posting is Karen Coyle," we're not referring to the literal string "Karen Coyle." That string didn't write the posting. We're using that string as a name that references the actual person. The person authored the blog posting. "Karen Coyle" is just a convenient reference--or non-literal--that points to the person. So--first of all, that's the difference between a literal and a non-literal.

...

Since I'm an RDBMS guy at heart, it's easy to think of it in those terms. A non-literal would be like a foreign key. The value itself may or may not mean anything--it just references a record in another table. A literal would be a cell that isn't a foreign or primary key. It's the actual data.

Now--this certainly isn't unambiguous. Going back to the FirstName example, one might use a non-literal for this property if you're actually thinking about first names as entities/resources in your data model. Maybe you have a separate description of each name, complete with history, related names, etc. In this case you could use a URI to identify each name, or each resource that describes any given name, or you could keep using the value string "Karen"--but in the latter case you would also need a URI associated with it that identifies how to interpret that value. Otherwise it's just a literal. So--in this case, you have the same value string ("Karen") that we could use for the same property as a literal or as a non-literal. From my understanding, what matters is whether or not it you're using it as an identifier to refer to something else and whether or not you include a URI that describes the identifier--not whether or not it's "structure data."

What Jason does here is to look beyond the way that DCAM defines the structures of literals and non-literals and instead focuses on what UI folks would call the "affordances." In other words, what do these types of values do for me in a linked environment? Although I've heard DC folks talk about this aspect of the DCAM, it is not brought out in the DCAM document itself.

Where I think that my concept of this differs from that which circulates in the DC world is that I'm not at all interested in refining philosophical points about the fine lines between literal and non-literal. This comes up in a second comment of Jason's that I reply to. I believe that Jason's analysis is in agreement with the DCAM definitions, which, however, doesn't work for me:
Jason: "If I said, "This book's author is Karen Coyle," then the real value of "author" is *the person,* and "Karen Coyle" is being used like a non-literal value to identify *the person.*"

Karen: I believe that you can indeed say: "this book's author is [literal value 'Karen Coyle']." Simple metadata does that all of the time. I think that the distinction is *not* in the string or even in the fact the you put it in classic RDF-triple terms, but in the intended use. So in a MARC record following AACR2, an author name in the 100 field is a non-literal because it represents a heading in the authority file. In a [metadata] record that is not using any particular cataloging rules (or where you as a recipient have no idea what the rules are), the value in the [author or creator] field, even if it is identical to the entry in the AACR2 record, is a literal because you can make no inference about what it might represent outside of the metadata record.
The difference that I see here is between a theoretical non-literal ("author of this book is Karen Coyle") and a value that one can actually act on ("author of this book is person identified in library land by LC Control Number: n 89613425"). I realize that this means that the context of the data has an effect on whether one would call the data literal or non-literal, but in fact, I'm less concerned with what you would call it than what I can do with it at any given moment in time. It's this knowing what I can do with a value that to me is of prime importance, and finding a way to convey to people and machines what they can do with a value is my main goal. (I don't know if Jason would disagree with this, but he knows how to comment, so I'll let him speak for himself.)

I am now arriving at the conclusion that if we focus on real affordances for linking, rather than structure, then we can have a very useful discussion of types of metadata affordances that serve our purposes. These may or may not exactly parallel the DCAM structure, but I don't think that adoption of the DCAM is our task -- I think our task is to create a useful model for the next generation of library data. What DCAM provides us with is an existing model that we can poke at, dissect, try to work with, and throw our own ideas at. Then, once we have defined our affordances we can figure out a way to structure our data profiles so they reveal those affordances to human and machine users.

5 comments:

Jonathan Rochkind said...

Makes a lot of sense to me, and I finally understand that 'literal' and 'non-literal' business, thanks!

I agree that the "Karen Coyle" can be a non-literal (when it's a transcribed string that is not a key to an external record), but the author can be a literal when, as in the 100, it's meant to be a key to another record.

Is "non-literal" then basically the same thing as saying "identifier"? Isn't something meant to be a key to(to reference) an external record of some kind, exactly what we mean by 'identifier'? If so, is there any added clarity of the "literal" and "non-literal" terminology, or should we just speak of identifiers and not identifiers?

Diane said...

Karen:

I like how you're thinking here, and agree that the function of the literal/non-literal is the important thing, and your post points out effectively why determining which one we're looking at doesn't answer the most important question.

Now the trick is to incorporate these insights into the various flavors of documentation we need to develop.

Kudos,
Diane

Karen Coyle said...

Jonathan, I, too, have been thinking about this "non-literal v. identifier" issue. I don't think that non-literals are the same as identifiers, but I do think that library record headings (names, subjects) are often non-literals. They do represent something that could be linked to outside of the bibliographic record itself.

As I've said before, I don't consider AACR2 name headings to be identifiers because they are the value string itself, and thus they change if the decisions about the data changes. An identifier would be the LC name authority record number, which should remain the same even if the preferred form of the name is changed.

So I think we now have a way to talk about the role of the preferred form of the heading, although I really want to find a term that is better than "non-literal." It would be much better to say what these values are, rather than what they aren't -- especially because there may be non-literal forms that we decide not to include in our model. "Linking value?" vs. "Linking identifier"?

To me, what is important about the library headings is 1) they are controlled 2) they create the link between the bibliographic record and the authority record. It is also interesting that they could be replaced by an identifier and provide the same (or better) functionality.

Jason Thomale said...

Karen, thanks for the post.

I completely agree that what's important is the general concept that, in a given statement, the value can be either a link to another entity/resource or it can be a data value--and all that that implies for building data models on top of DCAM or RDF. In my mind, the details about literals vs. non-literals, typed vs. untyped literals, non-literal vs. "identifier," etc. are more a matter of implementation. If a machine is acting on data set, it needs to know unambiguously how to act on each data value, so those details are important for how the model is implemented.

Let me just say this--and I don't consider myself an expert on either RDF or the DCAM, so take this for what it's worth. As we all know, RDF is supposed to be the enabling tech behind the semantic web. It allows data to be crawlable, just like the web is crawlable. And, just like the URL makes this possible for the web, the URI makes this possible for the semantic web.

As you noted, Karen, RDF defines the concept of a literal without explicitly defining the concept of a non-literal. But, anything that isn't a literal data value is assumed to be a URI. This fact greatly simplifies implementation. URIs allow a crawler to crawl on to the next resource. Literal values don't. There's no ambiguity in how a program should interpret a data value.

Now--certainly there's nothing magical about a URI--it *is* just a type of identifier. But defining a consistent type of identifier to use throughout our underlying framework simplifies implementation. And, for making the semantic web a reality, this simplicity is vital.

OTOH, I think the DCAM concept of "non-literal" introduces some ambiguity and, in my non-expert opinion, unnecessary complexity--obviously this is so based on the current conversation. The DCAM allows us to treat a non-URI, opaque identifier as a non-literal--a link to more data. That means the program that's using the data needs to know how to interpret these non-URI identifiers. The more types of these identifiers you have, the more complex it becomes to implement the data model at a machine level. The more complex the machine implementation of our data model, the more library-centric it is and the less friendly it is to outsiders who want to use/crawl our data, and the more difficult it is to leverage non-library tools. If one of our goals is to make our data more accessible to outsiders, then we want to implement it in a way that doesn't require people to learn, e.g., how we resolve an LC control number or an OCLC number, just so they can crawl our data.

In my mind that doesn't mean we couldn't still make use of our library identifiers--it just means we would have to wrap them in semantic web goodness--make them addressable via URIs, etc.

Owen said...

What Jason says about DCAM here makes sense to me, but it seems to me that DCAM introduces the two types of non-literal to allow us to move towards the semantic web in smaller steps - rather than trying to solve all our problems at the same time.