Tuesday, January 08, 2008

More on RDA and "literals"

I did spend time looking over the RDA Element Analysis for its use of "literal" and "non-literal." (I know that to some of you this will seem like nit-picking, but trust me that in the end it will matter.) Let me start by giving my definition of "literal" and "non-literal":

literal: an alphanumeric string representing a value.
example: "Moby Dick, or The whale"
example: "Herman Melville"
example: "ISBN:123456789x"

non-literal: a surrogate for the value itself.
example: uri:lccn: n 79006936 [identifies the LC name authorities record for Person Herman Melville]
example: http://authoritylists.info/uri/RDACarr/1052 [identifies RDA carrier type "volume"]

In programming, non-literals are those data elements that you give a name to. This means that in programming you are mainly working with non-literals:

mainTitle = 245ab

In data records, the values are often literals:

dc:title = "Moby Dick"

When we think about RDA and the literal/non-literal difference, we have some choices. RDA could treat every value as a literal, and let another standard, the data standard, define some as non-literals. This would provide for the maximum flexibility for implementing RDA.

Another possibility is that RDA could define some elements, such as the vocabulary lists included in RDA, as non-literals. These are, indeed, defined as non-literals in the RDA Element Analysis. This would mean that any use of RDA would need to define vocabularies for those data elements, and would have to assign those vocabularies and their entries with identifiers.

There are data elements that might be a literal in one implementation, and a non-literal in another. Author names are an example of this. Today we actually embed the author name as a literal in our MARC records; in the future we could use an identifier to link to an author record, as shown in the example above. Whether or not something is a literal is often a matter of implementation.

There are some data elements, or pieces of information that we think of as single data elements, that could be a combination of a literal and a non-literal. We already have this in the MARC record in the date element in the 008 field. The date itself is a literal ("1984"); it's just a string of numbers. Included with the date is a code that tells you what kind of date it is (single, range, copyright date). The same could be true of the extent statement: "345 p." could consist of a literal ("345") and a code for the unit (pages or leaves or volumes, etc.).

There are data elements that are hard to think of as a non-literal, and in fact they may never be one. Titles, explanatory notes, values like numbers of pages or dates -- all of these are likely to be simple text values in a data record.

Conclusion

RDA, as a set of cataloging rules, should not pre-determine whether elements are transcribed as literals or whether they are represented with surrogates for the values.

A step related to RDA in which RDA is defined as data elements that can be encoded for processing should allow literals for all data elements, but should be defined in such a way that non-literals could be used for any data element.

Another step, that encodes RDA as the library world's bibliographic record, should define non-literals for all vocabulary lists, and, where possible, for all units of measure or data element attributes (such as the type of publication date). It should also define optional non-literals for all authority-controlled elements. This would allow us to move increasingly in the direction of using non-literals.

Of course, our data elements themselves are (or should be) defined in such a way that they are identified with URIs, and therefore are non-literal values. This should be an obvious step in moving our data in the direction of the semantic web.

OK, I've stuck my neck out here -- all comments welcome!

11 comments:

Anonymous said...

Hi Karen,

I have to admit to a large level of fuzziness in trying to understand the literal vs. non-literal here, but let me ask these questions:

You used an example of MARC value for a non-literal value. Are you suggesting RDA use something other than MARC to determine the non-literal values?

Are there non-literal values found in Dublin Core that we could already use?

Would the RDA list of non-literal values eventually replace MARC?

Thanks,

Patty

Karen Coyle said...

Patty,

The "literal/non-literal" designation is about what goes into the data element. If we do decide to use non-literals for some elements, it will be very hard to do so in the MARC format. The extent in the MARC 300 field is an example of that -- we don't have a way to separate the numeric value from the meaning of the value (pages, leaves, etc.) We might be able to add a subfield to personal name fields to hold a suitable identifier, but other fields do not have subfields available (like some of the 78x fields). Some MARC data elements are already non-literals, mainly in the fixed fields, but expanding our use will require a different data carrier, in my opinion.

If you look at Dublin Core you can see that each of the Dublin Core data elements is defined with an identifier, like:

http://purl.org/dc/elements/1.1/title

That is a non-literal. If you look at how DC handles a list of data types (e.g. moving image, sound file, etc.), it does this with a set of values that have been given identifiers like:
http://purl.org/dc/dcmitype/Sound

So rather than typing in "sound file", which would be a literal, you provide http://purl.org/dc/dcmitype/Sound, which stands for the same thing. It is, however, more precise, because everyone using it must mean the same thing, which is defined on the DC web site.

It isn't a question of whether the RDA values would eventually replace MARC, since MARC is a particular record format, and the RDA values (as I envision them) are a set of data elements that are not yet part of a record format. Whether they could be expressed in MARC or not is a matter of current debate. I think not, but others contend that MARC will work as an expression of RDA.

Anonymous said...

Hi Karen,

Thanks for the clarification--this helps alot.

I guess my concern is that the strings in the non-literal DC examples are decidedly long and, unless a cataloger or other coder could select from a list of values and/or a template, there could be easily be mistakes made in typing the non-literal values.

I guess that RDA could define a set of simpler non-literal values. I was hoping that we would have more interaction between RDA and DC, however.

Sorry for the ramble--I'm still trying to get my head around the whole RDA/FRBR/DC world.

Patty

Karen Coyle said...

Patty, rambling is fine, and yours are right on the mark.

No human beings should ever have to type in things like http://purl.org/dc/elements/1.1/title although I know that today those do get hand typed by folks creating XML records. But that's really bad practice. Instead, humans should see the human-readable value ("title" or "sound recording") and a program should input the big, ugly value. It does amaze me that folks working in MARC actually have to find and input rather meaningless codes in the fixed fields. We should have done better than that years ago.

Anonymous said...

Karen

One initial comment: I've been going back over the DCAM document and I don't think your definitions clearly bring out the distinction DCAM makes between the value itself and the surrogate used to represent that value.

DCAM[1] has both a Resource Model (sect 2.1), which includes 'values' as entities -- and a Description Set Model (sect 2.2), which includes the surrogates representing the values. The values themselves -- literal or non-literal -- never appear in a description, only surrogates for them.

So the first step for RDA should be to decide whether the value (in the sense of the DCAM Resource Model) for a property is a literal or a non-literal. Then we can say that the value should be represented by a literal surrogate (if a literal value) or a non-literal surrogate (if a non-literal value) in a description.

The author of Moby Dick is clearly a non-literal value (the person, Herman Melville) and therefore should be represented by a non-literal surrogate. The title proper ('Moby-Dick, or, The whale') is a literal value and therefore should be represented by a literal surrogate (the character string 'Moby-Dick, or, The whale').

Looking quickly at what DC has chosen for the ranges for its properties[2], my impression is that only Title and Date properties have literal values (and I think Title, from other discussions I've read, is still controversial).

-Irvin

[1]http://dublincore.org/documents/abstract-model/
[2]http://dublincore.org/documents/domain-range/

Erik Hetzner said...

This document is extraordinarily confusing to me. Could somebody explain how an RDA literal compares with an RDF literal? In RDF a literal is a string (basically), either typed or not typed. But in this document we have strings which are non-literal. I presume that this is because RDA wants to have non-literals which are not URIs, which is not allowed in RDF. But this should be made clear in the beginning, or non-literal strings should be eliminated.

I would advise anybody reading this document to make sure that they have read the W3C document on RDF concepts and abstract schema (http://www.w3.org/TR/rdf-concepts/): it helps to understand the RDA element analysis.

As for the issue of literals, the RDF concepts document makes the useful point that any value that you represent as a literal could be represented by a URI. Eventually you will have to nail this down and use a literal, but I agree that doing this too early could be a mistake: especially if is done in a haphazard way.

Another aside, it seems to me that the page extent literal issue that you raise is essentially the issue of units & literals. It is discussed briefly at http://www.w3.org/TR/REC-rdf-syntax/#rdfvalue,
and if I recall correctly you can go back & read long discussions of the issue on the semantic web lists. The solution is basically as you put it: to have a link to a literal value (345) and to a URI representing the unit (pages).

Extent & quantities are discussed again in the beginning of this document, but again I am completely at a loss to understand the meaning. Perhaps somebody else could parse this for me: "A quantity is generally represented by a non-literal value surrogate using a typed value string with an associated syntax encoding scheme." I don't see how a quantity could be represented in any other way that how you put it in your post: a quantity is represented by a literal value (e.g., 345) combined with a non-literal unit (e.g., pages).

Karen Coyle said...

egh said:

"But in this document we have strings which are non-literal. I presume that this is because RDA wants to have non-literals which are not URIs, which is not allowed in RDF."

That's how I read it as well. I think that the distinction that RDA is making is between strings that have been *literally* taken from the item (title, publisher) and those that are supplied by the cataloger (notes, physical size). This is, to me, a different meaning of "literal" and does not coincide with that of RDF or DCAM. I think it makes sense to make a distinction between the transcribed elements and the provided elements, but RDA will need to develop different terminology for these. At the very least, it should not state, as it does, that its use of literal and non-literal follows DCAM.

I have to admit, though, that there are some interesting issues when it comes to the names of persons and corporate entities. To some they make look like literals, but in fact they are constructed and artificial strings that identify the entity. I'm going to think more about those, and ask some folks who understand this more than I do.

Erik Hetzner said...

Hi Karen.

Thanks for your comment & for giving me the impetus to finally read through the DCMI Abstract model document. It helped to clarify what is being discussed here.

As I read it, DCMI describes for "Description sets" (which are somehow built on top of plain descriptions) a model in which each statement consists of a property & a value surrogate. The property is a URI. The value surrogate is the key part. It can be either literal or non-literal. This is subtly different from literal/non-literal as used in RDF, unfortunately in my opinion. Let's call it DCMI-(literal, non-literal) and RDF-(literal, non-literal). If the value surrogate is DCMI-literal than it is represented by one value string which is an RDF-literal. If it is DCMI-non-literal than it is more complicated. The value surrogate can then be some combination of a URI (that is, an RDF-non-literal) which 'identifies the [DCMI-]non-literal value associated with the property', a vocabulary encoding scheme URI, and zero or more value strings, which are RDF-literals which identify a DCMI-non-literal value. In other words, the DCMI Abstract model uses RDF-literals to encode DCMI-non-literals. Which makes sense, but I wish they would be a little more clear about this. Skimming the RDA element analysis again, it makes sense. But I find the use of square boxes, which are used in RDF diagrams to describe literals, to represent what they call non-literals really through me.

I do think that RDA is making some mistakes here in terms of what is a literal and what is a non-literal. Or, at least, they are not being entirely consistent with the DCMI AM. The RDA drafts seem to take a very narrow definition of literal: it can really only be a language string. The DCMI AM definitions of the distinction between a DCMI-non-literal value and a DCMI-literal value are not particularly useful, in my opinion. However, the definition of a value string includes this useful exposition of the difference between a DCMI-literal and a DCMI-non-literal: 'In a literal value surrogate a value string encodes the value; in a non-literal value surrogate a value string represents the value.' Now, it is my opinion (perhaps the RDA drafters have a different one) that a date, for instance, which in the RDA element analysis is a non-literal, is encoded by value string, rather than represented by it. To my mind a value is only 'represented' if you have to look it up somewhere else. But I could be misunderstanding this whole discussion (as my first comment showed. :)

Anonymous said...

e.

Thank you very much for that. Your description squares with how I see the DCAM and you've also cleared up the confusion I was having with RDF literal vs DCAM literal.

Your last point about representing meaning 'look up else where' might be the key to my fuzziness between DCAM literals and non-literals. What you say about dates ties in with the fact that the DC Ranges and Domains document treats dates as literals. I was having trouble understanding how a date, which I'd consider an abstract entity, could be a literal -- but perhaps this represents/encodes distinction is the key.

I think the core difficulty I (and I suspect other librarians) have with all this is that the worldview is not one we're familiar with. It's like being dropped onto a new planet.

Karen Coyle said...

e - "To my mind a value is only 'represented' if you have to look it up somewhere else."

Yes, I'm told this is correct, in almost these same words. But that we do need to distinguish between those things that are represented by alpha codes (like the ISO country codes) and those that are presented by URIs. I need to look more into that.

Erik Hetzner said...

(e==egh; I don't know why blogger named me "e" for one comment)

Irvin-I'd definitely go to the source (the DCMI Abstract model) & see if what I'm saying makes sense for yourself. This is for me as well (not a librarian) like being dropped on another planet. I am generally impressed with the thought that has gone into these models, but I don't always understand what they are getting at.

Karen-That is the sort of distinction that I think the DCMI abstract model is making when they say that a DCMI-non-literal value surrogate can be made up of an value string which is an RDF-literal.