Tuesday, November 18, 2014

Classes in RDF

RDF allows one to define class relationships for things and concepts. The RDFS1.1 primer describes classes succinctly as:
Resources may be divided into groups called classes. The members of a class are known as instances of the class. Classes are themselves resources. They are often identified by IRIs and may be described using RDF properties. The rdf:type property may be used to state that a resource is an instance of a class.
This seems simple, but it is in fact one of the primary areas of confusion about RDF.

If you are not a programmer, you probably think of classes in terms of taxonomies -- genus, species, sub-species, etc. If you are a librarian you might think of classes in terms of classification, like Library of Congress or the Dewey Decimal System. In these, the class defines certain characteristics of the members of the class. Thus, with two classes, Pets and Veterinary science, you can have:
- dogs
- cats

Veterinary science
- dogs
- cats
In each of those, dogs and cats have different meaning because the class provides a context: either as pets, or information about them as treated in veterinary science.

For those familiar with XML, it has similar functionality because it makes use of nesting of data elements. In XML you can create something like this:
and it is clear which price goes with which type of drink, and that the bits directly under the <drink> level are all drinks, because that's what <drink> tells you.

Now you have to forget all of this in order to understand RDF, because RDF classes do not work like this at all. In RDF, the "classness" is not expressed hierarchically, with a class defining the elements that are subordinate to it. Instead it works in the opposite way: the descriptive elements in RDF (called "properties") are the ones that define the class of the thing being described. Properties carry the class information through a characteristic called the "domain" of the property. The domain of the property is a class, and when you use that property to describe something, you are saying that the "something" is an instance of that class. It's like building the taxonomy from the bottom up.

This only makes sense through examples. Here are a few:
1. "has child" is of domain "Parent".

If I say "X - has child - 'Fred'" then I have also said that X is a Parent because every thing that has a child is a Parent.

2. "has Worktitle" is of domain "Work"

If I say "Y - has Worktitle - 'Der Zauberberg'" then I have also said that Y is a Work because every thing that has a Worktitle is a Work.

In essence, X or Y is an identifier for something that is of unknown characteristics until it is described. What you say about X or Y is what defines it, and the classes put it in context. This may seem odd, but if you think of it in terms of descriptive metadata, your metadata describes the "thing in hand"; the "thing in hand" doesn't describe your metadata. 

Like in real life, any "thing" can have more than one context and therefore more than one class. X, the Parent, can also be an Employee (in the context of her work), a Driver (to the Department of Motor Vehicles), a Patient (to her doctor's office). The same identified entity can be an instance of any number of classes.
"has child" has domain "Parent"
"has licence" has domain "Driver"
"has doctor" has domain "Patient"

X - has child - "Fred"  = X is a Parent 
X - has license - "234566"  = X is a Driver
X - has doctor - URI:765876 = X is a Patient
Classes are defined in your RDF vocabulary, as as the domains of properties. The above statements require an application to look at the definition of the property in the vocabulary to determine whether it has a domain, and then to treat the subject, X, as an instance of the class described as the domain of the property. There is another way to provide the class as context in RDF - you can declare it explicitly in your instance data, rather than, or in addition to, having the class characteristics inherent in your descriptive properties when you create your metadata. The term used for this, based on the RDF standard, is "type," in that you are assigning a type to the "thing." For example, you could say:
X - is type - Parent
X - has child - "Fred"
This can be the same class as you would discern from the properties, or it could be an additional class. It is often used to simplify the programming needs of those working in RDF because it means the program does not have to query the vocabulary to determine the class of X. You see this, for example, in BIBFRAME data. The second line in this example gives two classes for this entity:
a bf:Instance, bf:Monograph .

One thing that classes do not do, however, is to prevent your "thing" from being assigned the "wrong class." You can, however, define your vocabulary to make "wrong classes" apparent. To do this you define certain classes as disjoint, for example a class of "dead" would logically be disjoint from a class of "alive." Disjoint means that the same thing cannot be of both classes, either through the direct declaration of "type" or through the assignment of properties. Let's do an example:
"residence" has domain "Alive"
"cemetery plot location" has domain "Dead"
"Alive" is disjoint "Dead" (you can't be both alive and dead)

X - is type - "Alive"                                         (X is of class "Alive")
X - cemetery plot location - URI:9494747      (X is of class "Dead")
Nothing stops you from creating this contradiction, but some applications that try to use the data will be stumped because you've created something that, in RDF-speak, is logically inconsistent. What happens next is determined by how your application has been programmed to deal with such things. In some cases, the inconsistency will mean that you cannot fulfill the task the application was attempting. If you reach a decision point where "if Alive do A, if Dead do B" then your application may be stumped and unable to go on.

All of this is to be kept in mind for the next blog post, which talks about the effect of class definitions on bibliographic data in RDF.

Note: Multiple domains are treated in RDF as an AND (an intersection). Using a library-ish example, let's assume you want to define a note field that you can use for any of your bibliographic entities. For this example, we'll define entities Work, Person, Manifestation for ex:note. You define your note property something like:

    a rdf:Property ;
    rdfs:label "Note"@en ;
    rdfs:domain ex:Work ;
    rdfs:domain ex:Person ;
    rdfs:domain ex:Manifestation ;
    rdfs:range rdfs:literal .

Any subject on which you use ex:note would be inferred to be, at the same time, Work and Person and Manifestation - which is manifestly illogical. There is not a way to express the rule: "This property CAN be used with these classes" in RDF. For that, you will need something that does not yet exist in RDF, but is being worked on in the W3C community, which is a set of rules that would allow you to validate property usage. You might also want see what schema.org has done for domain and range.


Jon Phipps said...

Nicely done!

It might also be worth pointing out (at some point) that determining the class of some 'thing' is potentially much more difficult in the wilder, open world of RDF where 'you' aren't in control of the assignment of properties. In the open world everything is true and valid until proven otherwise by closed world analysis. You may say that 'alive' is disjoint with 'dead' in your domain, but this is an assertion that would by necessity be rejected in the zombie domain.

This is the other key difference between RDF and XML: like Schrödinger's cat, RDF data by design can exist in both a state of open world truth (everything is true) and closed world falsehood (not on my planet!), without altering the data. This is both the challenge and the opportunity of RDF, because the open world encourages discovery of truths that aren't your truths, but may still be completely true, including truths that directly contradict your truths. XML is far more closed minded.

Karen Coyle said...

Thanks, Jon. Yes, the next post will talk about closed vs. open world. I thought that the explanation of classes should precede that. The closed vs. open is the other main sticking point for people trying to understand RDF. You see this as a design feature of RDF; I must admit that I don't see closed world as principles in RDF, but I do see them applied in practice, perhaps more than open world thinking. Perhaps this is a comment on the model of RDF itself: while the open world thinking is new and exciting, what RDF fails to address is that there is a great need for applications that function with closed world rules.

As for "alive" and "dead" - it's damned hard to find a disjoint pair that everyone can agree on. At least I didn't use "male | female". How about "dead | alive | undead"?