Friday, January 16, 2015

Real World Objects

I was asked a question about the meaning and import of the RDF concept of "Real World Object" (RWO) and didn't give a very good answer off the cuff. I'll try to make up for that here.

The concept of RWO comes out of the artificial intelligence (AI) community. Imagine that you are developing robots and other machines that must operate within the same world that you and I occupy. You have to find a way to "explain," in a machine-operational way, everything in our world: stairs and ramps, chairs and tables, the effect of gravity on a cup when you miss placing it on the table, the stars, love and loyalty (concepts are also objects in this view). The AI folks have actually set a goal to create such descriptions, which they call ontologies, for everything in the world; for every RWO.

You might consider this a conceit, or a folly, but that's the task they have set for themselves.

The original Scientific American article that described the semantic web used as its example intelligent 'bots that would manage your daily calendar and make appointments for you. This was far short of the AI "ontology of everything" but the result that matters to us now is that there have been AI principles baked into the development of RDF, including the concept of the RWO.

RWO isn't as mysterious as it may seem, and I can provide a simple example from our world. The MARC record for a book has the book as its RWO, and most of its data elements "speak" about the book. At the same time, we can say things about the MARC record, such as who originally created it, and who edited it last, and when. The book and the record are different things, different RWO's in an RDF view. That's not controversial, I would assume.

Our difficulties arise because in the past we didn't have a machine-actionable way to distinguish between those two "things": the book and the record. Each MARC record got an identifier, which identified the record. We've never had identifiers for the thing the record describes (although the ISBN sometimes works this way). It has always been safe to assume that the record was about the book, and what identified the book was the information in the record. So we obviously have a real world object, but we didn't give it its own identifier - because humans could read the text of the record and understand what it meant (most of the time or some of the time).

I'm not fully convinced that everything can be reduced to RWO/not-RWO, and so I'm not buying that is the only way to talk about our world and our data. It should be relatively easy, though, without getting into grand philosophical debates, to determine the difference between our metadata and the thing it describes. That "thing it describes" can be fuzzy in terms of the real world, such as when the spirit of Edgar Cayce speaks through a medium and writes a book. I don't want to have to discuss whether the spirit of Edgar Cayce is real or not. We can just say that "whoever authors the book is as real as it gets." So if we forget RWO in the RDF sense and just look sensibly at our data, I'm sure we can come to a practical agreement that allows both the metadata and the real world object to exist.

That doesn't resolve the problem of identifiers, however, and for machine-processing purposes we do need separate identifiers for our descriptions and what we are describing.* That's the problem we need to solve, and while we may go back and forth a bit on the best solution, the problem is tractable without resorting to philosophical confabulations.

* I think that the multi-level bibliographic descriptions like FRBR and BIBFRAME make this a bit more complex, but I haven't finished thinking about that, so will return if I have a clearer idea.

Saturday, January 10, 2015

This is what sexism looks like #2

Libraries, it seems, are in crisis, and many people are searching for answers. Someone I know posted a blog post pointing to community systems like Stack Overflow and Reddit as examples of how libraries could create "community." He especially pointed out the value of "gamification" - the ranking of responses by the community - as something libraries should consider. His approach was that it is "human nature" to want to gain points. "We are made this way: give us a contest and we all want to win." (The rest of the post and the comments went beyond this to the questions of what libraries should be today, etc.)

There were many (about 4 dozen, almost all men) comments on his blog (which I am not linking to, because I don't want this to be a "call out"). He emailed me asking for my opinion.

I responded only to his point about gamification, which was all I had time for, saying that in that area his post ignored an important gender issue. The competitive aspect was part of what makes those sites unfriendly to women.

I told him that there have been many studies of how children play, and they reveal some distinct differences between genders. Boys begin play by determining a set of rules that they will follow, and during play they may stop to enforce or discuss the rules. Girls begin to play with an unstructured understanding of the game, and, if problems arise during play, they work on a consensus. Boys games usually have points and winners. Girls' games are often without winners and are "taking turns" games. Turning libraries into a "winning" game could result in something like Reddit, where few women go, or if they do they are reluctant to participate.

And I said: "As a woman, I avoid the win/lose situations because, based on general social status (and definitely in the online society) I am already designated a loser. My position is pre-determined by my sex, so the game is not appealing."

I didn't post this to the site, just emailed it to the owner. It's good that I did not. The response from the blog owner was:
This is very interesting. But I need to see some proof.
Some proof. This is truly amazing. Search on Google Scholar for "games children gender differences" and you are overwhelmed with studies.

But it's even more amazing because none of the men who posted their ideas to the site were asked for proof. Their ideas are taken at face value. Of course, they didn't bring up issues of gender, class, or race in their responses, as if these are outside of the discussion of what libraries should be. And to bring them up is an "inconvenience" in the conversation, because the others do not want to hear it.

He also pointed me to a site that is "friendly to women." To that I replied that women decide what is "friendly to women."

I was invited to comment on the blog post, but it is now clear that my comments will not be welcome. In fact, I'd probably only get inundated with posts like "prove it." This does seem to be the response whenever a woman or minority points out an inconvenient truth.

Welcome to my world.

Monday, November 24, 2014

Multi-Entity Models.... Baker, Coyle, Petiya

Multi-Entity Models of Resource Description in the Semantic Web: A comparison of FRBR, RDA, and BIBFRAME
by Tom Baker, Karen Coyle, Sean Petiya
Published in: Library Hi Tech, v. 32, n. 4, 2014 pp 562-582 DOI:10.1108/LHT-08-2014-0081
Open Access Preprint

The above article was just published in Library hi Tech. However, because the article is a bit dense, as journal articles tend to be, here is a short description of the topic covered, plus a chance to reply to the article.

We now have a number of multi-level views of bibliographic data. There is the traditional "unit card" view, reflected in MARC, that treats all bibliographic data as a single unit. There is the FRBR four-level model that describes a single "real" item, and three levels of abstraction: manifestation, expression, and work. This is also the view taken by RDA, although employing a different set of properties to define instances of the FRBR classes. Then there is the BIBFRAME model, which has two bibliographic levels, work and instance, with the physical item as an annotation on the instance.

In support of these views we have three RDF-based vocabularies:

FRBRer (using OWL)
RDA (using RDFS)
BIBFRAME (using RDFS)

The vocabularies use a varying degree of specification. FRBRer is the most detailed and strict, using OWL to define cardinality, domains and ranges, and disjointness between classes and between properties. There are, however, no sub-classes or sub-properties. BIBFRAME properties all are defined in terms of domains (classes), and there are some sub-class and sub-property relationships. RDA has a single set of classes that are derived from the FRBR entities, and each property has the domain of a single class. RDA also has a parallel vocabulary that defines no class relationships; thus, no properties in that vocabulary result in a class entailment. [1]

As I talked about in the previous blog post on classes, the meaning of classes in RDF is often misunderstood, and that is just the beginning of the confusion that surrounds these new technologies. Recently, Bernard Vatant, who is a creator of the Linked Open Vocabularies site that does a statistical analysis of the existing linked open data vocabularies and how they relate to each other, said this on the LOV Google+ group:
"...it seems that many vocabularies in LOV are either built or used (or both) as constraint and validation vocabularies in closed worlds. Which means often in radical contradiction with their declared semantics."
What Vatant is saying here is that many vocabularies that he observes use RDF in the "wrong way." One of the common "wrong ways" is to interpret the axioms that you can define in RDFS or OWL the same way you would interpret them in, say, XSD, or in a relational database design. In fact, the action of the OWL rules (originally called "constraints," which seems to have contributed to the confusion, now called "axioms") can be entirely counter-intuitive to anyone whose view of data is not formed by something called "description logic (DL)."

A simple demonstration of this, which we use in the article, is the OWL axiom for "maximum cardinality." In a non-DL programming world, you often state that a certain element in your data is limited to the number of times it can be used, such as saying that in a MARC record you can have only one 100 (main author) field. The maximum cardinality of that field is therefore "1". In your non-DL environment, a data creation application will not let you create more than one 100 field; if an application receiving data encounters a record with more than one 100 field, it will signal an error.

The semantic web, in its DL mode, draws an entirely different conclusion. The semantic web has two key principles: open world, and non-unique name. Open world means that whatever the state of the data on the web today, it may be incomplete; there can be unknowns. Therefore, you may say that you MUST have a title for every book, but if a look at your data reveals a book without a title, then your book still has a title, it is just an unknown title. That's pretty startling, but what about that 100 field? You've said that there can only be one, so what happens if there are 2 or 3 or more of them for a book? That's no problem, says OWL: the rule is that there is only one, but the non-unique name rule says that for any "thing" there can be more than one name for it. So when an OWL program [2] encounters multiple author 100 fields, it concludes that these are all different names for the same one thing, as defined by the combination of the non-unique name assumption and the maximum cardinality rule: "There can only be one, so these three must really be different names for that one." It's a bit like Alice in Wonderland, but there's science behind it.

What you have in your database today is a closed world, where you define what is right and wrong; where you can enforce the rule that required elements absolutely HAVE TO be there; where the forbidden is not allowed to happen. The semantic web standards are designed for the open world of the web where no one has that kind of control. Think of it this way: what if you put a document onto the open web for anyone to read, but wanted to prevent anyone from linking to it? You can't. The links that others create are beyond your control. The semantic web was developed around the idea of a web (aka a giant graph) of data. You can put your data up there or not, but once it's there it is subject to the open functionality of the web. And the standards of RDFS and OWL, which are the current standards that one uses to define semantic web data, are designed specifically for that rather chaotic information ecosystem, where, as the third main principle of the semantic web states, "anyone can say anything about anything."

I have a lot of thoughts about this conflict between the open world of the semantic web and the needs for closed world controls over data; in particular whether it really makes sense to use the same technology for both, since there is such a strong incompatibility in underlying logic of these two premises. As Vatant implies, many people creating RDF data are doing so with their minds firmly set in closed world rules, such that the actual result of applying the axioms of OWL and RDF on this data on the open web will not yield the expected closed world results.

This is what Baker, Petiya and I address in our paper, as we create examples from FRBRer, RDA in RDF, and BIBFRAME. Some of the results there will probably surprise you. If you doubt our conclusions, visit the site http://lod-lam.slis.kent.edu/wemi-rdf/ that gives more information about the tests, the data and the test results.

[1] "Entailment" means that the property does not carry with it any "classness" that would thus indicate that the resource is an instance of that class.

[2] Programs that interpret the OWL axioms are called "reasoners". There are a number of different reasoner programs available that you can call from your software, such as Pellet, Hermit, and others built into software packages like TopBraid.

Tuesday, November 18, 2014

Classes in RDF

RDF allows one to define class relationships for things and concepts. The RDFS1.1 primer describes classes succinctly as:
Resources may be divided into groups called classes. The members of a class are known as instances of the class. Classes are themselves resources. They are often identified by IRIs and may be described using RDF properties. The rdf:type property may be used to state that a resource is an instance of a class.
This seems simple, but it is in fact one of the primary areas of confusion about RDF.

If you are not a programmer, you probably think of classes in terms of taxonomies -- genus, species, sub-species, etc. If you are a librarian you might think of classes in terms of classification, like Library of Congress or the Dewey Decimal System. In these, the class defines certain characteristics of the members of the class. Thus, with two classes, Pets and Veterinary science, you can have:
Pets
- dogs
- cats

Veterinary science
- dogs
- cats
In each of those, dogs and cats have different meaning because the class provides a context: either as pets, or information about them as treated in veterinary science.

For those familiar with XML, it has similar functionality because it makes use of nesting of data elements. In XML you can create something like this:
<drink>
    <lemonade>
        <price>$2.50</price>
        <amount>20</amount>
    </lemonade>
    <pop>
        <price>$1.50</price>
        <amount>10</amount>
    </pop>
</drink>
and it is clear which price goes with which type of drink, and that the bits directly under the <drink> level are all drinks, because that's what <drink> tells you.

Now you have to forget all of this in order to understand RDF, because RDF classes do not work like this at all. In RDF, the "classness" is not expressed hierarchically, with a class defining the elements that are subordinate to it. Instead it works in the opposite way: the descriptive elements in RDF (called "properties") are the ones that define the class of the thing being described. Properties carry the class information through a characteristic called the "domain" of the property. The domain of the property is a class, and when you use that property to describe something, you are saying that the "something" is an instance of that class. It's like building the taxonomy from the bottom up.

This only makes sense through examples. Here are a few:
1. "has child" is of domain "Parent".

If I say "X - has child - 'Fred'" then I have also said that X is a Parent because every thing that has a child is a Parent.

2. "has Worktitle" is of domain "Work"

If I say "Y - has Worktitle - 'Der Zauberberg'" then I have also said that Y is a Work because every thing that has a Worktitle is a Work.

In essence, X or Y is an identifier for something that is of unknown characteristics until it is described. What you say about X or Y is what defines it, and the classes put it in context. This may seem odd, but if you think of it in terms of descriptive metadata, your metadata describes the "thing in hand"; the "thing in hand" doesn't describe your metadata. 

Like in real life, any "thing" can have more than one context and therefore more than one class. X, the Parent, can also be an Employee (in the context of her work), a Driver (to the Department of Motor Vehicles), a Patient (to her doctor's office). The same identified entity can be an instance of any number of classes.
"has child" has domain "Parent"
"has licence" has domain "Driver"
"has doctor" has domain "Patient"

X - has child - "Fred"  = X is a Parent 
X - has license - "234566"  = X is a Driver
X - has doctor - URI:765876 = X is a Patient
Classes are defined in your RDF vocabulary, as as the domains of properties. The above statements require an application to look at the definition of the property in the vocabulary to determine whether it has a domain, and then to treat the subject, X, as an instance of the class described as the domain of the property. There is another way to provide the class as context in RDF - you can declare it explicitly in your instance data, rather than, or in addition to, having the class characteristics inherent in your descriptive properties when you create your metadata. The term used for this, based on the RDF standard, is "type," in that you are assigning a type to the "thing." For example, you could say:
X - is type - Parent
X - has child - "Fred"
This can be the same class as you would discern from the properties, or it could be an additional class. It is often used to simplify the programming needs of those working in RDF because it means the program does not have to query the vocabulary to determine the class of X. You see this, for example, in BIBFRAME data. The second line in this example gives two classes for this entity:
<http://bibframe.org/resources/FkP1398705387/8929207instance22>
a bf:Instance, bf:Monograph .

One thing that classes do not do, however, is to prevent your "thing" from being assigned the "wrong class." You can, however, define your vocabulary to make "wrong classes" apparent. To do this you define certain classes as disjoint, for example a class of "dead" would logically be disjoint from a class of "alive." Disjoint means that the same thing cannot be of both classes, either through the direct declaration of "type" or through the assignment of properties. Let's do an example:
"residence" has domain "Alive"
"cemetery plot location" has domain "Dead"
"Alive" is disjoint "Dead" (you can't be both alive and dead)

X - is type - "Alive"                                         (X is of class "Alive")
X - cemetery plot location - URI:9494747      (X is of class "Dead")
Nothing stops you from creating this contradiction, but some applications that try to use the data will be stumped because you've created something that, in RDF-speak, is logically inconsistent. What happens next is determined by how your application has been programmed to deal with such things. In some cases, the inconsistency will mean that you cannot fulfill the task the application was attempting. If you reach a decision point where "if Alive do A, if Dead do B" then your application may be stumped and unable to go on.

All of this is to be kept in mind for the next blog post, which talks about the effect of class definitions on bibliographic data in RDF.


Note: Multiple domains are treated in RDF as an AND (an intersection). Using a library-ish example, let's assume you want to define a note field that you can use for any of your bibliographic entities. For this example, we'll define entities Work, Person, Manifestation for ex:note. You define your note property something like:

ex:note
    a rdf:Property ;
    rdfs:label "Note"@en ;
    rdfs:domain ex:Work ;
    rdfs:domain ex:Person ;
    rdfs:domain ex:Manifestation ;
    rdfs:range rdfs:literal .

Any subject on which you use ex:note would be inferred to be, at the same time, Work and Person and Manifestation - which is manifestly illogical. There is not a way to express the rule: "This property CAN be used with these classes" in RDF. For that, you will need something that does not yet exist in RDF, but is being worked on in the W3C community, which is a set of rules that would allow you to validate property usage. You might also want see what schema.org has done for domain and range.

Saturday, October 25, 2014

Citations get HOT

The Public Library of Science research section, PLOSLabs (ploslabs.org) has announced some very interesting news about the work that they are doing on citations, which they are calling "Rich Citations".

Citations are the ultimate "linked data" of academia, linking new work with related works. The problem is that the link is human-readable only and has to be interpreted by a person to understand what the link means. PLOS Labs have been working to make those citations machine-expressive, even though they don't natively provide the information needed for a full computational analysis.

Given what one does have in a normal machine-readable document with citations, they are able to pull out an impressive amount of information:
  • What section the citation is found in. There is some difference in meaning whether a citation is found in the "Background" section of an article, or in the "Methodology" section. This gives only a hint to the meaning of the citation, but it's more than no information at all.
  • How often a resource is cited in the article. This could give some weight to its importance to the topic of the article.
  • What resources are cited together. Whenever a sentence ends with "[3][7][9]", you at least know that those three resources equally support what is being affirmed. That creates a bond between those resources.
  • ... and more
As an open access publisher, they also want to be able to take users as directly as possible to the cited resources. For PLOS publications, they can create a direct link. For other resources, they make use of the DOI to provide links. Where possible, they reveal the license of cited resources, so that readers can know which resources are open access and which are pay-walled.

This is just a beginning, and their demo site, appropriately named "alpha," uses their rich citations on a segment of the PLOS papers. They also have an API that developers can experiment with.

I was fortunate to be able to spend a day recently at their Citation Hackathon where groups hacked on ongoing aspects of this work. Lots of ideas floated around, including adding abstracts to the citations so a reader could learn more about a resource before retrieving it. Abstracts also would add search terms for those resources not held in the PLOS database. I participated in a discussion about coordinating Wikidata citations and bibliographies with the PLOS data.

Being able to datamine the relationships inherent in the act of citation is a way to help make visible and actionable what has long been the rule in academic research, which is to clearly indicate upon whose shoulders you are standing. This research is very exciting, and although the PLOS resources will primarily be journal articles, there are also books in their collection of citations. The idea of connecting those to libraries, and eventually connecting books to each other through citations and bibliographies, opens up some interesting research possibilities.

Sunday, October 19, 2014

schema.org - where it works

In the many talks about schema.org, it seems that one topic that isn't covered, or isn't covered sufficiently, is "where do you do it?" That is, where does it fit into your data flow? I'm going to give a simple, typical example. Your actual situation may vary, but I think this will help you figure out your own case.

The typical situation is that you have a database with your data. Searches go against that database, the results are extracted, a program formats these results into a web page, and the page is sent to the screen. Let's say that your database has data about authors, titles and dates. These are stored in your database in a way that you know which is which. A search is done, and let's say that the results of the search are:
author:  Williams, R
title: History of the industrial sewing machine
date: 1996
This is where you are in your data flow:

The next thing that happens (and remember, I'm speaking very generally) is that the results then are fed into a program that formats them into HTML, probably within a template that has all your headers, footers, sidebars and branding and sends the data to the browser. The flow now looks like

Let's say that you will display this as a citation, that looks like:
Williams, R. History of the industrial sewing machine. 1996.
Without any fancy formatting, the HTML for this is:
<p>Williams, R. History of the industrial sewing machine. 1996.</p>
Now we can see the problem that schema.org is designed to fix. You started with an author, a title and date, but what you are showing to the world is a string of characters are that undifferentiated. You have lost all the information about what these represent. To a machine, this is just another of many bazillions of paragraphs on the web. Even if you format your data like this:
<p>Author: Williams, R.</p>
<p>Title:  Williams, R. History of the industrial sewing machine</p>
<p>Date: 1996</p>
What a machine sees is:
<p>blah: blah</p>
<p>blah: blah</p>
<p>blah: blah</p>  
What we want is for the program that is is formatting the HTML to also include some metadata from schema.org that retains the meaning of the data you are putting on the screen. So rather than just putting HTML formatting, it will add formatting from schema.org. Schema.org has metadata elements for many different types of data. Using our example, let's say that this is a book, and here's how you could mark that up in schema.org:
<div vocab="http://schema.org/">
<div   typeof="Book">
<p>
    <span property="author">Williams, R.</span> <span property="name">History of the industrial sewing machine</span>. <span property="datePublished">1996</span>.
    </p>
    </div>
</div>
Again, this is a very simple example, but when we test this code in the Google Rich Snippet tool, we can see that even this very simple example has added rich information that a search engine can make use of:
To see a more complex example, this is what Dan Scott and I have done to enrich the files of the Bryn Mawr Classical Reviews.

The review as seen in a browser (includes schema.org markup)

The review as seen by a tool that reads the structured schema.org data.

From these you can see a couple of things. The first is that the schema.org markup does not change how your pages look to a user viewing your data in a browser. The second is that hidden behind that simple page is a wealth of rich information that was not visible before.

Now you are probably wondering: well, what's that going to do for me? Who will use it? At the moment, the users of this data are the search engines, and they use the data to display all of that additional information that you see under a link:


In this snippet, the information about stars, ratings, type of film and audience comes from schema. org mark-up on the page.

Because the data is there, many of us think that other users and uses will evolve. The reverse of that is that, of course, if the information isn't there then those as yet undeveloped possibilities cannot happen.



Wednesday, October 01, 2014

This is what sexism looks like

[Note to readers: sick and tired of it all, I am going to report these "incidents" publicly because I just can't hack it anymore.]

I was in a meeting yesterday about RDF and application profiles, in which I made some comments, and was told by the co-chair: "we don't have time for that now", and the meeting went on.

Today, a man who was not in the meeting but who listened to the audio sent an email that said:
"I agree with Karen, if I correctly understood her point, that this is "dangerous territory".  On the call, that discussion was postponed for a later date, but I look forward to having that discussion as soon as possible because I think it is fundamental."
And he went on to talk about the issue, how important it is, and at one point referred to it as "The requirement is that a constraint language not replace (or "hijack") the original semantics of properties used in the data."

The co-chair (I am the other co-chair, although reconsidering, as you may imagine) replied:
"The requirement of not hijacking existing formal specification languages for expressing constraints that rely on different semantics has not been raised yet."
"Has not been raised?!" The email quoting me stated that I had raised it the very day before. But an important issue is "not raised" until a man brings it up. This in spite of the fact that the email quoting me made it clear that my statement during the meeting had indeed raised this issue.

Later, this co-chair posted a link to a W3C document in an email to me (on list) and stated:
"I'm going on holidays so won't have time to explain you, but I could, in theory (I've been trained to understand that formal stuff, a while ago)"
That is so f*cking condescending. This happened after I quoted from W3C documents to support my argument, and I believe I had a good point.

So, in case you haven't experienced it, or haven't recognized it happening around you, this is what sexism looks like. It looks like dismissing what women say, but taking the same argument seriously if a man says it, and it looks like purposely demeaning a woman by suggesting that she can't understand things without the help of a man.

I can't tell you how many times I have been subjected to this kind of behavior, and I'm sure that some of you know how weary I am of not being treated as an equal no matter how equal I really am.

Quiet no more, friends. Quiet no more.

(I want to thank everyone who has given me support and acknowledgment, either publicly or privately. It makes a huge difference.) 

Some links about "'Splaining"
http://scienceblogs.com/thusspakezuska/2010/01/25/you-may-be-a-mansplainer-if/
http://geekfeminism.wikia.com/wiki/Splaining