Monday, April 08, 2019

I, too, want answers

Around 1966-67 I worked on the reference desk at my local public library. For those too young to remember, this was a time when all information was in paper form, and much of that paper was available only at the library. The Internet was just a twinkle in the eye of some scientists at DARPA, and none of us had any idea what kind of information environment was in our future.* The library had a card catalog and the latest thing was that check-outs were somehow recorded on microfilm, as I recall.

As you entered the library the reference desk was directly in front of you in the prime location in the middle of the main room. A large number of library users went directly to the desk upon entering. Some of these users had a particular research in mind: a topic, an author, or a title. They came to the reference desk to find the quickest route to what they sought. The librarian would take them to the card catalog, would look up the entry, and perhaps even go to the shelf with the user to look for the item.**

There was another type of reference request: a request for facts, not resources. If one wanted to know what was the population of Milwaukee, or how many slot machines there were in Saudia Arabia***, one turned to the library for answers. At the reference desk we had a variety of reference materials: encyclopedias, almanacs, dictionaries, atlases. The questions that we could answer quickly were called "ready reference." These responses were generally factual.

Because the ready reference service didn't require anything of the user except to ask the question, we also provided this service over the phone to anyone who called in. We considered ourselves at the forefront of modern information services when someone would call and ask us: "Who won best actor in 1937?" OK, it probably was a bar bet or a crossword puzzle clue but we answered, proud of ourselves.

I was reminded of all this by a recent article in Wired magazine, "Alexa, I Want Answers."[1] The argument as presented in the article is that what people REALLY want is an answer; they don't want to dig through books and journals at the library; they don't even want an online search that returns a page of results; what they want is to ask a question and get an answer, a single answer. What they want is "ready reference" by voice, in their own home, without having to engage with a human being. The article is about the development of the virtual, voice-first, answer machine: Alexa.

There are some obvious observations to be made about this. The glaringly obvious one is that not all questions lend themselves to a single, one sentence answer. Even a question that can be asked concisely may not have a concise answer. One that I recall from those long-ago days on the reference desk was the question: "When did the Vietnam War begin?" To answer this you would need to clarify a number of things: on whose part? US? France? Exactly what do you mean by begin? First personnel? First troops? Even with these details in hand experts would differ in their answers.

Another observation is that in the question/answer method over a voice device like Alexa, replying with a lengthy answer is not foreseen. Voice-first systems are backed by databases of facts, not explanatory texts. Like a GPS system they take facts and render them in a way that seems conversational. Your GPS doesn't reply with the numbers of longitude and latitude, and your weather app wraps the weather data in phrases like: "It's 63 degrees outside and might rain later today." It doesn't, however, offer a lengthy discourse on the topic. Just the facts, ma'am.[3]

It is very troubling that we have no measure of the accuracy of these answers. There are quite a few anecdotes about wrong answers (especially amusing ones) from voice assistants, but I haven't seen any concerted studies of the overall accuracy rate. Studies of this nature were done in the 1970's and 1980's on library reference services, and the results were shocking. Even though library reference was done by human beings who presumably would be capable of detecting wrong answers, the accuracy of answers hovered around 50-60%.[2] Repeated studies came up with similar results, and library journals were filled with articles about this problem. The  solution offered was to increase training of reference staff. Before the problem could be resolved, however, users who previously had made use of "ready reference" had moved on to in-sourcing their own reference questions by using the new information system: the Internet. If there still is ready reference occuring in libraries, it is undoubtedly greatly reduced in the number of questions asked, and it doesn't appear that studying the accuracy is on our minds today.

I have one final observation, and that is that we do not know the source(s) of the information behind the answers given by voice assistants. The companies behind these products have developed databases that are not visible to us, and no source information is given for individual answers. The voice-activated machines themselves are not the main product: they are mere user interfaces, dressed up with design elements that make them appealing as home decor. The data behind the machines is what is being sold, and is what makes the machines useful. With all of the recent discussion of algorithmic bias in artificial intelligence we should be very concerned about where these answers come from, and we should seriously consider if "answers" to some questions are even appropriate or desirable.

Now, I have question: how is it possible that so much of our new technology is based on so little intellectual depth? Is reductionism an essential element of technology,  or could we do better? I'm not going to ask Alexa**** for an answer to that.

[1] Vlahos, James. “Alexa, I Want Answers.” Wired, vol. 27, no. 3, Mar. 2019, p. 58. (Try EBSCO)
[2] Weech, Terry L. “Review of The Accuracy of Telephone Reference/Information Services in Academic Libraries: Two Studies.” The Library Quarterly: Information, Community, Policy, vol. 54, no. 1, 1984, pp. 130–31.

* The only computers we saw were the ones on Star Trek (1966), and those were clearly a fiction.
** This was also the era in which the gas station attendent pumped your gas, washed your windows, and checked your oil while you waited in your car.
*** The question about Saudia Arabia is one that I actually got. I also got the one about whether there were many "colored people" in Haiti. I don't remember how I answered the former, but I do remember that the user who asked the latter was quite disappointed with the answer. I think he decided not to go.
**** Which I do not have; I find it creepy even though I can imagine some things for which it could be useful.

Tuesday, March 12, 2019

I'd like to buy a VOWEL

One of the "defects" of RDF for data management is that it does not support business rules. That's a generality, so let me explain a bit.

Most data is constrained - it has rules for what is and what is not allowed. These rules can govern things like cardinality (is it required? is it repeatable?), value types (date, currency, string, IRI), and data relationships (If A, then not B; either A or B+C). This controlling aspect of data is what many data stores are built around; a bank, a warehouse, or even a library manage their activities through controlled data.

RDF has a different logical basis. RDF allows you to draw conclusions from the data (called "inferencing") but there is no mechanism of control that would do what we are accustomed to with our current business rules. This seems like such an obvious lack that you might wonder just how the developers of RDF thought it would be used. The answer is that they were not thinking about banking or company databases. The main use case for RDF development was using artificial intelligence-like axioms on the web. That's a very different use case from the kind of data work that most of us engage in.

RDF is characterized by what is called the "open world assumption" which says that:

- at any moment a set of data may be incomplete; that does not make it illegitimate
- anyone can say anything about anything; like the web in general there are no controls over what can and cannot be stated and who can participate

However, RDF is being used in areas where data with controls was once employed; where data is validated for quality and rejected if it doesn't meet certain criteria; where operating on the data is limited to approved actors. This means that we have a mis-match between our data model and some of the uses of that data model.

This mis-match was evident to people using RDF in their business operations. W3C held a preliminary meeting on "Validation of Data Shapes" in which there were presentations over two days that demonstrated some of the solutions that people had developed.  This then led to the Data Shapes working group in 2014 which produced the shapes validation language, SHACL (SHApes Constraint Language) in 2017. Of the interesting ways that people had developed to validate their RDF data, the use of SPARQL searches to determine if expected patterns were met became the basis for SHACL. Another RDF validation language, ShEx (Shape Expressions), is independent of SPARQL but has essentially the same functionality of SHACL. There are other languages as well (SPIN, StarDog, etc.) and they all assume a closed world rather than the open world of RDF.

My point on all this is to note that we now have a way to validate RDF instance data but no standard way(s) to define our metadata schema, with constraints, that we can use to produce that data. It's kind of a "tail wagging the dog" situation. There have been musings that the validation languages could also be used for metadata definition, but we don't have a proof of concept and I'm a bit skeptical. The reason I'm skeptical is that there's a certain human-facing element in data design and creation that doesn't need to be there in the validation phase. While there is no reason why the validation languages cannot also contain or link to term definitions, cataloging rules, etc. these would be add-ons. The validation languages also do most of their work at the detailed data level, while some guidance for humans happens at the macro definition of a data model - What is this data for? Who is the audience? What should the data creator know or research before beginning? What are the reference texts that one should have access to? While admittedly the RDA Toolkit  used in library data creation is an extreme form of the genre, you can see how much more there is beyond defining specific data elements and their valid values. Using a metadata schema in concert with RDF validation - yes! That's a winning combination, but I think we need bot.

Note that there are also efforts to use the validation languages to analyze existing graphs.(PDF) These could be a quick way to get an overview of data for which you have no description, but the limitations of this technique are easy to spot. They have basically the same problem that AI training datasets do: you only learn what is in that dataset, not the full range of possible graphs and values that can be produced. If your data is very regular then this analysis can be quite helpful; if your data has a lot of variation (as, for example, bibliographic data does) then the analysis of a single file of data may not be terribly helpful. At the same time, exercising the validation languages in this way is one way to discover how we can use algorithms to "look at" RDF data.

Another thing to note is that there's also quite a bit of "validation" that the validation languages do not handle, such as the reconciliation work that if often done in OpenRefine. The validation languages take an atomistic view of the data, not an overall one. I don't see a way to ask the question "Is this entry compatible with all of the other entries in this file?" That the validation languages don't cover this is not a fault, but it must be noted that there is other validation that may need to be done.

WOL, meet WVL


We need a data modeling language that is suitable to RDF data, but that provides actual constraints, not just inferences. It also needs to allow one to choose a closed world rule. The RDF suite of standards has provided the Web Ontology Language, which should be WOL but has been given the almost-acronym name of OWL. OWL does define "constraints", but they aren't constraints in the way we need for data creation. OWL constrains the axioms of inference. That means that it gives you rules to use when operating over a graph of data, and it still works in the open world. The use of the term "ontology" also implies that this is a language for the creation of new terms in a single namespace. That isn't required, but that is becoming a practice.

What we need is a web vocabulary language. WVL. But using the liberty that went from WOL to OWL, we can go from WVL to VWL, and that can be nicely pronounced as VOWEL. VOWEL (I'm going to write it like that because it isn't familiar to readers yet) can supply the constrained world that we need for data creation. It is not necessarily an RDF-based language, but it will use HTTP identifiers for things. It could function as linked data but it also can be entirely in a closed world. Here's what it needs to do:
  • describe the things of the metadata
  • describe the statements about those things and the values that are valid for those statements
  • give cardinality rules for things and statements
  • constrain values by type
  • give a wide range of possibilities for defining values, such as lists, lists of namespaces, ranges of computable values, classes, etc.
  • for each thing and statement have the ability to carry definitions and rules for input and decision-making about the value
  • can be serialized in any language that can handle key/value pairs or triples
  • can (hopefully easily) be translatable to a validation language or program
Obviously there may be more. This is not fully-formed yet, just the beginning. I have defined some of it in a github repo. (Ignore the name of the repo - that came from an earlier but related project.) That site also has some other thoughts, such as design patterns, a requirements document, and some comparison between existing proposals, such as the Dublin Core community's Description Set Profile, BIBFRAME, and soon Stanford's profle generator, Sinopia.

One of the ironies of this project is that VOWEL needs to be expressed as a VOWEL. Presumably one could develop an all-new ontology for this, but the fact is that most of what is needed exists already. So this gets meta right off the bat which makes it a bit harder to think about but easier to produce.

There will be a group starting up in the Dublin Core space to continue development of this idea. I will announce that widely when it happens. I think we have some real possibilities here, to make VOWEL a reality. One of my goals will be to follow the general principles of the original Dublin Core metadata, which is that simple wins out over complex, and it's easier to complex-ify simple than to simplify complex.

Monday, January 28, 2019

FRBR without FR or BR

(This is something I started working on that turns out to be a "pulled thread" - something that keeps on unwinding the more I work on it. What's below is a summary, while I decide what to do with the longer piece.)

FRBR was developed for the specific purpose of modeling library catalog data. I give the backstory on FRBR in chapter 5 of my book, "FRBR Before and After." The most innovative aspect of FRBR was the development of a multi-entity view of creative works. Referred to as "group 1" of three groups of entities, the entities described there are Work, Expression, Manifestation, and Item (WEMI). They are aligned with specific bibliographic elements used in library catalogs, and are defined with a rigid structure: the entities are linked to each other in a single chain; the data elements are defined each as being valid for one and only one entity; all WEMI entities are disjoint.

In spite of these specifics, something in that group 1 has struck a chord for metadata designers who do not adhere to the library catalog model as described in FRBR. In fact, some mentions or uses of WEMI are not even bibliographic in nature.* This leads me to conclude that a version of WEMI that is not tied to library catalog concepts could provide an interesting core of classes for metadata that describes creative or created resources.

We already have some efforts that have stepped away from the specifics of FRBR. From 2005 there is the first RDF FRBR ontology, frbrCore, which defines the entities of FRBR and key relationships between them as RDF classes. This ontology breaks away from FRBR in that it creates super-classes that are not defined in FRBR, but it retains the disjointness between the primary entities. We also have FRBRoo which is a FRBR-ized version of the CIDOC museum metadata model. This extends the number of classes to include some that represent processes that are not in the static model of the library catalog. In addition we have FaBiO, a bibliographic ontology that uses frbrCore classes but extends the WEMI-based classes with dozens of sub-classes that represent types of works and expressions.

I conclude that there is something in the ability to describe the abstraction of work apart from the concrete item that is useful in many areas. The intermediate entities, defined in FRBR as expression and manifestation, may have a role depending on the material and the application for which the metadata is being developed. Other intermediate entities may be useful at times. But as a way to get started, we can define four entities (which are "classes" in RDF) that parallel the four group 1 entities in FRBR. I would like to give these new names to distance them from FRBR, but that may not be possible as people have already absorbed the FRBR terminology.

FRBR            /   option1 / option2
work               / idea        / creative work
expression      / creation  / realization
manifestation / object     / product
item                / instance / individual

My preferred rules for these classes are:
  • any entity can be iterative (e.g. a work of a work)
  • any entity can have relationships/links to any other entity
  • no entity has an inherent dependency on any other entity
  • any entity can be used alone or in concert with other entities
  • no entities are disjoint
  • anyone can define additional entities or subclasses   
  • individual profiles using the model may recommend or limit attributes and relationships, but the model itself will not have restrictions
This implements a a theory of ontology development known as "minimum semantic commitment." In this theory,  base vocabulary terms should be defined with as little semantics as possible, with semantics in this sense being the axiomatic semantics of RDF. An ontology whose terms have high semantic definition, such as the original FRBR, will provide fewer opportunities for re-use because uses must adhere to the tightly defined semantics in the original ontology. Less commitment in the base ontology means that there are greater opportunities for re-use; desired semantics can be defined in specific implementations through the creation of application profiles.

Given this freedom, how would people choose to describe creative works? For example, here's one possible way to describe a work of art:

    title: Acrobats
    creator: Paul Klee
    genre: abstract art
    topic: acrobats
    date: 1914
    size: 9 x 9
    base material: paper
    material: watercolor, pastel, ink
    color: mixed
    signed: PKlee
    dated: 1914
And here's a way to describe a museum store's inventory record for a print:

    title: Acrobats
    creator: Paul Klee
    genre: abstract art
    topic: acrobats
    date: 1914
    description: 12-color archival inkjet print
    size: 24 x 36 inches
    price: $16.99
There is also no reason why a non-creative product couldn't use the manifestation class (which is one of the reasons that I would prefer to call it "product," which would resonate better for these potential users):

    description: dining chair
    dimensions: 26 x 23 x 21.5 inches
    weight:  21 pounds
    color: gray
    manufacturer: YEEFY
    price: $49.99
Here is the sum total of what this core WEMI would look like, still using the FRBR terminology:

<> rdf:type owl:Class ;
    rdfs:label "Work"@en ;
    rdfs:comment: "The creative work as abstraction."@en .

<> rdf:type owl:Class ;
    rdfs:label "Expression"@en ;
    rdfs:comment: "The creative work as it is expressed in a potentially perceivable form."@en .

<> rdf:type owl:Class ;                                                             rdfs:label "Manifestation"@en ;
    rdfs:comment: "The physical product that contains the creative work."@en .

<> rdf:type owl:Class ;
    rdfs:label "Item"@en ;
    rdfs:comment: "An instance or individual copy of the creative work."@en .

I can see communities like Dublin Core and as potential locations for these proposed classes because they represent general metadata communities, not just the GLAM world of IFLA. (I haven't approached them.) I'm open to hearing other ideas for hosting this, as well as comments on the ideas here. For it? Against it? Is there a downside?

* Examples of some "odd" references to FRBR for use in metadata for: