Saturday, November 23, 2019

The Work

The word "work" generally means something brought about by human effort, and at times implies that this effort involves some level of creativity. We talk about "works of art" referring to paintings hanging on walls. The "works" of Beethoven are a large number of musical pieces that we may have heard. The "works" of Shakespeare are plays, in printed form but also performed. In these statements the "work" encompasses the whole of the thing referred to, from the intellectual content to the final presentation.

This is not the same use of the term as is found in the Library Reference Model (LRM). If you are unfamiliar with the LRM, it is the successor to FRBR (which I am assuming you have heard of) and it includes the basic concepts of work, expression, manifestation and item that were first introduced in that previous study. "Work," as used in the LRM is a concept designed for use in library cataloging data. It is narrower than the common use of the term illustrated in the previous paragraph and is defined thus:
Class: Work
Definition: An abstract notion of an artistic or intellectual creation.
In this definition the term only includes the idea of a non-corporeal conceptual entity, not the totality that would be implied in the phrase "the works of Shakespeare." That totality is described when the work is realized through an LRM-defined "expression" which in turn is produced in an LRM-defined "manifestation" with an LRM-defined "item" as its instance.* These four entities are generally referred to as a group with the acronym WEMI.

Because many in the library world are very familiar with the LRM definition of work, we have to use caution when using the word outside the specific LRM environment. In particular, we must not impose the LRM definition on uses of the work that are not intending that meaning. One should expect that the use of the LRM definition of work would be rarely found in any conversation that is not about the library cataloging model for which it was defined. However, it is harder to distinguish uses within the library world where one might expect the use to be adherent to the LRM.

To show this, I want to propose a particular use case. Let's say that a very large bibliographic database has many records of bibliographic description. The use case is that it is deemed to be easier for users to navigate that large database if they could get search results that cluster works rather than getting long lists of similar or nearly identical bibliographic items. Logically the cluster looks like this:

In data design, it will have a form something like this:

This is a great idea, and it does appear to have a similarity to the LRM definition of work: it is gathering those bibliographic entries that are judged to represent the same intellectual content. However, there are reasons why the LRM-defined work could not be used in this instance.

The first is that there is only one WEMI relationship for work, and that is from LRM work to LRM expression. Clearly the bibliographic records in this large library catalog are not LRM expressions; they are full bibliographic descriptions including, potentially, all of the entities defined in the LRM.

To this you might say: but there is expression data in the bibliographic record, so we can think of this work as linking to the expression data in that record. That leads us to the second reason: the entities of WEMI are defined as being disjoint. That means that no single "thing" can be more than one of those entities; nothing can be simultaneously a work and an expression, or any other combination of WEMI entities. So if the only link we have available in the model is from work to expression, unless we can somehow convince ourselves that the bibliographic record ONLY represents the expression (which it clearly does not since it has data elements from at least three of the LRM entities) any such link will violate the rule of disjointness.

Therefore, the work in our library system can have much in common with the conceptual definition of the LRM work, but it is not the same work entity as is defined in that model.

This brings me back to my earlier blog post with a proposal for a generalized definition of WEMI-like entities for created works.  The WEMI concepts are useful in practice, but the LRM model has some constraints that prevent some desirable uses of those entities. Providing unconstrained entities would expand the utility of the WEMI concepts both within the library community, as evidenced by the use case here, and in the non-library communities that I highlight in that previous blog post and in a slide presentation.

To be clear, "unconstrained" refers not only to the removal of the disjointness between entities, but also to allow the creation of links between the WEMI entities and non-WEMI entities, something that is not anticipated in the LRM. The work cluster of bibliographic records would need a general relationship, perhaps, as in the case of VIAF, linked through a shared cluster identifier and an entity type identifying the cluster as representing an unconstrained work.

* The other terms are defined in the LRM as:

Class: Expression
Definition: A realization of a single work usually in a physical form.

Class: Manifestation
Definition: The physical embodiment of one or more expressions.

Class: Item
Definition: An exemplar of a single manifestation.

Monday, April 08, 2019

I, too, want answers

Around 1966-67 I worked on the reference desk at my local public library. For those too young to remember, this was a time when all information was in paper form, and much of that paper was available only at the library. The Internet was just a twinkle in the eye of some scientists at DARPA, and none of us had any idea what kind of information environment was in our future.* The library had a card catalog and the latest thing was that check-outs were somehow recorded on microfilm, as I recall.

As you entered the library the reference desk was directly in front of you in the prime location in the middle of the main room. A large number of library users went directly to the desk upon entering. Some of these users had a particular research in mind: a topic, an author, or a title. They came to the reference desk to find the quickest route to what they sought. The librarian would take them to the card catalog, would look up the entry, and perhaps even go to the shelf with the user to look for the item.**

There was another type of reference request: a request for facts, not resources. If one wanted to know what was the population of Milwaukee, or how many slot machines there were in Saudia Arabia***, one turned to the library for answers. At the reference desk we had a variety of reference materials: encyclopedias, almanacs, dictionaries, atlases. The questions that we could answer quickly were called "ready reference." These responses were generally factual.

Because the ready reference service didn't require anything of the user except to ask the question, we also provided this service over the phone to anyone who called in. We considered ourselves at the forefront of modern information services when someone would call and ask us: "Who won best actor in 1937?" OK, it probably was a bar bet or a crossword puzzle clue but we answered, proud of ourselves.

I was reminded of all this by a recent article in Wired magazine, "Alexa, I Want Answers."[1] The argument as presented in the article is that what people REALLY want is an answer; they don't want to dig through books and journals at the library; they don't even want an online search that returns a page of results; what they want is to ask a question and get an answer, a single answer. What they want is "ready reference" by voice, in their own home, without having to engage with a human being. The article is about the development of the virtual, voice-first, answer machine: Alexa.

There are some obvious observations to be made about this. The glaringly obvious one is that not all questions lend themselves to a single, one sentence answer. Even a question that can be asked concisely may not have a concise answer. One that I recall from those long-ago days on the reference desk was the question: "When did the Vietnam War begin?" To answer this you would need to clarify a number of things: on whose part? US? France? Exactly what do you mean by begin? First personnel? First troops? Even with these details in hand experts would differ in their answers.

Another observation is that in the question/answer method over a voice device like Alexa, replying with a lengthy answer is not foreseen. Voice-first systems are backed by databases of facts, not explanatory texts. Like a GPS system they take facts and render them in a way that seems conversational. Your GPS doesn't reply with the numbers of longitude and latitude, and your weather app wraps the weather data in phrases like: "It's 63 degrees outside and might rain later today." It doesn't, however, offer a lengthy discourse on the topic. Just the facts, ma'am.[3]

It is very troubling that we have no measure of the accuracy of these answers. There are quite a few anecdotes about wrong answers (especially amusing ones) from voice assistants, but I haven't seen any concerted studies of the overall accuracy rate. Studies of this nature were done in the 1970's and 1980's on library reference services, and the results were shocking. Even though library reference was done by human beings who presumably would be capable of detecting wrong answers, the accuracy of answers hovered around 50-60%.[2] Repeated studies came up with similar results, and library journals were filled with articles about this problem. The  solution offered was to increase training of reference staff. Before the problem could be resolved, however, users who previously had made use of "ready reference" had moved on to in-sourcing their own reference questions by using the new information system: the Internet. If there still is ready reference occuring in libraries, it is undoubtedly greatly reduced in the number of questions asked, and it doesn't appear that studying the accuracy is on our minds today.

I have one final observation, and that is that we do not know the source(s) of the information behind the answers given by voice assistants. The companies behind these products have developed databases that are not visible to us, and no source information is given for individual answers. The voice-activated machines themselves are not the main product: they are mere user interfaces, dressed up with design elements that make them appealing as home decor. The data behind the machines is what is being sold, and is what makes the machines useful. With all of the recent discussion of algorithmic bias in artificial intelligence we should be very concerned about where these answers come from, and we should seriously consider if "answers" to some questions are even appropriate or desirable.

Now, I have question: how is it possible that so much of our new technology is based on so little intellectual depth? Is reductionism an essential element of technology,  or could we do better? I'm not going to ask Alexa**** for an answer to that.

[1] Vlahos, James. “Alexa, I Want Answers.” Wired, vol. 27, no. 3, Mar. 2019, p. 58. (Try EBSCO)
[2] Weech, Terry L. “Review of The Accuracy of Telephone Reference/Information Services in Academic Libraries: Two Studies.” The Library Quarterly: Information, Community, Policy, vol. 54, no. 1, 1984, pp. 130–31.

* The only computers we saw were the ones on Star Trek (1966), and those were clearly a fiction.
** This was also the era in which the gas station attendent pumped your gas, washed your windows, and checked your oil while you waited in your car.
*** The question about Saudia Arabia is one that I actually got. I also got the one about whether there were many "colored people" in Haiti. I don't remember how I answered the former, but I do remember that the user who asked the latter was quite disappointed with the answer. I think he decided not to go.
**** Which I do not have; I find it creepy even though I can imagine some things for which it could be useful.

Tuesday, March 12, 2019

I'd like to buy a VOWEL

One of the "defects" of RDF for data management is that it does not support business rules. That's a generality, so let me explain a bit.

Most data is constrained - it has rules for what is and what is not allowed. These rules can govern things like cardinality (is it required? is it repeatable?), value types (date, currency, string, IRI), and data relationships (If A, then not B; either A or B+C). This controlling aspect of data is what many data stores are built around; a bank, a warehouse, or even a library manage their activities through controlled data.

RDF has a different logical basis. RDF allows you to draw conclusions from the data (called "inferencing") but there is no mechanism of control that would do what we are accustomed to with our current business rules. This seems like such an obvious lack that you might wonder just how the developers of RDF thought it would be used. The answer is that they were not thinking about banking or company databases. The main use case for RDF development was using artificial intelligence-like axioms on the web. That's a very different use case from the kind of data work that most of us engage in.

RDF is characterized by what is called the "open world assumption" which says that:

- at any moment a set of data may be incomplete; that does not make it illegitimate
- anyone can say anything about anything; like the web in general there are no controls over what can and cannot be stated and who can participate

However, RDF is being used in areas where data with controls was once employed; where data is validated for quality and rejected if it doesn't meet certain criteria; where operating on the data is limited to approved actors. This means that we have a mis-match between our data model and some of the uses of that data model.

This mis-match was evident to people using RDF in their business operations. W3C held a preliminary meeting on "Validation of Data Shapes" in which there were presentations over two days that demonstrated some of the solutions that people had developed.  This then led to the Data Shapes working group in 2014 which produced the shapes validation language, SHACL (SHApes Constraint Language) in 2017. Of the interesting ways that people had developed to validate their RDF data, the use of SPARQL searches to determine if expected patterns were met became the basis for SHACL. Another RDF validation language, ShEx (Shape Expressions), is independent of SPARQL but has essentially the same functionality of SHACL. There are other languages as well (SPIN, StarDog, etc.) and they all assume a closed world rather than the open world of RDF.

My point on all this is to note that we now have a way to validate RDF instance data but no standard way(s) to define our metadata schema, with constraints, that we can use to produce that data. It's kind of a "tail wagging the dog" situation. There have been musings that the validation languages could also be used for metadata definition, but we don't have a proof of concept and I'm a bit skeptical. The reason I'm skeptical is that there's a certain human-facing element in data design and creation that doesn't need to be there in the validation phase. While there is no reason why the validation languages cannot also contain or link to term definitions, cataloging rules, etc. these would be add-ons. The validation languages also do most of their work at the detailed data level, while some guidance for humans happens at the macro definition of a data model - What is this data for? Who is the audience? What should the data creator know or research before beginning? What are the reference texts that one should have access to? While admittedly the RDA Toolkit  used in library data creation is an extreme form of the genre, you can see how much more there is beyond defining specific data elements and their valid values. Using a metadata schema in concert with RDF validation - yes! That's a winning combination, but I think we need bot.

Note that there are also efforts to use the validation languages to analyze existing graphs.(PDF) These could be a quick way to get an overview of data for which you have no description, but the limitations of this technique are easy to spot. They have basically the same problem that AI training datasets do: you only learn what is in that dataset, not the full range of possible graphs and values that can be produced. If your data is very regular then this analysis can be quite helpful; if your data has a lot of variation (as, for example, bibliographic data does) then the analysis of a single file of data may not be terribly helpful. At the same time, exercising the validation languages in this way is one way to discover how we can use algorithms to "look at" RDF data.

Another thing to note is that there's also quite a bit of "validation" that the validation languages do not handle, such as the reconciliation work that if often done in OpenRefine. The validation languages take an atomistic view of the data, not an overall one. I don't see a way to ask the question "Is this entry compatible with all of the other entries in this file?" That the validation languages don't cover this is not a fault, but it must be noted that there is other validation that may need to be done.

WOL, meet WVL


We need a data modeling language that is suitable to RDF data, but that provides actual constraints, not just inferences. It also needs to allow one to choose a closed world rule. The RDF suite of standards has provided the Web Ontology Language, which should be WOL but has been given the almost-acronym name of OWL. OWL does define "constraints", but they aren't constraints in the way we need for data creation. OWL constrains the axioms of inference. That means that it gives you rules to use when operating over a graph of data, and it still works in the open world. The use of the term "ontology" also implies that this is a language for the creation of new terms in a single namespace. That isn't required, but that is becoming a practice.

What we need is a web vocabulary language. WVL. But using the liberty that went from WOL to OWL, we can go from WVL to VWL, and that can be nicely pronounced as VOWEL. VOWEL (I'm going to write it like that because it isn't familiar to readers yet) can supply the constrained world that we need for data creation. It is not necessarily an RDF-based language, but it will use HTTP identifiers for things. It could function as linked data but it also can be entirely in a closed world. Here's what it needs to do:
  • describe the things of the metadata
  • describe the statements about those things and the values that are valid for those statements
  • give cardinality rules for things and statements
  • constrain values by type
  • give a wide range of possibilities for defining values, such as lists, lists of namespaces, ranges of computable values, classes, etc.
  • for each thing and statement have the ability to carry definitions and rules for input and decision-making about the value
  • can be serialized in any language that can handle key/value pairs or triples
  • can (hopefully easily) be translatable to a validation language or program
Obviously there may be more. This is not fully-formed yet, just the beginning. I have defined some of it in a github repo. (Ignore the name of the repo - that came from an earlier but related project.) That site also has some other thoughts, such as design patterns, a requirements document, and some comparison between existing proposals, such as the Dublin Core community's Description Set Profile, BIBFRAME, and soon Stanford's profle generator, Sinopia.

One of the ironies of this project is that VOWEL needs to be expressed as a VOWEL. Presumably one could develop an all-new ontology for this, but the fact is that most of what is needed exists already. So this gets meta right off the bat which makes it a bit harder to think about but easier to produce.

There will be a group starting up in the Dublin Core space to continue development of this idea. I will announce that widely when it happens. I think we have some real possibilities here, to make VOWEL a reality. One of my goals will be to follow the general principles of the original Dublin Core metadata, which is that simple wins out over complex, and it's easier to complex-ify simple than to simplify complex.

Monday, January 28, 2019

FRBR without FR or BR

(This is something I started working on that turns out to be a "pulled thread" - something that keeps on unwinding the more I work on it. What's below is a summary, while I decide what to do with the longer piece.)

FRBR was developed for the specific purpose of modeling library catalog data. I give the backstory on FRBR in chapter 5 of my book, "FRBR Before and After." The most innovative aspect of FRBR was the development of a multi-entity view of creative works. Referred to as "group 1" of three groups of entities, the entities described there are Work, Expression, Manifestation, and Item (WEMI). They are aligned with specific bibliographic elements used in library catalogs, and are defined with a rigid structure: the entities are linked to each other in a single chain; the data elements are defined each as being valid for one and only one entity; all WEMI entities are disjoint.

In spite of these specifics, something in that group 1 has struck a chord for metadata designers who do not adhere to the library catalog model as described in FRBR. In fact, some mentions or uses of WEMI are not even bibliographic in nature.* This leads me to conclude that a version of WEMI that is not tied to library catalog concepts could provide an interesting core of classes for metadata that describes creative or created resources.

We already have some efforts that have stepped away from the specifics of FRBR. From 2005 there is the first RDF FRBR ontology, frbrCore, which defines the entities of FRBR and key relationships between them as RDF classes. This ontology breaks away from FRBR in that it creates super-classes that are not defined in FRBR, but it retains the disjointness between the primary entities. We also have FRBRoo which is a FRBR-ized version of the CIDOC museum metadata model. This extends the number of classes to include some that represent processes that are not in the static model of the library catalog. In addition we have FaBiO, a bibliographic ontology that uses frbrCore classes but extends the WEMI-based classes with dozens of sub-classes that represent types of works and expressions.

I conclude that there is something in the ability to describe the abstraction of work apart from the concrete item that is useful in many areas. The intermediate entities, defined in FRBR as expression and manifestation, may have a role depending on the material and the application for which the metadata is being developed. Other intermediate entities may be useful at times. But as a way to get started, we can define four entities (which are "classes" in RDF) that parallel the four group 1 entities in FRBR. I would like to give these new names to distance them from FRBR, but that may not be possible as people have already absorbed the FRBR terminology.

FRBR            /   option1 / option2
work               / idea        / creative work
expression      / creation  / realization
manifestation / object     / product
item                / instance / individual

My preferred rules for these classes are:
  • any entity can be iterative (e.g. a work of a work)
  • any entity can have relationships/links to any other entity
  • no entity has an inherent dependency on any other entity
  • any entity can be used alone or in concert with other entities
  • no entities are disjoint
  • anyone can define additional entities or subclasses   
  • individual profiles using the model may recommend or limit attributes and relationships, but the model itself will not have restrictions
This implements a a theory of ontology development known as "minimum semantic commitment." In this theory,  base vocabulary terms should be defined with as little semantics as possible, with semantics in this sense being the axiomatic semantics of RDF. An ontology whose terms have high semantic definition, such as the original FRBR, will provide fewer opportunities for re-use because uses must adhere to the tightly defined semantics in the original ontology. Less commitment in the base ontology means that there are greater opportunities for re-use; desired semantics can be defined in specific implementations through the creation of application profiles.

Given this freedom, how would people choose to describe creative works? For example, here's one possible way to describe a work of art:

    title: Acrobats
    creator: Paul Klee
    genre: abstract art
    topic: acrobats
    date: 1914
    size: 9 x 9
    base material: paper
    material: watercolor, pastel, ink
    color: mixed
    signed: PKlee
    dated: 1914
And here's a way to describe a museum store's inventory record for a print:

    title: Acrobats
    creator: Paul Klee
    genre: abstract art
    topic: acrobats
    date: 1914
    description: 12-color archival inkjet print
    size: 24 x 36 inches
    price: $16.99
There is also no reason why a non-creative product couldn't use the manifestation class (which is one of the reasons that I would prefer to call it "product," which would resonate better for these potential users):

    description: dining chair
    dimensions: 26 x 23 x 21.5 inches
    weight:  21 pounds
    color: gray
    manufacturer: YEEFY
    price: $49.99
Here is the sum total of what this core WEMI would look like, still using the FRBR terminology:

<> rdf:type owl:Class ;
    rdfs:label "Work"@en ;
    rdfs:comment: "The creative work as abstraction."@en .

<> rdf:type owl:Class ;
    rdfs:label "Expression"@en ;
    rdfs:comment: "The creative work as it is expressed in a potentially perceivable form."@en .

<> rdf:type owl:Class ;                                                             rdfs:label "Manifestation"@en ;
    rdfs:comment: "The physical product that contains the creative work."@en .

<> rdf:type owl:Class ;
    rdfs:label "Item"@en ;
    rdfs:comment: "An instance or individual copy of the creative work."@en .

I can see communities like Dublin Core and as potential locations for these proposed classes because they represent general metadata communities, not just the GLAM world of IFLA. (I haven't approached them.) I'm open to hearing other ideas for hosting this, as well as comments on the ideas here. For it? Against it? Is there a downside?

* Examples of some "odd" references to FRBR for use in metadata for: