Tuesday, October 13, 2015

SHACL - Shapes Constraint Language

If you've delved into RDF or other technologies of the Semantic Web you may have found yourself baffled at times by its tendency to produce data that is open to interpretation. This is, of course, a feature not a bug. RDF has as the basis of its design something called the "Open World Assumption". The OWA acts more like real life than controlled data stores because it allows the answers to many questions to be neither true nor false, but "we may not have all of the information." This makes it very hard to do the kind of data control and validity checking of data that is the norm in databases and in data exchange.

There is an obvious need in some situations to exercise constraints on the data that one manages in RDF. This is particularly true within local systems where data is created and updated, and when exchanging data with known partners. To fill this gap, the semantic web branch of the World Wide Web Consortium has been working on a new standard, called the SHApes Constraint Language (SHACL), that will perform for RDF the function that XML schema performs for XML: it will allow software developers to define validity rules for a particular set of RDF.

SHACL has been in development for nearly a year, and is just now available in a First Public Working Draft. A FPWD is not by any means a finished product, but is far enough along to give readers an idea of the direction that the standard is taking. It is made available because comment from a larger community is extremely important. The introduction to the draft tells you where to send your comments. (Note: I serve on the working group representing the Dublin Core community, so I will do my best to make sure your comments get full consideration.)

Like many standards, SHACL is not easy to understand. However, I think it will be important for members of the library and other cultural heritage communities to make an effort to weigh in on this standard. Support for SHACL is strong from the "enterprise" sector, people who primarily work on highly controlled closed systems like banks and other information intense businesses. How SHACL benefits those whose data is designed for the open web may depend on us.

SHACL Basics

The key to understanding SHACL is that SHACL is based in large part on SPARQL because SPARQL already has formally defined mechanisms that function on RDF graphs. There will be little if any SHACL functionality that could not be done with SPARQL. SPARQL queries that perform some of these functions are devilishly difficult to write so SHACL should provide a cleaner, more constraint-based language.

SHACL consists of a core of constraints that belong to the SHACL language and have SHACL-defined properties. These should be sufficient for most validation needs. SHACL also has a template mechanism that makes it possible for anyone to create a templated constraint to meet additional needs.

What does SHACL look like? It's RDF, so it looks like RDF. Here's a SHACL statement that covers the case "either one foaf:name OR (one foaf:forename AND one foaf:lastname):

    a sh:Shape ;
sh:scopeClass foaf:Person ;
    sh:constraint [
        a sh:OrConstraint ;
                sh:property [
                    sh:predicate foaf:name ;
                    sh:minCount 1 ;
                    sh:maxCount 1 ;
                sh:property [
                    sh:predicate foaf:forename  ;
                    sh:minCount 1 ;
                    sh:maxCount 1 ;
                ] ;
                sh:property [
                    sh:predicate foaf:lastname  ;
                    sh:minCount 1 ;
                    sh:maxCount 1 ;
    ] .

SHACL constraints can either be open or closed. Open, the default, constrains the named properties but ignores other properties in the same RDF graph. Closed, it essentially means "these properties and only these properties; everything else is a violation."

There are comparisons, such as "equal/not equal" that act on pairs of properties. There are also constraints on values such as defined value types (IRI, data type), lists of valid values, and pattern matching.

The question that needs to be answered around this draft is whether SHACL, as currently defined, meets our needs -- or at least, most of them. One way to address this would be to gather some typical and some atypical validation tests that are needed for library and archive data, and try to express those in SHACL. I have a few examples (mainly from Europeana data), but definitely need more. You can add them to the comments here, send them to me (or send a link to documentation that outlines your data rules), or post them directly to the working group list if you have specific questions.

Thanks in advance.

Tuesday, September 22, 2015

FRBR Before and After - Afterword

Below is a preview of the Afterword of my book, FRBR, Before and After. I had typed the title of the section as "Afterward" (caught by the copy editor, of course), and yet as I think about it, that wasn't really an inappropriate misspelling, because what really matters now is what comes after - after we think hard about what our goals are and how we could achieve them. In any case, here's a preview of that "afterward" from the book.


There is no question that FRBR represents a great leap forward in the theory of bibliographic description. It addresses the “work question” that so troubled some of the great minds of library cataloging in the twentieth century. It provides a view of the “bibliographic family” through its recognition of the importance of the relationships that exist between created cultural objects. It has already resulted in vocabularies that make it possible to discuss the complex nature of the resources that libraries and archives gather and manage.

As a conceptual model, FRBR has informed a new era of library cataloging rules. It has been integrated into the cataloging workflow to a certain extent. FRBR has also inspired some non-library efforts, and those have given us interesting insight into the potential of the conceptual model to support a variety of different needs.

The FRBR model, with its emphasis on bibliographic relationships, has the potential to restore context that was once managed through alphabetical collocation to the catalog. In fact, the use of a Semantic Web technology with a model of entities and relations could be a substantial improvement in this area, because the context that brings bibliographic units together can be made explicit: “translation of,” “film adaptation of,” “commentary on.” This, of course, could be achieved with or without FRBR, but because the conceptual model articulates the relationships, and the relationships are included in the recent cataloging rules, it makes sense to begin with FRBR and evolve from there.

However, the gap between the goals developed at the Stockholm meeting in 1991 and the result of the FRBR Study Group’s analysis is striking. FRBR defined only a small set of functional requirements, at a very broad level: find, identify, select, and obtain. The study would have been more convincing as a functional analysis if those four tasks had been further analyzed and had been the focus of the primary content of the study report. Instead, from my reading of the FRBR Final Report, it appears that the entity-relation analysis of bibliographic data took precedence over user tasks in the work of the FRBR Study Group.

The report’s emphasis on the entity-relation model, and the inclusion of three simple diagrams in the report, is mostly likely the reason for the widespread belief that the FRBR Final Report defines a technology standard for bibliographic data. Although technology solutions can and have been developed around the FRBR conceptual model, no technology solution is presented in the FRBR Final Report. Even more importantly, there is nothing in the FRBR Final Report to suggest that there is one, and only one, technology possible based on the FRBR concepts. This is borne out by the examples we have of FRBR-based data models, each of which interprets the FRBR concepts to serve their particular set of needs. The strength of FRBR as a conceptual model is that it can support a variety of interpretations. FRBR can be a useful model for future developments, but it is a starting point, not a finalized product.

There is, of course, a need for technology standards that can be used to convey information about bibliographic resources. I say “standards” in the plural, because it is undeniable that the characteristics of libraries and their users have such a wide range of functions and needs that no one solution could possibly serve all. Well-designed standards create a minimum level of compliance that allows interoperability while permitting necessary variation to take place. A good example of this is the light bulb: with a defined standard base for the light bulb we have been able to move from incandescent to fluorescent and now to LED bulbs, all the time keeping our same lighting fixtures. 
We must do the same for bibliographic data so that we can address the need for variation in the different approaches between books and non-books, and between the requirements of the library catalog versus the use of bibliographic data in a commercial model or in a publication workflow.

Standardization on a single over-arching bibliographic model is not a reasonable solution. Instead, we should ask: what are the minimum necessary points of compliance that will make interoperability possible between these various uses and users? Interoperability needs to take place around the information and meaning carried in the bibliographic description, not in the structure that carries the data. What must be allowed to vary in our case is the technology that carries that message, because it is the rapid rate of technology change that we must be able to adjust to in the least disruptive way possible. The value of a strong conceptual model is that it is not dependent on any single technology.

It is now nearly twenty years since the Final Report of the FRBR Study Group was published. The FRBR concept has been expanded to include related standards for subjects and for persons, corporate bodies, and families. There is an ongoing Working Group for Functional Requirements for Bibliographic Records that is part of the Cataloguing Section of the International Federation of Library Associations. It is taken for granted by many that future library systems will carry data organized around the FRBR groups of entities. I hope that the analysis that I have provided here encourages critical thinking about some of our assumptions, and fosters the kind of dialog that is needed for us to move fruitfully from broad concepts to an integrative approach for bibliographic data.

From FRBR, Before and After, by Karen Coyle. Published by ALA Editions, 2015

©Karen Coyle, 2015 Creative Commons License
FRBR, Before and After by Karen Coyle is licensed under a Creative Commons Attribution 4.0 International License.

Sunday, September 13, 2015

Models of our World

This is to announce the publication of my book, FRBR, Before and After, by ALA Editions, available in November, 2015. As is often the case, the title doesn't tell the story, so I want to give a bit of an introduction before everyone goes: "Oh, another book on FRBR, yeeech." To be honest, the book does have quite a bit about FRBR, but it's also a think piece about bibliographic models, and a book entitled "Bibliographic Models" would look even more boring than one called "FRBR, Before and After."

The before part is a look at the evolution of the concept of Work, and, yes, Panizzi and Cutter are included, as are Lubetzky, Wilson, and others. Then I look at modeling and how goals and models are connected, and the effect that technology has (and has not) had on library data. The second part of the book focuses on the change that FRBR has wrought both in our thinking and in how we model the bibliographic description. I'll post more about that in the near future, but let me just say that you might be surprised at what you read there.

The text will also be available as open access in early 2016. This is thanks to the generosity of ALA Editions, who agreed to this model. I do hope that enough libraries and individuals do decide to purchase the hard copy that ALA Publishing puts out so that this model of print plus OA is economically viable. I can attest to the fact that the editorial work and application of design to the book has produced a final version that I could not have even approximated on my own

Monday, August 10, 2015

Google becomes Alphabet

I thought it was a joke, especially when the article said that they have two investment companies, Ventures and Capital. But it's all true, so I have this to say:

G is for Google, H is for cHutzpah. In addition to our investment companies Ventures and Capital, we are instituting a think tank, Brain, and a company focused on carbon-based life-based forms, Body. Servicing these will be three key enterprises: Food, Water, and Air. Support will be provided by Planet, a subsidiary of Universe. Of course, we'll also need to provide Light. Let there be. Singularity. G is for God. 

Friday, July 17, 2015

Flexibility in bibliographic models

A motley crew of folks had a chat via Google Hangout earlier this week to talk about FRBR and Fedora. I know exactly squat about Fedora, but I've just spent 18 months studying FRBR and other bibliographic models, so I joined the discussion. We came to a kind of nodding agreement, that I will try to express here, but one that requires us to do some hard work if we are to make it something we can work with.

The primary conclusion was that the models of FRBR and BIBFRAME, with their separation of bibliographic information into distinct entities, are too inflexible for general use. There are simply too many situations in which either the nature of the materials or the available metadata simply does not fit into the entity boundaries defined in those models. This is not news -- since the publication of FRBR in 1998 there are have numerous articles pointing out the need for modifications of FRBR for different materials (music, archival materials, serials, and others). The report of the audio-visual community to BIBFRAME said the same. Similar criticisms have been aimed at recent generations of cataloging rules, whose goal is to provide uniformity in bibliographic description across all media types. The differences in treatment that are needed by the various communities are not mutually compatible, which means that a single model is not going to work over the vast landscape that is "cultural heritage materials."

At the same time, folks in this week's informal discussion were able to readily cite use cases in which they would want to identify a group of metadata statements that would define a particular aspect of the data, such as a work or an item. The trick, therefore, is to find a sweet spot between the need for useful semantics and the need for flexibility within the heterogeneous cultural heritage collections that could benefit from sharing and linking their data amongst them.

One immediate thought is: let's define a core! (OK, it's been done, but maybe that's a different core.) The problem with this idea is that there are NO descriptive elements that will be useful for all materials. Title? (seems obvious) -- but there are many materials in museums and archives that have no title, from untitled art works, to museum pieces ("Greek vase",) to materials in archives ("Letter from John to Mary"). Although these are often given names of a sort, none have titles that function to identify them in any meaningful way. Creators? From anonymous writings to those Greek vases, not to mention the dinosaur bones and geodes in a science museum, many things don't have identifiable creators. Subjects? Well, if you mean this to be "topic" then again, not everything has a topic; think "abstract art" and again those geodes. Most things have a genre or a type but standardizing on those alone would hardly reap great benefits in data sharing.

The upshot, at least the conclusion that I reach, is that there are no universals. At best there is some overlap between (A & B) and then between (B & C), etc. What the informal group that met this week concluded is that there is some value in standardizing among like data types, simply to make the job of developers easier. The main requirement overall, though, is to have a standard way to share ones metadata choices, not unlike an XML schema, but for the RDF world. Something that others can refer to or, even better, use directly in processing data you provide.

Note that none of the above means throwing over FRBR, BIBFRAME, or RDA entirely. Each has defined some data elements that will be useful, and it is always better to re-use than to re-invent. But the attempts to use these vocabularies to fix a single view of bibliographic data is simply not going to work in a world as varied as the one we live in. We limit ourselves greatly if we reject data that does not conform to a single definition rather than making use of connections between close but not identical data communities.

There's no solution being offered at this time, but identifying the target is a good first step.

Thursday, May 28, 2015

International Cataloguing Principles, 2015

IFLA is revising the International Cataloguing Principles and asked for input. Although I doubt that it will have an effect, I did write up my comments and send them in. Here's my view of the principles, including their history.

The original ICP dates from 1961 and read like a very condensed set of cataloging rules. [Note: As T Berger points out, this document was entitled "Paris Principles", not ICP.] It was limited to choice and form of entries (personal and corporate authors, titles). It also stated clearly that it applied to alphabetically sequenced catalogs:
The principles here stated apply only to the choice and form of headings and entry words -- i.e. to the principal elements determining the order of entries -- in catalogues of printed books in which entries under authors' names and, where these are inappropriate or insufficient, under the titles of works are combined in one alphabetical sequence.
The basic statement of principles was not particularly different from those stated by Charles Ammi Cutter in 1875.


ICP 1961

 Note that the ICP does not include subject access, which was included in Cutter's objectives for the catalog. Somewhere between 1875 and 1961, cataloging became descriptive cataloging only. Cutter's rules did include a fair amount detail about subject cataloging (in 13 pages, as compared to 23 pages on authors).

The next version of the principles was issued in 2009. This version is intended to be "applicable to online catalogs and beyond." This is a post-FRBR set of principles, and the objectives of the catalog are given in points with headings find, identify, select, obtain and navigate. Of course, the first four are the FRBR user tasks. The fifth one, navigate, as I recall was suggested by Elaine Svenonius and obviously was looked on favorably even though it hasn't been added to the FRBR document, as far as I know.

The statement of functions of the catalog in this 2009 draft is rather long, but the "find" function gives an idea of how the goals of the catalog have changed:

ICP 2009

It's worth pointing out a couple of key changes. The first is the statement "as the result of a search..." The 1961 principles were designed for an alphabetically arranged catalog; this set of principles recognizes that there are searches and search results in online catalogs, and it never mentions alphabetical arrangement. The second is that there is specific reference to relationships, and that these are expected to be searchable along with attributes of the resource. The third is that there is something called "secondary limiting of a search result." This latter appears to reflect the use of facets in search interfaces.

The differences between the 2015 draft of the ICP and this 2009 version are relatively minor. The big jump in thinking takes place between the 1961 version and the 2009 version. My comments (pdf) to the committee are as much about the 2009 version as the 2015 one. I make three points:
    1.  The catalog is a technology, and cataloging is therefore in a close relation to that technology
    Although the ICP talks about "find," etc., it doesn't relate those actions to the form of the "authorized access points." There is no recognition that searching today is primarily on keyword, not on left-anchored strings.

    2. Some catalog functions are provided by the catalog but not by cataloging
    The 2015 ICP includes among its principles that of accessibility of the catalog for all users. Accessibility, however, is primarily a function of the catalog technology, not the content of the catalog data. It also recommends (to my great pleasure) that the catalog data be made available for open access. This is another principle that is not content-based. Equally important is the idea, which is expressed in the 2015 principles under "navigate" as: "... beyond the catalogue, to other catalogues and in non-library contexts." This is clearly a function of the catalog, with the support of the catalog data, but what data serves this function is not mentioned.

    3. Authority control must be extended to all elements that have recognized value for retrieval
    This mainly refers to the inclusion of the elements that serve as limiting facets on retrieved sets. None of the elements listed here are included in the ICP's instructions on "authorized access points," yet these are, indeed, access points. Uncontrolled forms of dates, places, content, carrier, etc., are simply not usable as limits. Yet nowhere in the document is the form of these access points addressed.

    There is undoubtedly much more that could be said about the principles, but this is what seemed to me to be appropriate to the request for comment on this draft.

      Monday, May 11, 2015

      Catalogers and Coders

      Mandy Brown has a blog post highlighting The Real World of Technology by Ursula Franklin. As Brown states it, Franklin describes
      holistic technologies and prescriptive technologies. In the former, a practitioner has control over an entire process, and frequently employs several skills along the way...By contrast, a prescriptive technology breaks a process down into steps, each of which can be undertaken by a different person, often with different expertise.
      It's the artisan vs. Henry Ford's dis-empowered worker. As we know, there has been some recognition, especially in the Japanese factory models, that dis-empowered workers produce poorer quality goods with less efficiency. Brown has a certain riff on this, but what came immediately to my mind was the library catalog.

      The library catalog is not a classic case of the assembly line, but it has the element of different workers being tasked with different aspects of an outcome, but no one responsible for the whole. We have (illogically, I say) separated the creation of the catalog data from the creation of the catalog.

      In the era of card catalogs (and the book catalogs that preceded them), catalogers created the catalog. What they produced was what people used, directly. Catalogers decided the headings that would be the entry points to the catalog, and thus determined how access would take place. Catalogers wrote the actual display that the catalog user would see. Whether or not people would find things in the catalog was directly in the hands of the catalogs, and they could decide what would bring related entries within card-flipping distance of each other, and whether cross-references were needed.

      The technology of the card catalog was the card. The technologist was the cataloger.

      This is no longer the case. The technology of the catalog is now a selection of computer systems. Not only are catalogers not designing these systems, in most cases no one in libraries is doing so. This has created a strange and uncomfortable situation in the library profession. Cataloging is still based on rules created by a small number of professional bodies, mostly IFLA and some national libraries. IFLA is asking for comments on its latest edition of the International Cataloging Principles but those principles are not directly related to catalog technology. Some Western libraries are making use of or moving toward the rules created by the Joint Steering Committee for Resource Description and Access (RDA), which boasts of being "technology neutral." These two new-ish standards have nothing to say about the catalog itself, as if cataloging existed in some technological limbo.

      Meanwhile, work goes on in bibliographic data arena with the development of the BIBFRAMEs, variations on a new data carrier for cataloging data. This latter work has nothing to say about how resources should be cataloged, and also has nothing to say about what services catalogs should perform, nor how they should make the data useful. It's philosophy is "whatever in, whatever out."

      Meanwhile #2, library vendors create the systems that will use the machine-readable data that is created following cataloging rules that very carefully avoid any mention of functionality or technology. Are catalog users to be expected to perform left-anchored searches on headings? Keyword searches on whole records? Will the system provide facets that can be secondary limits on search results? What will be displayed to the user? What navigation will be possible? Who decides?

      The code4lib community talks about getting catalogers and coders together, and wonders if catalogers should be taught to code. The problem, however, is not between coders and catalogers but is systemic in our profession. We have kept cataloging and computer technology separate, as if they aren't both absolutely necessary. One is the chassis, the other the engine, but nothing useful can come off the assembly line unless both are present in the design and the creation of the product.

      It seems silly to have to say this, but you simply cannot design data and the systems that will use the data each in their own separate silo. This situation is so patently absurd that I am embarrassed to have to bring it up. Getting catalogers and coders together is not going to make a difference as long as they are trying to fit one group's round peg into the others' square hole. (Don't think about that metaphor too much.) We have to have a unified design, that's all there is to it.

      What are the odds? *sigh*