Tuesday, March 12, 2019

I'd like to buy a VOWEL

One of the "defects" of RDF for data management is that it does not support business rules. That's a generality, so let me explain a bit.

Most data is constrained - it has rules for what is and what is not allowed. These rules can govern things like cardinality (is it required? is it repeatable?), value types (date, currency, string, IRI), and data relationships (If A, then not B; either A or B+C). This controlling aspect of data is what many data stores are built around; a bank, a warehouse, or even a library manage their activities through controlled data.

RDF has a different logical basis. RDF allows you to draw conclusions from the data (called "inferencing") but there is no mechanism of control that would do what we are accustomed to with our current business rules. This seems like such an obvious lack that you might wonder just how the developers of RDF thought it would be used. The answer is that they were not thinking about banking or company databases. The main use case for RDF development was using artificial intelligence-like axioms on the web. That's a very different use case from the kind of data work that most of us engage in.

RDF is characterized by what is called the "open world assumption" which says that:

- at any moment a set of data may be incomplete; that does not make it illegitimate
- anyone can say anything about anything; like the web in general there are no controls over what can and cannot be stated and who can participate

However, RDF is being used in areas where data with controls was once employed; where data is validated for quality and rejected if it doesn't meet certain criteria; where operating on the data is limited to approved actors. This means that we have a mis-match between our data model and some of the uses of that data model.

This mis-match was evident to people using RDF in their business operations. W3C held a preliminary meeting on "Validation of Data Shapes" in which there were presentations over two days that demonstrated some of the solutions that people had developed.  This then led to the Data Shapes working group in 2014 which produced the shapes validation language, SHACL (SHApes Constraint Language) in 2017. Of the interesting ways that people had developed to validate their RDF data, the use of SPARQL searches to determine if expected patterns were met became the basis for SHACL. Another RDF validation language, ShEx (Shape Expressions), is independent of SPARQL but has essentially the same functionality of SHACL. There are other languages as well (SPIN, StarDog, etc.) and they all assume a closed world rather than the open world of RDF.

My point on all this is to note that we now have a way to validate RDF instance data but no standard way(s) to define our metadata schema, with constraints, that we can use to produce that data. It's kind of a "tail wagging the dog" situation. There have been musings that the validation languages could also be used for metadata definition, but we don't have a proof of concept and I'm a bit skeptical. The reason I'm skeptical is that there's a certain human-facing element in data design and creation that doesn't need to be there in the validation phase. While there is no reason why the validation languages cannot also contain or link to term definitions, cataloging rules, etc. these would be add-ons. The validation languages also do most of their work at the detailed data level, while some guidance for humans happens at the macro definition of a data model - What is this data for? Who is the audience? What should the data creator know or research before beginning? What are the reference texts that one should have access to? While admittedly the RDA Toolkit  used in library data creation is an extreme form of the genre, you can see how much more there is beyond defining specific data elements and their valid values. Using a metadata schema in concert with RDF validation - yes! That's a winning combination, but I think we need bot.

Note that there are also efforts to use the validation languages to analyze existing graphs.(PDF) These could be a quick way to get an overview of data for which you have no description, but the limitations of this technique are easy to spot. They have basically the same problem that AI training datasets do: you only learn what is in that dataset, not the full range of possible graphs and values that can be produced. If your data is very regular then this analysis can be quite helpful; if your data has a lot of variation (as, for example, bibliographic data does) then the analysis of a single file of data may not be terribly helpful. At the same time, exercising the validation languages in this way is one way to discover how we can use algorithms to "look at" RDF data.

Another thing to note is that there's also quite a bit of "validation" that the validation languages do not handle, such as the reconciliation work that if often done in OpenRefine. The validation languages take an atomistic view of the data, not an overall one. I don't see a way to ask the question "Is this entry compatible with all of the other entries in this file?" That the validation languages don't cover this is not a fault, but it must be noted that there is other validation that may need to be done.

WOL, meet WVL

 

We need a data modeling language that is suitable to RDF data, but that provides actual constraints, not just inferences. It also needs to allow one to choose a closed world rule. The RDF suite of standards has provided the Web Ontology Language, which should be WOL but has been given the almost-acronym name of OWL. OWL does define "constraints", but they aren't constraints in the way we need for data creation. OWL constrains the axioms of inference. That means that it gives you rules to use when operating over a graph of data, and it still works in the open world. The use of the term "ontology" also implies that this is a language for the creation of new terms in a single namespace. That isn't required, but that is becoming a practice.

What we need is a web vocabulary language. WVL. But using the liberty that went from WOL to OWL, we can go from WVL to VWL, and that can be nicely pronounced as VOWEL. VOWEL (I'm going to write it like that because it isn't familiar to readers yet) can supply the constrained world that we need for data creation. It is not necessarily an RDF-based language, but it will use HTTP identifiers for things. It could function as linked data but it also can be entirely in a closed world. Here's what it needs to do:
  • describe the things of the metadata
  • describe the statements about those things and the values that are valid for those statements
  • give cardinality rules for things and statements
  • constrain values by type
  • give a wide range of possibilities for defining values, such as lists, lists of namespaces, ranges of computable values, classes, etc.
  • for each thing and statement have the ability to carry definitions and rules for input and decision-making about the value
  • can be serialized in any language that can handle key/value pairs or triples
  • can (hopefully easily) be translatable to a validation language or program
Obviously there may be more. This is not fully-formed yet, just the beginning. I have defined some of it in a github repo. (Ignore the name of the repo - that came from an earlier but related project.) That site also has some other thoughts, such as design patterns, a requirements document, and some comparison between existing proposals, such as the Dublin Core community's Description Set Profile, BIBFRAME, and soon Stanford's profle generator, Sinopia.

One of the ironies of this project is that VOWEL needs to be expressed as a VOWEL. Presumably one could develop an all-new ontology for this, but the fact is that most of what is needed exists already. So this gets meta right off the bat which makes it a bit harder to think about but easier to produce.

There will be a group starting up in the Dublin Core space to continue development of this idea. I will announce that widely when it happens. I think we have some real possibilities here, to make VOWEL a reality. One of my goals will be to follow the general principles of the original Dublin Core metadata, which is that simple wins out over complex, and it's easier to complex-ify simple than to simplify complex.

Monday, January 28, 2019

FRBR without FR or BR

(This is something I started working on that turns out to be a "pulled thread" - something that keeps on unwinding the more I work on it. What's below is a summary, while I decide what to do with the longer piece.)

FRBR was developed for the specific purpose of modeling library catalog data. I give the backstory on FRBR in chapter 5 of my book, "FRBR Before and After." The most innovative aspect of FRBR was the development of a multi-entity view of creative works. Referred to as "group 1" of three groups of entities, the entities described there are Work, Expression, Manifestation, and Item (WEMI). They are aligned with specific bibliographic elements used in library catalogs, and are defined with a rigid structure: the entities are linked to each other in a single chain; the data elements are defined each as being valid for one and only one entity; all WEMI entities are disjoint.

In spite of these specifics, something in that group 1 has struck a chord for metadata designers who do not adhere to the library catalog model as described in FRBR. In fact, some mentions or uses of WEMI are not even bibliographic in nature.* This leads me to conclude that a version of WEMI that is not tied to library catalog concepts could provide an interesting core of classes for metadata that describes creative or created resources.

We already have some efforts that have stepped away from the specifics of FRBR. From 2005 there is the first RDF FRBR ontology, frbrCore, which defines the entities of FRBR and key relationships between them as RDF classes. This ontology breaks away from FRBR in that it creates super-classes that are not defined in FRBR, but it retains the disjointness between the primary entities. We also have FRBRoo which is a FRBR-ized version of the CIDOC museum metadata model. This extends the number of classes to include some that represent processes that are not in the static model of the library catalog. In addition we have FaBiO, a bibliographic ontology that uses frbrCore classes but extends the WEMI-based classes with dozens of sub-classes that represent types of works and expressions.

I conclude that there is something in the ability to describe the abstraction of work apart from the concrete item that is useful in many areas. The intermediate entities, defined in FRBR as expression and manifestation, may have a role depending on the material and the application for which the metadata is being developed. Other intermediate entities may be useful at times. But as a way to get started, we can define four entities (which are "classes" in RDF) that parallel the four group 1 entities in FRBR. I would like to give these new names to distance them from FRBR, but that may not be possible as people have already absorbed the FRBR terminology.


FRBR            /   option1 / option2
work               / idea        / creative work
expression      / creation  / realization
manifestation / object     / product
item                / instance / individual

My preferred rules for these classes are:
  • any entity can be iterative (e.g. a work of a work)
  • any entity can have relationships/links to any other entity
  • no entity has an inherent dependency on any other entity
  • any entity can be used alone or in concert with other entities
  • no entities are disjoint
  • anyone can define additional entities or subclasses   
  • individual profiles using the model may recommend or limit attributes and relationships, but the model itself will not have restrictions
This implements a a theory of ontology development known as "minimum semantic commitment." In this theory,  base vocabulary terms should be defined with as little semantics as possible, with semantics in this sense being the axiomatic semantics of RDF. An ontology whose terms have high semantic definition, such as the original FRBR, will provide fewer opportunities for re-use because uses must adhere to the tightly defined semantics in the original ontology. Less commitment in the base ontology means that there are greater opportunities for re-use; desired semantics can be defined in specific implementations through the creation of application profiles.

Given this freedom, how would people choose to describe creative works? For example, here's one possible way to describe a work of art:

work:
    title: Acrobats
    creator: Paul Klee
    genre: abstract art
    topic: acrobats
    date: 1914
item:
    size: 9 x 9
    base material: paper
    material: watercolor, pastel, ink
    color: mixed
    signed: PKlee
    dated: 1914
   
And here's a way to describe a museum store's inventory record for a print:

work:
    title: Acrobats
    creator: Paul Klee
    genre: abstract art
    topic: acrobats
    date: 1914
manifestation:
    description: 12-color archival inkjet print
    size: 24 x 36 inches
    price: $16.99
   
There is also no reason why a non-creative product couldn't use the manifestation class (which is one of the reasons that I would prefer to call it "product," which would resonate better for these potential users):

manifestation/product:
    description: dining chair
    dimensions: 26 x 23 x 21.5 inches
    weight:  21 pounds
    color: gray
    manufacturer: YEEFY
    price: $49.99
   
Here is the sum total of what this core WEMI would look like, still using the FRBR terminology:

<http://example.com/Work> rdf:type owl:Class ;
    rdfs:label "Work"@en ;
    rdfs:comment: "The creative work as abstraction."@en .

<http://example.com/Expression> rdf:type owl:Class ;
    rdfs:label "Expression"@en ;
    rdfs:comment: "The creative work as it is expressed in a potentially perceivable form."@en .

<http://example.com/Manifestation> rdf:type owl:Class ;                                                             rdfs:label "Manifestation"@en ;
    rdfs:comment: "The physical product that contains the creative work."@en .

<http://example.com/Item> rdf:type owl:Class ;
    rdfs:label "Item"@en ;
    rdfs:comment: "An instance or individual copy of the creative work."@en .

I can see communities like Dublin Core and schema.org as potential locations for these proposed classes because they represent general metadata communities, not just the GLAM world of IFLA. (I haven't approached them.) I'm open to hearing other ideas for hosting this, as well as comments on the ideas here. For it? Against it? Is there a downside?


* Examples of some "odd" references to FRBR for use in metadata for:

Tuesday, November 27, 2018

It's "academic"

We all know that writing and publishing is of great concern to those whose work is in academia; the "publish or perish" burden haunts pre-tenure educators and grant-seeking researchers. Revelations that data had been falsified in published experimental results brings great condemnation from publishers and colleagues, and yet I have a feeling that underneath it all is more than an ounce of empathy from those who are fully aware of the forces that would lead one to put ones' thumbs on the scales for the purposes of winning the academic jousting match. It is only a slight exaggeration to compare these souls to the storied gladiators whose defeat meant summary execution. From all evidence, that is how many of them experience the contest to win the ivory tower - you climb until you fall.

Research libraries and others deal in great part with the output of the academe. In many ways their practices reinforce the value judgments made on academic writing, such as having blanket orders for all works published by a list of academic presses. In spite of this, libraries have avoided making an overt statement of what is and what is not "academic." The "deciders" of academic writing are the publishers - primarily the publishers of peer-reviewed journals that decide what information does and does not become part of the record of academic achievement, but also those presses that issue scholarly monographs. Libraries are the consumers of these decisions but stop short of tagging works as "academic" or "scholarly."

The pressure on academics has only increased in recent years, primarily because of the development of "impact factors." In 1955, Eugene Garfield introduced the idea that one could create a map of scientific publishing using an index of the writings cited by other works. (Science, 1955; 122 :108–11) Garfield was interested in improving science by linking works so that one could easily find supporting documents. However, over the years the purpose of citation has evolved from a convenient link to precedents into a measure of the worth of scholars themselves in the form of the "h-index" - the measure of how often a person (not a work) has been cited. The h-index is the "lifetime home runs" statistic of the academic world. One is valued for how many times one is cited, making citations the coin of the realm, not sales of works or even readership. No one in academia could or should be measured on the same scale as a non-academic writer when it comes to print runs, reviews, or movie deals. Imagine comparing the sales figures of "Poetic Autonomy in Ancient Rome" with "The Da Vinci Code". So it matters in academia to carve out a world that is academic, and that isolates academic works such that one can do things like calculate an h-index value.

This interest in all things academic has led to a number of metadata oddities that make me uncomfortable, however. There are metadata schemas that have an academic bent that translates to a need to assert the "scholarliness" of works being given a bibliographic description. There is also an emphasis on science in these bibliographic metadata, with less acknowledgement of the publishing patterns of the humanities. My problem isn't solely with the fact that they are doing this, but in particular with how they go about it.

As an example, the metadata schema BIBO clearly has an emphasis on articles as scholarly writing; notably, it has  a publication type "academic article" but does not have a publication type for "academic book." This reflects the bias that new scientific discoveries are published as journal articles, and many scientists do not write book-length works at all. This slights the work of historians like Ann M. Blair whose book, Too Much to Know, has what I estimate to be about 1,450 "primary sources," ranging from manuscripts in Latin and German from the 1500's to modern works in a number of languages. It doesn't get much more academic than that.

BIBO also has different metadata terms for "journal" and "magazine":
  • bibo:journal "A periodical of scholarly journal Articles."
  • bibo:magazine"A periodical of magazine Articles. A magazine is a publication that is issued periodically, usually bound in a paper cover, and typically contains essays, stories, poems, etc., by many writers, and often photographs and drawings, frequently specializing in a particular subject or area, as hobbies, news, or sports."
Something in that last bit on magazines smacks of "leisure time" while the journal clearly represents "serious work."  It's also interesting that the description of magazine is quite long, describes the physical aspects ("usually bound in a paper cover"), and gives a good idea of the potential content. "Journal" is simply "scholarly journal articles." Aside from the circularity of the definitions (journal has journal articles, magazines have magazine articles), what this says is simply that a journal is a "not magazine."

Apart from the snobbishness of the difference between these terms is the fact that one seeks in vain for a bright line between the two. There is, of course, the "I know it when I see it" test, and there is definitely some academic writing that you can pick out without hesitation. But is an opinion piece in the journal of a scientific society academic? How about a book review? How about a book review in the New York Review of Books (NYRB), where articles run to 2-5,000 words, are written by an academic in the field, and make use of the encyclopedic knowledge of the topic on the part of the reviewer? When Marcia Angell, professor at the Harvard Medical School and former Editor in Chief of The New England Journal of Medicine writes for the NYRB, has she slipped her academic robes for something else? She seems to think so. On her professional web site she lists among her publications a (significantly long) letter to the editor  (called a "comment" in academic journal-eze) of a science journal article about women in medicine but she does not include in her publication list the articles she has written for NYRB even though these probably make more use of her academic knowledge than the comment did. She is clearly making a decision about what is "academic" (i.e. career-related) and what is not. It seems that the dividing line is not the content of the writing but how her professional world esteems the publishing vehicle.

Not to single out BIBO, I should mention other "culprits" in the tagging of scholarly works, such as WikiData. Wikidata has:
  • academic journal article (Q18918145) article published in an academic journal
  • academic writing (Q4119870) academic writing and publishing is conducted in several sets of forms and genres
  • scholarly article (Q13442814) article in an academic publication, usually peer reviewed
  • scholarly publication (Q591041) scientific publications that report original empirical and theoretical work in the natural sciences
There is so much wrong with each of these, from circular definitions to bias toward science as the only scholarly pursuit (scholarly publication is a "scientific publication" in the "natural sciences"). (I've already commented on this in WikiData, sarcastically calling it a fine definition if you ignore the various directions that science and scholarship have taken since the mid-19th century.)  What this reveals, however is that the publication  and publisher defines whether the work is "scholarly." If any article in an academic publication is a scholarly article, then the comment by Dr. Angell is, by definition, scholarly, and the NYRB articles are not. Academia is, in fact, a circularly-defined world. 
Giving one more example, schema.org has this:
  • schema:ScholarlyArticle (sub-class of Article) A scholarly article.
Dig that definition! There are a few other types of article in schema, org, such as "newsArticle" and "techArticle" but it appears that all of those magazine articles would be simple "Article."

Note that in real life publications call themselves whatever they wish. With a hint at how terms may have changed over time: Ladies' Home Journal calls itself a journal, and the periodical published by the American Association for the Advancement of Science, Science, gives itself the domain sciencemag.org. "Science Magazine" just sounds right, doesn't it?

It's not wrong for folks to characterize some publications and some writing as "academic" but any metadata term needs a clear definition, which these do not have. What this means is that people using these schemas are being asked to make a determination with very little guidance that would help them separate the scholarly or academic from... well, from the rest of publishing output. With the inevitable variation in categorization, you can be sure that in metadata coded with these schemas the separation between scholarly/academic and not scholarly/academic writing is probably not going to be useful because there will be little regularity of assignment between communities that are using this metadata.

I admit that I picked on this particular metadata topic because I find the designation of "scholarly" or "academic" to be judgemental. If nothing else, when people judge they need some criteria for that judgement. What I would like to see is a clear definition that would help people decide what is and what is not "academic," and what the use cases are for why this typing of materials should be done. As with most categorizations, we can expect some differences in the decisions that will be made by catalogers and indexers working with these metadata schemas. A definition at least gives you something to discuss and to argue for.  Right now we don't have that for scholarly/academic publications.

And I am glad that libraries don't try to make this distinction.


Tuesday, August 14, 2018

Libraryland, We Have a Problem


The first rule of every multi-step program is to admit that you have a problem. I think it's time for us librarians to take step one and admit that we do have a problem.

The particular problem that I have in mind is the disconnect between library data and library systems in relation to the category of metadata that libraries call "headings." Headings are the strings in the library data that represent those entities that would be entry points in a linear catalog like a card catalog.

It pains me whenever I am an observer to cataloger discussions on the proper formation of headings for items that they are cataloging. The pain point is that I know that the value of those headings is completely lost in the library systems of today, and therefore there are countless hours of skilled cataloger time that are being wasted.

The Heading


Both book and card catalogs were catalogs of headings. The catalog entry was a heading followed by one or more bibliographic entries. Unfortunately, the headings serve multiple purposes, which is generally not a good data practice but is due to the need for parsimony in library data when that data was analog, as in book and card catalogs.

  • A heading is a unique character string for the "thing" – the person, the corporate body, the family – essentially an identifier.
Tolkien, J. R. R. (John Ronald Reuel), 1892-1973
  • It supports the selection of the entity in the catalog from among the choices that are presented (although in some cases the effectiveness of this is questionable)


  • It is an access point, intended to be the means of finding, within the catalog, those items held by the library that meet the need of the user.
  • It provides the sort order for the catalog entries (which is why you see inverted forms like "Tolkien, J. R. R.")
United States. Department of State. Bureau for Refugee Programs
United States. Department of State. Bureau of Administration
United States. Department of State. Bureau of Administration and Security
United States. Department of State. Bureau of African Affairs
    • That sort order, and those inverted headings, also have a purpose of collocation of entries by some measure of "likeness"
    Tolkien, J. R. R. (John Ronald Reuel), 1892-1973
    Tolkien Society
    Tolkien Trust
    The last three functions, providing a sort order, access, and collocation, have been lost in the online catalog. The reasons for this are many, but the main explanation is that keyword searching has replaced alphabetical browse as a way to locate items in a library catalog.

    The upshot is that many hours are spent during the cataloging process to formulate a left-anchored, alphabetically order-able heading that has no functionality in library catalogs other than as fodder for a context-less keyword search.

    Once a keyword search is done the resulting items are retrieved without any correlation to headings. It may not even be clear which headings one would use to create a useful order. The set of retrieved bibliographic resources from a single keyword search may not provide a coherent knowledge graph. Here's an illustration using the keyword "darwin":

    Gardiner, Anne.
    Melding of two spirits : from the "Yiminga" of the Tiwi to the "Yiminga" of Christianity / by Anne Gardiner ; art work by
    Darwin : State Library of the Northern Territory, 1993.
    Christianity--Australia--Northern Territory.
    Tiwi (Australian people)--Religion.
    Northern Territory--Religion.

    Crabb, William Darwin.
    Lyrics of the golden west. By W. D. Crabb.
    San Francisco, The Whitaker & Ray company, 1898
    West (U.S.)--Poetry.

    Darwin, Charles, 1809-1882.
    Origin of species by means of natural selection; or, The preservation of favored races in the struggle for life and The descent of man and selection in relation to sex, by Charles Darwin.
    New York, The Modern library [1936]
    Evolution (Biology)
    Natural selection.
    Heredity.
    Human evolution.

    Bear, Greg, 1951-
    Darwin's radio / Greg Bear.
    New York : Ballantine Books, 2003.
    Women molecular biologists--Fiction.
    DNA viruses--Fiction.

    No matter what you would choose as a heading on which to order these, it will not produce a sensible collocation that would give users some context to understand the meaning of this particular set of items – and that is because there is no meaning to this set of items, just a coincidence of things named "Darwin."

    Headings that have been chosen to be controlled strings should offer a more predictable user search experience than free text searching, but headings do not necessarily provide collocation. As an example, Wikipedia uses the names of its pages as headings, and there are some rules (or at least preferred practices) to make the headings sensible. A search in Wikipedia is a left-to-right search on a heading string that is presented as a drop-down list of a handful of headings that match the search string:




    Included in the headings in the drop-down are "see"-type terms that, when selected, take the user directly to the entry for the preferred term. If there is no one preferred term Wikipedia directs users to disambiguation pages to help users select among similar headings:


    The Wikipedia pages, however, only provide accidental collocation, not the more comprehensive collocation that libraries aim to attain. That library-designed collocation, however, is also the source of the inversion of headings, making those strings unnatural and unintuitive for users. Although the library headings are admirably rules based, they often use rules that will not be known to many users of the catalog, such as the difference in name headings with prepositions based on the language of the author. To search on these names, one therefore needs to know the language of the author and the rule that is applied to that language, something that I am quite sure we can assume is not common knowledge among catalog users.

    De la Cruz, Melissa
    Cervantes Saavedra, Miguel de
    I may be the only patron of my small library branch that has known to look for the mysteries by Icelandic author Arnaldur Indriðason under "A" not "I".

    What Is To Be Done?


    There isn't an easy (or perhaps not even a hard) answer. As long as humans use words to describe their queries we will have the problem that words and concepts, and words and relationships between concepts, do not neatly coincide.

    I see a few techniques that might be used if we wish to save collocation by heading. One would be to allow keyword searching but for the system to use that to suggest headings that then can be used to view collocated works. Some systems do allow users to retrieve headings by keyword, but headings, which are very terse, are often not self-explanatory without the items they describe. A browse of headings alone is much less helpful that the association of the heading with the bibliographic data it describes. Remember that headings were developed for the card catalog where they were printed on the same card that carried the bibliographic description.

    Another possible area of investigation would be to look to the classified catalog, a technique that has existed alongside alphabetical catalogs for centuries. The Decimal Classification of Dewey was a classified approach to knowledge with a language-based index (his "Relativ Index") to the classes. (It is odd that the current practice in US libraries is to have one classification for items on shelves and an unrelated heading system (LCSH) for subject access.)
    The classification provides the intellectual collocation that the headings themselves do not provide. The difficulty with this is that the classification collocates topically but, at least in its current form, does not collocate the name headings in the catalog that identify people and organizations as entities.

    Conclusion (sort of)

    Controlled headings as access points for library catalogs could provide better service than keyword search alone. How to make use of headings is a difficult question. The first issue is how to exploit the precision of headings while still allowing users to search on any terms that they have in mind. Keyword search is, from the user's point of view, frictionless. They don't have to think "what string would the library have used for this?".

    Collocation of items by topical sameness or other relationships (e.g. "named for", "subordinate to") is possibly the best service that libraries could provide, although it is very hard to do this through the mechanism of language strings. Dewey's original idea of a classified order with a language-based index is still a good one, although classifications are hard to maintain and hard to assign.

    If challenged to state what I think the library catalog should be, my answer would be that it should provide a useful order that illustrates one or more intellectual contexts that will help the user enter and navigate what the library has to offer. Unfortunately I can't say today how we could do that. Could we think about that together?

    Readings

    Dewey, Melvil. Decimal classification and relativ index for libraries, clippings, notes, etc. Edition 7. Lake Placid Club, NY., Forest Press, 1911. https://archive.org/details/decimalclassifi00dewegoog

    Shera, Jesse H, Margaret E. Egan, and Jeannette M. Lynn. The Classified Catalog: Basic Principles and Practices. Chicago, Ill: American Library Association, 1956




    Monday, August 06, 2018

    FRBR as a Data Model


    (I've been railing against FRBR since it was first introduced. It still confuses me some. I put out these ideas for discussion. If you disagree, please add your thoughts to this post.)

    I was recently speaking at a library conference in OSLO where I went through my criticisms of our cataloging models, and how they are not suited to the problems we need to solve today. I had my usual strong criticisms of FRBR and the IFLA LRM. However, when I finished speaking I was asked why I am so critical of those models, which means that I did not explain myself well. I am going to try again here, as clearly and succinctly as I can.

    Conflation of Conceptual Models with Data Models


    FRBR's main impact was that it provided a mental model of the bibliographic universe that reflects a conceptual view of the elements of descriptive cataloging. You will find nothing in FRBR that could not be found in standard library cataloging of the 1990's, which is when the FRBR model was developed. What FRBR adds to our understanding of bibliographic information is that it gives names and definitions to key concepts that had been implied but not fully articulated in library catalog data. If it had stopped there we would have had an interesting mental model that allows us to speak more precisely about catalogs and cataloging.

    Unfortunately, the use of diagrams that appear to define actual data models and the listing of entities and their attributes have led the library world down the wrong path, that of reading FRBR as the definition of a physical data model. Compounding this, the LRM goes down that path even further by claiming to be a structural model of bibliographic data, which implies that it is the structure for library catalog data. I maintain that the FRBR conceptual model should not be assumed to also be a model for bibliographic data in a machine-readable form. The main reason for this has to do with the functionality that library catalogs currently provide (and and what functions they may provide in the future). This is especially true in relation to what FRBR refers to as its Group 1 entities: work, expression, manifestation, and item.

    The model defined in the FRBR document presents an idealized view that does not reflect the functionality of bibliographic data in library catalogs nor likely system design. This is particularly obvious in the struggle to fit the reality of aggregate works into the Group 1 "structure," but it is true even for simple published resources. The remainder of this document attempts to explain the differences between the ideal and the real.

    The Catalog vs the Universe


    One of the unspoken assumptions in the FRBR document is that it poses its problems and solutions in the context of the larger bibliographic universe, not in terms of a library catalog. The idea of gathering all of the manifestations of an expression and all of the expressions of a work is not shown as contingent on the holdings of any particular library. Similarly, bibliographic relationships are presented as having an existence without addressing how those relationships would be handled when the related works are not available in a data set. This may be due to the fact that the FRBR working group was made up solely of representatives of large research libraries whose individual catalogs cover a significant swath of the bibliographic world. It may also have arisen from the fact that the FRBR working group was formed to address the exchange of data between national libraries, and thus was intended as a universal model. Note that no systems designers were involved in the development of FRBR to address issues that would come up in catalogs of various extents or types.

    The questions asked and answered by the working group were therefore not of the nature of "how would this work in a catalog?" and were more of the type "what is nature of bibliographic data?". The latter is a perfectly legitimate question for a study of the nature of bibliographic data, but that study cannot be assumed to answer the first question.

    Functionality


    Although the F in FRBR stands for "functional" FRBR does little to address the functionality of the library catalog. The user tasks find, identify, select and obtain (and now explore, added in the LRM) are not explained in terms of how the data aids those tasks; the FRBR document only lists which data elements are essential to each task. Computer system design, including the design of data structures, needs to go at least a step further in its definition of functions, which means not only which data elements are relevant, but the specific usage the data element is put to in an actual machine interaction with the user and services. A systems developer has to take into account precisely what needs to be done with the FRBR entities in all of the system functions, from input to search and display.

    (Note: I'm going to try to cover this better and to give examples in an upcoming post.)

    Analysis that is aimed at creating a bibliographic data format for a library catalog would take into account that providing user-facing information about work and expression is context-dependent based on the holdings of the individual library and on the needs of its users. It would also take into account the optional use of work and expression information in search and display, and possibly give alternate views to support different choices in catalog creation and deployment. Essentially, analysis for a catalog would take system functionality into account.

    There a lot of facts about the nature of computer-based catalogs have to be acknowledged: that users are no longer performing “find” in an alphabetical list of headings, but are performing keyword searches; that collocation based on work-ness is not a primary function of catalog displays; that a significant proportion of a bibliographic database consists of items with a single work-expression-manifestation grouping; and finally that there is an inconsistent application of work and expression information in today's data.

    In spite of nearly forty years of using library systems whose default search function is a single box in which users are asked to input query terms that will be searched as keywords taken from a combination of creator, title, and subject fields in the bibliographic record, the LRM doubles down on the status of textual headings as primary elements, aka: Nomen. Unfortunately it doesn't address the search function in any reasonable fashion, which is to say it doesn't give an indication of the role of Nomen in the find function. In fact, here is the sum total of what the LRM says about search:

    "To facilitate this task [find], the information system seeks to enable effective searching by offering appropriate search elements or functionality."


    That's all. As I said in my talk at Oslo, this is up there with the jokes about bad corporate mission statements, like: "We endeavor to enhance future value through intelligent paradigm implementation." First, no information system ineffective searching. Yet the phrase "effective searching" is meaningless in itself; without a definition of what is effective this is just a platitude. The same is true for "appropriate search elements": no one would suggest that a system should use inappropriate search elements, but defining appropriate search is not at all a simple task. In fact, I contend that one of the primary problems with today's library systems is that we specifically lack a definition of appropriate, effective search. This is rendered especially difficult because the data that we enter into our library systems is data that was designed for an entirely different technology: the physical card catalog, organized as headings in alphabetical order.

    One Record to Rule Them All


    Our actual experience regarding physical structures for bibliographic data should be sufficient proof that there is not one single solution. Although libraries today are consolidating around the MARC21 record format, primarily for economic reasons, there have been numerous physical formats in use that mostly adhere to the international standard of ISBD. In this same way, there can be multiple physical formats that adhere to the conceptual model expressed in the FRBR and LRM documents. We know this is the case by looking at the current bibliographic data, which includes varieties of MARC, ISBD, BIBFRAME, and others. Another option for surfacing information about works in catalogs could follow what OCLC seems to be developing, which is the creation of works through a clustering of single-entity records. In that model, a work is a cluster of expressions, and an expression is a cluster of manifestations. This model has the advantage that it does not require the cataloger to make decisions about work and expression statements before it is known if the resource will be the progenitor of a bibliographic family, or will stand alone. It also does not require the cataloger to have knowledge of the bibliographic universe beyond their own catalog.

    The key element of all of these, and countless other, solutions is that they can be faithful to the mental model of FRBR while also being functional and efficient as systems. We should also expect that the systems solutions to this problem space will not stay the same over time, since technology is in constant evolution.

    Summary


    I have identified here two quite fundamental areas where FRBR's analysis differs from the needs of system development: 1) the difference between conceptual and physical models and 2) the difference between the (theoretical) bibliographic universe and the functional library catalog. Neither of these are a criticism of FRBR as such, but they do serve as warnings about some widely held assumptions in the library world today, which is that of mistaking the FRBR entity model for a data and catalog design model. This is evident in the outcry over the design of the BIBFRAME model which uses a two-tiered bibliographic view and not the three-tiers of FRBR. The irony of that complaint is that at the very same time as those outcries, catalogers are using FRBR concepts (as embodied in RDA) while cataloging into the one-tiered data model of MARC, which includes all of the entities of FRBR in a single data record. While cataloging into MARC records may not be the best version of bibliographic data storage that we could come up with, we must acknowledge that there are many possible technology solutions that could allow the exercise of bibliographic control while making use of the concepts addressed in FRBR/LRM. Those solutions must be based as least as much on user needs in actual catalogs as on bibliographic theory.

    As a theory, FRBR posits an ideal bibliographic environment which is not the same as the one that is embodied in any library catalog. The diagrams in the FRBR and LRM documents show the structure of the mental model, but not library catalog data. Because the FRBR document does not address implementation of the model in a catalog, there is no test of how such a model does or does not reflect actual system design. The extrapolation from mental model to physical model is not provided in FRBR or the LRM, as neither addresses system functions and design, not even at a macro level.

    I have to wonder if FRBR/LRM shouldn't be considered a model for bibliography rather than library catalogs. Bibliography was once a common art in the world of letters but that has faded greatly over the last half century. Bibliography is not the same as catalog creation, but one could argue that libraries and librarians are the logical successors to the bibliographers of the past, and that a “universal bibliography” created under the auspices of libraries would provide an ideal context for the entries in the library catalog. This could allow users to view the offerings of a single library as a subset of a well-described world of resources, most of which can be accessed in other libraries and archives.
    ­


    Tuesday, October 10, 2017

    Google Books and Mein Kampf

    I hadn't look at Google Books in a while, or at least not carefully, so I was surprised to find that Google had added blurbs to most of the books. Even more surprising (although perhaps I should say "troubling") is that no source is given for the book blurbs. Some at least come from publisher sites, which means that they are promotional in nature. For example, here's a mildly promotional text about a literary work, from a literary publisher:



    This gives a synopsis of the book, starting with:

    "Throughout a single day in 1892, John Shawnessy recalls the great moments of his life..." 

    It ends by letting the reader know that this was a bestseller when published in 1948, and calls it a "powerful novel."

    The blurb on a 1909 version of Darwin's The Origin of Species is mysterious because the book isn't a recent publication with an online site providing the text. I do not know where this description comes from, but because the  entire thrust of this blurb is about the controversy of evolution versus the Bible (even though Darwin did not press this point himself) I'm guessing that the blurb post-dates this particular publication.


    "First published in 1859, this landmark book on evolutionary biology was not the first to deal with the subject, but it went on to become a sensation -- and a controversial one for many religious people who could not reconcile Darwin's science with their faith."
    That's a reasonable view to take of Darwin's "landmark" book but it isn't what I would consider to be faithful to the full import of this tome.

    The blurb on Hitler's Mein Kampf is particularly troubling. If you look at different versions of the book you get both pro- and anti- Nazi sentiments, neither of which really belong  on a site that claims to be a catalog of books. Also note that because each book entry has only one blurb, the tone changes considerably depending on which publication you happen to pick from the list.


    First on the list:
    "Settling Accounts became Mein Kampf, an unparalleled example of muddled economics and history, appalling bigotry, and an intense self-glorification of Adolf Hitler as the true founder and builder of the National Socialist movement. It was written in hate and it contained a blueprint for violent bloodshed."

    Second on the list:
    "This book has set a path toward a much higher understanding of the self and of our magnificent destiny as living beings part of this Race on our planet. It shows us that we must not look at nature in terms of good or bad, but in an unfiltered manner. It describes what we must do if we want to survive as a people and as a Race."
    That's horrifying. Note that both books are self-published, and the blurbs are the ones that I find on those books in Amazon, perhaps indicating that Google is sucking up books from the Amazon site. There is, or at least at one point there once was, a difference between Amazon and Google Books. Google, after all, scanned books in libraries and presented itself as a search engine for published texts; Amazon will sell you Trump's tweets on toilet paper. The only text on the Google Books page still claims that Google Books is about  search: "Search the world's most comprehensive index of full-text books." Libraries partnered with Google with lofty promises of gains in scholarship:
    "Our participation in the Google Books Library Project will add significantly to the extensive digital resources the Libraries already deliver. It will enable the Libraries to make available more significant portions of its extraordinary archival and special collections to scholars and researchers worldwide in ways that will ultimately change the nature of scholarship." Jim Neal, Columbia University
    I don't know how these folks now feel about having their texts intermingled with publications they would never buy and described by texts that may come from shady and unreliable sources.

    Even leaving aside the grossest aspects of the blurbs and Google's hypocrisy about its commercialization of its books project, adding blurbs to the book entries with no attribution and clearly not vetting the sources is extremely irresponsible. It's also very Google to create sloppy algorithms that illustrate their basic ignorance of the content their are working with -- in this case, the world's books.

    Tuesday, August 08, 2017

    On reading Library Journal, September, 1877

    Of the many advantages to retirement is the particular one of idle time. And I will say that as a librarian one could do no better than to spend some of that time communing with the history of the profession. The difficulty is that it is so rich, so familiar in many ways that it is hard to move through it quickly. Here is just a fraction of the potential value to be found in the September issue of volume two of Library Journal.* Admittedly this is a particularly interesting number because it reports on the second meeting of the American Library Association.

    For any student of library history it is especially interesting to encounter certain names as living, working members of the profession.



    Other names reflect works that continued on, some until today, such as Poole and Bowker, both names associated with long-running periodical indexes.

    What is particularly striking, though, is how many of the topics of today were already being discussed then, although obviously in a different context. The association was formed, at least in part, to help librarianship achieve the status of a profession. Discussed were the educating of the public on the role of libraries and librarians as well as providing education so that there could be a group of professionals to take the jobs that needed that professional knowledge. There was work to be done to convince state legislatures to support state and local libraries.

    One of the first acts of the American Library Association when it was founded in 1876 (as reported in the first issue of Library Journal) was to create a Committee on Cooperation. This is the seed for today's cooperative cataloging efforts as well as other forms of sharing among libraries. In 1877, undoubtedly encouraged by the participation of some members of the publishing community in ALA, there was hope that libraries and publishers would work together to create catalog entries for in-print works.
    This is one hope of the early participants that we are still working on, especially the desire that such catalog copy would be "uniform." Note that there were also discussions about having librarians contribute to the periodical indexes of R. R. Bowker and Poole, so the cooperation would flow in both directions.

    The physical organization of libraries also was of interest, and a detailed plan for a round (actually octagonal) library design was presented:
    His conclusion, however, shows a difference in our concepts of user privacy.
    Especially interesting to me are the discussions of library technology. I was unaware of some of the emerging technologies for reproduction such as the papyrograph and the electric pen. In 1877, the big question, though, was whether to employ the new (but as yet un-perfected) technology of the typewriter in library practice.

    There was some poo-pooing of this new technology, but some members felt it may be reaching a state of usefulness.


    "The President" in this case is Justin Winsor, Superintendent of the Boston Library, then president of the American Library Association. Substituting more modern technologies, I suspect we have all taken part in this discussion during our careers.

    Reading through the Journal evokes a strong sense of "le plus ça change..." but I admit that I find it all rather reassuring. The historical beginnings give me a sense of why we are who we are today, and what factors are behind some of our embedded thinking on topics.


    * Many of the early volumes are available from HathiTrust, if you have access. Although the texts themselves are public domain, these are Google-digitized books and are not available without a login. (Don't get me started!) If you do not have access to those, most of the volumes are available through the Internet Archive. Select "text" and search on "library journal". As someone without HathiTrust institutional access I have found most numbers in the range 1-39, but am missing (hint, hint): 5/1880; 8-9/1887-88; 17/1892; 19/1894; 28-30/1903-1905; 34-37;1909-1912. If I can complete the run I think it would be good to create a compressed archive of the whole and make that available via the Internet Archive to save others the time of acquiring them one at a time. If I can find the remainder that are pre-1923 I will add those in.