Tuesday, August 14, 2018

Libraryland, We Have a Problem


The first rule of every multi-step program is to admit that you have a problem. I think it's time for us librarians to take step one and admit that we do have a problem.

The particular problem that I have in mind is the disconnect between library data and library systems in relation to the category of metadata that libraries call "headings." Headings are the strings in the library data that represent those entities that would be entry points in a linear catalog like a card catalog.

It pains me whenever I am an observer to cataloger discussions on the proper formation of headings for items that they are cataloging. The pain point is that I know that the value of those headings is completely lost in the library systems of today, and therefore there are countless hours of skilled cataloger time that are being wasted.

The Heading


Both book and card catalogs were catalogs of headings. The catalog entry was a heading followed by one or more bibliographic entries. Unfortunately, the headings serve multiple purposes, which is generally not a good data practice but is due to the need for parsimony in library data when that data was analog, as in book and card catalogs.

  • A heading is a unique character string for the "thing" – the person, the corporate body, the family – essentially an identifier.
Tolkien, J. R. R. (John Ronald Reuel), 1892-1973
  • It supports the selection of the entity in the catalog from among the choices that are presented (although in some cases the effectiveness of this is questionable)


  • It is an access point, intended to be the means of finding, within the catalog, those items held by the library that meet the need of the user.
  • It provides the sort order for the catalog entries (which is why you see inverted forms like "Tolkien, J. R. R.")
United States. Department of State. Bureau for Refugee Programs
United States. Department of State. Bureau of Administration
United States. Department of State. Bureau of Administration and Security
United States. Department of State. Bureau of African Affairs
    • That sort order, and those inverted headings, also have a purpose of collocation of entries by some measure of "likeness"
    Tolkien, J. R. R. (John Ronald Reuel), 1892-1973
    Tolkien Society
    Tolkien Trust
    The last three functions, providing a sort order, access, and collocation, have been lost in the online catalog. The reasons for this are many, but the main explanation is that keyword searching has replaced alphabetical browse as a way to locate items in a library catalog.

    The upshot is that many hours are spent during the cataloging process to formulate a left-anchored, alphabetically order-able heading that has no functionality in library catalogs other than as fodder for a context-less keyword search.

    Once a keyword search is done the resulting items are retrieved without any correlation to headings. It may not even be clear which headings one would use to create a useful order. The set of retrieved bibliographic resources from a single keyword search may not provide a coherent knowledge graph. Here's an illustration using the keyword "darwin":

    Gardiner, Anne.
    Melding of two spirits : from the "Yiminga" of the Tiwi to the "Yiminga" of Christianity / by Anne Gardiner ; art work by
    Darwin : State Library of the Northern Territory, 1993.
    Christianity--Australia--Northern Territory.
    Tiwi (Australian people)--Religion.
    Northern Territory--Religion.

    Crabb, William Darwin.
    Lyrics of the golden west. By W. D. Crabb.
    San Francisco, The Whitaker & Ray company, 1898
    West (U.S.)--Poetry.

    Darwin, Charles, 1809-1882.
    Origin of species by means of natural selection; or, The preservation of favored races in the struggle for life and The descent of man and selection in relation to sex, by Charles Darwin.
    New York, The Modern library [1936]
    Evolution (Biology)
    Natural selection.
    Heredity.
    Human evolution.

    Bear, Greg, 1951-
    Darwin's radio / Greg Bear.
    New York : Ballantine Books, 2003.
    Women molecular biologists--Fiction.
    DNA viruses--Fiction.

    No matter what you would choose as a heading on which to order these, it will not produce a sensible collocation that would give users some context to understand the meaning of this particular set of items – and that is because there is no meaning to this set of items, just a coincidence of things named "Darwin."

    Headings that have been chosen to be controlled strings should offer a more predictable user search experience than free text searching, but headings do not necessarily provide collocation. As an example, Wikipedia uses the names of its pages as headings, and there are some rules (or at least preferred practices) to make the headings sensible. A search in Wikipedia is a left-to-right search on a heading string that is presented as a drop-down list of a handful of headings that match the search string:




    Included in the headings in the drop-down are "see"-type terms that, when selected, take the user directly to the entry for the preferred term. If there is no one preferred term Wikipedia directs users to disambiguation pages to help users select among similar headings:


    The Wikipedia pages, however, only provide accidental collocation, not the more comprehensive collocation that libraries aim to attain. That library-designed collocation, however, is also the source of the inversion of headings, making those strings unnatural and unintuitive for users. Although the library headings are admirably rules based, they often use rules that will not be known to many users of the catalog, such as the difference in name headings with prepositions based on the language of the author. To search on these names, one therefore needs to know the language of the author and the rule that is applied to that language, something that I am quite sure we can assume is not common knowledge among catalog users.

    De la Cruz, Melissa
    Cervantes Saavedra, Miguel de
    I may be the only patron of my small library branch that has known to look for the mysteries by Icelandic author Arnaldur Indriðason under "A" not "I".

    What Is To Be Done?


    There isn't an easy (or perhaps not even a hard) answer. As long as humans use words to describe their queries we will have the problem that words and concepts, and words and relationships between concepts, do not neatly coincide.

    I see a few techniques that might be used if we wish to save collocation by heading. One would be to allow keyword searching but for the system to use that to suggest headings that then can be used to view collocated works. Some systems do allow users to retrieve headings by keyword, but headings, which are very terse, are often not self-explanatory without the items they describe. A browse of headings alone is much less helpful that the association of the heading with the bibliographic data it describes. Remember that headings were developed for the card catalog where they were printed on the same card that carried the bibliographic description.

    Another possible area of investigation would be to look to the classified catalog, a technique that has existed alongside alphabetical catalogs for centuries. The Decimal Classification of Dewey was a classified approach to knowledge with a language-based index (his "Relativ Index") to the classes. (It is odd that the current practice in US libraries is to have one classification for items on shelves and an unrelated heading system (LCSH) for subject access.)
    The classification provides the intellectual collocation that the headings themselves do not provide. The difficulty with this is that the classification collocates topically but, at least in its current form, does not collocate the name headings in the catalog that identify people and organizations as entities.

    Conclusion (sort of)

    Controlled headings as access points for library catalogs could provide better service than keyword search alone. How to make use of headings is a difficult question. The first issue is how to exploit the precision of headings while still allowing users to search on any terms that they have in mind. Keyword search is, from the user's point of view, frictionless. They don't have to think "what string would the library have used for this?".

    Collocation of items by topical sameness or other relationships (e.g. "named for", "subordinate to") is possibly the best service that libraries could provide, although it is very hard to do this through the mechanism of language strings. Dewey's original idea of a classified order with a language-based index is still a good one, although classifications are hard to maintain and hard to assign.

    If challenged to state what I think the library catalog should be, my answer would be that it should provide a useful order that illustrates one or more intellectual contexts that will help the user enter and navigate what the library has to offer. Unfortunately I can't say today how we could do that. Could we think about that together?

    Readings

    Dewey, Melvil. Decimal classification and relativ index for libraries, clippings, notes, etc. Edition 7. Lake Placid Club, NY., Forest Press, 1911. https://archive.org/details/decimalclassifi00dewegoog

    Shera, Jesse H, Margaret E. Egan, and Jeannette M. Lynn. The Classified Catalog: Basic Principles and Practices. Chicago, Ill: American Library Association, 1956




    Monday, August 06, 2018

    FRBR as a Data Model


    (I've been railing against FRBR since it was first introduced. It still confuses me some. I put out these ideas for discussion. If you disagree, please add your thoughts to this post.)

    I was recently speaking at a library conference in OSLO where I went through my criticisms of our cataloging models, and how they are not suited to the problems we need to solve today. I had my usual strong criticisms of FRBR and the IFLA LRM. However, when I finished speaking I was asked why I am so critical of those models, which means that I did not explain myself well. I am going to try again here, as clearly and succinctly as I can.

    Conflation of Conceptual Models with Data Models


    FRBR's main impact was that it provided a mental model of the bibliographic universe that reflects a conceptual view of the elements of descriptive cataloging. You will find nothing in FRBR that could not be found in standard library cataloging of the 1990's, which is when the FRBR model was developed. What FRBR adds to our understanding of bibliographic information is that it gives names and definitions to key concepts that had been implied but not fully articulated in library catalog data. If it had stopped there we would have had an interesting mental model that allows us to speak more precisely about catalogs and cataloging.

    Unfortunately, the use of diagrams that appear to define actual data models and the listing of entities and their attributes have led the library world down the wrong path, that of reading FRBR as the definition of a physical data model. Compounding this, the LRM goes down that path even further by claiming to be a structural model of bibliographic data, which implies that it is the structure for library catalog data. I maintain that the FRBR conceptual model should not be assumed to also be a model for bibliographic data in a machine-readable form. The main reason for this has to do with the functionality that library catalogs currently provide (and and what functions they may provide in the future). This is especially true in relation to what FRBR refers to as its Group 1 entities: work, expression, manifestation, and item.

    The model defined in the FRBR document presents an idealized view that does not reflect the functionality of bibliographic data in library catalogs nor likely system design. This is particularly obvious in the struggle to fit the reality of aggregate works into the Group 1 "structure," but it is true even for simple published resources. The remainder of this document attempts to explain the differences between the ideal and the real.

    The Catalog vs the Universe


    One of the unspoken assumptions in the FRBR document is that it poses its problems and solutions in the context of the larger bibliographic universe, not in terms of a library catalog. The idea of gathering all of the manifestations of an expression and all of the expressions of a work is not shown as contingent on the holdings of any particular library. Similarly, bibliographic relationships are presented as having an existence without addressing how those relationships would be handled when the related works are not available in a data set. This may be due to the fact that the FRBR working group was made up solely of representatives of large research libraries whose individual catalogs cover a significant swath of the bibliographic world. It may also have arisen from the fact that the FRBR working group was formed to address the exchange of data between national libraries, and thus was intended as a universal model. Note that no systems designers were involved in the development of FRBR to address issues that would come up in catalogs of various extents or types.

    The questions asked and answered by the working group were therefore not of the nature of "how would this work in a catalog?" and were more of the type "what is nature of bibliographic data?". The latter is a perfectly legitimate question for a study of the nature of bibliographic data, but that study cannot be assumed to answer the first question.

    Functionality


    Although the F in FRBR stands for "functional" FRBR does little to address the functionality of the library catalog. The user tasks find, identify, select and obtain (and now explore, added in the LRM) are not explained in terms of how the data aids those tasks; the FRBR document only lists which data elements are essential to each task. Computer system design, including the design of data structures, needs to go at least a step further in its definition of functions, which means not only which data elements are relevant, but the specific usage the data element is put to in an actual machine interaction with the user and services. A systems developer has to take into account precisely what needs to be done with the FRBR entities in all of the system functions, from input to search and display.

    (Note: I'm going to try to cover this better and to give examples in an upcoming post.)

    Analysis that is aimed at creating a bibliographic data format for a library catalog would take into account that providing user-facing information about work and expression is context-dependent based on the holdings of the individual library and on the needs of its users. It would also take into account the optional use of work and expression information in search and display, and possibly give alternate views to support different choices in catalog creation and deployment. Essentially, analysis for a catalog would take system functionality into account.

    There a lot of facts about the nature of computer-based catalogs have to be acknowledged: that users are no longer performing “find” in an alphabetical list of headings, but are performing keyword searches; that collocation based on work-ness is not a primary function of catalog displays; that a significant proportion of a bibliographic database consists of items with a single work-expression-manifestation grouping; and finally that there is an inconsistent application of work and expression information in today's data.

    In spite of nearly forty years of using library systems whose default search function is a single box in which users are asked to input query terms that will be searched as keywords taken from a combination of creator, title, and subject fields in the bibliographic record, the LRM doubles down on the status of textual headings as primary elements, aka: Nomen. Unfortunately it doesn't address the search function in any reasonable fashion, which is to say it doesn't give an indication of the role of Nomen in the find function. In fact, here is the sum total of what the LRM says about search:

    "To facilitate this task [find], the information system seeks to enable effective searching by offering appropriate search elements or functionality."


    That's all. As I said in my talk at Oslo, this is up there with the jokes about bad corporate mission statements, like: "We endeavor to enhance future value through intelligent paradigm implementation." First, no information system ineffective searching. Yet the phrase "effective searching" is meaningless in itself; without a definition of what is effective this is just a platitude. The same is true for "appropriate search elements": no one would suggest that a system should use inappropriate search elements, but defining appropriate search is not at all a simple task. In fact, I contend that one of the primary problems with today's library systems is that we specifically lack a definition of appropriate, effective search. This is rendered especially difficult because the data that we enter into our library systems is data that was designed for an entirely different technology: the physical card catalog, organized as headings in alphabetical order.

    One Record to Rule Them All


    Our actual experience regarding physical structures for bibliographic data should be sufficient proof that there is not one single solution. Although libraries today are consolidating around the MARC21 record format, primarily for economic reasons, there have been numerous physical formats in use that mostly adhere to the international standard of ISBD. In this same way, there can be multiple physical formats that adhere to the conceptual model expressed in the FRBR and LRM documents. We know this is the case by looking at the current bibliographic data, which includes varieties of MARC, ISBD, BIBFRAME, and others. Another option for surfacing information about works in catalogs could follow what OCLC seems to be developing, which is the creation of works through a clustering of single-entity records. In that model, a work is a cluster of expressions, and an expression is a cluster of manifestations. This model has the advantage that it does not require the cataloger to make decisions about work and expression statements before it is known if the resource will be the progenitor of a bibliographic family, or will stand alone. It also does not require the cataloger to have knowledge of the bibliographic universe beyond their own catalog.

    The key element of all of these, and countless other, solutions is that they can be faithful to the mental model of FRBR while also being functional and efficient as systems. We should also expect that the systems solutions to this problem space will not stay the same over time, since technology is in constant evolution.

    Summary


    I have identified here two quite fundamental areas where FRBR's analysis differs from the needs of system development: 1) the difference between conceptual and physical models and 2) the difference between the (theoretical) bibliographic universe and the functional library catalog. Neither of these are a criticism of FRBR as such, but they do serve as warnings about some widely held assumptions in the library world today, which is that of mistaking the FRBR entity model for a data and catalog design model. This is evident in the outcry over the design of the BIBFRAME model which uses a two-tiered bibliographic view and not the three-tiers of FRBR. The irony of that complaint is that at the very same time as those outcries, catalogers are using FRBR concepts (as embodied in RDA) while cataloging into the one-tiered data model of MARC, which includes all of the entities of FRBR in a single data record. While cataloging into MARC records may not be the best version of bibliographic data storage that we could come up with, we must acknowledge that there are many possible technology solutions that could allow the exercise of bibliographic control while making use of the concepts addressed in FRBR/LRM. Those solutions must be based as least as much on user needs in actual catalogs as on bibliographic theory.

    As a theory, FRBR posits an ideal bibliographic environment which is not the same as the one that is embodied in any library catalog. The diagrams in the FRBR and LRM documents show the structure of the mental model, but not library catalog data. Because the FRBR document does not address implementation of the model in a catalog, there is no test of how such a model does or does not reflect actual system design. The extrapolation from mental model to physical model is not provided in FRBR or the LRM, as neither addresses system functions and design, not even at a macro level.

    I have to wonder if FRBR/LRM shouldn't be considered a model for bibliography rather than library catalogs. Bibliography was once a common art in the world of letters but that has faded greatly over the last half century. Bibliography is not the same as catalog creation, but one could argue that libraries and librarians are the logical successors to the bibliographers of the past, and that a “universal bibliography” created under the auspices of libraries would provide an ideal context for the entries in the library catalog. This could allow users to view the offerings of a single library as a subset of a well-described world of resources, most of which can be accessed in other libraries and archives.
    ­


    Tuesday, October 10, 2017

    Google Books and Mein Kampf

    I hadn't look at Google Books in a while, or at least not carefully, so I was surprised to find that Google had added blurbs to most of the books. Even more surprising (although perhaps I should say "troubling") is that no source is given for the book blurbs. Some at least come from publisher sites, which means that they are promotional in nature. For example, here's a mildly promotional text about a literary work, from a literary publisher:



    This gives a synopsis of the book, starting with:

    "Throughout a single day in 1892, John Shawnessy recalls the great moments of his life..." 

    It ends by letting the reader know that this was a bestseller when published in 1948, and calls it a "powerful novel."

    The blurb on a 1909 version of Darwin's The Origin of Species is mysterious because the book isn't a recent publication with an online site providing the text. I do not know where this description comes from, but because the  entire thrust of this blurb is about the controversy of evolution versus the Bible (even though Darwin did not press this point himself) I'm guessing that the blurb post-dates this particular publication.


    "First published in 1859, this landmark book on evolutionary biology was not the first to deal with the subject, but it went on to become a sensation -- and a controversial one for many religious people who could not reconcile Darwin's science with their faith."
    That's a reasonable view to take of Darwin's "landmark" book but it isn't what I would consider to be faithful to the full import of this tome.

    The blurb on Hitler's Mein Kampf is particularly troubling. If you look at different versions of the book you get both pro- and anti- Nazi sentiments, neither of which really belong  on a site that claims to be a catalog of books. Also note that because each book entry has only one blurb, the tone changes considerably depending on which publication you happen to pick from the list.


    First on the list:
    "Settling Accounts became Mein Kampf, an unparalleled example of muddled economics and history, appalling bigotry, and an intense self-glorification of Adolf Hitler as the true founder and builder of the National Socialist movement. It was written in hate and it contained a blueprint for violent bloodshed."

    Second on the list:
    "This book has set a path toward a much higher understanding of the self and of our magnificent destiny as living beings part of this Race on our planet. It shows us that we must not look at nature in terms of good or bad, but in an unfiltered manner. It describes what we must do if we want to survive as a people and as a Race."
    That's horrifying. Note that both books are self-published, and the blurbs are the ones that I find on those books in Amazon, perhaps indicating that Google is sucking up books from the Amazon site. There is, or at least at one point there once was, a difference between Amazon and Google Books. Google, after all, scanned books in libraries and presented itself as a search engine for published texts; Amazon will sell you Trump's tweets on toilet paper. The only text on the Google Books page still claims that Google Books is about  search: "Search the world's most comprehensive index of full-text books." Libraries partnered with Google with lofty promises of gains in scholarship:
    "Our participation in the Google Books Library Project will add significantly to the extensive digital resources the Libraries already deliver. It will enable the Libraries to make available more significant portions of its extraordinary archival and special collections to scholars and researchers worldwide in ways that will ultimately change the nature of scholarship." Jim Neal, Columbia University
    I don't know how these folks now feel about having their texts intermingled with publications they would never buy and described by texts that may come from shady and unreliable sources.

    Even leaving aside the grossest aspects of the blurbs and Google's hypocrisy about its commercialization of its books project, adding blurbs to the book entries with no attribution and clearly not vetting the sources is extremely irresponsible. It's also very Google to create sloppy algorithms that illustrate their basic ignorance of the content their are working with -- in this case, the world's books.

    Tuesday, August 08, 2017

    On reading Library Journal, September, 1877

    Of the many advantages to retirement is the particular one of idle time. And I will say that as a librarian one could do no better than to spend some of that time communing with the history of the profession. The difficulty is that it is so rich, so familiar in many ways that it is hard to move through it quickly. Here is just a fraction of the potential value to be found in the September issue of volume two of Library Journal.* Admittedly this is a particularly interesting number because it reports on the second meeting of the American Library Association.

    For any student of library history it is especially interesting to encounter certain names as living, working members of the profession.



    Other names reflect works that continued on, some until today, such as Poole and Bowker, both names associated with long-running periodical indexes.

    What is particularly striking, though, is how many of the topics of today were already being discussed then, although obviously in a different context. The association was formed, at least in part, to help librarianship achieve the status of a profession. Discussed were the educating of the public on the role of libraries and librarians as well as providing education so that there could be a group of professionals to take the jobs that needed that professional knowledge. There was work to be done to convince state legislatures to support state and local libraries.

    One of the first acts of the American Library Association when it was founded in 1876 (as reported in the first issue of Library Journal) was to create a Committee on Cooperation. This is the seed for today's cooperative cataloging efforts as well as other forms of sharing among libraries. In 1877, undoubtedly encouraged by the participation of some members of the publishing community in ALA, there was hope that libraries and publishers would work together to create catalog entries for in-print works.
    This is one hope of the early participants that we are still working on, especially the desire that such catalog copy would be "uniform." Note that there were also discussions about having librarians contribute to the periodical indexes of R. R. Bowker and Poole, so the cooperation would flow in both directions.

    The physical organization of libraries also was of interest, and a detailed plan for a round (actually octagonal) library design was presented:
    His conclusion, however, shows a difference in our concepts of user privacy.
    Especially interesting to me are the discussions of library technology. I was unaware of some of the emerging technologies for reproduction such as the papyrograph and the electric pen. In 1877, the big question, though, was whether to employ the new (but as yet un-perfected) technology of the typewriter in library practice.

    There was some poo-pooing of this new technology, but some members felt it may be reaching a state of usefulness.


    "The President" in this case is Justin Winsor, Superintendent of the Boston Library, then president of the American Library Association. Substituting more modern technologies, I suspect we have all taken part in this discussion during our careers.

    Reading through the Journal evokes a strong sense of "le plus ça change..." but I admit that I find it all rather reassuring. The historical beginnings give me a sense of why we are who we are today, and what factors are behind some of our embedded thinking on topics.


    * Many of the early volumes are available from HathiTrust, if you have access. Although the texts themselves are public domain, these are Google-digitized books and are not available without a login. (Don't get me started!) If you do not have access to those, most of the volumes are available through the Internet Archive. Select "text" and search on "library journal". As someone without HathiTrust institutional access I have found most numbers in the range 1-39, but am missing (hint, hint): 5/1880; 8-9/1887-88; 17/1892; 19/1894; 28-30/1903-1905; 34-37;1909-1912. If I can complete the run I think it would be good to create a compressed archive of the whole and make that available via the Internet Archive to save others the time of acquiring them one at a time. If I can find the remainder that are pre-1923 I will add those in.

    Sunday, July 09, 2017

    The Work

    I've been on a committee that was tasked by the Program for Cooperative Cataloging folks(*) to help them understand some of the issues around works (as defined in FRBR, RDA, BIBFRAME, etc.). There are huge complications, not the least being that we all are hard-pressed to define what a work is, much less how it should be addressed in some as-yes-unrealized future library system. Some of what I've come to understand may be obvious to you, especially if you are a cataloger who provides authority data for your own catalog or the shared environment. Still, I thought it would be good to capture these thoughts. Of course, I welcome comments and further insights on this.



    There are at least four different meanings to the term work as it is being discussed in library venues.

    "Work-ness"

    First there is the concept that every resource embodies something that could be called a "work" and that this work is a human creation. The idea of the work probably dates back as far as the recognition that humans create things, and that those things have meaning. There is no doubt that there is "work-ness" in all created things, although prior to FRBR there was little attempt to formally define it as an aspect of bibliographic description. It entered into cataloging consciousness in the 20th century: Patrick Wilson saw works as families of resources that grow and branch with each related publication;[1] Richard Smiraglia looked at works as a function of time;[2] and Seymour Lubetzky seems to have been the first to insist on viewing the work as intellectual content separate from the physical piece.[3]

    "Work Description"

    Second, there is the work in the bibliographic description: the RDA cataloging rules define the attributes or data elements that make up the work description, like the names of creators and the subject matter of the resource. Catalogers include these elements in descriptive cataloging even when the work is not defined as a stand-alone entity, as in the case of doing RDA cataloging in a MARC21 record environment. Most of the description of works is not new; creators and subjects have been assigned to cataloged items for a century or more. What is changed is that conceptually these are considered to be elements of the work that is inherent in the resource that is being cataloged but not limited to the item in hand.

    It is this work description that is addressed in FRBR. The FRBR document of 1998 describes the scope of its entities to be solely bibliographic,  specifically excluding authority data:
    "The present study does not analyse those additional data associated with persons, corporate bodies, works, and subjects that are typically recorded only in authority records."
    Notably, FRBR is silent on the question of whether the work description is unique within the catalog, which would be implied by the creation of a work authority "record".

    "Work Decision"

    Next there is the work decision: this is the situation when a data creator determines whether the work to be described needs a unique and unifying entry within the stated cataloging environment to bring together exemplars of the same work that may be described differently. If so, the cataloger defines the authoritative identity for the work and provides information that distinguishes that work from all other works, and that brings together all of the variations of that work. The headings ("uniform titles") that are created also serve to disambiguate expressions of the same work by adding dates, languages, and other elements of the expression. To back all of this up, the cataloger gives evidence of his/her decision, primarily what sources were consulted that support the decision.

    In today's catalog, a full work decision, resulting in a work authority record, is done for only a small number of works, with the exception of musical works where such titles are created for nearly all. The need to make the work decision may vary from catalog to catalog and can depend on whether the library holds multiple expressions of the work or other works that may need clarification in the catalog. Note that there is nothing in FRBR that would indicate that every work must have a unique description, just that works should be described. However, some have assumed that the FRBR work is always a representation of a unique creation. I don't find that expressed in FRBR nor the FRBR-LRM.

    "Work Entity"

    Finally there is the work entity: this is a data structure that encapsulates the description of the work. This data structure could be realized in any number of different encodings, such as ISO 2709 (the underlying record structure for MARC21), RDF, XML, or JSON. The latter two can also accommodate linked data in the form of RDFXML or JSON-LD.

    Here we have a complication in our current environment because the main encodings of bibliographic data, MARC21 and BIBFRAME, both differ from the work concept presented in FRBR and in the RDA cataloging rules, which follow FRBR fairly faithfully. With a few exceptions, MARC21 does not distinguish work elements from expression or manifestation elements. Encoding RDA-defined data in the MARC21 "unit record" can be seen as proof of the conceptual nature of the work (and expression and manifestation) as defined in FRBR.

    BIBFRAME, the proposed replacement for MARC21, has re-imagined the bibliographic work entity, departing from the entity breakdown in FRBR by defining a BIBFRAME work entity that tends to combine elements from FRBR's work and expression. However, where FRBR claims a neat divison between the entities, with no overlapping descriptive elements, BIBFRAME 2.0 is being designed as a general bibliographic model, not an implementation of FRBR. (Whether or not BIBFRAME achieves this goal is another question.)

    The diagrams in the 1998 FRBR report imply that there would be a work entity structure. However, the report also states unequivocally that it is not defining a data format.(**) In keeping with 1990's library technology, FRBR anticipates that each entity may have an identifier, but the identifier is a descriptive element (think: ISBN), not an anchor for all of the data elements of the entity (think: IRI).

    As we see with the implementation of RDA cataloging in the MARC21 environment, describing a work conceptually does not require the use of a separate work "record." Whether work decisions are required for every cataloged manifestation is a cataloging decision; whether work entities are required for every work is a data design decision. That design decision should be based on the services that the system is expected to render.  The "entity" decision may or may not require any action on the part of the cataloger depending on the interface in which cataloging takes place. Just as today's systems do not store the MARC21 data as it appears on the cataloger's screen, future systems will have internal data storage formats that will surely differ from the view in the various user interfaces.

    "The Upshot"

    We can assume that every human-created resource has an aspect of work-ness, but this doesn't always translate well to bibliographic description nor to a work entity in bibliographic data. Past practice in relation to works differs significantly from, say, the practice in relation to agents (persons, corporate bodies) for whom one presumes that the name authority control decision is always part of the cataloging workflow. Instead, work "names" have been inconsistently developed (with exceptions, such as in music materials). It is unclear if, in the future, every work description will be assumed to have undergone a "work name authority" analysis, but even more unreliable is any assumption that can be made about whether an existing bibliographic description without a uniform title has had its "work-ness" fully examined.

    This latter concern is especially evident in the transformations of current MARC21 cataloging into either RDA, BIBFRAME, or schema.org. From what I have observed, the transformations do not preserve the difference between a manifestation title that does not have a formal uniform title to represent the work, and those titles that are currently coded in MARC21 fields 130, 240, or the $t of an author/title field. Instead, where a coded uniform title is not available in the MARC21 record, the manifestation title is copied to the work title element. This means that the fact that a cataloger has carefully crafted a work title for the resource is lost. Even though we may agree that the creation of work titles has been inconsistent at best, copying transcribed titles to the work title entity wherever no uniform title field is present in the MARC21 record seems to be a serious loss of information. Or perhaps I should put this as a question: in the absence of a unform title element, can we assume that the transcribed title is the appropriate work title?

    To conclude, I guess I will go ahead and harp on a common nag of mine, which is that copying data from one serialization to another is not the transformation that will help us move forward. The "work" is very complex; I would feel less concerned if we had a strong and shared concept of what services we want the work to provide in the future, which should help us decide what to do with the messy legacy that we have today.


    Footnotes

    * Note that in 1877 there already was a "Co-operation committee" of the American Library Association, tasked with looking at cooperative cataloging and other tasks. That makes this a 140-year-old tradition.
    "Of the standing committees, that on co-operation will probably prove the most important organ of the Association..." (see more at link)

    ** If you want more about what FRBR is and is not, I will recommend my book "FRBR: Before and After" (open access copy) for an in-depth analysis. If you want less, try my SWIB talk "Mistakes Have Been Made" which gets into FRBR at about 13:00, but you might enjoy the lead-up to that section.

    References

    [1] Wilson, Patrick. Two Kinds of Power : an Essay on Bibliographical Control. University of California Publications: Librarianship. Berkeley, Los Angeles, London: University of California Press, 1978.
    [2] Smiraglia, Richard. The Nature of “a Work”; Implications for the Organization of Knowledge. Lanham: Scarecrow Press, 2001.
    [3] Lubetzky, Seymour. Principles of Cataloging. Final report. Phase I. In: Seymour Lubtezky: writings on the classical art of cataloging. Edited by Elaine Svenonius and Dorothy McGarry. Englewood, CO, Libraries Unlimited. 2001

    Tuesday, June 20, 2017

    Pray for Peace

    This is a piece I wrote on March 22, 2003, two days after the beginning of the second Gulf war. I just found it in an old folder, and sadly have to say that things have gotten worse than I feared. I also note an unfortunate use of terms like "peasant" and "primitive" but I leave those as a recognition of my state of mind/information. Pray for peace.

    Saturday, March 22, 2003

    Gulf War II

    The propaganda machine is in high gear, at war against the truth. The bombardments are constant and calculated. This has been planned carefully over time.

    The propaganda box sits in every home showing footage that it claims is of a distant war. We citizens, of course, have no way to independently verify that, but then most citizens are quite happy to accept it at face value.

    We see peaceful streets by day in a lovely, prosperous and modern city. The night shots show explosions happening at a safe distance. What is the magical spot from which all of this is being observed?

    Later we see pictures of damaged buildings, but they are all empty, as are the streets. There are no people involved, and no blood. It is the USA vs. architecture, as if the city of Bagdad itself is our enemy.

    The numbers of casualties, all of them ours, all of them military, are so small that each one has an individual name. We see photos of them in dress uniform. The families state that they are proud. For each one of these there is the story from home: the heavily made-up wife who just gave birth to twins and is trying to smile for the camera, the child who has graduated from school, the community that has rallied to help re-paint a home or repair a fence.

    More people are dying on the highways across the USA each day than in this war, according to our news. Of course, even more are dying around the world of AIDS or lung cancer, and we aren't seeing their pictures or helping their families. At least not according to the television news.

    The programming is designed like a curriculum with problems and solutions. As we begin bombing the networks show a segment in which experts explain the difference between the previous Gulf War's bombs and those used today. Although we were assured during the previous war that our bombs were all accurately hitting their targets,  word got out afterward that in fact the accuracy had been dismally low. Today's experts explain that the bombs being used today are far superior to those used previously, and that when we are told this time that they are hitting their targets it is true, because today's bombs really are accurate.

    As we enter and capture the first impoverished, primitive village, a famous reporter is shown interviewing Iraqi women living in the USA who enthusiastically assure us that the Iraqi people will welcome the American liberators with open arms. The newspapers report Iraqis running into the streets shouting "Peace to all." No one suggests that the phrase might be a plea for mercy by an unarmed peasant facing a soldier wearing enough weaponry to raze the entire village in an eye blink.

    Reporters riding with US troops are able to phone home over satellite connections and show us grainy pictures of heavily laden convoys in the Iraqi desert. Like the proverbial beasts of burden, the trucks are barely visible under their packages of goods, food and shelter. What they are bringing to the trade table is different from the silks and spices that once traveled these roads, but they are carrying luxury goods beyond the ken of many of Iraq's people: high tech sensor devices, protective clothing against all kinds of dangers, vital medical supplies and, perhaps even more important, enough food and water to feed an army. In a country that feeds itself only because of international aid -- aid that has been withdrawn as the US troops arrive -- the trucks are like self-contained units of American wealth motoring past.

    I feel sullied watching any of this, or reading newspapers. It's an insult to be treated like a mindless human unit being prepared for the post-war political fall-out. I can't even think about the fact that many people in this country are believing every word of it. I can't let myself think that the propaganda war machine will win.

    Pray for peace.

    Wednesday, May 17, 2017

    Two FRBRs, Many Relationships

    There is tension in the library community between those who favor remaining with the MARC21 standard for bibliographic records, and others who are promoting a small number of RDF-based solutions. This is the perceived conflict, but in fact both camps are looking at the wrong end of the problem - that is, they are looking at the technology solution without having identified the underlying requirements that a solution must address. I contend that the key element that must be taken into account is the role of FRBR on cataloging and catalogs.

    Some background:  FRBR is stated to be a mental model of the bibliographic universe, although it also has inherent in it an adherence to a particular technology: entity-relation analysis for relational database design. This is stated fairly clearly in the  introduction to the FRBR report, which says:

    The methodology used in this study is based on an entity analysis technique that is used in the development of conceptual models for relational database systems. Although the study is not intended to serve directly as a basis for the design of bibliographic databases, the technique was chosen as the basis for the methodology because it provides a structured approach to the analysis of data requirements that facilitates the processes of definition and delineation that were set out in the terms of reference for the study. 

    The use of an entity-relation model was what led to the now ubiquitous diagrams that show separate entities for works, expressions, manifestations and items. This is often read as a proposed structure for bibliographic data, where a single work description is linked to multiple expression descriptions, each of which in turn link to one or more manifestation descriptions. Other entities like the primary creator link to the appropriate bibliographic entity rather than to a bibliographic description as a whole. In relational database terms, this would create an efficiency in which each work is described only once regardless of the number of expressions or manifestations in the database rather than having information about the work in multiple bibliographic descriptions. This is seen by some as a potential efficiency also for the cataloging workflow as information about a work does not need to be recreated in the description of each manifestation of the work.

    Two FRBRs


    What this means is that we have (at least) two FRBR's: the mental model of the bibliographic universe, which I'll refer to as FRBR-MM; and the bibliographic data model based on an entity-relation structure, which I'll refer to as FRBR-DM. These are not clearly separated in the FRBR final report and there is some ambiguity in statements from members of the FRBR working group about whether both models are intended outcomes of the report. Confusion arises in many discussions of FRBR when we do not distinguish which of these functions is being addressed.

    FRBR-Mental Model


    FRBR-MM is the thinking behind the RDA cataloging rules, and the conceptual entities define the structure of the RDA documentation and workflow. It instructs catalogers to analyze each item they catalog as being an item or manifestation that carries the expression of a creative work. There is no specific data model associated with the RDA rules, which is why it is possible to use the mental model to produce cataloging that is entered into the form provided by the MARC21 record; a structure that approximates the catalog entry described in AACR2.

    In FRBR-MM, some entities can be implicit rather than explicit. FRBR-MM does not require that a cataloguer produce a separate and visible work entity. In the RDA cataloging coded in MARC, the primary creator and the subjects are associated with the overall bibliographic description without there being a separate work identity. Even when there is a work title created, the creator and subjects are directly associated with the bibliographic description of the manifestation or item. This doesn't mean that the cataloguer has not thought about the work and the expression in their bibliographic analysis, but the rules do not require those to be called out separately in the description. In the mental model you can view FRBR as providing a checklist of key aspects of the bibliographic description that must be addressed.

    The FRBR report defines bibliographic relationships more strongly than previous cataloging rules. For her PhD work, Barbara Tillett (a principal on both the FRBR and RDA work groups) painstakingly viewed thousands of bibliographic records to tease out the types of bibliographic relationships that were noted. Most of these were implicit in free-form cataloguer-supplied notes and in added entries in the catalog records. Previous cataloging rules said little about bibliographic relationships, while RDA, using the work of Tillett which was furthered in the FRBR final report, has five chapters on bibliographic relationships. In the FRBR-MM encoded in MARC21,  these continue to be cataloguer notes ("Adapted from …"), subject headings ("--adaptations"), and added entry fields. These notes and headings are human-readable but do not provide machine-actionable links between bibliographic descriptions. This means that you cannot have a system function that retrieves all of the adaptations of a work, nor are systems likely to provide searches based on relationship type, as these are buried in text. Also, whether relationships are between works or expressions or manifestations is not explicit in the recorded data. In essence, FRBR-MM in MARC21 ignores the separate description of the FRBR-defined Group 1 entities (WEMI), flattening the record into a single bibliographic description that is very similar to that produced with AACR2.

    FRBR-Data Model


    FRBR-DM adheres to the model of separate identified entities and the relationships between them. These are seen in the diagrams provided in the FRBR report, and in the section on bibliographic relationships from that report. The first thing that needs to be said is that the FRBR report based its model on an analysis that is used for database design. There is no analysis provided for a record design. This is significant because databases and records used for information exchange can have significantly different structures. In a database there could be one work description linked to any number of expressions, but when exchanging information about a single  manifestation presumably the expression and work entities would need to be included. That probably means that if you have more than one manifestation for a work being transmitted, that work information is included for each manifestation, and each bibliographic description is neatly contained in a single package. The FRBR report does not define an actual database design nor a record exchange format, even though the entities and relations in the report could provide a first step in determining those technologies.

    FRBR-DM uses the same mental model as FRBR-MM, but adds considerable functionality that comes from the entity-relationship model. FRBR-DM implements the concepts in FRBR in a way that FRBR-MM does not. It defines separate entities for work, expression, manifestation and item, where MARC21 has only a single entity. FRBR-DM also defines relationships that can be created between specific entities. Without actual entities some relationships between the entities may be implicit in the catalog data, but only in a very vague way. A main entry author field in a MARC21 record has no explicit relationship to the work concept inherent in the bibliographic description, but many people's mental model would associate the title and the author as being a kind of statement about the work being described. Added entries may describe related works but they do not link to those works.

    The FRBR-DM model was not imposed on the RDA rules, which were intended to be neutral as to the data formats that would carry the bibliographic description. However, RDA was designed to support the FRBR-DM by allowing for individual entity descriptions with their own identifiers and for there to be identified relationships between those entities. FRBR-DM proposes the creation of a work entity that can be shared throughout the bibliographic universe where that work is referenced. The same is true for all of the FRBR entities. Because each entity has an identified existence, it is possible to create relationships between entities; the same relationships that are defined in the FRBR report, and more if desired. FRBR-DM, however, is not supported by the MARC21 model because MARC21 does not have a structure that would permit the creation of separately identified entities for the FRBR entities. FRBR-DM does have an expression as a data model in the RDA Registry. In the registry, RDA is defined as an RDF vocabulary in parallel with the named elements in the RDA rule set, with each element associated with the FRBR entity that defines it in the RDA text. This expression, however, so far has only one experimental system implementation in RIMMF. As far as I know, no libraries are yet using this as a cataloging system.

    The replacement proposed by the Library of Congress for the MARC21 record, BIBFRAME, makes use of entities and relations similar to those defined in FRBR, but does not follow FRBR to the letter. The extent to which it was informed by FRBR is unclear but FRBR was in existence when BIBFRAME was developed. Many of the entities defined by FRBR are obvious, however, and would be arrived at by any independent analysis of bibliographic data: persons, corporate bodies, physical descriptions, subjects. How BIBFRAME fits into the FRBR-MM or the FRBR-DM isn't clear to me and I won't attempt to find a place for it in this current analysis. I will say that using an entity-relation model and promoting relationships between those entities is a mainstream approach to data, and would most likely be the model in any modern bibliographic data design.


    MARC v RDF? 


    The decision we are facing in terms of bibliographic data is often couched in terms of "MARC vs. RDF", however, that is not the actual question that underlies that decision. Instead, the question should be couched as: entities and relations, or not? if you want to share entities like works and persons, and if you want to create actual relationships between bibliographic entities, something other than MARC21 is required. What that "something" is should be an open question, but it will not be a "unit record" like MARC21.

    For those who embrace the entity-relation model, the perceived "rush to RDF" is not entirely illogical; RDF is the current technology that supports entity-relation models. RDF is supported by a growing number of open source tools, including database management and indexing. It is a World Wide Web Consortium (W3C) standard, and is quickly becoming a mainstream technology used by communities like banking, medicine, and academic and government data providers. It also has its down sides: there is no obvious support in the current version of RDF for units of data that could be called "records" - RDF only recognizes open graphs; RDF is bad at retaining the order of data elements, something that bibliographic data often relies upon. These "faults" and others are well known to the W3C groups that continue to develop the standard and some are currently under development as additions to the standard.

    At the same time, leaping directly to a particular solution is bad form. Data development usually begins with a gathering of use cases and requirements, and technology is developed to meet the gathered requirements. If it is desired to take advantage of some or all of the entity-relation capabilities of FRBR, the decision about the appropriate replacement for MARC21 should be based on a needs analysis. I recall seeing some use cases in the early BIBFRAME work, but I also recall that they seemed inadequate. What needs to be addressed is the extent to which we expect library catalogs to make use of bibliographic relationships, and whether those relationships must be bound to specific entities.

    What we could gain by developing use cases would be a shared set of expectations that could be weighed against proposed solutions. Some of the aspects of what catalogers like about MARC may feed into those requirements, as well what we wish for in the design of the future catalog. Once the set of requirements is reasonably complete, we have a set of criteria against which to measure whether the technology development is meeting the needs of everyone involved with library data.

    Conclusion: It's the Relationships


    The disruptive aspect of FRBR is not primarily that it creates a multi-level bibliographic model between works, expressions, manifestations, and items. The disruption is in the definition of relationships between and among those entities that requires those entities to be separately identified. Even the desire to share separately work and expression descriptions can most likely be done by identifying the pertinent data elements within a unit record. But the bibliographic relationships defined in FRBR and RDA, if they are to be actionable, require a new data structure.

    The relationships are included in RDA but are not implemented in RDA in MARC21, basically because they cannot be implemented in a "unit record" data format. The key question is whether those relationships (or others) are intended to be included in future library catalogs. If they are, then a data format other than MARC21 must be developed. That data format may or may not implement FRBR-defined bibliographic relationships; FRBR was a first attempt to redefine a long-standing bibliographic model and should be considered the first, not the last, word in bibliographic relationships.

    If we couch the question in terms of bibliographic relationships, not warring data formats, we begin to have a way to go beyond emotional attachments and do a reasoned analysis of our needs.