Tuesday, November 27, 2018

It's "academic"

We all know that writing and publishing is of great concern to those whose work is in academia; the "publish or perish" burden haunts pre-tenure educators and grant-seeking researchers. Revelations that data had been falsified in published experimental results brings great condemnation from publishers and colleagues, and yet I have a feeling that underneath it all is more than an ounce of empathy from those who are fully aware of the forces that would lead one to put ones' thumbs on the scales for the purposes of winning the academic jousting match. It is only a slight exaggeration to compare these souls to the storied gladiators whose defeat meant summary execution. From all evidence, that is how many of them experience the contest to win the ivory tower - you climb until you fall.

Research libraries and others deal in great part with the output of the academe. In many ways their practices reinforce the value judgments made on academic writing, such as having blanket orders for all works published by a list of academic presses. In spite of this, libraries have avoided making an overt statement of what is and what is not "academic." The "deciders" of academic writing are the publishers - primarily the publishers of peer-reviewed journals that decide what information does and does not become part of the record of academic achievement, but also those presses that issue scholarly monographs. Libraries are the consumers of these decisions but stop short of tagging works as "academic" or "scholarly."

The pressure on academics has only increased in recent years, primarily because of the development of "impact factors." In 1955, Eugene Garfield introduced the idea that one could create a map of scientific publishing using an index of the writings cited by other works. (Science, 1955; 122 :108–11) Garfield was interested in improving science by linking works so that one could easily find supporting documents. However, over the years the purpose of citation has evolved from a convenient link to precedents into a measure of the worth of scholars themselves in the form of the "h-index" - the measure of how often a person (not a work) has been cited. The h-index is the "lifetime home runs" statistic of the academic world. One is valued for how many times one is cited, making citations the coin of the realm, not sales of works or even readership. No one in academia could or should be measured on the same scale as a non-academic writer when it comes to print runs, reviews, or movie deals. Imagine comparing the sales figures of "Poetic Autonomy in Ancient Rome" with "The Da Vinci Code". So it matters in academia to carve out a world that is academic, and that isolates academic works such that one can do things like calculate an h-index value.

This interest in all things academic has led to a number of metadata oddities that make me uncomfortable, however. There are metadata schemas that have an academic bent that translates to a need to assert the "scholarliness" of works being given a bibliographic description. There is also an emphasis on science in these bibliographic metadata, with less acknowledgement of the publishing patterns of the humanities. My problem isn't solely with the fact that they are doing this, but in particular with how they go about it.

As an example, the metadata schema BIBO clearly has an emphasis on articles as scholarly writing; notably, it has  a publication type "academic article" but does not have a publication type for "academic book." This reflects the bias that new scientific discoveries are published as journal articles, and many scientists do not write book-length works at all. This slights the work of historians like Ann M. Blair whose book, Too Much to Know, has what I estimate to be about 1,450 "primary sources," ranging from manuscripts in Latin and German from the 1500's to modern works in a number of languages. It doesn't get much more academic than that.

BIBO also has different metadata terms for "journal" and "magazine":
  • bibo:journal "A periodical of scholarly journal Articles."
  • bibo:magazine"A periodical of magazine Articles. A magazine is a publication that is issued periodically, usually bound in a paper cover, and typically contains essays, stories, poems, etc., by many writers, and often photographs and drawings, frequently specializing in a particular subject or area, as hobbies, news, or sports."
Something in that last bit on magazines smacks of "leisure time" while the journal clearly represents "serious work."  It's also interesting that the description of magazine is quite long, describes the physical aspects ("usually bound in a paper cover"), and gives a good idea of the potential content. "Journal" is simply "scholarly journal articles." Aside from the circularity of the definitions (journal has journal articles, magazines have magazine articles), what this says is simply that a journal is a "not magazine."

Apart from the snobbishness of the difference between these terms is the fact that one seeks in vain for a bright line between the two. There is, of course, the "I know it when I see it" test, and there is definitely some academic writing that you can pick out without hesitation. But is an opinion piece in the journal of a scientific society academic? How about a book review? How about a book review in the New York Review of Books (NYRB), where articles run to 2-5,000 words, are written by an academic in the field, and make use of the encyclopedic knowledge of the topic on the part of the reviewer? When Marcia Angell, professor at the Harvard Medical School and former Editor in Chief of The New England Journal of Medicine writes for the NYRB, has she slipped her academic robes for something else? She seems to think so. On her professional web site she lists among her publications a (significantly long) letter to the editor  (called a "comment" in academic journal-eze) of a science journal article about women in medicine but she does not include in her publication list the articles she has written for NYRB even though these probably make more use of her academic knowledge than the comment did. She is clearly making a decision about what is "academic" (i.e. career-related) and what is not. It seems that the dividing line is not the content of the writing but how her professional world esteems the publishing vehicle.

Not to single out BIBO, I should mention other "culprits" in the tagging of scholarly works, such as WikiData. Wikidata has:
  • academic journal article (Q18918145) article published in an academic journal
  • academic writing (Q4119870) academic writing and publishing is conducted in several sets of forms and genres
  • scholarly article (Q13442814) article in an academic publication, usually peer reviewed
  • scholarly publication (Q591041) scientific publications that report original empirical and theoretical work in the natural sciences
There is so much wrong with each of these, from circular definitions to bias toward science as the only scholarly pursuit (scholarly publication is a "scientific publication" in the "natural sciences"). (I've already commented on this in WikiData, sarcastically calling it a fine definition if you ignore the various directions that science and scholarship have taken since the mid-19th century.)  What this reveals, however is that the publication  and publisher defines whether the work is "scholarly." If any article in an academic publication is a scholarly article, then the comment by Dr. Angell is, by definition, scholarly, and the NYRB articles are not. Academia is, in fact, a circularly-defined world. 
Giving one more example, schema.org has this:
  • schema:ScholarlyArticle (sub-class of Article) A scholarly article.
Dig that definition! There are a few other types of article in schema, org, such as "newsArticle" and "techArticle" but it appears that all of those magazine articles would be simple "Article."

Note that in real life publications call themselves whatever they wish. With a hint at how terms may have changed over time: Ladies' Home Journal calls itself a journal, and the periodical published by the American Association for the Advancement of Science, Science, gives itself the domain sciencemag.org. "Science Magazine" just sounds right, doesn't it?

It's not wrong for folks to characterize some publications and some writing as "academic" but any metadata term needs a clear definition, which these do not have. What this means is that people using these schemas are being asked to make a determination with very little guidance that would help them separate the scholarly or academic from... well, from the rest of publishing output. With the inevitable variation in categorization, you can be sure that in metadata coded with these schemas the separation between scholarly/academic and not scholarly/academic writing is probably not going to be useful because there will be little regularity of assignment between communities that are using this metadata.

I admit that I picked on this particular metadata topic because I find the designation of "scholarly" or "academic" to be judgemental. If nothing else, when people judge they need some criteria for that judgement. What I would like to see is a clear definition that would help people decide what is and what is not "academic," and what the use cases are for why this typing of materials should be done. As with most categorizations, we can expect some differences in the decisions that will be made by catalogers and indexers working with these metadata schemas. A definition at least gives you something to discuss and to argue for.  Right now we don't have that for scholarly/academic publications.

And I am glad that libraries don't try to make this distinction.


Tuesday, August 14, 2018

Libraryland, We Have a Problem


The first rule of every multi-step program is to admit that you have a problem. I think it's time for us librarians to take step one and admit that we do have a problem.

The particular problem that I have in mind is the disconnect between library data and library systems in relation to the category of metadata that libraries call "headings." Headings are the strings in the library data that represent those entities that would be entry points in a linear catalog like a card catalog.

It pains me whenever I am an observer to cataloger discussions on the proper formation of headings for items that they are cataloging. The pain point is that I know that the value of those headings is completely lost in the library systems of today, and therefore there are countless hours of skilled cataloger time that are being wasted.

The Heading


Both book and card catalogs were catalogs of headings. The catalog entry was a heading followed by one or more bibliographic entries. Unfortunately, the headings serve multiple purposes, which is generally not a good data practice but is due to the need for parsimony in library data when that data was analog, as in book and card catalogs.

  • A heading is a unique character string for the "thing" – the person, the corporate body, the family – essentially an identifier.
Tolkien, J. R. R. (John Ronald Reuel), 1892-1973
  • It supports the selection of the entity in the catalog from among the choices that are presented (although in some cases the effectiveness of this is questionable)


  • It is an access point, intended to be the means of finding, within the catalog, those items held by the library that meet the need of the user.
  • It provides the sort order for the catalog entries (which is why you see inverted forms like "Tolkien, J. R. R.")
United States. Department of State. Bureau for Refugee Programs
United States. Department of State. Bureau of Administration
United States. Department of State. Bureau of Administration and Security
United States. Department of State. Bureau of African Affairs
    • That sort order, and those inverted headings, also have a purpose of collocation of entries by some measure of "likeness"
    Tolkien, J. R. R. (John Ronald Reuel), 1892-1973
    Tolkien Society
    Tolkien Trust
    The last three functions, providing a sort order, access, and collocation, have been lost in the online catalog. The reasons for this are many, but the main explanation is that keyword searching has replaced alphabetical browse as a way to locate items in a library catalog.

    The upshot is that many hours are spent during the cataloging process to formulate a left-anchored, alphabetically order-able heading that has no functionality in library catalogs other than as fodder for a context-less keyword search.

    Once a keyword search is done the resulting items are retrieved without any correlation to headings. It may not even be clear which headings one would use to create a useful order. The set of retrieved bibliographic resources from a single keyword search may not provide a coherent knowledge graph. Here's an illustration using the keyword "darwin":

    Gardiner, Anne.
    Melding of two spirits : from the "Yiminga" of the Tiwi to the "Yiminga" of Christianity / by Anne Gardiner ; art work by
    Darwin : State Library of the Northern Territory, 1993.
    Christianity--Australia--Northern Territory.
    Tiwi (Australian people)--Religion.
    Northern Territory--Religion.

    Crabb, William Darwin.
    Lyrics of the golden west. By W. D. Crabb.
    San Francisco, The Whitaker & Ray company, 1898
    West (U.S.)--Poetry.

    Darwin, Charles, 1809-1882.
    Origin of species by means of natural selection; or, The preservation of favored races in the struggle for life and The descent of man and selection in relation to sex, by Charles Darwin.
    New York, The Modern library [1936]
    Evolution (Biology)
    Natural selection.
    Heredity.
    Human evolution.

    Bear, Greg, 1951-
    Darwin's radio / Greg Bear.
    New York : Ballantine Books, 2003.
    Women molecular biologists--Fiction.
    DNA viruses--Fiction.

    No matter what you would choose as a heading on which to order these, it will not produce a sensible collocation that would give users some context to understand the meaning of this particular set of items – and that is because there is no meaning to this set of items, just a coincidence of things named "Darwin."

    Headings that have been chosen to be controlled strings should offer a more predictable user search experience than free text searching, but headings do not necessarily provide collocation. As an example, Wikipedia uses the names of its pages as headings, and there are some rules (or at least preferred practices) to make the headings sensible. A search in Wikipedia is a left-to-right search on a heading string that is presented as a drop-down list of a handful of headings that match the search string:




    Included in the headings in the drop-down are "see"-type terms that, when selected, take the user directly to the entry for the preferred term. If there is no one preferred term Wikipedia directs users to disambiguation pages to help users select among similar headings:


    The Wikipedia pages, however, only provide accidental collocation, not the more comprehensive collocation that libraries aim to attain. That library-designed collocation, however, is also the source of the inversion of headings, making those strings unnatural and unintuitive for users. Although the library headings are admirably rules based, they often use rules that will not be known to many users of the catalog, such as the difference in name headings with prepositions based on the language of the author. To search on these names, one therefore needs to know the language of the author and the rule that is applied to that language, something that I am quite sure we can assume is not common knowledge among catalog users.

    De la Cruz, Melissa
    Cervantes Saavedra, Miguel de
    I may be the only patron of my small library branch that has known to look for the mysteries by Icelandic author Arnaldur IndriĆ°ason under "A" not "I".

    What Is To Be Done?


    There isn't an easy (or perhaps not even a hard) answer. As long as humans use words to describe their queries we will have the problem that words and concepts, and words and relationships between concepts, do not neatly coincide.

    I see a few techniques that might be used if we wish to save collocation by heading. One would be to allow keyword searching but for the system to use that to suggest headings that then can be used to view collocated works. Some systems do allow users to retrieve headings by keyword, but headings, which are very terse, are often not self-explanatory without the items they describe. A browse of headings alone is much less helpful that the association of the heading with the bibliographic data it describes. Remember that headings were developed for the card catalog where they were printed on the same card that carried the bibliographic description.

    Another possible area of investigation would be to look to the classified catalog, a technique that has existed alongside alphabetical catalogs for centuries. The Decimal Classification of Dewey was a classified approach to knowledge with a language-based index (his "Relativ Index") to the classes. (It is odd that the current practice in US libraries is to have one classification for items on shelves and an unrelated heading system (LCSH) for subject access.)
    The classification provides the intellectual collocation that the headings themselves do not provide. The difficulty with this is that the classification collocates topically but, at least in its current form, does not collocate the name headings in the catalog that identify people and organizations as entities.

    Conclusion (sort of)

    Controlled headings as access points for library catalogs could provide better service than keyword search alone. How to make use of headings is a difficult question. The first issue is how to exploit the precision of headings while still allowing users to search on any terms that they have in mind. Keyword search is, from the user's point of view, frictionless. They don't have to think "what string would the library have used for this?".

    Collocation of items by topical sameness or other relationships (e.g. "named for", "subordinate to") is possibly the best service that libraries could provide, although it is very hard to do this through the mechanism of language strings. Dewey's original idea of a classified order with a language-based index is still a good one, although classifications are hard to maintain and hard to assign.

    If challenged to state what I think the library catalog should be, my answer would be that it should provide a useful order that illustrates one or more intellectual contexts that will help the user enter and navigate what the library has to offer. Unfortunately I can't say today how we could do that. Could we think about that together?

    Readings

    Dewey, Melvil. Decimal classification and relativ index for libraries, clippings, notes, etc. Edition 7. Lake Placid Club, NY., Forest Press, 1911. https://archive.org/details/decimalclassifi00dewegoog

    Shera, Jesse H, Margaret E. Egan, and Jeannette M. Lynn. The Classified Catalog: Basic Principles and Practices. Chicago, Ill: American Library Association, 1956




    Monday, August 06, 2018

    FRBR as a Data Model


    (I've been railing against FRBR since it was first introduced. It still confuses me some. I put out these ideas for discussion. If you disagree, please add your thoughts to this post.)

    I was recently speaking at a library conference in OSLO where I went through my criticisms of our cataloging models, and how they are not suited to the problems we need to solve today. I had my usual strong criticisms of FRBR and the IFLA LRM. However, when I finished speaking I was asked why I am so critical of those models, which means that I did not explain myself well. I am going to try again here, as clearly and succinctly as I can.

    Conflation of Conceptual Models with Data Models


    FRBR's main impact was that it provided a mental model of the bibliographic universe that reflects a conceptual view of the elements of descriptive cataloging. You will find nothing in FRBR that could not be found in standard library cataloging of the 1990's, which is when the FRBR model was developed. What FRBR adds to our understanding of bibliographic information is that it gives names and definitions to key concepts that had been implied but not fully articulated in library catalog data. If it had stopped there we would have had an interesting mental model that allows us to speak more precisely about catalogs and cataloging.

    Unfortunately, the use of diagrams that appear to define actual data models and the listing of entities and their attributes have led the library world down the wrong path, that of reading FRBR as the definition of a physical data model. Compounding this, the LRM goes down that path even further by claiming to be a structural model of bibliographic data, which implies that it is the structure for library catalog data. I maintain that the FRBR conceptual model should not be assumed to also be a model for bibliographic data in a machine-readable form. The main reason for this has to do with the functionality that library catalogs currently provide (and and what functions they may provide in the future). This is especially true in relation to what FRBR refers to as its Group 1 entities: work, expression, manifestation, and item.

    The model defined in the FRBR document presents an idealized view that does not reflect the functionality of bibliographic data in library catalogs nor likely system design. This is particularly obvious in the struggle to fit the reality of aggregate works into the Group 1 "structure," but it is true even for simple published resources. The remainder of this document attempts to explain the differences between the ideal and the real.

    The Catalog vs the Universe


    One of the unspoken assumptions in the FRBR document is that it poses its problems and solutions in the context of the larger bibliographic universe, not in terms of a library catalog. The idea of gathering all of the manifestations of an expression and all of the expressions of a work is not shown as contingent on the holdings of any particular library. Similarly, bibliographic relationships are presented as having an existence without addressing how those relationships would be handled when the related works are not available in a data set. This may be due to the fact that the FRBR working group was made up solely of representatives of large research libraries whose individual catalogs cover a significant swath of the bibliographic world. It may also have arisen from the fact that the FRBR working group was formed to address the exchange of data between national libraries, and thus was intended as a universal model. Note that no systems designers were involved in the development of FRBR to address issues that would come up in catalogs of various extents or types.

    The questions asked and answered by the working group were therefore not of the nature of "how would this work in a catalog?" and were more of the type "what is nature of bibliographic data?". The latter is a perfectly legitimate question for a study of the nature of bibliographic data, but that study cannot be assumed to answer the first question.

    Functionality


    Although the F in FRBR stands for "functional" FRBR does little to address the functionality of the library catalog. The user tasks find, identify, select and obtain (and now explore, added in the LRM) are not explained in terms of how the data aids those tasks; the FRBR document only lists which data elements are essential to each task. Computer system design, including the design of data structures, needs to go at least a step further in its definition of functions, which means not only which data elements are relevant, but the specific usage the data element is put to in an actual machine interaction with the user and services. A systems developer has to take into account precisely what needs to be done with the FRBR entities in all of the system functions, from input to search and display.

    (Note: I'm going to try to cover this better and to give examples in an upcoming post.)

    Analysis that is aimed at creating a bibliographic data format for a library catalog would take into account that providing user-facing information about work and expression is context-dependent based on the holdings of the individual library and on the needs of its users. It would also take into account the optional use of work and expression information in search and display, and possibly give alternate views to support different choices in catalog creation and deployment. Essentially, analysis for a catalog would take system functionality into account.

    There a lot of facts about the nature of computer-based catalogs have to be acknowledged: that users are no longer performing “find” in an alphabetical list of headings, but are performing keyword searches; that collocation based on work-ness is not a primary function of catalog displays; that a significant proportion of a bibliographic database consists of items with a single work-expression-manifestation grouping; and finally that there is an inconsistent application of work and expression information in today's data.

    In spite of nearly forty years of using library systems whose default search function is a single box in which users are asked to input query terms that will be searched as keywords taken from a combination of creator, title, and subject fields in the bibliographic record, the LRM doubles down on the status of textual headings as primary elements, aka: Nomen. Unfortunately it doesn't address the search function in any reasonable fashion, which is to say it doesn't give an indication of the role of Nomen in the find function. In fact, here is the sum total of what the LRM says about search:

    "To facilitate this task [find], the information system seeks to enable effective searching by offering appropriate search elements or functionality."


    That's all. As I said in my talk at Oslo, this is up there with the jokes about bad corporate mission statements, like: "We endeavor to enhance future value through intelligent paradigm implementation." First, no information system ineffective searching. Yet the phrase "effective searching" is meaningless in itself; without a definition of what is effective this is just a platitude. The same is true for "appropriate search elements": no one would suggest that a system should use inappropriate search elements, but defining appropriate search is not at all a simple task. In fact, I contend that one of the primary problems with today's library systems is that we specifically lack a definition of appropriate, effective search. This is rendered especially difficult because the data that we enter into our library systems is data that was designed for an entirely different technology: the physical card catalog, organized as headings in alphabetical order.

    One Record to Rule Them All


    Our actual experience regarding physical structures for bibliographic data should be sufficient proof that there is not one single solution. Although libraries today are consolidating around the MARC21 record format, primarily for economic reasons, there have been numerous physical formats in use that mostly adhere to the international standard of ISBD. In this same way, there can be multiple physical formats that adhere to the conceptual model expressed in the FRBR and LRM documents. We know this is the case by looking at the current bibliographic data, which includes varieties of MARC, ISBD, BIBFRAME, and others. Another option for surfacing information about works in catalogs could follow what OCLC seems to be developing, which is the creation of works through a clustering of single-entity records. In that model, a work is a cluster of expressions, and an expression is a cluster of manifestations. This model has the advantage that it does not require the cataloger to make decisions about work and expression statements before it is known if the resource will be the progenitor of a bibliographic family, or will stand alone. It also does not require the cataloger to have knowledge of the bibliographic universe beyond their own catalog.

    The key element of all of these, and countless other, solutions is that they can be faithful to the mental model of FRBR while also being functional and efficient as systems. We should also expect that the systems solutions to this problem space will not stay the same over time, since technology is in constant evolution.

    Summary


    I have identified here two quite fundamental areas where FRBR's analysis differs from the needs of system development: 1) the difference between conceptual and physical models and 2) the difference between the (theoretical) bibliographic universe and the functional library catalog. Neither of these are a criticism of FRBR as such, but they do serve as warnings about some widely held assumptions in the library world today, which is that of mistaking the FRBR entity model for a data and catalog design model. This is evident in the outcry over the design of the BIBFRAME model which uses a two-tiered bibliographic view and not the three-tiers of FRBR. The irony of that complaint is that at the very same time as those outcries, catalogers are using FRBR concepts (as embodied in RDA) while cataloging into the one-tiered data model of MARC, which includes all of the entities of FRBR in a single data record. While cataloging into MARC records may not be the best version of bibliographic data storage that we could come up with, we must acknowledge that there are many possible technology solutions that could allow the exercise of bibliographic control while making use of the concepts addressed in FRBR/LRM. Those solutions must be based as least as much on user needs in actual catalogs as on bibliographic theory.

    As a theory, FRBR posits an ideal bibliographic environment which is not the same as the one that is embodied in any library catalog. The diagrams in the FRBR and LRM documents show the structure of the mental model, but not library catalog data. Because the FRBR document does not address implementation of the model in a catalog, there is no test of how such a model does or does not reflect actual system design. The extrapolation from mental model to physical model is not provided in FRBR or the LRM, as neither addresses system functions and design, not even at a macro level.

    I have to wonder if FRBR/LRM shouldn't be considered a model for bibliography rather than library catalogs. Bibliography was once a common art in the world of letters but that has faded greatly over the last half century. Bibliography is not the same as catalog creation, but one could argue that libraries and librarians are the logical successors to the bibliographers of the past, and that a “universal bibliography” created under the auspices of libraries would provide an ideal context for the entries in the library catalog. This could allow users to view the offerings of a single library as a subset of a well-described world of resources, most of which can be accessed in other libraries and archives.
    ­