Coyle's InFormation

Tuesday, November 27, 2018

It's "academic"

We all know that writing and publishing is of great concern to those whose work is in academia; the "publish or perish" burden haunts pre-tenure educators and grant-seeking researchers. Revelations that data had been falsified in published experimental results brings great condemnation from publishers and colleagues, and yet I have a feeling that underneath it all is more than an ounce of empathy from those who are fully aware of the forces that would lead one to put ones' thumbs on the scales for the purposes of winning the academic jousting match. It is only a slight exaggeration to compare these souls to the storied gladiators whose defeat meant summary execution. From all evidence, that is how many of them experience the contest to win the ivory tower - you climb until you fall.

Research libraries and others deal in great part with the output of the academe. In many ways their practices reinforce the value judgments made on academic writing, such as having blanket orders for all works published by a list of academic presses. In spite of this, libraries have avoided making an overt statement of what is and what is not "academic." The "deciders" of academic writing are the publishers - primarily the publishers of peer-reviewed journals that decide what information does and does not become part of the record of academic achievement, but also those presses that issue scholarly monographs. Libraries are the consumers of these decisions but stop short of tagging works as "academic" or "scholarly."

The pressure on academics has only increased in recent years, primarily because of the development of "impact factors." In 1955, Eugene Garfield introduced the idea that one could create a map of scientific publishing using an index of the writings cited by other works. (Science, 1955; 122 :108–11) Garfield was interested in improving science by linking works so that one could easily find supporting documents. However, over the years the purpose of citation has evolved from a convenient link to precedents into a measure of the worth of scholars themselves in the form of the "h-index" - the measure of how often a person (not a work) has been cited. The h-index is the "lifetime home runs" statistic of the academic world. One is valued for how many times one is cited, making citations the coin of the realm, not sales of works or even readership. No one in academia could or should be measured on the same scale as a non-academic writer when it comes to print runs, reviews, or movie deals. Imagine comparing the sales figures of "Poetic Autonomy in Ancient Rome" with "The Da Vinci Code". So it matters in academia to carve out a world that is academic, and that isolates academic works such that one can do things like calculate an h-index value.

This interest in all things academic has led to a number of metadata oddities that make me uncomfortable, however. There are metadata schemas that have an academic bent that translates to a need to assert the "scholarliness" of works being given a bibliographic description. There is also an emphasis on science in these bibliographic metadata, with less acknowledgement of the publishing patterns of the humanities. My problem isn't solely with the fact that they are doing this, but in particular with how they go about it.

As an example, the metadata schema BIBO clearly has an emphasis on articles as scholarly writing; notably, it has a publication type "academic article" but does not have a publication type for "academic book." This reflects the bias that new scientific discoveries are published as journal articles, and many scientists do not write book-length works at all. This slights the work of historians like Ann M. Blair whose book, Too Much to Know, has what I estimate to be about 1,450 "primary sources," ranging from manuscripts in Latin and German from the 1500's to modern works in a number of languages. It doesn't get much more academic than that.

BIBO also has different metadata terms for "journal" and "magazine":

bibo:journal "A periodical of scholarly journal Articles."
bibo:magazine"A periodical of magazine Articles. A magazine is a publication that is issued periodically, usually bound in a paper cover, and typically contains essays, stories, poems, etc., by many writers, and often photographs and drawings, frequently specializing in a particular subject or area, as hobbies, news, or sports."

Something in that last bit on magazines smacks of "leisure time" while the journal clearly represents "serious work." It's also interesting that the description of magazine is quite long, describes the physical aspects ("usually bound in a paper cover"), and gives a good idea of the potential content. "Journal" is simply "scholarly journal articles." Aside from the circularity of the definitions (journal has journal articles, magazines have magazine articles), what this says is simply that a journal is a "not magazine."

Apart from the snobbishness of the difference between these terms is the fact that one seeks in vain for a bright line between the two. There is, of course, the "I know it when I see it" test, and there is definitely some academic writing that you can pick out without hesitation. But is an opinion piece in the journal of a scientific society academic? How about a book review? How about a book review in the New York Review of Books (NYRB), where articles run to 2-5,000 words, are written by an academic in the field, and make use of the encyclopedic knowledge of the topic on the part of the reviewer? When Marcia Angell, professor at the Harvard Medical School and former Editor in Chief of The New England Journal of Medicine writes for the NYRB, has she slipped her academic robes for something else? She seems to think so. On her professional web site she lists among her publications a (significantly long) letter to the editor (called a "comment" in academic journal-eze) of a science journal article about women in medicine but she does not include in her publication list the articles she has written for NYRB even though these probably make more use of her academic knowledge than the comment did. She is clearly making a decision about what is "academic" (i.e. career-related) and what is not. It seems that the dividing line is not the content of the writing but how her professional world esteems the publishing vehicle.

Not to single out BIBO, I should mention other "culprits" in the tagging of scholarly works, such as WikiData. Wikidata has:

academic journal article (Q18918145) article published in an academic journal
academic writing (Q4119870) academic writing and publishing is conducted in several sets of forms and genres
scholarly article (Q13442814) article in an academic publication, usually peer reviewed
scholarly publication (Q591041) scientific publications that report original empirical and theoretical work in the natural sciences

There is so much wrong with each of these, from circular definitions to bias toward science as the only scholarly pursuit (scholarly publication is a "scientific publication" in the "natural sciences"). (I've already commented on this in WikiData, sarcastically calling it a fine definition if you ignore the various directions that science and scholarship have taken since the mid-19th century.) What this reveals, however is that the publication and publisher defines whether the work is "scholarly." If any article in an academic publication is a scholarly article, then the comment by Dr. Angell is, by definition, scholarly, and the NYRB articles are not. Academia is, in fact, a circularly-defined world.

Giving one more example, schema.org has this:

schema:ScholarlyArticle (sub-class of Article) A scholarly article.

Dig that definition! There are a few other types of article in schema, org, such as "newsArticle" and "techArticle" but it appears that all of those magazine articles would be simple "Article."

Note that in real life publications call themselves whatever they wish. With a hint at how terms may have changed over time: Ladies' Home Journal calls itself a journal, and the periodical published by the American Association for the Advancement of Science, Science, gives itself the domain sciencemag.org. "Science Magazine" just sounds right, doesn't it?

It's not wrong for folks to characterize some publications and some writing as "academic" but any metadata term needs a clear definition, which these do not have. What this means is that people using these schemas are being asked to make a determination with very little guidance that would help them separate the scholarly or academic from... well, from the rest of publishing output. With the inevitable variation in categorization, you can be sure that in metadata coded with these schemas the separation between scholarly/academic and not scholarly/academic writing is probably not going to be useful because there will be little regularity of assignment between communities that are using this metadata.

I admit that I picked on this particular metadata topic because I find the designation of "scholarly" or "academic" to be judgemental. If nothing else, when people judge they need some criteria for that judgement. What I would like to see is a clear definition that would help people decide what is and what is not "academic," and what the use cases are for why this typing of materials should be done. As with most categorizations, we can expect some differences in the decisions that will be made by catalogers and indexers working with these metadata schemas. A definition at least gives you something to discuss and to argue for. Right now we don't have that for scholarly/academic publications.

And I am glad that libraries don't try to make this distinction.

Tuesday, August 14, 2018

Libraryland, We Have a Problem

The first rule of every multi-step program is to admit that you have a problem. I think it's time for us librarians to take step one and admit that we do have a problem.

The particular problem that I have in mind is the disconnect between library data and library systems in relation to the category of metadata that libraries call "headings." Headings are the strings in the library data that represent those entities that would be entry points in a linear catalog like a card catalog.

It pains me whenever I am an observer to cataloger discussions on the proper formation of headings for items that they are cataloging. The pain point is that I know that the value of those headings is completely lost in the library systems of today, and therefore there are countless hours of skilled cataloger time that are being wasted.

The Heading

Both book and card catalogs were catalogs of headings. The catalog entry was a heading followed by one or more bibliographic entries. Unfortunately, the headings serve multiple purposes, which is generally not a good data practice but is due to the need for parsimony in library data when that data was analog, as in book and card catalogs.

A heading is a unique character string for the "thing" – the person, the corporate body, the family – essentially an identifier.

Tolkien, J. R. R. (John Ronald Reuel), 1892-1973

It supports the selection of the entity in the catalog from among the choices that are presented (although in some cases the effectiveness of this is questionable)

It is an access point, intended to be the means of finding, within the catalog, those items held by the library that meet the need of the user.

It provides the sort order for the catalog entries (which is why you see inverted forms like "Tolkien, J. R. R.")

United States. Department of State. Bureau for Refugee Programs
United States. Department of State. Bureau of Administration
United States. Department of State. Bureau of Administration and Security
United States. Department of State. Bureau of African Affairs

That sort order, and those inverted headings, also have a purpose of collocation of entries by some measure of "likeness"

Tolkien, J. R. R. (John Ronald Reuel), 1892-1973

Tolkien Society

Tolkien Trust

The last three functions, providing a sort order, access, and collocation, have been lost in the online catalog. The reasons for this are many, but the main explanation is that keyword searching has replaced alphabetical browse as a way to locate items in a library catalog.

The upshot is that many hours are spent during the cataloging process to formulate a left-anchored, alphabetically order-able heading that has no functionality in library catalogs other than as fodder for a context-less keyword search.

Once a keyword search is done the resulting items are retrieved without any correlation to headings. It may not even be clear which headings one would use to create a useful order. The set of retrieved bibliographic resources from a single keyword search may not provide a coherent knowledge graph. Here's an illustration using the keyword "darwin":

Gardiner, Anne.

Melding of two spirits : from the "Yiminga" of the Tiwi to the "Yiminga" of Christianity / by Anne Gardiner ; art work by

Darwin : State Library of the Northern Territory, 1993.

Christianity--Australia--Northern Territory.

Tiwi (Australian people)--Religion.

Northern Territory--Religion.

Crabb, William Darwin.

Lyrics of the golden west. By W. D. Crabb.

San Francisco, The Whitaker & Ray company, 1898

West (U.S.)--Poetry.

Darwin, Charles, 1809-1882.

Origin of species by means of natural selection; or, The preservation of favored races in the struggle for life and The descent of man and selection in relation to sex, by Charles Darwin.

New York, The Modern library [1936]

Evolution (Biology)

Natural selection.

Heredity.

Human evolution.

Bear, Greg, 1951-

Darwin's radio / Greg Bear.

New York : Ballantine Books, 2003.

Women molecular biologists--Fiction.

DNA viruses--Fiction.

No matter what you would choose as a heading on which to order these, it will not produce a sensible collocation that would give users some context to understand the meaning of this particular set of items – and that is because there is no meaning to this set of items, just a coincidence of things named "Darwin."

Headings that have been chosen to be controlled strings should offer a more predictable user search experience than free text searching, but headings do not necessarily provide collocation. As an example, Wikipedia uses the names of its pages as headings, and there are some rules (or at least preferred practices) to make the headings sensible. A search in Wikipedia is a left-to-right search on a heading string that is presented as a drop-down list of a handful of headings that match the search string:

Included in the headings in the drop-down are "see"-type terms that, when selected, take the user directly to the entry for the preferred term. If there is no one preferred term Wikipedia directs users to disambiguation pages to help users select among similar headings:

The Wikipedia pages, however, only provide accidental collocation, not the more comprehensive collocation that libraries aim to attain. That library-designed collocation, however, is also the source of the inversion of headings, making those strings unnatural and unintuitive for users. Although the library headings are admirably rules based, they often use rules that will not be known to many users of the catalog, such as the difference in name headings with prepositions based on the language of the author. To search on these names, one therefore needs to know the language of the author and the rule that is applied to that language, something that I am quite sure we can assume is not common knowledge among catalog users.

De la Cruz, Melissa

Cervantes Saavedra, Miguel de

I may be the only patron of my small library branch that has known to look for the mysteries by Icelandic author Arnaldur Indriðason under "A" not "I".

What Is To Be Done?

There isn't an easy (or perhaps not even a hard) answer. As long as humans use words to describe their queries we will have the problem that words and concepts, and words and relationships between concepts, do not neatly coincide.

I see a few techniques that might be used if we wish to save collocation by heading. One would be to allow keyword searching but for the system to use that to suggest headings that then can be used to view collocated works. Some systems do allow users to retrieve headings by keyword, but headings, which are very terse, are often not self-explanatory without the items they describe. A browse of headings alone is much less helpful that the association of the heading with the bibliographic data it describes. Remember that headings were developed for the card catalog where they were printed on the same card that carried the bibliographic description.

Another possible area of investigation would be to look to the classified catalog, a technique that has existed alongside alphabetical catalogs for centuries. The Decimal Classification of Dewey was a classified approach to knowledge with a language-based index (his "Relativ Index") to the classes. (It is odd that the current practice in US libraries is to have one classification for items on shelves and an unrelated heading system (LCSH) for subject access.)

The classification provides the intellectual collocation that the headings themselves do not provide. The difficulty with this is that the classification collocates topically but, at least in its current form, does not collocate the name headings in the catalog that identify people and organizations as entities.

Conclusion (sort of)

Controlled headings as access points for library catalogs could provide better service than keyword search alone. How to make use of headings is a difficult question. The first issue is how to exploit the precision of headings while still allowing users to search on any terms that they have in mind. Keyword search is, from the user's point of view, frictionless. They don't have to think "what string would the library have used for this?".

Collocation of items by topical sameness or other relationships (e.g. "named for", "subordinate to") is possibly the best service that libraries could provide, although it is very hard to do this through the mechanism of language strings. Dewey's original idea of a classified order with a language-based index is still a good one, although classifications are hard to maintain and hard to assign.

If challenged to state what I think the library catalog should be, my answer would be that it should provide a useful order that illustrates one or more intellectual contexts that will help the user enter and navigate what the library has to offer. Unfortunately I can't say today how we could do that. Could we think about that together?

Readings

Dewey, Melvil. Decimal classification and relativ index for libraries, clippings, notes, etc. Edition 7. Lake Placid Club, NY., Forest Press, 1911. https://archive.org/details/decimalclassifi00dewegoog

Shera, Jesse H, Margaret E. Egan, and Jeannette M. Lynn. The Classified Catalog: Basic Principles and Practices. Chicago, Ill: American Library Association, 1956

Monday, August 06, 2018

FRBR as a Data Model

(I've been railing against FRBR since it was first introduced. It still confuses me some. I put out these ideas for discussion. If you disagree, please add your thoughts to this post.)

I was recently speaking at a library conference in OSLO where I went through my criticisms of our cataloging models, and how they are not suited to the problems we need to solve today. I had my usual strong criticisms of FRBR and the IFLA LRM. However, when I finished speaking I was asked why I am so critical of those models, which means that I did not explain myself well. I am going to try again here, as clearly and succinctly as I can.

Conflation of Conceptual Models with Data Models

FRBR's main impact was that it provided a mental model of the bibliographic universe that reflects a conceptual view of the elements of descriptive cataloging. You will find nothing in FRBR that could not be found in standard library cataloging of the 1990's, which is when the FRBR model was developed. What FRBR adds to our understanding of bibliographic information is that it gives names and definitions to key concepts that had been implied but not fully articulated in library catalog data. If it had stopped there we would have had an interesting mental model that allows us to speak more precisely about catalogs and cataloging.

Unfortunately, the use of diagrams that appear to define actual data models and the listing of entities and their attributes have led the library world down the wrong path, that of reading FRBR as the definition of a physical data model. Compounding this, the LRM goes down that path even further by claiming to be a structural model of bibliographic data, which implies that it is the structure for library catalog data. I maintain that the FRBR conceptual model should not be assumed to also be a model for bibliographic data in a machine-readable form. The main reason for this has to do with the functionality that library catalogs currently provide (and and what functions they may provide in the future). This is especially true in relation to what FRBR refers to as its Group 1 entities: work, expression, manifestation, and item.

The model defined in the FRBR document presents an idealized view that does not reflect the functionality of bibliographic data in library catalogs nor likely system design. This is particularly obvious in the struggle to fit the reality of aggregate works into the Group 1 "structure," but it is true even for simple published resources. The remainder of this document attempts to explain the differences between the ideal and the real.

The Catalog vs the Universe

One of the unspoken assumptions in the FRBR document is that it poses its problems and solutions in the context of the larger bibliographic universe, not in terms of a library catalog. The idea of gathering all of the manifestations of an expression and all of the expressions of a work is not shown as contingent on the holdings of any particular library. Similarly, bibliographic relationships are presented as having an existence without addressing how those relationships would be handled when the related works are not available in a data set. This may be due to the fact that the FRBR working group was made up solely of representatives of large research libraries whose individual catalogs cover a significant swath of the bibliographic world. It may also have arisen from the fact that the FRBR working group was formed to address the exchange of data between national libraries, and thus was intended as a universal model. Note that no systems designers were involved in the development of FRBR to address issues that would come up in catalogs of various extents or types.

The questions asked and answered by the working group were therefore not of the nature of "how would this work in a catalog?" and were more of the type "what is nature of bibliographic data?". The latter is a perfectly legitimate question for a study of the nature of bibliographic data, but that study cannot be assumed to answer the first question.

Functionality

Although the F in FRBR stands for "functional" FRBR does little to address the functionality of the library catalog. The user tasks find, identify, select and obtain (and now explore, added in the LRM) are not explained in terms of how the data aids those tasks; the FRBR document only lists which data elements are essential to each task. Computer system design, including the design of data structures, needs to go at least a step further in its definition of functions, which means not only which data elements are relevant, but the specific usage the data element is put to in an actual machine interaction with the user and services. A systems developer has to take into account precisely what needs to be done with the FRBR entities in all of the system functions, from input to search and display.

(Note: I'm going to try to cover this better and to give examples in an upcoming post.)

Analysis that is aimed at creating a bibliographic data format for a library catalog would take into account that providing user-facing information about work and expression is context-dependent based on the holdings of the individual library and on the needs of its users. It would also take into account the optional use of work and expression information in search and display, and possibly give alternate views to support different choices in catalog creation and deployment. Essentially, analysis for a catalog would take system functionality into account.

There a lot of facts about the nature of computer-based catalogs have to be acknowledged: that users are no longer performing “find” in an alphabetical list of headings, but are performing keyword searches; that collocation based on work-ness is not a primary function of catalog displays; that a significant proportion of a bibliographic database consists of items with a single work-expression-manifestation grouping; and finally that there is an inconsistent application of work and expression information in today's data.

In spite of nearly forty years of using library systems whose default search function is a single box in which users are asked to input query terms that will be searched as keywords taken from a combination of creator, title, and subject fields in the bibliographic record, the LRM doubles down on the status of textual headings as primary elements, aka: Nomen. Unfortunately it doesn't address the search function in any reasonable fashion, which is to say it doesn't give an indication of the role of Nomen in the find function. In fact, here is the sum total of what the LRM says about search:

"To facilitate this task [find], the information system seeks to enable effective searching by offering appropriate search elements or functionality."

That's all. As I said in my talk at Oslo, this is up there with the jokes about bad corporate mission statements, like: "We endeavor to enhance future value through intelligent paradigm implementation." First, no information system ineffective searching. Yet the phrase "effective searching" is meaningless in itself; without a definition of what is effective this is just a platitude. The same is true for "appropriate search elements": no one would suggest that a system should use inappropriate search elements, but defining appropriate search is not at all a simple task. In fact, I contend that one of the primary problems with today's library systems is that we specifically lack a definition of appropriate, effective search. This is rendered especially difficult because the data that we enter into our library systems is data that was designed for an entirely different technology: the physical card catalog, organized as headings in alphabetical order.

One Record to Rule Them All

Our actual experience regarding physical structures for bibliographic data should be sufficient proof that there is not one single solution. Although libraries today are consolidating around the MARC21 record format, primarily for economic reasons, there have been numerous physical formats in use that mostly adhere to the international standard of ISBD. In this same way, there can be multiple physical formats that adhere to the conceptual model expressed in the FRBR and LRM documents. We know this is the case by looking at the current bibliographic data, which includes varieties of MARC, ISBD, BIBFRAME, and others. Another option for surfacing information about works in catalogs could follow what OCLC seems to be developing, which is the creation of works through a clustering of single-entity records. In that model, a work is a cluster of expressions, and an expression is a cluster of manifestations. This model has the advantage that it does not require the cataloger to make decisions about work and expression statements before it is known if the resource will be the progenitor of a bibliographic family, or will stand alone. It also does not require the cataloger to have knowledge of the bibliographic universe beyond their own catalog.

The key element of all of these, and countless other, solutions is that they can be faithful to the mental model of FRBR while also being functional and efficient as systems. We should also expect that the systems solutions to this problem space will not stay the same over time, since technology is in constant evolution.

Summary

I have identified here two quite fundamental areas where FRBR's analysis differs from the needs of system development: 1) the difference between conceptual and physical models and 2) the difference between the (theoretical) bibliographic universe and the functional library catalog. Neither of these are a criticism of FRBR as such, but they do serve as warnings about some widely held assumptions in the library world today, which is that of mistaking the FRBR entity model for a data and catalog design model. This is evident in the outcry over the design of the BIBFRAME model which uses a two-tiered bibliographic view and not the three-tiers of FRBR. The irony of that complaint is that at the very same time as those outcries, catalogers are using FRBR concepts (as embodied in RDA) while cataloging into the one-tiered data model of MARC, which includes all of the entities of FRBR in a single data record. While cataloging into MARC records may not be the best version of bibliographic data storage that we could come up with, we must acknowledge that there are many possible technology solutions that could allow the exercise of bibliographic control while making use of the concepts addressed in FRBR/LRM. Those solutions must be based as least as much on user needs in actual catalogs as on bibliographic theory.

As a theory, FRBR posits an ideal bibliographic environment which is not the same as the one that is embodied in any library catalog. The diagrams in the FRBR and LRM documents show the structure of the mental model, but not library catalog data. Because the FRBR document does not address implementation of the model in a catalog, there is no test of how such a model does or does not reflect actual system design. The extrapolation from mental model to physical model is not provided in FRBR or the LRM, as neither addresses system functions and design, not even at a macro level.

I have to wonder if FRBR/LRM shouldn't be considered a model for bibliography rather than library catalogs. Bibliography was once a common art in the world of letters but that has faded greatly over the last half century. Bibliography is not the same as catalog creation, but one could argue that libraries and librarians are the logical successors to the bibliographers of the past, and that a “universal bibliography” created under the auspices of libraries would provide an ideal context for the entries in the library catalog. This could allow users to view the offerings of a single library as a subset of a well-described world of resources, most of which can be accessed in other libraries and archives.

Bibliography

IFLA. Functional Requirements for Bibliographic Records. 1998/2008
IFLA. Library Reference Model. 2017

Tuesday, October 10, 2017

Google Books and Mein Kampf

I hadn't look at Google Books in a while, or at least not carefully, so I was surprised to find that Google had added blurbs to most of the books. Even more surprising (although perhaps I should say "troubling") is that no source is given for the book blurbs. Some at least come from publisher sites, which means that they are promotional in nature. For example, here's a mildly promotional text about a literary work, from a literary publisher:

This gives a synopsis of the book, starting with:

"Throughout a single day in 1892, John Shawnessy recalls the great moments of his life..."

It ends by letting the reader know that this was a bestseller when published in 1948, and calls it a "powerful novel."

The blurb on a 1909 version of Darwin's The Origin of Species is mysterious because the book isn't a recent publication with an online site providing the text. I do not know where this description comes from, but because the entire thrust of this blurb is about the controversy of evolution versus the Bible (even though Darwin did not press this point himself) I'm guessing that the blurb post-dates this particular publication.

"First published in 1859, this landmark book on evolutionary biology was not the first to deal with the subject, but it went on to become a sensation -- and a controversial one for many religious people who could not reconcile Darwin's science with their faith."

That's a reasonable view to take of Darwin's "landmark" book but it isn't what I would consider to be faithful to the full import of this tome.

The blurb on Hitler's Mein Kampf is particularly troubling. If you look at different versions of the book you get both pro- and anti- Nazi sentiments, neither of which really belong on a site that claims to be a catalog of books. Also note that because each book entry has only one blurb, the tone changes considerably depending on which publication you happen to pick from the list.

First on the list:

"Settling Accounts became Mein Kampf, an unparalleled example of muddled economics and history, appalling bigotry, and an intense self-glorification of Adolf Hitler as the true founder and builder of the National Socialist movement. It was written in hate and it contained a blueprint for violent bloodshed."

Second on the list:

"This book has set a path toward a much higher understanding of the self and of our magnificent destiny as living beings part of this Race on our planet. It shows us that we must not look at nature in terms of good or bad, but in an unfiltered manner. It describes what we must do if we want to survive as a people and as a Race."

That's horrifying. Note that both books are self-published, and the blurbs are the ones that I find on those books in Amazon, perhaps indicating that Google is sucking up books from the Amazon site. There is, or at least at one point there once was, a difference between Amazon and Google Books. Google, after all, scanned books in libraries and presented itself as a search engine for published texts; Amazon will sell you Trump's tweets on toilet paper. The only text on the Google Books page still claims that Google Books is about search: "Search the world's most comprehensive index of full-text books." Libraries partnered with Google with lofty promises of gains in scholarship:

"Our participation in the Google Books Library Project will add significantly to the extensive digital resources the Libraries already deliver. It will enable the Libraries to make available more significant portions of its extraordinary archival and special collections to scholars and researchers worldwide in ways that will ultimately change the nature of scholarship." Jim Neal, Columbia University

I don't know how these folks now feel about having their texts intermingled with publications they would never buy and described by texts that may come from shady and unreliable sources.

Even leaving aside the grossest aspects of the blurbs and Google's hypocrisy about its commercialization of its books project, adding blurbs to the book entries with no attribution and clearly not vetting the sources is extremely irresponsible. It's also very Google to create sloppy algorithms that illustrate their basic ignorance of the content their are working with -- in this case, the world's books.

Tuesday, August 08, 2017

On reading Library Journal, September, 1877

Of the many advantages to retirement is the particular one of idle time. And I will say that as a librarian one could do no better than to spend some of that time communing with the history of the profession. The difficulty is that it is so rich, so familiar in many ways that it is hard to move through it quickly. Here is just a fraction of the potential value to be found in the September issue of volume two of Library Journal.* Admittedly this is a particularly interesting number because it reports on the second meeting of the American Library Association.

For any student of library history it is especially interesting to encounter certain names as living, working members of the profession.

Other names reflect works that continued on, some until today, such as Poole and Bowker, both names associated with long-running periodical indexes.

What is particularly striking, though, is how many of the topics of today were already being discussed then, although obviously in a different context. The association was formed, at least in part, to help librarianship achieve the status of a profession. Discussed were the educating of the public on the role of libraries and librarians as well as providing education so that there could be a group of professionals to take the jobs that needed that professional knowledge. There was work to be done to convince state legislatures to support state and local libraries.

One of the first acts of the American Library Association when it was founded in 1876 (as reported in the first issue of Library Journal) was to create a Committee on Cooperation. This is the seed for today's cooperative cataloging efforts as well as other forms of sharing among libraries. In 1877, undoubtedly encouraged by the participation of some members of the publishing community in ALA, there was hope that libraries and publishers would work together to create catalog entries for in-print works.

This is one hope of the early participants that we are still working on, especially the desire that such catalog copy would be "uniform." Note that there were also discussions about having librarians contribute to the periodical indexes of R. R. Bowker and Poole, so the cooperation would flow in both directions.

The physical organization of libraries also was of interest, and a detailed plan for a round (actually octagonal) library design was presented:

His conclusion, however, shows a difference in our concepts of user privacy.

Especially interesting to me are the discussions of library technology. I was unaware of some of the emerging technologies for reproduction such as the papyrograph and the electric pen. In 1877, the big question, though, was whether to employ the new (but as yet un-perfected) technology of the typewriter in library practice.

There was some poo-pooing of this new technology, but some members felt it may be reaching a state of usefulness.

"The President" in this case is Justin Winsor, Superintendent of the Boston Library, then president of the American Library Association. Substituting more modern technologies, I suspect we have all taken part in this discussion during our careers.

Reading through the Journal evokes a strong sense of "le plus ça change..." but I admit that I find it all rather reassuring. The historical beginnings give me a sense of why we are who we are today, and what factors are behind some of our embedded thinking on topics.

* Many of the early volumes are available from HathiTrust, if you have access. Although the texts themselves are public domain, these are Google-digitized books and are not available without a login. (Don't get me started!) If you do not have access to those, most of the volumes are available through the Internet Archive. Select "text" and search on "library journal". As someone without HathiTrust institutional access I have found most numbers in the range 1-39, but am missing (hint, hint): 5/1880; 8-9/1887-88; 17/1892; 19/1894; 28-30/1903-1905; 34-37;1909-1912. If I can complete the run I think it would be good to create a compressed archive of the whole and make that available via the Internet Archive to save others the time of acquiring them one at a time. If I can find the remainder that are pre-1923 I will add those in.

Sunday, July 09, 2017

The Work

I've been on a committee that was tasked by the Program for Cooperative Cataloging folks(*) to help them understand some of the issues around works (as defined in FRBR, RDA, BIBFRAME, etc.). There are huge complications, not the least being that we all are hard-pressed to define what a work is, much less how it should be addressed in some as-yes-unrealized future library system. Some of what I've come to understand may be obvious to you, especially if you are a cataloger who provides authority data for your own catalog or the shared environment. Still, I thought it would be good to capture these thoughts. Of course, I welcome comments and further insights on this.

There are at least four different meanings to the term work as it is being discussed in library venues.

"Work-ness"

First there is the concept that every resource embodies something that could be called a "work" and that this work is a human creation. The idea of the work probably dates back as far as the recognition that humans create things, and that those things have meaning. There is no doubt that there is "work-ness" in all created things, although prior to FRBR there was little attempt to formally define it as an aspect of bibliographic description. It entered into cataloging consciousness in the 20th century: Patrick Wilson saw works as families of resources that grow and branch with each related publication;[1] Richard Smiraglia looked at works as a function of time;[2] and Seymour Lubetzky seems to have been the first to insist on viewing the work as intellectual content separate from the physical piece.[3]

"Work Description"

Second, there is the work in the bibliographic description: the RDA cataloging rules define the attributes or data elements that make up the work description, like the names of creators and the subject matter of the resource. Catalogers include these elements in descriptive cataloging even when the work is not defined as a stand-alone entity, as in the case of doing RDA cataloging in a MARC21 record environment. Most of the description of works is not new; creators and subjects have been assigned to cataloged items for a century or more. What is changed is that conceptually these are considered to be elements of the work that is inherent in the resource that is being cataloged but not limited to the item in hand.

It is this work description that is addressed in FRBR. The FRBR document of 1998 describes the scope of its entities to be solely bibliographic, specifically excluding authority data:

"The present study does not analyse those additional data associated with persons, corporate bodies, works, and subjects that are typically recorded only in authority records."

Notably, FRBR is silent on the question of whether the work description is unique within the catalog, which would be implied by the creation of a work authority "record".

"Work Decision"

Next there is the work decision: this is the situation when a data creator determines whether the work to be described needs a unique and unifying entry within the stated cataloging environment to bring together exemplars of the same work that may be described differently. If so, the cataloger defines the authoritative identity for the work and provides information that distinguishes that work from all other works, and that brings together all of the variations of that work. The headings ("uniform titles") that are created also serve to disambiguate expressions of the same work by adding dates, languages, and other elements of the expression. To back all of this up, the cataloger gives evidence of his/her decision, primarily what sources were consulted that support the decision.

In today's catalog, a full work decision, resulting in a work authority record, is done for only a small number of works, with the exception of musical works where such titles are created for nearly all. The need to make the work decision may vary from catalog to catalog and can depend on whether the library holds multiple expressions of the work or other works that may need clarification in the catalog. Note that there is nothing in FRBR that would indicate that every work must have a unique description, just that works should be described. However, some have assumed that the FRBR work is always a representation of a unique creation. I don't find that expressed in FRBR nor the FRBR-LRM.

"Work Entity"

Finally there is the work entity: this is a data structure that encapsulates the description of the work. This data structure could be realized in any number of different encodings, such as ISO 2709 (the underlying record structure for MARC21), RDF, XML, or JSON. The latter two can also accommodate linked data in the form of RDFXML or JSON-LD.

Here we have a complication in our current environment because the main encodings of bibliographic data, MARC21 and BIBFRAME, both differ from the work concept presented in FRBR and in the RDA cataloging rules, which follow FRBR fairly faithfully. With a few exceptions, MARC21 does not distinguish work elements from expression or manifestation elements. Encoding RDA-defined data in the MARC21 "unit record" can be seen as proof of the conceptual nature of the work (and expression and manifestation) as defined in FRBR.

BIBFRAME, the proposed replacement for MARC21, has re-imagined the bibliographic work entity, departing from the entity breakdown in FRBR by defining a BIBFRAME work entity that tends to combine elements from FRBR's work and expression. However, where FRBR claims a neat divison between the entities, with no overlapping descriptive elements, BIBFRAME 2.0 is being designed as a general bibliographic model, not an implementation of FRBR. (Whether or not BIBFRAME achieves this goal is another question.)

The diagrams in the 1998 FRBR report imply that there would be a work entity structure. However, the report also states unequivocally that it is not defining a data format.(**) In keeping with 1990's library technology, FRBR anticipates that each entity may have an identifier, but the identifier is a descriptive element (think: ISBN), not an anchor for all of the data elements of the entity (think: IRI).

As we see with the implementation of RDA cataloging in the MARC21 environment, describing a work conceptually does not require the use of a separate work "record." Whether work decisions are required for every cataloged manifestation is a cataloging decision; whether work entities are required for every work is a data design decision. That design decision should be based on the services that the system is expected to render. The "entity" decision may or may not require any action on the part of the cataloger depending on the interface in which cataloging takes place. Just as today's systems do not store the MARC21 data as it appears on the cataloger's screen, future systems will have internal data storage formats that will surely differ from the view in the various user interfaces.

"The Upshot"

We can assume that every human-created resource has an aspect of work-ness, but this doesn't always translate well to bibliographic description nor to a work entity in bibliographic data. Past practice in relation to works differs significantly from, say, the practice in relation to agents (persons, corporate bodies) for whom one presumes that the name authority control decision is always part of the cataloging workflow. Instead, work "names" have been inconsistently developed (with exceptions, such as in music materials). It is unclear if, in the future, every work description will be assumed to have undergone a "work name authority" analysis, but even more unreliable is any assumption that can be made about whether an existing bibliographic description without a uniform title has had its "work-ness" fully examined.

This latter concern is especially evident in the transformations of current MARC21 cataloging into either RDA, BIBFRAME, or schema.org. From what I have observed, the transformations do not preserve the difference between a manifestation title that does not have a formal uniform title to represent the work, and those titles that are currently coded in MARC21 fields 130, 240, or the $t of an author/title field. Instead, where a coded uniform title is not available in the MARC21 record, the manifestation title is copied to the work title element. This means that the fact that a cataloger has carefully crafted a work title for the resource is lost. Even though we may agree that the creation of work titles has been inconsistent at best, copying transcribed titles to the work title entity wherever no uniform title field is present in the MARC21 record seems to be a serious loss of information. Or perhaps I should put this as a question: in the absence of a unform title element, can we assume that the transcribed title is the appropriate work title?

To conclude, I guess I will go ahead and harp on a common nag of mine, which is that copying data from one serialization to another is not the transformation that will help us move forward. The "work" is very complex; I would feel less concerned if we had a strong and shared concept of what services we want the work to provide in the future, which should help us decide what to do with the messy legacy that we have today.

Footnotes

* Note that in 1877 there already was a "Co-operation committee" of the American Library Association, tasked with looking at cooperative cataloging and other tasks. That makes this a 140-year-old tradition.

"Of the standing committees, that on co-operation will probably prove the most important organ of the Association..." (see more at link)

** If you want more about what FRBR is and is not, I will recommend my book "FRBR: Before and After" (open access copy) for an in-depth analysis. If you want less, try my SWIB talk "Mistakes Have Been Made" which gets into FRBR at about 13:00, but you might enjoy the lead-up to that section.

References

[1] Wilson, Patrick. Two Kinds of Power : an Essay on Bibliographical Control. University of California Publications: Librarianship. Berkeley, Los Angeles, London: University of California Press, 1978.
[2] Smiraglia, Richard. The Nature of “a Work”; Implications for the Organization of Knowledge. Lanham: Scarecrow Press, 2001.
[3] Lubetzky, Seymour. Principles of Cataloging. Final report. Phase I. In: Seymour Lubtezky: writings on the classical art of cataloging. Edited by Elaine Svenonius and Dorothy McGarry. Englewood, CO, Libraries Unlimited. 2001

Tuesday, June 20, 2017

Pray for Peace

This is a piece I wrote on March 22, 2003, two days after the beginning of the second Gulf war. I just found it in an old folder, and sadly have to say that things have gotten worse than I feared. I also note an unfortunate use of terms like "peasant" and "primitive" but I leave those as a recognition of my state of mind/information. Pray for peace.

Saturday, March 22, 2003

Gulf War II

The propaganda machine is in high gear, at war against the truth. The bombardments are constant and calculated. This has been planned carefully over time.

The propaganda box sits in every home showing footage that it claims is of a distant war. We citizens, of course, have no way to independently verify that, but then most citizens are quite happy to accept it at face value.

We see peaceful streets by day in a lovely, prosperous and modern city. The night shots show explosions happening at a safe distance. What is the magical spot from which all of this is being observed?

Later we see pictures of damaged buildings, but they are all empty, as are the streets. There are no people involved, and no blood. It is the USA vs. architecture, as if the city of Bagdad itself is our enemy.

The numbers of casualties, all of them ours, all of them military, are so small that each one has an individual name. We see photos of them in dress uniform. The families state that they are proud. For each one of these there is the story from home: the heavily made-up wife who just gave birth to twins and is trying to smile for the camera, the child who has graduated from school, the community that has rallied to help re-paint a home or repair a fence.

More people are dying on the highways across the USA each day than in this war, according to our news. Of course, even more are dying around the world of AIDS or lung cancer, and we aren't seeing their pictures or helping their families. At least not according to the television news.

The programming is designed like a curriculum with problems and solutions. As we begin bombing the networks show a segment in which experts explain the difference between the previous Gulf War's bombs and those used today. Although we were assured during the previous war that our bombs were all accurately hitting their targets, word got out afterward that in fact the accuracy had been dismally low. Today's experts explain that the bombs being used today are far superior to those used previously, and that when we are told this time that they are hitting their targets it is true, because today's bombs really are accurate.

As we enter and capture the first impoverished, primitive village, a famous reporter is shown interviewing Iraqi women living in the USA who enthusiastically assure us that the Iraqi people will welcome the American liberators with open arms. The newspapers report Iraqis running into the streets shouting "Peace to all." No one suggests that the phrase might be a plea for mercy by an unarmed peasant facing a soldier wearing enough weaponry to raze the entire village in an eye blink.

Reporters riding with US troops are able to phone home over satellite connections and show us grainy pictures of heavily laden convoys in the Iraqi desert. Like the proverbial beasts of burden, the trucks are barely visible under their packages of goods, food and shelter. What they are bringing to the trade table is different from the silks and spices that once traveled these roads, but they are carrying luxury goods beyond the ken of many of Iraq's people: high tech sensor devices, protective clothing against all kinds of dangers, vital medical supplies and, perhaps even more important, enough food and water to feed an army. In a country that feeds itself only because of international aid -- aid that has been withdrawn as the US troops arrive -- the trucks are like self-contained units of American wealth motoring past.

I feel sullied watching any of this, or reading newspapers. It's an insult to be treated like a mindless human unit being prepared for the post-war political fall-out. I can't even think about the fact that many people in this country are believing every word of it. I can't let myself think that the propaganda war machine will win.

Pray for peace.