Coyle's InFormation: 2017

Tuesday, October 10, 2017

Google Books and Mein Kampf

I hadn't look at Google Books in a while, or at least not carefully, so I was surprised to find that Google had added blurbs to most of the books. Even more surprising (although perhaps I should say "troubling") is that no source is given for the book blurbs. Some at least come from publisher sites, which means that they are promotional in nature. For example, here's a mildly promotional text about a literary work, from a literary publisher:

This gives a synopsis of the book, starting with:

"Throughout a single day in 1892, John Shawnessy recalls the great moments of his life..."

It ends by letting the reader know that this was a bestseller when published in 1948, and calls it a "powerful novel."

The blurb on a 1909 version of Darwin's The Origin of Species is mysterious because the book isn't a recent publication with an online site providing the text. I do not know where this description comes from, but because the entire thrust of this blurb is about the controversy of evolution versus the Bible (even though Darwin did not press this point himself) I'm guessing that the blurb post-dates this particular publication.

"First published in 1859, this landmark book on evolutionary biology was not the first to deal with the subject, but it went on to become a sensation -- and a controversial one for many religious people who could not reconcile Darwin's science with their faith."

That's a reasonable view to take of Darwin's "landmark" book but it isn't what I would consider to be faithful to the full import of this tome.

The blurb on Hitler's Mein Kampf is particularly troubling. If you look at different versions of the book you get both pro- and anti- Nazi sentiments, neither of which really belong on a site that claims to be a catalog of books. Also note that because each book entry has only one blurb, the tone changes considerably depending on which publication you happen to pick from the list.

First on the list:

"Settling Accounts became Mein Kampf, an unparalleled example of muddled economics and history, appalling bigotry, and an intense self-glorification of Adolf Hitler as the true founder and builder of the National Socialist movement. It was written in hate and it contained a blueprint for violent bloodshed."

Second on the list:

"This book has set a path toward a much higher understanding of the self and of our magnificent destiny as living beings part of this Race on our planet. It shows us that we must not look at nature in terms of good or bad, but in an unfiltered manner. It describes what we must do if we want to survive as a people and as a Race."

That's horrifying. Note that both books are self-published, and the blurbs are the ones that I find on those books in Amazon, perhaps indicating that Google is sucking up books from the Amazon site. There is, or at least at one point there once was, a difference between Amazon and Google Books. Google, after all, scanned books in libraries and presented itself as a search engine for published texts; Amazon will sell you Trump's tweets on toilet paper. The only text on the Google Books page still claims that Google Books is about search: "Search the world's most comprehensive index of full-text books." Libraries partnered with Google with lofty promises of gains in scholarship:

"Our participation in the Google Books Library Project will add significantly to the extensive digital resources the Libraries already deliver. It will enable the Libraries to make available more significant portions of its extraordinary archival and special collections to scholars and researchers worldwide in ways that will ultimately change the nature of scholarship." Jim Neal, Columbia University

I don't know how these folks now feel about having their texts intermingled with publications they would never buy and described by texts that may come from shady and unreliable sources.

Even leaving aside the grossest aspects of the blurbs and Google's hypocrisy about its commercialization of its books project, adding blurbs to the book entries with no attribution and clearly not vetting the sources is extremely irresponsible. It's also very Google to create sloppy algorithms that illustrate their basic ignorance of the content their are working with -- in this case, the world's books.

Tuesday, August 08, 2017

On reading Library Journal, September, 1877

Of the many advantages to retirement is the particular one of idle time. And I will say that as a librarian one could do no better than to spend some of that time communing with the history of the profession. The difficulty is that it is so rich, so familiar in many ways that it is hard to move through it quickly. Here is just a fraction of the potential value to be found in the September issue of volume two of Library Journal.* Admittedly this is a particularly interesting number because it reports on the second meeting of the American Library Association.

For any student of library history it is especially interesting to encounter certain names as living, working members of the profession.

Other names reflect works that continued on, some until today, such as Poole and Bowker, both names associated with long-running periodical indexes.

What is particularly striking, though, is how many of the topics of today were already being discussed then, although obviously in a different context. The association was formed, at least in part, to help librarianship achieve the status of a profession. Discussed were the educating of the public on the role of libraries and librarians as well as providing education so that there could be a group of professionals to take the jobs that needed that professional knowledge. There was work to be done to convince state legislatures to support state and local libraries.

One of the first acts of the American Library Association when it was founded in 1876 (as reported in the first issue of Library Journal) was to create a Committee on Cooperation. This is the seed for today's cooperative cataloging efforts as well as other forms of sharing among libraries. In 1877, undoubtedly encouraged by the participation of some members of the publishing community in ALA, there was hope that libraries and publishers would work together to create catalog entries for in-print works.

This is one hope of the early participants that we are still working on, especially the desire that such catalog copy would be "uniform." Note that there were also discussions about having librarians contribute to the periodical indexes of R. R. Bowker and Poole, so the cooperation would flow in both directions.

The physical organization of libraries also was of interest, and a detailed plan for a round (actually octagonal) library design was presented:

His conclusion, however, shows a difference in our concepts of user privacy.

Especially interesting to me are the discussions of library technology. I was unaware of some of the emerging technologies for reproduction such as the papyrograph and the electric pen. In 1877, the big question, though, was whether to employ the new (but as yet un-perfected) technology of the typewriter in library practice.

There was some poo-pooing of this new technology, but some members felt it may be reaching a state of usefulness.

"The President" in this case is Justin Winsor, Superintendent of the Boston Library, then president of the American Library Association. Substituting more modern technologies, I suspect we have all taken part in this discussion during our careers.

Reading through the Journal evokes a strong sense of "le plus ça change..." but I admit that I find it all rather reassuring. The historical beginnings give me a sense of why we are who we are today, and what factors are behind some of our embedded thinking on topics.

* Many of the early volumes are available from HathiTrust, if you have access. Although the texts themselves are public domain, these are Google-digitized books and are not available without a login. (Don't get me started!) If you do not have access to those, most of the volumes are available through the Internet Archive. Select "text" and search on "library journal". As someone without HathiTrust institutional access I have found most numbers in the range 1-39, but am missing (hint, hint): 5/1880; 8-9/1887-88; 17/1892; 19/1894; 28-30/1903-1905; 34-37;1909-1912. If I can complete the run I think it would be good to create a compressed archive of the whole and make that available via the Internet Archive to save others the time of acquiring them one at a time. If I can find the remainder that are pre-1923 I will add those in.

Sunday, July 09, 2017

The Work

I've been on a committee that was tasked by the Program for Cooperative Cataloging folks(*) to help them understand some of the issues around works (as defined in FRBR, RDA, BIBFRAME, etc.). There are huge complications, not the least being that we all are hard-pressed to define what a work is, much less how it should be addressed in some as-yes-unrealized future library system. Some of what I've come to understand may be obvious to you, especially if you are a cataloger who provides authority data for your own catalog or the shared environment. Still, I thought it would be good to capture these thoughts. Of course, I welcome comments and further insights on this.

There are at least four different meanings to the term work as it is being discussed in library venues.

"Work-ness"

First there is the concept that every resource embodies something that could be called a "work" and that this work is a human creation. The idea of the work probably dates back as far as the recognition that humans create things, and that those things have meaning. There is no doubt that there is "work-ness" in all created things, although prior to FRBR there was little attempt to formally define it as an aspect of bibliographic description. It entered into cataloging consciousness in the 20th century: Patrick Wilson saw works as families of resources that grow and branch with each related publication;[1] Richard Smiraglia looked at works as a function of time;[2] and Seymour Lubetzky seems to have been the first to insist on viewing the work as intellectual content separate from the physical piece.[3]

"Work Description"

Second, there is the work in the bibliographic description: the RDA cataloging rules define the attributes or data elements that make up the work description, like the names of creators and the subject matter of the resource. Catalogers include these elements in descriptive cataloging even when the work is not defined as a stand-alone entity, as in the case of doing RDA cataloging in a MARC21 record environment. Most of the description of works is not new; creators and subjects have been assigned to cataloged items for a century or more. What is changed is that conceptually these are considered to be elements of the work that is inherent in the resource that is being cataloged but not limited to the item in hand.

It is this work description that is addressed in FRBR. The FRBR document of 1998 describes the scope of its entities to be solely bibliographic, specifically excluding authority data:

"The present study does not analyse those additional data associated with persons, corporate bodies, works, and subjects that are typically recorded only in authority records."

Notably, FRBR is silent on the question of whether the work description is unique within the catalog, which would be implied by the creation of a work authority "record".

"Work Decision"

Next there is the work decision: this is the situation when a data creator determines whether the work to be described needs a unique and unifying entry within the stated cataloging environment to bring together exemplars of the same work that may be described differently. If so, the cataloger defines the authoritative identity for the work and provides information that distinguishes that work from all other works, and that brings together all of the variations of that work. The headings ("uniform titles") that are created also serve to disambiguate expressions of the same work by adding dates, languages, and other elements of the expression. To back all of this up, the cataloger gives evidence of his/her decision, primarily what sources were consulted that support the decision.

In today's catalog, a full work decision, resulting in a work authority record, is done for only a small number of works, with the exception of musical works where such titles are created for nearly all. The need to make the work decision may vary from catalog to catalog and can depend on whether the library holds multiple expressions of the work or other works that may need clarification in the catalog. Note that there is nothing in FRBR that would indicate that every work must have a unique description, just that works should be described. However, some have assumed that the FRBR work is always a representation of a unique creation. I don't find that expressed in FRBR nor the FRBR-LRM.

"Work Entity"

Finally there is the work entity: this is a data structure that encapsulates the description of the work. This data structure could be realized in any number of different encodings, such as ISO 2709 (the underlying record structure for MARC21), RDF, XML, or JSON. The latter two can also accommodate linked data in the form of RDFXML or JSON-LD.

Here we have a complication in our current environment because the main encodings of bibliographic data, MARC21 and BIBFRAME, both differ from the work concept presented in FRBR and in the RDA cataloging rules, which follow FRBR fairly faithfully. With a few exceptions, MARC21 does not distinguish work elements from expression or manifestation elements. Encoding RDA-defined data in the MARC21 "unit record" can be seen as proof of the conceptual nature of the work (and expression and manifestation) as defined in FRBR.

BIBFRAME, the proposed replacement for MARC21, has re-imagined the bibliographic work entity, departing from the entity breakdown in FRBR by defining a BIBFRAME work entity that tends to combine elements from FRBR's work and expression. However, where FRBR claims a neat divison between the entities, with no overlapping descriptive elements, BIBFRAME 2.0 is being designed as a general bibliographic model, not an implementation of FRBR. (Whether or not BIBFRAME achieves this goal is another question.)

The diagrams in the 1998 FRBR report imply that there would be a work entity structure. However, the report also states unequivocally that it is not defining a data format.(**) In keeping with 1990's library technology, FRBR anticipates that each entity may have an identifier, but the identifier is a descriptive element (think: ISBN), not an anchor for all of the data elements of the entity (think: IRI).

As we see with the implementation of RDA cataloging in the MARC21 environment, describing a work conceptually does not require the use of a separate work "record." Whether work decisions are required for every cataloged manifestation is a cataloging decision; whether work entities are required for every work is a data design decision. That design decision should be based on the services that the system is expected to render. The "entity" decision may or may not require any action on the part of the cataloger depending on the interface in which cataloging takes place. Just as today's systems do not store the MARC21 data as it appears on the cataloger's screen, future systems will have internal data storage formats that will surely differ from the view in the various user interfaces.

"The Upshot"

We can assume that every human-created resource has an aspect of work-ness, but this doesn't always translate well to bibliographic description nor to a work entity in bibliographic data. Past practice in relation to works differs significantly from, say, the practice in relation to agents (persons, corporate bodies) for whom one presumes that the name authority control decision is always part of the cataloging workflow. Instead, work "names" have been inconsistently developed (with exceptions, such as in music materials). It is unclear if, in the future, every work description will be assumed to have undergone a "work name authority" analysis, but even more unreliable is any assumption that can be made about whether an existing bibliographic description without a uniform title has had its "work-ness" fully examined.

This latter concern is especially evident in the transformations of current MARC21 cataloging into either RDA, BIBFRAME, or schema.org. From what I have observed, the transformations do not preserve the difference between a manifestation title that does not have a formal uniform title to represent the work, and those titles that are currently coded in MARC21 fields 130, 240, or the $t of an author/title field. Instead, where a coded uniform title is not available in the MARC21 record, the manifestation title is copied to the work title element. This means that the fact that a cataloger has carefully crafted a work title for the resource is lost. Even though we may agree that the creation of work titles has been inconsistent at best, copying transcribed titles to the work title entity wherever no uniform title field is present in the MARC21 record seems to be a serious loss of information. Or perhaps I should put this as a question: in the absence of a unform title element, can we assume that the transcribed title is the appropriate work title?

To conclude, I guess I will go ahead and harp on a common nag of mine, which is that copying data from one serialization to another is not the transformation that will help us move forward. The "work" is very complex; I would feel less concerned if we had a strong and shared concept of what services we want the work to provide in the future, which should help us decide what to do with the messy legacy that we have today.

Footnotes

* Note that in 1877 there already was a "Co-operation committee" of the American Library Association, tasked with looking at cooperative cataloging and other tasks. That makes this a 140-year-old tradition.

"Of the standing committees, that on co-operation will probably prove the most important organ of the Association..." (see more at link)

** If you want more about what FRBR is and is not, I will recommend my book "FRBR: Before and After" (open access copy) for an in-depth analysis. If you want less, try my SWIB talk "Mistakes Have Been Made" which gets into FRBR at about 13:00, but you might enjoy the lead-up to that section.

References

[1] Wilson, Patrick. Two Kinds of Power : an Essay on Bibliographical Control. University of California Publications: Librarianship. Berkeley, Los Angeles, London: University of California Press, 1978.
[2] Smiraglia, Richard. The Nature of “a Work”; Implications for the Organization of Knowledge. Lanham: Scarecrow Press, 2001.
[3] Lubetzky, Seymour. Principles of Cataloging. Final report. Phase I. In: Seymour Lubtezky: writings on the classical art of cataloging. Edited by Elaine Svenonius and Dorothy McGarry. Englewood, CO, Libraries Unlimited. 2001

Tuesday, June 20, 2017

Pray for Peace

This is a piece I wrote on March 22, 2003, two days after the beginning of the second Gulf war. I just found it in an old folder, and sadly have to say that things have gotten worse than I feared. I also note an unfortunate use of terms like "peasant" and "primitive" but I leave those as a recognition of my state of mind/information. Pray for peace.

Saturday, March 22, 2003

Gulf War II

The propaganda machine is in high gear, at war against the truth. The bombardments are constant and calculated. This has been planned carefully over time.

The propaganda box sits in every home showing footage that it claims is of a distant war. We citizens, of course, have no way to independently verify that, but then most citizens are quite happy to accept it at face value.

We see peaceful streets by day in a lovely, prosperous and modern city. The night shots show explosions happening at a safe distance. What is the magical spot from which all of this is being observed?

Later we see pictures of damaged buildings, but they are all empty, as are the streets. There are no people involved, and no blood. It is the USA vs. architecture, as if the city of Bagdad itself is our enemy.

The numbers of casualties, all of them ours, all of them military, are so small that each one has an individual name. We see photos of them in dress uniform. The families state that they are proud. For each one of these there is the story from home: the heavily made-up wife who just gave birth to twins and is trying to smile for the camera, the child who has graduated from school, the community that has rallied to help re-paint a home or repair a fence.

More people are dying on the highways across the USA each day than in this war, according to our news. Of course, even more are dying around the world of AIDS or lung cancer, and we aren't seeing their pictures or helping their families. At least not according to the television news.

The programming is designed like a curriculum with problems and solutions. As we begin bombing the networks show a segment in which experts explain the difference between the previous Gulf War's bombs and those used today. Although we were assured during the previous war that our bombs were all accurately hitting their targets, word got out afterward that in fact the accuracy had been dismally low. Today's experts explain that the bombs being used today are far superior to those used previously, and that when we are told this time that they are hitting their targets it is true, because today's bombs really are accurate.

As we enter and capture the first impoverished, primitive village, a famous reporter is shown interviewing Iraqi women living in the USA who enthusiastically assure us that the Iraqi people will welcome the American liberators with open arms. The newspapers report Iraqis running into the streets shouting "Peace to all." No one suggests that the phrase might be a plea for mercy by an unarmed peasant facing a soldier wearing enough weaponry to raze the entire village in an eye blink.

Reporters riding with US troops are able to phone home over satellite connections and show us grainy pictures of heavily laden convoys in the Iraqi desert. Like the proverbial beasts of burden, the trucks are barely visible under their packages of goods, food and shelter. What they are bringing to the trade table is different from the silks and spices that once traveled these roads, but they are carrying luxury goods beyond the ken of many of Iraq's people: high tech sensor devices, protective clothing against all kinds of dangers, vital medical supplies and, perhaps even more important, enough food and water to feed an army. In a country that feeds itself only because of international aid -- aid that has been withdrawn as the US troops arrive -- the trucks are like self-contained units of American wealth motoring past.

I feel sullied watching any of this, or reading newspapers. It's an insult to be treated like a mindless human unit being prepared for the post-war political fall-out. I can't even think about the fact that many people in this country are believing every word of it. I can't let myself think that the propaganda war machine will win.

Pray for peace.

Wednesday, May 17, 2017

Two FRBRs, Many Relationships

There is tension in the library community between those who favor remaining with the MARC21 standard for bibliographic records, and others who are promoting a small number of RDF-based solutions. This is the perceived conflict, but in fact both camps are looking at the wrong end of the problem - that is, they are looking at the technology solution without having identified the underlying requirements that a solution must address. I contend that the key element that must be taken into account is the role of FRBR on cataloging and catalogs.

Some background: FRBR is stated to be a mental model of the bibliographic universe, although it also has inherent in it an adherence to a particular technology: entity-relation analysis for relational database design. This is stated fairly clearly in the introduction to the FRBR report, which says:

The methodology used in this study is based on an entity analysis technique that is used in the development of conceptual models for relational database systems. Although the study is not intended to serve directly as a basis for the design of bibliographic databases, the technique was chosen as the basis for the methodology because it provides a structured approach to the analysis of data requirements that facilitates the processes of definition and delineation that were set out in the terms of reference for the study.

The use of an entity-relation model was what led to the now ubiquitous diagrams that show separate entities for works, expressions, manifestations and items. This is often read as a proposed structure for bibliographic data, where a single work description is linked to multiple expression descriptions, each of which in turn link to one or more manifestation descriptions. Other entities like the primary creator link to the appropriate bibliographic entity rather than to a bibliographic description as a whole. In relational database terms, this would create an efficiency in which each work is described only once regardless of the number of expressions or manifestations in the database rather than having information about the work in multiple bibliographic descriptions. This is seen by some as a potential efficiency also for the cataloging workflow as information about a work does not need to be recreated in the description of each manifestation of the work.

Two FRBRs

What this means is that we have (at least) two FRBR's: the mental model of the bibliographic universe, which I'll refer to as FRBR-MM; and the bibliographic data model based on an entity-relation structure, which I'll refer to as FRBR-DM. These are not clearly separated in the FRBR final report and there is some ambiguity in statements from members of the FRBR working group about whether both models are intended outcomes of the report. Confusion arises in many discussions of FRBR when we do not distinguish which of these functions is being addressed.

FRBR-Mental Model

FRBR-MM is the thinking behind the RDA cataloging rules, and the conceptual entities define the structure of the RDA documentation and workflow. It instructs catalogers to analyze each item they catalog as being an item or manifestation that carries the expression of a creative work. There is no specific data model associated with the RDA rules, which is why it is possible to use the mental model to produce cataloging that is entered into the form provided by the MARC21 record; a structure that approximates the catalog entry described in AACR2.

In FRBR-MM, some entities can be implicit rather than explicit. FRBR-MM does not require that a cataloguer produce a separate and visible work entity. In the RDA cataloging coded in MARC, the primary creator and the subjects are associated with the overall bibliographic description without there being a separate work identity. Even when there is a work title created, the creator and subjects are directly associated with the bibliographic description of the manifestation or item. This doesn't mean that the cataloguer has not thought about the work and the expression in their bibliographic analysis, but the rules do not require those to be called out separately in the description. In the mental model you can view FRBR as providing a checklist of key aspects of the bibliographic description that must be addressed.

The FRBR report defines bibliographic relationships more strongly than previous cataloging rules. For her PhD work, Barbara Tillett (a principal on both the FRBR and RDA work groups) painstakingly viewed thousands of bibliographic records to tease out the types of bibliographic relationships that were noted. Most of these were implicit in free-form cataloguer-supplied notes and in added entries in the catalog records. Previous cataloging rules said little about bibliographic relationships, while RDA, using the work of Tillett which was furthered in the FRBR final report, has five chapters on bibliographic relationships. In the FRBR-MM encoded in MARC21, these continue to be cataloguer notes ("Adapted from …"), subject headings ("--adaptations"), and added entry fields. These notes and headings are human-readable but do not provide machine-actionable links between bibliographic descriptions. This means that you cannot have a system function that retrieves all of the adaptations of a work, nor are systems likely to provide searches based on relationship type, as these are buried in text. Also, whether relationships are between works or expressions or manifestations is not explicit in the recorded data. In essence, FRBR-MM in MARC21 ignores the separate description of the FRBR-defined Group 1 entities (WEMI), flattening the record into a single bibliographic description that is very similar to that produced with AACR2.

FRBR-Data Model

FRBR-DM adheres to the model of separate identified entities and the relationships between them. These are seen in the diagrams provided in the FRBR report, and in the section on bibliographic relationships from that report. The first thing that needs to be said is that the FRBR report based its model on an analysis that is used for database design. There is no analysis provided for a record design. This is significant because databases and records used for information exchange can have significantly different structures. In a database there could be one work description linked to any number of expressions, but when exchanging information about a single manifestation presumably the expression and work entities would need to be included. That probably means that if you have more than one manifestation for a work being transmitted, that work information is included for each manifestation, and each bibliographic description is neatly contained in a single package. The FRBR report does not define an actual database design nor a record exchange format, even though the entities and relations in the report could provide a first step in determining those technologies.

FRBR-DM uses the same mental model as FRBR-MM, but adds considerable functionality that comes from the entity-relationship model. FRBR-DM implements the concepts in FRBR in a way that FRBR-MM does not. It defines separate entities for work, expression, manifestation and item, where MARC21 has only a single entity. FRBR-DM also defines relationships that can be created between specific entities. Without actual entities some relationships between the entities may be implicit in the catalog data, but only in a very vague way. A main entry author field in a MARC21 record has no explicit relationship to the work concept inherent in the bibliographic description, but many people's mental model would associate the title and the author as being a kind of statement about the work being described. Added entries may describe related works but they do not link to those works.

The FRBR-DM model was not imposed on the RDA rules, which were intended to be neutral as to the data formats that would carry the bibliographic description. However, RDA was designed to support the FRBR-DM by allowing for individual entity descriptions with their own identifiers and for there to be identified relationships between those entities. FRBR-DM proposes the creation of a work entity that can be shared throughout the bibliographic universe where that work is referenced. The same is true for all of the FRBR entities. Because each entity has an identified existence, it is possible to create relationships between entities; the same relationships that are defined in the FRBR report, and more if desired. FRBR-DM, however, is not supported by the MARC21 model because MARC21 does not have a structure that would permit the creation of separately identified entities for the FRBR entities. FRBR-DM does have an expression as a data model in the RDA Registry. In the registry, RDA is defined as an RDF vocabulary in parallel with the named elements in the RDA rule set, with each element associated with the FRBR entity that defines it in the RDA text. This expression, however, so far has only one experimental system implementation in RIMMF. As far as I know, no libraries are yet using this as a cataloging system.

The replacement proposed by the Library of Congress for the MARC21 record, BIBFRAME, makes use of entities and relations similar to those defined in FRBR, but does not follow FRBR to the letter. The extent to which it was informed by FRBR is unclear but FRBR was in existence when BIBFRAME was developed. Many of the entities defined by FRBR are obvious, however, and would be arrived at by any independent analysis of bibliographic data: persons, corporate bodies, physical descriptions, subjects. How BIBFRAME fits into the FRBR-MM or the FRBR-DM isn't clear to me and I won't attempt to find a place for it in this current analysis. I will say that using an entity-relation model and promoting relationships between those entities is a mainstream approach to data, and would most likely be the model in any modern bibliographic data design.

MARC v RDF?

The decision we are facing in terms of bibliographic data is often couched in terms of "MARC vs. RDF", however, that is not the actual question that underlies that decision. Instead, the question should be couched as: entities and relations, or not? if you want to share entities like works and persons, and if you want to create actual relationships between bibliographic entities, something other than MARC21 is required. What that "something" is should be an open question, but it will not be a "unit record" like MARC21.

For those who embrace the entity-relation model, the perceived "rush to RDF" is not entirely illogical; RDF is the current technology that supports entity-relation models. RDF is supported by a growing number of open source tools, including database management and indexing. It is a World Wide Web Consortium (W3C) standard, and is quickly becoming a mainstream technology used by communities like banking, medicine, and academic and government data providers. It also has its down sides: there is no obvious support in the current version of RDF for units of data that could be called "records" - RDF only recognizes open graphs; RDF is bad at retaining the order of data elements, something that bibliographic data often relies upon. These "faults" and others are well known to the W3C groups that continue to develop the standard and some are currently under development as additions to the standard.

At the same time, leaping directly to a particular solution is bad form. Data development usually begins with a gathering of use cases and requirements, and technology is developed to meet the gathered requirements. If it is desired to take advantage of some or all of the entity-relation capabilities of FRBR, the decision about the appropriate replacement for MARC21 should be based on a needs analysis. I recall seeing some use cases in the early BIBFRAME work, but I also recall that they seemed inadequate. What needs to be addressed is the extent to which we expect library catalogs to make use of bibliographic relationships, and whether those relationships must be bound to specific entities.

What we could gain by developing use cases would be a shared set of expectations that could be weighed against proposed solutions. Some of the aspects of what catalogers like about MARC may feed into those requirements, as well what we wish for in the design of the future catalog. Once the set of requirements is reasonably complete, we have a set of criteria against which to measure whether the technology development is meeting the needs of everyone involved with library data.

Conclusion: It's the Relationships

The disruptive aspect of FRBR is not primarily that it creates a multi-level bibliographic model between works, expressions, manifestations, and items. The disruption is in the definition of relationships between and among those entities that requires those entities to be separately identified. Even the desire to share separately work and expression descriptions can most likely be done by identifying the pertinent data elements within a unit record. But the bibliographic relationships defined in FRBR and RDA, if they are to be actionable, require a new data structure.

The relationships are included in RDA but are not implemented in RDA in MARC21, basically because they cannot be implemented in a "unit record" data format. The key question is whether those relationships (or others) are intended to be included in future library catalogs. If they are, then a data format other than MARC21 must be developed. That data format may or may not implement FRBR-defined bibliographic relationships; FRBR was a first attempt to redefine a long-standing bibliographic model and should be considered the first, not the last, word in bibliographic relationships.

If we couch the question in terms of bibliographic relationships, not warring data formats, we begin to have a way to go beyond emotional attachments and do a reasoned analysis of our needs.

Wednesday, April 12, 2017

If It Ain't Broke

For the first time in over forty years there is serious talk of a new metadata format for library bibliographic data. This is an important moment.

There is not, however, a consensus within the profession on the need to replace the long-standing MARC record format with something different. A common reply to the suggestion that library data creation needs a new data schema is the phrase: "If it ain't broke, don't fix it." This is more likely to be uttered by members of the cataloging community - those who create the bibliographic data that makes up library catalogs - than by those whose jobs entail systems design and maintenance. It is worth taking a good look at the relationship that catalogers have with the MARC format, since their view is informed by decades of daily encounters with a screen of MARC encoding.

Why This Matters

When the MARC format was developed, its purpose was clear: it needed to provide the data that would be printed on catalog cards produced by the Library of Congress. Those cards had been printed for over six decades, so there was no lack of examples to use to define the desired outcome. In ways unimagined at the time, MARC would change, nay, expand the role of shared cataloging, and would provide the first online template for cataloging.

Today work is being done on the post-MARC data schema. However, how the proposed new schema might change the daily work of catalogers is unclear. There is some anxiety in the cataloging community about this, and it is understandable. What I unfortunately see is a growing distrust of this development on the part of the data creators in our profession. It has not been made clear what their role is in the development of the next "MARC," not even whether their needs are a driving force in that development. Surely a new model cannot be successful without the consideration (or even better, the participation) of the people who will spend their days using the new data model to create the library's data.

(An even larger question is the future of the catalog itself, but I hardly know where to begin on that one.)

If it Ain't Broke...

The push-back against proposed post-MARC data formats is often seen as a blanket rejection of change. Undoubtedly this is at times the case. However, given that there have now been multiple generations of catalogers who worked and continue to work with the MARC record, we must assume that the members of the cataloging community have in-depth knowledge of how that format serves the cataloging function. We should tap that knowledge as a way to understand the functionality in MARC that has had a positive impact on cataloging for four decades, and should study how that functionality could be carried forward into the future bibliographic metadata schema.

I asked on Twitter for input on what catalogers like about MARC, and received some replies. I also viewed a small number of presentations by catalogers, primarily those about proposed replacements for MARC. From these I gathered the following list of "what catalogers like about MARC." I present these without comment or debate. I do not agree with all of the statements here, but that is no matter; the purpose here is to reflect cataloger perspectives.

(Note: This list is undoubtedly incomplete and I welcome comments or emails with your suggestions for additions or changes.)

What Catalogers Like/Love About MARC

There is resistance to moving away from using the MARC record for cataloging among some in the Anglo-American cataloging community. That community has been creating cataloging data in the MARC formats for forty years. For these librarians, MARC has many positive qualities, and these are qualities that are not perceived to exist in the proposals for linked data. (Throughout the sections below, read "library cataloging" and variants as referring to the Anglo-American cataloging tradition that uses the MARC format and the Anglo-American Cataloging Rules and its newer forms.)

MARC is Familiar

Library cataloging makes use of a very complex set of rules that determine how a resource is described. Once the decisions are made regarding the content of the description, those results are coded in MARC. Because the creation of the catalog record has been done in the MARC format since the late 1970's, working catalogers today have known only MARC as the bibliographic record format and the cataloging interface. Catalogers speak in "MARC" - using the tags to name data elements - e.g. "245" instead of "title proper".

MARC is WYSIWYG

Those who work with MARC consider it to be "human readable." Most of the description is text, therefore what the cataloger creates is exactly what will appear on the screen in the library catalog. If a cataloger types "ill." that is what will display; if the cataloger instead types "illustrations" then that is what will display. In terms of viewing a MARC record on a screen, some cataloger displays show the tags and codes to one side, and the text of those elements is clearly readable as text.

MARC Gives Catalogers Control

The coding is visible, and therefore what the cataloger creates on the screen is virtually identical to the machine-readable record that is being created. Everything that will be shown in the catalog is in the record (with the exception of cover art, at least in some catalogs). The MARC rules say that the order of fields and subfields in the record are the order in which that information should be displayed in the catalog. Some systems violate this by putting the fields in numeric order, but the order of subfields is generally maintained. Catalogers wish to control the order of display and are frustrated when they cannot. In general, changing anything about the record with automated procedures can un-do the decisions made by catalogers as part of their work, and is a cause of frustration for catalogers.

MARC is International

MARC is used internationally, and because the record uses numerics and alphanumeric codes, a record created in another country is readable to other MARC users. Note that this was also the purpose of the International Standard Bibliographic Description (ISBD), which instead of tags uses punctuation marks to delimit elements of the bibliographic description. If a cataloger sees this, but cannot read the text:

245 02 |a לטוס עם עין אחת / |c דני בז.

it is still clear that this is a title field with a main title (no subtitle), followed by a statement of the author's name as provided on the title page of the book.

MARC is the Lingua Franca of Cataloging

This is probably the key point that comprises all of the above, but it is important to state it as such. This means that the entire workflow, the training materials, the documentation - all use MARC. Catalogers today think in MARC and communicate in MARC. This also means that MARC defines the library cataloging community in the way that a dialect defines the local residents of a region. There is pride in its "library-ness". It is also seen as expressing the Anglo-American cataloging tradition.

MARC is Concise

MARC is concise as a physical format (something that is less important today than it was in the 1960s when MARC was developed), and it is also concise on the screen. "245" represents "title proper"; "240" represents "uniform title"; "130" represents "uniform title main entry". Often an entire record can be viewed on a single screen, and the tags and subfield codes take up very little display space.

MARC is Very Detailed

MARC21 has about 200 tags currently defined, and each of these can have up to 36 subfields. There are about 2000 subfields defined in MARC21, although the distribution is uneven and depends on the semantics of the field; some fields have only a handful of subfields, and in others there are few codes remaining that could be assigned.

MARC is Flat

The MARC record is fairly flat, with only two levels of coding: field and subfield. This is a simple model that is easy to understand and easy to visualize.

MARC is Extensible

Throughout its history, the MARC record has been extended by adding new fields and subfields. There are about 200 defined fields which means that there is room to add approximately 600 more.

MARC has Mnemonics

Some coding is either consistent or mnemonic, which makes it easier for catalogers to remember the meaning of the codes. There are code blocks that refer to cataloging categories, such as the title block (2XX), the notes block (5XX) and the subject block (6XX). Some subfields have been reserved for particular functions, such as the use of the numeric subfields in 0-8. In other cases, the mnemonic is used in certain contexts, such as the use of subfield "v" for the volume information of series. In other fields, the "v" may be used for something else, such as the "form" subfield in subject fields, but the context makes it clear.

There are also field mnemonics. For example, all tagged fields that have "00" in the second and third places are personal name fields. All fields and subfields that use the number 9 are locally defined (with a few well-known exceptions).

MARC is Finite and Authoritative

MARC defines a record that is bounded. What you see in the record is all of the information that is being provided about the item being described. The concept of "infinite graphs" is hard to grasp, and hard to display on a screen. This also means that MARC is an authoritative statement of the library bibliographic description, whereas graphs may lead users to sources that are not approved by or compatible with the library view.

Thursday, April 06, 2017

Precipitating Forward

Our Legacy, Our Mistake

If you follow the effort taking place around the proposed new bibliographic data standard, BIBFRAME, you may have noticed that much of what is being done with BIBFRAME today begins our current data in MARC format and converts it to BIBFRAME. While this is a function that will be needed should libraries move to a new data format, basing our development on how our legacy data converts is not the best way to move forward. In fact, it doesn't really tell us what "forward" might look like if we give it a chance.

We cannot define our future by looking only at our past. There are some particular aspects of our legacy data that make this especially true.

I have said before (video, article) that we made a mistake when we went from printing cards using data encoded in MARC, to using MARC in online catalogs. The mistake was that we continued to use the same data that had been well-adapted to card catalogs without making the changes that would have made it well-adapted to computer catalogs. We never developed data that would be efficient in a database design or compatible with database technology. We never really moved from textual description to machine-actionable data points. Note especially that computer catalogs fail to make use of assigned headings as they are intended, yet catalogers continue to assign them at significant cost.

One of the big problems in our legacy data that makes it hard to take advantage of computing technology is that the data tends to be quirky. Technology developers complain that the data is full of errors (as do catalogers), but in fact it is very hard to define, algorithmically, what is an error in our data. The fact is that the creation of the data is not governed by machine rules; instead, decisions are made by humans with a large degree of freedom. Some fields are even defined as being either this or that, something that is never the case in a data design. A few fields are considered required, although we've all seen records that don't have those required fields. Many fields are repeatable and the order of fields and subfields is left to the cataloger, and can vary.

The cataloger view is of a record of marked-up text. Computer systems can do little with text other than submit it for keyword indexing and display it on the screen. Technical designers look to the fixed fields for precise data points that they can operate on, but these are poorly supported and are often not included in the records since they don't look like "cataloging" as it is defined in libraries. These coded data elements are not defined by the cataloging code, either, and can be seen a mere "add-ons" that come with the MARC record format. The worst of it is that they are almost uniformly redundant with the textual data yet must be filled in separately, an extra step in the cataloging process that some cannot afford.

The upshot of this is that it is very hard to operate over library catalog data algorithmically. It is also very difficult to do any efficient machine validation to enforce consistency in the data. If we carry that same data and those same practices over to a different metadata schema, it will still be very hard to operate over algorithmically, and it will still be hard to do quality control as a function of data creation.

The counter argument to this is that cataloging is not a rote exercise - that catalogers must make complex decisions that could not be done by machines. If cataloging were subject to the kinds of data entry rules that are used in banking and medical and other modern systems, then the creativity of the cataloger's work would be lost, and the skill level of cataloging would drop to mere data entry.

This is the same argument you could used for any artisanal activity. If we industrialize the act of making shoes, the skills of the master shoe-maker are lost. However, if we do not industrialize shoe production, only a very small number of people will be able to afford to wear shoes.

This decision is a hard one, and I sympathize with the catalogers who are very proud of their understanding of the complexity of the bibliographic world. We need people who understand that complexity. Yet increasingly we are not able to afford to support the kind of cataloging practices of which we are proud. Ideally, we would find a way to channel those skills into a more efficient workflow.

There is a story that I tell often: In the very early days of the MARC record, around the mid-1970's, many librarians thought that we could never have a "computer catalog" because most of our cataloging existed only on cards, and we could NEVER go back and convert the card catalogs, retype every card into MARC. At that same time, large libraries in the University of California system were running over 100,000-150,000 cards behind in their filing. For those of you who never filed cards... it was horribly labor intensive. Falling 150,000 cards behind meant that a book was on the shelf THREE MONTHS before the cards were in the catalog. Some of this was the "fault" of OCLC which was making it almost too easy to create those cards. Another factor was a great increase in publishing that was itself facilitated by word processing and computer-driven typography. Within less than a decade it became more economical to go through the process of conversion from printed cards to online catalogs than to continue to maintain enormous card catalogs. And the rest is history. MARC, via OCLC, created a filing crisis, and in a sense it was the cost of filing that killed the card catalog, not the thrill of the modern online catalog.

The terrible mistake that we made back then was that we did not think about what was different between the card catalog and the online catalog, and we did not adjust our data creation accordingly. We carried the legacy data into the new format which was a disservice to both catalogers and catalog users. We missed an opportunity to provide new discovery options and more efficient data creation.

We mustn't make this same mistake again.

The Precipitant

Above I said that libraries made the move into computer-based catalogs because it was uneconomical to maintain the card catalog. I don't know what the precipitant will be for our current catalog model, but there are some rather obvious places to look to for that straw that will break the MARC/ILS back. These problems will probably manifest themselves as costs that require the library to find a more efficient and less costly solution. Here are some of the problems that I see today that might be factors that require change:

Output rates of intellectual and cultural products is increasing. Libraries have already responded to this through shared cataloging and purchase of cataloging from product vendors. However, the records produced in this way are then loaded into thousands of individual catalogs in the MARC-using community.
Those records are often edited for correctness and enhanced. Thus they are costing individual libraries a large amount of money, potentially as much or more than libraries save by receiving the catalog copy.
Each library must pay for a vendor system that can ingest MARC records, facilitate cataloging, and provide full catalog user (patron) support for searching and display.
"Sharing" in today's environment means exporting data and sending it as a file. Since MARC records can only be shared as whole records, updates and changes generally are done as a "full record replace" which requires a fair amount of cycles.
The "raw" MARC record as such is not database friendly, so records must be greatly massaged in order to store them in databases and provide indexing and displays. Another way to say this is that there are no database technologies that know about the MARC record format. There are database technologies that natively accept and manage other data formats, such as key-value pairs

There are some current technologies that might provide solutions:

Open source. There is already use of open source technology in some library projects. Moving more toward open source would be facilitated by moving away from a library-centric data standard and using at least a data structure that is commonly deployed in the information technology world. Some of this advantage has already been obtained with using MARCXML.
The cloud. The repeated storing of the same data in thousands of catalogs means not being able to take advantage of true sharing. In a cloud solution, records would be stored once (or in a small number of mirrors), and a record enhancement would enhance the data for each participant without being downloaded to a separate system. This is similar to what is being proposed by OCLC's WorldShare and Ex Libris' Alma, although presumably those are "starter" applications. Use of the cloud for storage might also mean less churning of data in local databases; it could mean that systems could be smaller and more agile.
NoSQL databases and triple stores. The current batch of databases are open source, fast, and can natively process data in a variety of formats (although not MARC). Data does not have to be "pre-massaged" in order to be stored in a database or retrieved and the database technology and the data technology are in sync. This makes deployment of systems easier and faster. There are NoSQL database technologies for RDF. Another data format that has dedicated database technology is XML, although that ship may have sailed by now.
The web. The web itself is a powerful technology that retrieves distributed data at astonishing rates. There are potential cost/time savings on any function that can be pushed out the web to make use of its infrastructure.

The change from MARC to ?? will come and it will be forced upon us through technology and economics. We can jump to a new technology blindly, in a panic, or we can plan ahead. Duh.

Monday, February 13, 2017

Miseducation

There's a fascinating video created by the Southern Poverty Law Center (in January 2017) that focuses on Google but is equally relevant to libraries. It is called The Miseducation of Dylann Roof.

In this video, the speaker shows that by searching on "black on white violence" in Google the top items are all from racist sites. Each of these link only to other racist sites. The speaker claims that Google's algorithms will favor similar sites to ones that a user has visited from a Google search, and that eventually, in this case, the user's online searching will be skewed toward sites that are racist in nature. The claim is that this is what happened to Dylan Roof, the man who killed 9 people at an historic African-American church - he entered a closed information system that consisted only of racist sites. It ends by saying: "It's a fundamental problem that Google must address if it is truly going to be the world's library."

I'm not going to defend or deny the claims of the video, and you should watch it yourself because I'm not giving a full exposition of its premise here (and it is short and very interesting). But I do want to question whether Google is or could be "the world's library", and also whether libraries do a sufficient job of presenting users with a well-round information space.

It's fairly easy to dismiss the first premise - that Google is or should be seen as a library. Google is operating in a significantly different information ecosystem from libraries. While there is some overlap between Google and library collections, primarily because Google now partners with publishers to index some books, there is much that is on the Internet that is not in libraries, and a significant amount that is in libraries but not available online. Libraries pride themselves on providing quality information, but we can't really take the lion's share of the credit for that; the primary gatekeepers are the publishers from whom we purchase the items in our collections. In terms of content, most libraries are pretty staid, collecting only from mainstream publishers.

I decided to test this out and went looking for works promoting Holocaust denial or Creationism in a non-random group of libraries. I was able to find numerous books about deniers and denial, but only research libraries seem to carry the books by the deniers themselves. None of these come from mainstream publishing houses. I note that the subject heading, Holocaust denial literature, is applied to both those items written from the denial point of view, as well as ones analyzing or debating that view.

Creationism gets a bit more visibility; I was able to find some creationist works in public libraries in the Bible Belt. Again, there is a single subject heading, Creationism, that covers both the pro- and the con-. Finding pro- works in WorldCat is a kind of "needle in a haystack" exercise.

Don't dwell too much on my findings - this is purely anecdotal, although a true study would be fascinating. We know that libraries to some extent reflect their local cultures, such as the presence of the Gay and Lesbian Archives at the San Francisco Public Library. But you often hear that libraries "cover all points of view," which is not really true.

The common statement about libraries is that we gather materials on all sides of an issue. Another statement is that users will discover them because they will reside near each other on the library shelves. Is this true? Is this adequate? Does this guarantee that library users will encounter a full range of thoughts and facts on an issue?

First, just because the library has more than one book on a topic does not guarantee that a user will choose to engage with multiple sources. There are people who seek out everything they can find on a topic, but as we know from the general statistics on reading habits, many people will not read voraciously on a topic. So the fact that the library has multiple items with different points of view doesn't mean that the user reads all of those points of view.

Second, there can be a big difference between what the library holds and what a user finds on the shelf. Many public libraries have a high rate of circulation of a large part of their collection, and some books have such long holds lists that they may not hit the shelf for months or longer. I have no way to predict what a user would find on the shelf in a library that had an equal number of books expounding the science of evolution vs those promoting the biblical concept of creation, but it is frightening to think that what a person learns will be the result of some random library bookshelf.

But the third point is really the key one: libraries do not cover all points of view, if by points of view you include the kind of mis-information that is described in the SPLC video. There are many points of view that are not available from mainstream publishers, and there are many points of view that are not considered appropriate for anything but serious study. A researcher looking into race relations in the United States today would find the sites that attracted Roof to provide important insights, as SPLC did, but you will not find that same information in a "reading" library.

Libraries have an idea of "appropriate" that they share with the publishing community. We are both scientific and moral gatekeepers, whether we want to admit it or not. Google is an algorithm functioning over an uncontrolled and uncontrollable number of conversations. Although Google pretends that its algorithm is neutral, we know that it is not. On Amazon, which does accept self-published and alternative press books, certain content like pornography is consciously kept away from promotions and best seller lists. Google has "tweaked" its algorithms to remove Holocaust denial literature from view in some European countries that forbid the topic. The video essentially says that Google should make wide-ranging cultural, scientific and moral judgments about the content it indexes.

I am of two minds about the idea of letting Google or Amazon be a gatekeeper. On the one hand, immersing a Dylann Roof in an online racist community is a terrible thing, and we see the result (although the cause and effect may be hard to prove as strongly as the video shows). On the other hand, letting Google and Amazon decide what is and what is not appropriate does not sit well at all. As I've said before having gatekeepers whose motivations are trade secrets that cannot be discussed is quite dangerous.

There has been a lot of discussion lately about libraries and their supposed neutrality. I am very glad that we can have that discussion. With all of the current hoopla about fake news, Russian hackers, and the use of social media to target and change opinion, we should embrace the fact of our collection policies, and admit widely that we and others have thought carefully about the content of the library. It won't be the most radical in many cases, but we care about veracity, and that''s something that Google cannot say.