Thursday, April 13, 2023

What is Controlled Digital Lending? The Origin Story

 The bulk of the reporting on the lawsuit between publishers (four of them, led by Hachette) and the Internet Archive's version of Controlled Digital Lending hits one of these points of view:

  •  The publishers are evil, money-grubbing idiots going after the generous, saintly Internet Archive
  •  The Internet Archive is evil, stealing from the poor publishers and even poorer writers

As is so often the case, it really is more compex than that. I will try to throw a bit of clarity into mix here, mainly by talking about some of the realities of library service in the 21st century, and the origins of controlled digital lending.

The Origins of CDL

Michelle M. Wu, a law librarian and law professor, wrote a piece for the Law Library Journal in 2011 explaining the dilemmas faced by law libraries and proposing a modest solution.

"Building a Collaborative Digital Collection: A Necessary Evolution in Libraries" LAW LIBRARY JOURNAL Vol. 103:4 [2011-34]) (online)

The solution is what became Controlled Digital Lending. The reasons she lays out are the key.

The main argument that Wu puts forth (and that I find convincing) is this: library users either want or actually need to be able to access materials remotely, which means in electronic formats over a network. Increasingly, materials that libraries wish to provide are available from publishers in those electronic formats. The catch, however, is that libraries are not able to own materials in electronic formats, but instead can only subscribe to access services. It is this lack of ownership that is the rub. If a library loses its digital subscription for some reason, such as no longer being able to afford it, it not only loses access to future materials, it loses access to all of the past materials that were included in that subscription. This puts libraries in the terrible position of having to decide between fulfilling their role as the reliable repository and archive of material in their subject area, or of serving the needs of library users. As Wu points out, libraries are already struggling to afford the materials that they feel they should be collecting, so purchasing these materials both in hard copy for archival purposes and also in digital form for user service is entirely beyond the pale.

What Wu suggests in her article is a variation on Inter-Library Loan, combined with a library collective purchasing plan. A cooperative group of law libraries would combine purchasing physical resources for those items that are rarely used but that should be available to the researchers who need them. This is not a revolutionary idea - library consortia have been making use of this kind of approach for a significant amount of time. The difference in Wu's plan is that as items are requested from the consortial holdings, they will be digitized and the digital format will be the one loaned. To stay within the intention of copyright law, in particular First Sale, Wu offers that the digital file will be loaned as a surrogate for the physical copy:

"Materials acquired would be digitized, and only the number of copies acquired in print for each subsequently digitized document would circulate at any given time. The print copy would be stored for archival purposes; only the digital copy would “circulate.”" Wu p.535
The physical resource would fulfill the need for an archival copy, and the digital resource would allow lending to any networked member of the cooperative group. Her solution assumes an effective digital rights management system that would make the loan a loan and not a pirate-able copy.

Wu carefully covers all of the potential legal objections, and points out the various areas of US copyright law that might be touched on with her proposal. Specific areas are First Sale, Fair Use, and the various exceptions to the copyright law that are applied to libraries. She defends the digitization as format-shifting, not unlike the format-shifting that is done for sound recordings as the technology for that medium has changed.
"It is the work itself that is copyrighted, not the form." Wu p. 541
She also addresses what would be an obvious objection of rights holders, that the digital copy is substituting for a purchase. The hard copy would be purchased by the consortium, and given her statement that these would primarily be low-use materials that many libraries would not themselves purchase, no harm would be done to the market which would be limited.

The argument that I find strongest is that of preservation: the US copyright law does allow libraries to make copies of works for the purposes of preservation if no equivalent copy is available for purchase. (Section 108 subsection c) Using the argument that the purpose of the library is to preserve as well as to make works available, Wu says:
"In cases where a digital version is available only for license, a library could argue that such a license is not equivalent to either the print copy or a digital copy they would make, because both of these items would be owned by the library and the licensed digital version would not." Wu p. 539

Context Counts

Controlled Digital Lending is the technology: the digitization of paper works and the lending of the digital copy using management software that prevents piracy of the digital file.  It is the context that makes Wu's proposal different to the implementation of controlled digital lending at the Internet Archive.
  • Wu's proposal was for a consortium of law libraries serving their own users; the IA's implementation was open to anyone on the web
  • Wu's proposal was for academic materials of low use; the IA's included popular works
  • The works in Wu's proposal would have been selected with specific research purposes in mind; the IA's collection was an opportunistic group of books that they had often obtained as second-hand - therefore no research purpose could be argued. (Purpose is one of the Fair Use factors.)
  • Wu's proposal argued for the need of libraries to preserve materials that otherwise would not be preserved; the Archive is indeed an archive, known for its preservation of web sites that otherwise would be lost. However, the popular books named in the lawsuit against the Archive are already "preserved" in thousands of libraries who have those physical books on their shelves - the preservation argument is not easily supplied. 

HathiTrust, which is a consortium of libraries that originally contributed to the Google Books project, is an example that follows Wu's approach. HathiTrust  stores digitized copies of books and follows the ruling related to Google Books that searching of in-copyright works is permitted, but not reading. HathiTrust developed its own controlled digital lending service as an emergency service for when a member library is temporarily closed due to a disaster. In that case, users from the member library can borrow digitized books held by the member library in hard copy.

Wu even suggests that libraries might share the burden of digitization by providing digitized copies to libraries that own the books in hard copy. This latter, though, was one of the things that got the Internet Archive in trouble because it became a "digital lending broker" for other libraries, adding their hard copy count to the Archive's lending "units" including some of books owned by the publishers in the lawsuit.

The Upshot

The argument presented by Wu is quite strong and is justified through her careful reading of copyright law, in particular as that law applies to libraries. The extension of her proposal to popular reading materials and to an unlimited user base changes everything. Libraries do have specific collections, identified users, and stated purposes that guide their acquisitions. Something I feel strongly about - that is absent from so many modern information activities - is that effective information use requires purposeful resource selection and organization. Any mass of stored resources is only as valuable as its organization and coherence. In some cases, a "less" that is well organized can be more informative than a "more" that may lack the key works in a subject area. It may be old-fashioned on my part, but I adhere to the concept of defined user goals and the deliberate collection of specific works in support of user learning. This is what I read in Wu's work but which I do not see in the Archive's activities.

I think Wu's context could be understood as falling within the confines of copyright law. I'm not sure that the Archive's case does. I do hope that this current lawsuit does not result in a rejection of digitization for lending for all libraries.

Friday, April 07, 2023

Libraries, the law, and equality

 


In the spirit of "everyone is equal under the law", it is equally illegal for both a starving man and a billionaire to steal a loaf of bread. Or to copy a book.

 Libraries for the People

It was not all that long ago when "library" often referred to the room in a rich man's home where he stored books that were only available to him, and perhaps members of his family (especially if they were not female). Other libraries, usually larger ones, were attached to prestigious educational institutions and accessible by people worthy of that prestige (which would not include non-white nor female people). We are fortunate  today that we have these things called "public libraries," libraries that serve everyone regardless of their wealth, their race, or their gender.

Here's the catch: public libraries are generally small and modestly funded by the local community. A moderately sized public library has 50,000 - 100,000 volumes. A large public library may have up to 500,00 volumes. A large university library has many more. Harvard University library claims to have 20 million book volumes, 400 million manuscripts, and 10 million photographs. Stanford University library may have at least 12 million book volumes. Michigan State University libraries have about 7 million book volumes. The British Museum Library lays claim to 170 and 200 million items of which 13.5 million are printed books and e-books. There is no question that the members of our community who are served solely by public libraries, while they have unprecedented access to books, are not able to study the full range of printed knowledge of our world. To whit, the university libraries are often referred to as "research libraries" while the local public libraries are called "reading libraries." This separates us into "readers" and "researchers," and while you might conclude that any literate person can read, only those associated with large libraries will be able to avail themselves of the tools to do research.

Digital Access

Much of the research done in academe consumes and creates journal articles. Originally issued only in paper, and mailed to libraries and departments, journal articles have been available in digital form from the mid-1990's and today it would be unusual for an article-based publication to be issued only in paper. Journal owners have digitized the full run of publications, as have cooperative projects based in academia. A researcher or student at a Western university is likely to have more than a century of scientific, technical or social science academic article output available through the Internet, any day, any time, and perhaps from any place. Anyone from less wealthy nations will have less access, although perhaps just a tad more than they had when the articles were issued only in paper.

The story is different with books. While most academic articles have been converted to digital form, the same cannot be said of books. It is only recently that publishers have issued their books in electronic form using the electronic files that are now part and parcel of the publishing process. That only takes care of current publications, however. Sitting in libraries are centuries of one-off publications in book form. Books from this vast backlog must be digitized from the existing physical copy.  Projects by libraries and educational institutions to digitize the monographic backlog, similar to those that succeeded in digitizing the journal output of the ages, have not been accomplished. There are various reasons why that is the case: the sheer number of book pages that would need to be digitized is huge; non-destructive digitization of bound volumes is difficult and often does not yield good results; partnering with publishers for this task is hampered by the fact that numerous books from the 20th century and older are "orphaned," meaning that although they may be under copyright their copyright holder cannot be found; and compared to modern ebooks, digitized books have little to recommend them for reading, although with their searchable text they may be useful for research.

The only efforts to digitize the backlog of books, Google Book Search and the digitizing by the Internet Archive, have resulted in lawsuits against those organizations. The suit against Google concluded that digitization is allowed as long as the digitized books are provided for purposes of searching but not reading. The Internet Archive took the view that books are for reading, an approach that I find hard to oppose.

Reading vs Research

Reading and research are related but different activities. Reading is often associated with books, and includes books on scientific and academic topics as well as fiction, from great literature to beach reads. While few non-researchers read academic articles, some members of academe do read books as part of their research. Of course, many people also read for pleasure; reading is a key means of acquiring culture, along side other activities like taking in performances of various arts.

If you are not at one of those institutions with a large research library, the only way you may have to see the content of many books is by accessing a digitized book. A digitized book is not the same reading experience as the ebook produced by publishers. A digitized book has not been produced from an electronic file of its contents as an ebook has been. Instead, each page of the physical book has been photographed, and those images have been analyzed using optical character recognition (OCR) software. The result of the OCR is a text file, and that file will be more or less "lossy" depending on things like the condition of the original book pages, the clarity of the font, the language of the text. 

Unlike an ebook, reading the digitized book usually means viewing pictures of the books' pages.

 


It's not a great reading experience, but imagine that the book is important for your studies or your work; it would be worth the effort.

On the other hand, if you are wanting some modern leisure reading and you are in North America, you will be much better served by checking out the book and ebook offerings of your public library. If you are not in North America, and if your locality has a limited public library or no public library at all, then the extra effort that you may need to make to read a digitized book may be worth it to you. If, however, you had the funds to purchase the materials you needed or were associated with an institution that made those materials available to you, it is unlikely that you would choose the less sophisticated and less available copy provided at the Archive.

Hachette, et al., v Internet Archive

The above sets out some of the social parameters that we should consider when thinking about the recent lawsuit relating to Controlled Digital Lending. (See previous post.) In brief, the Internet Archive has digitized many books and makes them available globally, lending one "copy" at a time. A group of publishers has sued the Archive based on a set of books for which the publishers hold the copyright. The issue is often presented as a test of the concept of Controlled Digital Lending, although only some books are in question in the lawsuit. Those books represent only a portion of the books available at the Archive or in libraries in general. Although one may think of a binary division of books into "still in copyright" and "no longer in copyright" the actual situation is more complex.

  1. There are the books that are out of copyright, which generally means books from 1924 and earlier in the US. These are not under discussion. However, there is no way to separate the basic copyrighted content of a book, like Mark Twain's Huck Finn from later reprintings that often add some bit of a preface so that the publisher can put a copyright notice on it and pretend to have the rights. Such "books" may be considered in copyright even though the primary content of those books is not. There is unfortunately no penalty for a publisher in slapping a copyright statement onto a book that is not under copyright, as can be seen in my favorite example of a blank journal sold with a copyright notice.
  2. There are the orphaned works, for which there is no one to assert rights. Either the rights holder (the publisher) no longer exists, or the documentation that would make it possible to assert rights does not exist. Because this is a category of unknowns, it is quite difficult to determine which books are in this category.
  3. There are works that are not orphaned but the publisher is not asserting rights in relation to Controlled Digital Lending. This may be the majority of the books being loaned by the Archive because there are only four publishers in the lawsuit. We don't know what the other publishers think about the lending.
  4. There are the books by the four publishers that are included in the lawsuit. These four publishers  are asserting that the Archive violated their rights and potentially deprived them of income.

It would be great to know the figures that would allow us to compare 1-3 with 4. It would also be great to know how many loans were actually made by the Archive of those books in the 4th category. Presumably that figure will inform the penalty that is imposed on the Archive.

The Archive's defense seems to be solid as it shows that in both the presence and the absence of its contested service no change was noted for publisher sales. It is chilling that the judge so readily dismissed the Archive's arguments, and especially chilling if you consider, as a hypothetical, applying this same argument to libraries in general.

"IA’s experts observed that print sales of the Works in Suit and general demand for library ebooks did not decrease while the Works in Suit were available on IA’s Website; that Amazon rankings for the Works in Suit improved when IA’s digital lending skyrocketed (and government lockdowns were in full effect) at the beginning of the Covid-19 pandemic; and that, despite the removal of the Works in Suit from IA’s library in June 2020, OverDrive checkouts of the Works in Suit did not increase." (Case 1:20-cv-04160-JGK-OTW Document 188 Filed 03/24/23 Page 42)

That sounds like a good defense, yet the judge dismisses it.

"But these metrics do not begin to meet IA’s burden to show a lack of market harm. Taking them at face value, they show at best that the presence of the Works in Suit in IA’s online library correlated, however weakly, with positive financial indicators for the Publishers in other areas. They do not show that IA’s conduct caused these benefits to the Publishers. In any event, IA cannot offset the harm it inflicts on the Publishers’ library ebook revenues, see, e.g., Andy Warhol Found., 11 F.4th at 48; TVEyes, 883 F.3d at 180, by pointing to other asserted benefits to the Publishers in other markets. Nor could those asserted benefits tip the scales in favor of fair use when the other factors point so strongly against fair use." (Case 1:20-cv-04160-JGK-OTW Document 188 Filed 03/24/23 Page 43)
Given this kind of reasoning, there is no "proof" that any library could provide that would clearly absolve the library of harm to publishers. That should be okay because "not harming publishers" is not how we should see the role of libraries in our world. Libraries exist for the same reasons that educational institutions exist: to further the abilities of citizens to participate in "science and the useful arts", as it is called in the constitution. Yet as Dan Cohen says in his article in the Atlantic:
On Friday, the judge sided almost entirely with the publishers. The Internet Archive “argues that its digital lending makes it easier for patrons who live far from physical libraries to access books and that it supports research, scholarship, and cultural participation by making books widely accessible on the Internet,” Judge John G. Koeltl wrote in his pointed ruling. “But these alleged benefits cannot outweigh the market harm to the Publishers.”
Thus, societal benefits, such as those of libraries and schools, take a back seat to profit. Or should I say "alleged benefits." Today, copyright law creates a basis for the legality of library lending through the first sale doctrine. Some library privileges relating to making copies are included in the US copyright law. But these do not add up to actual support for the work of libraries, only a limitation on culpability as they perform key functions such as preserving cultural materials that have been abandoned by their creators and providing access to recorded culture to all who request it. In the legal regime, libraries are allowed, but not encouraged, to provide a valuable service for society. Judge John G. Koeltl has little regard for that service.

Thursday, April 06, 2023

Judge's Decision on Internet Archive's Controlled Digital Lending

The story is long and complex, so here's about the shortest Q&A summary that I can manage. Remember IANAL (I am not a lawyer), IAAL (I am a librarian). Also, I'm leaving out lots of details here, but provide links so that you can get to them. While this is playing out as a legal question, the societal issues are barely considered. I will try to give some thoughts on those soon.

Q: Who sued the Archive?
A: Four publishers: Hachette Book Group, Inc., HarperCollins Publishers LLC, John Wiley & Sons, Inc., Penguin Random House LLC.

Q: What did they sue about?
A: That the Archive digitized paper books for which the publishers hold the copyright and loaned the digital copies to people.

Q: Are these the only publishers whose works the Archive digitized?
A: Oh, no. There are probably thousands of others.

Q: What are the books that are named in the suit?
A: There are too many to list, but here are a few to give you an idea:

  • Elizabeth Gilbert's Eat, Pray, Love: One Woman's Search for Everything Across Italy, India and Indonesia
  • Malcolm Gladwell's Blink: The Power of Thinking Without Thinking
  • C. S. Lewis's The Lion, the Witch, and the Wardrobe
  • J. D. Salinger's The Catcher in the Rye
  • Laura Ingalls Wilder's The House on the Prairie

There are many minor works as well, and others whose titles you would recognize.
The full list is at: https://storage.courtlistener.com/recap/gov.uscourts.nysd.537900/gov.uscourts.nysd.537900.1.1.pdf

Q: What was the Archive's legal defense for its actions?
A: They argue that digitization is analogous to the kind of time-shifting that is done through technologies like Tivo; it is a sort of "format shift" and therefore is fair use. They also argue that the Archive, as a non-profit library, is providing a lending service like libraries do with hard copies of books. It calls this process of digitizing and lending "Controlled Digital Lending." In Controlled Digital Lending the library treats the original hard copy and the digital copy as a single "thing" and lends either one or the other but not both at a time. This is called the "one-to-one principle" and it is designed to mimic the First Rights law of the US which is the basis for the legality of library lending.

Q: How did the court respond?
A: The judge looked at the four factors of the US copyright law and concluded that the Archive's use was not fair. He accepted the publisher's arguments that the lending of the books competed with the publishers' own digital and physical sales. He also bought the publishers' argument that the Archive, albeit a non-profit, gained status and therefore donations through the book lending service.

Q: Are there legal arguments to support Controlled Digital Lending?
A: Yes, ones have been made. In particular there is the work of Michelle Wu, who wrote "Building a Collaborative Digital Collection: A Necessary Evolution in Libraries.". Her initial thesis regarded law libraries and their difficulty in keeping up with the production of legal resources. Later, she was one of a group of legal scholars who developed a more general statement on Controlled Digital Lending. They argue that in this environment of increasing remote access to information, libraries have to be able to move beyond the requirement that users visit a physical space to access materials. And since not all materials have been provided in digital form, libraries need to take on the process of digitization for materials that they hold only in hard copy.

Q: Is this the first time that libraries have digitized materials?
A: No. Libraries have used various technologies, including digitization, to make materials available to disabled users. They also have digitized, faxed, and copied individual journal articles and book sections to satisfy interlibrary loan requests. They rarely have digitized entire books except to preserve rare materials, but those are generally free of copyrights due to their age.

Q: So, did the Archive do something wrong?
A: Possibly. For materials online a copyright holder can issue a "take down" notice, and the recipient is obligated to remove the item from access. The publishers claim that they gave the Archive a list of items to take down, but not all were removed. I haven't seen a statement from the Archive on why that method failed. Then, for about four months, during the beginning of the COVID pandemic in 2020, the Archive eliminated the one-to-one rule and allowed unlimited lending. This was done as a service to offset the fact that during that time many physical libraries were closed to their users, but it was not in keeping with the legal principles that had been laid out for Controlled Digital Lending. 

Another possible error was the digitization of materials for which the publishers have digital versions (ebooks) on offer. This makes the argument that the Archive was competing with the publishers more convincing. Copyright law also views "creative" works more strongly than factual works, and these are publishers of fiction as well as popular non-fiction, types of works that one could see as worthy of maximum copyright protection. Materials intended for research and education (academic journal articles, scientific treatises) are more likely to meet the "purpose" requirement of a fair use assessment. It is quite a bit harder to claim fair use copying for "something fun to read" and the publishers in the suit are all major purveyors of popular reading. 

 Continuing on, most libraries have a limited user base: universities serve current students, staff and faculty; public libraries serve residents in their jurisdiction. The Archive was lending materials globally. That latter is both an argument against the Archive, if you are a publisher, and an argument for the Archive, if you support equal access to information. 

Q: Didn't we go through this already with Google books?

Not quite. Google never allowed anyone to read its digitized books. It stated that its digitization project was to provide searching within the text of books, and users were only displayed snippets, not the whole book. That was deemed to be fair use by the court. Since then, Google books has mainly been acquiring digital texts provided by publishers, and the amount of visible content is part of the agreement between Google and its book partners.

Q: Could a different implementation of Controlled Digital Lending succeed?
A: Possibly. There are libraries that have partnered with the Archive in this project but were not mentioned in the lawsuit; it is unclear whether they will be able to continue lending their digitized books - although they may have to find another technical solution to the lending service, which is currently run by the Archive. There is also the possibility that a digitization project that had specific service goals, like the one initially proposed by Wu for law libraries, would be easier to defend. Both the Archive and the earlier digitization project, Google Books, decided that it was expedient to digitize first and ask permission later. They also both digitized indiscriminately, including old and new, academic and popular. Google eventually adopted an "opt-in" model in its publisher relations, although as the search engine of record what it has to offer is a level of visibility that no one else can provide. The other option is to limit access to books in the public domain, which cuts off almost the last century of works. 

Q: What's next?
A: There will be appeals by the Archive, but if those do not alter the court's view then the Archive will be required to compensate the publishers for its infringement of their rights. Presumably that compensation will be based on some estimated amount that the publishers were damaged. So far I have no seen any actual figures that would be used to make such a determination.

Resources:

Tuesday, January 10, 2023

KO is KO'd

A library is intended to be a place of organized knowledge. Knowledge organization (KO) takes place in two areas: the shelf and the catalog. In this post I want to address KO in the catalog.

Headings

KO in the catalog makes use of "headings". Headings are catalog entry points, such as the title of a work or the name of an author. Library catalogs also assign topical headings to their holdings.

The "knowledge organization" of the title and author headings consists of alternate versions of those. Alternate forms can be from an unused form (Cornwell, David John Moore) to the used one (Le Carré, John). They can also refer from one form that a searcher may use (Twain, Mark) to a related name that is also to be found in the catalog (Snodgrass, Quintus Curtius).

Subject headings are a bit more complex because they also have the taxonomic relationships of broader and narrower concepts. So a broader term (Cooking) can link to a narrower term (Desserts) in the same topic area. Subject headings also have alternate terms and related terms.

The way that this KO is intended to work is that each heading and reference is entered into the catalog in alphabetical order where the user will encounter them during a search.

Cornwell, David John Moore
    see: Le CarrĂ©, John

Twain, Mark
    see also: Snodgrass, Quintus Curtius
    
Cooking
    see narrower: Desserts
    see narrower: Frying
    see narrower: Menus
    
It may seem obvious but it is still important to note that this entire system is designed around an alphabetical browse of strings of text. The user was alerted to the alternate terms and the topical structure during the browse of cards in the card catalog, where the alternate and taxonomic entries were recorded for the user to see. Any "switching" from one term to another was done by the user herself who had to walk over to another catalog drawer and look up the term, if she so chose. The KO that existed in the catalog was evident to the user.

Automation

A database of data creates the ability to search rather than browse. A database search plucks precise elements from storage in response to a query and delivers them to the questioner. The "random access" of that technology has all but eliminated the need to find information through alphabetical order. Before the database there was no retrieval in the sense that we mean today, retrieval where a user is given a finite set of results without intermediate steps on their part. Instead, yesterday's catalog users moved around in an unlimited storehouse of relevant and non-relevant materials from which they had to make choices.

In the database environment, the user does not see the KO that may be provided. Even if the system does some term-switching from unused to used terms, the searcher is given the result of a process that is not transparent. Someone searching on "Cornwell, David" will receive results for the name "John Le Carré" but no explanation of why that is the case. Less likely is that a search on "Twain, Mark" will lead the searcher to the works that Twain wrote under the additional alias of "Snodgrass, Quintus Curtius" or that the search on "Cooking" will inform the user that there is a narrower heading for "Menus." A precise retrieval provides no context for the result, and context is what knowledge organization is all about.

Answering a question is not a conversation. The card catalog engaged the user in at least a modicum of conversation as it suggested entry headings other than the ones being browsed. It is even plausible that some learning took place as the user was guided from one place in the list to another. None of that is intended or provided with the database search.

KW is especially not KO

The loss of KO is exacerbated with keyword searching. While one might be able to link a reference to a single-word topic or to a particular phrase, such as "cookery" to "cooking," individual words that can appear anywhere in a heading are even further removed from any informational context. A word like "solar" ("solar oscillations", "solar cars", "orbiting solar observatories") or "management" ("wildlife management", "time management", "library catalog management") is virtually useless on its own, and the items retrieved will be from significantly different topic areas.

Keyword searching is very popular because, as one computer science student once told me, "I always get something." The controversy today over mis-information is around the fact that "something" is a context-free deliverable. In libraries, keyword searching helps users retrieve items with complex headings, but the resulting resources may be so different one from the other that the the retrieved set resembles a random selection from the catalog. Note, too, that even the sophisticated search engines are unable to inform their users that broader and narrower topics exist, nor can they translate from words to topics. Words are tools to express knowledge, but keywords are only fragments of knowledge.

21st Century Goals

I would like to suggest a goal for 21st century librarians, and that is a return to knowledge organization. I don't know how it can be done, but it is essential to provide this as a service to library users who are poorly served by the contextless searches in today's library catalogs. To accomplish this with computer and database technology will probably not make use of the technique of heading assignment of the card catalog. Users might enter the library through a topic map of some type, perhaps. I really don't know. I do know that educating users will be a big hurdle; the facility of typing a few words and getting "something" will be hard to overcome in a world where quick bits of information are not only the norm but all that some generations have ever known. A knowledge system has to be demonstrably better, and that's a tall order.

Saturday, November 05, 2022

Cautions on ISBN and a bit on DOI

I have been reading through the documents relating to the court case that Hachette has brought against the Internet Archives "controlled digital lending" program. I wrote briefly about this before. In this recent reading I am once again struck by the use and over-use of ISBNs as identifiers. Most of my library peeps know this, but for others, I will lay out the issues and problems with these so-called "identifiers".

"BOOK"

The "BN" of the ISBN stands for "BOOK NUMBER." The "IS" is for "INTERNATIONAL STANDARD" which was issued by the International Standards Organization, whose documents are unfortunately paywalled. But the un-paywalled page defines the target of an ISBN as:

[A] unique international identification system for each product form or edition of a separately available monographic publication published or produced by a specific publisher that is available to the public.

What isn't said here in so many words is that the ISBN does not define a specific content; it defines a  salable product instance in the same way that a UPS code is applied to different sizes and "flavors" of Dawn dish soap. What many people either do not know or may have forgotten is that every book product is given a different ISBN. This means that the hardback book, the trade paperback, the mass-market paperback, the MOBI ebook, the EPUB ebook, even if all brought to market by a single publisher, all have different ISBNs.  

The word "book" is far from precise and it is a shame that the ISBN uses that term. Yes, it is applied to the book trade, but it is not applied to a "book" except in a common sense of that word. When you say "I read a book" you do not often mean the same thing as the B in ISBN. Your listener has no idea if you are referring to a hard back or a paperback copy of the text. It would be useful to think of the ISBN as the ISBpN - the International Standard Book product Number.

Emphasizing the ISBN's use as a product code, bookstores at one point were assigning ISBNs to non-book products like stuffed animals and other gift items. This was because the retail system that the stores used required ISBNs. I believe that this practice has been quashed, but it does illustrate that the ISBN is merely a bar code at a sales point.

1970

The ISBN became a standard product number in the book trade in 1970, in the era when the Universal Product Code (UPC) concept was being developed in a variety of sales environments. This means that every book product that appeared on the market before that date does not have an ISBN. This doesn't mean that a text from before that date cannot have an ISBN - as older works are re-issued for the current market, they, too, are given ISBNs as they are prepared for the retail environment. Even some works that are out of copyright (pre-1925) may be found to have ISBNs when they have been reissued. 

The existence of an ISBN on the physical or electronic version of a book tells you nothing about its copyright status and does not mean that the book content is currently in print. It has the same meaning as the bar code on your box of cereal - it is a product identifier that can be used in automated systems to ring up a purchase. 

The Controlled Digital Lending Lawsuit

The lawsuit between a group of publishers led by Hachette and the Internet Archive is an example of two different views: that of selling and that of reading.

 

In the lawsuit the publishers quantify the damage done to them by expressing the damage to them in terms of numbers of ISBNs. This Implies that the lawsuit is not including back titles that are pre-ISBN. Because the concern is economic, items that are long out of print don't seem to be included in the lawsuit.

The difference between the book as product and the book as content shows up in how ISBNs are used. The publisher’s expert notes that many metadata records at the archive have multiple ISBNs and surmises that the archive is adding these to the records. What this person doesn’t know is that library records, which the archive is using, often contain ISBNs for multiple book products which the libraries consider interchangeable. The library user is seeking specific content and is not concerned with whether the book is a hard back, has library binding, or is one of the possible soft covers. The “book “ that the library user is seeking is an information vessel.

It is the practice in libraries, where there is more than one physical book type available, to show the user a single metadata record that doesn’t distinguish between them. The record may describe a hard bound copy even though the library has only the trade paperback. This may not be ideal but the cost-benefit seems defensible. Users probably pay little attention to the publication details that would distinguish between these products. 

 

From a single library metadata record

 

Where libraries do differentiate is between forms that require special hardware or software. Even here however the ISBN cannot be used for the library’s purpose because services that manage these materials can provide the books in the primary digital reading formats based off a single metadata record, even though each ebook format is assigned its own ISBN for the purpose of sales.

The product view is what you see on Amazon. The different products have different prices which is one way they are distinguished. A buyer can see the different prices for hard copy, paperback, or kindle book, and often a range of prices for used copies. Unlike the library user, the Amazon customer has to make a choice, even if all of the options have the same content. For sales to be possible, each of the products has its own ISBN. 

Different products have different prices


Counting ISBNs may be the correct quantifier for the publishers, but they feature only minimally in the library environment. Multiple ISBNs on a single library metadata record is not an attempt to hide publisher products by putting them together; it's good library practice for serving its readers. Users coming to the library with an ISBN will be directed to the content they seek regardless of the particular binding the library owns. Counting the ISBNs in the Internet Archive's metadata will not be a good measure of the number of "books" there using the publisher's definition of "book."


Digital Object Identifier (DOI)


I haven't done a deep study of the use of DOIs, but again there seems to be a great enthusiasm for the DOI as an identifier yet I see little discussion of the limitations of its reach. DOI began in 2000 so it has a serious time limit. Although it has caught on big with academic and scientific publications, it has less reach with social sciences, political writing, and other journalism. Periodicals that do not use DOIs may well be covering topics that can also be found in the DOI-verse. Basing an article research system on the presence of DOIs is an arbitrary truncation of the knowledge universe.

 

The End

 

Identifiers are useful. Created works are messy. Metadata is often inadequate. As anyone who has tried to match up metadata from multiple sources knows, working without identifiers makes that task much more  difficult. However, we must be very clear, when using identifiers, to recognize what they identify.


Monday, June 27, 2022

The OCLC v Clarivate Dilemma

OCLC has filed suit against the company Clarivate which owns Proquest and ExLibris. The suit focuses on a metadata service proposed by Ex Libris called "MetaDoor." MetaDoor isn't a bibliographic database Ă  la WorldCat, it is a peer-to-peer service that allows its users to find quality records in the catalog systems of other libraries. ("MetaDoor" is a terrible name for a product, by the way.)

What seems to specifically have OCLC's dander up is that Ex Libris states that it will allow any and all libraries, not just its Alma customers, to use this service for free. As the service does not yet exist it is unknown how it could affect the library metadata sharing environment. It may succeed, it may fail. If it succeeds, the technology that Ex Libris develops will be a logical next step in bibliographic data sharing, but its effect on OCLC is hard to predict.

Yesterday's and Today's Technology

WorldCat is yesterday's technology: a huge, centralized database. Peer-to-peer sharing of bibliographic records has been available since the 1980's with the development of the Z39.50 protocol, and presumably a considerable amount of sharing over that protocol has been used by libraries to obtain records from other libraries. Over the years many programs and systems have been developed to make use of Z39.50 and the protocol is built in to library systems, both for obtaining records and for sharing records.

The actual extent of peer-to-peer sharing of bibliographic records already today does not seem to be known, although I did only a brief amount of research looking for that information. It is definitely in use in library environments where participation in OCLC is unaffordable; articles vaunt its use in Russia, India, Korea, and other countries. It is built into the open source library system Koha that is aimed at those libraries that are priced out of the mainstream library systems market. Where libraries have known peers, such as the national library of a country, peer-to-peer makes good sense.

What OCLC's centralized database has that peer-to-peer lacks (at least to date) is consolidated library holdings information. As Kyle Banerjee said on Twitter, the real value in WorldCat is the holdings. This is used by interlibrary loan systems, and it is what appears on the screen when you do a WorldCat search. Cleverly, OCLC has recorded the geographical location of all of its holding libraries and can give you a list of libraries relative to your location. In the past this type of service was only available through a central database, but we may have arrived at point where peer-to-peer could provide this as well.

A couple of other things before I look at some specific points in the lawsuit. One is that WorldCat is not the only bibliographic database used for sharing of metadata. Some smaller library companies also have their own shared databases. These are much smaller than WorldCat and the libraries that use them generally are 1) unable to afford OCLC's member fees and 2) do not have need of the depth or breadth of WorldCat's bibliographic data. For example, the CARL database from TLC company has a database of 77 million records, many less than worldCat's over 500 million.  Even the Library of Congress catalog is only 20 million strong. The value for some libraries is that WorldCat contains the long tail; for others, that long tail is not needed. It's the difference between the Harvard library and your local public library. Harvard may well have need for metadata for a Lithuanian poetry journal, your local public library can do just fine with a peer database of popular works published in the US.

And another: we're slowly moving to a less "thing"-based world to a "data"-based world. Yes, scholars still need books and journals, but increasingly our information seeking returns tiny bites, not big thoughts. You can rue that, but I think it's only going to get worse. It's like the difference between a Ken Burns 10-part documentary on the Civil War and TikTok. The metadata creation activity for the deep thoughts of books and articles is not viable for YouTube, Instagram, TikTok or even Facebook. Us "book people" are hanging on to a vast repository that is less and less looking forward and more and more becoming dusty and crusty. We don't want to lose that valuable archive, but it is hard to claim that we are not a fading culture.

OK, to the lawsuit.

What is the Nut of this Case?

OCLC claims in its suit that Clarivate is undertaking MetaDoor as a malicious act, targeting WorldCat with a desire to destroy it. I don't think you need to be malicious to come up with a project to create an efficient system for sharing bibliographic records. Creating a shared database at this time is simply a logical need for any data service. 

The main fact behind OCLC's suit is that uploading library catalog bibliographic records to MetaDoor is a violation of the libraries' contracts with OCLC, and that Clarivate/Ex Libris is encouraging libraries to violate their contracts. As Clarivate has no such contract with OCLC, the suit uses terms like "conspiracy" and a lot of "tortious" to describe that Clarivate/Ex Libris is breaking some law of competition by encouraging OCLC customers to violate their contracts. 

I'm not sure how that will play in court but you can see on the Clarivate site that one of their main areas of expertise is in intellectual property around data. Regardless of the outcome of this suit we may get to see some interesting arguments around data ownership. It's still a wide-open area where some smart discussion would be very welcome.

The ILS Market

The lawsuit complains that Clarivate has become the largest player in the ILS market through its purchase of systems like Ex Libris and ProQuest. (It isn't clear to me how "large" is defined here.) It also bemoans the consolidation of the library market. The library market is hardly unique in this; consolidation of this type is a normal course of things in our barely-regulated capitalism. It is, as always, hard to understand just what Clarivate owns because Clarivate owns Proquest which owns Ex Libris which owns Innovative Interfaces which owns SkyRiver and VTLS, among others. The number of players in the library market, which once was a handful of independent companies, is shrinking at a rapid rate, and this has been a worry in the library world now for decades.

On its web site Clarivate presents itself as a research data and analytics company. It includes Proquest and Web of Science in its list of offerings, but interestingly makes no mention of Ex Libris. I've always wondered why anyone with any business sense would want to enter the library cataloging systems market. In fact, Clarivate inherited Ex Libris when it purchased Proquest, and the Clarivate press release upon acquiring Proquest makes no mention of Ex Libris or other library systems.

Speaking of market consolidation, one must remember that at one time OCLC had two rather large competitors in the library cataloging market: the Western Library Network and Research Libraries Information Network. OCLC purchased both of these, and they then ceased to exist. That was itself a consolidation that concerned many because at the time few library cataloging systems provided a significantly large database to support the cataloging activity. Also, take a gander at this chart from Marshall Breeding's Library Tech Guides that shows the "mergers and acquisitions" of OCLC:


(more readable on Marshall's site, so hop over there fore details)

What is a WorldCat record?

The lawsuit speaks of the "theft" of WorldCat records by Ex Libris for their MetaDoor product (which isn't well explained as the records will be voluntarily offered by the participating libraries). The peer-to-peer action of MetaDoor, however, does not touch the WorldCat database directly. As I understand it from the Ex Libris web site, libraries using the Ex Libris system agree to have that system harvest records from their database. Information from those records will be indexed in MetaDoor but the records themselves will not stored there. Users of MetaDoor will discover records they need for cataloging through MetaDoor, and the records will be retrieved from the library system holding the record. Without a doubt, some of those records will have been downloaded by libraries during cataloging on OCLC. The lawsuit refers to these as "WorldCat records."

Here's the hitch: these are records are distributed among individual library databases. Each MARC record is a character string in which any part of that string can be modified using software written for that purpose. That software may be part of the library catalog system, or it may be standalone software like the open software MARCedit. Other software, like Open Refine, has been incorporated into batch workflows for MARC records to make changes to records. Basically, the records undergo a lot of changes, both the "enhancements" in WorldCat that the lawsuit refers to but also an unknown quantity of modifications once individual libraries obtain them.

Note that some libraries do not use OCLC and therefore have no WorldCat records, and many libraries have multiple sources of bibliographic data. It simply isn't possible to say "all your MARC belong to us." It's much more complicated than that. Although there is nominally both provenance and versioning data in the MARC records, these fields are as editable as all others. In addition, some systems ignore these and do not attempt to update those details as the records are edited. This means that there is no way to look at a record in a library database and determine precisely from where it was originally obtained prior to being in that database. If library A uses OCLC to create a catalog record and library B (not an OCLC cataloging customer) uses its catalog system's Z39.50 option to copy that record from Library A, modifying the record for its own purposes, then library C obtains the record from Library B ... well, you see the problem. These records may flow throughout the library catalog universe, losing their identity as WorldCat records with each step.

OCLC appears to claim in its suit that the OCLC number confers some kind of ownership stamp on the records. In one of the later paragraphs of a very insightful Scholarly Kitchen blog post, Todd Carpenter reminds us that OCLC has not claimed restrictions on the identification number. Also, like everything else in the MARC record, that number can be deleted, modified, or added to a record at the whim of the cataloger. (OK, I admit that "whim" and "cataloger" probably shouldn't be used in the same sentence.) 

Rather than flinging lawsuits around, it would be very interesting to use that money to hire one of those people who looks out 20 years to tell you what the environment will be and what you should be investing in today. I can cover a certain amount of the past, but the future is a fog to me. I hope someone has ideas.

-----

As with many lawsuits, there's a lot of flinging documentation back and forth. Check out this site to keep an eye on things. I welcome recommendations of other resources.

Wednesday, January 26, 2022

What's in a Name?

This is an essay about the forms of names and their representation in metadata. It is not by any means complete, nor am I an expert in this very complex area. These are my observations and a few suggestions for future work. All comments welcome.

[Because this is huge, and printing from here is oddly difficult, here's a PDF.]

If you do anything online, and surely you do, you have filled in countless forms with your name and address. Within the Western and English-speaking world, these have some minor (and occasionally annoying) variations. You might be asked for a first name and last name, or a given name and a family name, or just a name in a particular order.


There are variations, of course. Some recognize the practice of giving a person a "middle" name, that is a second, and perhaps secondary, additional name.

Because these forms are often used in commercial sites and the companies wish to have a polite relationship with their customers, you might be asked about your preferred form of address. 

These forms of address have cultural significance, and the list itself can reveal quite a bit about a culture. This is the list from the British Airways site:

We'll come back to some of these below.

The above examples come from commerce sites. The use of names at those sites are mostly social. Even on a site like a bank, the name has only a minor role in regards to identification because security relies on user names, passwords, and two-factor identification. Names themselves are  poor identifiers because they are far from unique across a population. Even if you think you have an unusual name, you will find others with your name in the vastness of the Internet.*

If you think about times that you've been on the phone with some bank or service, they invariably ask you to provide a telephone number, an email address, or a unique identifier like a social security number as a way to identify you. Only after they have located a record with that identifier do they use your name as both a confirmation that you've given (and they've entered) the correct number, but also so that they can cheerily refer to you by your name.

Names in Cultural Heritage

Where commercial organizations use names to effect a relationship with their current customers, cultural heritage institutions have a different set of needs. They often cover not only names of modern persons but persons worldwide and of previous eras. An organization must be able to encode this full range of names in a way that is useful today but that is, to the extent possible, faithful to the cultural and historical context of the person. Royalty, religious figures, even characters in mythology all have a very tender place in their respective cultures. To treat them otherwise is to dismiss their cultural importance. You wouldn't want to provide metadata for Benedict XVI without also including that his title and his role in the church is "Pope". You most certainly would not simply name him "Joseph Ratzinger" unless you were giving a very specific, pre-Pope, context.  I don't know what name Queen Elizabeth II would provide when signing up for an Amazon account ("Elizabeth Windsor"?) as there is unlikely to be an input box appropriate for her royal name, but culturally and historically she is Elizabeth II, Queen of Great Britain.

There is also the question of giving people their due rank in whatever hierarchy the particular culture values. As you can see with the example above for the list of the titles offered by British Airlines, whereas US-based airlines limit the titles to Mr., Mrs., Ms. and Dr., that titles of nobility are important in the UK. We can presume that to "mis-title" a person would be a social faux pas in most cultures, but there is also a historical context included in titles that one would not want to lose.

The "firstname, lastname" Problem

Not all names fit the "firstname, lastname" model. A primary reason to identify these parts of names is to support displays in alphabetical order by the "last name". This assumes that the last name is a family name, and that common usage is to gather together all persons with that family name in a display. In reality, this singular "family name" is only one possible name pattern. 

As the term "family name" implies, this positions a person within a group of persons with a particular relationship. In the dominant Western world, the name is paternal and denotes a line of inheritance. But this is by no means the only name pattern that exists. There are cultures where the child's name includes the family names of both the mother and the father, and sometimes other ancestors in the family line. This is how Juan RodrĂ­guez y GarcĂ­a-Lozano and MarĂ­a de la PurificaciĂ³n Zapatero Valero, have a son named JosĂ© Luis RodrĂ­guez Zapatero. Treating "RodrĂ­guez Zapatero" as the family name would not bring together the alphabetical entries of the father and son.

There are other cultures that have a given name and a patronymic. While a patronymic may look like a family name, it is not. The singer Björk may have seemed to be using a single name as part of her art, like Cher or Madonna, but in fact in the Icelandic culture persons are known by a single name. When a more precise designation is needed, that name is enhanced with a name based on the given name of their father. In this case, Björk has a more "official" name of Björk GuðmundsdĂ³ttir, which is "Björk daughter of Guðmundur". Her father's name was Guðmundur Gunnarsson, he being the son of Gunnar. The author Arnaldur Indriðason is "Arnaldur the son of Indriði." In this practice, creating an order based on the patronymic would result in just a jumble of individual parental names, and persons are almost always called solely by their "first-and-only" name.

Yet another exception to the firstname/lastname conundrum relates to the names of royalty as mentioned above. Charles, Prince of Wales is the son of Elizabeth II. Their names do not connect them which is somewhat ironic given how important family relationships are to royal lineage. Both are of the house of Windsor but you wouldn't know that from their names. Like a Pope, the cultural or political position in these cases outweighs the personal. In addition, the title by which someone is officially known can change over time, making identification even more confusing, with titles being inherited or bestowed as circumstances change. Some people hold a plethora of titles: in addition to Prince of Wales, Charles is Earl of Chester, Duke of Cornwall, Duke of Rothesay, Earl of Merioneth and Baron Greenwich. This is as bad as the name proliferation in Russian novels, and just as confusing.

And there are the "one name" instances.  We have historical figures with only a single name ("Homer", "Aesop") but there are also current cultures in which members have only one name.


Any metadata that strictly requires both a given name and a family name will be unable to accommodate these and it is not unusual for people with only one name to be required to provide a second name to conform to the given/family name expectation in other cultures. There may even be local traditions for how one invents such a name. Yet they would not use that invented name in their own home country.

Names and Language

It is hard to separate language from culture, but there are some name situations in which the name is translated into the "receiving" language. Catherine (the Great) is Catherine in French, Caterina in Italian, etc.  The same is true of Popes:

Papa Franciscus (Latin)

Papa Francesco (Italian)

Papa Francisco (Spanish)

Pope Francis (English)

Another twist is that scientists and other cognoscenti of the late medieval and early modern times communicated with each other in Latin, and, probably as a form of showing that they were members of this elite club, often converted their names to a Latin form. Thus, one Aldo Pio Manuzio, a Venetian scholar and a very early book printer, took the name Aldus Pius Manutius. Francis Bacon published his "Novum Organum" (which was in Latin) as "Franciscus Baconis". 

Things get doubly complex as people and their names move from one culture to another. Many people of Chinese origin reverse the order of their names from family name first then given name to the preferred order in Western countries that places the family name last. In some cases, as with science fiction author Liu Cixin, a change for the Western marketplace creates a bit of confusion for anyone wanting to correctly encode this Chinese name.


Note that his translator, Ken Liu, an American, uses the Western form of his own name. So this book cover is a good illustration of the name problem across cultures.

Names in Metadata

How we handle names in metadata design depends mainly on the intended application functions for the data. I give below some key functions that use names, but this is an incomplete list. I can see these four as key purposes for names and their encoding in metadata:

1. Display - Names get displayed in a number of different contexts, from phone books to faculty listings on a web site to conference name tags. Displays may use all or only part of a name, and there are a variety of ways that one can order the name parts.

2. Disambiguation - Which Mary Jones is this? How do I identify and find the one that I am looking for?

3. Addressing - We do want to address people appropriately, and we also want to talk about them appropriately.

4. Finding - Searching via keyword is without context, so I'll assume that all name forms can be searched in that way. I will describe "finding" as meaning a search for a specific, known name.

I'll trace these through some metadata schemas to illustrate the metadata capabilities one might have.

Library of Congress (and other libraries)

Libraries have been dealing with names and name forms for, well, forever; as long as there have been libraries. The set of rules for determining what name to enter for someone in the library catalog is many, many tens of pages long, and there are separate rules for personal names, corporate names, and family names. Yet library name practices have their limitations, in particular that names are entered as strings that are to be used to create a specific alphabetical sort order that begins with the surname, followed by a comma, and then the forename(s). 

Dempsey, Martin, 1904-
Dempsey, Martin E.
Dempsey, Mary.
Dempsey, Mary A.

Display by family name works well for Western names with family names, but not for Eastern names that place the family name first.

  Mao, Zedong

Following Chinese name practice his name would naturally be given as "Mao Zedong" because the family name is always given first. If one attempts to use the comma to revert names to their natural order, say from "Smith, Jane" to "Jane Smith" then you would also end up with "Zedong Mao" which is not correct in that cultural context. A culturally sensitive "natural order" display is not supported by this metadata. 

The primary display form is the Western one of lastname-comma-first names, but there are exceptions for entry by forename, which is given specific coding: 

Arnaldur Indriðason, 1961-
Homer

As I've shown in the Mao Zedong example, the encoding of name parts in library data does not provide what you might need to create other display forms. In the case of Arnaldur Indriðason, outside of the library need to alphabetize its entries, you may want to know that  Indriðason is a patronymic if you intend to use the name to address the person as he would be addressed in his culture. The example of "Mao, Zedong" is lacking the information that this is a name in a culture that regularly refers to people with their surname preceding their given name (and without a comma). You would want to know that this should be rendered as "Mao Zedong" when used in that context. 

As you can see in the examples above, the Library of Congress name practice goes beyond just the name and adds elements that are meant to inform and clarify. It includes dates (birth, death); titles and other terms associated with a name (Pope, Jr., illustrator); enumeration (II); and fuller form of the name, which fills in portions of the name that use initials ("Boyle, Timothy D. (Timothy Dale)").  Interestingly, the "III" in Pope Pius III is an enumeration, while the "III" in "John R Kennedy, III" is an "other term associated with a name." I'm going to guess that this primarily relates to the positioning of the "III" in the display. This illustrates a tension between identifying parts of the name and providing the desired display of those parts.

There is a another problem with "title and other terms" because it is a catchall element that doesn't distinguish between some very different types of data. The documentation lists:

  • titles designating rank, office, or nobility, e.g., Sir
  • terms of address, e.g., Mrs.
  • initials of an academic degree or denoting membership in an organization, e.g., F.L.A. 
  • a roman numeral used with a surname
  • other words or phrases associated with the name, e.g., clockmaker, Saint.

As you can see, some of these would display before the name in a "natural order" display:

  • Sir Paul McCartney
  • Mrs. Harriet Ward

While others display afterward:

  • John Kennedy, Jr.
  • John Kennedy, III

 And some can be either or both:

  • Dr. Paul Johnson, DDS
  • Dr. Sophie Jones, Ph.D., F.I.P.A.

There is always the need to disambiguate between people with the same name. Some of these "other terms"  work well in identifying a person:

Boyle, Tom (Professor)

Boyle, Tom (Spiritualist)

However, the clarification between identical names used most often in library name data is the dates of birth and death. These used to be included only when necessary to distinguish between identical names but the information is now included whenever it is available to the cataloger. This makes the dates an integral part of the name, much as the roman numerals of the names of Popes. 

Pius I, Pope, d. ca. 154 

Pius II, Pope, 1405-1464

Although perhaps once useful for the purpose of distinguishing otherwise identical names, the sheer number of people who are included in library catalogs has greatly limited the utility of these dates for disambiguation.

Kennedy, John, 1919-1945
Kennedy, John, 1921-
Kennedy, John, 1926-1994
Kennedy, John, 1928-
Kennedy, John, 1931-
Kennedy, John, 1931-2004
Kennedy, John, 1934-2012
Kennedy, John, 1939-
Kennedy, John, 1940-
Kennedy, John, 1947-
Kennedy, John, 1948-
Kennedy, John, 1951-
Kennedy, John, 1953-
Kennedy, John, 1956-
Kennedy, John, 1959-
Kennedy, John, 1963-
Kennedy, John, 1965-
Kennedy, John, 1973-
Kennedy, John, -1988.

There is provision for alternate versions of names in library practice although these reside in a separate file and are not always linked to the primary name in library databases.

Boyle, Thomas John
    see: Boyle, T. C. 

The library name practices, although probably the most detailed of any metadata name schemes, are not very generalizable; they serve one designated application, which is the alphabetical order of the entries in the library catalog. 

Dublin Core

Dublin Core is absolutely minimal when it comes to names, as "core" implies. It provides only one property, dct:creator, without further detail. It also does not distinguish between persons and organizations: both can be coded as "creator" with an implicit class of Agent. Any further intelligence must be provided elsewhere in a metadata scheme that makes use of Dublin Core.

Dublin Core does allow for the value of the dct:creator property to be either a literal or an IRI or Bnode, and the encoding of the value of the IRI could be a more precise name form. Using an IRI could also be a method for providing a unique identity for the creator.

FOAF

The "Friend of a Friend" vocabulary is about people, their names, and some modern social connectivity: email address, web site, etc. FOAF has three name properties:

  • foaf:name - which can be used to an entire name, undifferentiated in terms of types of name
  • foaf:familyName & foaf:givenName - intended to be used together (but with no mechanism to enforce that) this allows an obvious separation between the names. How they would display is left to the applications that make use of them. 

The  foaf:familyName and foaf:givenName cover a limited set of name forms. In the context of many online sites this may suffice, especially where there is no enforcement of "real" names. Given that FOAF was developed for use within and between online social sites, it avoids the need for historical forms of names.

All of these are defined as taking literal values, which we know does not provide an unambiguous identity for a person. There are properties defined in FOAF under the "Social Web" rubric, such as an email address, that should serve to disambiguate persons in a particular social context. These are not, however, part of the name itself.

schema.org

The vocabulary schema.org was developed to provide "structured data on the Internet". (This is exactly the original impetus behind Dublin Core. How that went south, and what schema.org attempts to do instead, is beyond this post.) The vocabulary listed under the person schema is extensive, although only a few elements are directly related to names:

sdo:familyName, sdo:givenName, sdo:additionalName

sdo:givenName, is defined as the "first name" and sdo:familyName, is defined as the "last name". sdo:additionalName, is "An additional name for a Person, can be used for a middle name".  This latter is highly flexible but at the same time non-specific. It also creates some confusion in terms of the order of names for anyone whose name does not fit the exact "first-middle-last" pattern. As shown above, it's not totally uncommon to have more than one name that can fit into any of those particular buckets. Presumably the properties are repeatable, but they are defined with the singular term "name". It also does not clarify a display order. 

sdo:givenName "T."
sdo:familyName "Boyle"
sdo:additionalName "C."

Schema.org does have properties for both pre-name and post-name honorifics. The examples given for these are: sdo:honorificPrefix (Dr., Mrs.); sdo:honorificSuffix (M.D., PhD).  These examples don't make it clear if it might be possible to encode:

sdo:givenName "Charles"
sdo:honorificSuffix "Prince of Wales"

or

sdo:givenName "Pius"
sdo:honorificPrefix "Pope"
sdo:honorificSuffix "II"

In any case it appears that this would not distinguish between the informal honorifics like "Esq." and those that are essential parts of the name such as titles of nobility. There also does not seem to be an obvious way to encode non-honorific suffixes, such as "Jr." or "III". 

Without some strong guidance, it would be hard to know which of these properties would be used for the parts of a name like MarĂ­a de la PurificaciĂ³n Zapatero Valero. We'll see a possible solution to this with Wikidata, below.

Wikipedia

Wikipedia has probably millions of articles for people and therefore has to deal with the question of names. Their search does not distinguish between names and other article topics, and all are searched in left-to-right natural order in a drop-down box. Names are article titles just as any topic can be an article title.

There is no special coding of the name or parts of the name - it is simply a string of characters. Where more than one person has the same name article creators must add something to disambiguate the name which is usually done by adding an area of activity and perhaps a location associated with that activity:


Wikipedia also has a special type of page where topics that have common terms, including names, can be further defined.

 
These pages allow an explanation to distinguish between people who share a name. It goes beyond the parenthetical phrases that are used to create unique article names for persons with the same name, and is much more human-friendly than the birth and death dates that library cataloging relies on. Yet while Wikipedia excels in disambiguation, its encoding for names is limited to a single property, "name", in the infobox for a person, although it also allows for honorifics and for alternative forms of the name. 
 

Because the various Wikipedias are divided by language, there are properties for translations and transliterations of names, and it allows for name changes over the course of a person's life. 

Wikidata

Wikidata began by extracting data points from the Wikipedia entries, primarily from the infoboxes, but has grown beyond that to a database of facts that is edited directly. Perhaps because it is massively crowd-sourced, a long list of name properties have been developed. In addition to the usual given name and family name  there are terms like demonym (a name representing a place), second family name in Spanish name, Roman cognomen (ancient surname), patronym or matronym (names representing the person's father or mother), first family name in Portuguese name, and many others. 

Also because it is crowd-sourced there should be no expectation that this list is complete or balanced. It most likely represents a modicum of self-interest on the part of participants.

Conclusion (?)

Any solution in this area needs to recognize that one size does not fit all. For some applications a single "name=[string]" will be sufficient and it would be seriously counter-productive to force those folks to engage in detailed encoding. Another barrier to detailed encoding is that few people have knowledge to encode the universe of name forms at a detailed level. Requiring metadata creators to make distinctions outside of their understanding would only result in error-ridden metadata. Better a blind single string than mis-coded details. Yet there will be applications and their metadata communities that can or must make use of the subtleties of name details that are not of interest to others.

Because of both the great variety of name forms and the variability of applications that make use of names, I recommend a metadata vocabulary that follows the principle of minimum semantic commitment. This means a vocabulary that includes broad classes and properties that can be used as is where detailed coding is not needed or desired, but which can be extended to accommodate many different contexts.

The trick then is to define broad classes that aid in defining semantics but do little restriction. Classes for things like "Agent", with subclasses for "Person", "Groups of Persons", and perhaps "Non-persons". Properties could begin with "name" which could be subdivided into any definable part of a name that people find useful. Further specificity can be provided by application profiles that define such requirements as cardinality or value types for the various properties. Applications themselves could contain rules for the displays that are needed for their use cases.

The challenge now is to find a standards group that is interested to take this on.

-------

* With perhaps a few exceptions. I once heard Lorcan Dempsey opine that person's names would be much more useful if parents would just give their children unique names, "... like Lorcan Dempsey."