Monday, March 01, 2021

Digitization Wars, Redux


 Because this is long, you can download it as a PDF here.

From 2004 to 2016 the book world (authors, publishers, libraries, and booksellers) was involved in the complex and legally fraught activities around Google’s book digitization project. Once known as “Google Book Search,” the company claimed that it was digitizing books to be able to provide search services across the print corpus, much as it provides search capabilities over texts and other media that are hosted throughout the Internet. 

Both the US Authors Guild and the Association of American Publishers sued Google (both separately and together) for violation of copyright. These suits took a number of turns including proposals for settlements that were arcane in their complexity and that ultimately failed. Finally, in 2016 the legal question was decided: digitizing to create an index is fair use as long as only minor portions of the original text are shown to users in the form of context-specific snippets. 

We now have another question about book digitization: can books be digitized for the purpose of substituting remote lending in the place of the lending of a physical copy? This has been referred to as “Controlled Digital Lending (CDL),” a term developed by the Internet Archive for its online book lending services. The Archive has considerable experience with both digitization and providing online access to materials in various formats, and its Open Library site has been providing digital downloads of out of copyright books for more than a decade. Controlled digital lending applies solely to works that are presumed to be in copyright. 

Controlled digital lending works like this: the Archive obtains and retains a physical copy of a book. The book is digitized and added to the Open Library catalog of works. Users can borrow the book for a limited time (2 weeks) after which the book “returns” to the Open Library. While the book is checked out to a user no other user can borrow that “copy.” The digital copy is linked one-to-one with a physical copy, so if more than one copy of the physical book is owned then there is one digital loan available for each physical copy. 

The Archive is not alone in experimenting with lending of digitized copies: some libraries have partnered with the Archive’s digitization and lending service to provide digital lending for library-owned materials. In the case of the Archive the physical books are not available for lending. Physical libraries that are experimenting with CDL face the added step of making sure that the physical book is removed from circulation while the digitized book is on loan, and reversing that on return of the digital book. 

Although CDL has an air of legality due to limiting lending to one user at a time, authors and publishers associations had raised objections to the practice. [nwu] However, in March of 2020 the Archive took a daring step that pushed their version of the CDL into litigation: using the closing of many physical libraries due to the COVID pandemic as its rationale, the Archive renamed its lending service the National Emergency Library [nel] and eliminated the one-to-one link between physical and digital copies. Ironically this meant that the Archive was then actually doing what the book industry had accused it of (either out of misunderstanding or as an exaggeration of the threat posed): it was making and lending digital copies beyond its physical holdings. The Archive stated that the National Emergency Library would last only until June of 2020, presumably because by then the COVID danger would have passed and libraries would have re-opened. In June the Archive’s book lending service returned to the one-to-one model. Also in June a suit was filed by four publishers (Hachette, HarperCollins, Penguin Random House, and Wiley) in the US District Court of the Southern District of New York. [suit] 

The Controlled Digital Lending, like the Google Books project, holds many interesting questions about the nature of “digital vs physical,” not only in a legal sense but in a sense of what it means to read and to be a reader today. The lawsuit not only does not further our understanding of this fascinating question; it sinks immediately into hyperbole, fear-mongering, and either mis-information or mis-direction. That is, admittedly, the nature of a lawsuit. What follows here is not that analysis but gives a few of the questions that are foremost in my mind.

 Apples and Oranges 

 Each of the players in this drama has admirable reasons for their actions. The publishers explain in their suit that they are acting in support of authors, in particular to protect the income of authors so that they may continue to write. The Authors’ Guild provides some data on author income, and by their estimate the average full-time author earns less than $20,000 per year, putting them at poverty level.[aghard] (If that average includes the earnings of highly paid best selling authors, then the actual earnings of many authors is quite a bit less than that.) 

The Internet Archive is motivated to provide democratic access to the content of books to anyone who needs or wants it. Even before the pandemic caused many libraries to close the collection housed at the Archive contained some works that are available only in a few research libraries. This is because many of the books were digitized during the Google Books project which digitized books from a small number of very large research libraries whose collections differ significantly from those of the public libraries available to most citizens. 

Where the pronouncements of both parties fail is in making a false equivalence between some authors and all authors, and between some books and all books, and the result is that this is a lawsuit pitting apples against oranges. We saw in the lawsuits against Google that some academic authors, who may gain status based on their publications but very little if any income, did not see themselves as among those harmed by the book digitization project. Notably the authors in this current suit, as listed in the bibliography of pirated books in the appendix to the lawsuit, are ones whose works would be characterized best as “popular” and “commercial,” not academic: James Patterson, J. D. Salinger, Malcolm Gladwell, Toni Morrison, Laura Ingalls Wilder, and others. Not only do the living authors here earn above the poverty level, all of them provide significant revenue for the publishers themselves. And all of the books listed are in print and available in the marketplace. No mention is made of out-of-print books, no academic publishers seem to be involved. 

On the part of the Archive, they state that their digitized books fill an educational purpose, and that their collection includes books that are not available in digital format from publishers:

“ While Overdrive, Hoopla, and other streaming services provide patrons access to latest best sellers and popular titles,  the long tail of reading and research materials available deep within a library’s print collection are often not available through these large commercial services.  What this means is that when libraries face closures in times of crisis, patrons are left with access to only a fraction of the materials that the library holds in its collection.”[cdl-blog]

This is undoubtedly true for some of the digitized books, but the main thesis of the lawsuit points out that the Archive has digitized and is also lending current popular titles. The list of books included in the appendix of the lawsuit shows that there are in-copyright and most likely in-print books of a popular reading nature that have been part of the CDL. These titles are available in print and may also be available as ebooks from the publishers. Thus while the publishers are arguing that current, popular books should not be digitized and loaned (apples), the Archive is arguing that they are providing access to items not available elsewhere, and for educational purposes (oranges). 

The Law 

The suit states that publishers are not questioning copyright law, only violations of the law.

“For the avoidance of doubt, this lawsuit is not about the occasional transmission of a title under appropriately limited circumstances, nor about anything permissioned or in the public domain. On the contrary, it is about IA’s purposeful collection of truckloads of in-copyright books to scan, reproduce, and then distribute digital bootleg versions online.” ([Suit] Page 3).

This brings up a whole range of legal issues in regard to distributing digital copies of copyrighted works. There have been lengthy arguments about whether copyright law could permit first sale rights for digital items, and the answer has generally been no; some copyright holders have made the argument that since transfer of a digital file is necessarily the making of a copy there can be no first sale rights for those files. [1stSale] [ag1] Some ebook systems, such as the Kindle, have allowed time-limited person-to-person lending for some ebooks. This is governed by license terms between Amazon and the publishers, not by the first sale rights of the analog world. 

Section 108 of the copyright law does allow libraries and archives to make a limited number of copies The first point of section 108 states that libraries can make a single copy of a work as long as 1) it is not for commercial advantage, 2) the collection is open to the public and 3) the reproduction includes the copyright notice from the original. This sounds to be what the Archive is doing. However, the next two sections (b and c) provide limitations on that first section that appear to put the Archive in legal jeopardy: section “b” clarifies that copies may be made for preservation or security; section “c” states that the copies can be made if the original item is deteriorating and a replacement can no longer be purchased. Neither of these applies to the Archive’s lending. 

 In addition to its lending program, the Archive provides downloads of scanned books in DAISY format for those who are certified as visually impaired by the National Library Service for the Blind and Physically Handicapped in the US. This is covered in 121A of the copyright law, Title17, which allows the distribution of copyrighted works in accessible formats. This service could possibly be cited as a justification of the scanning of in-copyright works at the Archive, although without mitigating the complaints about lending those copies to others. This is a laudable service of the Archive if scans are usable by the visually impaired, but the DAISY-compatible files are based on the OCR’d text, which can be quite dirty. Without data on downloads under this program it is hard to know the extent to which this program benefits visually impaired readers. 


Most likely as part of the strategy of the lawsuit, very little mention is made of “lending.” Instead the suit uses terms like “download” and “distribution” which imply that the user of the Archive’s service is given a permanent copy of the book

“With just a few clicks, any Internet-connected user can download complete digital copies of in-copyright books from Defendant.” ([suit] Page 2). “... distributing the resulting illegal bootleg copies for free over the Internet to individuals worldwide.” ([suit] Page 14).
Publishers were reluctant to allow the creation of ebooks for many years until they saw that DRM would protect the digital copies. It then was another couple of years before they could feel confident about lending - and by lending I mean lending by libraries. It appears that Overdrive, the main library lending platform for ebooks, worked closely with publishers to gain their trust. The lawsuit questions whether the lending technology created by the Archive can be trusted.
“...Plaintiffs have legitimate fears regarding the security of their works both as stored by IA on its servers” ([suit] Page 47).

In essence, the suit accuses IA of a lack of transparency about its lending operation. Of course, any collaboration between IA and publishers around the technology is not possible because the two are entirely at odds and the publishers would reasonably not cooperate with folks they see as engaged in piracy of their property. 

Even if the Archive’s lending technology were proven to be secure, lending alone is not the issue: the Archive copied the publishers’ books without permission prior to lending. In other words, they were lending content that they neither owned (in digital form) nor had licensed for digital distribution. Libraries pay, and pay dearly, for the ebook lending service that they provide to their users. The restrictions on ebooks may seem to be a money-grab on the part of publishers, but from their point of view it is a revenue stream that CDL threatens. 

Is it About the Money?

“... IA rakes in money from its infringing services…” ([suit] Page 40). (Note: publishers earn, IA “rakes in”)
“Moreover, while Defendant promotes its non-profit status, it is in fact a highly commercial enterprise with millions of dollars of annual revenues, including financial schemes that provide funding for IA’s infringing activities. ([suit] Page 4).

These arguments directly address section (a)(1) of Title 17, section 108: “(1) the reproduction or distribution is made without any purpose of direct or indirect commercial advantage”. 

At various points in the suit there are references to the Archive’s income, both for its scanning services and donations, as well as an unveiled show of envy at the over $100 million that Brewster Kahle and his wife have in their jointly owned foundation. This is an attempt to show that the Archive derives “direct or indirect commercial advantage” from CDL. Non-profit organizations do indeed have income, otherwise they could not function, and “non-profit” does not mean a lack of a revenue stream, it means returning revenue to the organization instead of taking it as profit. The argument relating to income is weakened by the fact that the Archive is not charging for the books it lends. However, much depends on how the courts will interpret “indirect commercial advantage.” The suit argues that the Archive benefits generally from the scanned books because this enhances the Archive’s reputation which possibly results in more donations. There is a section in the suit relating to the “sponsor a book” program where someone can donate a specific amount to the Archive to digitize a book. How many of us have not gotten a solicitation from a non-profit that makes statements like: “$10 will feed a child for a day; $100 will buy seed for a farmer, etc.”? The attempt to correlate free use of materials with income may be hard to prove. 


Decades ago, when the service Questia was just being launched (Questia ceased operation December 21, 2020), Questia sales people assured a group of us that their books were for “research, not reading.” Google used a similar argument to support its scanning operation, something like “search, not reading.” The court decision in Google’s case decided that Google’s scanning was fair use (and transformative) because the books were not available for reading, as Google was not presenting the full text of the book to its audience.[suit-g] 

The Archive has taken the opposite approach, a “books are for reading” view. Beginning with public domain books, many from the Google books project, and then with in-copyright books, the Archive has promoted reading. It developed its own in-browser reading software to facilitate reading of the books online. [reader] (*See note below)

Although the publishers sued Google for its scanning, they lost due to the “search, not reading” aspect of that project. The Archive has been very clear about its support of reading, which takes the Google justification off the table. 

“Moreover, IA’s massive book digitization business has no new purpose that is fundamentally different than that of the Publishers: both distribute entire books for reading.” ([suit] Page 5). 

 However, the Archive's statistics on loaned books shows that a large proportion of the books are used for 30 minutes or less. 

“Patrons may be using the checked-out book for fact checking or research, but we suspect a large number of people are browsing the book in a way similar to browsing library shelves.” [ia1] 

 In its article on the CDL, the Center for Democracy and Technology notes that “the majority of books borrowed through NEL were used for less than 30 minutes, suggesting that CDL’s primary use is for fact-checking and research, a purpose that courts deem favorable in a finding of fair use.” [cdt] The complication is that the same service seems to be used both for reading of entire books and as a place to browse or to check individual facts (the facts themselves cannot be copyrighted). These may involve different sets of books, once again making it difficult to characterize the entire set of digitized books under a single legal claim. 

The publishers claim that the Archive is competing with them using pirated versions of their own products. That leads us to the question of whether the Archive’s books, presented for reading, are effectively substitutes for those of the publishers. Although the Archive offers actual copies, those copies that are significantly inferior to the original. However, the question of quality did not change the judgment in the lawsuit against copying of texts by Kinko’s [kinkos], which produced mediocre photocopies from printed and bound publications. It seems unlikely that the quality differential will serve to absolve the Archive from copyright infringement even though the poor quality of some of the books interferes with their readability. 

Digital is Different

Publishers have found a way to monetize digital versions, in spite of some risks, by taking advantage of the ability to control digital files with technology and by licensing, not selling, those files to individuals and to libraries. It’s a “new product” that gets around First Sale because, as it is argued, every transfer of a digital file makes a copy, and it is the making of copies that is covered by copyright law. [1stSale] 

The upshot of this is that because a digital resource is licensed, not sold, the right to pass along, lend, or re-sell a copy (as per Title 17 section 109) does not apply even though technology solutions that would delete the sender’s copy as the file safely reaches the recipient are not only plausible but have been developed. [resale] 

“Like other copyright sectors that license education technology or entertainment software, publishers either license ebooks to consumers or sell them pursuant to special agreements or terms.” ([suit] Page 15)

“When an ebook customer obtains access to the title in a digital format, there are set terms that determine what the user can or cannot do with the underlying file.”([suit] Page 16)

This control goes beyond the copyright holder’s rights in law: DRM can exercise controls over the actual use of a file, limiting it to specific formats or devices, allowing or not allowing text-to-speech capabilities, even limiting copying to the clipboard.

Publishers and Libraries 

The suit claims that publishers and libraries have reached an agreement, an equilibrium.

“To Plaintiffs, libraries are not just customers but allies in a shared mission to make books available to those who have a desire to read, including, especially, those who lack the financial means to purchase their own copies.” ([suit] Page 17).
In the suit, publishers contrast the Archive’s operation with the relationship that publishers have with libraries. In contrast with the Archive’s lending program, libraries are the “good guys.”
“... the Publishers have established independent and distinct distribution models for ebooks, including a market for lending ebooks through libraries, which are governed by different terms and expectations than print books.”([suit] Page 6).
These “different terms” include charging much higher prices to libraries for ebooks, limiting the number of times an ebook can be loaned. [pricing1] [pricing2]
“Legitimate libraries, on the other hand, license ebooks from publishers for limited periods of time or a limited number of loans; or at much higher prices than the ebooks available for individual purchase.” [agol]
The equilibrium of which publishers speak looks less equal from the library side of the equation: library literature is replete with stories about the avarice of publishers in relation to library lending of ebooks. Some authors/publishers even speak out against library lending of physical books, claiming that this cuts into sales. (This same argument has been made for physical books.)
“If, as Macmillan has determined, 45% of ebook reads are occurring through libraries and that percentage is only growing, it means that we are training readers to read ebooks for free through libraries instead of buying them. With author earnings down to new lows, we cannot tolerate ever-decreasing book sales that result in even lower author earnings.” [agliblend][ag42]

The ease of access to digital books has become a boon for book sales, and ebook sales are now rising while hard copy sales fall. This economic factor is a motivator for any of those engaged with the book market. The Archive’s CDL is a direct affront to the revenue stream that publishers have carved out for specific digital products. There are indications that the ease of borrowing of ebooks - not even needing to go to the physical library to borrow a book - is seen as a threat by publishers. This has already played out in other media, from music to movies. 

It would be hard to argue that access to the Archive’s digitized books is merely a substitute for library access. Many people do not have actual physical library access to the books that the Archive lends, especially those digitized from the collections of academic libraries. This is particularly true when you consider that the Archive’s materials are available to anyone in the world with access to the Internet. If you don’t have an economic interest in book sales, and especially if you are an educator or researcher, this expanded access could feel long overdue. 

We need numbers 

We really do not know much about the uses of the Archive’s book collection. The lawsuit cites some statistics of “views” to show that the infringement has taken place, but the page in question does not explain what is meant by a “view”. Archive pages for downloadable files of metadata records also report “views” which most likely reflect views of that web page, since there is nothing viewable other than the page itself. Open Library book pages give “currently reading” and “have read” stats, but these are tags that users can manually add to the page for the work. To compound things, the 127 books cited in the suit have been removed from the lending service (and are identified in the Archive as being in the collection “litigation works

Although numbers may not affect the legality of the controlled digital lending, the social impact of the Archive’s contribution to reading and research would be clearer if we had this information. Although the Archive has provided a small number of testimonials, a proof of use in educational settings would bolster the claims of social benefit which in turn could strengthen a fair use defense. 


(*) The NWU has a slide show [nwu2] that explains what it calls Controlled Digital Lending at the Archive. Unfortunately this document conflates the Archive's book Reader with CDL and therefore muddies the water. It muddies it because it does not distinguish between sending files to dedicated devices (which is what Kindle is) or dedicated software like what libraries use via software like Libby, and the Archive's use of a web-based reader. It is not beyond reason to suppose that the Archive's Reader software does not fully secure loaned items. The NWU claims that files are left in the browser cache that represent all book pages viewed: "There’s no attempt whatsoever to restrict how long any user retains these images". (I cannot reproduce this. In my minor experiments those files disappear at the end of the lending period, but this requires more concerted study.) However, this is not a fault of CDL but a fault of the Reader software. The reader is software that works within a browser window. In general, electronic files that require secure and limited use are not used within browsers, which are general purpose programs.

Conflating the Archive's Reader software with Controlled Digital Lending will only hinder understanding. Already CDL has multiple components:

  1. Digitization of in-copyright materials
  2. Lending of digital copies of in-copyright materials that are owned by the library in a 1-to-1 relation to physical copies

We can add #3, the leakage of page copies via the browser cache, but I maintain that poorly functioning software does not automatically moot points 1 and 2. I would prefer that we take each point on its own in order to get a clear idea of the issues.

The NWU slides also refer to the Archive's API which allows linking to individual pages within books. This is an interesting legal area because it may be determined to be fair use regardless of the legality of the underlying copy. This becomes yet another issue to be discussed by the legal teams, but it is separate from the question of controlled digital lending. Let's stay focused.













[nwu] "Appeal from the victims of Controlled Digital Lending (CDL)". (Retrieved 2021-01-10) 

[nwu2] "What is the Internet Archive doing with our books?"



[reader] Bookreader 




Thursday, June 25, 2020

Women designing

Those of us in the library community are generally aware of our premier "designing woman," the so-called "Mother of MARC," Henriette Avram. Avram designed the MAchine Reading Cataloging record in the mid-1960's, a record format that is still being used today. MARC was way ahead of its time using variable length data fields and a unique character set that was sufficient for most European languages, all thanks to Avram's vision and skill. I'd like to introduce you here to some of the designing women of the University of California library automation project, the project that created one of the first online catalogs in the beginning of the 1980's, MELVYL. Briefly, MELVYL was a union catalog that combined data from the libraries of the nine (at that time) University of California campuses. It was first brought up as a test system in 1980 and went "live" to the campuses in 1982.

Work on the catalog began in or around 1980, and various designs were put forward and tested. Key designers were Linda Gallaher-Brown, who had one of the first masters degrees in computer science from UCLA, and Kathy Klemperer, who like many of us was a librarian turned systems designer.

We were struggling with how to create a functional relational database of bibliographic data (as defined by the MARC record) with computing resources that today would seem laughable but were "cutting edge" for that time. I remember Linda remarking that during one of her school terms she returned to her studies to learn that the newer generation of computers would have this thing called an "operating system" and she thought "why would you need one?" By the time of this photo she had come to appreciate what an operating system could do for you. The one we used at the time was IBM's OS 360/370.

Kathy Klemperer was the creator of the database design diagrams that were so distinctive we called them "Klemperer-grams." Here's one from 1985:
MELVYL database design Klemperer-gram, 1985
Drawn and lettered by hand, not only did these describe a workable database design, they were impressively beautiful. Note that this not only predates the proposed 2009 RDA "database scenario" for a relational bibliographic design by 24 years, it provides a more detailed and most likely a more accurate such design.
RDA "Scenario 1" data design, 2009
In the early days of the catalog we had a separate file and interface for the cataloged serials based on a statewide project (including the California State Universities). Although it was possible to catalog serials in the MARC format, the systems that had the detailed information about which issues the libraries held was stored in serials control databases that were separate from the library catalog, and many serials were represented by crusty cards that had been created decades before library automation. The group below developed and managed the CALLS (California Academic Library List of Serials). Four of those pictured were programmers, two were serials data specialists, and four had library degrees. Obviously, these are overlapping sets. The project heads were Barbara Radke (right) and Theresa Montgomery (front, second from right).

At one point while I was still working on the MELVYL project, but probably around the very late 1990's or early 2000's, I gathered up some organization charts that had been issued over the years and quickly calculated that during its history the project the technical staff that had created this early marvel had varied from 3/4 to 2/3 female. I did some talks at various conferences in which I called MELVYL a system "created by women." At my retirement in 2003 I said the same thing in front of the entire current staff, and it was not well-received by all. In that audience was one well-known member of the profession who later declared that he felt women needed more mentoring in technology because he had always worked primarily with men, even though he had indeed worked in an organization with a predominantly female technical staff, and another colleague who was incredulous when I stated once that women are not a minority, but over 50% of the world's population. He just couldn't believe it.

While outright discrimination and harassment of women are issues that need to be addressed, the invisibility of women in the eyes of their colleagues and institutions is horribly damaging. There are many interesting projects, not the least the Wikipedia Women in Red, that aim to show that there is no lack of accomplished women in the world, it's the acknowledgment of their accomplishments that falls short. In the library profession we have many women whose stories are worth telling. Please, let's make sure that future generations know that they have foremothers to look to for inspiration.

Monday, May 25, 2020


I've been trying to capture what I remember about the early days of library automation. Mostly my memory is about fun discoveries in my particular area (processing MARC records into the online catalog). I did run into an offprint of some articles in ITAL from 1982 (*) which provide very specific information about the technical environment, and I thought some folks might find that interesting. This refers to the University of California MELVYL union catalog, which at the time had about 800,000 records.

Operating system: IBM 360/370
Programming language: PL/I
CPU: 24 megabytes of memory
Storage: 22 disk drives, ~ 10 gigabytes

The disk drives were each about the size of an industrial washing machine. In fact, we referred to the room that held them as "the laundromat."

Telecommunications was a big deal because there was no telecommunications network linking the libraries of the University of California. There wasn't even one connecting the campuses at all. The article talks about the various possibilities, from an X.25 network to the new TCP/IP protocol that allows "internetwork communication." The first network was a set of dedicated lines leased from the phone company that could transmit 120 characters per second (character = byte) to about 8 ASCII terminals at each campus over a 9600 baud line. There was a hope to be able to double the number of terminals.

In the speculation about the future, there was doubt that it would be possible to open up the library system to folks outside of the UC campuses, much less internationally. (MELVYL was one of the early libraries to be open access worldwide over the Internet, just a few years later.) It was also thought that libraries would charge other libraries to view their catalogs, kind of like an inter-library loan.

And for anyone who has an interest in Z39.50, one section of the article by David Shaughnessy and Clifford Lynch on telecommunications outlines a need for catalog-to-catalog communication which sounds very much like the first glimmer of that protocol.


(*) Various authors in a special edition: (1982). In-Depth: University of California MELVYL. Information Technology and Libraries, 1(4)

I wish I could give a better citation but my offprint does not have page numbers and I can't find this indexed anywhere. (Cue here the usual irony that libraries are terrible at preserving their own story.)

Monday, April 27, 2020

Ceci n'est pas une Bibliothèque

On March 24, 2020, the Internet Archive announced that it would "suspend waitlists for the 1.4 million (and growing) books in our lending library," a service they then named The National Emergency Library. These books were previously available for lending on a one-to-one basis with the physical book owned by the Archive, and as with physical books users would have to wait for the book to be returned before they could borrow it. Worded as a suspension of waitlists due to the closure of schools and libraries caused by the presence of the coronavirus-19, this announcement essentially eliminated the one-to-one nature of the Archive's Controlled Digital Lending program. Publishers were already making threatening noises about the digital lending when it adhered to lending limitations, and surely will be even more incensed about this unrestricted lending.

I am not going to comment on the legality of the Internet Archive's lending practices. Legal minds, perhaps motivated by future lawsuits, will weigh in on that. I do, however, have much to say on the use of the term "library" for this set of books. It's a topic worthy of a lengthy treatment, but I'll give only a brief account here.


The roots “LIBR…” and “BIBLIO…” both come down to us from ancient words for trees and tree bark. It is presumed that said bark was the surface for early writings. “LIBR…”, from the Latin word liber meaning “book,” in many languages is a prefix that indicates a bookseller’s shop, while in English it has come to mean a collection of books and from that also the room or building where books are kept. “BIBLIO…” derives instead from the Greek biblion (one book) and biblia (books, plural). We get the word Bible through the Greek root, which leaked into old Latin and meant The Book.

Therefore it is no wonder that in the minds of many people, books = library.  In fact, most libraries are large collections of books, but that does not mean that every large collection of books is a library. Amazon has a large number of books, but is not a library; it is a store where books are sold. Google has quite a few books in its "book search" and even allows you to view portions of the books without payment, but it is also not a library, it's a search engine. The Internet Archive, Amazon, and Google all have catalogs of metadata for the books they are offering, some of it taken from actual library catalogs, but a catalog does not make a quantity of books into a library. After all, Home Depot has a catalog, Walmart has a catalog; in essence, any business with an inventory has a catalog.
"...most libraries are large collections of books, but that does not mean that every large collection of books is a library."

The Library Test

First, I want to note that the Internet Archive has met the State of California test to be defined as a library, and this has made it possible for the Archive to apply for library-related grants for some of its projects. That is a Good Thing because it has surely strengthened the Archive and its activities. However, it must be said that the State of California requirements are pretty minimal, and seem to be limited to a non-profit organization making materials available to the general public without discrimination. There doesn't seem to be a distinction between "library" and "archive" in the state legal code, although librarians and archivists would not generally consider them easily lumped together as equivalent services.

The Collection

The Archive's blog post says "the Internet Archive currently lends about as many as a US library that serves a population of about 30,000." As a comparison, I found in the statistics gathered by the California State Library those of the Benicia Public Library in Benicia California. Benicia is a city with a population of 31,000; the library has about 88,000 books. Well, you might say, that's not as good as over one million books at the Internet Archive. But, here's the thing: those are not 88,000 random books, they are books chosen to be, as far as the librarians could know, the best books for that small city. If Benicia residents were, for example, primarily Chinese-speaking, the library would surely have many books in Chinese. If the city had a large number of young families then the children's section would get particular attention. The users of the Internet Archive's books are a self-selected (and currently un-defined) set of Internet users. Equally difficult to define is the collection that is available to them:
This library brings together all the books from Phillips Academy Andover and Marygrove College, and much of Trent University’s collections, along with over a million other books donated from other libraries to readers worldwide that are locked out of their libraries.
Each of these is (or was, in the case of Marygrove, which has closed) a collection tailored to the didactic needs of that institution. How one translates that, if one can, to the larger Internet population is unknown. That a collection has served a specific set of users does not mean that it can serve all users equally well. Then there is that other million books, which are a complete black box.

Library science

I've argued before against dumping a large and undistinguished set of books on a populace, regardless of the good intentions of those doing so. Why not give the library users of a small city these one million books? The main reason is the ability of the library to fulfill the 5 Laws of Library Science:
  1. Books are for use.
  2. Every reader his or her book.
  3. Every book its reader.
  4. Save the time of the reader.
  5. The library is a growing organism. [0]
The online collection of the Internet Archive nicely fulfills laws 1 and 5: the digital books are designed for use, and the library can grow somewhat indefinitely. The other three laws are unfortunately hindered by the somewhat haphazard nature of the set of books, combined with the lack of user services.

Of the goals of librarianship, matching readers to books is the most difficult. Let's start with law 3, "every book its reader." When you follow the URL to the National Emergency Library, you see something like this:
The lack of cover art is not the problem here. Look at what books you find: two meeting reports, one journal publication, and a book about hand surgery, all from 1925. Scroll down for a bit and you will find it hard to locate items that are less obscure than this, although undoubtedly there are some good reads in this collection. These are not the books whose readers will likely be found in our hypothetical small city. These are books that even some higher education institutions would probably choose not to have in their collections. While these make the total number of available books large, they may not make the total number of useful books large. Winnowing this set to one or more (probably more) wheat-filled collections could greatly increase the usability of this set of books.

"While these make the total number of available books large, they may not make the total number of useful books large."

A large "anything goes" set of documents is a real challenge for laws 2 and 4: every reader his or her book, and save the time of the reader. The more chaff you have the harder it is for a library user to find the wheat they are seeking. The larger the collection the more of the burden is placed on the user to formulate a targeted search query and to have the background to know which items to skip over. The larger the retrieved set, the less likely that any user will scroll through the entire display to find the best book for their purposes. This is the case for any large library catalog, but these libraries have built their collection around a particular set of goals. Those goals matter. Goals are developed to address a number of factors, like:
  • What are the topics of interest to my readers and my institution?
  • How representative must my collection be in each topic area?
  • What are the essential works in each topic area?
  • What depth of coverage is needed for each topic? [1]
If we assume (and we absolutely must assume this) that the user entering the library is seeking information that he or she lacks, then we cannot expect users to approach the library as an expert in the topic being researched. Although anyone can type in a simple query, fewer can assess the validity and the scope of the results. A search on "California history" in the National Emergency Library yields some interesting-looking books, but are these the best books on the topic? Are any key titles missing? These are the questions that librarians answer when developing collections.

The creation of a well-rounded collection is a difficult task. There are actual measurements that can be run against library collections to determine if they have the coverage that can be expected compared to similar libraries. I don't know if any such statistical packages can look beyond quantitative measures to judge the quality of the collection; the ones I'm aware of look at call number ranges, not individual titles.  There

Library Service

The Archive's own documentation states that "The Internet Archive focuses on preservation and providing access to digital cultural artifacts. For assistance with research or appraisal, you are bound to find the information you seek elsewhere on the internet." After which it advises people to get help through their local public library. Helping users find materials suited to their need is a key service provided by libraries. When I began working in libraries in the dark ages of the 1960's, users generally entered the library and went directly to the reference desk to state the question that brought them to the institution. This changed when catalogs went online and were searchable by keyword, but prior to then the catalog in a public library was primarily a tool for librarians to use when helping patrons. Still, libraries have real or virtual reference desks because users are not expected to have the knowledge of libraries or of topics that would allow them to function entirely on their own. And while this is true for libraries it is also true, perhaps even more so, for archives whose collections can be difficult to navigate without specialized information. Admitting that you give no help to users seeking materials makes the use of the term "library" ... unfortunate.

What is to be done?

There are undoubtedly a lot of useful materials among the digital books at the Internet Archive. However, someone needing materials has no idea whether they can expect to find what they need in this amalgamation. The burden of determining whether the Archive's collection might suit their needs is left entirely up to the members of this very fuzzy set called "Internet users." That the collection lends at the rate of a public library serving a population of 30,000 shows that it is most likely under-utilized. Because the nature of the collection is unknown one can't approach, say, a teacher of middle-school biology and say: "they've got what you need." Yet the Archive cannot implement a policy to complete areas of the collection unless it knows what it has as compared to known needs.

"... these warehouses of potentially readable text will remain under-utilized until we can discover a way to make them useful in the ways that libraries have proved to be useful."

I wish I could say that a solution would be simple - but it would not. For example, it would be great to extract from this collection works that are commonly held in specific topic areas in small, medium and large libraries. The statistical packages that analyze library holdings all are, AFAIK, proprietary. (If anyone knows of an open source package that does this, please shout it out!) If would also be great to be able to connect library collections of analog books to their digital equivalents. That too is more complex than one would expect, and would have to be much simpler to be offered openly. [2]

While some organizations move forward with digitizing books and other hard copy materials, these warehouses of potentially readable text will remain under-utilized until we can discover a way to make them useful in the ways that libraries have proved to be useful. This will mean taking seriously what modern librarianship has developed over its circa 2 centuries, and in particular those 5 laws that give us a philosophy to guide our vision of service to the users of libraries.


[0] Even if you are familiar with the 5 laws you may not know that Ranganathan was not as succinct as this short list may imply. The book in which he introduces these concepts is over 450 pages long, with extended definitions and many homey anecdotes and stories.

[1] A search on "collection development policy" will yield many pages of policies that you can peruse. To make this a "one click" here are a few *non-representative* policies that you can take a peek at:
[2] Dan Scott and I did a project of this nature with a Bay Area public library and it took a huge amount of human intervention to determine whether the items matched were really "equivalent". That's a discussion for another time, but, man, books are more complicated than they appear.

Monday, February 03, 2020

Use the Leader, Luke!

If you learned the MARC format "on the job" or in some other library context you may have learned that the record is structured as fields with 3-digit tags, each with two numeric indicators, and that subfields have a subfield indicator (often shown as "$" because it is a non-printable character) and a single character subfield code (a-z, 0-9). That is all true for the MARC records that libraries create and process, but the MAchine Readable Cataloging standard (Z39.2 or ISO 2709) has other possibilities that we are not using. Our "MARC" (currently MARC21) is a single selection from among those possibilities, in essence an application profile of the MARC standard. The key to the possibilities afforded by MARC is in the MARC Leader, and in particular in two positions that our systems generally ignore because they always contain the same values in our data:
Leader byte 10 -- Indicator count
Leader byte 11 -- Subfield code length
In MARC21 records, Leader byte 10 is always "2" meaning that fields have 2-byte indicators, and Leader byte 11 is always 2 because the subfield code is always two characters in length. That was a decision made early on in the life of MARC records in libraries, and it's easy to forget that there were other options that were not taken. Let's take a short look at the possibilities the record format affords beyond our choice.

Both of these Leader positions are single bytes that can take values from 0 to 9. An application could use the MARC record format and have zero indicators. It isn't hard to imagine an application that has no need of indicators or that has determined to make use of subfields in their stead. As an example, the provenance of vocabulary data for thesauri like LCSH or the Art and Architecture Thesaurus could always be coded in a subfield rather than in an indicator:
650 $a Religion and science $2 LCSH
Another common use of indicators in MARC21 is to give a byte count for the non-filing initial articles on title strings. Istead of using an indicator value for this some libraries outside of the US developed a non-printing code to make the beginning and end of the non-filing portion. I'll use backslashes to represent these codes in this example:
245 $a \The \Birds of North America
I am not saying that all indicators in MARC21 should or even could be eliminated, but that we shouldn't assume that our current practice is the only way to code data.

In the other direction, what if you could have more than two indicators? The MARC record would allow you have have as many as nine. In addition, there is nothing to say that each byte in the indicator has to be a separate data element; you could have nine indicator positions that were defined as two data elements (4 + 5), or some other number (1 + 2 + 6). Expanding the number of indicators, or beginning with a larger number, could have prevented the split in provenance codes for subject vocabularies between one indicator value and the overflow subfield, $2, when the number exceeded the capability of a single numerical byte. Having three or four bytes for those codes in the indicator and expanding the values to include a-z would have been enough to include the full list of authorities for the data in the indicators. (Although I would still prefer putting them all in $2 using the mnemonic codes for ease of input.)

In the first University of California union catalog in the early 1980's we expanded the MARC indicators to hold an additional two bytes (or was it four?) so that we could record, for each MARC field, which library had contributed it. Our union catalog record was a composite MARC record with fields from any and all of the over 300 libraries across the University of California system that contributed to the union catalog as dozen or so separate record feeds from OCLC and RLIN. We treated the added indicator bytes as sets of bits, turning on bits to represent the catalog feeds from the libraries. If two or more libraries submitted exactly the same MARC field we stored the field once and turned on a bit for each separate library feed. If a library submitted a field for a record that was new to the record, we added the field and turned on the appropriate bit. When we created a user display we selected fields from only one of the libraries. (The rules for that selection process were something of a secret so as not to hurt anyone's feelings, but there was a "best" record for display.) It was a multi-library MARC record, made possible by the ability to use more than two indicators.

Now on to the subfield code. The rule for MARC21 is that there is a single subfield code and that is a lower case a-z and 0-9. The numeric codes have special meaning and do not vary by field; the alphabetic codes aare a bit more flexible. That gives use 26 possible subfields per tag, plus the 10 pre-defined numeric ones. The MARC21 standard has chosen to limit the alphabetic subfield codes to lower case characters. As the fields reached the limits of the available subfield codes (and many did over time) you might think that the easiest solution would be to allow upper case letters as subfield codes. Although the subfield code limitation was reached decades ago for some fields I can personally attest to the fact that suggesting the expansion of subfield codes to upper case letters was met with horrified glares at the MARC standards meeting. While clearly in 1968 the range of a-z seemed ample, that has not be the case for nearly half of the life-span of MARC.

The MARC Leader allows one to define up to 9 characters total for subfield codes. The value in this Leader position includes the subfield delimiter so this means that you can have a subfield delimiter and up to 8 characters to encode a subfield. Even expanding from a-z to aa-zz provides vastly more possibilities, and allow upper case as well give you a dizzying array of choices.

The other thing to mention is that there is no prescription that field tags must be numeric. They are limited to three characters in the MARC standard, but those could be a-z, A-Z, 0-9, not just 0-9, greatly expanding the possibilities for adding new tags. In fact, if you have been in the position to view internal systems records in your vendor system you may have been able to see that non-numeric tags have been used for internal system purposes, like noting who made each edit, whether functions like automated authority control have been performed on the record, etc. Many of the "violations" of the MARC21 rules listed here have been exploited internally -- and since early days of library systems.

There are other modifiable Leader values, in particular the one that determines the maximum length of a field, Leader 20. MARC21 has Leader 20 set at "4" meaning that fields cannot be longer than 9999. That could be longer, although the record size itself is set at only 5 bytes, so a record cannot be longer than 99999. However, one could limit fields to 999 (Leader value 20 set at "3") for an application that does less pre-composing of data compared to MARC21 and therefore comfortably fits within a shorter field length. 

The reason that has been given, over time, why none of these changes were made was always: it's too late, we can't change our systems now. This is, as Caesar might have said, cacas tauri. Systems have been able to absorb some pretty intense changes to the record format and its contents, and a change like adding more subfield codes would not be impossible. The problem is not really with the MARC21 record but with our inability (or refusal) to plan and execute the changes needed to evolve our systems. We could sit down today and develop a plan and a timeline. If you are skeptical, here's an example of how one could manage a change in length to the subfield codes:

a MARC21 record is retrieved for editing
  1. read the Leader 10 of the MARC21 record
  2. if the value is "2" and you need to add a new subfield that uses the subfield code plus two characters, convert all of the subfield codes in the record:
    • $a becomes $aa, $b becomes $ba, etc.
    • $0 becomes $01, $1 becomes $11, etc.
    • Leader 10 code is changed to "3"
  3. (alternatively, convert all records opened for editing)

a MARC21 record is retrieved for display
  1. read the Leader 10 of the MARC21 record
  2. if the value is "2" use the internal table of subfield codes for records with the value "2"
  3. if the value is "3" use the internal table of subfield codes for records with the value "3"

Sounds impossible? We moved from AACR to AACR2, and now from AACR2 to RDA without going back and converting all of our records to the new content.  We have added new fields to our records, such as the 336, 337, 338 for RDA values, without converting all of the earlier records in our files to have these fields. The same with new subfields, like $0, which has only been added in recent years. Our files have been using mixed record types for at least a couple of generations -- generations of systems and generations of catalogers.

Alas, the time to make these kinds of changes this was many years ago. Would it be worth doing today? That depends on whether we anticipate a change to BIBFRAME (or some other data format) in the near future. Changes do continue to be made to the MARC21 record; perhaps it would have a longer future if we could broach the subject of fixing some of the errors that were introduced in the past, in particular those that arose because of the limitations of MARC21 that could be rectified with an expansion of that record standard. That may also help us not carry over some of the problems in MARC21 that are caused by these limitations to a new record format that does not need to be limited in these ways.


Although the MARC  record was incredibly advanced compared to other data formats of its time (the mid-1960's), it has some limitations that cannot be overcome within the standard itself. One obvious one is the limitation of the record length to 5 bytes. Another is the fact that there are only two levels of nesting of data: the field and the subfield. There are times when a sub-subfield would be useful, such as when adding information that relates to only one subfield, not the entire field (provenance, external URL link). I can't advocate for continuing the data format that is often called "binary MARC" simply because it has limitations that require work-arounds. MARCXML, as defined as a standard, gets around the field and record length limitations, but it is not allowed to vary from the MARC21 limitations on field and subfield coding. It would be incredibly logical to move to a "non-binary" record format (XML, JSON, etc.) beginning with the existing MARC21 and  to allow expansions where needed. It is the stubborn adherence to the ISO 2709 format really has limited library data, and it is all the more puzzling because other solutions that can keep the data itself intact have been available for many decades.

Tuesday, January 28, 2020


I was always a bit confused about the inclusion of "pamflets" in the subtitle of the Decimal System, such as this title page from the 1922 edition:

Did libraries at the time collect numerous pamphlets? For them to be the second-named type of material after books was especially puzzling.

I may have discovered an answer to my puzzlement, if not THE answer, in Andrea Costadoro's 1856 work:
A "pamphlet" in 1856 was not (necessarily) what I had in mind, which was a flimsy publication of the type given out by businesses, tourist destinations, or public health offices. In the 1800's it appears that a pamphlet was a literary type, not a physical format. Costadoro says:
"It has been a matter of discussion what books should be considered pamphlets and what not. If this appellation is intended merely to refer to the SIZE of the book, the question can be scarecely worth considering ; but if it is meant to refer to the NATURE of a work, it may be considered to be of the same class and to stand in the same connexion with the word Treatise as the words Tract ; Hints ; Remarks ; &c, when these terms are descriptive of the nature of the books to which they are affixed." (p. 42)
To be on the shelves of libraries, and cataloged, it is possible that these pamphlets were indeed bound, perhaps by the library itself. 

The Library of Congress genre list today has a cross-reference from "pamphlet" to "Tract (ephemera)". While Costadoro's definition doesn't give any particular subject content to the type of work, LC's definition says that these are often issued by religious or political groups for proselytizing. So these are pamphlets in the sense of the political pamphlets of our revolutionary war. Today they would be blog posts, or articles in Buzzfeed or Slate or any one of hundreds of online sites that post such content.

Churches I have visited often have short publications available near the entrance, and there is always the Watchtower, distributed by Jehovah's Witnesses at key locations throughout the world, and which is something between a pamphlet (in the modern sense) and a journal issue. These are probably not gathered in most libraries today. In Dewey's time the printing (and collecting by libraries) of sermons was quite common. In a world where many people either were not literate or did not have access to much reading material, the Sunday sermon was a "long form" work, read by a pastor who was probably not as eloquent as the published "stars" of the Sunday gatherings. Some sermons were brought together into collections and published, others were published (and seemingly bound) on their own.  Dewey is often criticized for the bias in his classification, but what you find in the early editions serves as a brief overview of the printed materials that the US (and mostly East Coast) culture of that time valued. 

What now puzzles me is what took the place of these tracts between the time of Dewey and the Web. I can find archives of political and cultural pamphlets in various countries and they all seem to end around the 1920's-30's, although some specific collections, such as the Samizdat publications in the Soviet Union, exist in other time periods.

Of course the other question now is: how many of today's tracts and treatises will survive if they are not published in book form?

Saturday, November 23, 2019

The Work

The word "work" generally means something brought about by human effort, and at times implies that this effort involves some level of creativity. We talk about "works of art" referring to paintings hanging on walls. The "works" of Beethoven are a large number of musical pieces that we may have heard. The "works" of Shakespeare are plays, in printed form but also performed. In these statements the "work" encompasses the whole of the thing referred to, from the intellectual content to the final presentation.

This is not the same use of the term as is found in the Library Reference Model (LRM). If you are unfamiliar with the LRM, it is the successor to FRBR (which I am assuming you have heard of) and it includes the basic concepts of work, expression, manifestation and item that were first introduced in that previous study. "Work," as used in the LRM is a concept designed for use in library cataloging data. It is narrower than the common use of the term illustrated in the previous paragraph and is defined thus:
Class: Work
Definition: An abstract notion of an artistic or intellectual creation.
In this definition the term only includes the idea of a non-corporeal conceptual entity, not the totality that would be implied in the phrase "the works of Shakespeare." That totality is described when the work is realized through an LRM-defined "expression" which in turn is produced in an LRM-defined "manifestation" with an LRM-defined "item" as its instance.* These four entities are generally referred to as a group with the acronym WEMI.

Because many in the library world are very familiar with the LRM definition of work, we have to use caution when using the word outside the specific LRM environment. In particular, we must not impose the LRM definition on uses of the work that are not intending that meaning. One should expect that the use of the LRM definition of work would be rarely found in any conversation that is not about the library cataloging model for which it was defined. However, it is harder to distinguish uses within the library world where one might expect the use to be adherent to the LRM.

To show this, I want to propose a particular use case. Let's say that a very large bibliographic database has many records of bibliographic description. The use case is that it is deemed to be easier for users to navigate that large database if they could get search results that cluster works rather than getting long lists of similar or nearly identical bibliographic items. Logically the cluster looks like this:

In data design, it will have a form something like this:

This is a great idea, and it does appear to have a similarity to the LRM definition of work: it is gathering those bibliographic entries that are judged to represent the same intellectual content. However, there are reasons why the LRM-defined work could not be used in this instance.

The first is that there is only one WEMI relationship for work, and that is from LRM work to LRM expression. Clearly the bibliographic records in this large library catalog are not LRM expressions; they are full bibliographic descriptions including, potentially, all of the entities defined in the LRM.

To this you might say: but there is expression data in the bibliographic record, so we can think of this work as linking to the expression data in that record. That leads us to the second reason: the entities of WEMI are defined as being disjoint. That means that no single "thing" can be more than one of those entities; nothing can be simultaneously a work and an expression, or any other combination of WEMI entities. So if the only link we have available in the model is from work to expression, unless we can somehow convince ourselves that the bibliographic record ONLY represents the expression (which it clearly does not since it has data elements from at least three of the LRM entities) any such link will violate the rule of disjointness.

Therefore, the work in our library system can have much in common with the conceptual definition of the LRM work, but it is not the same work entity as is defined in that model.

This brings me back to my earlier blog post with a proposal for a generalized definition of WEMI-like entities for created works.  The WEMI concepts are useful in practice, but the LRM model has some constraints that prevent some desirable uses of those entities. Providing unconstrained entities would expand the utility of the WEMI concepts both within the library community, as evidenced by the use case here, and in the non-library communities that I highlight in that previous blog post and in a slide presentation.

To be clear, "unconstrained" refers not only to the removal of the disjointness between entities, but also to allow the creation of links between the WEMI entities and non-WEMI entities, something that is not anticipated in the LRM. The work cluster of bibliographic records would need a general relationship, perhaps, as in the case of VIAF, linked through a shared cluster identifier and an entity type identifying the cluster as representing an unconstrained work.

* The other terms are defined in the LRM as:

Class: Expression
Definition: A realization of a single work usually in a physical form.

Class: Manifestation
Definition: The physical embodiment of one or more expressions.

Class: Item
Definition: An exemplar of a single manifestation.