Coyle's InFormation: 04/01/2023

Thursday, April 13, 2023

What is Controlled Digital Lending? The Origin Story

The bulk of the reporting on the lawsuit between publishers (four of them, led by Hachette) and the Internet Archive's version of Controlled Digital Lending hits one of these points of view:

The publishers are evil, money-grubbing idiots going after the generous, saintly Internet Archive
The Internet Archive is evil, stealing from the poor publishers and even poorer writers

As is so often the case, it really is more compex than that. I will try to throw a bit of clarity into mix here, mainly by talking about some of the realities of library service in the 21st century, and the origins of controlled digital lending.

The Origins of CDL

Michelle M. Wu, a law librarian and law professor, wrote a piece for the Law Library Journal in 2011 explaining the dilemmas faced by law libraries and proposing a modest solution.

"Building a Collaborative Digital Collection: A Necessary Evolution in Libraries" LAW LIBRARY JOURNAL Vol. 103:4 [2011-34]) (online)

The solution is what became Controlled Digital Lending. The reasons she lays out are the key.

The main argument that Wu puts forth (and that I find convincing) is this: library users either want or actually need to be able to access materials remotely, which means in electronic formats over a network. Increasingly, materials that libraries wish to provide are available from publishers in those electronic formats. The catch, however, is that libraries are not able to own materials in electronic formats, but instead can only subscribe to access services. It is this lack of ownership that is the rub. If a library loses its digital subscription for some reason, such as no longer being able to afford it, it not only loses access to future materials, it loses access to all of the past materials that were included in that subscription. This puts libraries in the terrible position of having to decide between fulfilling their role as the reliable repository and archive of material in their subject area, or of serving the needs of library users. As Wu points out, libraries are already struggling to afford the materials that they feel they should be collecting, so purchasing these materials both in hard copy for archival purposes and also in digital form for user service is entirely beyond the pale.

What Wu suggests in her article is a variation on Inter-Library Loan, combined with a library collective purchasing plan. A cooperative group of law libraries would combine purchasing physical resources for those items that are rarely used but that should be available to the researchers who need them. This is not a revolutionary idea - library consortia have been making use of this kind of approach for a significant amount of time. The difference in Wu's plan is that as items are requested from the consortial holdings, they will be digitized and the digital format will be the one loaned. To stay within the intention of copyright law, in particular First Sale, Wu offers that the digital file will be loaned as a surrogate for the physical copy:

"Materials acquired would be digitized, and only the number of copies acquired in print for each subsequently digitized document would circulate at any given time. The print copy would be stored for archival purposes; only the digital copy would “circulate.”" Wu p.535

The physical resource would fulfill the need for an archival copy, and the digital resource would allow lending to any networked member of the cooperative group. Her solution assumes an effective digital rights management system that would make the loan a loan and not a pirate-able copy.

Wu carefully covers all of the potential legal objections, and points out the various areas of US copyright law that might be touched on with her proposal. Specific areas are First Sale, Fair Use, and the various exceptions to the copyright law that are applied to libraries. She defends the digitization as format-shifting, not unlike the format-shifting that is done for sound recordings as the technology for that medium has changed.

"It is the work itself that is copyrighted, not the form." Wu p. 541

She also addresses what would be an obvious objection of rights holders, that the digital copy is substituting for a purchase. The hard copy would be purchased by the consortium, and given her statement that these would primarily be low-use materials that many libraries would not themselves purchase, no harm would be done to the market which would be limited.

The argument that I find strongest is that of preservation: the US copyright law does allow libraries to make copies of works for the purposes of preservation if no equivalent copy is available for purchase. (Section 108 subsection c) Using the argument that the purpose of the library is to preserve as well as to make works available, Wu says:

"In cases where a digital version is available only for license, a library could argue that such a license is not equivalent to either the print copy or a digital copy they would make, because both of these items would be owned by the library and the licensed digital version would not." Wu p. 539

Context Counts

Controlled Digital Lending is the technology: the digitization of paper works and the lending of the digital copy using management software that prevents piracy of the digital file. It is the context that makes Wu's proposal different to the implementation of controlled digital lending at the Internet Archive.

Wu's proposal was for a consortium of law libraries serving their own users; the IA's implementation was open to anyone on the web
Wu's proposal was for academic materials of low use; the IA's included popular works
The works in Wu's proposal would have been selected with specific research purposes in mind; the IA's collection was an opportunistic group of books that they had often obtained as second-hand - therefore no research purpose could be argued. (Purpose is one of the Fair Use factors.)
Wu's proposal argued for the need of libraries to preserve materials that otherwise would not be preserved; the Archive is indeed an archive, known for its preservation of web sites that otherwise would be lost. However, the popular books named in the lawsuit against the Archive are already "preserved" in thousands of libraries who have those physical books on their shelves - the preservation argument is not easily supplied.

HathiTrust, which is a consortium of libraries that originally contributed to the Google Books project, is an example that follows Wu's approach. HathiTrust stores digitized copies of books and follows the ruling related to Google Books that searching of in-copyright works is permitted, but not reading. HathiTrust developed its own controlled digital lending service as an emergency service for when a member library is temporarily closed due to a disaster. In that case, users from the member library can borrow digitized books held by the member library in hard copy.

Wu even suggests that libraries might share the burden of digitization by providing digitized copies to libraries that own the books in hard copy. This latter, though, was one of the things that got the Internet Archive in trouble because it became a "digital lending broker" for other libraries, adding their hard copy count to the Archive's lending "units" including some of books owned by the publishers in the lawsuit.

The Upshot

The argument presented by Wu is quite strong and is justified through her careful reading of copyright law, in particular as that law applies to libraries. The extension of her proposal to popular reading materials and to an unlimited user base changes everything. Libraries do have specific collections, identified users, and stated purposes that guide their acquisitions. Something I feel strongly about - that is absent from so many modern information activities - is that effective information use requires purposeful resource selection and organization. Any mass of stored resources is only as valuable as its organization and coherence. In some cases, a "less" that is well organized can be more informative than a "more" that may lack the key works in a subject area. It may be old-fashioned on my part, but I adhere to the concept of defined user goals and the deliberate collection of specific works in support of user learning. This is what I read in Wu's work but which I do not see in the Archive's activities.

I think Wu's context could be understood as falling within the confines of copyright law. I'm not sure that the Archive's case does. I do hope that this current lawsuit does not result in a rejection of digitization for lending for all libraries.

Friday, April 07, 2023

Libraries, the law, and equality

In the spirit of "everyone is equal under the law", it is equally illegal for both a starving man and a billionaire to steal a loaf of bread. Or to copy a book.

Libraries for the People

It was not all that long ago when "library" often referred to the room in a rich man's home where he stored books that were only available to him, and perhaps members of his family (especially if they were not female). Other libraries, usually larger ones, were attached to prestigious educational institutions and accessible by people worthy of that prestige (which would not include non-white nor female people). We are fortunate today that we have these things called "public libraries," libraries that serve everyone regardless of their wealth, their race, or their gender.

Here's the catch: public libraries are generally small and modestly funded by the local community. A moderately sized public library has 50,000 - 100,000 volumes. A large public library may have up to 500,00 volumes. A large university library has many more. Harvard University library claims to have 20 million book volumes, 400 million manuscripts, and 10 million photographs. Stanford University library may have at least 12 million book volumes. Michigan State University libraries have about 7 million book volumes. The British Museum Library lays claim to 170 and 200 million items of which 13.5 million are printed books and e-books. There is no question that the members of our community who are served solely by public libraries, while they have unprecedented access to books, are not able to study the full range of printed knowledge of our world. To whit, the university libraries are often referred to as "research libraries" while the local public libraries are called "reading libraries." This separates us into "readers" and "researchers," and while you might conclude that any literate person can read, only those associated with large libraries will be able to avail themselves of the tools to do research.

Digital Access

Much of the research done in academe consumes and creates journal articles. Originally issued only in paper, and mailed to libraries and departments, journal articles have been available in digital form from the mid-1990's and today it would be unusual for an article-based publication to be issued only in paper. Journal owners have digitized the full run of publications, as have cooperative projects based in academia. A researcher or student at a Western university is likely to have more than a century of scientific, technical or social science academic article output available through the Internet, any day, any time, and perhaps from any place. Anyone from less wealthy nations will have less access, although perhaps just a tad more than they had when the articles were issued only in paper.

The story is different with books. While most academic articles have been converted to digital form, the same cannot be said of books. It is only recently that publishers have issued their books in electronic form using the electronic files that are now part and parcel of the publishing process. That only takes care of current publications, however. Sitting in libraries are centuries of one-off publications in book form. Books from this vast backlog must be digitized from the existing physical copy. Projects by libraries and educational institutions to digitize the monographic backlog, similar to those that succeeded in digitizing the journal output of the ages, have not been accomplished. There are various reasons why that is the case: the sheer number of book pages that would need to be digitized is huge; non-destructive digitization of bound volumes is difficult and often does not yield good results; partnering with publishers for this task is hampered by the fact that numerous books from the 20th century and older are "orphaned," meaning that although they may be under copyright their copyright holder cannot be found; and compared to modern ebooks, digitized books have little to recommend them for reading, although with their searchable text they may be useful for research.

The only efforts to digitize the backlog of books, Google Book Search and the digitizing by the Internet Archive, have resulted in lawsuits against those organizations. The suit against Google concluded that digitization is allowed as long as the digitized books are provided for purposes of searching but not reading. The Internet Archive took the view that books are for reading, an approach that I find hard to oppose.

Reading vs Research

Reading and research are related but different activities. Reading is often associated with books, and includes books on scientific and academic topics as well as fiction, from great literature to beach reads. While few non-researchers read academic articles, some members of academe do read books as part of their research. Of course, many people also read for pleasure; reading is a key means of acquiring culture, along side other activities like taking in performances of various arts.

If you are not at one of those institutions with a large research library, the only way you may have to see the content of many books is by accessing a digitized book. A digitized book is not the same reading experience as the ebook produced by publishers. A digitized book has not been produced from an electronic file of its contents as an ebook has been. Instead, each page of the physical book has been photographed, and those images have been analyzed using optical character recognition (OCR) software. The result of the OCR is a text file, and that file will be more or less "lossy" depending on things like the condition of the original book pages, the clarity of the font, the language of the text.

Unlike an ebook, reading the digitized book usually means viewing pictures of the books' pages.

It's not a great reading experience, but imagine that the book is important for your studies or your work; it would be worth the effort.

On the other hand, if you are wanting some modern leisure reading and you are in North America, you will be much better served by checking out the book and ebook offerings of your public library. If you are not in North America, and if your locality has a limited public library or no public library at all, then the extra effort that you may need to make to read a digitized book may be worth it to you. If, however, you had the funds to purchase the materials you needed or were associated with an institution that made those materials available to you, it is unlikely that you would choose the less sophisticated and less available copy provided at the Archive.

Hachette, et al., v Internet Archive

The above sets out some of the social parameters that we should consider when thinking about the recent lawsuit relating to Controlled Digital Lending. (See previous post.) In brief, the Internet Archive has digitized many books and makes them available globally, lending one "copy" at a time. A group of publishers has sued the Archive based on a set of books for which the publishers hold the copyright. The issue is often presented as a test of the concept of Controlled Digital Lending, although only some books are in question in the lawsuit. Those books represent only a portion of the books available at the Archive or in libraries in general. Although one may think of a binary division of books into "still in copyright" and "no longer in copyright" the actual situation is more complex.

There are the books that are out of copyright, which generally means books from 1924 and earlier in the US. These are not under discussion. However, there is no way to separate the basic copyrighted content of a book, like Mark Twain's Huck Finn from later reprintings that often add some bit of a preface so that the publisher can put a copyright notice on it and pretend to have the rights. Such "books" may be considered in copyright even though the primary content of those books is not. There is unfortunately no penalty for a publisher in slapping a copyright statement onto a book that is not under copyright, as can be seen in my favorite example of a blank journal sold with a copyright notice.
There are the orphaned works, for which there is no one to assert rights. Either the rights holder (the publisher) no longer exists, or the documentation that would make it possible to assert rights does not exist. Because this is a category of unknowns, it is quite difficult to determine which books are in this category.
There are works that are not orphaned but the publisher is not asserting rights in relation to Controlled Digital Lending. This may be the majority of the books being loaned by the Archive because there are only four publishers in the lawsuit. We don't know what the other publishers think about the lending.
There are the books by the four publishers that are included in the lawsuit. These four publishers are asserting that the Archive violated their rights and potentially deprived them of income.

It would be great to know the figures that would allow us to compare 1-3 with 4. It would also be great to know how many loans were actually made by the Archive of those books in the 4th category. Presumably that figure will inform the penalty that is imposed on the Archive.

The Archive's defense seems to be solid as it shows that in both the presence and the absence of its contested service no change was noted for publisher sales. It is chilling that the judge so readily dismissed the Archive's arguments, and especially chilling if you consider, as a hypothetical, applying this same argument to libraries in general.

"IA’s experts observed that print sales of the Works in Suit and general demand for library ebooks did not decrease while the Works in Suit were available on IA’s Website; that Amazon rankings for the Works in Suit improved when IA’s digital lending skyrocketed (and government lockdowns were in full effect) at the beginning of the Covid-19 pandemic; and that, despite the removal of the Works in Suit from IA’s library in June 2020, OverDrive checkouts of the Works in Suit did not increase." (Case 1:20-cv-04160-JGK-OTW Document 188 Filed 03/24/23 Page 42)

That sounds like a good defense, yet the judge dismisses it.

"But these metrics do not begin to meet IA’s burden to show a lack of market harm. Taking them at face value, they show at best that the presence of the Works in Suit in IA’s online library correlated, however weakly, with positive financial indicators for the Publishers in other areas. They do not show that IA’s conduct caused these benefits to the Publishers. In any event, IA cannot offset the harm it inflicts on the Publishers’ library ebook revenues, see, e.g., Andy Warhol Found., 11 F.4th at 48; TVEyes, 883 F.3d at 180, by pointing to other asserted benefits to the Publishers in other markets. Nor could those asserted benefits tip the scales in favor of fair use when the other factors point so strongly against fair use." (Case 1:20-cv-04160-JGK-OTW Document 188 Filed 03/24/23 Page 43)

Given this kind of reasoning, there is no "proof" that any library could provide that would clearly absolve the library of harm to publishers. That should be okay because "not harming publishers" is not how we should see the role of libraries in our world. Libraries exist for the same reasons that educational institutions exist: to further the abilities of citizens to participate in "science and the useful arts", as it is called in the constitution. Yet as Dan Cohen says in his article in the Atlantic:

On Friday, the judge sided almost entirely with the publishers. The Internet Archive “argues that its digital lending makes it easier for patrons who live far from physical libraries to access books and that it supports research, scholarship, and cultural participation by making books widely accessible on the Internet,” Judge John G. Koeltl wrote in his pointed ruling. “But these alleged benefits cannot outweigh the market harm to the Publishers.”

Thus, societal benefits, such as those of libraries and schools, take a back seat to profit. Or should I say "alleged benefits." Today, copyright law creates a basis for the legality of library lending through the first sale doctrine. Some library privileges relating to making copies are included in the US copyright law. But these do not add up to actual support for the work of libraries, only a limitation on culpability as they perform key functions such as preserving cultural materials that have been abandoned by their creators and providing access to recorded culture to all who request it. In the legal regime, libraries are allowed, but not encouraged, to provide a valuable service for society. Judge John G. Koeltl has little regard for that service.

Thursday, April 06, 2023

Judge's Decision on Internet Archive's Controlled Digital Lending

The story is long and complex, so here's about the shortest Q&A summary that I can manage. Remember IANAL (I am not a lawyer), IAAL (I am a librarian). Also, I'm leaving out lots of details here, but provide links so that you can get to them. While this is playing out as a legal question, the societal issues are barely considered. I will try to give some thoughts on those soon.

Q: Who sued the Archive?
A: Four publishers: Hachette Book Group, Inc., HarperCollins Publishers LLC, John Wiley & Sons, Inc., Penguin Random House LLC.

Q: What did they sue about?
A: That the Archive digitized paper books for which the publishers hold the copyright and loaned the digital copies to people.

Q: Are these the only publishers whose works the Archive digitized?
A: Oh, no. There are probably thousands of others.

Q: What are the books that are named in the suit?
A: There are too many to list, but here are a few to give you an idea:

Elizabeth Gilbert's Eat, Pray, Love: One Woman's Search for Everything Across Italy, India and Indonesia
Malcolm Gladwell's Blink: The Power of Thinking Without Thinking
C. S. Lewis's The Lion, the Witch, and the Wardrobe
J. D. Salinger's The Catcher in the Rye
Laura Ingalls Wilder's The House on the Prairie

There are many minor works as well, and others whose titles you would recognize.
The full list is at: https://storage.courtlistener.com/recap/gov.uscourts.nysd.537900/gov.uscourts.nysd.537900.1.1.pdf

Q: What was the Archive's legal defense for its actions?
A: They argue that digitization is analogous to the kind of time-shifting that is done through technologies like Tivo; it is a sort of "format shift" and therefore is fair use. They also argue that the Archive, as a non-profit library, is providing a lending service like libraries do with hard copies of books. It calls this process of digitizing and lending "Controlled Digital Lending." In Controlled Digital Lending the library treats the original hard copy and the digital copy as a single "thing" and lends either one or the other but not both at a time. This is called the "one-to-one principle" and it is designed to mimic the First Rights law of the US which is the basis for the legality of library lending.

Q: How did the court respond?
A: The judge looked at the four factors of the US copyright law and concluded that the Archive's use was not fair. He accepted the publisher's arguments that the lending of the books competed with the publishers' own digital and physical sales. He also bought the publishers' argument that the Archive, albeit a non-profit, gained status and therefore donations through the book lending service.

Q: Are there legal arguments to support Controlled Digital Lending?
A: Yes, ones have been made. In particular there is the work of Michelle Wu, who wrote "Building a Collaborative Digital Collection: A Necessary Evolution in Libraries.". Her initial thesis regarded law libraries and their difficulty in keeping up with the production of legal resources. Later, she was one of a group of legal scholars who developed a more general statement on Controlled Digital Lending. They argue that in this environment of increasing remote access to information, libraries have to be able to move beyond the requirement that users visit a physical space to access materials. And since not all materials have been provided in digital form, libraries need to take on the process of digitization for materials that they hold only in hard copy.

Q: Is this the first time that libraries have digitized materials?
A: No. Libraries have used various technologies, including digitization, to make materials available to disabled users. They also have digitized, faxed, and copied individual journal articles and book sections to satisfy interlibrary loan requests. They rarely have digitized entire books except to preserve rare materials, but those are generally free of copyrights due to their age.

Q: So, did the Archive do something wrong?
A: Possibly. For materials online a copyright holder can issue a "take down" notice, and the recipient is obligated to remove the item from access. The publishers claim that they gave the Archive a list of items to take down, but not all were removed. I haven't seen a statement from the Archive on why that method failed. Then, for about four months, during the beginning of the COVID pandemic in 2020, the Archive eliminated the one-to-one rule and allowed unlimited lending. This was done as a service to offset the fact that during that time many physical libraries were closed to their users, but it was not in keeping with the legal principles that had been laid out for Controlled Digital Lending.

Another possible error was the digitization of materials for which the publishers have digital versions (ebooks) on offer. This makes the argument that the Archive was competing with the publishers more convincing. Copyright law also views "creative" works more strongly than factual works, and these are publishers of fiction as well as popular non-fiction, types of works that one could see as worthy of maximum copyright protection. Materials intended for research and education (academic journal articles, scientific treatises) are more likely to meet the "purpose" requirement of a fair use assessment. It is quite a bit harder to claim fair use copying for "something fun to read" and the publishers in the suit are all major purveyors of popular reading.

Continuing on, most libraries have a limited user base: universities serve current students, staff and faculty; public libraries serve residents in their jurisdiction. The Archive was lending materials globally. That latter is both an argument against the Archive, if you are a publisher, and an argument for the Archive, if you support equal access to information.

Q: Didn't we go through this already with Google books?

Not quite. Google never allowed anyone to read its digitized books. It stated that its digitization project was to provide searching within the text of books, and users were only displayed snippets, not the whole book. That was deemed to be fair use by the court. Since then, Google books has mainly been acquiring digital texts provided by publishers, and the amount of visible content is part of the agreement between Google and its book partners.

Q: Could a different implementation of Controlled Digital Lending succeed?
A: Possibly. There are libraries that have partnered with the Archive in this project but were not mentioned in the lawsuit; it is unclear whether they will be able to continue lending their digitized books - although they may have to find another technical solution to the lending service, which is currently run by the Archive. There is also the possibility that a digitization project that had specific service goals, like the one initially proposed by Wu for law libraries, would be easier to defend. Both the Archive and the earlier digitization project, Google Books, decided that it was expedient to digitize first and ask permission later. They also both digitized indiscriminately, including old and new, academic and popular. Google eventually adopted an "opt-in" model in its publisher relations, although as the search engine of record what it has to offer is a level of visibility that no one else can provide. The other option is to limit access to books in the public domain, which cuts off almost the last century of works.

Q: What's next?
A: There will be appeals by the Archive, but if those do not alter the court's view then the Archive will be required to compensate the publishers for its infringement of their rights. Presumably that compensation will be based on some estimated amount that the publishers were damaged. So far I have no seen any actual figures that would be used to make such a determination.

Resources:

The Controlled Digital Lending site
The many documents in the lawsuit
The judge's decision
The Authors Guild CDL page
Google books legal decision (from Wikipedia because it has lots of links)