Sunday, July 15, 2012

Friends of HathiTrust

I have written before about the lawsuit of the Author's Guild (AG) against HathiTrust (HT). The tweet-sized explanation is that the AG claims that the corpus of digitized books in the HathiTrust that are not in the public domain are infringements of copyright. HathiTrust claims that the digitized copies are justified under fair use. (It may be relevant that many of the digitized texts stored in HT are the result of the mass digitization done by Google.)

For analysis of the legal issues, please see James Grimmelman's blog, in particular his post summarizing how the various arguments fit into the copyright law's "four factors."

I want to focus on some issues that I think are of particular interest to librarians and scholars. In particular, I want to bring up some of the points from the amicus brief from the digital humanities and law scholars.

While scientists and others who work with quantifiable data (social scientists using census data, business researchers with huge amounts of data from stock markets, etc.), those working in the humanities whose raw material is in printed texts have not been able to make use of the massive data mining techniques that are moving other areas of research forward. If you want to study how language has changed over time, or when certain concepts entered the vocabulary of mass media, the physical storage of this information makes it impossible to run these as calculations, and the size of the corpus makes it very difficult, if not impossible, to do the research in "human time". Thus, the only way for the "Digital Humanities" to engage in modern research is after the digitization of their primary materials.

This presumably speaks to the first factor of fair use:

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

As Grimmelman says "The Authors Guild focuses on the corpus itself; HathiTrust focuses on its uses." It may make sense that scholars should be allowed to make copies of any material they need to use in their research, but I can imagine objections, some of which the AG has already made: 1) you don't need to systematically copy every book in every library to do your research and 2) that's fine, but can you guarantee that infringing copies will not be distributed?

It's a hard sell, yet it's also hard not to see the point of view of the humanities scholars who feel that they could make great progress (ok, and some good career moves) if they had access to this material.

The other argument that the digital humanities scholars make is that the data derived from the digitization process is not infringing because it is non-expressive metadata. Here it gets a bit confusing because although they refer to the data derived from digitization as "metadata," the examples that they give vary from the digitized copies themselves, to a database where all of this is stored, and to the output from Google n-grams. If the database consists of metadata, then the Google n-grams are an example of the use of that metadata, but are not an example of the metadata itself. In fact the "metadata" that is produced from digitization is a good graphic copy of each page of the book, plus a reproduction, word for word (with unfortunate but not deliberate imprecision) of the text itself. That this copy is essential for the research uses desired is undeniable, and the brief gives many good examples of quantitative research in the humanities. But I fear that their insistence that digitization produces mere "metadata" may not be convincing.

Here's a short version from the text:

"In ruling on the parties’ motions, the Court should recognize that text mining is a non-expressive use that presents no legally cognizable conflict with the statutory rights or interests of the copyright holders. Where, as here, the output of a database—i.e., the data it produces and displays—is noninfringing, this Court should find that the creation and operation of the database itself is likewise noninfringing. The copying required to convert paper library books into a searchable digital database is properly considered a “nonexpressive use” because the works are copied for reasons unrelated to their protectable expressive qualities; none of the works in question are being read by humans as they would be if sitting on the shelves of a library or bookstore." p. 2

They also talk about transformation of works, and the legal issues here are complex and my impression is that the various past legal decisions may not provide a clear path. They then end a section with this quote:

"By contrast, the many forms of metadata produced by the library digitization at the heart of this litigation do not merely recast copyrightable expression from underlying works; rather, the metadata encompasses numerous uncopyrightable facts about the works, such as author, title, frequency of particular words or phrases, and the like." (p.17)

This, to me, comes completely out of left field. Anyone who has done digitization projects is aware that most projects use human-produced library metadata for the authors and titles of the digitized works. In addition, the result of the OCR step of the digitization process is a large text file that is the text, from first word to last, in that order, and possibly a mapping file that gives the coordinates of the location of each word on each OCR'd page. Any term frequency data is a few steps away from the actual digitization process and its immediate output, and fits in perfectly with the earlier arguments around the use of datamining.

I do sincerely hope that digitization of texts will be permitted by the court for the purposes argued in this paper. An attempt at justification, after the fact, of Google's mass digitization project may, however, suffer weaknesses inherent in that project, in particular that no prior negotiation was attempted with either authors nor publishers, and once the amended settlement between Google and the suing parties was denied by court, there is no mutual agreement on uses, security, nor compensation.

In addition, the economic and emotional impact of Google's role in this process cannot be ignored: this is a company that is so strong and so pervasive in our lives that mere nations struggle to protect their own (and their citizens') interests. When Google or Amazon or Facebook steps into your territory, the earth trembles and fear is not an unreasonable response. I worry that idea of digitization itself has been tainted, making it harder for scholars to make their case of the potential benefits of post-digitization research.


Kevin said...

The fact that a creation if human-produced does not make it copyrightable under US law. The telephone book is the classic example of something produced by a human ("sweat of the brow") but not copyrightable as a compilation (because it has no "modicum of creativity").

Karen Coyle said...

Kevin, I don't think that it was implied anywhere in the post that human-produced metadata is, by definition, copyrightable. My comment about human-created bibliographic data was about the source of the data, not the rights. It is important that the description of the technology within the brief be accurate, and unfortunately there are lapses. I find that throughout they confuse the direct output of digitization with uses that have been made of it, such as calculation of term frequency.

The Amicus brief is arguing that data derived from a textual work is not within the copyright of the original (pre-digitized) work. They do not speak at all about any ownership over what they are calling the "metadata." It is an interesting question -- and one that has not been addressed, AFAIK -- whether the results of computational research on the corpus can themselves be copyrighted. Doing something like advanced linguistic research that happens to use computer programs could well be considered creative enough for copyright to apply to the programs, but I don't know if one can claim copyright in the resulting data. I haven't looked into this, but I believe that this question has come up in scientific research.

In terms of human-created library metadata, I have to say that I have trouble seeing library cataloging as mere "sweat of the brow." If it were that, we wouldn't need 1500 pages of cataloging rules, with a lot of "if...then." Like complex data extraction, library cataloging requires complex decision-making. The end result may still be considered purely factual, but I am always struck by the sheer complexity of the activity -- maybe "sweat of the brain" would be a better term.