Tuesday, September 24, 2013

Hopes and fears for Google Books case

We're back in the saddle of the now epic lawsuit against Google for its massive scanning of the books held by libraries. I have very mixed feelings about the case and its outcomes, and the news reports from yesterday's hearing (transcript) in Judge Denny Chin's court are not making me feel any better about it. In brief, the Author's Guild is claiming that Google violated fair use by scanning in-copyright books. Since that act alone is not sufficient to address a defense of fair use, they also state (correctly, in my view) that although Google is not providing advertising on the individual book pages, that it overall makes money off of the scanned books because that digital corpus enforces its position against other search engines. There are some things that the Authors Guild has right (such as, that Google makes money off of search results pages that can include links to Google Books), but they miss the mark in other arguments:
"For all intents and purposes, it paid libraries for the right to digitize and copy much of our nation’s literary heritage and then used the resulting digital library to gain a competitive advantage over search engine competitors that respected the rights of authors by limiting their digitization programs to books that were either licensed or were no longer protected by copyright. Aided by its infringing conduct, Google’s search engine has proven remarkably successful—to the point where “google” has become a widely used verb in the English language.
First, the addition of Google Books to the search took place long after we were all "googling." Google's main value still comes from providing access to open web resources that otherwise would just be a massive digital junk heap. I suspect that those who are interested in using Google to search within the text of "closed" books (ones that are not available as full text online) consciously go to the Google Book Search pages. I don't know this for a fact, but I'd be willing to bet that user intent behind most Google searches is to access the actual content of a web page or document, not to be given a reference to an off-line resource.

Next is the statement that Google "paid libraries for the right to digitize..." This makes it sound like Google gave the libraries money, and that there was no cost to the libraries. The agreement between Google and libraries was an exchange that had costs for both (less for the libraries, more for Google) and benefits for both (less for the libraries, more for Google). In the end, Google got the better part of the deal, but libraries got something, even though something they have not yet been able to greatly benefit from: libraries got copies of the scans at a lower price than had they done the digitization themselves. Unfortunately, due to both copyright issues and the nature of the agreement between Google and the libraries, there are significant barriers to making the kind of uses that would make this a truly transformative corpus for research.

All of the news reports emphasized some comments by Judge Chin to the effect that Google Books appears to be both transformative (in the copyright law sense) and a benefit to society. What worries me a bit is that Judge Chin is not looking beyond the use of the resulting digital texts for search. I consider search to be the tip of the iceberg, and the visible part of Google Books that Google would like everyone to focus on. My assumption is that Google has a research interest in having exclusive access to 20 million non-Web digital texts in a myriad of languages, and that this research is aimed not only at search but at Google's desire to be THE interface between man and machine, which means that machines have to get better at human languages.

If Judge Chin rules that Google's book digitization is fair use, it's a huge win, not only for Google but also for libraries. After all, if it is fair use for Google to digitize works for the purposes of searching, there is no question that it is also fair use for libraries to do the same. If Judge Chin rules that Google's book digitization is NOT fair use because of profit-making, then we still do not know for sure whether library digitization would be considered fair use (although much would depend on exactly how the decision is worded). This of course makes me want to cheer on Chin toward the "is fair use" decision, but at the same time I know that this means that any research that Google is doing on its private cache of digital texts will continue, giving them great advantages over competitors in the arms race of technology advancement.

Once again, I so wish that large-scale digitization for search and research had been undertaken by libraries, not Google. The questions of "not for profit" and social value would be a slam-dunk, and I'd not be harboring this fear that there is a hidden agenda behind the project. Maybe if libraries had done this we'd only have two or three million digitized books, not 20 million (as is claimed for Google), but they'd be untainted, in my mind, and I could still consider them a cultural heritage resource rather than a commercial product.

Sunday, September 22, 2013

Copyright, Metadata, and Attribution

The Berkeley Center for Law and Technology (BCLT) has done some interesting research on copyright, including a white paper that details the issues of performing "due diligence" in a determination of orphan works.

Recently I attended a small meeting co-sponsored by BCLT and the DPLA to begin a discussion of the issues around copyright in metadata, with a particular emphasis on bibliographic metadata. Much of the motivation for this is the uncertainty in the library and archival community about whether they can freely share their metadata. As long as this question remains un-answered, there are barriers to the free flow of data from and between cultural heritage institutions.

At the conclusion of the meeting it was clear that it will take some research to fully define the problem space. Fortunately for all of us, BCLT may be able to devote resources to undertake such a study, similar to what they have done around orphan works.

One of the first questions to undertake is whether bibliographic metadata is copyrightable in the first place. If not, then no further steps need to be taken -- not even putting a CC0 license on the data. In fact, some knowledgeable folks worry that using CC0 implies that there do exist intellectual property rights that must be addressed.

However, before you can attempt to determine if bibliographic metadata can be argued to be a set of facts which, under US copyright law, do not enjoy protection, you must be able to define "bibliographic metadata." During the meeting we did not attempt to create such a definition, but discussion ranged from "anything about a resource" to a specific set of descriptive elements. As there were representatives of archives in the room, we also talked about some of the implications of describing unpublished materials, which have a different legal standing but also provide less self-identification than resources that have been published. Drawing the line between fact and embellishment in bibliographic metadata is not going to be easy. Nor will the determination of level of creativity of the data, a necessary part of the analysis for US law. Note that other types of metadata were also discussed, such as rights metadata and preservation metadata, as well as a recognition that the exchange of metadata will of course cross national boundaries. Any study will have to determine where it will draw the "metadata" line, and also whether one can address the the question with an international scope.

Another complexity is that bibliographic data is already "crowd-sourced" in a sense. For any given bibliographic record,  different contributions have been made by different librarians from different institutions and at different times. This recognition makes it hard to ascribe intellectual ownership to any one party. And while library catalog data may be considered to be factual, it is much more than a simple rendering of facts, as the complexity of the cataloging rules attests. I likened library cataloging to a medical diagnosis: the end result (some scribbles in a file and perhaps a prescription given to the patient) does not reveal all of the knowledge and judgment that went into the decision. Metadata is the tip of an iceberg. That may not change its legal status, but I think that unless you have delved into the intricacies of cataloging it is hard to appreciate all that goes into the fairly simple display that users see on the screen.

The legal question is difficult, and to me it isn't entirely clear that solving the question on the legality of bibliographic data exchange will be sufficient to break the logjam. In a sense, projects like DPLA and Europeana, both of which have declared their metadata to be available with a CC0 license, might have more real impact than a determination based in law. Significant discussion at the meeting was about the need for attribution on the part of cultural heritage institutions. Like academics, the reputation and standing of such institutions depends on their getting recognition for their work. Releasing metadata (including thumbnails in the case of visual materials) needs to increase the visibility of those institutions, and to raise public awareness of the value of their collections. It is possible that solving the attribution problem could essentially dissolve the barriers to metadata sharing, since the gain to the institutions would be obvious.

Perhaps my one unique contribution to the group discussion was this:

We all know the © symbol and what it means. What we need now is an equally concise and recognizable symbol for attribution. Something like "(@)The Bancroft Library" or "(@)Dr. Seuss Collection". This would shorten attribution statements but also make them quickly recognizable, and a statement could also be a link to the appropriate web page. Standardizing attribution in this way should make adding attributions easier, and would demonstrate a culture of "giving credit where credit is due." The symbol needs to be simple, and should be easy to understand. It's time to comb through the Unicode charts for just the right one. Any suggestions?

See Also:

Unicode 1F6A9 - Triangular flag meaning "location"