Sunday, September 20, 2009

DOJ drops bomb in Google/AAP settlement

On Friday, September 17, 2009 the Department of Justice delivered its long-awaited Statement of Interest in the proposed settlement between Google and the AAP/AG in the class action suit surrounding the Google Book Search product. The DOJ has some very specific requirements for modification of the settlement, some of which could result in significant changes in the nature of the agreement. The headline, however, is:
that "the court should reject the settlement in its current form," and reconsider after changes are made.

Beyond that, my summary is this:

1) the DOJ does not like that the settlement allows uses of orphan works that go beyond those allowed by copyright law, and especially that others will be profiting from those uses

2) the DOJ considers the settlement to be anti-competitive, and

3) between the lines, it appears that the DOJ can't decide between supporting the full access to scanned books for the good of mankind, and wanting the settlement to limit itself to the original scope of Google's project, which was to digitize for indexing only.

And I should add:

4) nothing here has a direct effect on libraries or the Google library partners, except, perhaps, in that it changes the product that Google will provide as its subscription service, and

5) that the DOJ letter clearly states that Google and the AAP/AG are already in the process of making changes to the settlement to respond to the DOJ's concerns.

The Concerns

The Class

The first has to do with the definition of the class of rights holders who are party to the class action suit. DOJ concludes that the settlement does not satisfy the rules for defining a class as set out in Rule 23, the rule that governs class action suits.

In this area, DOJ is mainly concerned with the potential rights holders of orphan works. It isn't easy to understand what solutions DOJ sees for finding the rights holders for these works, but the Department is uneasy that known rights holders will be the ones negotiating with the rights registry, and that they will also benefit from any money made on orphan works. In other words, it will be to the advantage of rights holders that the parents of those orphans NOT be found. DOJ suggests, among other things, that the money made on orphan works not be paid out to others, but be used to try to find rights holders.

It also suggests that not enough work was done to notify all potential members of the class, in particular foreign authors.

The Potential Uses, and Orphan and Out-of-Print Works

DOJ appears to be nervous about the open-endedness of the future uses that Google can make of both orphan and out-of-print works. To remedy this, it is suggested that out-of-print works (including orphans) be treated the same as in-print works, that is, that rights holders must opt-in to any uses that Google intends to make of the works. To me this makes sense from a legal point of view, since copyright does not distinguish between in- and out-of-print status. It makes less sense from a market point of view, because presumably there is less active interest in the out-of-print works on the part of the rights holder. However, we really do not know what in- and out-of-print mean in a predominantly digital environment, and it may be a mistake to be making decisions based on the analog market, as the settlement does.

There are some parts of the DOJ document that suggest what could be radical solutions, yet they appear almost as asides, such as when suggesting that out-of-print works should be subject to opt-in, they say:
"Such a revision would, of course, not give Google immediate authorization to use all out-of-print works beyond the digitization and scanning which is the foundation of the plaintiffs' Complaint in this matter." p. 14
This seems to indicate that DOJ would be more comfortable with a settlement that essentially authorized the current scope of the Google Book Search product, which was the basis for Google's claim of Fair Use: search and snippet display.

In another section, they voice concern over the fact that some rights holders will be earning money on the unclaimed works of others. They say:
"The risk of such improper leveraging might also be reduced by narrowing the scope of the license. A settlement that simply authorized Google to engage in scanning and snippet displays in the future would limit the profits that others could potentially derive from out-of-print works whose owners fail to learn of their right to claim those profits." p. 15
In fact, this would greatly limit the profit that Google could earn (from which those of the rights holders derive), since the main source of expected profit for Google seems to be from the licensing of full views of the books (to libraries and other institutions) and the "sale" of books to individuals. If this is really what the DOJ means, then it is essentially suggesting that Google have no more use of orphaned works than it has today. With that limitation, it seems that Google might as well go forward with its Fair Use defense, if it would want to continue scanning books at all.


DOJ is concerned that the settlement doesn't allow for sufficient competition. It isn't clear to me, however, how that competition might be achieved. First the document states that the Registry does not have the power to give access to works to entities other than Google, since copyright law doesn't allow it. Then it says that the best solution is to make sure that other companies get equal access. To show that I'm not making this up (although I may be mis-interpreting):
"The Proposed Settlement does not forbid the Registry from licensing these works to others. But the Registry can only act "to the extent permitted by law." S.A. 6.2(b). And the parties have represented to the United States that they believe the Registry would lack the power and ability to license copyrighted books without the consent of the copyright owner -- which consent cannot be obtained from the owners of orphan works." p. 23
"This risk of market foreclosure would be substantially ameliorated if the Proposed Settlement could be amended to provide some mechanism by which Google's competitors could gain comparable access to orphan works...." p. 25
As far as antitrust goes, the document states that although there are concerns about antitrust, the full analysis has not been completed. There are suggestions, however, that the main concerns have to do with the Book Rights Registry and the setting of prices for all works (instead of relying on competition to determine prices).


All in all, it seems to me that the DOJ has pointed out some of the same problems indicated by others, but unfortunately hasn't really given a clear direction for the settlement to take. What we do know is that we'll see a new version of the settlement sometime in the future... many more pages of dense text to ponder.

Monday, September 14, 2009

Google Books Metadata and Library Functions

In a recent post in the NGC4LIB list, we got a very welcome answer from Chip Nilges of OCLC about Google's use of WorldCat records:
To answer Karen's most recent post, Google can use any WC metadata field. And it's important to note as well that our agreement with Google is not exclusive. We're happy to work with others in the same way. The goal, as I said in my original post, is to support the efforts of our members to bring their collections online, make them discoverable, and drive traffic to library services.



As we have seen from recent postings about the metadata being presented in the Google Books Search service, there are some problems. Although Google claims to have taken the metadata from its library partners, we can look at records in GBS and the record for that item in the library partner database and see how very different they are. It is clear that Google has not retained all of the fields that libraries have provided, and has made some very odd choices about what to keep. Perhaps what we need to do, to help Google improve the metadata, is to make clear what data elements we anticipate we will need in order to integrate the Google product with library services.

When you ask people what metadata is needed for a service, they will often reply something like "everything" or "more is better." I'm going to take a different approach here because I think it is a good idea to connect metadata needs with actual functionality. This not only justifies the metadata, but the functionality helps explain the nature of the metadata that is required. For example, if we say that we want "date of publication" in our metadata, it may seem that we could use the date from the publication statement, which can have dates like "c1956" or "[1924]." If, instead, we indicate that we want to use dates in computational research, then it is clear (hopefully) that we need the fixed field date (from the 008 field in the MARC record).

So here are the functions that come to my mind, and I welcome additions. (Do remember that at this point we are only talking about books, so many fields relating to other formats will not be included.) I'll add the related MARC fields as I get a chance.

Function: Scholarship
Need: A thorough description of the edition in question. This will include authors, titles, physical description, and series information.

Function: Metasearch
Need: To be able to combine searches with the same data elements in library catalogs. Generally this means "headings," from the bibliographic record (authors, titles, subject headings).

Function: Collection development
Need: To use GBS to fill in gaps (or make comparisons) in a library's holdings, usually using classification numbers.

Function: Linking to other bibliographic collections or databases
Need: Identifiers and headings that may be found in other collections that would allow linking.

Function: Computation
Need: Data elements that can mark a text in time and space (date and place of publication), as well as those that can help segment the file, like language. This function also may need to rely on combining editions into groupings of Works, since this research may need to distinguish Works from Manifestations. Computation will most likely use metadata as a controlled vocabulary, and the full text of the work as the "meat" of the research.

Tuesday, September 08, 2009

GBS, according to Amazon

When I first read the settlement agreement between Google, the AAP and the Author's Guild, I immediately thought: "Wow. Jeff Bezos must be freaking out!" Because it is obvious that the settlement, as written, sets up a bookselling operation of unprecedented proportions. It also does so in a way that makes it hard if not impossible for any other company to compete in certain areas, particularly in relation to works that are out of print but not out of copyright.

Amazon has responded to the proposed settlement with a document for the court. (The document for Amazon was authored by David Nimmer, known for "Nimmer on Copyright", the primary text on the topic of US copyright -- and which sells for over $2,000. When it comes to "big guns" it's hard to get any bigger.) The document makes four major points relating to the settlement. I will paraphrase them here, but if you have an interest in what Amazon has to say you must read the document yourself, because my analysis undoubtedly reflects my non-expert reading of it.

  1. The settlement should be rejected because it makes changes to copyright law that should be decided by Congress, not a lawsuit.

  2. The settlement should be rejected because the Book Rights Registry that it creates is a cartel of rights holders, and violates anti-trust law.

  3. The settlement must be rejected because its expropriation of orphan works violates the copyright act.

  4. The settlement must be rejected because it would release Google from liability of future actions.

All of these seem like good arguments to me, but I am especially taken by the fourth one. The Amazon document explains in some detail that class action here is being used to allow future actions that are not part of the complaint.
"A class action settlement can only extinguish claims that arise from the same factual predicate as the class claims.... Future claims for future conduct cannot be released by a settlement agreement because they are not part of the same factual predicate as the purported claims." p. 35
What this says, in my interpretation, is that Google is being taken to court by the AAP and AG because it has, in the past, scanned and OCR'd books that are in copyright without asking permission of the rights holders. Yet, the settlement addresses actions that Google has not yet taken, such as the sale of institutional subscriptions, consumer sales of access to books, and a variety of possible revenue models such as print on demand. This is not redress for violation of rights but a kind of blanket agreement that gives Google rights over the materials for future developments.
"The sale of books or subscriptions to a database of scanned works is conduct in which Google has not yet engaged and, because of criminal sanctions, likely would never engage without a clear license to do so." p. 39
Nimmer's analysis seems to be that this is not appropriate in a lawsuit, and especially one in which members of the class are giving up future rights that cannot even be enumerated. The hypothetical example reads:
"... let us imagine that Google has already scanned Lonesome Dove and included it in the Google Books Program, that Technology X is invented in 2016, and that Google decides in 2020 to inaugurate widescale expoitation of books via that new technology including Lonesome Dove. To the extent that author Larry McMurtry objects to that exploitation in 2021 (in the same way that previous litigation contested the scope of his grant of books rights to his publisher in Lonesome Dove at the dawn of the age of audio books), a dispute may develop between author and publisher. The Settlement Agreement goes out of its way to immunize Google from any liability for copyright infringement under those circumstances." p. 39 footnote 29
I cannot confirm nor dispute this analysis, but there is something very frightening about giving up (or assigning, depending on how you see it) rights for an indefinite future when we have no idea what that future will bring. The Amazon comments have interpreted the settlement as having overly expansive concessions to Google that could have unintended consequences in the future.

Monday, September 07, 2009

GBS and Bad Metadata

Ever since Geoffrey Nunberg got up at the Google Books Colloquium at Berkeley on August 28, 2009, and showed the audience how bad the Google Books metadata is (Google's Book Search: A Disaster for Scholars, article in The Chronicle of Higher Education, Google Books: The Metadata Mess, the slide presentation from the Conference at UC Berkeley), some parts of the academic world have been buzzing about the topic.

Google representatives claim that their data comes from libraries and from other sources, but it is easy to show that Google is not including the library's bibliographic record in GBS. It might just be seen as a short-sighted decision on their part not to keep all of the data from the MARC records supplied by the libraries. After all, which of these do you think makes the most sense to the casual reader:
12 pages
12 p. 27 cm.
However, there is some evidence that Google is missing parts of the library bibliographic record. Here are some examples of subjects from GBS and the records from the very libraries that supplied the works:

Indians of North America
Indian baskets

Indians of North America -- Languages.
Indians of North America -- California
Indian baskets -- North America

This is the same pattern that appeared in the records released by the University of Michigan for their public domain scanned books -- only the $a of the 6XX field was included. (I wrote about this: Many other fields are also excluded from those Michigan records, and one has to wonder if the same was true of the records received/used by Google.

I know that it is possible to retrieve the full library records for the books because the Open Library is using this technique to retrieve bibliographic data for the public domain books scanned by Google. Google is obviously capable of doing this, yet chooses not to.

This leaves us with a bit of a mystery, although I think I know the answer. The mystery is: why would Google only use limited metadata from the participating libraries? And why won't they answer the question that I asked at the Conference: "Do you have a contract with OCLC? And does it restrict what data you can use?" Because if the answer is "yes and yes" then we only have ourselves (as in "libraries") to blame. And Nunberg and his colleagues should be furious at us.