Tuesday, August 29, 2006

The dotted line

The University of California has released its agreement with Google ("uc:" in quotes below). As a public institution, all such contracts must be made publicly available on request. We similarly have access to the University of Michigan agreement ("um:" in quotes below), which gives us the ability to do some comparison.

As is always the case, the language of the contract does not entirely reveal the intentions of the Parties. It is instead a strange almost-Shakespearean courtship where neither Party wishes to say what they really want, and everyone pretends like it's all so wonderfully fine, while at the same time each player is hoping to pull the wool over the eyes of the other. So my interpretation here may reveal more about my own assumptions than any truth about the contracts.

[Note: I apologize for any typos, but I had to transcribe much of this from the PDF files, which did not allow for text copy. If you see errors, let me know and I'll fix them.]

Quality Control

The Michican Contract gives the library the quality control review function, and allows the library to actual hold up the digitizing process if its quality requirements are not met:
um: 2.4 Digitizing the Selected Content. Google will be responsible for Digitizing the Selected Content. Subject to handling constraints or procedures specified in the Project Plan, Google shall at its sole discretion determine how best to Digitize the Selected Content, so long as the resulting digital files meet the benchmarking guidelines agreed to by Google and U of M, and the U of M Digital Copy can be provided to U of M in a format agreed to by Google an U of M. U of M will engage in ongoing review (thorough sampling) of the resulting digital files, and shall inform Google of files that do not meet benchmarking guidelines or do not comply with the agreed-upon format, U of M may stop new work until this failure can be rectified.
Perhaps Google has learned its lesson about trying to meet the standards of libraries, because UC's contract is notably silent on the QC topic. The paragraph that begins ...
uc: 2.4 Digitizing the Selected Content. Subject to handling constraints or procedures specified in the Project Plan, Google shall in its sole discretion determine how best to Digitize the Selected Content.
... then goes on to talk about Google's responsibility in taking care of UC's books, and its promise to replace any that get damaged. Nothing more about the digital files that result.

[Note: Peter Brantley points out section 4.7.1 which contains a reference to image standards and lets UC QA up to 250 books a month to "assess quality." However, there is no stated recourse so UC and Google are relying on each other's good intentions here.]
note 4.7.1, which refers to image standards in line with
those established as a community of library partners.

What the Libraries Get

In this case, UC seems to have learned from the past experience of others, and negotiated for more from Google:
um: 2.5 ... the U of M Digital Copy will consist of a set of image and OCR files and associated information indicating at a minimum (1) bibliographic information consisting of the title and author of each Digitized work, (2) which image files correspond to that Digitized work, and (3) the logical order of these image files.
uc: 4.7 University Digital Copy. Unless otherwise agreed by the Parties in writing, the "University Digital Copy" means the digital copy of the selected content that is Digitized by Google consisting of (a) a set of image and OCR files, (b) associated meta-information about the files including bibliographic information consisting of title and author of each Digitized work and technical information consisting of the date of scanning the work, information about which image files correspond to what digitized work, and information pertaining to the logical order of image files that make up a Digitized work, (c) a list of works that are supplied for Digitization but not actually Digitized, and (d) the image coordinates for each Digitized Work ("Image Coordinates"); provided that Image Coordinates will only be provided (i) so long as University complies with the volume commitments set forth in Section 2.2 and (ii) pursuant to the restrictions on University's use and distribution of such Image Coordinates set for the Section 4.10.
The "Image Coordinates" are what make it possible to locate a word in a page image, for example for the purpose of highlighting the query word on the screen. Michigan didn't get these coordinates with its files, and possibly the other four original Google library partners did not, either. We'll look at Section 4.10 and its limitations in a moment, but the "volume commitments" in 2.2 say that:
uc: University will use reasonable efforts to provide or provide Google with access to no less than three thousand (3,000) books (or such amount that is mutually agreed to by the Parties) of Selected Content per day to Digitize commencing on the sixty-first (61st) day after the Effective Date...
So it sounds like the University really, really, really wanted the coordinates and Google really, really, really wanted to make sure that the University would not drag its heels in terms of providing the books to Google. So these two inherently unrelated desires became bargaining chips.

Using the Files

In both contracts it is stated that Google owns the Digital Copy, and makes it clear that neither Google nor the library are claiming any ownership of the underlying texts that have been digitized. This seems to be in keeping with US copyright law, although there is the inherent difficulty that occurs when you display a digital version of a public domain resource on a screen. At that point, any controls desired by the owner of the digital file are hard to enforce. In the Michigan contract, Google basically states that the university will not allow wholesale downloading of the files, and will attempt to prevent any downloading for commercial purposes (as if they could tell that from a download action):
um: 4.4.1 Use of U of M Digital Copy on U of M Website. U of M shall have the right to use the U of M Digital Copy, in whole or in part at U of M's sole discretion, as part of services offered on U of M's website. U of M shall implement technological measures (e.g., through use of the robots.txt protocol) to restrict automated access to any portion of the U of M Digital Copy or the portions of the U of M website on which any portion of the U of M Digital Copy is available. U of M shall also make reasonable efforts (including but not limited to restrictions placed in Terms of Use for the U of M website) to prevent third parties from (a) downloading or otherwise obtaining any portion of the U of M Digital Copy for commercial purposes, (b) redistributing any portions of the U of M Digital Copy, or (c) automated and systematic downloading from its website image files from the U of M Digital Copy. U of M shall restrict access to the U of M Digital Copy to those persons having a need to access such materials and shall also cooperate in good faith with Google to mutually develop methods and systems for ensuring that the substantial portions of the U of M Digital Copy are not downloaded from the services offered on U of M's website or otherwise disseminated to the public at large.

There are a number of interesting items in the above paragraph. First, that a robots.txt file is considered a "technological measure." In fact, it is at best a gentleman's agreement; there is nothing that forces you to abide by the instructions in the robots.txt file so the less gentlemanly are not stopped from accessing the items that it calls "disallowed." It also is theoretically only a message center for web crawlers, and not a way to limit access to sections of ones web site. More robust technology must be employed for that. The next is that access is to be restricted to "those persons having a need to access such materials" which is about the vaguest access condition that I can imagine. How could any of us show that we have such a need in relation to an information resource? Well, if nothing else, that language doesn't appear in the UC contract, so maybe they've re-thought that particular requirement.

Next, the contract allows Michigan to use the digital files to provide services to its consortial partners, but basically leaves it up to Michigan to get the proper agreements from those libraries:
um: 4.4.2 Use of U of M Digital Copy in Cooperative Web Services. Subject to the restrictions set forth in this section, U of M shall have the right to use the U of M Digital Copy, in whole or in part at U of M's sole discretion, as part of services offered in cooperation with partner research libraries such as the institutions in the Digital Library Federation. Before making any such distribution, U of M shall enter into a written agreement with the partner research library and shall provide a copy of such agreement to Google, which agreement shall: (a) contain limitations on the partner research library's use of the materials that correspond to and are at least as restrictive as the limitations placed on U of M's use of the U of M Digital Copy in section 4.4.1; and (b) shall expressly name Google as a third party beneficiary of that agreement, including the ability for Google to enforce the restrictions against the partner research library.
Another possible learning experience, or perhaps just a result of the particular negotiations between UC and Google, but the UC/Google contract is much more specific about uses of the files and the agreements that are required for UC to exchange part of all of the files with other parties. It does limit access to the digital files to UC Library patrons, which means that these will probably be treated similarly to licensed resources in the Library, which require a user ID login for access. Although the UC contract also contains the "robots.txt" language, it also contains some stronger wording about creating technological protection measures for the files:
uc: 4.9 Use of University Digital Copy. University shall have the right to use the University Digital Copy, in whole or in part at University's sole discretion, subject to copyright law, as part of services offered to the University Library Patrons. University may not charge, receive payment or other consideration for the use of the University Digital Copy except that University may charge of use of any services supplemental to the original work that the University supplies that add value to the University Digital Copy (for example, University may charge University Library Patrons for access to annotations to works from professors and scholars but the original work will always be accessible without a fee), and to recover copying costs actually incurred. University agrees that to the extent it makes any portion of the University Digital Copy publicly available, that it will identify the works, in a statement on a web page or other access point to be mutually agreed to by the Parties, as "Digitized by Google" or in a substantially similar manner. University shall implement technological measures (e.g., through use of the robots.txt protocol) to restrict automated access to any portion of the University Digital Copy, or the portions of the University website on which any portion of the University Digital Copy is available. University shall also prevent third parties from (a) downloading or otherwise obtaining any portion of the University Digital Copy for commercial purposes, (b) redistributing any portions of the University Digital Copy, or (c) automated and systematic downloading from its website image files from the University Digital Copy. University shall develop methods and systems for ensuring that substantial portions of the University Digital Copy are not downloaded from the services offered on University's website or otherwise disseminated to the public at large. University shall also implement security and handling procedures for the University Digital Copy which procedures shall be mutually agreed by the Parties. Except as expressly allowed herein, University will not share, provide, license, or sell the University Digital Copy to any third party.

The image coordinates, which UC seems to have "won" as a special deal, cannot be shared at all:
4.10: (a) University shall not share, provide, license, distribute or sell the Image Coordinates to any entity in any manner. University may use the Image Coordinates only as part of the University Digital Copy for the services provided to University Library Patrons set forth in Section 4.9 above.
What is particularly odd below in 4.10 (b) is that Google states that UC can distribute no more than 10% of the digital copy (which is, by definition, owned by Google), but it can distribute 100% of the digital copies of public domain works. I can imagine that UC insisted on this, but it seems to contradict the distinction that Google is making between the rights in the digital files created by Google and the rights in the underlying works.
(b) Subject to the restrictions contained herein, University shall have the right to distribute (1) no more than ten percent (10%) of the University Digital Copy (but not any portion of the Image Coordinates) to (i) other libraries and (ii) educational institutions, in each case for non-commercial research, scholarly or academic purposes and (2) all or any portion of public domain works contained in the University Digital Copy (but not any portion of the Image Coordinates) to research libraries for research, scholarly and academic purposes by those libraries and the faculty, students, scholars and staff authorized by said libraries to access their commercially licensed electronic information products. Any recipient of the University Digital Copy under this Section 4.10 is referred to herein as a "Recipient Institution." Prior to any distribution by University to a Recipient Institution, Google and the Recipient Institution must have entered into a written agreement on terms acceptable to Google governing the use of the University Digital Copy and that, among other things, provide an indemnity to Google. In addition, any distribution by University to a Recipient Institution is subject to a written agreement that (A) prohibits that Recipient Institution from redistributing without first obtaining the prior written consent of Google, (B) makes Google an express third party beneficiary of such agreement, (C) provides an indemnity to Google from the Recipient Institution for the Recipient Institutions's use of the Selected Content, (D) contains limitations at least as restrictive as the restrictions on University set forth in Section 4.9, (E) contains limitations on the use of the University Digital Copy consistent with copyright law and the limitations set for in clauses (1) and (2) above, and (E) requires each Recipient Institution, to the extent it makes any portion of the University Digital Copy publicly available, to identify the works, in a statement on the applicable web page or other access point, as "digitized by Google" or in a substantially similar manner.
Here I notice especially "(E) contains limitations on the use of the University Digital Copy consistent with copyright law" and I'm wondering to what this refers. It seems it either means that Google is asserting some intellectual property rights in the digital copies, or that they are reminding the University that it cannot re-distribute the digital copies beyond that allowed by fair use. Since the latter is a given, and not a matter of contract, it would appear that the first interpretation is correct. Yet I don't see a clear statement of Google's IP rights in the contract.

My final comment has to do with the fact that the licenses are for limited times. Michigan's extends until 2009, and UC's is stated as being for six years from the signing. Someone with more expertise in contract law will need to help me understand what this means for the restrictions given above. This may be clearer through a reading of the contracts, and I encourage anyone with the necessary skills to read them and let the rest of us know what some of this language means. Naturally, our concerns are about ownership and use, and getting a fair shake for library users.

10 comments:

JonathanNil said...

Calling robots.txt a "technological measure" sounds like an attempt to argue it's covered by DMCA, no? Or set up the assumption or expectation that it is, anyway, lay the groundwork for arguing it is in court.

If it's covered by the DMCA, then circumventing it would (possibly) be illegal.

Karen Coyle said...

That is the argument that is being made against the Internet Archive, but I don't believe we have a ruling on that yet.

Karen Coyle said...

As for my question about the expiration of the contract, someone pointed me to section 8.3:

"The following sections survive expiration or termination of this agreement: 1, 2.4, 2.5, 4 (excluding Section 4.6), 6, 8.3, 9. 10, and 11."

9, 10, and 11 are about warraties, liability, and indemnification. Four is the section on ownership and use of the digital copies, so it does appear that Google has a perpetual ownership and that the restrictions on use are also perpetual. This, then, brings up the question of preservation of the files.

Sean said...

uc: University will use reasonable efforts to provide or provide Google with access to no less than three thousand (3,000) books (or such amount that is mutually agreed to by the Parties) of Selected Content per day to Digitize commencing on the sixty-first (61st) day after the Effective Date...

I suspect that this is probably something that wasn't a problem for UC. A large amount of the collection is out at the Regional Library Facilities and I imagine that much of the digitization operations will be based near or out of them.

Harrison said...

Thanks for doing this analysis Karen. Very interesting stuff. I did find one innaccuracy in your write-up concerning access to licensed resources at UC. You state that:

[the contract] does limit access to the digital files to UC Library patrons, which means that these will probably be treated similarly to licensed resources in the Library, which require a user ID login for access.

In actuality, at Berkeley anyways, library patrons aren't required to provide a user ID to access most licensed resources as long as they originate from a valid campus IP address. In other words, members of the general public can enter a library and use one of our many public workstations to access licensed materials.

Peter Hirtle said...

You wrote:

"I had to transcribe much of this from the PDF files, which did not allow for text copy."

FYI, there is an HTML version at http://www.google-watch.org/foia/ucfoia.html.

You also wrote:

"That is the argument that is being made against the Internet Archive, but I don't believe we have a ruling on that yet."

It was just settled out of court. See my recent posting on blog.librarylaw.com.

Joseph Lorenzo Hall said...

Hi, I've posted an OCR'd version of the cooperative agreement here: http://dream.sims.berkeley.edu/~jhall/ucgoogle_cooperative_agreement_OCR.pdf.

marry said...
This comment has been removed by a blog administrator.
AlexMoatti said...

Merci beaucoup. Du travail sérieux. Très intéressant.

I have a question about robots.text, that you mention : I heard in a congress ith Bibliothèque municipale de Lyon (that woks with Google) the locution : "no index" which sould be put in each page of the Library site to prevent indexing by other crawlers.

Are we talking of the same thing ?

What is the efficiency of "no index" instruction, to your opinion ?

(Excuse my English langage)

Karen Coyle said...

Alex -

My understanding of robots.txt v. "noindex" is that the latter is on an individual page, whereas the former refers to an entire directory. So it's a question of granularity. If you have a single file in a directory that you do not want indexed, you use a meta tag and "noindex"; if you want to exclude an entire directory, robots.txt is the best answer.

Neither of them provides technological controls -- that is neither of them actually prevents indexing, they just let a responsible search engine know your preferences. Someone could ignore those preferences and index those items. However, social pressure on the Internet is fairly powerful, so I think that few honest search engines would risk the outcry that would result if the robots.txt and noindex were ignored.

(p.s. You have absolutely no need to apologize for "your English" -- We native speakers should feel grateful that so many are willing to learn our awkward, rather illogical language!)
(p.p.s Je suis très contente che vous avez trouve' mon "post" utile!)