Tuesday, August 29, 2006

The dotted line

The University of California has released its agreement with Google ("uc:" in quotes below). As a public institution, all such contracts must be made publicly available on request. We similarly have access to the University of Michigan agreement ("um:" in quotes below), which gives us the ability to do some comparison.

As is always the case, the language of the contract does not entirely reveal the intentions of the Parties. It is instead a strange almost-Shakespearean courtship where neither Party wishes to say what they really want, and everyone pretends like it's all so wonderfully fine, while at the same time each player is hoping to pull the wool over the eyes of the other. So my interpretation here may reveal more about my own assumptions than any truth about the contracts.

[Note: I apologize for any typos, but I had to transcribe much of this from the PDF files, which did not allow for text copy. If you see errors, let me know and I'll fix them.]

Quality Control

The Michican Contract gives the library the quality control review function, and allows the library to actual hold up the digitizing process if its quality requirements are not met:
um: 2.4 Digitizing the Selected Content. Google will be responsible for Digitizing the Selected Content. Subject to handling constraints or procedures specified in the Project Plan, Google shall at its sole discretion determine how best to Digitize the Selected Content, so long as the resulting digital files meet the benchmarking guidelines agreed to by Google and U of M, and the U of M Digital Copy can be provided to U of M in a format agreed to by Google an U of M. U of M will engage in ongoing review (thorough sampling) of the resulting digital files, and shall inform Google of files that do not meet benchmarking guidelines or do not comply with the agreed-upon format, U of M may stop new work until this failure can be rectified.
Perhaps Google has learned its lesson about trying to meet the standards of libraries, because UC's contract is notably silent on the QC topic. The paragraph that begins ...
uc: 2.4 Digitizing the Selected Content. Subject to handling constraints or procedures specified in the Project Plan, Google shall in its sole discretion determine how best to Digitize the Selected Content.
... then goes on to talk about Google's responsibility in taking care of UC's books, and its promise to replace any that get damaged. Nothing more about the digital files that result.

[Note: Peter Brantley points out section 4.7.1 which contains a reference to image standards and lets UC QA up to 250 books a month to "assess quality." However, there is no stated recourse so UC and Google are relying on each other's good intentions here.]
note 4.7.1, which refers to image standards in line with
those established as a community of library partners.

What the Libraries Get

In this case, UC seems to have learned from the past experience of others, and negotiated for more from Google:
um: 2.5 ... the U of M Digital Copy will consist of a set of image and OCR files and associated information indicating at a minimum (1) bibliographic information consisting of the title and author of each Digitized work, (2) which image files correspond to that Digitized work, and (3) the logical order of these image files.
uc: 4.7 University Digital Copy. Unless otherwise agreed by the Parties in writing, the "University Digital Copy" means the digital copy of the selected content that is Digitized by Google consisting of (a) a set of image and OCR files, (b) associated meta-information about the files including bibliographic information consisting of title and author of each Digitized work and technical information consisting of the date of scanning the work, information about which image files correspond to what digitized work, and information pertaining to the logical order of image files that make up a Digitized work, (c) a list of works that are supplied for Digitization but not actually Digitized, and (d) the image coordinates for each Digitized Work ("Image Coordinates"); provided that Image Coordinates will only be provided (i) so long as University complies with the volume commitments set forth in Section 2.2 and (ii) pursuant to the restrictions on University's use and distribution of such Image Coordinates set for the Section 4.10.
The "Image Coordinates" are what make it possible to locate a word in a page image, for example for the purpose of highlighting the query word on the screen. Michigan didn't get these coordinates with its files, and possibly the other four original Google library partners did not, either. We'll look at Section 4.10 and its limitations in a moment, but the "volume commitments" in 2.2 say that:
uc: University will use reasonable efforts to provide or provide Google with access to no less than three thousand (3,000) books (or such amount that is mutually agreed to by the Parties) of Selected Content per day to Digitize commencing on the sixty-first (61st) day after the Effective Date...
So it sounds like the University really, really, really wanted the coordinates and Google really, really, really wanted to make sure that the University would not drag its heels in terms of providing the books to Google. So these two inherently unrelated desires became bargaining chips.

Using the Files

In both contracts it is stated that Google owns the Digital Copy, and makes it clear that neither Google nor the library are claiming any ownership of the underlying texts that have been digitized. This seems to be in keeping with US copyright law, although there is the inherent difficulty that occurs when you display a digital version of a public domain resource on a screen. At that point, any controls desired by the owner of the digital file are hard to enforce. In the Michigan contract, Google basically states that the university will not allow wholesale downloading of the files, and will attempt to prevent any downloading for commercial purposes (as if they could tell that from a download action):
um: 4.4.1 Use of U of M Digital Copy on U of M Website. U of M shall have the right to use the U of M Digital Copy, in whole or in part at U of M's sole discretion, as part of services offered on U of M's website. U of M shall implement technological measures (e.g., through use of the robots.txt protocol) to restrict automated access to any portion of the U of M Digital Copy or the portions of the U of M website on which any portion of the U of M Digital Copy is available. U of M shall also make reasonable efforts (including but not limited to restrictions placed in Terms of Use for the U of M website) to prevent third parties from (a) downloading or otherwise obtaining any portion of the U of M Digital Copy for commercial purposes, (b) redistributing any portions of the U of M Digital Copy, or (c) automated and systematic downloading from its website image files from the U of M Digital Copy. U of M shall restrict access to the U of M Digital Copy to those persons having a need to access such materials and shall also cooperate in good faith with Google to mutually develop methods and systems for ensuring that the substantial portions of the U of M Digital Copy are not downloaded from the services offered on U of M's website or otherwise disseminated to the public at large.

There are a number of interesting items in the above paragraph. First, that a robots.txt file is considered a "technological measure." In fact, it is at best a gentleman's agreement; there is nothing that forces you to abide by the instructions in the robots.txt file so the less gentlemanly are not stopped from accessing the items that it calls "disallowed." It also is theoretically only a message center for web crawlers, and not a way to limit access to sections of ones web site. More robust technology must be employed for that. The next is that access is to be restricted to "those persons having a need to access such materials" which is about the vaguest access condition that I can imagine. How could any of us show that we have such a need in relation to an information resource? Well, if nothing else, that language doesn't appear in the UC contract, so maybe they've re-thought that particular requirement.

Next, the contract allows Michigan to use the digital files to provide services to its consortial partners, but basically leaves it up to Michigan to get the proper agreements from those libraries:
um: 4.4.2 Use of U of M Digital Copy in Cooperative Web Services. Subject to the restrictions set forth in this section, U of M shall have the right to use the U of M Digital Copy, in whole or in part at U of M's sole discretion, as part of services offered in cooperation with partner research libraries such as the institutions in the Digital Library Federation. Before making any such distribution, U of M shall enter into a written agreement with the partner research library and shall provide a copy of such agreement to Google, which agreement shall: (a) contain limitations on the partner research library's use of the materials that correspond to and are at least as restrictive as the limitations placed on U of M's use of the U of M Digital Copy in section 4.4.1; and (b) shall expressly name Google as a third party beneficiary of that agreement, including the ability for Google to enforce the restrictions against the partner research library.
Another possible learning experience, or perhaps just a result of the particular negotiations between UC and Google, but the UC/Google contract is much more specific about uses of the files and the agreements that are required for UC to exchange part of all of the files with other parties. It does limit access to the digital files to UC Library patrons, which means that these will probably be treated similarly to licensed resources in the Library, which require a user ID login for access. Although the UC contract also contains the "robots.txt" language, it also contains some stronger wording about creating technological protection measures for the files:
uc: 4.9 Use of University Digital Copy. University shall have the right to use the University Digital Copy, in whole or in part at University's sole discretion, subject to copyright law, as part of services offered to the University Library Patrons. University may not charge, receive payment or other consideration for the use of the University Digital Copy except that University may charge of use of any services supplemental to the original work that the University supplies that add value to the University Digital Copy (for example, University may charge University Library Patrons for access to annotations to works from professors and scholars but the original work will always be accessible without a fee), and to recover copying costs actually incurred. University agrees that to the extent it makes any portion of the University Digital Copy publicly available, that it will identify the works, in a statement on a web page or other access point to be mutually agreed to by the Parties, as "Digitized by Google" or in a substantially similar manner. University shall implement technological measures (e.g., through use of the robots.txt protocol) to restrict automated access to any portion of the University Digital Copy, or the portions of the University website on which any portion of the University Digital Copy is available. University shall also prevent third parties from (a) downloading or otherwise obtaining any portion of the University Digital Copy for commercial purposes, (b) redistributing any portions of the University Digital Copy, or (c) automated and systematic downloading from its website image files from the University Digital Copy. University shall develop methods and systems for ensuring that substantial portions of the University Digital Copy are not downloaded from the services offered on University's website or otherwise disseminated to the public at large. University shall also implement security and handling procedures for the University Digital Copy which procedures shall be mutually agreed by the Parties. Except as expressly allowed herein, University will not share, provide, license, or sell the University Digital Copy to any third party.

The image coordinates, which UC seems to have "won" as a special deal, cannot be shared at all:
4.10: (a) University shall not share, provide, license, distribute or sell the Image Coordinates to any entity in any manner. University may use the Image Coordinates only as part of the University Digital Copy for the services provided to University Library Patrons set forth in Section 4.9 above.
What is particularly odd below in 4.10 (b) is that Google states that UC can distribute no more than 10% of the digital copy (which is, by definition, owned by Google), but it can distribute 100% of the digital copies of public domain works. I can imagine that UC insisted on this, but it seems to contradict the distinction that Google is making between the rights in the digital files created by Google and the rights in the underlying works.
(b) Subject to the restrictions contained herein, University shall have the right to distribute (1) no more than ten percent (10%) of the University Digital Copy (but not any portion of the Image Coordinates) to (i) other libraries and (ii) educational institutions, in each case for non-commercial research, scholarly or academic purposes and (2) all or any portion of public domain works contained in the University Digital Copy (but not any portion of the Image Coordinates) to research libraries for research, scholarly and academic purposes by those libraries and the faculty, students, scholars and staff authorized by said libraries to access their commercially licensed electronic information products. Any recipient of the University Digital Copy under this Section 4.10 is referred to herein as a "Recipient Institution." Prior to any distribution by University to a Recipient Institution, Google and the Recipient Institution must have entered into a written agreement on terms acceptable to Google governing the use of the University Digital Copy and that, among other things, provide an indemnity to Google. In addition, any distribution by University to a Recipient Institution is subject to a written agreement that (A) prohibits that Recipient Institution from redistributing without first obtaining the prior written consent of Google, (B) makes Google an express third party beneficiary of such agreement, (C) provides an indemnity to Google from the Recipient Institution for the Recipient Institutions's use of the Selected Content, (D) contains limitations at least as restrictive as the restrictions on University set forth in Section 4.9, (E) contains limitations on the use of the University Digital Copy consistent with copyright law and the limitations set for in clauses (1) and (2) above, and (E) requires each Recipient Institution, to the extent it makes any portion of the University Digital Copy publicly available, to identify the works, in a statement on the applicable web page or other access point, as "digitized by Google" or in a substantially similar manner.
Here I notice especially "(E) contains limitations on the use of the University Digital Copy consistent with copyright law" and I'm wondering to what this refers. It seems it either means that Google is asserting some intellectual property rights in the digital copies, or that they are reminding the University that it cannot re-distribute the digital copies beyond that allowed by fair use. Since the latter is a given, and not a matter of contract, it would appear that the first interpretation is correct. Yet I don't see a clear statement of Google's IP rights in the contract.

My final comment has to do with the fact that the licenses are for limited times. Michigan's extends until 2009, and UC's is stated as being for six years from the signing. Someone with more expertise in contract law will need to help me understand what this means for the restrictions given above. This may be clearer through a reading of the contracts, and I encourage anyone with the necessary skills to read them and let the rest of us know what some of this language means. Naturally, our concerns are about ownership and use, and getting a fair shake for library users.

Saturday, August 26, 2006

WebDewey: Keeping Users Uninformed

I was looking for a Dewey Decimal number to accompany a topic in an article I'm working on, and learned, although perhaps I should have known, that DDC is not available for open access. I wandered around OCLC's Dewey site, and came across the license that controls use of the WebDewey product. Some aspects of it surprised me.

The first was the definition of "Subscriber" in the WebDewey contract: "Subscriber means a library or not-for-profit information agency..." So does this mean that a corporate library cannot get a license to use the DDC? Or is it just that they must work only with the hard copy? What would be the purpose of limiting use to non-profits?

The next is from the grant of license. First, you are granted a license to use WebDewey to create bibliographic records, but "Such bibliographic records and metadata may display DDC numbers, but shall not display DDC captions." This basically eliminates the possibility of creating a rich classified display for a library. I find that it isn't enough to browse the shelf viewing only the classification number and the book titles, since the book titles alone do not reveal what the classification number means. I'd love to have a virtual shelflist that lets me know where I am, topically, and then shows me the titles in that area. But, no, you are not allowed to do that with the Dewey Classification... at least not unless you limit your display to "the DDC22 summaries," that is the first three digits of the Dewey classification number. Since modern topics have necessitated a great lengthening of the Dewey numbers (such as: Disaster relief efforts for earthquakes are classed in 363.3495095982, according to the Dewey Blog), being limited to the three digit topics is nearly useless.

I realize that the DDC is business, but the business of libraries is to inform, to help users find what they need, not to obscure our shelf order. Sheesh!

Friday, August 25, 2006

Do it yourself digital books

This user got tired of waiting for his book to appear on the Internet Archive so he just did it! Perhaps the Million Book Project just needs one million users like this:

Reviewer: papeters - 4 out of 5 stars - April 10, 2004
Subject: Good copy for PG

Tired of waiting for corrections, I got another copy of the book and made good scans. The book is available through PG at:

http://www.gutenberg.net/1/1/9/2/11926/11926-h/11926-h.htm (html)
or
http://www.gutenberg.net/1/1/9/2/11926/11926.txt (plain-text)

See this at http://www.archive.org/details/WashingtonInDomesticLife