Saturday, November 22, 2008

More on Google/AAP

Here are some more bits and thoughts on the agreement between Google and the AAP.

Library Involvement

Some librarians were involved in the settlement talks. The only one I have found so far who has come out about this is Georgia Harper. The librarians were working under a non-disclosure agreement (NDA), and therefore will not be able to reveal any details of the discussions. I have heard statements from others who I believe were privy to the negotiations, and they all seem to feel that the outcome was better for libraries due to the involvement of members of our "class." (Note that Google and AAP had high-end lawyers arguing their side, and we had hard-working librarians. I don't know how many of "our" representatives were also lawyers, but you can just imagine how greatly out-gunned they were.) Unfortunately that doesn't change my mind about the bait and switch move.

Google Books as Library

Some have begun to refer to Google Books as a library. We have to do some serious thinking about what the Google Book database really is. To begin with, it's not a research collection, at least not at this point. It's really a somewhat odd, almost random bunch of book "stuff." As you know, neither Google nor the libraries are selecting particular books for digitization. This is a "mass digitization" project that starts at one end of a library and plows through blindly to the other end. Some libraries have limited Google to public domain works, so in terms of any area of study there is an artificial cut-off of knowledge. Not to mention that some libraries, mainly the University of California, have been working with Google primarily to digitize books in their two storage facilities; that is, they have been digitizing the low use books that were stored remotely.

So the main reason why Google Books is not a library is that it isn't what we would call a "collection." The books have not been chosen to support a particular discipline or research area. Yet it will become a de facto collection because people will begin using it for research. Thus "all human knowledge" becomes something more like the elephant and the blind man: research in online resources and research that uses print materials will get very different views of human knowledge. (This is not a new phenomenon. I wrote about this in terms of some early digital projects I was involved in.) One of the big gaps in Google Books will be current materials, those that are still in print. Google will need to convince the publishers that it can increase their revenue stream for current books in order to get them to participate.

Subscribing to Google Books: Just Say No?

Beyond the (undoubtedly hard-won by library representatives) single terminal access in each public library in the US, libraries will be asked to subscribe to the Google Book service in order to give their users access to the text of the books (not just the search capability). This is one of the more painful aspects of the agreement because it seems to ignore the public costs that went in to the purchase, organization, and storage of those works by libraries. (I'm not includng privately funded libraries here, but many of the participants are publicly funded.) The parallels with the OCLC mess are ironic: libraries paying for access to their own materials. So, couldn't the libraries just refuse to subscribe? Not really. Publicly funded libraries have a mission to provide access to the world's intellectual output in a way that best serves their users. When something new comes along -- films on DVD, music on CD, the Internet -- libraries must do what they can to make sure that their users are not informationally underpriviledged. Google now has the largest body of digitized full text, and there will be a kind of "information arms race" as institutions work to make sure that their users can compete using these new resources.

The (Somewhat Hidden) Carrot

I can't imagine that anyone thought that libraries and Google were digitizing books primarily so that people could read what are essentially photographs of book pages on a computer screen. Google initially stated that they were only interested in searching the full text of books. While interesting in itself, keyword searching of rather poor OCR text is not a killer app. What we gain by having a large number of digitized books is a large corpus on which we can do computational research. We can experiment with ideas like: can we follow the flow of knowledge through these texts? Can we create topic maps of fields of study? Can we identify the seminal works in some area? The ability to do this research is included in the agreement (section 7.2(d), The Research Corpus). There will be two copies of this corpus allowed under the agreement, although I don't see any detail as to what the "corpus" will consist of. Will it just be a huge file of digitized books and OCR? Will it be a set of services?

I have suspected for a while that Google was already doing research on the digital files that it holds. It only makes sense. For academics in areas like statistics, computer science, and linguistics, this corpus opens up a whole range of possibilities for research; and research means grants, and grants mean jobs (or tenure, as the case may be). This will be a strong motivation for institutions to want to participate in the Google Book product. Research will NOT be limited to participants; others can request access. What I haven't yet found is anything relating to pricing for the use of the research collection, nor if being a participating library grants less expensive access for your institution. If the latter is the case, then one motivation for libraries to agree to allow Google to scan their books (at some continuing cost to the library) will be that it favors the institution's researchers in this new and exciting area. Full participant libraries (the ones that get to keep the digital copies of their works) can treat their own corpus as research fodder. The other costs of being a full participant are such that I'll still be surprised if any libraries go that route, but if they do I think that this "hidden carrot" will be a big part of it.


There's lots of good blogging going on out there on this topic. It needs a cumulative page to help people find the posts. Please tell me you have time to work on that, so I don't have to take it on! (Or that it exists already and I've missed it.) (The PureInformation Blog has a good list.)

Note: the Internet Archive/OCA may take this on. I'll post if/when they do.

Previous posts:


Jonathan Rochkind said...

At DLF, the GBS project manager was there and gave a talk. Can't remember his name, but he was smart and a good speaker. But anyway, he said that as far as Google and the publisher's were concerned, there would be no charge for access to the research corpus--which can only be used for 'non-consumptive' purposes, which basically means data analysis, not reading of invididual titles for resarch purposes.

He said it would be up to the yet to be determined two hosts of the research centers to decide how to control or provide access. I think it's theoretically possible they could charge for it on a 'cost recovery' basis, but I doubt they will unless 'we' (it's unclear exactly how they will be chosen, but he said it would be up to the participating libraries) choose poorly.

Jonathan Rochkind said...

He also made it clear that access to the research corpus--again as far as Google and the publisher's org were concerned--would NOT be limited to participating libraries, in any way. Anyone can have access as far as they are concerned, although the research corpus hosts may limit access for cost or administrative purposes, it would be up to them. The agreement doesn't put any limits on who can have access, as long as they are using it for 'non-consumptive' limits. It also doesn't _require_ that anyone in particular have access, that choice is left to the yet-to-be-determined 'centers'.

Jonathan Rochkind said...

And I didn't realize that library representatives were involved at all in the negotiations, that's interesting, and perhaps explains some of the pro-library things in the agreement that don't meet the immediate interests of Google in the agreement.

I should certainly hope that there was library counsel involved. If the library represntatives didn't forcefully negotiate for their represenatives including library counsel experienced in Intellectual Property law (or consortial counsel--perhaps someone like the ARL counsel who posted the GBS summary)--then, again, I'd blame libraries. But perhaps, that ARL lawyer was in fact himself involved, one could hope.

If you go into business negotiations without taking them seriously, having people on your side with the right expertise, thinking about your long-term strategic interests---then you get what you deserve. Which seems to be have been a problem with initial Google partner agreements, as extensively written about by Peter Brantley.

The security provisions in the agreement are definitely a burden, and obviously something the AAP put in, not Google. The limitation to only exactly _two_ research centers seems unfortunate to me too, although it's better than one.

Google, incidentally, plans to fund the startup of those research centers, I forget exactly how much money, but it was a substantial amount. According to the Google guy at DLF. That was not something in the settlement, just something Google decided to do.

If I remember correctly, Google also plans to pay for the printing charges the settlement says have to be paid, for any printing at academic libraries (maybe only partners? Not sure) for the first two years, up to a maximum amount. I don't think that's mentioned in the settlement, just something Google plans to do.

Karen Coyle said...

Thanks, Jonathan. I'm glad to hear from someone who was at DLF and heard the GBS talk.

It's in the agreement that anyone can use the corpus, they just have to apply -- rather like you would to enter an archive. I'm glad to hear that they aren't expecting to charge for use, but it so far isn't clear to me who will host the research corpus and how they will pay for it, although it will require Google's approval to be a host.

p. 15 defines a qualified user of the corpus and says: "A for-profit entity may only be a “Qualified User” if both the Registry and Google give their prior written consent."

And on page 81 we get: "(viii) No Commercial Use. Except with the express permission of the Registry and Google, direct, for profit, commercial use of information extracted from Books in the Research Corpus is prohibited."

Hmmm. Does that mean that Google can determine which for-profits can compete in this area? And Google itself can become a third host for the research corpus. Are they not allowed to make use of it in a way that could be construed as commercial?

And in the "shades of OCLC" category, p. 82 has:

"(ix) Use of Data. Use of data extracted from specific Books within the Research Corpus to provide services to the public or a third party that compete with services offered by the Rightsholder of those Books or by Google is prohibited."

So there is the "you can't compete with Google" clause.

The bats in my belfry are ringin' dem bells! Intruder alert!

Anonymous said...

I want to respond to the other parts of your commentary, Karen, and hopefully will get the chance to do so soon. But I did want to add my own plea for a consolidation of commentary about the GBS settlement. I have the intention to do it, and a whole slew of bookmarked items, but as of yet have not had the time to start...

Jonathan Rochkind said...

I think it's a fairly safe bet that the U of Michigan's HathiTrust will be one of the two research corpus locations, as they've been doing the legwork.

I have to process how HathiTrust getting in on the settlement like that might or might not limit things HathiTrust can do that I was hoping they would do! But it was always clear to me that HathiTrust's _main_ goal was always this sort of data mining analysis research, not the patron interfaces that I actually wanted from them! But getting in on the agreement might make those things--that before HathiTrust was interested in throwing in too as a kind of bonus--less likely. Also, HathiTrust had not previously planned on _limiting_ itself to Google-digitized books. Not sure if becoming a settlement-recognized host end up making it more complicated to host non-Google texts too.

The Google guy at DLF said that it was up to 'the community' (of libraries) to identify the centers, but of course there's no formal process for 'the community' to make such a decision, and ultimately it's up to Google. But I have no reason to believe they're interested in micro-managing this selection or what the hosts do after they are selected. (The publishers, on the other hand, are intersted in micro-managing security, thus the terms in the settlement).

But HathiTrust had already started becoming such a host even before the settlement, I'd be shocked if they weren't chosen as one of the two centers.

Karen Coyle said...


Are you implying (or saying) that Hathi Trust folks (e.g. Michigan) were among the library folks negotiating with Google, and therefore had privileged information that might have led them to form the trust?

Erik Hetzner said...

Google is indeed doing research on the corpus. See, for instance, the work of Bill N. Schilit and Okan Kolak at JCDL 2008 and HT 2008 on mining quotations.

As far as I am concerned, the corpus should be made available without distinction to any researcher with a valid plan - certainly the public domain content should be. It would be a great tragedy if only google were to be able to exploit this. On the other hand, I don't yet see the libraries being able to make this corpus available to researchers.

Jonathan Rochkind said...

No, I did not mean to imply that. I was considering that perhaps HathiTrust was involved in the negotiations, and that may have helped lead the allowance for something like HathiTrust. Which I don't consider to be anything nefarious, I'd hope they were involved in the negotiations, and it would be good that they ensured their plans were provided for in them.