We have recently seen a court case that decided that HathiTrust's use of digitized books to provide an index to those books is fair. There is another court case that will decide a similar question regarding Google's digitization of books for its Google Book Search. Note, however, that even if both of these are determined to be fair use, each is a particular situation in a particular context. Both organizations have developed their services in an attempt to meet what they judged to be the letter of the law, and yet there is a considerable difference in the services they provide.
HathiTrust stores copies of digitized books from the collections of member libraries. In this case, HT is not itself doing the digitization but is storing files for books mostly digitized by Google. A search in the full text database of OCR'd page images returns, for in-copyright items, the page numbers on which the terms were found, and the number of hits found on each page. There are no snippets and no view of the text unless the text itself is deemed to be out of copyright.
Google has a different approach. To begin with, Google has performed mass digitization of books (estimated at about 20 million) without first obtaining permission from rights holders. So the Google case includes the act of digitization, whereas the HathiTrust case begins with digital files obtained from Google. Therefore the act of digitizing was not a factor in that case. In terms of use of the digitized works, Google also provides keyword searching of the OCR'd digital images, but takes a different approach to the results viewable by the searchers. Google provides short (about 3-5 lines) snippets that show the search terms in context on a page.
Google, however, places specific restrictions to avoid letting users "game" the search to gain access to enough of the text to substitute for actually acquiring access to the book. Here is how Google describes this in its recent legal response:
"The information that appears in Google Books does not substitute for reading the book. Google displays no more than three snippets from a book in response to a search query, even if the search term appears many times in the book. ... Google also prevents users from view a full page, or even several contiguous snippets, by displaying only one snippet per page in response to a given search and by 'blacking' (i.e. making unable for snippet view in response to any search) at least one snippet per page and one out of ten pages in a book." p.8Google also exempts some types of books, like reference works, cookbooks, and poetry, from snippet display entirely.
The differences in the results returned by these two services reflect the differences in their contexts and their goals. HathiTrust has member institutions and their authorized users. The collection within HathiTrust reflects the holdings of the member institutions' libraries which means that the authorized users should have access, either in their library or through inter-library loan, to the physical book that was scanned. The HathiTrust full text is a search on the members' "stuff." The decision to give only page numbers makes some sense in this context, although providing snippets to scholars might have been acceptable to the judge. The return of page numbers and full word counts within pages reflects, IMO, the interest in quantitative analysis of term use. It also gives scholars some idea of the weight the term has within the text.
Google's situation is different. Google has no institutions, no members, no libraries; it provides its service to the general public (at least to the US public). There is no reason to assume that all of the members of that public will have access to the hard copy of any particular digitized book. Google seems to have decided that promoting its service as having primarily a marketing function, with the snippets as "teasers," would mollify the various intellectual property owners. In its brief of November 9, Google reiterates that it does not put advertising on the Google Book Search results pages, nor does Google make any money off of its referrals to book purchasing sites.
So here are two organizations that have bent over backwards to stay within what they deemed to be the boundaries of fair use, and they have done so in significantly different ways. This means that the fair use determination of each of these could have different outcomes, and each will provide different clues as to how fair use is viewed for digitized works.
It of course bears mentioning that both of these solutions provide hurdles for users. The HathiTrust user who is searching on a term that could have more than one meaning ("iron" "dive" "foot") does not have any context to help her understand if the results are relevant. The Google user, on the other hand, gets some context but cannot see all of the results and therefore does not know if there are key retrievals among those that have been blocked algorithmically. A use that is "fair" within copyright law may not seem "fair" to the user who is doing research. It makes you wonder if our idea of "fair use" couldn't be extended to be fair but also "useful."
Related posts
http://kcoyle.blogspot.com/2012/10/copyright-victories-part-ii.html