Coyle's InFormation: 07/01/2012

Sunday, July 29, 2012

Fair Use Dejà Vu

In its July 27 court filing,[1] Google has made the case for its fair use defense for the digitization of books in its Google Book Search (GBS) project. [2] As many of us have hoped, the case it makes appears strong. That it was necessary to throw libraries under the bus to achieve this is unfortunate, but I honestly do not see a an alternative that wouldn't weaken the case a bit.

Fair Use is Fair

The argument that Google has made from the beginning of its book scanning project is that copying for the purpose of providing keyword access to full texts is fair use. They are fortunately able to cite case law to defend this, including case law allowing the copying of entire images by image search engines.

Among the reasons that they give for their fair use defense are:

1. Keyword search is not a substitute for the text itself. In fact, the copy of the text is necessary to provide a means for users to discover the existence of books and therefore for the books to fulfill their purpose of being read.

"Books exist to be read. Google Books exists to help readers find those books. Like a paper index or a card catalogue, it does not substitute for reading the books themselves..." (p. 2)

2. Google has elaborate protections in place to prevent users from reconstructing the text from its products. They reveal some of these protections, such as disabling snippet display for one instance of the keyword on each page, and disabling display of one page out of ten.

"One of the snippets on each page is blacklisted (meaning that it will not be shown). In addition, at least one out of ten entire pages in each book is blacklisted." (p. 10)

3. No advertising appears on the GBS pages. This implies that Google is not making any money that could be claimed by authors as being theirs.

4. The Authors Guild has no proof of harm that has come from the digitization of the books. It is suggested that a thorough study might show that there have been gains rather than losses in terms of book sales. Even the Authors Guild (the Plaintiff in this case) advises authors to provide some of the text of their books (usually the first chapter) for browsing in online bookstores, and many rights holders participate voluntarily in Amazon's "Look inside" feature that shows considerably more than the disputed snippets that are displayed in GBS. And Google notes that 45,000 (!) publishers have signed up to have their in-print books searchable in GBS, with varying amounts of text available to the searcher prior to purchase. This makes the case that search and some text display is good for authors, not harmful.

5. Digital copies of books have never been "distributed to the public" (key wording in the copyright law). Only the libraries themselves that held the actual hard copies could receive a copy of the files resulting from the digitization.

Of course, all of this is done citing court cases in support of these arguments. The Authors Guild undoubtedly has counter-cases to present.

Libraries Under the Bus

One of the key copyright-related arguments that Google makes is that its full text search within books provides a public service and support of research that is unprecedented. In making these claims Google decided to particularly emphasize its superiority to library catalogs. (Google refers multiple times to "card catalogues" which seems oddly antiquated, but perhaps that was the intent.)

"The tool is not a substitute for the books themselves -- readers still must buy a book from a store or borrow it from a library to read it. Rather, Google Books is an important advance on the card-catalogue method of finding books. The advance is simply stated: unlike card catalogues, which are limited to a very small amount of bibliographic information, Google Books permits full-text search, identifying books that could never be found using even the most thorough card catalog." (p.1) [sic uses of "catalogue" and "catalog" in the same paragraph.]

"Google Books was born of the realization that much of the store of human knowledge lies in books on library shelves where it is very difficult to find....Despite the importance of this vast store of human knowledge, there exists no centralized way to search these texts to identify which might be germane to the interests of a particular reader." (p. 4)

As a librarian, I have to say that this dismissal of the library as inadequate really hurts. Yet I believe that Google is expressing an opinion that is probably quite common among information searchers today. One could counter with many examples where the library catalog entry succeeds and GBS fails, but of course that wouldn't bolster Google's arguments here. A reasonable analysis would put the two methods (full text and standards-based metadata) as complementary.

Google also argues that it did not give copies of the digital files resulting from its scanning to the libraries. How this plays out is not only clever, but it shows some real foresight on Google's part. They developed a portal where the libraries could request that a copy of the files be made "on demand" for the library, and using an encryption specific to that library. The transmission of the files from Google to the libraries was then an act of the libraries, not of Google.

"Moreover, the undisputed facts show that it is the libraries that make the library copies, not Google, and that Google provides only a technological system that enables libraries to create digital copies of books in their collections. Under established Second Circuit precedent, Google cannot be held directly liable for infringement because Google itself has not engaged in any volitional act constituting distribution." (p. 33)

Clearly, Google designed the system (with goes by the acronym "GRIN") with this in mind.

I don't mind this, but wish that Google hadn't included a dig at HathiTrust as part of this argument. The document would not have suffered, in my opinion, if Google had left the parenthetical phrase off of this sentence:

"No library may obtain a digital copy created from another library's book -- even if both libraries own identical copies of that book (although libraries may delegate that task to a technical service provider such as HathiTrust)." (p. 15)

It's one thing to claim innocence, but another to point the finger at others.

Omissions

There a few glaring omissions from the document, some of which would weaken Google's case.

There is no mention of the computational uses that can be made of the digital corpus, something that was a strong focus in the failed settlement between Google and the authors and publishers. I have no doubts that Google is currently engaged in research using this corpus -- I don't see how they could resist doing so. They do mention the "n-gram" feature briefly, but as this is based on what appears to be a simple use of term frequency, it may not attract the court's attention.

In another omission, Google states that:

"Informed by the results of a search of that index, users can click on links in Google Books to locate a library from which to borrow those books ... " (p. 4)

Google fails to state that this is not a service provided by Google but one provided by OCLC using exactly those card catalogues that Google finds so inadequate. Credit should be given where credit is due, but there is an important battle to be won.

Bottom Line

The ability to create full text searches of printed works (and other physical materials) is so important to research and learning -- and should be such an obvious modern approach to searching these materials -- that a win for Google is a win for us all. Although some aspects of this document shot arrows into my librarian-ly heart, I hope with all of that wounded heart that they prevail in this suit.

[1] This points to the ScribD site which unfortunately is now connected to Facebook and therefore is a huge privacy monster. The document should appear on the Public Index site shortly, with no login required.
[2] The term "product" could also be used to describe GBS.

Wednesday, July 25, 2012

Authorities and entities

In my previous post, I talked about the three database scenarios proposed by the JSC for RDA. These can be considered to be somewhat schematic because, of course, real databases are often modified for purposes of efficiency in searching and display, as well as to facilitate update. But the conceptual structures provided in the JSC document are useful ways to think about our data future.

There is one problem that I see, however, and that is the transition from authority control to entities. Because we have authority records for some of the same things that are entities in the entity-relationship model of FRBR, there seems to be a wide-spread assumption that an authority record is the same as an entity record. In fact, IFLA has developed "authority data" models for names and for subjects that are intended to somehow mix with the FRBR model to create a more complete view of the bibliographic description.

This may be a wholly mis-guided activity, for the reason that authority control and entities (in the entity-relation sense) are not at all the same thing.

The library authority control, and the record that carries the information, has as its sole purpose to provide the preferred heading representing the "thing" being identified (person, corporate body, subject heading). It also provides non-preferred forms of the name or subject that might be ones that a catalog user would include in a query for that thing. The rest of the information contained in the record is solely in support of the process of selection of the appropriate string, including documentation of the resources used by the cataloger in making that decision. In knowledge organization thinking, this would be considered a controlled list of terms.

To understand what an entity is, one might use the WEMI entities as examples. An entity is indeed about some defined "thing," and it contains a description of that thing that fulfills one or more intended uses of the metadata. In the WEMI case, we can cite the four FRBR user tasks of find, identify, select, obtain. So if Work is an entity and contains all of the relevant bibliographic information about that Work, then Person is an entity and should contain all of the relevant information about that person. One such piece of information could be a preferred form of the person's name for display in a particular community's bibliographic data, although I could also make the argument that library applications could continue to make use of brief records that support cataloging and display of controlled text strings if that is the only function that is required. In fact, in the VIAF union database of authority data, the data is treated as a controlled list of terms, not unlike a list of terms for languages or musical instruments.

What would be a Person entity? It could, of course, be as much or as little as you would like, but it would be a description of the Person for your purposes. It is this element of description that I think is key, and we could think of it in terms of the FRBR user tasks:

find - would help users find the Person using any form of the name, but also using other criteria like: 19th century French generals; Swedish mystery writers; translators of the Aeneid.

identify - would give users information to help disambiguate between Persons retrieved. This assumes that there would be some amount of biographical information as well as categorization that let users know who precisely this Person entity represents.

select - this is where this would differ from traditional FRBR which seems to assume that one is already looking for bibliographic materials at this step. I suppose that here one might select between Charles Dodgson and Lewis Carroll, whose biographic information is similar but whose area of activity is entirely different.

obtain - this step would lead one to the library's held works by and/or about that Person, but it could also lead to further information, like web pages, entries in an online database, etc.

If you are wondering what a Person entity might look like, it might look like a mashup between an entry in WorldCat identities and Wikipedia. I suggest a mashup because Identities is limited to data already in bibliographic and authority records and therefore has little in the way of general biographical information. That latter is available, sometimes abundantly, in Wikipedia, and of course a link to that Wikipedia entry would be a logical addition to a library record for a Person entity.

What this thinking leads me to conclude is:

1) the library authority file is a term list, not a set of entities, and therefore is not the Person entity implied in FRBR
2) having person entities in our files could be a great service for our users, and it might be possible to create them to take the place of the simple term lists that our authority records now represent
3) the FRBR user tasks may need to be modified or reinterpreted to be focused less on seeking a particular document and more on seeking a particular person (agent) or subject

Monday, July 23, 2012

Futures and Options

No, I'm not talking about the stock market, but about the options that we have for moving beyond the MARC format for library data. You undoubtedly know that the Library of Congress has its Bibliographic Framework Transition Initiative that will consider these options. In an ALA Webinar last week I proposed my own set of options -- undoubtedly not as well-studied as LC's will be, but I offer them as one person's ideas.

It helps to remember the three database scenarios of RDA. These show a progressive view of moving from the flat record format of MARC to a relational database. The three RDA scenarios (which should be read from the bottom up) are

Relational database model -- In this model, data is stored as separate entities, presumably following the entities defined in FRBR. Each entity has a defined set of data elements and the bibliographic description is spread across these entities which are then linked together using FRBR-like relationships.
Linked authority files -- The database has bibliographic records and has authority records, and there are machine-actionable links between them. These links should allow certain strings, like name headings, to be stored only once, and should reflect changes to the authority file in the related bibliographic records.
Flat file model -- The database has bibliographic records and it has authority records, but there is no machine-actionable linking between the two. This is the design used by some library systems, but it is also a description of the situation that existed with the card catalog.

These move from #3, being the least desirable, to #1, being the intended format of RDA data. I imagine that the JSC may not precisely subscribe to these descriptions today because of course in the few years since the document was created the technology environment has changed, and linked data now appears to be the goal. The models are still interesting in the way that they show a progression.

I also have in mind something of a progression, or at least a set of three options that move from least to most desirable. To fully explain each of these in sufficient detail will require a significant document, and I will attempt to write up such an explanation for the Futurelib wiki site. Meanwhile, here are the options that I see, with their advantages and disadvantages. The order, in this case, is from what I see as least desirable (#3, in keeping with the RDA numbering) to most desirable (#1).

#3 Serialization of MARC in RDF

Advantages

mechanical - requires no change to the data
would be round-tripable, similar to MARCXML
requires no system changes, since it would just be an export format

Disadvantages

does not change the data at all - all of the data remains as text strings, which do not link
keeps library data in a library-only silo
library data will not link to any non-library sources, and even linking to library sources will be limited because of the profusion of text strings

#2 Extraction of linked data from MARC records

Advantages

does not require library major system changes because it extracts data from current MARC format
some things (e.g. "persons") can be given linkable identifiers that will link to other Web resources
the linked data can be re-extracted as we learn more, so we don't have to get it right or complete the first time
does not change the work of catalogers

Disadvantages

probably not round-trippable with MARC
the linked data is entirely created by programs and algorithms, so it doesn't get any human quality control (think: union catalog de-duping algorithms)
capabilities of the extracted data are limited by what we have in records today, similar to the limitations of attempting to create RDA in MARC

#1 Linked data "all the way down", e.g. working in linked data natively

Advantages

gives us the greatest amount of interoperability with web resources and the most integration with the information in that space
allows us to implement the intent of RDA
allows us to create interesting relationships between resources and possibly serve users better

Disadvantages

requires new library systems
will probably instigate changes in cataloging practice
presumably entails significant costs, but we have little ability to develop a cost/benefit analysis

There is a lot behind these three options that isn't explained here, and I am also interested in hearing other options that you see. I don't think that our options are only three -- there could be many points between them -- but this is intended to be succinct.

To conclude, I don't see much, if any, value in my option #3; #2 is already being done by the British Library, OCLC, and the National Library of Spain; I have no idea how far in our future #1 is, nor even if we'll get there before the next major technology change. If we can't get there in practice, we should at least explore it in theory because I believe that only #1 will give us a taste of a truly new bibliographic data model.

Sunday, July 15, 2012

Friends of HathiTrust

I have written before about the lawsuit of the Author's Guild (AG) against HathiTrust (HT). The tweet-sized explanation is that the AG claims that the corpus of digitized books in the HathiTrust that are not in the public domain are infringements of copyright. HathiTrust claims that the digitized copies are justified under fair use. (It may be relevant that many of the digitized texts stored in HT are the result of the mass digitization done by Google.)

For analysis of the legal issues, please see James Grimmelman's blog, in particular his post summarizing how the various arguments fit into the copyright law's "four factors."

I want to focus on some issues that I think are of particular interest to librarians and scholars. In particular, I want to bring up some of the points from the amicus brief from the digital humanities and law scholars.

While scientists and others who work with quantifiable data (social scientists using census data, business researchers with huge amounts of data from stock markets, etc.), those working in the humanities whose raw material is in printed texts have not been able to make use of the massive data mining techniques that are moving other areas of research forward. If you want to study how language has changed over time, or when certain concepts entered the vocabulary of mass media, the physical storage of this information makes it impossible to run these as calculations, and the size of the corpus makes it very difficult, if not impossible, to do the research in "human time". Thus, the only way for the "Digital Humanities" to engage in modern research is after the digitization of their primary materials.

This presumably speaks to the first factor of fair use:

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

As Grimmelman says "The Authors Guild focuses on the corpus itself; HathiTrust focuses on its uses." It may make sense that scholars should be allowed to make copies of any material they need to use in their research, but I can imagine objections, some of which the AG has already made: 1) you don't need to systematically copy every book in every library to do your research and 2) that's fine, but can you guarantee that infringing copies will not be distributed?

It's a hard sell, yet it's also hard not to see the point of view of the humanities scholars who feel that they could make great progress (ok, and some good career moves) if they had access to this material.

The other argument that the digital humanities scholars make is that the data derived from the digitization process is not infringing because it is non-expressive metadata. Here it gets a bit confusing because although they refer to the data derived from digitization as "metadata," the examples that they give vary from the digitized copies themselves, to a database where all of this is stored, and to the output from Google n-grams. If the database consists of metadata, then the Google n-grams are an example of the use of that metadata, but are not an example of the metadata itself. In fact the "metadata" that is produced from digitization is a good graphic copy of each page of the book, plus a reproduction, word for word (with unfortunate but not deliberate imprecision) of the text itself. That this copy is essential for the research uses desired is undeniable, and the brief gives many good examples of quantitative research in the humanities. But I fear that their insistence that digitization produces mere "metadata" may not be convincing.

Here's a short version from the text:

"In ruling on the parties’ motions, the Court should recognize that text mining is a non-expressive use that presents no legally cognizable conflict with the statutory rights or interests of the copyright holders. Where, as here, the output of a database—i.e., the data it produces and displays—is noninfringing, this Court should find that the creation and operation of the database itself is likewise noninfringing. The copying required to convert paper library books into a searchable digital database is properly considered a “nonexpressive use” because the works are copied for reasons unrelated to their protectable expressive qualities; none of the works in question are being read by humans as they would be if sitting on the shelves of a library or bookstore." p. 2

They also talk about transformation of works, and the legal issues here are complex and my impression is that the various past legal decisions may not provide a clear path. They then end a section with this quote:

"By contrast, the many forms of metadata produced by the library digitization at the heart of this litigation do not merely recast copyrightable expression from underlying works; rather, the metadata encompasses numerous uncopyrightable facts about the works, such as author, title, frequency of particular words or phrases, and the like." (p.17)

This, to me, comes completely out of left field. Anyone who has done digitization projects is aware that most projects use human-produced library metadata for the authors and titles of the digitized works. In addition, the result of the OCR step of the digitization process is a large text file that is the text, from first word to last, in that order, and possibly a mapping file that gives the coordinates of the location of each word on each OCR'd page. Any term frequency data is a few steps away from the actual digitization process and its immediate output, and fits in perfectly with the earlier arguments around the use of datamining.

I do sincerely hope that digitization of texts will be permitted by the court for the purposes argued in this paper. An attempt at justification, after the fact, of Google's mass digitization project may, however, suffer weaknesses inherent in that project, in particular that no prior negotiation was attempted with either authors nor publishers, and once the amended settlement between Google and the suing parties was denied by court, there is no mutual agreement on uses, security, nor compensation.

In addition, the economic and emotional impact of Google's role in this process cannot be ignored: this is a company that is so strong and so pervasive in our lives that mere nations struggle to protect their own (and their citizens') interests. When Google or Amazon or Facebook steps into your territory, the earth trembles and fear is not an unreasonable response. I worry that idea of digitization itself has been tainted, making it harder for scholars to make their case of the potential benefits of post-digitization research.

Thursday, July 05, 2012

ISBN as URI

One thing that is annoying in this early stage of attempting to create linked data is that we lack URI forms for some key elements of our data. One of those is the ISBN. At the linked data conference in Florence, I asked a representative of the international ISBN agency about creating an "official" URI form for ISBNs and was told that it already exists: ISBN-A, which uses the DOI namespace.

The format of the ISBN URI is:

Handle System DOI name prefix = "10."
ISBN (GS1) Bookland prefix = "978." or "979."
ISBN registration group element and publisher prefix = variable length numeric string of 2 to 8 digits
Prefix/suffix divider = "/"
ISBN Title enumerator and checkdigit = maximum 6 digit title enumerator and 1 digit check digit

I was thrilled to find this, but as I look at it more closely I wonder how easy it will be to divide the ISBN at the right point between the publisher prefix and the title enumerator. In the case where an ISBN is recorded as a single string (which is true in many instances in library data):

9781400096237
9788804598770

there is nothing in the string to indicate where the divider should be, which in these two cases is:

978.14000/96237
978.8804/598770

I have two questions for anyone out there who wants to think about this:

1) Is this division into prefix/suffix practical for our purposes?
2) Would having a standard ISBN-A URI format relieve us of the need to have a separate property (such as bibo:ISBN) for these identifiers? In other words, is the URI format identification enough that this could be used directly in a situation like:

<bookURI> <RDVocab:hasManifestationIdentifier><http://10.978.14000/96237>