Monday, November 23, 2009


The Google Books Settlement is causing a great deal of previously unexpressed bibliographic interest -- just how many books are there in the known universe? How many are published in the four countries now included in the settlement agreement? (US, UK, Canada, Australia). And how many are in the public domain?

Lorcan Dempsey and Brian Lavoie have recently published an article in DLib that looks at these figures using the world's largest database of bibliographic data, WorldCat. The data is fascinating, but I have already seen it mis-interpreted, so I thought some clarification might be useful.

Dempsey and Lavoie are very clear that what they are measuring is "Manifestations." Folks outside of the library environment are unlikely to know what that means, therefore it is important to clarify what the numbers in the Dempsey/Lavoie article represent. Each “book” that is counted represents a published product at about the same level of granularity that today would be given an ISBN. Therefore if a publisher re-issues a book in their backlist after the previous print run has been exhausted (say, a decade later) and with a new introduction, it is considered a different book. The publication date that is fed into the study is the date of the new issuing of the book. Also, as publishers re-package and re-print public domain books, these also are considered separate products with new ISBNs and new dates.

Thus, if you look up a commonly re-published book like “Moby Dick, Or The Whale” in the Library of Congress catalog, you retrieve 40 items (and more if you use the short form of the name, simply “Moby Dick”), of which only one is pre-1923 — that one was published in 1851. Of the other thirty-nine instances of the publication of the work, which range from 1925 to 2006, some contain what GBS called “inserts” - that is, separately copyrightable intellectual property in the form of introductions, etc., but others may be a straight republication of the text. If you do the same lookup in FictionFinder, a work-based view of a portion of the WorldCat database. you find:

823 editions of "Moby Dick" (which combines the various versions of the title)
534 of which are in English

of these:

9 have an unknown date
60 have a date of 1923 or earlier
465 have dates after 1923

Looking through the list on FictionFinder it is easy to see that there are some duplicate records, both in the pre- and post-1923 entries.

Therefore, the question we now need to answer is: how many public domain works have been republished after the 1923 cut-off date?

Google appears to currently lack the ability to make the proper connection between the original text that is in the public domain and the many “manifestations” (as they are called in library-speak) that were published later — and are also in the public domain, at least as far as the primary text is concerned. This is a non-trivial exercise when one is working only with the metadata that describes the work, but may become more feasible with the ability to do a full text analysis of the contents of the various packages in which publishers have placed the original work of Melville. I assume that Google is working on this, although I cannot predict how it will affect their assessment of the PD/(c) split.

What is clear, however, is that Google is going to need to identify Works (if not strictly in the FRBR sense, then at least in the sense that meets some definition that is valid for copyright law).

Saturday, November 14, 2009

Amended Google/AAP Settlement

The amended settlement has been issued (the best way to see the changes is in the redline version). I will summarize here the changes that I see as having the greatest impact on libraries and on the public. For legal issues, I suggest James Grimmelmann's blog. For business issues, probably the NY Times and Wall Street Journal.

Foreign Works Mostly Excluded

Undoubtedly due to the many complaints from foreign rights holders, the settlement now only includes (oddly enough) US, UK, Australian and Canadian works. This would include, as I interpret it, translations of non-USetc works published in those four countries. This greatly changes the value of the institutional subscription for higher education, as well as the value of the 'research corpus' (essentially a database of the OCR'd texts that researchers can use for computational research).

Since we know that information seekers prefer accessing works online rather than in hard copy, I anticipate that the online service will be very popular. But it will contain almost exclusively these Anglo-American products, a narrow swath of the intellectual output of the planet. As it is, too many Americans are unaware of the world outside of those Anglo-American borders. This will just exacerbate that problem. It could change the content of of education and research. As I've said before, availability is a significant determinant of what intellectual materials people use in their research.

Particular to Libraries

In general, the sections on libraries (both participation and use of the digital copies) remain unchanged. There are a few minor changes, some of which are puzzling.

Public Libraries

The statement about the free access for public libraries has been changed from:
in the case of each Public Library, no more than one terminal per Library Building

in the case of each Public Library, one terminal per Library Building.; provided, however, that the Registry may authorize one or more additional terminals in any Library Building under such further conditions at it may establish, acting in its sole discretion and in furtherance of the interests of all Rightsholders.
So it leaves the options open for giving some public libraries additional (free?) access. Still, there is no information on whether or how public libraries could subscribe in a way that would allow them to fully serve their communities.


The definition of "books" that could be digitized originally included microforms. The word "not" has been added:
hard copy (not including microform)
No idea why, but perhaps a look at the comments will reveal one from UMI or some other party related to microforms.

[Found it: The ProQuest letter states that dissertations should NOT be included as they are controlled through ProQuest's dissertation service. The letter mentions that some dissertations are in microform format, but that today many are available as print-on-demand or online. Although microforms were excluded, p. 327 of the redline document states:
What Material Is Covered?
"Books” include in-copyright written works, such as novels, textbooks, dissertations, and other writings...".
So ProQuest did not get what it asked for.]

OCLC Networks

The original settlement had a strange exception that removed OCLC networks from the definition of "consortium":
"Institutional Consortium” means a group of libraries, companies, institutions or other entities located within the United States that is a member of the International Coalition of Library Consortia with the exception of Online Computer Library Center (OCLC) - affiliated networks.
That exception has been removed. I would love to know why it was there in the first place, but can only assume that one or both of these requests came about because of participation by OCLC in the settlement discussions.

[Note: I discovered that Lyrasis and Nylink filed an objection about this exception, which may be why it was removed. Their analysis was that it had come from OCLC and gave OCLC the ability to manage competition by determining which organizations would be excluded from participating in the business of brokering services for libraries. They assume that OCLC hopes to be in that business itself.]

Download Formats & Course Packs

In the original settlement, the only download format mentioned was PDF. As we know, since then Google has announced that it will provide e-books from the publisher partner content that it carries on GBS. Ebook formats have been added in to the settlement as possible download formats. At the same time, the product line described as:
Custom Publishing - Per-page pricing of Books, or
portions thereof, for course materials, and other forms of custom
publishing for the educational and professional markets
has been removed.


There are complex changes to the treatment of orphan works which I have not tried (yet) to absorb. Those will undoubtedly have some impact on libraries and the public but at the moment I have no thoughts on that.

The settlement now allows rightsholders to place a Creative Commons license on their works. I really don't see a great deal of significance in this, although it does emphasize that by participating in GBS your rights are now governed by contract law rather than copyright law.

And, last, Google admits to some of its own difficulties in bibliographic control when it states that "The inclusion of a work within the Books Database does not, in and of itself, mean that the
work is a Book within the meaning of Section 1.19 (Book)." In other words: we threw a whole bunch of bib records into a database; don't assume anything from it.

Monday, November 09, 2009


Waiting for the next round of Google/AAP/AG settlement prose (which was due today, November 9, but has been moved back to Friday, November 13, when the parties will presumably present it to the judge), I have read Ken Auletta's book "Googled: the end of the world as we know it." It's mainly a business book, and primarily about media and advertising. I can sum up what it says about Google in three statements:
  1. Engineering can fix anything
  2. Information is neutral and measurable
  3. Advertising is information
OK, maybe that's a bit overly concise, but that is what it boils down to. I've often wondered how your motto can be "Don't be evil" when you are in the advertising business. It obviously works if you consider information to only have meaning based on numerical measures, and that advertising is just another kind of information. This engineer-based mentality as the guiding principle of the largest, richest advertising company in the world falls somewhere between Ayn Rand's objectivism and Bernie Maddoff's ponzi scheme. About 50% of Google's employees are engineers, and engineers, on average, earn twice what non-engineers earn.

Google has ramped up the advertising game by orders of magnitude, destabilizing huge, long-lived media companies, and it's all based on... winners win. Google sees its role as matching up users with things they are seeking, whether it's web sites, books, or a place to buy sneakers. It doesn't matter to Google what the information is.

There is something creepy about the way that Auletta refers to SergeyandLarry as "the founders." It sounds almost... cult-like. The fact that the book treats the founders and CEO Eric Schmidt as a three-some is just way too trinitarian for my taste.