Monday, September 07, 2009

GBS and Bad Metadata

Ever since Geoffrey Nunberg got up at the Google Books Colloquium at Berkeley on August 28, 2009, and showed the audience how bad the Google Books metadata is (Google's Book Search: A Disaster for Scholars, article in The Chronicle of Higher Education, Google Books: The Metadata Mess, the slide presentation from the Conference at UC Berkeley), some parts of the academic world have been buzzing about the topic.

Google representatives claim that their data comes from libraries and from other sources, but it is easy to show that Google is not including the library's bibliographic record in GBS. It might just be seen as a short-sighted decision on their part not to keep all of the data from the MARC records supplied by the libraries. After all, which of these do you think makes the most sense to the casual reader:
12 pages
12 p. 27 cm.
However, there is some evidence that Google is missing parts of the library bibliographic record. Here are some examples of subjects from GBS and the records from the very libraries that supplied the works:

GBS:
Indians of North America
Indian baskets

Library:
Indians of North America -- Languages.
Indians of North America -- California
Indian baskets -- North America

This is the same pattern that appeared in the records released by the University of Michigan for their public domain scanned books -- only the $a of the 6XX field was included. (I wrote about this: http://kcoyle.blogspot.com/2008/05/amputation.html). Many other fields are also excluded from those Michigan records, and one has to wonder if the same was true of the records received/used by Google.

I know that it is possible to retrieve the full library records for the books because the Open Library is using this technique to retrieve bibliographic data for the public domain books scanned by Google. Google is obviously capable of doing this, yet chooses not to.

This leaves us with a bit of a mystery, although I think I know the answer. The mystery is: why would Google only use limited metadata from the participating libraries? And why won't they answer the question that I asked at the Conference: "Do you have a contract with OCLC? And does it restrict what data you can use?" Because if the answer is "yes and yes" then we only have ourselves (as in "libraries") to blame. And Nunberg and his colleagues should be furious at us.

7 comments:

  1. You should also look at this from Google's point of view - once the bits are digitized, they are easier to manipulate. They can always go back and redo the cataloging - it is the scanning that takes the most effort.

    ReplyDelete
  2. In a comment to "Google Books: A Metadata Train Wreck" (http://languagelog.ldc.upenn.edu/nll/?p=1701), Jon Orwant (Google) writes the following:

    We don't display the raw library metadata because of a contractual obligation with a library catalog aggregator that forbids us from doing so. Given how they pay their bills, this is understandable.

    Might this library catalog aggregator be OCLC?

    ReplyDelete
  3. The Google Book Search-OCLC deal is old news, to some extent. Here's the press release from May 2008.

    ReplyDelete
  4. Anon - thanks for pointing out that Orwant quote. However, it's not a question of "raw" -- it's a question of how much of the library data Google is allowed to use, as per its contract with OCLC. We know (as Eric points out) that Google and OCLC are working together. What we don't know is the nature of that agreement. If that agreement results in extremely poor metadata for Google records, and/or metadata that doesn't work well for libraries, then we have a problem. I personally think that truncating every subject heading after the first $a is pretty bad, and will make it hard to incorporate metadata from GBS in a metasearch with library data. If GBS got what Michigan put out in its metadata, it's lacking place of publication, series statements, all notes, and all authors except the main author. But I think that the truncation of the subject headings is the worst.

    ReplyDelete
  5. hmmm, so if Google had free rein to use all of OCLC's metadata without restriction, my library would be paying me to catalog for Google Books. sweet deal!

    ReplyDelete
  6. Anon,

    OCLC sells data to folks like Google, it doesn't give it away. That money goes back to OCLC, and OCLC's view (as stated in their Policy and FAQ, albeit now withdrawn) is that this activity helps support WorldCat and therefore is for the good of all. Whether or not your library should be getting something more than a more fiscally robust OCLC is a question for the OCLC membership. But it is perfectly reasonable for the for-profit sector to compensate libraries for the work they have done. Note that the participating GBS libraries have negotiated very hard with Google to get something in exchange for Google's use of their collections. For example, Michigan will get access to the subscription service for free, and other participating libraries will probably get deep discounts. If you have something of value, you don't have to give it away -- instead, you should see it as a bartering point.

    ReplyDelete
  7. "it is perfectly reasonable for the for-profit sector to compensate libraries for the work they have done"

    Compensation for the work done and expenses, in my view, should be clearly distinct from monetization of assets.

    The settlement project doesn't seem very explicit in that respect, or am I wrong?

    ReplyDelete

Comments are moderated, so may not appear immediately, depending on how far away I am from email, time zones, etc.