Monday, May 26, 2008


OK, maybe I'm just in a particularly bad mood, but I guess I've just about had it with libraries shooting themselves in the foot. Then letting gangrene set in and going for amputation.

I'm talking about not sharing the data we have. We've all heard the complaints that the Web community is going forward "reinventing the wheel" in terms of bibliographic data. But then when anyone shows an interest in our bibliographic data, we withhold like the anal retentives we are. (Wow, I must be in an extraordinarily bad mood!)

We've just learned that OCLC is sharing its data with Google, and that libraries can download records for Google books from OCLC -- although there was no mention of the "usual fee," which I'm sure applies. I happen to know that Google receives a full bibliographic record with each book that it digitizes. I have no idea what they do with that data because it doesn't appear on the screen in Google Book Search. What is significant is that the quality data that has been created by libraries is still essentially invisible to most users of the Internet. Do you still wonder why we are overlooked by most of the information seeking population? How can they possibly know what we've done to organize the bibliographic world if we won't let them see?

So the latest thing that got my goat came across on the NGC4LIB list. First, it turns out that University of Michigan has made available a file of record representing the books of their digitized by Google that are in the public domain via OAI-PMH. (The OAI Identify command.) This is a good thing, obviously. But in this post, we learn that the records have been "truncated" to meet the requirements of OCLC for record sharing. So I decided to see what "truncated" means. Here are some examples. The bolded text is what you get when you retrieve the records from Michigan. The text in italics is what shows in the Michigan catalog for the same item.

LDR nam 22003251i 4500
005 19901105000000.0
006 m d
007 cr bn ---auaua
008 901105s1980 dcu b f00010 eng d
020 |b pbk.
035 |a (OCoLC)ocm06624048
040 |a GPO |c GPO |d m.c. |d m/c |d EYM
074 |a 968-H-1
0860 |a J 26.2:C 73/10
1001 |a Villano, Clair E.
24510 |a Complaint and referral handling / |c by Clair E. Villano, Metropolitan Denver District Attorneys' Office of Consumer Fraud and Economic Crime.
260 |a [Washington] : |b Dept. of Justice, Law Enforcement Assistance Administration, |c 1980.
300 |a v, 25 p. ; |c 28 cm.
440 0 |a Operational guide to white-collar crime enforcement
500 |a At head of title: The National Center on White-Collar Crime.
500 |a Project supported by Grant No. 77-TA-99-0008, awarded to the Battelle Memorial Institute Law and Justice Study Center.
500 |a May 1980.
504 |a Includes bibliographical references.
533 |a Electronic text and image data |b Ann Arbor, Mich. : |c University of Michigan Library |d 2007 |e Includes both image files and keyword searchable text. |f [Michigan Digitization Project]
538 |a Mode of access: Internet.
650 0 |a Complaints (Criminal procedure) |z United States.
650 0 |a White collar crimes |z United States.
7101 |a United States. |b Law Enforcement Assistance Administration.
7102 |a National Center on White-Collar Crime (U.S.)
7102 |a Metropolitan Denver District Attorneys' Office of Consumer Fraud and Economic Crime.
7102 |a Battelle Law and Justice Study Center.
856 4 |u
|wmdp.39015034803505 |xeContent

LDR nam 2200301 a 4500
005 19901105000000.0
006 m d
007 cr bn ---auaua
008 901105s1980 dcu bc f00010 eng c
010 |a 81600912 //r90
035 |a (OCoLC)ocm06999418
040 |a DGPO/DLC |c DLC |d GPO |d EYM
043 |a n-us---
05000 |a Z6616.B3215 |b B37 |a E746.B37
074 |a 383-B
08200 |a 016.355/0092/4 |2 19
0860 |a D 214.13:B 26/859-930
1001 |a Bartlett, Merrill L.
24510 |a George Barnett, 1859-1930 : |b register of his personal papers / |c compiled by Merrill L. Bartlett.
260 |a Washington, D.C. : |b History and Museums Division, Headquarters, U.S. Marine Corps, |c 1980.
300 |a vii, 18 p. ; |c 27 cm.
504 |a Bibliography: p. 17-18.
533 |a Electronic text and image data |b Ann Arbor, Mich. : |c University of Michigan Library |d 2007 |e Includes both image files and keyword searchable text. |f [Michigan Digitization Project]
538 |a Mode of access: Internet.
60010 |a Barnett, George, |d 1850-1930 |x Manuscripts |x Catalogs.
61010 |a United States. |b Marine Corps. |x History |x Sources |x Manuscripts |x Catalogs.
61010 |a United States. |b Marine Corps. |b History and Museums Division |x Catalogs.
650 0 |a Manuscripts, American |z Washington (D.C.) |x Catalogs.
7101 |a United States. |b Marine Corps. |b History and Museums Division.
856 4\|u

There are obviously many other examples, but the trend is clear (well, partly so). The lack of a 245 $c (statement of responsibility) and all of the 7XX's (added entries) means that the record is incomplete in terms of authorship. You won't be able to see or search on anything but the main author. I'm baffled by the removal of the place of publication, since it's not used for retrieval (the coded place is in the 008 field). Ditto the 300 $c with the size in centimeters. The subject headings have been rendered entirely useless. As we know, the 6XX $a is not the top of some logical hierarchy, but is idiosyncratically the first term based on some rather complex rules. So in the first record we lose "United States" because it is the second term, but in the second record we get only "United States" and lose all references to "Marine Corps." which is the actual topic of the item.

I note also that the output records do not have the required 001/003/005 fields. (And not having required fields is a problem in itself.) These are needed to identify the source of the record (003) and provide a unique identity for the record in that source system (001 + 003). The 005 would give the date of the most recent update before the record was exported. The combination of these three would allow a receiving system to accept periodic updates to these records. I suspect (and this is just me) that the existence of the 001 would also allow one to retrieve the un-truncated record from Michigan's catalog using something like Z39.50. I admit, however, that the lack of these fields could be a simple error.

I'm not going to say that these records are completely useless, and I am talking to the Open Library folks about adding them to their database. But I do consider the maiming of these records to be an embarrassment, a kind of self-mutilation done by a profession with amazing low self-esteem. Please, folks, let's find our power and stand up for what we know is right!

Wednesday, May 21, 2008


The Open Library is, among other things, an interesting experiment in the creation of a book catalog that mixes data from libraries, publishers, and online sites (currently only Amazon).

One of the big issues that comes up, of course, is that of author names. We know that author names are recorded differently in different sources. We also know that only the library data carries what could be considered an author identifier: a unique string for each unique author. (How "unique author" is defined is a discussion for another day.)

Because the Open Library creates a web page for each author, and that web page links to the books by the author, it is important not to split an author's works into separate pages for each form of the author's name. In other words, you wouldn't want a page for "Mark Twain" and another for "Twain, Mark." Although that would be simple case.

As book data is added to the Open Library, the incoming bibliographic records are matched to those already in the database. It is only after a match is found that the authors are compared in some detail. The fact that the varying author names appear on the same bibliographic record allows you to make some inferences. So the library names can be switched around to "natural order" for comparison to Amazon data. The main question is, however, if you don't get an exact match at that point, what else would constitute a match?

There is an interesting set of data (created by data wrangler Edward Betts) that lists matched books that have un-matched authors. This small set is like a microcosm of "the author problem." Edward has added Jaro-Winkler values as a way to quantify the matches, although it isn't clear where the bright line is between match and no match. This is an interesting problem.

Here are some things that turn up in the small set:

No author vs. many authors
When a book has many authors, especially when it is a compilation of works by different authors, library catalog data does not record "an author" in the author position (MARC 100). Amazon lists all of the authors as authors. (#2, #14 in data set)

Libraries do not list "anonymous" as an author. Amazon does (and even has a link so you can click on it!). Compare Open Library and Amazon for the book "Primary Colors."

Transliterated name forms
Erofeev v. Erofeyev (#7), Tsernianski v. Crnjanski (# 170). Also see #s 162 and 163 for this same case. Although there are some examples of misspellings of complex names, the issue here is mainly that the library cataloging standardizes on a particular form of the name, and the publisher and bookseller data probably uses the form that appears on the work in question. You can see that in the Crnjanski book which has "Tsernianski" in the "by" statement (which is the awkward term I came up with for the statement of responsibility, and suggestions are welcome -- although don't bother to suggest "statement of responsibility" because I consider that NOT for user consumption).

Bits and Pieces
Names with "bits and pieces" like titles of address (Dr., Mrs.), multi-part names (see # 9 for De Courcy, Catherine vs. Courcy Catherine De) (and #76 for Gregory, Saint, Bishop of Tours vs. Gregory of Tours). The problems here seem to be a combination of not knowing when where and how to include some of these (like the "Jr. " in William F. Buckley Jr., which isn't included in the library heading -- #155), and, once they are there what order to put them in. This is an area where some data match formulas might be able to help out.

Just Plain Wrong?
Obviously, the data isn't always perfect, and in particular many Amazon entries seem to have been hastily input. (I wonder to what extent these represent the used book sellers on Amazon? How does that data get into Amazon's database? See #21, and its Amazon entry.) Also, there are many entries that are incomplete (e.g. just the author's last name).

I truly believe that the future will bring more instances when we will find ourselves needing to combine bibliographic data from different sources, or move data across traditional community boundaries. For this reason I have one specific request for the library community: Please provide the name as it appears on the work in a form that can be used for matching.
This means that burying it in the statement of responsibility with no further mark-up does not help. Meanwhile, we may want to consider string-matching within the SoR against names from other sources. Yeech!

Saturday, May 03, 2008

An easy, online, social library catalog

One thing that I learned in my short visit to Kosovo is that there are many libraries there, and I'm sure in every region and in every country, that are small and have no catalog. (There are also large libraries without catalogs, but the solution for them is more difficult than what I am proposing here.) I went online to see what software might be available for these libraries, and came to the conclusion that 1) the software they need does not exist and 2) there's no reason for catalog creation to be as complex as we've made it. As a matter of fact, if we look around us there are many online systems that are free to users, or nearly so, require no training, and that function on a fairly large scale. What I'm proposing here is actually no more complex than most social networking systems, but with a library bent. Here's what we need:
  1. A social networking site where the society members are libraries, not individuals.
  2. The ability to capture copy cataloging from other libraries or create cataloging on the site itself.
  3. Full Unicode support, both for the interface and for the data.
  4. The ability to capture and create records using a MARC-compatible format.
  5. The ability to export the library catalog records in MARC format.
  6. A reports function that could print off the results of searches or even the library's inventory, so it could be used off-line.
  7. The creation of groups of "library friends," that is other libraries whose data should be included in searches and displays. This will facilitate sharing and also will serve users in areas where resources are scarce and scattered.
  8. A search and display interface that looks like a modern library catalog
  9. It all has to be easy to use with no training required, and not require any technical support on the part of the library.
Sound impossible? Hardly. Essentially, I'm thinking of a cross between MySpace and Librarything, with a user interface that looks something like Scriblio. It could also be called a Worldcat with an easy cataloging interface and very, very low user fees. It may benefit from some of the features of the wiki world, with shared editing of bibliographic data, so I guess I should add the Open Library into the mix.

There are many people encouraging libraries to use Open Source systems like Koha, but the libraries I'm talking about here have no capability to run software, much less Unix-based software. They may have only one computer, and it has to be used for everything: Internet access, office applications like document creation, and, if they have the capability, the library catalog. For those that do have at least part-time Internet access, the ideal system would be run online, with no technical requirements on the library's part.

The MARC requirement is an important one. The system does not need to support the full MARC record, but support for a standard minimum record means that the libraries can use each other's data for copy cataloging, and that some time in the future they may be able to contribute their records to library systems or to regional union catalogs. The ability to form networks between libraries is essential to overcome the incredible scarcity that exists for people living in rural and under-developed areas.

We already have many of the parts of this system, and I'm confident that the technology is no problem. We need the organization and the sustainability. Please send along any suggestions you have for how we can get this done.