Wednesday, May 21, 2008


The Open Library is, among other things, an interesting experiment in the creation of a book catalog that mixes data from libraries, publishers, and online sites (currently only Amazon).

One of the big issues that comes up, of course, is that of author names. We know that author names are recorded differently in different sources. We also know that only the library data carries what could be considered an author identifier: a unique string for each unique author. (How "unique author" is defined is a discussion for another day.)

Because the Open Library creates a web page for each author, and that web page links to the books by the author, it is important not to split an author's works into separate pages for each form of the author's name. In other words, you wouldn't want a page for "Mark Twain" and another for "Twain, Mark." Although that would be simple case.

As book data is added to the Open Library, the incoming bibliographic records are matched to those already in the database. It is only after a match is found that the authors are compared in some detail. The fact that the varying author names appear on the same bibliographic record allows you to make some inferences. So the library names can be switched around to "natural order" for comparison to Amazon data. The main question is, however, if you don't get an exact match at that point, what else would constitute a match?

There is an interesting set of data (created by data wrangler Edward Betts) that lists matched books that have un-matched authors. This small set is like a microcosm of "the author problem." Edward has added Jaro-Winkler values as a way to quantify the matches, although it isn't clear where the bright line is between match and no match. This is an interesting problem.

Here are some things that turn up in the small set:

No author vs. many authors
When a book has many authors, especially when it is a compilation of works by different authors, library catalog data does not record "an author" in the author position (MARC 100). Amazon lists all of the authors as authors. (#2, #14 in data set)

Libraries do not list "anonymous" as an author. Amazon does (and even has a link so you can click on it!). Compare Open Library and Amazon for the book "Primary Colors."

Transliterated name forms
Erofeev v. Erofeyev (#7), Tsernianski v. Crnjanski (# 170). Also see #s 162 and 163 for this same case. Although there are some examples of misspellings of complex names, the issue here is mainly that the library cataloging standardizes on a particular form of the name, and the publisher and bookseller data probably uses the form that appears on the work in question. You can see that in the Crnjanski book which has "Tsernianski" in the "by" statement (which is the awkward term I came up with for the statement of responsibility, and suggestions are welcome -- although don't bother to suggest "statement of responsibility" because I consider that NOT for user consumption).

Bits and Pieces
Names with "bits and pieces" like titles of address (Dr., Mrs.), multi-part names (see # 9 for De Courcy, Catherine vs. Courcy Catherine De) (and #76 for Gregory, Saint, Bishop of Tours vs. Gregory of Tours). The problems here seem to be a combination of not knowing when where and how to include some of these (like the "Jr. " in William F. Buckley Jr., which isn't included in the library heading -- #155), and, once they are there what order to put them in. This is an area where some data match formulas might be able to help out.

Just Plain Wrong?
Obviously, the data isn't always perfect, and in particular many Amazon entries seem to have been hastily input. (I wonder to what extent these represent the used book sellers on Amazon? How does that data get into Amazon's database? See #21, and its Amazon entry.) Also, there are many entries that are incomplete (e.g. just the author's last name).

I truly believe that the future will bring more instances when we will find ourselves needing to combine bibliographic data from different sources, or move data across traditional community boundaries. For this reason I have one specific request for the library community: Please provide the name as it appears on the work in a form that can be used for matching.
This means that burying it in the statement of responsibility with no further mark-up does not help. Meanwhile, we may want to consider string-matching within the SoR against names from other sources. Yeech!


Anonymous said...


Thanks for the interesting post. But to get all FRBR for a minute... When you say:

"Please provide the name as it appears on the work in a form that can be used for matching."

... we can't, can we? Because a name can't appear on a work, because it's an abstraction. The name can only appear on an item (which exemplifies a manifestation). So we can transcribe an author's name as written on the item and associate it with a manifestation; and then come up with a controlled form and associate it with a work. Amazon mushes the two together which is the reason for the confusion.

Perhaps the answer is for RDA to have have an 'author' property of a manifestation that holds the name of an author as it appears on the manifestation. And the work continues to have the author property with the controlled form.


Anonymous said...

"Please provide the name as it appears on the work in a form that can be used for matching."

Aside from the FRBR/RDA terminology question...

In theory, the name should appear in the record as it appears on in the 245 subfield c. As far as I know, that is the ONLY place in a MARC/AACR record you'll find that. Whether that meets the criteria of "a form that can be used for matching", however...

I'm not sure of the best way to handle this question. Certainly there needs to be some way of linking works by an author independent of the exact form of the name on the work/manifestation/item/whatever.

I hate to suggest a uniform identifier of some sort - a number, probably, that the user would never see (unless they wanted to), but that would serve to link all forms of the name together...which would also avoid having an "authorized" form of the name, which tends to be pretty arbitrary. Don't know how workable that would be, though.

Anonymous said...

But doesn't the form of name that appears in 245c often also get recorded as a variant form in name authority records? Thom Hickey had an interesting blog post recently about his work in controlling names in WorldCat (see
He mentioned using other clues to try to sort out harder name matches. Might the folks at Open Library do the same? Something like This author who wrote on this particular subject is likely to be the same author who wrote this particular manifestation that I have in hand. Sounds doable, right?


Con said...

arkham, I don't see why you are resistant to providing a unique identifier for an author.

Seriously, this is in fact the only scalable solution. Name matching is always error prone. Do the name matching once, check it, and then assign a unique ID. Then go on to the next name without a unique ID.

Trying to use people's names as natural keys is an old-fashioned idea. Librarians need to get over their pride on this point and learn a lesson from computer scientists, who have known this for some time.

Thank god for the Germans who have managed to get control numbers included generally in MARC (in the $0 subfield). Yes! As of this year!