The Open Library is, among other things, an interesting experiment in the creation of a book catalog that mixes data from libraries, publishers, and online sites (currently only Amazon).
One of the big issues that comes up, of course, is that of author names. We know that author names are recorded differently in different sources. We also know that only the library data carries what could be considered an author identifier: a unique string for each unique author. (How "unique author" is defined is a discussion for another day.)
Because the Open Library creates a web page for each author, and that web page links to the books by the author, it is important not to split an author's works into separate pages for each form of the author's name. In other words, you wouldn't want a page for "Mark Twain" and another for "Twain, Mark." Although that would be simple case.
As book data is added to the Open Library, the incoming bibliographic records are matched to those already in the database. It is only after a match is found that the authors are compared in some detail. The fact that the varying author names appear on the same bibliographic record allows you to make some inferences. So the library names can be switched around to "natural order" for comparison to Amazon data. The main question is, however, if you don't get an exact match at that point, what else would constitute a match?
There is an interesting set of data (created by data wrangler Edward Betts) that lists matched books that have un-matched authors. This small set is like a microcosm of "the author problem." Edward has added Jaro-Winkler values as a way to quantify the matches, although it isn't clear where the bright line is between match and no match. This is an interesting problem.
Here are some things that turn up in the small set:
No author vs. many authors
When a book has many authors, especially when it is a compilation of works by different authors, library catalog data does not record "an author" in the author position (MARC 100). Amazon lists all of the authors as authors. (#2, #14 in data set)
Libraries do not list "anonymous" as an author. Amazon does (and even has a link so you can click on it!). Compare Open Library and Amazon for the book "Primary Colors."
Transliterated name forms
Erofeev v. Erofeyev (#7), Tsernianski v. Crnjanski (# 170). Also see #s 162 and 163 for this same case. Although there are some examples of misspellings of complex names, the issue here is mainly that the library cataloging standardizes on a particular form of the name, and the publisher and bookseller data probably uses the form that appears on the work in question. You can see that in the Crnjanski book which has "Tsernianski" in the "by" statement (which is the awkward term I came up with for the statement of responsibility, and suggestions are welcome -- although don't bother to suggest "statement of responsibility" because I consider that NOT for user consumption).
Bits and Pieces
Names with "bits and pieces" like titles of address (Dr., Mrs.), multi-part names (see # 9 for De Courcy, Catherine vs. Courcy Catherine De) (and #76 for Gregory, Saint, Bishop of Tours vs. Gregory of Tours). The problems here seem to be a combination of not knowing when where and how to include some of these (like the "Jr. " in William F. Buckley Jr., which isn't included in the library heading -- #155), and, once they are there what order to put them in. This is an area where some data match formulas might be able to help out.
Just Plain Wrong?
Obviously, the data isn't always perfect, and in particular many Amazon entries seem to have been hastily input. (I wonder to what extent these represent the used book sellers on Amazon? How does that data get into Amazon's database? See #21, and its Amazon entry.) Also, there are many entries that are incomplete (e.g. just the author's last name).
I truly believe that the future will bring more instances when we will find ourselves needing to combine bibliographic data from different sources, or move data across traditional community boundaries. For this reason I have one specific request for the library community: Please provide the name as it appears on the work in a form that can be used for matching.
This means that burying it in the statement of responsibility with no further mark-up does not help. Meanwhile, we may want to consider string-matching within the SoR against names from other sources. Yeech!