There's been talk and action lately around libraries making use of data provided by publishers or retailers. What little experience I have in this area leads me to understand that we need to do some serious studies of the bibliographic metadata that is created in situations outside of libraries. What I present here is a single bit of work, not a study, and the numbers should be considered valid only for this particular set of data. However, I think that this shows the value that real studies could produce in terms of understanding our relative approaches to metadata.
For those who prefer not to read further, let me give you my conclusions here:
1. Libraries focus on the title as it is given on the title page. Others (publishers, retailers) are more interested in the cover title, both in its promotional role and as that which the buyer and retailer see when handling the product and creating order on shelves.
2.While online bookstores rely heavily on the ISBN to identify the item, and therefore are motivated to correct the ISBN if needed, the Library of Congress records in this study appear to be less often updated to correct an ISBN. (Therefore, it would be interesting to do this comparison with OCLC records to see if they get corrected more frequently than LoC.)
3. Retailers and publishers use the form of the author's name that is on the book itself and do not concern themselves with the unique identification of authors. Only libraries use the authoritative name form, which may not match up to the form used by others.
4. Publishers and libraries have different data points for number of pages, with libraries using the numbered pages and publishers focusing on the total number of sheets.
I should also mention that everyone except libraries seems to use title case for titles. Does anyone know the logic behind the library decision not to use title case?
- 250,000 LC MARC records compared to Amazon online data, matching on ISBN, then comparing titles
- 71,000 records matched on ISBN
- of those, 67,000 also matched on title, or on partial title (left-anchored)
Reasons for Non-matches
Of those that didn't match, the reason was (based on an unscientific sample, so the percentages are just a rough guide):
1. The Amazon entry includes what libraries consider to be the series title as part of the title. These are often those "publisher series" that would be placed in a MARC 490. They generally appear prominently on the cover of the book, are presented with the title on the cover, and are carried in the cover design. Retailers also seem to add key information that would appear on the cover, such as the fact that the item includes a CD-ROM.
Amazon: State Shapes: Texas
MARC: Texas (series: State Shapes)
(Note, I've gotten a better look at ONIX data and it turns out that in many cases the series is coded as the title, and the book title is coded as the subtitle. So in the example above, the data that Amazon received would have had "State Shapes" as the title and "Texas" as the subtitle.)
Amazon: How to Prepare for the GMAT with CD-ROM
MARC: How to prepare for the graduate management admission test
(In this case there are two versions of the book, one with, one without the CD. MARC has the same title for both)
2. Minor differences in wording or spelling errors. These are often on Amazon titles, perhaps those that have been entered in by bookstores or small retailers that sell through Amazon. There are also some obvious differences in practice which may or may not be consistent in the retailer data.
Amazon: Literature of Memory
MARC: Literatures of memory
Amazon: One Eye Laughing, the Other Eye Weeping
MARC: One eye laughing, the other weeping
Amazon: Java(TM) Server and Servlets
MARC: Java server and servlets
3. The title in Amazon includes the name of the author; the MARC record separates these into author and title. (Amazon also includes the author name in the author field.) There are also times when this is reversed (eg MARC includes the author name in the title, Amazon does not):
Amazon: John Thelwall's the Peripatetic
MARC: The peripatetic
Amazon: BBC Walking with Dinosaurs
MARC: Walking with dinosaurs
Amazon: Southern Christmas
MARC: Emyl Jenkin's southern Christmas
4. Titles so entirely different that it appears to be a wrong ISBN. Often it is an ISBN from another book by the same publisher, possibly a mistaken re-use. Of these, the entries on Amazon appear to be correct, while those in LC records often contain an ISBN that retrieves more than one item from that publisher. I wouldn't be at all surprised to learn the the ISBN received by the CIP program is often not the actual final ISBN. When the book then arrives at LoC, it may be hard to determine that or why the ISBN has changed.
Amazon: Harriet Tubman
MARC: Paul Robeson
5. There are differences in the treatment of numbers and abbreviations that appear in titles. In some cases, the title on Amazon has been abbreviated beyond what appears on the book, probably by a bookseller saving keystrokes. It's also my guess that in some Amazon entries the abbreviation or number is spelled out to influence retrieval.
Amazon: Ten Best Teaching Practices
MARC: 10 best teaching practices
Amazon: God, Doctor Buzzard, and the Bolito Man
MARC: God, Dr. Buzzard, and the Bolito Man
6. Mysterious, undiagnosed, or possible errors in the comparison algorithm. I'll work more on these.
Amazon: Decisive Treatise and Epistle Dedicatory
MARC: The book of the decisive treatise determining the connection between the law and wisdom
Link to the Data
This link takes you to a page with comparisons that link to the MARC record held at the Internet Archive and to the Amazon page for the book. You can see these and other differences by looking at the two sets of data. Again, note that this was a quick comparison and there are some errors in the comparison methodology that we are already aware of.
Some Other Observed Differences
Although not included in this group (which only compares titles) there are other differences that I have observed in ONIX data but that I haven't attempted to measure.
7. Authors. An obvious area of difference is that publisher and retailer data does not use the library name authorities form of the name ("Smith, John, 1837-"). Publishers tend to include a display form of the name in their data ("John Smith") and some of them also include the inverted form ("Smith, John"), but there is no concept of unique identification of authors across time and between different publishers. Talking with publisher representatives, I also have learned that the form of the author's name that will be used on the printed book and in publicity may be designated in the contract between the author and the publisher. This does not mean that publishers cannot include a version of the name as found in the LC name authorities file. However, it is likely that there will be multiple forms of the name in non-library data, rather than the single form found in library records.
8. Pagination. I was looking at pagination as part of a de-duping algorithm because it served us well when de-duping within library data as a way to distinguish different editions. This will not be the case between library and publisher data, at least not with the data that I have seen. Publishers have an entirely different measure of pages, and it is (logically) the actual number of pages in the physical book. This is clearly a matter of cost to them, and also a key piece of data about the manufacture of the book. Libraries, instead, record the printed page numbers. This latter is immediately visible to the cataloger, while the publisher count would mean having to actually hand count the pages in the book. In this case, libraries and publishers are each working with the information that is easily found at hand, but the results differ considerably.