Friday, November 30, 2007

Titles in Retail and Publisher Data

There's been talk and action lately around libraries making use of data provided by publishers or retailers. What little experience I have in this area leads me to understand that we need to do some serious studies of the bibliographic metadata that is created in situations outside of libraries. What I present here is a single bit of work, not a study, and the numbers should be considered valid only for this particular set of data. However, I think that this shows the value that real studies could produce in terms of understanding our relative approaches to metadata.

For those who prefer not to read further, let me give you my conclusions here:

1. Libraries focus on the title as it is given on the title page. Others (publishers, retailers) are more interested in the cover title, both in its promotional role and as that which the buyer and retailer see when handling the product and creating order on shelves.

2.While online bookstores rely heavily on the ISBN to identify the item, and therefore are motivated to correct the ISBN if needed, the Library of Congress records in this study appear to be less often updated to correct an ISBN. (Therefore, it would be interesting to do this comparison with OCLC records to see if they get corrected more frequently than LoC.)

3. Retailers and publishers use the form of the author's name that is on the book itself and do not concern themselves with the unique identification of authors. Only libraries use the authoritative name form, which may not match up to the form used by others.

4. Publishers and libraries have different data points for number of pages, with libraries using the numbered pages and publishers focusing on the total number of sheets.

I should also mention that everyone except libraries seems to use title case for titles. Does anyone know the logic behind the library decision not to use title case?

The Comparison

- 250,000 LC MARC records compared to Amazon online data, matching on ISBN, then comparing titles

The Numbers

- 71,000 records matched on ISBN
- of those, 67,000 also matched on title, or on partial title (left-anchored)

Reasons for Non-matches

Of those that didn't match, the reason was (based on an unscientific sample, so the percentages are just a rough guide):

1. The Amazon entry includes what libraries consider to be the series title as part of the title. These are often those "publisher series" that would be placed in a MARC 490. They generally appear prominently on the cover of the book, are presented with the title on the cover, and are carried in the cover design. Retailers also seem to add key information that would appear on the cover, such as the fact that the item includes a CD-ROM.

Amazon: State Shapes: Texas
MARC: Texas (series: State Shapes)

(Note, I've gotten a better look at ONIX data and it turns out that in many cases the series is coded as the title, and the book title is coded as the subtitle. So in the example above, the data that Amazon received would have had "State Shapes" as the title and "Texas" as the subtitle.)

Amazon: How to Prepare for the GMAT with CD-ROM
MARC: How to prepare for the graduate management admission test
(In this case there are two versions of the book, one with, one without the CD. MARC has the same title for both)

Number: ~45%

2. Minor differences in wording or spelling errors. These are often on Amazon titles, perhaps those that have been entered in by bookstores or small retailers that sell through Amazon. There are also some obvious differences in practice which may or may not be consistent in the retailer data.

Amazon: Literature of Memory
MARC: Literatures of memory

Amazon: One Eye Laughing, the Other Eye Weeping
MARC: One eye laughing, the other weeping

Amazon: Java(TM) Server and Servlets
MARC: Java server and servlets

Number: ~27%

3. The title in Amazon includes the name of the author; the MARC record separates these into author and title. (Amazon also includes the author name in the author field.) There are also times when this is reversed (eg MARC includes the author name in the title, Amazon does not):

Amazon: John Thelwall's the Peripatetic
MARC: The peripatetic

Amazon: BBC Walking with Dinosaurs
MARC: Walking with dinosaurs

Amazon: Southern Christmas
MARC: Emyl Jenkin's southern Christmas

Number: ~10%

4. Titles so entirely different that it appears to be a wrong ISBN. Often it is an ISBN from another book by the same publisher, possibly a mistaken re-use. Of these, the entries on Amazon appear to be correct, while those in LC records often contain an ISBN that retrieves more than one item from that publisher. I wouldn't be at all surprised to learn the the ISBN received by the CIP program is often not the actual final ISBN. When the book then arrives at LoC, it may be hard to determine that or why the ISBN has changed.

Amazon: Harriet Tubman
MARC: Paul Robeson

Number: ~8%

5. There are differences in the treatment of numbers and abbreviations that appear in titles. In some cases, the title on Amazon has been abbreviated beyond what appears on the book, probably by a bookseller saving keystrokes. It's also my guess that in some Amazon entries the abbreviation or number is spelled out to influence retrieval.

Amazon: Ten Best Teaching Practices
MARC: 10 best teaching practices

Amazon: God, Doctor Buzzard, and the Bolito Man
MARC: God, Dr. Buzzard, and the Bolito Man

Number: ~6%

6. Mysterious, undiagnosed, or possible errors in the comparison algorithm. I'll work more on these.

Amazon: Decisive Treatise and Epistle Dedicatory
MARC: The book of the decisive treatise determining the connection between the law and wisdom

Number: ~14%

Link to the Data

This link takes you to a page with comparisons that link to the MARC record held at the Internet Archive and to the Amazon page for the book. You can see these and other differences by looking at the two sets of data. Again, note that this was a quick comparison and there are some errors in the comparison methodology that we are already aware of.

Some Other Observed Differences

Although not included in this group (which only compares titles) there are other differences that I have observed in ONIX data but that I haven't attempted to measure.

7. Authors. An obvious area of difference is that publisher and retailer data does not use the library name authorities form of the name ("Smith, John, 1837-"). Publishers tend to include a display form of the name in their data ("John Smith") and some of them also include the inverted form ("Smith, John"), but there is no concept of unique identification of authors across time and between different publishers. Talking with publisher representatives, I also have learned that the form of the author's name that will be used on the printed book and in publicity may be designated in the contract between the author and the publisher. This does not mean that publishers cannot include a version of the name as found in the LC name authorities file. However, it is likely that there will be multiple forms of the name in non-library data, rather than the single form found in library records.

8. Pagination. I was looking at pagination as part of a de-duping algorithm because it served us well when de-duping within library data as a way to distinguish different editions. This will not be the case between library and publisher data, at least not with the data that I have seen. Publishers have an entirely different measure of pages, and it is (logically) the actual number of pages in the physical book. This is clearly a matter of cost to them, and also a key piece of data about the manufacture of the book. Libraries, instead, record the printed page numbers. This latter is immediately visible to the cataloger, while the publisher count would mean having to actually hand count the pages in the book. In this case, libraries and publishers are each working with the information that is easily found at hand, but the results differ considerably.

3 comments:

Anonymous said...

The first reason for non-matches above is interesting. It is obviously valuable to capture the semantic meaning of "series title," but it would seem to imply that you are actually going to do something with that relationship.

The most obvious thing to do would be to create a link out of the series name that, when clicked on, retrieves a list of all books that are part of the series. The data in the MARC record enables this and librarians seem to implicitly understand the value of coding this kind of relationship in our cataloging standard. However, I am not sure any library catalog interface provides such a simple, efficient and elegant display and user interaction opportunity out of such data. Instead we just change the way the title displays on a web page that may ironically break the connection to how it appears on work itself.

This seems to be one of the major reasons for why library description differs so much in practice from the business world's bibliographic description. We describe things to achieve a maximum *possibility* for understanding the semantics of the thing, but we stop there and do not necessarily realize all possibilities. The business world employs a description that is semantically poorer but they get maximum return on investment out of the data elements they do have. The user experience of the business's customer is far richer even though the bibliographic description offered to the library patron is far deeper.

The situation seems to be summed up by the distinction between "getting it right" (libraries) and "doing it well" (businesses). I think some of the other data you describe here also supports this subtle difference.

Melissa said...

So, to continue what Steve was saying, what catalogers need is not to change what they do, but how they display it? Stylesheets for MARC.

The problem is that if we continue to catalog the way we do (ie, ignoring the publisher-created metadata) we're always going to miss out on the efficacy of using publisher data. Does it matter that we catalog deeply when the users don't understand what we create, or don't see it?

Anonymous said...

The library catalog interface that did a good job of handling series data was the card catalog. Online catalog displays have a long way to go to be able to fulfill basic the basic functions of the catalog.
That is a minor point, however. The real point is that publishers and librarians conceive of titles and authors differently. The publishers can be satisfied with "good enough" because their files are small; librarians need to get things "right" or at least accurate because their files are so large. In order to make use of the publisher data, we need to be much more generous in the use of variant titlesa and to allow for matchings on spellings and titles that are close to what is on the book. This can be done by machine processing.
What cannot be done so easily by machine processing is actually matching item and bibliographic data and doing the intellectual work to identify authors (in whatever form those names are displayed). I am excited about using the ONIX data, but "useful use" will require not only careful programming but some human intervention for some time yet.