Coyle's InFormation: 11/01/2007

Friday, November 30, 2007

Titles in Retail and Publisher Data

There's been talk and action lately around libraries making use of data provided by publishers or retailers. What little experience I have in this area leads me to understand that we need to do some serious studies of the bibliographic metadata that is created in situations outside of libraries. What I present here is a single bit of work, not a study, and the numbers should be considered valid only for this particular set of data. However, I think that this shows the value that real studies could produce in terms of understanding our relative approaches to metadata.

For those who prefer not to read further, let me give you my conclusions here:

1. Libraries focus on the title as it is given on the title page. Others (publishers, retailers) are more interested in the cover title, both in its promotional role and as that which the buyer and retailer see when handling the product and creating order on shelves.

2.While online bookstores rely heavily on the ISBN to identify the item, and therefore are motivated to correct the ISBN if needed, the Library of Congress records in this study appear to be less often updated to correct an ISBN. (Therefore, it would be interesting to do this comparison with OCLC records to see if they get corrected more frequently than LoC.)

3. Retailers and publishers use the form of the author's name that is on the book itself and do not concern themselves with the unique identification of authors. Only libraries use the authoritative name form, which may not match up to the form used by others.

4. Publishers and libraries have different data points for number of pages, with libraries using the numbered pages and publishers focusing on the total number of sheets.

I should also mention that everyone except libraries seems to use title case for titles. Does anyone know the logic behind the library decision not to use title case?

The Comparison

- 250,000 LC MARC records compared to Amazon online data, matching on ISBN, then comparing titles

The Numbers

- 71,000 records matched on ISBN
- of those, 67,000 also matched on title, or on partial title (left-anchored)

Reasons for Non-matches

Of those that didn't match, the reason was (based on an unscientific sample, so the percentages are just a rough guide):

1. The Amazon entry includes what libraries consider to be the series title as part of the title. These are often those "publisher series" that would be placed in a MARC 490. They generally appear prominently on the cover of the book, are presented with the title on the cover, and are carried in the cover design. Retailers also seem to add key information that would appear on the cover, such as the fact that the item includes a CD-ROM.

Amazon: State Shapes: Texas
MARC: Texas (series: State Shapes)

(Note, I've gotten a better look at ONIX data and it turns out that in many cases the series is coded as the title, and the book title is coded as the subtitle. So in the example above, the data that Amazon received would have had "State Shapes" as the title and "Texas" as the subtitle.)

Amazon: How to Prepare for the GMAT with CD-ROM
MARC: How to prepare for the graduate management admission test
(In this case there are two versions of the book, one with, one without the CD. MARC has the same title for both)

Number: ~45%

2. Minor differences in wording or spelling errors. These are often on Amazon titles, perhaps those that have been entered in by bookstores or small retailers that sell through Amazon. There are also some obvious differences in practice which may or may not be consistent in the retailer data.

Amazon: Literature of Memory
MARC: Literatures of memory

Amazon: One Eye Laughing, the Other Eye Weeping
MARC: One eye laughing, the other weeping

Amazon: Java(TM) Server and Servlets
MARC: Java server and servlets

Number: ~27%

3. The title in Amazon includes the name of the author; the MARC record separates these into author and title. (Amazon also includes the author name in the author field.) There are also times when this is reversed (eg MARC includes the author name in the title, Amazon does not):

Amazon: John Thelwall's the Peripatetic
MARC: The peripatetic

Amazon: BBC Walking with Dinosaurs
MARC: Walking with dinosaurs

Amazon: Southern Christmas
MARC: Emyl Jenkin's southern Christmas

Number: ~10%

4. Titles so entirely different that it appears to be a wrong ISBN. Often it is an ISBN from another book by the same publisher, possibly a mistaken re-use. Of these, the entries on Amazon appear to be correct, while those in LC records often contain an ISBN that retrieves more than one item from that publisher. I wouldn't be at all surprised to learn the the ISBN received by the CIP program is often not the actual final ISBN. When the book then arrives at LoC, it may be hard to determine that or why the ISBN has changed.

Amazon: Harriet Tubman
MARC: Paul Robeson

Number: ~8%

5. There are differences in the treatment of numbers and abbreviations that appear in titles. In some cases, the title on Amazon has been abbreviated beyond what appears on the book, probably by a bookseller saving keystrokes. It's also my guess that in some Amazon entries the abbreviation or number is spelled out to influence retrieval.

Amazon: Ten Best Teaching Practices
MARC: 10 best teaching practices

Amazon: God, Doctor Buzzard, and the Bolito Man
MARC: God, Dr. Buzzard, and the Bolito Man

Number: ~6%

6. Mysterious, undiagnosed, or possible errors in the comparison algorithm. I'll work more on these.

Amazon: Decisive Treatise and Epistle Dedicatory
MARC: The book of the decisive treatise determining the connection between the law and wisdom

Number: ~14%

Link to the Data

This link takes you to a page with comparisons that link to the MARC record held at the Internet Archive and to the Amazon page for the book. You can see these and other differences by looking at the two sets of data. Again, note that this was a quick comparison and there are some errors in the comparison methodology that we are already aware of.

Some Other Observed Differences

Although not included in this group (which only compares titles) there are other differences that I have observed in ONIX data but that I haven't attempted to measure.

7. Authors. An obvious area of difference is that publisher and retailer data does not use the library name authorities form of the name ("Smith, John, 1837-"). Publishers tend to include a display form of the name in their data ("John Smith") and some of them also include the inverted form ("Smith, John"), but there is no concept of unique identification of authors across time and between different publishers. Talking with publisher representatives, I also have learned that the form of the author's name that will be used on the printed book and in publicity may be designated in the contract between the author and the publisher. This does not mean that publishers cannot include a version of the name as found in the LC name authorities file. However, it is likely that there will be multiple forms of the name in non-library data, rather than the single form found in library records.

8. Pagination. I was looking at pagination as part of a de-duping algorithm because it served us well when de-duping within library data as a way to distinguish different editions. This will not be the case between library and publisher data, at least not with the data that I have seen. Publishers have an entirely different measure of pages, and it is (logically) the actual number of pages in the physical book. This is clearly a matter of cost to them, and also a key piece of data about the manufacture of the book. Libraries, instead, record the printed page numbers. This latter is immediately visible to the cataloger, while the publisher count would mean having to actually hand count the pages in the book. In this case, libraries and publishers are each working with the information that is easily found at hand, but the results differ considerably.

Thursday, November 15, 2007

Future of Bibliographic Control,LC, 11/13

Notes from the meeting on Nov. 13 of the Working Group on the Future of Bibliographic Control.

These are my notes and should NOT be taken to represent accurately the thoughts of the working group, only my quick recording of what I understood at the meeting. Also, I must add the disclaimer that I have been engaged as a consultant to the group for the writing of the report. I attempt in that work to be as faithful to the outcomes desired of the group as I can. However, I admit that pure objectivity is a chimera, so my own opinions may come through in the text below.

There was an introduction explaining the creation of the working group (which you can read about on the working group's web site: http://www.loc.gov/bibliographic-future/). The group presented an interim report to the Library of Congress. The full report will be available by December 1 for public comment. The comment period will end on December 15, and the final report will be presented on January 8, 2008.

The report was commissioned by the Library of Congress, but it many of its recommendations involve the the library community and other players in its environment. There are over 100 individual recommendations in five general areas.

The working group concluded that there are three major "sea changes" that are needed in the library community:

1. We must redefine bibliographic control broadly to include all materials, a widely diverse community of users, and a multiplicity of venues where information is sought.

2. We must redefine the bibliographic universe to include all stakeholders, including the for-profit organizations that are involved in information delivery and digitization

3. The role of the Library of Congress must be redefined as a partner with other libraries and with non-library institutions, working to achieve the goals of the library community.

The five areas of recommendations are:

1. Increase the efficiency of bibliographic production for all libraries through cooperation and sharing of bibliographic records and through the use of data produced in the overall supply chain.

2. Transfer effort into high value activity. In particular, provide greater value for knowledge creation through leveraging access for unique materials held by libraries, materials that are currently hidden and under-used.

3. Position our technology by recognizing that the Web is our technology platform as well as the appropriate platform for our standards. Recognize that our users are not only people but also applications that interact with library data.

4. Position our community for the future by adding evaluative, qualitative and quantitative analyses of resources. Work to realize the potential provided by the FRBR framework.

5. Strengthen the library and information science profession through education and through development of metrics that will inform decision-making now and in the future.

Under each of these areas there are sets of recommendations. The full set of recommendations is fairly detailed, and the group presented high level groupings of recommendations in the first four areas. (Area five was not presented in detail at the meeting.)

In area 1, the recommendations are grouped:

1.1 Eliminate redundancies in the production of bibliographic metadata. This means making use of data that is created elsewhere in the supply chain, and increasing the sharing of bibliographic records and modifications to records. In particular, the group asks for an examination of barriers to sharing.

1.2 Increase the distribution of responsiblity for bibliographic record production. Increase the number of institutions that participate in shared cataloging activities.

1.3 Collaborate on authority record creation. Similar to 1.2, this recommends that the number of participants in authority record creation be increased, but it also asks that we look at the possibility of sharing across sectors and internationally, to reduce the number of times that an authoritative heading must be created.

Area 2 is called "Enhance Access to Rare and Unique Materials." In this area the group states that any efficiencies gained in other areas should allow the redirection of energy to providing access to unique materials that are held by libraries and other cultural heritage institutions. In particular, the group recommends:

2.2 Integrate access to rare & unique materials with other library materials

2.3. Share bibiographic data relating to these materials. The sharing of bibliographic data must not be limited to those areas where copy cataloging is desired.

2.4 Encourage digitization to allow broad access

Area 3 is about technology and the Web:

3.1.1 Integrate library standards into the Web environment

3.1.2 Extend the use of standard identifiers for bibliographic entities, and include those identifiers in bibliographic records.

3.1.3 Develop a more flexible, extensible metadata carrier that can be readily exchanged with non-library applications.

Area 3 also addresses standards:

3.2.1 Develop standards with a focus on return on investment. Do analysis before beginning the standards process.

3.2.2 Incorporate usage data and lessons from use tests in the standards development process

Area 4 is about positioning the library community toward a more progressive future. In this area there are three main recommendation areas:

4.1 Design for today's and tomorrow's user. This means that we must design into our catalogs and other tools the ability to present evaluative information, and to allow and encourage users to interact with bibliographic data. We must also make use of statistical and other computationally-derived information in our user services.

4.2 Realize FRBR. The framework known as FRBR has great potential but so far is untested. It is being used as the basis for RDA, even though FRBR itself is not clearly understood. The working group recommends that no further work be done on RDA until there has been more investigation of FRBR and the basis it provides for bibliographic metadata. [Note: this recommendation is likely to change such that there will be specific recommendations relating to RDA; FRBR will be treated separately.]

4.3 Optimize LCSH for Use and Re-use. Encourage an analysis of LCSH that would move the system toward a more facetted subject system. Work to create more links between LCSH and other subject heading systems in use. Recognize that with the digitization of works the act of subject assignment may benefit from computational analysis.

In the time that I was at the meeting (I had to leave before the question period ended) there were two questions/comments. The first had to do with the fact that while there are costs to today's methods of bibliographic control, that changes in bibliographic control will have costs as well. (Here it would be good to listen again to the talk given by Rick Lugg at the meeting held at LoC. He spoke of the costs of NOT changing, something that is hard to measure but is very real.) The other comment (from Barbara Tillett) mentioned many of the recommendations and stated that LoC is already engaged in, or has rejected, analogous activities. It was acknowledged, however, that LoC had not made these activities public, so the community is generally unaware of the progress made. To me this points out one of the areas that we all need to work on, which is sharing information about our projects and their progress so that the community as a whole can benefit from work done by a single institution.

Wednesday, November 07, 2007

Hierarchy v. Relationships

The use of hierarchy as an organizing principle keeps coming up. I think we are attracted to hierarchy because of its neatness, even though in fact the real world is organized more like fuzzy sets. Fuzzy sets are hard to comprehend, nearly impossible to draw, and can't be slotted neatly into an application.

When people talk about FRBR, they are often focussed on the Group 1 entities, and those are seen as hierarchical. They tend to be shown as:
- Work
-- Expression
--- Manifestation
---- Item

as if we'll fit all of our intellectual works into such a neat hierarchy. T'ain't so. Of all of the relationships that are talked about in FRBR (I almost said "expressed" but that term has now been given a new meaning in this discussion) I think these are the least interesting. And they become even less interesting when we move beyond the traditional inventory control function of the library catalog and begin to see ourselves as navigating in a knowledge universe. But first let me tackle the Group 1 entities.

There are complaints (or remarks, depending on the context) that we don't have an agreed on definition of Work, and that the division between Work and Expression is unclear. They are unclear because in real life there isn't a neat hierarchy that just needs to be modeled. What is a Work is entirely contextual -- when I'm looking for an article, the article is a Work. When I'm subscribing to a journal, the journal is a work. When I'm on iTunes a song is a work, when I'm in the music store the album is a work. A Work is the content I am seeking at that time. In the imaginary universe where I get to create my bibliographic system, a Work will be defined as: anything you wish to talk about, point to, address. So a book-length text is a work, an article in a journal is a work, a journal is a work, a book chapter is a work -- all at the same time and in the same system. For one person the book Wizard of Oz and the movie Wizard of Oz will be a single Work. To a film buff, the director's cut of Blade Runner and the original release are distinct works. To be a Work, it just has to be definable and have a way to name it, that is it has to have an identifier. But anything can be a Work. As a matter of fact, I probably won't use the term Work at all in my universe.

As for Expressions, there will be very obvious Expressions of Works, and there will be fuzzier Expressions. There will be Expressions that express more than one Work. Expression is a relationship, not a subset. If you don't have to organize your bibliographic universe in a hierarchical way, then the need to slot each Expression under a Work goes away, although the relationship can remain.

I'm less sure about Manifestation and Item, even though these are the most concrete of the Group 1 entities. Are they a legitimate focus of a Knowledge Management system, or are they about managing physical objects? When I think about some of the uses of bibliographic data, for instance as citations in a text publication, Manifestation seems to be mainly about locating -- so if I've quoted a passage from a book, I need to cite the manifestation and the page because that's the only way that someone else can find that exact quote. When I include a URL in a document that links to a particular digital manifestation, I am giving the user a direct link to the location. Manifestations and Items will be of interest in some instances, say to rare book collectors, but I'm not at all sure that those instances justify the emphasis they have been given. And if the purpose is primarily inventory control, then I think those relationships will be managed to the extent that they matter to the library. For example, a public library may not terribly care which manifestation of the book Moby Dick is on its shelves, although its inventory system will need to know the barcode, and its acquisitions system will need to store how much the library paid for it and the provider.

The truly interesting relationships in FRBR are those between and among these entities, and those are ones that I have not seen explored. These are the relationships between things: thing1 is a translation of thing2; thing3 is an abridgment of thing4; thing5 extends thing6 in this certain way; thing7 cites thing1; thing8 continues thing3. This is where we get real value, where we provide various interesting paths through which seekers can navigate. This is what we don't provide explicitly in our catalogs today, although a human user may be able to intuit some of these relationships among the works we present.

We have so narrowly defined bibliographic control in libraries that it doesn't really include the relationships between intellectual products, except to the degree that we might make a note that one thing is a translation of another thing. But we see those relationships as "extra" or "secondary," and yet they are the very essence of knowledge creation. It astonishes me that we have focused so completely on the physical items that we have essentially missed what would make our catalogs intelligent.

Sunday, November 04, 2007

Our subject mess

Lately I've had occasion to work with a few different groups of people who are delving into library bibliographic data for the first time. Believe me, it is quite revealing to view it from the viewpoint of these novices. Novices only in this one area, because they generally are quite savvy about computing and data. Each new revelation gives me a chance to regale them with an amusing story about "how it got that way." I can explain (note: explain, not justify) why we have no identifiers for key elements like authors and works. I can pretty much explain why we seem more concerned about the package than the content. I can reminisce about moments in the history of library systems development that happened before some members of these groups were born. But I get totally stuck when they point out the mess that is our subject access.

We have two classification systems, Dewey (DDC) and Library of Congress. (LCC) That in itself is not a problem, and it's fairly easy to explain how they developed in different contexts, always making sure to explain that these systems classify the items in a library, not the world of thought.

What is hard is to try to explain what either of them has to do with the Library of Congress Subject Headings.(LCSH) Many folks assume that LCSH is the entry vocabulary into LCC. Thus if there is a classification code in a record that stands for "vocal music, choruses" that there will be a heading in the record that is "vocal music, choruses," and vice versa. They also assume that the two subject systems (classification and subject headings) have the same structure, which would mean that you can "drill down" from music to vocal music then to choruses in either or both. Nothing could be further from the truth. So it is quite confusing to them when they see a record with a call number that would ostensibly be about "vocal music, choruses" based on the classification, but instead the subject heading is "Cantatas, Secular -- Scores." And they are equally confused when the record has another subject heading ("Funeral music") but only the one classification number.

I can't explain this disconnect between the subject headings and the classification scheme, except to say: that's how it is.

Recently, I was browsing through my beloved copy of the DDC from 1899 that still has both its numeric and alphabetical tabs relating respectively to the classification and the "Relativ Subject Index." The RSI is indeed an index to the classification scheme, and it appears that Dewey originally intended it also as the access to the collection:

"HOW TO USE THIS INDEX
Find the subject desired in its alphabetical place in the index. The number after it is its class number and refers to the place where the topic will be found, in numerical order of class numbers, on the shelves or in the subject catalog."

From this I can only presume that the shelves and the subject catalog were in classification order, and the alphabetical index was the index to that classification. I can only guess at this point, from what he says here, that the subject catalog was in classification order, as is the shelf, but also contained the verbal translation of what the decimal classification numbers meant.

"Under this class number will be found the resources of the library on the subject desired. Other subjects near the one sought may often be consulted with profit; e.g., Communism is the topic wanted and the index refers to 335.4, but 335, Socialism, and even the inclusive division 330, Political economy, also contain much on this subject. The reverse is equally true; the full material on socialism can only be had by looking at its divisions 335.3, Fourierism, 335.4, Communism, etc. The topics which are thus subdivided are plainly marked in the index by heavy faced type."

My copy is #3933, originally owned by the Roger Williams Park Museum in Providence, Rhode Island. The current incarnation of the institution appears to be the Museum of Natural History and Planetarium. My copy has many penciled notes in the area of Zoology (DDC 590), which would fit the natural history nature of the institution. (I don't see any evidence of a current library.) By 1900 the "dictionary catalog" would have taken root, so I don't know if the library would have followed Dewey's instructions for the creation of a classified catalog. But I do wonder how we got from a single system that had an alphabetical index to a classification system to a system with an alphabetical index and two classification systems, but in which the index and the classification have essentially each gone their own ways. This is obviously a gap in my education, which I will gladly rectify if you have suggestions for readings.

Meanwhile, no wonder users are confused.