Sunday, October 28, 2007

Bibliographic ER

No, I'm not sending libraries to the emergency room, although there are days when I feel like we're at that point. The ER in the title refers to Entity-Relationship, a way to look at data that emphasizes the general viewpoint that there are things, and those things exist in relation to each other.

In one sense, this is what we have done for over a century with our library data. The bibliographic records that we create have in them many relationships: Person authored Book; Publishing House published Book; Book is in Series; Book has Topics. Those relationships are implicit in our records, but the data isn't formatted in an entity-relationship model. Our records, instead, talk about the relationships but don't make it easy to give the various entities their own existence. So we create a record that contains:

Book title.
Place, publisher, date
Subject A
Subject B

The record represents all of the information about the book, but there is no record that represents all of the information about the author, or all of the information about the publisher, etc. Instead, those "entities" are buried in bibliographic records scattered throughout the file.

An E-R model would give each of these entities an identity on which you could hang information about the entity.

OK, I can't draw worth beans. But basically the idea is that authors, subjects, publishers, topics, all become entries in their own right. This means that you can add information to the author record or the series record, because they have their own place in the design. It also makes it easy to look at your data from many different points of view, while still retaining all of the richness of the relationships. So from the point of view of the person who is the illustrator in the book above, the bibliographic world may look like this:

This type of model is expressed in FRBR, but the E-R aspect of FRBR does not seem to be incorporated into RDA as it stands today. Instead, RDA appears to be aimed at creating the same flat structure that we have in library data today.

If you take a look at the OpenLibrary you will see that books get a page that is about the book, and authors get a separate page that is about the author. This is very simple, but it is also very important. It means that the catalog is no longer just a list of books with authors but can become a rich source of information about authors. You can add bios for authors, link to web sites about the author, launch a discussion group about a favorite author. Because the author is an entity, not just a data element in a record about the book, it becomes a potentially active part of your information system.

In the future, I hope that we can give life to many more entities in the OpenLibrary, and also that we can give them meaningful relationships between each other. This would mean taking a semantic web approach to library data. I don't have a clear picture of where we'll end up, but I'm glad that folks there are interested in experimenting. If you've already thought this through or have ideas in this direction, please step forward. I'd love to hear from you.

Saturday, October 20, 2007

Great Minds...

As if in response to my post on name authorities, OCLC has come up with a version of the Virtual International Authority File (acronym VIAF). Type in "Fitzgerald, Michael" and you'll see that each name has associated with it what they are calling a "sample title." The titles are unattractive, being normalized forms, but still give you some idea of what each author has written, and you might be able to sort the Michael Fitzgerald who writes on XSL from the one who has written the guide to better business letters. At this point, that authority control has already determined that these are different people is incredibly valuable, where the value was much harder to see when all you had were names and dates.

Friday, October 12, 2007

Cataloging as Industry

Something pointed me to this paper by Alan Danskin of the British Library:
Tomorrow never knows: the end of cataloging?

It has some well-spoken statements about the great increase in materials, the need to collaborate better with others in the publishing supply chain, etc. But what really stood out for me was this:
The future of cataloguing depends on transforming the process from a craft into an industry.

He qualifies this by saying
This requires unambiguous identification at different levels of granularity to facilitate repurposing of metadata created at the different stages of the process of creating and publishing resources. It also means we may have to be less precious about some of our cherished practices.

I can't disagree with what he says here, but I must say that I have a different take on the idea of industrialization of cataloging, and that is that we should consider taking cataloging out of the library and giving it to others who will actually industrialize it. Just as we don't hand craft our own library shelves, and we don't hand craft our own library systems, perhaps we shouldn't be hand-crafting our own catalog records.

What I refer to here would probably come under the rubric of "outsourcing," some of which already takes place, especially for works in less common or more difficult languages. But what if, just what if, someone could develop a cataloging service that was cheaper than what libraries can do themselves, and had comparable quality? Is there any reason why we shouldn't go for it?