Sunday, December 17, 2006

Digitization and the Catalog

I have just posted the preprint of my current column for the Journal of Academic Librarianship, titled "Mass Digitization of Books." It takes about 4-6 months for the columns to be published, and as I read over this one I can see that things have already changed. For example, when I wrote the column, Google was not yet allowing the download of its public domain books.

However, I should have included one more very important issue in the article, but it hadn't occurred to me at the time: the effect of this mass digitization on our catalogs. The cataloging rules require that the digital copy be represented in the catalog with its own record. This means that a library that undergoes a mass digitization project on its book collection faces doubling the number of book records in its catalog. Leaving aside the issues of user display for now, and assuming that the creation of the records requires very little human intervention, we can probably still calculate a significant cost in storage space (albeit cheap these days), the size of backups, the time to load and index all of those records, and a general overhead in the underlying database.

This brings up the issue of creating catalog entries that represent "multiple versions," that is, having a single record that contains the information for all of the different formats in which the book is available -- regular print, e-book version, digitized copy, large print. There are good arguments both for and against, and it's a complex discussion, but I'll just say that I am convinced that we could structure our catalog records in a way that would make this work.


Anonymous said...

Whether the records are in fact 'single records' or 'multiple records' seems to me not as important as how our systems present such related groups of records to the user. Any way you can think of to have things presented to the user---that way can work with either single or multiple records. So long as the system can unambiguously determine that the multiple records are different versions of the same 'work' or what have you!

We need to stop thinking in terms of the structure of our records _directly_ corresponding to display/interface. The structure of our records is important to capture the semantic information we want to capture. The determination of how to structure the records should be based on how to most efficiently and flexibly capture that semantic information--as well as, regretabbly, how to fit into our 'legacy' systems. If the semantic information has been captured properly---many presentations, any presentation, is possible.

To me, whether they should be seperate records or a single record is a non-issue. The important thing is that the system is capable of BOTH grouping them together AND seperating them. Currently, this is of course not generally the case.

Jonathan Rochkind

Karen Coyle said...

Jonathan, I agree that the structure of the record NE what the user should see in the display. But remember that we have more than one user: there is the user who must create the records (generally the copy cataloger), and the user who must maintain the system (the IT person) and who will have to respond in case of a disaster. There is also the manager who is responsible for overall costs. My concern is not for the end-user, because I think we can develop a friendly display (and already do in some catalogs), but for the fact that we don't seem to have thought through the consequences for systems maintenance and cost. How many libraries could double the size of their catalog and not feel the pinch?

Jason said...

I'm not a cataloger, I work in IT. But to me this seems like not seeing the forest through the trees only in the sense that I would be less worried about the stress on the system than I would about the sheer manpower that it takes to enter such records. And I'm not just talking about a person creating duplicate format records manually, but even from an automated batch process and then re-indexing it can be quite time consuming.

I agree that it is an issue to be considered, although most systems could handle such growth if running on an enterprise database like Oracle or such. Storage is relatively cheap these days. Labor is not cheap.