Tuesday, April 08, 2008


One of the big challenges for the RDA Vocabularies project is to define terms that have clear, unambiguous meaning so that library data can have a firm foundation for data processing. Library data as conceived in the 19th century (and carried through to today) was designed as a textual display (not data elements). This meant that it only had to be clear to human beings, who are marvelous at the interpretation of text and quite accepting of the imprecision of utterances. Now that our data must be managed by machines we have less room for the free-wheeling language basis of our data. The text that follows was my attempt to explain this on the RDA-L mailing list.

In RDA, there is a field called "extent" that is defined as:

"Extent reflects the number of units and/or subunits making up a resource." (

That's fine as meaning goes. Here are examples from RDA Chapter 3:

327 pages
1 sculpture
2 portfolios ([18] leaves; [24] leaves)

These are fairly clear: number + unit ("327" + "pages"). Then you get:
viii, 278 pages

This could be interpreted as:
viii pages
278 pages

But it has another meaning, which is that it is also conveying (I could say primarily conveying) the PAGINATION, that is, how the pages are numbered. Pagination is important for distinguishing editions, so this is good information, but it isn't the same as the number of pages in the item -- especially not to a computer.

Why does this matter?

It matters because it has a different meaning -- a meaning that humans can distinguish, but computers cannot. And even for humans the meaning is somewhat ambiguous -- since only numbered pages are included.

In the past our data was designed as text to be read by humans. We could rely on humans to make inferences about the data, which meant that there was tolerance for this kind of mixing of meanings. But if, for example, we want to be able to match up ONIX records and library records for the same item (and there are good reasons to do that both for acquisitions purposes and user service purposes), then this mixture of meanings makes it hard to compare the ONIX number of pages (which is literally the number of pages in the book -- that's after all what they pay the printer for) with the library pagination. (Believe me it's hard -- I've been working on this kind of match.) In essence, if we want to include number + unit AND pagination in the same record, we should distinguish between them.

And let's not go into a rant against the publishers ("They do it wrong!" No, they don't; they do what is right for them). There is a distinction often made also in abstracting and indexing data, where an article can have both page numbers ("43-47") and number of pages ("5"). This latter is used in ILL to estimate copying costs.

So if we want our data to play well in the world of bibliographic information, we have to pay close attention to meaning. We have our habits (as I believe I showed when I asked about title case on the RDA list*, which drew many responses but much speculation), but those do not serve us well if we can't turn them into unambiguous data definitions. I think it is a shame that RDA is carrying forward some of our habits without thinking more about meaning.

* I'll summarize this in an upcoming post


F. Tim Knight said...

Hi Karen,
The pagination also tells you that there is preliminary info with this particular document: acknowledgments, toc, preface, forward, etc. Not the same meaning you're referring to, and perhaps not particularly important in this context, but that's what came to mind when I read your post.

Karen Coyle said...

Good point, Tim. So there's a lot of value in this data, but it takes some interpretation (and not all users will be swift enough to "get it" - I didn't - of course, roman numerals aren't a natural for me, so I just tend to skip over them without interpreting them as numbers). Then we come to the question: is this valuable information? and if it is, is this the best way to convey it?