Sunday, May 13, 2012

RDA, DBMS, RDF

I have written before about some issues relating to RDA and RDF. Today I want to actually consider some things we should consider that should cause us to question the concept of "RDA in RDF."

For many decades we have been using relational databases to store our bibliographic data, bibliographic data that we create and exchange using the MARC format. Doing so was not by any means natural or intuitive because there is nothing about the structure or content of the MARC record that lends itself to being stored and managed in a relational database. The results were often awkward, inefficient, and unsatisfying.

Part of the reason for this is the unitary and flat nature of MARC. In spite of the long history of creating separate authority files, each MARC record is a complete and closed document with no actual connections to data outside of itself. While some database implementations for MARC do create relational tables for headings, the degree to which a MARC record can be separated out into tables is minimal and gains us very little in terms of the functionality of an RDBMS.

The underlying problem, however, is not in the structure of the MARC record but in the content of our catalog records. Moving from the card to a database for our data requires more than adding mark-up coding around the catalog data; to do so successfully requires re-thinking the data in terms of relational database principles. There are two basic principles to relational database design: repetition and combination.

To design for relational databases you look at your data to see what elements will be repeated in many different records. Rather than carrying those data elements in multiple records, you create a separate database table for each repeating element, and you store that element once. For example, if you are creating a database of mailing addresses, you see quickly that elements like state and zip code will appear in multiple records. You therefore create a table of state names and one of zip codes, and perhaps even one that links zip codes to city names. In this way, your database carries the string "Mississippi" only once, and that string is replaced in the records with a database pointer that uses much less internal storage. Ditto the zip code. And if the zip code is associated in a table with a city name, you also only store city names once, and each address record needs only a pointer to the zip code, not a city name. In fact, with a zip code you can get the city and state, and your design might look like:



In this way you have saved a huge amount of storage space. You have also made selection of your records on zip code, city and state much more efficient than if they were stored in every address record, because a search on a zip code, for example, retrieves a single entry in the zip code table, and that entry has database-managed links to the relevant records.

In a database of customer orders that has your inventory information along with customer addresses, you use the tables in your database to search for things like "all customers in Mississippi who have ordered WidgetX in the last six months." Information about your inventory and information about purchases are all in appropriate sets of tables in your database and you can combine the data elements to develop different views of the data.

Where the goal in relational database design is to identify and isolate data elements that are the same, the goal in library cataloging data is exactly the opposite: headings are developed primarily to differentiate at the data creation point rather than allow combination within the database management system. The goal is to have each data point be as unique as possible and to be assigned to as few records as possible. Thus, library cataloging creates headings whose purpose is to distinguish between entries:

Shakespeare, William, 1564-1616. As you like it
Shakespeare, William, 1564-1616. As you like it. 1905
Shakespeare, William, 1564-1616. As you like it. 1911.
Shakespeare, William, 1564-1616. As you like it. 1919.
Shakespeare, William, 1564-1616. As you like it. Czech
Shakespeare, William, 1564-1616. As you like it. French

These headings are counter to the functioning of a database management system. If moved to a database table to facilitate retrieval, they will each point to only one or a very small number of records. This negates both the space-saving aspect of database management and it also does not facilitate combination of data elements for retrieval. In the case of headings, the combination of elements is pre-coordinated in the data, rather than post-coordinated in the database retrieval function.

A database approach might break this data into four tables:




In this way one could search for this data by title, by title + author, date + language, or by any other combination of these four data elements. To search the library headings as anything but a single keyworded string, that is to use these headings to perform searches on title or date or language, would be incredibly inefficient. The upshot is that library headings are not "relational" and do not contribute to the functionality that database management systems can provide. Instead, database management systems make use of the separate coded elements, such as date and language, for combinatorial retrieval. Names and titles, because they are text strings and do not have an identified presence in the stored records, must be searched separately rather than being available for relational combination. The results of this type of searching are less than optimal in speed and accuracy.

All of this may seem obvious to some of you, so you may be asking yourselves why I bring this up. I bring it up because even though RDA claims to have as its goal the creation of records in a relational design (see scenario one in this JSC document), it continues to instruct catalogers to create pre-coordinated headings like the ones above. Not only will these not be efficient or fruitful in a relational database, this brings into question whether RDA is truly modeled on the principles it claims to embrace. If it is not we have cause to worry: we cannot move forward with data that does not conform to a modern model.

Note that in this post I have been emphasizing the use of relational database design for the data. The current plans for a new bibliographic framework appear to plan to create a data model for RDA that is based on semantic web principles. Those principles are yet another significant evolution following on the database model, which is now considered waning technology. Other communities, ones that have been designing for database management requirements for their data for decades, are now looking at ways to transform that data to RDF. It is possible that we can skip the relational database phase of our data development and move directly into a semantic web model. However,  to think that data created following RDA instructions, which is not even suitable for a relational database, could be made usable on the semantic web without major modifications is simply wrong. If we create a bibliographic framework that takes RDA as it has been described and ports that, unchanged, to RDF we will create a data model that does not serve us, does not serve our users, and that cannot reasonably interact with other linked data on the web.

What we need is an analysis of our data, not a transformation of it "as is" to a new technology. If we aren't ready to admit that some traditional practices, like headings, are no longer useful or usable in today's technological environment, we cannot have any hope that our data will be relevant in the future.

(p.s. I anticipate that someone will state that headings are needed for alphabetical displays, to "collocate" the records. To that I reply: 1) you can do the same collocation using the data elements, and in fact you could devise multiple collocations by combining the elements in different ways and 2) a linear, alphabetic display is so anachronistic with today's technology, and so seldom used when available, that it is hard to justify the use of human catalogers to create these fields. If you still believe that library records must contain hand-crafted headings, all I can say is: you can believe what you want, but some of us will be exploring other solutions.)

5 comments:

Jodi Schneider said...

This is a really good way of explaining the problem. I think that the type of "analysis" that's required is similar to the development of a crosswalk. So I don't think it should be off-putting, but it's certainly necessary!

Unknown said...

In my day job, I'm a DBA and a data architect. You did a great job breaking down a very complex subject. But I do have a couple of thoughts. The most obvious one is that what's needed for your reporting is OLAP not OLTP. In other words, normalize the underlying data (your demo schema is not even 3rd normal form, particularly with the name table, though I understand completely why you couldn't present a fully normalized scheme for this dicussion) and then build a datawarehouse to address the concerns you note. That way you flatten out the normalized schema and get the quick reporting you need that displays the data in the format you require.

The new ITSC field, in which you can link a given "text" to multiple ISBNs seems like a partial solution to the issue you note. But it doesn't solve it completely so I expect the ITSC field to experience some pain as the shortcomings of that implementation will arise pretty quickly in the real world.

I confess that when I saw the definition of ISTC I felt there were serious underlying issues and even wondered if anyone at Bowker had considered hiring a data architect to iron out the problems.

These are NOT uncommon problems at all and businesses solved them ages ago.

Yet again, I am shocked that my DBA life and my writing life are intersecting.

Billio said...

I wish the analysis of bibliographic data would start from a selection of accepted "use cases".

I agree linked data is a good solution, if we can find a satisfactory way to incorporate the legacy of millions of MARC records into the database.

However, complex headings do solve an important use case: a meaningful arrangement of works for a prolific and well-published author or an anonymous work such as the Bible.

Fortunately, such an arrangement can be replicated and easily processed in a linked data model, if we make the model flexible enough.

This doesn't invalidate the need to present this linked data in a structured, linear way that mimics that provided by complex heading structures.

Karen Coyle said...

Billio - I totally agree that we need use cases. Unfortunately, I suspect that the cataloging community still has the card as the assumption in its collective mind, not the database and not the linked data web. In a sense, FRBR provides the "use cases" with its four user tasks (find, identify, select, obtain) but I personally find these to be totally inadequate; I want to add "stumble upon" or "follow link to" or "discover." But remember that FRBR was describing a new structure for our current (at that time, 1998) cataloging data.

I really, really like the idea of use cases. I think I'll start a page on the futurelib wiki and put out a call. Look for that soon.

Mike Cosgrave said...

You are mainly looking at the problem of finding the book in a library, and missing a very important use case where Prof Boring wants the class to look at page 23 of a particular edition - date of publication would make more sense as edition, off of which editor, publisher and indeed language hang.

There is an underlying disconnect between merely solving a library problem of managing books, and a broader problem of what the users do with the book. It is very similar to genealogy where there is, or was, a common, dominant file format created for a particular use case which was not very satisfactory for other uses.