The first question to be answered is: What are our data elements? In theory, this should be one of the simpler questions, but it's not. I can create a list of all of the MARC fields, subfields, and fixed field elements (which I have, and they are linked from this page of the futurelib wiki), but that doesn't answer the question. Here's why:
Indicators
The indicators in the MARC fields are like a wild card in poker -- they can be used to utterly transform the play. Some of the indicators are simple and probably can be dismissed: the non-filing indicators and the indicators that control printing. Some are data elements in themselves: "Existence in NAL collection" is essentially a binary data element. Many further refine the meaning of the field, allowing the field to carry any one of a number of related subelements:
Second - Type of ringOthers name the source of the term, such as LCSH or MeSH. It'll take a fair amount of work to figure out what all of these qualifiers mean in terms of actual data elements.# - Not applicable0 - Outer ring1 - Exclusion ring
Redundancy
There is non-textual (although not non-string) data in the MARC record, primarily in the fixed fields (00X) but also in some of the number and code fields (0XX). Some of these, actually most of these, are redundant with display information in the body of the record. Should these continue to be separate data elements, or can we remove this redundancy and still have useful user displays? Basically, having the same information entered in two different ways in your data is just begging for trouble and we've all seen fixed field dates and display (260 $c) dates that contradict each other.
Inconsistency
Primarily due to the constraints of the MARC format, the same information has been coded differently in different fields. A personal author entry in the 100 field uses subfields abcdejqu; in the 760 linking entry field, all of that data is entered into subfield a. It's the same data element, and by that I mean that the some contents are contained in the concatenation of abcdejqu as in a. Bringing together all of these krufty bits into a more rational data definition is something I really long for.
And of course my favorite... data buried in text
So much of our data isn't data, it's text, or it's data buried in text. My favorite example is the ISBN. Everyone knows how important the ISBN is in all kinds of bibliographic linking operations. But there isn't a place in our record for the ISBN as a data element. Instead, there is a subfield that takes the ISBN as well as other information.
020 __ |a 0812976479 (pbk.)This means that every system that processes MARC records has to have code that separates out the actual ISBN from whatever else might be in the subfield. Other buried information includes things like pagination and size or other extents:
300 __ |a 1 sound disc : |b analog, 33 1/3 rpm, stereo. ; |c 12 in.
300 __ |a 376 p. ; |c 21 cm.
Once this analysis is done (and I do need help, yes, thank you!), it may be possible to compare MARC to the RDA elements and see where we do and don't have a match. I have a drafty web page where I am putting the lists I'm creating of RDA elements, but I will try to get it all written up on the futurelib wiki so it's all in one place. I encourage others to grab this data and play with it, or to start doing whatever you think you can do with the registered RDA vocabularies. And please post your results somewhere and let me know so that I can gather it all, probably on the wiki.