Wednesday, September 07, 2011

XML and Library Data Future

There is sometimes the assumption that the future data carrier for library data will be XML. I think this assumption may be misleading and I'm going to attempt to clarify how XML may fit into the library data future. Some of this explanation is necessarily over-simplified because a full exposition of the merits and de-merits of XML would be a tome, not a blog post.

What is XML?

The eXtensible Markup Language (XML) is a highly versatile markup language. A markup language is primarily a way to encode text or other expressions so that some machine-processing can be performed. That processing can manage display (e.g. presenting text in bold or italics) or it can be similar to metadata encoding of the meaning of a group of characters ("dateAndTime"). It makes the expression more machine-usable. It is not a data model in itself, but it can be used to mark up data based on a wide variety of models.*

XML is the flagship standard in a large family of markup languages, although not the first: it is an evolution of SGML which had (perhaps necessary) complexities that rendered it very difficult for most mortals to use. It's also the conceptual granddaddy of HTML, a much simplified markup language that many of us take for granted.

Defining Metadata in XML

There is a difference between using XML as a markup for documents or data and using XML to define your data. XML has some inherent structural qualities that may not be compatible with what you want your data to be. There is a reason why XML "records" are generally referred to as "documents": they tend to be quite linear in nature, with a beginning, a middle, and an end, just like a good story.

XML's main structural functionality is that of nesting, or the creation of containers that hold separate bits of data together.

<paragraph>
   <sentence></sentence>
   <sentence></sentence> ...
</paragraph>

<name>
   <familyname></familyname>
   <forenames></forenames>
</name>

This is useful for document markup and also handy when marking up data. It is not unusual for XML documents to have nesting of elements many layers deep. This nesting, however, can be deceptive. Just because you have things inside other things does not mean that the relationship is anything more than a convenience for the application for which it was designed.

<customer>
    <customerNumber></customerNumber>
    <phoneNumber></phoneNumber>
</customer>

Nested elements are most frequently in a whole/part relationship, with the container representing the whole and holding the elements (parts) together as a unit (in particular a unit that can be repeated).

<address>
    <street1></street1>
    <street2></street2>
    <city></city>
    <state></state>
    <zip></zip>
</address

While usually not hierarchical in the sense of genus/species or broader/narrower, this nesting has some of the same data processing issues that we find in other hierarchical arrangements:
  • The difficulty of placing elements in a single hierarchy when many elements could be logically located in more than one place. That problem has to be weighed against the inconvenience and danger of carrying the same data more than once in a record or system and the chances that these redundant elements will not get updated together.
  • The need to traverse the whole hierarchy to get to "buried" elements. This was the pain-in-the-neck that caused most data processing shops to drop hierarchical database management systems for relational ones. XML tools make this somewhat less painful, but not painless.
  • Poor interoperability. The same data element can be in different containers in different XML documents, but the data elements may not be usable outside the context of the containing element (e.g. "street2").
Nesting, like hierarchy, is necessarily a separation of elements from each other, and XML does not provide a way to bring these together for a different view. Contrast the container structure of XML and a graph structure.



In the nested XML structure some of the same data is carried in separate containers and there isn't any inherent relationship between them. Were this data entered into a relational database it might be possible to create those relationships, somewhat like the graph view. But as a record the XML document has separate data elements for the same data because the element is not separate from the container. In other words, the XML document has two different data elements for the zip code:

  address:zip
  censusDistrict:zip

To use a library concept as an analogy, the nesting in XML is like pre-coordination in library subject headings. It binds elements together in a way that they cannot be readily used in any other context. Some coordination is definitely useful at the application level, but if all of your data is pre-coordinated it becomes difficult to create new uses for new contexts.

Avoid XML Pitfalls

XML does not make your data any better than it was, and it can be used to mark up data that is illogically organized and poorly defined. A misstep that I often see is data designers beginning to use XML before their data is fully described, and therefore letting the structure and limitations of XML influence what their data can express. Be very wary of any project that decides that the data format will be XML before the data itself has been fully defined.

XML and Library data

If XML had been available in 1965 when Henriette Avram was developing the MARC format it would have been a logical choice for that data. The task that Avram faced was to create a machine-readable version of the data on the catalog card that would allow cards to be printed that looked exactly like the cards that were created prior to MARC. It was a classic document mark-up situation. Had that been the case our records could very well have evolved in a way that is different to what we have today, because XML would not have had the need to separate fixed field data from variable field data, and expansion of some data areas might have been easier. But saying that XML would have been a good format in 1965 does not mean that it would be a good format in 2011.

For the future library data format, I can imagine that it will, at times, be conveyed over the internet in XML. If it can ONLY be conveyed in XML we will have created a problem for ourselves. Our data should be independent of any particular serialization and be designed so that it is not necessary to have any particular combination or nesting of elements in order to make use of the data. Applications that use the data can of course combine and structure the elements however they wish, but for our data to be usable in a variety of applications we need to keep the "pre-coordination" of elements to a minimum.



* For example, there is an XML serialization (essentially a record format) of RDF that is frequently used to exchange linked data, although other serializations are also often available. It is used primarily because there is a wide range of software tools available for making use of XML data in applications, and there are many fewer tools available for the more "native" RDF expressions such as triples or turtle. It encapsulates RDF data in a record format and I suspect that using XML for this data will turn out to be a transitional phase as we move from record-based data structures to graph-based ones.

3 comments:

Dorothea said...

By and large I agree with this, and I certainly agree with the conclusion, but XML is just a tiny bit more flexible than you're giving it credit for.

Given the zipcode problem, I'd encode it once, give it an xml:id (or other id) attribute, and have any other element point.

Karen Coyle said...

Dorothea, I agree. As I said, some things had to be simplified for the blog post. There are ways to get around many of the limitations, and if you have your data well-defined those "sticky points" will become evident. The danger is in fitting your data to XML, rather than XML to your data.

Unknown said...

I think the last part of this post cannot be overstated. People tend to think conceptually about bibliographic data in terms of MARC, turning a particular serialization into a mental model. It would be a gross error to define a schema by its output markup rather than its structure.

XML might be one way that some subset of library developers would want to send/receive next-generation bibliographic data, but I'd contend that JSON is becoming more important/standard from the client, server, and even database level.