As I mentioned previously, I am going to try to cover each of Martha Yee's questions from her ITAL article of June, 2009. The title of the article is: "Can Bibliographic data be Put Directly onto the Semantic Web?" Here are the first three. As always, these are my answers, which may be incorrect or incomplete, so I welcome discussion both of Yee's text as well as mine. (Martha's article is available here.)
Is there an assumption on the part of the Semantic Web developers that a given data element, such as publisher name, shuld be expressed as either a literal or using a URI ... but never both?The answer to this is "no," and is explained in greater detail in my post on RDF basics.
Yee goes on, however, to state that there is value in distinguishing the following types of data:
She then says that
- Copied as is from an artifact (transcribed)
- Supplied by a cataloger
- Categorized by a cataloger (controlled)
"For many data elements, therefore it will be important to be able to record both a literal (transcribed or composed form or both) and a URI (controlled form)."This distinction between types of data is important, and is one that we haven't made successfully in our current cataloging data. The example I usually give is that of the publisher name in the publisher statement area. Unless you know library cataloging, you might assume that is a controlled name that could be linked to, for example, a Publisher entity in a data model. That's not the case. The publisher name is a sort-of transcribed element, with a lot of cataloger freedom to not record it exactly as it appears. If we want to represent a publisher entity, we need to add it to our data set. There are various possible ways to do this. One would be to declare a publisher property that has a URI that identifies the publisher, and a literal that carries the sort-of transcribed element. But remember that there are two kinds of literals in Yee's list: transcribed and cataloger supplied. So a property that can take both a URI and a literal is still not going to allow us to make that distinction.
A better way to look at this is perhaps to focus more on the meaning of the properties that you wish to use to describe your resource. The transcribed publisher, the cataloger supplied publisher, and the identifier for the corporate body that is the publisher of the resource -- are these really the same thing? You may eventually wish to display them in the same area of your display, but that does not make them semantically the same. For the sake of clarity, if you have a need to distinguish between these different meanings of "publisher", then it would be best to treat them as three separate properties (a.k.a. "data elements").
Paying attention to the meaning of the property and the functionality that you hope to obtain with your data can go a long way toward solving some of these areas where you are dealing with what looks like a single complex data element. In library data that was meant primarily for display, making these distinctions was less important, and we have numerous instances of data elements that could either have values that aren't exactly alike or that were expected to perform more than one function. Look at the wide range of uniform titles, from a simple common title ("Hamlet") to the complex structured titles for music and biblical works. Or how the controlled main author heading functions as display, enforcement of sort order, and link to an authority record. There will be a limit to how precise data can be, but some of our traditional data elements may need a more rigorous definition to support new system functionality.
Will the Internet ever be fast enough to assemble the equivalent of our current records from a collection of hundreds or even thousands of URIs?I answered this in that same post, but would like to add what I think we might be doing with controlled lists in near-future systems. What we generally have today is a text document online that is updated by the relevant maintenance agency. The documents are human-readable, and updates generally require someone in the systems area of the library or vendor's support group to add new entries to the list. This is very crude considering the capabilities of today's technology.
I am assuming that in the future controlled lists will be available in a known and machine-actionable format (such as SKOS). With our lists online and in a coded form, the data could be downloaded automatically by library systems on a periodic basis (monthly, weekly, nightly -- it would depend on the type of list and needs of the community). The downloaded file could be processed into the library system without human intervention. The download could include the list term, display options, any definitions that are available, and a date on which the term becomes operational. Management of this kind of update is no different to what many systems do today to receive updated bibliographic records from LC or from other producers.
The use of SKOS or something functionally similar can give us advantages over what we have today. It could provide alternate display forms in different languages, links to cataloger documentation that could be incorporated into workstation software, and it could provide versioning and history so that it would be easier to process records created in different eras.
There could be similar advantages to be gained by using identifiers for what today we call "authority data." That's a bit more complex however, so I won't try to cover it in this short post. It's a great topic for a future discussion.