Tuesday, July 23, 2013

Linked Data First Steps & Catch-21

Often when I am with groups of librarians talking about linked data, this question comes up:
"What can we do TODAY to get ready for linked data?"
It's not really a hard question, because, at least in my mind, there is an obvious starting point: identifiers. We can begin today to connect the textual data in our bibliographic records with identifiers for the same thing or concept.

What identifiers exist? Thanks to the Library of Congress we have identifiers for all of our authority controlled elements: names and subjects. (And if you are outside of the US, look to your national library for their work in this area, or connect to the Virtual International Authority File where you can.) LoC also provides identifiers for a number of the controlled lists used in MARC21 data.

The linked data standards require that identifiers be in the form of an HTTP-based URI. What this means is that your identifier looks like a URL. The identifier for me in the LC name authority file is:
http://id.loc.gov/authorities/names/n89613425
Any bibliographic data with my name in a field should also contain this identifier. (OK, admittedly that's not a lot of bib data.) That brings us to "Catch-21" -- the MARC21 record. Although a control subfield was added to MARC21 for identifiers ($0), that subfield requires the identifier to be in a MARC21-specific format:
The control number or identifier is preceded by the appropriate MARC Organization code (for a related authority record) or the Standard Identifier source code (for a standard identifier scheme), enclosed in parentheses.
The example in the MARC21 documentation is:
100 1#$aBach, Johann Sebastian.$4aut$0(DE-101c)310008891
Modified to use LC name authorities that would be:
 100 1#$aBach, Johann Sebastian,$d1685-1750$0(LoC)n79021425
The contents of the $0 therefore is not a linked data identifier even in those instances where we have a proper linked data identifier for the name.  Catch-21. I therefore suggest that, as an act of Catch-21 disobedience, we all declare that we will ignore the absurdity of having recently added an anti-linked data identifier subfield to our standard, and use it instead for standard HTTP URIs:
100 1#$aBach, Johann Sebastian,$d1685-1750$0http://id.loc.gov/authorities/names/n79021425
Once we've gotten over this hurdle, we can begin to fill in identifiers for authority-controlled elements. Obviously we won't be doing this by hand, one record at a time. This should be part of a normal authority update service, or it may be feasible within systems that store and link national authority data to bibliographic records.

We should also insist that cataloging services that use the national authority files begin to include these subfields in bibliographic data as it is created/downloaded.

Note that because the linked data standard identifiers are HTTP URIs, aka URLs, by including these identifiers in your bibliographic data you have already created a link -- a link to the web page for that person or subject, and a link to a machine-readable form of the authority data in a variety of formats (MARCXML, JSON, RDF, and more). In the LC identifier service, the name authority data includes a link to the VIAF identifier for the person; the VIAF identifier for some persons is included in the Wikipedia page about the person; the Wikipedia identifier links you to DBpedia and the DBpedia identifier is used by Freebase ...

That's how it all gets started, with one identifier that becomes a kind of identifier snowball rolling down hill, collecting more and more links as it goes along.

Pretty easy, eh?

13 comments:

John Mark Ockerbloom said...

So, if an organization with a MARC organization code is willing to sign onto this, couldn't this even be done with 100% conformance to the current MARC Spec?

For instance, the Internet Archive (including the Open Library) has MARC org code casfia. So if they aren't otherwise claiming their authority space, they could make it cover all URIs. In the example you give, then, subfield 0 would be (casfia)http://id.loc.gov/authorities/names/n79021425

Would this work for your purposes?

Ben Companjen said...

When you think about it, a link is only a link when the reader knows what to do when it encounters it. Putting a URI in an HTML document doesn't make it a (hyper)link, unless you put it in an anchor tag's href-attribute, so that the browser knows it should render it as a link and provide the "clickability".
A MARC record reader that recognises the content of the $0 field as an identifier for a name record can provide the same "clickability".

That said, URIs are supposed to be more meaningful in the wider world.

Karen Coyle said...

John, That's clever, but frustrating because you still are creating a string that is not a valid URI. Should we really have to go to such lengths?

Ben, a lot of software recognizes the "http://" string as a link. It would be great for us to be doing something that isn't limited to software that knows about MARC records.

My main interest, though, is getting the identifiers into the records so that they can be used by systems. Having something that directly matches the canonical URI for something is the best way to do that. The http URI form has become so ubiquitous of a standard that it makes no sense to not be able to use it.

Thanks to both of you for trying!

Reinhold Heuvelmann said...

Karen, adding _real_ URI's to a MARC field is already possible, without any disobedience. Key is that $0 has been defined more broadly in 2010, alongside MARBI discussions of http://www.loc.gov/marc/marbi/2010/2010-dp03.html and http://www.loc.gov/marc/marbi/2010/2010-06.html . From then on, $0 is an "Authority record control number _or_ _standard_ _number_" (emphasis added), as described in Appendix A: Control subfields" at http://www.loc.gov/marc/bibliographic/ecbdcntf.html . In parentheses, we may use one of the values out of the list of "Standard Identifier Source Codes" at http://www.loc.gov/standards/sourcelist/standard-identifier.html , containing "uri".

So, the example you gave should read:

100 1# $aBach, Johann Sebastian,$d1685-1750 $0(uri)http://id.loc.gov/authorities/names/n79021425

-- which is perfectly MARC compliant.

---

In addition, we can repeat $0, so we are able to add one $0 to the same field, each one containing a different URI pointing to pieces of information at different places, but describing exactly the same entity:

100 1# $aBach, Johann Sebastian,$d1685-1750
$0(uri)http://id.loc.gov/authorities/names/n79021425
$0(uri)http://d-nb.info/gnd/11850553X
$0(uri)http://viaf.org/viaf/12304462/

We _can_ do so, if we want. Whether we actually do so, or should do so, these may be different questions.

Karen Coyle said...

Reinhold,

(uri)http://viaf.org/viaf/12304462/ is not a valid URI. That's my point. You have to extract the URI from that string. It makes no sense to create a string with an embedded URI, rather than create a string that is a valid URI. The point of URIs is that they are self-defining, so you don't need (uri). Anything that begins "http:" is, by definition, a URI.

And it is interesting that if this discussion took place in 2010 that the documentation does not reflect it. That's three years... a long, long time in this fast-moving world.

John Mark Ockerbloom said...

Maybe I'm missing something, but I don't see extracting URIs from a string, either manually or automatically, to be a big deal. For instance, it's easy for us (or for a program) to pick out the URIs in Reinhold Heuvelmann's comment (both the ones in prose paragraphs and the ones in quoted MARC snippets), even though the comment itself isn't a URI.

On many blogging platforms, the URIs in his comment would in fact be automatically hyperlinked. (I don't know why Blogger doesn't do this, but lots of other systems do.)

Similarly, picking out and hyperlinking the full URIs in a MARC record is pretty easy too, if somone's inclined to write a program to do it. (As you point out, the "http://" is a dead giveaway.) The trickiest part of picking out URIs in arbitrary strings in practice can be figuring out where they end, but MARC's subfield structure simplifies that problem.

Karen Coyle said...

John, it's not that it's a big deal to write code to extract the URI, it's that there is no reason to have to do that. It's "no big deal" to extract the ISBN from a subfield when it is followed by "(paperback)" but we can hardly say that the subfield represents the ISBN as an identifier -- it's a text field that has to be processed to extract the ISBN from it. That's absurd.

Every library-only practice shows a lack of understanding, or respect for, the Web standards. Every time you have to do something library-specific to work with our data, we get further from the mainstream.

We could offer our data via APIs in XML or JSON for linking and other uses, but we can't because we do not have a data element for ISBN or for URI. We have text fields that have those embedded in them.

Mrs. White said...

This was timely and interesting! I am halfway through my introductory cataloging course at Drexel University, and it is fascinating to be a part of cataloging's transition into the Digital Age.

Reinhold Heuvelmann said...

Karen,

sorry for insisting, but my impression is that you are too harsh with MARC.

Regarding the documentation, Appendix A for me is clear enough about the ways to use $0. In addition, the page has a history at the bottom, wich says that $0 has been defined in 2007, and redifined in 2010. You can go to the papers that I mentioned, or even read the minutes of MARBI meetings, e.g. of the 2010 Annual Meeting at http://www.loc.gov/marc/marbi/minutes/an-10.html .

Regarding the structure of $0: It is true, the content of $0 is a string containing a URI _and_ _something_ _more_ (this seems to be your point). The way I look at it is: MARC has ways to handle a lot more identifiers in $0, an authority record control number in 2007, and then from 2010 on all sorts of standard identifiers. So there has to be a designation which kind of identifier is contained in $0. I see the parentheses as a sub-label, or sub-subfield, a way to express the kind of identifier. So to me $0(uri) ... is fine.

For me, a comparison to an ISBN in 020 is somewhat unfair. And, by the way, as not everyone reads MARBI papers, a subfield $q has recently been defined in 020 and some other fields, so there are more efforts to care for data.

Best -- Reinhold

Dan Scott said...

Given that we're talking about the MARC format, I'm not sure that it's worthwhile worrying too much about software that doesn't know about MARC records. We do, however, need software that understands the MARC format and can convert it into something meaningful for other systems. One of those primary cases being HTML display on the Web.

For example, it wouldn't be too difficult for me to teach Evergreen to look up the parenthesized org code and map the identifier to a URL that could then be used to create a link. We could start with a table in the database that maps "(LoC)n" to "http://id.loc.gov/authorities/names/n + number"; then the same with "(LoC)sh" to "http://id.loc.gov/authorities/subjects/sh + number".

Or if we're lucky enough to get a "(uri)" for an org-code-or-standard-identifier, then just throw in the link as appropriate.

Then tada - more linked data will be out there in the wild on the web.

Reinhold Heuvelmann said...

It finally happened :-)

The PCC Task Group on URIs in MARC in consultation with the British Library submitted a paper to the MARC Advisory Committee (MAC), "Redefining Subfield $0 to Remove the Use of Parenthetical Prefix '(uri)' in the MARC 21 Authority, Bibliographic, and Holdings Formats", see

http://www.loc.gov/marc/mac/2016/2016-dp18.html

As the minutes of the MAC meetings
http://www.loc.gov/marc/mac/minutes/an-16.html
tell: "It was moved and approved to consider this Discussion Paper as a Proposal; the Proposal was approved [...]; 1 abstention."

The PCC Task Group's web page is at
http://www.loc.gov/aba/pcc/bibframe/TaskGroups/URI-TaskGroup.html

There's more to come.

Reinhold

denials said...

Great to see some action on this front! To follow up, on the Evergreen side we did indeed implement some of what I threw out in my previous comment, publishing linked open data where we can infer specific URI patterns.

However, I hold out hope that the MAC considers the proposal I put out a few years back to enable repeated $0 to identify individual subfields in, for example, a 264 to provide separate identifiers for "Cambridge" and "Elsevier".

Karen Coyle said...

One other suggestion for what @denials proposes is to have the $0 follow the subfield it modifies. However, one of the complications of MARC is that subfielding is not the same as a logical data element. Some logical elements are split across subfields, some have irrelevant subfields in the middle of them (245 $h), etc. What is coded in MARC doesn't easily translate to "things you wish to identify as such." We really need to get beyond MARC and code our data semantically, not typographically.