Thursday, October 18, 2012

Is Linked Data the Answer?

I recently gave keynote talks at Dublin Core 2012 and Emtacl12 with the title "Think 'Different'." Since the slides of my talks don't generally have much text on them, I wrote up the talk as a document. The document has a kind of appendix covering the point in my presentation where I took advantage of my position on stage to ask and answer what I think is a common question: Is linked data the answer?

Many would expect me to answer "yes" to this question, but my answer is a bit more complex. Linked data is a technology that I believe we will make use of to connect library data to other information resources. That's what the "linked" in linked data is all about -- creating a web of information by connecting bits of data in different documents and datasets. However, we have to be very cautious about having "an answer." When you have an answer you tend to stop looking at the questions that arise, and you also tend to ignore questions that aren't going to be solved by the answer you have chosen. There is no technology that will do everything that we need, so while linked data can be useful for some things we may need to do, it cannot be the answer to all of our technical requirements.

Note that I describe linked data as "connecting bits of data." The origin of the semantic web is in the need and desire to make actionable data that today is essentially hidden within the text of documents. For example, if I say:

"My name is Karen. I will be holding a webinar on June 4 at 3:00 Pacific time for anyone who wants to learn about my paperweight collection."

That's text. There is interesting information in there, but it isn't available for any computational uses. The Semantic Web, as implemented through linked data, would make that information actionable. There are various ways to do this, and one is through the use of microformats which mark up data within a document. This could look something like:

<p>My name is <span class="name">Karen</span>. I will be holding a <span class="event">webinar</span> on <span class="datetime" title="2012-06-04T03:00-09:0000">June 4 at 3:00 Pacific time</span> for anyone who wants to learn about my <span topic="paperweights">paperweight</span> collection.</p>

This text now also has bits of data that can be used for various purposes, including linking. The linking capabilities in this particular example are low, but some additional information, like standard identifiers for the person and for the topic, would then increase the linkability of this data.

<p>My name is <span class="name" id="http://viaf.org/viaf/48369992/">Karen</span>. I will be holding a <span class="event">webinar</span> on <span class="datetime" title="2012-06-04T03:00-09:0000">June 4 at 3:00 Pacific time</span> for anyone who wants to learn about my <span topic="paperweights" id="http://id.loc.gov/authorities/subjects/sh85097666.html">paperweight </span> collection.</p>

This isn't a perfect example, but I wouldn't claim that we're heading toward perfect data. What we need is to get more out of the information we have. 

I perceive an assumption in the library linked data movement that what the Web needs (because linked data is data on the Web) is our bibliographic data. I disagree. The Web is awash in bibliographic data - from Amazon to Google Books, from fan sites like IMDB or MusicBrainz, and from sharing sites like LibraryThing and GoodReads. Libraries may have some unique bibliographic data, but most of what we have would duplicate what is already there, many times over.

There's also the fact that much of bibliographic data isn't DATA in the linked data sense. It isn't actionable data elements for the most part. In fact, bibliographic data is more like a structured document: it mainly has text, and that text is to be displayed to humans. It is possible to extract actual data (dates of publication, numbers of pages, various identifiers), but the text itself is a large part of the point about bibliographic data.

What this means for us in libraries is that we shouldn't be thinking that linked data will replace bibliographic data. It will encode the aspects of bibliographic data that will give us the most and the best links.

Then we need to ask: why are we linking? What will we get? Well, we can get connections between books and maps, between books and documents, and between search retrievals and libraries. This latter interests me especially. Google is experimenting with using microformat data, in particular the schema.org data that it is fostering along with Yahoo!, Bing, and Yandex (the Russian search engine).  Schema.org microformat data allows a search engine like Google to enrich the snippets with more than just a block of text from the page. This is an example from the Google Webmaster pages on Rich Snippets:
Below is my conceptualization of what we could do with library data. The bibliographic data, as I've said before, often already exists on the Web and we may not be helping things by adding many more duplicate copies of that data. But what we have in libraries that no one else has is library holdings data. We know where Web users can find "stuff" in their local community. If that could be linked to the Web, a future rich snippet might look like:

Obviously there are steps to be taken to make this possible, but if you want to think about how library data might fit into the Web of data that information seekers make use of millions or billions of times a day, this is one option. It's a start, and it uses data we already have.

You can take a look at the schema.org data that is created for WorldCat records simply by doing a WorldCat search and scrolling down to the section called "Linked data." The number of holdings is included (and this in itself is something that might interest Google as a measure of popularity). Making the link to the holdings of an actual library, and making that possible for all libraries, not just OCLC member libraries, is something I consider a worthy experiment for linking library data.

3 comments:

Anonymous said...

I'm interested in the commercial perspective (search engine optimization in its broadest interpretation), and the general thinking that appears prevalent is that the use of micro data/tags is at this point not yet ready for widespread use as it would generally fail the ROI test. The result is there doesn't appear to be very much attention being given to this issue in the commercial sphere.

On the other hand the linking of data through techniques such as this is clearly stage center in the academic and research spheres, and I am seeing an increasing recognition of its growing importance reading comments of those whose typical vision is, like yours, a little past the current horizon.

So that leaves me to ponder:

- galling though it may be to the typical corporate masters of the universe, is the commercial world falling behind?

- if this is the case, history tends to indicate that a period of rapid 'catch-up' whether it be in technology, education, paradigms etc creates the sort of turbulence and dislocation that typically results in some leaders missing the mark and falling by the wayside, and new entrants seizing the opportunity to obtain a footprint in a new area very quickly

Just look at Kodak breathing its last breath coinciding with the relative neophyte Instagram being sold for $1 billion for graphic illustration of these two points.

The trouble is there are so many subjects that potentially could change the playing field so dramatically it's a challenge to remain current while also having a business to manage!

Karen Coyle said...

I think it all depends on what you consider to be "linked data." The commercial world is rapidly embracing microformats because they promise to enhance online sales. The other end of linked data, linking for research and academic purposes, is mainly taking place in universities and research institutions. An interesting question is: will these two uses link together at some point, creating a single linking environment? I won't even pretend to make a guess about that.

I suspect that the I in ROI for microdata is negligible for online vendors.

Henry said...

Re: Library Holding data, Eric Miller's presentation in ALA Mid-Winter 2013 has similar viewpoint about the use Library Holding data in BIBFRAME.

Re: Lots of bibliographic data available outside, why there are libraries attempt to publish their catalogs in Linked Data.

Henry