Monday, December 06, 2010

Online 2010 and SWIB

I'm just back from a lengthy trip that ended at the Semantic Web in Bibliotheken (SWIB)(#swib10) conference in Cologne, Germany, followed by Online Information 2010 in London ( #online2010). These are some thoughts from those events.


I saw two examples of uses of FRBR that do not follow the structure provided in the FRBR documentation and both made good sense to me.
  • The Bibliotheque Nationale of France (BNF) is working to export its data in a linked data format. They are linking the Manifestation directly to the Work and to the Expression, rather than following the M -> E -> W order that is defined in FRBR. I need to think about this some more, but it seems to remove some of the rigidity of the linear WEMI.
  • The Deutsche Nationalbibliothek is using an identifier method that seems to resolve the (long) discussion I instigated on the FRBR list about identifying WEMI with a single identifier. They give an identifier to the single WEMI group (one work, one expression, one manifestation, and presumably one Item, but no one seems to be talking about items.) There is also an identifier for each W, E, M, I. This works well for input and output (and sharing). When a matching W or WE is found, a "merged" identifier is coined for the FRBR units. I couldn't follow the presentation, as it was in German, but from the slides it looked to me that all of these identifiers could co-exist, and therefore would represent different views simultaneously of the bibliographic data that would depend on the function in play (e.g. export of data about a book v. support of shared cataloging).
The key thing that I learned, though, was that there is a plethora of semantic web activities in libraries in Europe. Among these, the British Library has released the National Bibliography (1956-); the BNF will soon make data available, as will the German National Library. What do these libraries have in common? Among other things, their data is not bound by the OCLC record policy, so they are able to make it freely available.

Online 2010

I was the opening speaker on a panel about the Semantic Web at this conference and unfortunately that was the only bit of the conference I was able to attend other than the exhibits. Online Info is a combined publisher/library conference, with the publishing side being primary. At the conference one of the three tracks was "Exploiting Open and Linked Data." In the exhibits the term "semantic" was everywhere. I would like to attend this conference (because I can't really say that I have) to get a view of linked data from another industry's perspective.

My co-speakers were Sarah Barlett of Talis, and Martin Malmsten from the Swedish National Library. Sarah did something that had never occurred to me, but now I just think "Doh!" it's so obvious. Her talk walked through a literary, rather than bibliographic, view of some library materials. She showed how you could use linked data to support the humanities. It was, as the British say, brilliant. It's also a great way to teach people about linked data, and she advised everyone to come up with something they have a passion for and use it as an exercise in linking. Now I want to come up with some fun linking exercises for teaching purposes.

Martin talked about the motivation for making LIBRIS, the Swedish union catalog, open as linked data since 2008. He and I agreed that we really need a good linked data app that would allow people to explore the linked data space. He quoted Corey Harper saying that the killer app for linked data will probably be created by a 13-year-old, someone for whom the idea of open linking is neither novel nor new. I am really interested to see what the "linked open data" generation comes up with!


Jonathan Rochkind said...

Regarding the FRBR "innovations".

The first one seems clearly semantically identical to 'traditional' FRBR.

Since an M always has exactly one E which always has exactly one W -- an M always has exactly one W too. You can always find it by going through the E--- but if you want to put it directly on the M instead, it means the same thing. Except now you've kind of duplicated data, which you need to keep in sync. If that M ever claims to belong to a W that's _different_ than it's E's W, you have invalid contradictory data. So every time you change an E's W, you have to make sure to change all of the corresponding manifestation's W, and every time you change a manifestation's W, you need to not only change the expression's W, but also change the W's of all other M's in that expression -- or split them into different expressions instead.

I'm not sure what you gain by duplicating your data, but if it's convenient, it's not a problem -- so long as you keep everything in sync. It may be that you never WILL change an M or E's W.... but I actually don't think that's a good assumption. In a properly functionality healthy system, I think assignments between M E and W would in fact change _all the time_, as people add more information.

In the second case, this is a bit different than what I thought we were talking about before. Your description now is that they are giving an identifier to the set (or 'triple', if you like) particular single M, single E, and single W. (or four-ple if you include a single I too). What I thought we were talking about before is giving an identifier to the set of a W, ALL of it's E's, ALL of those E's M's. (A set with different size depending on the W).

In the former case, of giving an identifier to a particular triple of one W,E,M -- this identifies exactly the same thing as the M. Because an M neccesarily has one E which neccesarily has one W. So pointing to the M is semantically the same as pointing to the M->E->W. I'm not sure what you gain by making a new identifier for the triple, instead of just using an identifier for the M? And the downside is, if an M _changes_ it's E assignment -- what exactly does that triple identifier mean now? It's like a historical snapshot of that M _if_ it belonged to the E and W it no longer does?

And in the latter case, where the identifier is for the set of a W and _all_ it's E's and M's -- it's semantically the same as just pointing to a W. Because a W inherently has E's which have M's. So again, I'm not sure what you gain from making a _new_ identifier which semantically is the same as the W was in the first place. But again, you've added somewhat confusing semantics in the case that an M or E changes it's parent assignment.

Karen Coyle said...

Jonathan, if you look at the FRBR diagram you see that the E->W relationship is one-to-one, but the M-E relationship is many to many -- a manifestation can have multiple E's and thus multiple W's, and E can be related to multiple M's.

Having the structure I described does not mean that you are duplicating your data, any more than it does in the many-to-many relationships in a relational database. I think it's the difference between logical structure and physical storage, and whether you duplicate things or not depends on many factors, as it does today (e.g. most databases end up doing some duplication to avoid joins).

I don't agree that assignments between MEW so much "change" as that new ones are added, which is the very nature of linked data. Adding new relationships does not change the old ones, although it adds new paths that you can follow.

"What I thought we were talking about before is giving an identifier to the set of a W, ALL of it's E's, ALL of those E's M's." In some circumstances you may need that, in others you may need to identify a single WEMI. The DNB diagram gives a way to create manifestation-to-manifestation links with non-frbr-ized data. I can see the need for that. So there are two options: one in which you wish to do an M-to-M relationship with something like an MLA citation in a document, and another when you want to create a non-specific relationship between bibliographic "things" -- this is a pretty vague one, and linking to the W for this one may be sufficient.

And, once again, although pointing to the M is *semantically* the same as pointing to MEW, there is a problem with how FRBR has assigned properies, IMO. Essentially the way it is defined you cannot use FRBR properties with non-FRBR data. So this is why Ross S has developed two new relationships that are not FRBR relationships but general bibliographic relationships outside of FRBR. Now with the non-FRBR properties available, we can do that.

This is from Ross's post to ol-tech:

Ok, so following up on this, I've created:

which as has the subproperties:

The first property can be used with a FRBR W or any FRBR WEMI entity, the commonManifestation with a FRBR M, etc. For the sake of export, however, I still favor a WEMI unit (one of each) since that is what will most likely be understandable by the non-FRBR bibliographic world. And packing it together with single ID will be more understandable and more efficient than keeping only the separate IDs plus the links between that that must be followed. (The XC project has already reported efficiency problems having to recombine the FRBR separate entities for display and other functions.) It also allows the creation of bibliographic data with bits missing (e.g. no E) without having to create blank nodes, which are awkward at best.

Jonathan Rochkind said...

A given Manifestation can belong to more than one Expression? Maybe I've been misunderstanding FRBR. Is there an easy example?

If you are creating semantically ambiguous or duplicated data just to get around limitations in a particular vocabularly, it seems to me the vocabulary should be fixed, rather than creating semantically duplicative data. It was called the 'semantic web' for a reason, right? But we've been having this argument for a while, I still don't understand why you can't create semantically 'normalized' data with the current vocabularies -- if it's neccesary, however, to create new properties, that's a lot better than creating new semantically duplicative entities. But like we both know, we've been having this argument for a while, we're probably not going to get on the same page with it. So let everyone experiment with what works for them, perfect. I just worry about introducing encouragement of semantic ambiguity into the standardized specification for FRBR itself. If we're not careful, we'll end up with a whole lot of library data that calls itself 'frbr' but is no easier get unambiguous meaning out of than the MARC we've already got.

Karen Coyle said...

Jonathan, look at the FRBR diagram. Really. It's there. A manifestation can "Manifest" more than one expression. For example, if you have a book of essays, each essay is a work, and each work has an expression. The manifestation (the book of essays) manifests more than one expression. This is the aggregation issue, which is viewed differently, by the way, among catalogers of different materials. Music catalogers view a CD with a piece by Bach and another by Beethoven as a manifestation with multiple expression/works. Book catalogers tend to treat a book of essays as a single "thing" (and this is reflected in the FRBR working group reports on aggregations, which you might find interesting.) (Note, RDA does not provide a data element for tables of contents, since each entry in what we call today a table of contents should be an expression/work construct. This was discussed a while back on the RDA-L list.)

"A given Manifestation can belong to more than one Expression? Maybe I've been misunderstanding FRBR. Is there an easy example?"

Remember that manifestations do not "belong" to expressions, they "manifest" expressions, which I read as being an arrow pointing from a manifestation to an expression, and I believe that is generally accepted concept.

"If you are creating semantically ambiguous or duplicated data just to get around limitations in a particular vocabularly, it seems to me the vocabulary should be fixed, rather than creating semantically duplicative data."

I absolutely agree. I think FRBR needs to be fixed. That's why I've stated that now that we try to work with FRBR it seems to be that everyone is having trouble fitting their data into the model. So either we don't understand the model, or the model needs to be changed. It could be that FRBR is fine as a conceptual model but we need something else as a way to structure our data. I don't know what the answer is, but I have yet to see any two different "frbr-izations" that are alike. This should be food for thought.

Curious but shy said...

Why do you say that The Bibliotheque Nationale of France and The Deutsche Nationalbibliothek are not bound by the oclc record use policy - don;t both libraries have their records in oclc? Same for the British Library? Thanks.

Karen Coyle said...

Curious: they have said it themselves. They upload their records into OCLC but are not "full members" because they do not do their cataloging in OCLC, and therefore are not bound by the record use policy. I also presume that the national libraries have actual contracts with OCLC that clarify mutual obligations. Most of them seem to have quite active legal departments that see to those things.

In fact, the vast majority of records in OCLC are there as the result of batch upload, not cataloging on OCLC. (As per the 2009 OCLC annual report, only 13% of the records added in 2009 were the result of cataloging on OCLC; the remainder were batch uploaded.) This would seem to indicate that a large number of records in WorldCat are not covered by the record use policy, but I don't know if they can be identified. Presumably, any records with BL or DNB or BNF or even LC as the original cataloging agency would not be covered, but I've never heard anyone make this claim. It's an interesting idea, though.

Curious but shy said...

So then libraries that only load records are not bound by the record use policy? Do you have something from oclc to point to that I could use?

How about something from those national libraries too where they've said they aren;t covered by the policy. That would be good to see.

Karen Coyle said...

Nothing from OCLC, but here's what BL says:

"The British Library is not a member of OCLC and is therefore not
restricted by any terms which may apply to its membership. The Library
supplies records to a wide variety of union catalogues (e.g. SUNCAT &
COPAC) in order to increase the global visibility of its collections.
Contributing records to OCLC's WorldCat system is simply part of this
wider strategy to improve awareness and usage of the BL's holdings.


Curious but shy said...

Oh, but this doesn't mean that any library that batches isn't covered - they may have a specific agreement in place. Have the others said the same?

So then we can't conclude that it's libraries outside the US per your article, but it could just be these national libraries that have a specific agreement.

Karen Coyle said...

Curious: I don't think we have a way to know, with one quick search, who is and isn't covered by the agreement, but the fact is that libraries that are making their data available as linked data are notably not OCLC member libraries. While I can't fill in a whole list of who is and who isn't, I do know it that "open linked data available" and "OCLC member library" is a null set.

Curious said...

How do you know that?

Karen Coyle said...

Prove me wrong.

Curious said...

Didn't mean to make yo udefensive. You stated it so I'm just asking you to back up your statement. Shouldn't be hard to do since it was your words. Surely you aren't making statements without facts to support it...

Peter said...

Getting a little sarcastic are we, that's nasty. Not like you.

Julia said...

Agree, that's a bold statement if it's not fact-based.

Karen Coyle said...

It's not sarcasm, it's a plea to folks to do some *research*, not just answer blog posts. Get involved, put in some effort. Find information.

The only list I know of of open linked data is this:

If there are library data sets that are not on that list, then let me know and we can add them.

Peter said...

Sorry, but must agree with Curious. He/she has only asked you to share your "research" that supports your statements. You've made several without showing the supporting research. I don't see where Curious made fact-based claims like you have so no need to research - or do your research for you. He/she is doing a good reference interview of you and your statements - IMO.

Curious said...

First, there are only 19 institutions on the list...miniscule in comparison to all libraries in the world-not even statistically relevent. Second, several are not even libraries e.g. Talis, biblios, openlibrary. Third, the 3 you mentioned are not on the list (BL,The Bibliotheque Nationale of France and The Deutsche Nationalbibliothek). So is this an oversite or are they not really participating yet-or did I misunderstand your blog? Fourth, I believe I did prove you wrong (as you asked me to do) because Univ of MI is on the list and they are an oclc member b/c we borrow from them - symbol eym.

Karen Coyle said...

As I said, it's the only list that I know of, and it isn't complete. It would be great to have a complete list... but we don't. The three you have mentioned have been encouraged to add their data. Also, the W3C working group on library linked data has decided to make use of this list as the place to put announcements of linked library data, since they recognize that no complete list exists and it's hard to find out what efforts are going on. So hopefully it will get fleshed out.

If you read JP Wilkin's blog post, the Michigan data set only includes UM's original cataloging, as defined by the OCLC record use policy. So it's a limited set, and not a meaningful one. For example, BL has put out the British National Bibliography, starting from 1956. Although it isn't all of BL's data (the earlier data will take more work because it isn't uniform in nature), it is a meaningful set of data.

I know you will still find fault in anything I say, but I will modify my statement to be that no OCLC library has released their holdings as linked data. If they did, I would be interested to see the reaction, since that would violate the record use policy.

What we haven't defined is the purpose of linked data, and JPW's blog post shows that he has some frustration about that. The linked data that is being produced today, as I stated, is really just experimental. I think of it as "potentially linked data" because the linking isn't happening yet. However, this is a first step, and for that I am encouraged.

Curious said...

I won't find fault in anything you say, just things that aren't true :) Especially as librarians, we have a duty to be authoritative and that is all I look for in your post or anyone's. I hope that you understand and accept my view.

Adrian Pohl said...

I refer to this discussion on my blog as my post got to long for leaving it here as a comment, see


"I think the commenters attack the wrong person when they accuse Karen of making a statement that isn't fact-based. Maybe one should pose the question why it is so hard to find out the facts about who actually is an OCLC member and thus bound by the policy."