Monday, April 18, 2011

FRBR as cake

I keep trying to explain what bothers me about FRBR, and in particular about WEMI. I've recently thought about it it with this image of a cake. I know this is a flawed analogy, but it works for me on some level. It goes like this:

When you make a cake, you have a number of ingredients:

When you mix them together to make a cake you don't get this:

You get this:
My point here, in case it isn't clear, is that the purpose of creating a bibliographic description using a number of different entities is to... well, to create a bibliographic description; something that as a whole has meaning. You can create it from individual "ingredients," like information about a Work and an Expression, but those do not need to remain separate entities in your final product; instead, that information can become part of your whole.

I know that people like the idea of a distributed bibliographic description with a single Work entity that links to many Expressions that then link to many Manifestations, etc., and that could be the underlying structure of ones data store. But just because there are Work entities (eggs) doesn't mean that our metadata keeps the Work entity "intact." In fact, our systems may use only a portion of the Work entity, and may use bits of it at different times in different contexts.

Leaving poorly-drawn analogies aside, creating our data as sets (or "graphs") of triples should give us maximum flexibility. One thing this means is that even a partial description is valid. Thus a full library catalog record and an abbreviated citation are both valid representations of a resource. They should connect to the larger linked data information space through any of the statements they contain, regardless of the structure of their graphs. And it is my guess that many bibliographic descriptions will be simple graphs with a single RDF subject (that means a single bibliographic resource). The highly structured bibliographic universe of FRBR will be a minority case, and the FRBR entities, like our eggs and sugar and flour, will be useful ingredients that disappear into actual creations.

7 comments:

Jonathan Rochkind said...

Ah, but metadata is not a cake, heh. And besides, what about layer cakes?

If you draw it like this, does it bother you less? I keep arguing that this is the better conceptual way to think about FRBR WEMI -- still the same model, just a better (IMO) way to think about it.



https://docs.google.com/drawings/pub?id=1T5hc6x_EEX7x4khBzjpAHud5MgG7BASRpuhT34uCUbA&w=960&h=720


I never quite understand what you are suggesting as an alternative. Let's say you've got a whole bunch of copies of Moby Dick. I mean, how many are there in the world? Depends on if by 'copies' you mean 'maniestations' or 'items' (and the model is what gives us the common language to even talk about this). Either thousands or millions depending. Sometimes you want to make an assertion about Moby Dick _as a work_ -- do you need to copy this same assertion made each time for each 'copy', and every time a new copy shows up and gets an identifier, you need to make the assertion again for THAT copy of Moby Dick? That would be ridiculous -- you want to be able to identify the work Moby Dick with one identifier, make your assertions about that, and then make assertions that all of those copies are members of the Moby Dick work set.

Then when you find a new copy, you assert it's part of Moby Dick, and all the assertions about 'Moby Dick' in general come with it.

Likewise, sometimes you want to make an assertion about a particular translation of Moby Dick (edition). Or about a particular printing/binding (manifestation), or even for a particular item (signed by the author!).

It's the FRBR model that sets up a model for doing that, and for being able to make assertions using this common model at least potentially compatible.

What is the alternative?

I always think you are objecting to a particular limited vision of FRBR, or even stereotype of FRBR, and not to the FRBR model itself.

Karen Coyle said...

Jonathan,

I realize i didn't make this explicit, but my objection is to the assumption that each FRBR entity is a separate "thing" and therefore a will be a separate identified set of data when bibliographic descriptions are created. The IFLA model encourages this when it defines some relationships as being only between, say, two Expressions.

Even by saying "manifestations or items" you imply that these are separate things. Is there such thing as an item that can be "or'd" with manifestation?

I agree that "... you want to be able to identify the work Moby Dick with one identifier, make your assertions about that, and then make assertions that all of those copies are members of the Moby Dick work set" although the use of the term "set" makes me ponder. Rather than "members of a set" I would rather see something like the "expresses" relationship, but without having to separate the properties of the bibliographic description into Works v. Expressions. I would like the bibliographic description to have a wholeness and still have the rich relationships that FRBR provides.

So what I see is a bibliographic description that has all of the properties that FRBR offers and more (I'm sure we can think of others), and all of the relationships, but not restricted to the linear structure of WEMI. In other words, I want FRBR to be conceptual but not to dictate the structure that we use to express our bibliographic data. I also want for a description to be able to contain any combination of descriptive elements, either chosen from library cataloging or from any other metadata space... all that without the need to conform to any particular E-R design.

It's interesting that you say "I keep arguing that this is the better conceptual way to think about FRBR WEMI -- still the same model, just a better (IMO) way to think about it." But that model is conceptually a structure, and I am arguing that that's the problem: that it is a structure not a concept. And I happen to think that structure is too confining, and that it will be a detriment to sharing of bibliographic data in an open information space like the linked data cloud.

Jonathan Rochkind said...

To me it doesn't seem confining at all. It's not really all that confining at all -- it just says that when you make an assertion, you have to decide if it's about a work in general, about a particular text (ie, a particular translation or edition:expression), a particular print run (manifestation), or a particular concrete specific physical book (item).

That's the heart of the WEMI model. Any kind of metadata vocabulary or standard or model is "confining" to some extent, that's the nature of the beast. And this amount of confining here seems the appropriate amount to me -- without even that, how can different metadata be interoperable at all?

With that level of confinement, as an example, let's say a book review website provides data is a list of reviews, and they say they are reviews of WORK X. (Because they're playing fast and loose and just organize reviews as all being about Moby Dick, they don't worry about which edition/manifestation -- no problem, we've got the data we've got, and now we know what we've got). If we want to relate that to our data, we just need to determine which of our work identifier(s) corresponds to work X, and now the connection is built.

Of course that's an expensive proposition relating all those works from two different 'namespaces', sure. But at least we know what we have to do. (Or maybe our data is so great that THEY want to do the relating too, or some mutual combination).

Without the W E M I---we've got no idea if the data is about the work in general, a particular version/translation, or a particular publication (viking 2002 paperback). We don't even know where to start to relate the data.

So that is the heart of W E M I -- being specific what sort of thing the data is talking about, using the W E M I framework. And yeah, that requires you to "seperate the properties of bibliographic description" into W E M I, in the sense that that's just another way of saying "be explicit about whether you're talking about a W, E, M, or I, when you 'say' something."

If you really agree that is still too confining, then indeed you disagree with W E M I. Although I don't know how we can begin to make bibliographic data interoperable without at least that much confinement. If you don't even know if an assertion is being made about a work in general (Moby Dick), or a particular manifestation (Viking 2002 paperback), how can you even begin to know what the data _means_?

On the other hand, if you think that basic heart is okay, but there are problems with the particular ways particular attributes and relationships of W E M I are being formally specified, if those are too confining -- then I think that's an implementation detail, which is not to say it's not important, it's very important, but it's not fundamentally a disagreement with the W E M I model, it's instead an argument that it's particular formal implementation needs some work (as of course it does one way or another, this stuff is hard and few people have done serious work with FRBR model yet).

Karen Coyle said...

Jonathan,

I think what we have here is the difference between a model that is like XML, and one that is like linked data. In XML, it is the structure that tells you what your data elements really mean, something like:

Work {
Title
Date }

Manifestation {
Title
Date }


In Linked data, the full semantics are inherent in the properties ("data elements") and the properties tell you what entity is being described:

If you have:
Title (with domain of Work)

then the subject of your property is a work. If you have:

Date (with domain of Manifestation)

then the subject of your property is a manifestation.

Because linked data is broken down into individual statements, you cannot assume that they will always be in the same structural context, which is why structure can't be used to define the properties, like in XML.

I think that FRBR was designed with a particular structure in mind, which makes it hard to use it in a linked data sense. If nothing else, it will be necessary (as has been done for the RDF version of RDA) to provide a non-FRBR-bound set of properties that can be used by folks who aren't following FRBR (e.g. most people adding citations to academic articles). That would at least allow non-FRBR and FRBR properties to play well together:

Date
more specific:
Date (with domain of Work)
Date (with domain of Expression)
Date (with domain of Manifestation)

Even with this, I think it would be good to rethink FRBR in light of RDF and linked data since it was developed with a different metadata model.

Jonathan Rochkind said...

I don't know, I get all confused when you start dropping details of RDF, I admit.

To me, it is clear that the basic domain model should be implementation-agnostic. It should WORK with RDF, yes, but it should not assume RDF.

To me it's also clear that our domain model rightfully _should_ constrain someone such as to _force_ them to be clear whether their assertions are about a work in general, or a particular manifestation, or a particular item.

If all we know is they're saying "URI-subject predicate object", but we have NO WAY of knowing if the URI-subject is a Work or a Manifestation or an Item -- this is fairly useless data. Knowing which it is is what makes it possible to link an assertion made in someone elses data set to an assertion made in mine -- by linking the "subject" URIs. If I don't even know if they think they are talking about a work in general or a particular edition of it -- it's useless data.

It's not clear to me if you agree or disagree with this basic point.

But as far as I can tell, the model (or lack thereof) you are suggesting leads to a situation where someone can say they are using the same vocabulary, but be entirely vague about whether the subject of their assertion is a Work or a Manifestation or neither. You seem to be specifically saying "No, I should NOT have to be clear if my subject is a Work or a Manifestation or something else, when I'm saying some predicate of this subject has some value."

That seems disastrous to me, what's the point of having a vocabulary if it doesn't define some entities? To me that's the starting point for a vocabulary regarding the bibliographic universe, definining the entities (that is, the classes of things that can be 'subjects' of your assertions).

That matters maybe more in the bibliographic universe than in other domains, because _everything_ but the physical item in the bibliographic universe is really just a concept, not an objective real thing.

If forcing people to be explicit about what class/entity/type of thing is their subject really does make your vocabulary too hard to use with RDF, as I think you are insisting -- that seems like a rather disastrous problem in RDF to me, rather than a problem with our modelling. It is perfectly appropriate and desirable that our domain model requires people to specify what entity/class/type of subject their data is about. Work vs Expression vs Manifestation vs Item.

Jonathan Rochkind said...

So, okay, let's get more concrete.

Yes, if you are adding a citation to an academic article, you (or rather the system which is supporting you adding this citation, or the designer of that system) DOES have to decide what is being cited --

-- are you citing a particular item (like, a particular physical copy of a journal issue, and the article in that copy?). Obviously not.

Are you citing the article as a Work, like if it gets revised you are citing the whole set of both revisions, not a particular revision/edition? Probably not.

Are you specifically citing the electronic instead of the print version? I dunno, if you are you're citing a manifestation, if you're not you're citing an expression.

I think it's perfectly appropriate that using the FRBR model requires a data provider to make this decision about _what_ is being cited. It is, true, a bit more confusing when we're talking about articles rather than books. The FRBR model was, admittedly, designed to be able to formally describe the bulk of our legacy data -- largely books. That's the focus, and this was a totally reaosonable choice.

But if you don't make this decision about WHAT the subject of your data is -- we are going to have to have to make that decision for you ANYWAY when translating your data to interoperate with FRBR data. So what's the point of letting you use one of 'our' predicates (ie properties, or elements), we might as well just translate 'your' predicate to ours anyway.

A model _has_ to be constraining in some way to be a useful model. If you really want incredibly few constraints, you can just use DCTerms, right? But translating our rich legacy data into DCTerms, we lose so much. The point of FRBR is to formally describe our rich data without using this. And it's totally a reasonable and in my opinion correct decision to constrain such a model so that, yes, to express data in this model, you DO need to decide if you're talking about "Moby Dick" in general, or a particular edition/translation of Moby Dick. If you can't even decide this, then, yeah, you are not being clear enough for our model. Maybe you're being clear enough for DCTerms, use that, no problem.

Karen Coyle said...

Jonathan, since we keep going round and round, and since what I am talking about does indeed involve RDF, linked data and sharing of data on the semantic web, and since you say:

"I get all confused when you start dropping details of RDF,"

I'd like to suggest some reading that I found helpful:

Semantic web for the working ontologist,
by
Dean Allemang, 2008, Morgan Kaufmann Publishers/Elsevier

Semantic Web Programming, by
John Hebeler. 2009. Wiley. ISBN: 978-0470418017

The latter includes a lot of conceptual material before it gets into programming, but it's possible that the programming examples will make that conceptual material clearer for you.