Wednesday, May 26, 2010

FRBR and Sharability

One of the possible advantages to using FRBR as a bibliographic model is that it can provide us with sharable bits in the form of the defined entities. I've been working on creating a test set of records to illustrate some linked data concepts, and so I began thinking about how the data would break out into sharable units. It turns out to be... an interesting question.

Work

Let's start with the Work, which I believe many people have high hopes for. I have a book in hand which I will use for this illustration. Because this is a book, there are only a few possible data elements in the Work, and these are:
Title of the work: Mort
Preferred title for the work: Mort
Date of work: 1987
Place of origin of the work: England, UK
As you can see, there isn't a lot of information in the Work entity itself. In many cases, a cataloger will not know the date of the work, and may not know where the work was written, in which case you could have just title, and the entire Work entity would be:
Title of the work: Mort
What is obviously missing here is the name of the author. That, however, is not an attribute of the Work in FRBR, but is an entity of its own, either Person, Corporate Body, or Family. It seems clear that without the name of the creator (where appropriate) the Work isn't terribly useful on its own. So I am going to add that creator from FRBR Group 2:
Work:
Title of the work:
Mort
Preferred title for the work: Mort
Date of work: 1987
Place of origin of the work: England, UK

Person:
Author: Terry Pratchett
OK, now we are getting somewhere. We have an author and a title. This is a "unit" that someone could grab or link to and make use of. They aren't really separable, which is what puzzles me a bit about FRBR. It's not like you could re-use this Work for another book with the same title (and there are others with this same title). It's only the Work by Terry Pratchett that this Work entity can represent. As far as I am concerned, the creator entity and the work entity are inseparable in the description of a work. A creator can be associated with many works, but Work cannot be re-used with different creators. Once the creator(s) of the Work are defined, that relationship is fixed as part of the identity of the Work.

We could leave Work as it is here, but if you want to include subject headings in your sharing, they need to be included in the shared Work, because subject headings in FRBR are only associated with the Work. Given that, our sharable Work becomes:
Work:
Title of the work:
Mort
Preferred title for the work: Mort
Date of work: 1987
Place of origin of the work: England, UK

Person:
Author: Terry Pratchett

Subject:
Topic: Fantasy fiction, English
Topic: Discworld (Imaginary place) -- Fiction
This is the unit that needs to be created so we can share Works.

Expression


Now let's move on to the Expression, the real bugbear of FRBR. For books, Expression has few data elements. In this case we have:
Date of expression: 1987
Language of expression: English
All perfectly fine and well, but clearly not something that can stand alone. Similar to Work, this expression is not usable with just any English language work written in 1987 -- it's not sharable in that sense. This Expression must be associated irrevocably with a particular Work, in this case the Work we created above. There will be some link that essentially says:
E:identifier --> expresses --> W:identifier
Second thought: Expression can also have an important creator/agent role, such as translator, editor, adaptor -- and possibly others related to music that I'm not knowledgeable about -- so it, too, should include those for sharing. In fact, probably all of the Group2 to Group1 relationships need to be included in a sharing situation. So we get:
Expression
Date of expression:
1987
Language of expression: French

Person
Translator: J-P Sartre
The unit of sharing here must be the expanded Expression plus the expanded Work (with Group2 and Group3 entities). This illustrates something that has bothered me a bit about the Group1 FRBR entities, which is the dependency inherent in the hierarchy WEMI. WEMI essentially must be created as a single thing with multiple parts. This is true even of the Manifestation.

Manifestation

The Manifestation is seemingly the richest and therefore the most independent of the FRBR Group1 entities, but as we'll see, without the Work and Expression you do not get a useful set of data elements. Here is what we have for our Manifestation:
Title proper: Mort
Statement of responsibility: Terry Pratchett
Title proper of series: Discworld
Date of publication: 2001
Copyright date: 1987
Place of publication: New York, NY
Publisher's name: HarperTorch
Extent of text: 243 pages
Dimensions: 17 cm
Carrier type: volume
Mode of issuance: single unit
Media type: unmediated
What is lacking here? Well, there's no link to the entity for the author, which would provide an identification of the author and any variant forms of the author's name. There's no language of text, because that's in the Expression. And there are no subject headings, because those are associated with the Work. If this were a translation, there would be no link to the Work in the original title. The Manifestation entity is very readable, but if we are sharing for the purposes of copy cataloging, it has to be bundled with the Work and Expression to be usable.

Our Sharable Units

So this is what we get as sharable units:
  1. Work + Group 2 (creator) + Group 3 (subject)
  2. Expression + Group2 (creator) + Work + Group 2 (creator) + Group 3 (subject)
  3. Manifestation + Expression + Group2 (creator) + Work + Group 2 (creator) + Group 3 (subject)
With these three, it will be possible to build on Works and Expressions as needed, creating new Expressions and Manifestations for a Work. It will also be possible to "grab" a Manifestation and along with it get a full description including subjects and creators.

Now we just need a system to test this out.

18 comments:

Anonymous said...

The dependencies you're talking about aren't really something to be worried about - they're part and parcel of how a relational database (which is what we're actually creating with FRBR) works.

Essentially, there are two more types of "hidden" attribute for these entities beyond the ones you've described. What you've talked about are the actual metadata elements, as it were - they're what users of the system will view and be interested in. But to provide the links or dependencies that you have described, the entities will also contain attributes called "primary keys" and "foreign keys" in database-speak.

Each entity should have one primary key, which is simply a unique identifier differentiating this entity from all others of the same type. So, for example, Terry Pratchett could have the primary key 123456 - or it could be a text string such as is currently used in authority records including his name in a decided format and possibly discretionary information to distinguish him from other Terry Pratchetts. Either approach will do, although the former is probably to be preferred.

Then entities - such as works - which "have" an author need a "foreign key" as one of their attributes, which is precisely another element's primary key. So your two entities from your first example become in fact:

Work:
Primary key: 987654
Title: Mort
Foreign key for author: 123456

Person:
Primary key: 123456
Name: Terry Pratchett

These are the "bits" that will be shared under FRBR - there need be no mystery surrounding the link between them because it is in fact explicit in their attributes. It's just that users probably aren't interested in primary and foreign keys, and they probably won't be displayed by default in (e.g.) OPACs.

Your point remains ultimately valid though. As I'm sure you'll see immediately, everybody must be using the same primary keys for their entities, or the system will break as soon as records are shared. Centralized authority control is thus absolutely essential for the benefits of FRBR to be reaped.

Tom

carolslib said...

Perhaps I'm totally off-base (it happens) but it seems to me this is what occurs in the "super" record that John Espy from VTLS was discussing at ALA 2008. At least, this was my understanding of the whole thing. From your scenario, it is similar to the current bib/holding/item record - the holding record has very little info without the bib record and the item record works only with the holding record then the bib record. If the information in your scenario is parsed out into easily manipulated fields (such as a numeric field for the pagination in manifestation, a numeric field for date, etc. etc.), this should link easily out and if the data in the work/expression/manifestation is linked (are linked?) then I see lots of options. Or am I again walking down the wrong garden path?

Karen Coyle said...

Tom, I'm afraid I don't go for the "relational database" model. I'd rather have the entities and links be of the semantic web type, and therefore to be able to be used independently from a particular file, record, or database design. Because WEMI is hierarchical, it doesn't fit into this kind of a model; the dependencies are the barrier to that. So I think we'll need a "son of FRBR" to move into the semantic web world.

Carol, I think the "superwork" is one that encompasses everything related to the Work, and would bring together books, movies, operas, etc., which libraries consider to be separate works. The Bib/HOldings/Item is more like WEMI. But I agree with your take on the data and parsing, once we get the model part worked out.

Jonathan Rochkind said...

The author is indeed instrinsically attached to the work -- but it is quite PROPERLY modelled as a relationship to an author entity, not as embedded author literals in the work record. Isn't this the kind of thing you are always arguing FOR, Karen? Using relationships (whether expressed as URI or otherwise) instead of copy-and-pasted string literals?

This is true whether it's an "entity relational" model, a "relational database", "RDF-style", "entity-value-attribute", etc. It's just good modelling. For a person to be a seperate entity than a work, not just a bunch of literals attached to a work. This is NOT something unique to relational databases, although those used to relational databases can mostly easily understand it through that lense. It's just plain good modelling.

But maybe I'm misunderstanding your objection, because I'm surprised to see you disagreeing with this, I've seen you so often before arguing for it?

Whether a particular system includes the "literals" corresponding to a Person in a 'record export' of a Work that person is the creator of -- well, a system is certainly free to do that, but I'd hope it ALSO includes the "identifier" (whether URI, LCCN, or other) of the Person entity in relationship, so the consuming system can restore properly modelled data, not just string literals.

Jonathan Rochkind said...

And I still disagree that WEMI doesn't fit into an RDF/entity-value-attribute/semantic-web model. It certainly does/can. And even in that model, you still want Person and Work to be separate entities linked by a relationship -- the way computers link things through relationships is by "identifiers", of which URIs, database foreign keys, and LCCN authority numbers, are all examples of.

Even in, especially in, RDF-style, you don't want simple repeated string literals for the same name all over the place, you want relationships between entities.

Karen Coyle said...

Jonathan, perhaps I wasn't clear. I'm not objecting to anything, I'm trying to clarify what are the most useful units for sharing. And it seems to me that many folks haven't understood that the sharable unit is not the FRBR Work entity, but that entity with the creator and subjects that link to it. This is purely practical in terms of what would be useful in a shared environment.

You say "And I still disagree that WEMI doesn't fit into an RDF/entity-value-attribute/semantic-web model. It certainly does/can. And even in that model, you still want Person and Work to be separate entities linked by a relationship" -- Person and Work are not WEMI; WEMI is Group1, and I think Group1 has problems because the entities do not stand alone. I don't know how you get from this to string literals, so I think we're talking past each other. In this particular post I'm not talking about IDs (notice I left them out of my description) vs. literals. I'm talking instead about what data groups make sense in a shared bibliographic environment that is FRBR-based rather than MARC based. In MARC we use the whole MARC record as our share point. With FRBR we can share at different levels, and I'm trying to think through what "sets" make sense for sharing. That's all.

egh said...

I suspect that the appropriate model for sharing in FRBR is the linked data model.

Let me explain what I mean. Let us assume that I am cataloging the 2001 edition of your example. I know that you have done the work of creating a Work and Expression entry for it. Because we are in a linked data world, you have given them URLs, W:identifier and E:identifier.

Now I create a record:

manifestationOf: E:identifier
Place of publication: New York, NY
Publisher's name: New American Library
Extent of text: 181 pages
Dimensions: 21 cm
[...]

When this record is indexed for access, any needed information is pulled down from your URL, E:identifier, and added to our index. (Assume that this data can be cached to avoid a heavy hit on servers your.) Because you have exposed the data in a URL, a machine can access it. So my access system would pull down your Expression info and "follow its nose" to the Work, using any and all information it wanted to build an access system that is useful my users. Becuase your links are other URLs, my indexer can fetch them if necessary. For example, your Work record would contain links to a Person record for Terry Pratchett, where I can pull down dates and pen-names. Additionally it could contain links to a Topic record for "Fantasy fiction" on at http://id.loc.gov/authorities/sh85047114. We could pull in some broader terms, e.g. Fantasy literature, so that a user searching for fantasy literature would not be disappointed. We could also follow "Similar concepts from other vocabularies" to pull a similar french topic, Littérature fantastique.

I hope this argument makes sense. You can probably tell that I am a fan of FRBR and linked data. I think that the unit of sharing in a FRBR/linked data catalog is as large or small as the access system needs. If the access system is in Canada, it would probably want to pull in that French language topic heading for the book. If it is in the US, maybe not. The key is dereferenceable URLs and machine-readable data.

Karen Coyle said...

response to egh -

hmmm. I wonder why I keep getting comments about IDs and linked data on this post.

Yes, I am assuming linked data. I am assuming URIs. That doesn't change the question of what minimum elements will be needed for sharing. THAT is the topic of the post.

Anyway, as I continue to think about it, it seems to me that we'll need all of the linked elements between G1 and G2 and G3, but probably NOT the G1/G1, G2/G2 or G3/G3.

egh said...

Karen - I responded because I read your blog. No conspiracy of linked-data folks. :)

I guess I don't understand the question, "what minimum elements will be needed for sharing".

My answer to that question, as I thought I understood it, was: one URL. If I have an item, and you have cataloged its manifestation, then all I need is the URL you used, right? I simply create a new Item in my system, give it the relationship . Then my access system follows URLs until it has added to my access system all information necessary for a user to search for [Pratchett Mort] or [Fantasy fiction] or [Littérature fantastique] and locate that item.

There is a lot of hand waving, I admit; a lot to be answered in the "follow URLs until it has added to my access system all information necessary" part. I guess that is probably what you are asking.

Thanks for your time.

-Erik

Karen Coyle said...

Erik,

My question has to do with how we will share cataloging copy. Maybe I didn't make that clear enough. If we have a bunch of linked data *somewhere*, and someone dips into it while cataloging, we will need to present the cataloger (I think, I could be wrong) with a sensible unit out of the giant pool of triples (or whatever it is that we have). FRBR is often cited as aiding the sharing of cataloging data. There are some examples of cataloging scenarios linked from http://dublincore.org/dcmirdataskgroup/. But those scenarios assume that there are things called 'work, expression, manifestation records' -- but I don't think we'll have records for WEMI entities because that doesn't make sense to me. (This all started when someone took a typical bib record and labeled it a FRBR:Manifestation, which of course it isn't if it has authors and subjects.) So I'm thinking about what makes sense for sharing cataloging, and it looks like I'm not thinking very clearly because I have not conveyed my thoughts very well.

egh said...

Hi Karen -

I am coming at this as somebody who does not know RDA at all, & who has never cataloged a book (in a MARC system, at least). So take what I say with a grain of salt.

My vision of what a FRBR/linked data (FRBR/LD) cataloging system is informed a bit by a project I did for building a digital collection web site. (https://launchpad.net/ervin).

It seems to me that FRBR/LD cataloger is weaving a web, pulling together existing pieces at whatever level is necessary to describe the Item in hand. A Work from here, a Topic from there, a Person from somewhere else. In this way they should be able to do the minimum level of work possible. If a suitable Expression has been cataloged by somebody else, this might mean no cataloging proper. Or it could involve creating a record for each W, E, M, and I, but pulling in subjects from id.loc.gov and authors from openlibrary. Or in the case of very strange works, creating every single piece.

Doing a good job of this will involve good UIs for searching the pool of triples out there, as you say. There needs to be a good way of finding a Person URL that describes who you need.

While I agree with you that a Manifestation record proper does not contain author info, it is connected to author information. So while the triple does exist, we can create a serialization of a graph that contains manifestation information and the author, while clearly differentiating the two. E.g., in turtle syntax, please see here.

Thanks for bringing this up. There is a lot to figure out here.

-Erik

Karen Coyle said...

Erik,

That's the general idea, although creator is not an attribute of frbr:work. So instead of:

#
:w dct:creator :pratchett.
#
:pratchett a frbr:Person .

you need something more like:

:p pratchett
:r is creator of
:w

The creator is not contained at all within WEMI, but has a relationship to W. The scenarios on the DC-RDA page also put the creator within a W structure, but I think that is not what FRBR says. It may turn out that's the best way to do it, but I'm trying to puzzle through FRBR as it is before declaring it to be unworkable.

Also note that this is an area where FRBR is inconsistent, in my mind. The manifestation has "publisher" as an attribute, even though the diagrams show "publish" as a relationship between an agent entity and a manifestation (G2 and G1). But there isn't a creator attribute on work, which would also be a G2 and G1. I think this is because they were thinking in terms of traditional cataloging, and publishers are not treated as entities (e.g. they are not authority controlled) in traditional cataloging. In fact, publisher should be expressed as a relationship. Inconsistencies like these make it hard to model the data that catalogers will produce under FRBR.

egh said...

Hi Karen -

FRBR has the "created by" relationship defined in 5.2.2Hi Karen -

FRBR doesn't define an attribute for creator, but it does define the "created by" relationship in §5.2.2, or in the FRBR RDF schema, frbr:creator. I think that it is functionally equivalent to dc:creator.

Re. publisher, after a quick look back at the FRBR document I would think that publisher (the attribute) is equivalent to the statement of responsibility, while the produced by relationship is like the created by relationship.

-Erik

Karen Coyle said...

Erik, FRBR created by would be a narrowing of dc:creator, because it is required to have as its object (or subject, I'm not sure which way it intends to point) a FRBR Group 2 entity. dc:creator can take anything, so it is broader in its definition.

Statement of responsibility is an invention of library cataloging and has nothing to do with publishers. When you have a title that looks like:

Alice in Wonderland : Alice’s adventures in Wonderland and Through the looking-glass / by Lewis Carroll ; black and white illustrations by John Tenniel.

everything after the "/" is the statement of responsibility. The problems with publisher have to do with how it has been treated in library cataloging in the past, not any theoretical consideration.

egh said...

Hi Karen - I think I'm missing something about the creator. I see what you are saying about frbr:creator being a subclass of dc:creator. But I'm not sure what the problem is with modeling the relationship as:

W dc:creator P
P rdf:type frbr:Person

What I meant by bringing up the statement of responsibility is that the publisher attribute is to the realized by relationship as the statement of responsibility is to the created by relationship.

That is, FRBR provides for a controlled vocabulary/entity model for publishers in the realized by relationship, while also providing a free-form, uncontrolled string in the publisher attribute.

(I always imagined the statement of responsibility as a mostly faithful transcription of what a book says its author(s) is/are. Maybe this is wrong.)

So an expression might have the following triples:

E dc:publisher "A joint publication of the State Department of the United States and the British Foreign Office."
E frbr:realizer http://state.gov/
E frbr:realizer http://www.fco.gov.uk/

(As an aside, I think that the distinction between a controlled form and an uncontrolled one is a very useful one, and librarians ought to promote it to the RDF modelers more often. A fun excursion on this theme can be found in this blog post about Franco, master of soukous and rumba, a/k/a Franco et le O.K. Jazz, Franco et Le Tout Puissant OK Jazz, Franco et le TPOK Jazz, etc. A person should not have to choose between ensuring that an album is tied to the controlled form of its creator and transcribing faithfully the creator named on the work, which I think contains information that can be quite useful to the person looking for an item, such as that Franco has been promoted to Le Grand Maitre Franco.)

Anonymous said...

Hello again Karen, sorry I've missed a few days of this. When I read your reply to my initial comment I realized that I was indeed wrong to say that FRBR had to be creating a relational database; as Jonathan implied, I am used to relational databases and I do see things through that lens by default! I agree entirely, however, that the semantic web is the more appropriate model here.

So what happens in the semantic web model when you borrow a work record from someone without borrowing the author record, for example?

The answer - nothing. The work record on your catalogue will continue to point to the author record on theirs. And that is exactly how the semantic web is supposed to work. Indeed, the people you're borrowing from will probably have referenced an author record in a repository like LC Authorities. And in time, there may be such central repositories for work, manifestation, expression records as well - all your own catalogue will have to do is highlight the relevant bits in such publicly available resources, and add whatever you want to for the benefit of your particular user group.

Tom

Karen Coyle said...

Erik, your statement about the transcribed elements is true: there is a bunch of stuff in a library record that attempts to transcribe faithfully what is found on the item. That reflects the role of the bib record as a surrogate for the actual item. Those things won't/can't be controlled as vocabularies. (Although I think it may be time to rethink which fields this should apply to, and how useful it is.) The "publisher name", which is an element in RDA, was not an actual transcription in the previous cataloging rules, but was allowed to be abbreviated based on the cataloger's judgment. In RDA, I believe it is transcribed. So in this sense you are right -- we should separate the transcribed text from ACTUAL DATA, and we could then include both. (My contention is that there are only a handful of fields that need to be transcribed.)

As for your:

W dc:creator P
P rdf:type frbr:Person

It is all a matter of precision. dc:creator can take any values, so this statement could be used in a situation where some P are frbr:Person and some are not. In that case, each instance would define itself through its value, and they could intermingle in a data set. For folks who are wanting to create FRBR-defined data only, they would want to be more precise and state that their property (not its value, as in your case) is in FRBR:creator, which can only take FRBR:Person as a value. In fact, the library world, through FRBR and RDA, is going for maximum precision, something that I believe is going to inhibit data exchange and keep library data in its silo. But that's the data I'm trying to work with at the moment.

Now that I think about it, your dc:creator example could help us out: there are non-frbr-ized properties defined in the RDA vocabularies at http://metadataregistry.org/rdabrowse.htm, and we could use those with frbr defined G2 and G3, which most folks don't have a big problem with (although we do need a definition of each of those outside of FRBR as well...)

ok, gotta think some more. Thanks.

egh said...

Hi Karen -

Thanks for the thoughtful reply and the info about cataloging serving as a surrogate. It is strange that FRBR uses the term realized by for what I would call published by, but I suppose they had their reasons.

-Erik