Coyle's InFormation: 05/01/2010

Wednesday, May 26, 2010

FRBR and Sharability

One of the possible advantages to using FRBR as a bibliographic model is that it can provide us with sharable bits in the form of the defined entities. I've been working on creating a test set of records to illustrate some linked data concepts, and so I began thinking about how the data would break out into sharable units. It turns out to be... an interesting question.

Work

Let's start with the Work, which I believe many people have high hopes for. I have a book in hand which I will use for this illustration. Because this is a book, there are only a few possible data elements in the Work, and these are:

Title of the work: Mort
Preferred title for the work: Mort
Date of work: 1987
Place of origin of the work: England, UK

As you can see, there isn't a lot of information in the Work entity itself. In many cases, a cataloger will not know the date of the work, and may not know where the work was written, in which case you could have just title, and the entire Work entity would be:

Title of the work: Mort

What is obviously missing here is the name of the author. That, however, is not an attribute of the Work in FRBR, but is an entity of its own, either Person, Corporate Body, or Family. It seems clear that without the name of the creator (where appropriate) the Work isn't terribly useful on its own. So I am going to add that creator from FRBR Group 2:

Work:
Title of the work: Mort
Preferred title for the work: Mort
Date of work: 1987
Place of origin of the work: England, UK

Person:
Author: Terry Pratchett

OK, now we are getting somewhere. We have an author and a title. This is a "unit" that someone could grab or link to and make use of. They aren't really separable, which is what puzzles me a bit about FRBR. It's not like you could re-use this Work for another book with the same title (and there are others with this same title). It's only the Work by Terry Pratchett that this Work entity can represent. As far as I am concerned, the creator entity and the work entity are inseparable in the description of a work. A creator can be associated with many works, but Work cannot be re-used with different creators. Once the creator(s) of the Work are defined, that relationship is fixed as part of the identity of the Work.

We could leave Work as it is here, but if you want to include subject headings in your sharing, they need to be included in the shared Work, because subject headings in FRBR are only associated with the Work. Given that, our sharable Work becomes:

Work:
Title of the work: Mort
Preferred title for the work: Mort
Date of work: 1987
Place of origin of the work: England, UK

Person:
Author: Terry Pratchett

Subject:
Topic: Fantasy fiction, English
Topic: Discworld (Imaginary place) -- Fiction

This is the unit that needs to be created so we can share Works.

Expression

Now let's move on to the Expression, the real bugbear of FRBR. For books, Expression has few data elements. In this case we have:

Date of expression: 1987
Language of expression: English

All perfectly fine and well, but clearly not something that can stand alone. Similar to Work, this expression is not usable with just any English language work written in 1987 -- it's not sharable in that sense. This Expression must be associated irrevocably with a particular Work, in this case the Work we created above. There will be some link that essentially says:

E:identifier --> expresses --> W:identifier

Second thought: Expression can also have an important creator/agent role, such as translator, editor, adaptor -- and possibly others related to music that I'm not knowledgeable about -- so it, too, should include those for sharing. In fact, probably all of the Group2 to Group1 relationships need to be included in a sharing situation. So we get:

Expression
Date of expression: 1987
Language of expression: French

Person
Translator: J-P Sartre

The unit of sharing here must be the expanded Expression plus the expanded Work (with Group2 and Group3 entities). This illustrates something that has bothered me a bit about the Group1 FRBR entities, which is the dependency inherent in the hierarchy WEMI. WEMI essentially must be created as a single thing with multiple parts. This is true even of the Manifestation.

Manifestation

The Manifestation is seemingly the richest and therefore the most independent of the FRBR Group1 entities, but as we'll see, without the Work and Expression you do not get a useful set of data elements. Here is what we have for our Manifestation:

Title proper: Mort
Statement of responsibility: Terry Pratchett
Title proper of series: Discworld
Date of publication: 2001
Copyright date: 1987
Place of publication: New York, NY
Publisher's name: HarperTorch
Extent of text: 243 pages
Dimensions: 17 cm
Carrier type: volume
Mode of issuance: single unit
Media type: unmediated

What is lacking here? Well, there's no link to the entity for the author, which would provide an identification of the author and any variant forms of the author's name. There's no language of text, because that's in the Expression. And there are no subject headings, because those are associated with the Work. If this were a translation, there would be no link to the Work in the original title. The Manifestation entity is very readable, but if we are sharing for the purposes of copy cataloging, it has to be bundled with the Work and Expression to be usable.

Our Sharable Units

So this is what we get as sharable units:

Work + Group 2 (creator) + Group 3 (subject)
Expression + Group2 (creator) + Work + Group 2 (creator) + Group 3 (subject)
Manifestation + Expression + Group2 (creator) + Work + Group 2 (creator) + Group 3 (subject)

With these three, it will be possible to build on Works and Expressions as needed, creating new Expressions and Manifestations for a Work. It will also be possible to "grab" a Manifestation and along with it get a full description including subjects and creators.

Now we just need a system to test this out.

Monday, May 03, 2010

Bib data and the Semantic Web

I know that I've gone on and on about transforming bibliographic data into a semantic web format. And whenever folks have asked me: "What will it look like?" I haven't had a good response. Now there is something to show you: Freebase.

Freebase is a database of interlinked semantic web "statements": essentially what are called by the SemWeb types as "triples." The statements come from a variety of open data sources such as Wikipedia, TVDB.com, a science fiction fan database, and Open Library. By placing a user interface over these data they now have a searchable, navigable site that can link books to movies to (theoretically) music to science to... well, anything where linked data is available.

Their book data isn't as strong as it should be, given that they claim to have imported the Open Library file (I suspect it was only partially imported). When you look at the Freebase entry for Emily Dickinson you only see two works listed. Open library has 137 Works for Dickinson, and WorldCat Identities lists 3, 388. Also, their approach is more "popular" than rigorous. However, there is no reason why this same technique could not be used with "pure" library data, and library catalogs could make use of any of the data in such a database because it is all available through linking and APIs. A database like Freebase essentially serves as a huge pot of available, re-usable information.

In its current form, Freebase would not be sufficient for library data sharing, although it could provide an interesting testing ground. What we need to work out for libraries is a way to version and source content so that you know who provided each statement and when, and to make it easy to contribute new information or improvements to the information in a sensible and automated way. There is no reason why we could not create a "LibBase" that exists solely of what libraries would consider to be authoritative information; a kind of linked data WorldCat. That data would have to be able to interact with other data on the Web, and by doing so libraries would become discoverable on the Web. It would be logical for projects like Freebase to link to the library data. Library users would have a rich, navigable information base that could help them follow (or even make) connections between library resources -- connections that are much less evident in today's catalogs. Some technical magic would need to occur to allow users to move seamlessly from the whole world to their local library, but I don't think that's going to take rocket science to solve.

There is a group of interested souls planning to get together on the Friday morning of ALA DC to begin some exploration of how we might make semantic web technology work for libraries. There will be announcements on various lists (I'm guessing NGC4LIB, CODE4LIB, LITA-L and RDA-L, a the very least). If you can get to ALA a little early, please mark that slot on your calendar. It'll be a free-floating, working, barcamp-style meeting, as I understand it.