Saturday, September 13, 2008

Thinking About Linking

In my previous post on affordances, I included inter- and intra-metadata links. I feel like there's a lot of confusion in this area (some of which I may myself have contributed), so I'm going to do a bit of a disorganized brain dump here as an attempt to start a conversation in this area, see if maybe we (or I) can't arrive at some clarity.

In the FRBR vision that RDA has embraced, there is something called the "relational/object-oriented model." I have some basic problems with this because I perceive relational and object-oriented designs to be quite distinct. This concept of relational/object-oriented gives me one of those "blank brain" moments -- when something sounds like it should make sense but I just can't make sense out of it. So I'm going to treat it as a set of relationships within a bibliographic record.

In the FRBR/RDA model there are entities: Work, Expression, Manifestation, Item (WEMI), and Person, Corporate body, Concept, Object, Event, Place. The interesting thing about these is that none of them is intended to stand alone. This is a very inter-dependent group of entities, not a set of separate records. This is hard for us to imagine because today's model is indeed of separate records for bibliographic data and authority data (covering names and subjects). However, our view is colored by the fact that the bibliographic record carries headings from the authority records, an therefore is complete in itself. Authority records, if you think about them, even those for names, are of the nature of a controlled vocabulary. The view of these vocabularies as contributing to the bibliographic description means that we have to have a way to express both the entities themselves and the links between them.

In addition, we have to decide what one defines as a record. If, to describe a work, one must also describe the creator, then it does seem that the Work entity and Person (or Corporate) entity must be part of the same record. Otherwise, the record cannot stand alone. So what does it mean to include the Person entity, and where does that entity reside? Or is an unresolved link to a (presumed) entity sufficient to complete the bibliographic record? In other words, if the bibliographic record has, as part of the work, a link to a Person entity that resides elsewhere, is that bibliographic record complete?

Note: I read back through FRBR and FRANAR regarding the Person entity. FRBR includes only the "name heading" in its Person entity, while the FRANAR Person entity has many more elements. This parallels today's difference between the personal name field and the name authority record.
There are other kinds of relationships that are between bibliographic entities. To my mind there are two types of relationships here: dependent and independent. The dependent relationships are between the WEMI entities, none of which is considered complete in itself. In fact, I consider the WEMI to be a single entity with dependent parts. (Admittedly, this is how current library cataloging views it, with a single flat record that contains information on all of these bibliographic levels which exist simultaneously in a single object.) To me, these are indivisible -- you can't have any one of them without the others.
[Note that I consider the WEMI to be a single entity in terms of library cataloging records. The levels of this entity do have meaning on their own. For example, a literary critic will often refer to the Work, perhaps to the Expression. A publisher or bookstore advertises the Manifestation. A library identifies and circulates the Item, and a rare book seller deals almost exclusively in Items.]

The independent relationships are those between different bibliographic entities -
  • Work-Work, two works that reflect or reference each other (cited, cites; works based on other works, like parodies or sequels)
  • Whole-Part, works in which one can be contained in the other (article and journal, chapter and book, volume and series)
  • Item-Item, reproductions of all types
To a large degree, these relationships can all be expressed as properties: isCreatorOf, isExpressionOf, isCitedBy. But I can't shake the feeling that there are at least two distinct kinds of relationships: those that fill in what otherwise would be gaps in a metadata record, and those that inform relationships between bibliographic items. I also wonder about links with and between complex entities. For example, imagine a bibliographic record that links to a member of a subject vocabulary that is stored in SKOS format. The SKOS record has numerous fields covering preferred and alternate headings, definitions, links to broader and narrower terms, and all of this in various languages. What if the property in the bibliographic record has the meaning "definition of term in French"? What does one link to? Or is the only possible link to the vocabulary member as a whole?

So these are a few of the questions I have. Hopefully some of them can be cleared up quickly. I'm interested in hearing how others think about these issues. For those attending DC2008, if this interests you I'm game for some discussion.


Anonymous said...

"If, to describe a work, one must also describe the creator, then it does seem that the Work entity and Person (or Corporate) entity must be part of the same record. Otherwise, the record cannot stand alone."

Wait, why is it assumed that the record must stand alone? I'd say that if to describe a work, one must also describe the creator, then for every work record, a creator record must exist for it's creator, and every system that includes a work record must also include the record for it's creator(s).

That isn't the same thing as saying the work and creator must both be part of the same record--the same data file, the same indivisible computer file.

Of course, with the authority record being a very rudimentary creator record, we ALREADY are functioning in an environment where the work record and the creator record are not one and the same. While there are many problems with our current environment, I dont' think this is one of them---or rather, it's not that the work and creator records are seperate, but that our practice is not one of using good context-independent identifiers to link the work record to a creator record (or ideally, to more than one creator record). (And role identifiers too).

But deciding that somehow a creator record can't exist independently of a work record would be taking a step backwards.

Karen Coyle said...

Jonathan, I warned you that my thoughts were a bit scattered on this!

Internal to any system or database, the data can be divided out into inter-related bits. But do we want unresolved dependent links in records that we communicate? If you retrieve a bib record from a database using SRU, what do you get? A bib with identifiers in place of some entities? A bib and the entity records within a wrapper? Do you get just the "preferred heading" for the creator (which is all that is defined in the rules for the bib record) or do you get all of the data elements relating to the creator? OR can we assume a publicly available pool of creator records that is stable enough that anyone receiving a bib record will have access to it?

So I guess where this question breaks down is between a local system, which can guarantee that linked data is available, and the larger public exchange of bib data. For that latter, we need a communications format, and I assume that there are advantages to communicating a record (which can be complex) that has its dependencies resolved.

Jonathan said...

If a record includes a "dependent link", then absolutely, there needs to be a predictable conventional way to automatically retrieve the record identified by the identifier that stands in for the "dependent link". Whether that retrieval method (or 'resolution') is actually built into the identifier itself (like an http identifier, resolution of which over http gives you the record), or whether it's simply a convention (If I have this LCCN for an authority record, I know, predictably, that I can access this record identified by that LCCN like so)---either way.

Or for an alternate approach: If it really is important that a group of records always be downloaded together, then the system system providing that download can always provide them together---even if they are seperate records. When a system asks to retrieve the record for Pride and Prejudice, what it gets delivered is a package including a manifestation record for Pride and Prejudice, a person entity record for Jane Austen, a work record for Pride and Prejudice, maybe an expression record for Pride and Prejudice too. That package is ALWAYS what's delivered when you ask for Pride and Prejudice--but it's still a package of several records, not one big record.

[And then when you ask for Sense and Sensibility, you get a package that includes the person entity record for Jane Austen again, which you already had and didn't need, but oh well.]

There are a whole bunch of ways to approach this problem. But jamming everything into an indivisible record is not the answer. There needs to be a way to identify the person entity Jane Austen, precisely so you can tell that Pride and Prejudice and Sense and Sensibility are both related to this same person record. And so when someone finds a mistake in the Jane Austen person entity record and corrects it, that correction can be efficiently communicated to everyone by means of an identifier for that person record, rather than trying to advertise the change by referencing each of a couple dozen "book" records that are written by Jane Austen and all have that errant data duplicated in them. (Which is in fact more or less the state we are in now, despite the existence of under-used authority record architecture).

Be careful Karen, if you keep going down this road too far you're going to decide that all this modern identifier stuff is just a bunch of hooey, and we're fine with just MARC and AACR2, who needs anything else? I'm surprised, I'm used to disagreeing with you thinking you are a bit too 'fundamentalist' on 'semantic web' principles, but now I find myself disagreeing with you thinking you have abandoned them altogether!

Karen Coyle said...

Jonathan, I don't disagree with you, but also note that I am not proposing solutions, but problems that need to be solved. So this is a discussion that I feel we must have (and hopefully not just you and I!) as we re-defined what data we will convey and how we will convey it as we rely further on linked data. I do not have preconceived ideas on where we should end up. I'd like to put forth a whole lot of different options, even though many of them will be knocked down as we go forward.

So thanks for your suggestions/scenario, and send in others if they occur to you.

Andrew Hankinson said...

I used to buy into FRBR, but after having seen some of the ways in which automated systems can extrapolate and draw relationships with them being explicitly defined, I'm starting to become more than a little skeptical. If you'll bear with me...

To me, both the strongest and weakest qualities of FRBR are its abilities to draw relationships between entities, and to treat separate entities as a part of a larger bibliographic universe. In the bibliographic model we have one work with clear relationships to one (or several) expressions, which in turn is related to one (or several) manifestations, and each of these can be related to other works, or creators, or any other thing that we choose. In theory, this would allow us to jump from a musical work to finding both a score and recording, or by only knowing an author's name we can find everything written by/about him.

From what I've seen, however, this breaks down at a couple very practical and not-too-surprising points.

The first, and most important, is the point of inter-cataloguer consistency. (i.e. the famous "Whose FRBR do you trust?") I liken creating relationships to the assigning of subject headings in the current system. In the practical, multi-million record bibliographic universe, choosing between relating two items and skipping over several, possibly more appropriate relationships, is a very real possibility. How item x and item y are related can be multifaceted and can be different depending if a subject expert is forming these relationships. I believe this will translate into bibliographic "islands"; clusters of relationships that should be related to other records, but that exist outside of this. Practically speaking, we then can't trust that a search for "Edgar Allen Poe" will also return everything we have by / about him. To me, this is no different than the current situation we have today.

The second part where I believe this breaks down, which is somewhat tangential to the original topic, is that it is human nature to do the bare minimum required. (See, for example, Zipf's principle of least effort.) With such overwhelming possibilities for forming relationships and creating complex data structures, I don't believe we will achieve the type of bibliographic utopia we're looking for. Instead, cataloguers will continue on doing the best they can, but with an eye to the increasing stack of books they have piled on their desk, will only do what's necessary to make the item findable. And where do they decide to choose? Will records created on a Wednesday morning have more relationship than those created on a Friday afternoon, just because the person wants to go home?

And so, back to my original point: why are we doing all this work, when we have machines that are almost designed to do this for us? Why are we worrying about the different parts of a bibliographic work, and how to create these explicit relationships, when all people really need is a decent way to access the data we have? "Item-in-hand", flat-file cataloguing is, in my opinion, still a viable way for cataloguing resources, if for no other reason than it removes much of the decision making and thus inconsistencies of a more complex model.

If we can develop smart ways of linking records via machine learning algorithms and advanced data mining, do we need FRBR? And if we develop a global, dynamic bibliographic database that moves beyond the "download-and-store" local records systems we have in place now, will FRBR collapse under its own weight?

I'd sincerely love to hear thoughts about these, as this is what's causing me to yearn for simpler times... (and I'm still young!)

Anonymous said...


You said some interesting things , and I just have a few questions for you. Before we get to handling the more complex relationships that Karen mentions, we should probably deal with the relationships we already have trouble with now. Do you think that these machine learning algorithms and the advanced data mining could help us sort out the explicit relationships we already create, say for example between a monograph and the series to which it belongs? Do think that they could recognize when a monograph is part of a series and then place it other monographs in that same series? Catalogers also rely heavily on the contents note to mention all the separately authored works within an anthology. Do you think these same methods could be used to retrospectively extract and make explicit the relationships between the parts and the whole?


Karen Coyle said...

Thanks for the comments, Andrew and Bryan. I think I'm in agreement with Andrew, if I understand his thinking, in that what really matters is not the fixed hierarchy of FRBR but the many and varied relationships between bibliographic entities. The Dublin Core folks who are heavily into the Semantic Web talk a lot about making 'inferences' based on relationships. I'm not sure how much this can help us, but here's one possible example: if you identify two items as being of the same series, you can infer that they have the same publisher. That will be correct a high proportion of the time, enough to make the inference generally useful. You cannot, however, infer that because two items have the same publisher that they are of the same series. So inferring, while interesting, is an imprecise machine action that can suggest relationships which, however, will often need to be confirmed by a person.

Where I think we get ourselves tripped up is in our assumption that bibliographic 'control' is an act of precision, not a relative statement. We have no way, in our data, to say: 'this book is a lot like this other one, but we aren't sure if one derives from the other.' Librarians operate almost entirely on that assumption of precision, but my gut feeling is that users move about comfortably in a state of relativity. They are fine with the vague relationships inherent in Amazon's 'if you like A you might want to read B.' They follow all kinds of different relationships that are not precise, including browsing the shelves of the library (where they may even move from one subject to another without being aware of it).

I don't think that precision in cataloging and the creation of a wide variety of relationships are in conflict with one another. However, in the past we have emphasized precision of description and haven't done a very good job of creating relationships, and I personally think that the latter is superior for users in terms of the discovery of information and ideas.