Saturday, September 29, 2007

Name authority control, aka name identification

Libraries do something they call "name authority control". For most people in IT, this would be called "assigning unique identifiers to names." Identifying authors is considered one of the essential aspects of library cataloging, and it isn't done in any other bibliographic environment, as far as I know. When a user goes to a library catalog, they will find all of the works of T.C. Boyle under a single name, even though he has variously used T.C. Boyle and T. Coraghessan Boyle on his books, and was born with the name Thomas John Boyle. Authority control puts all of his works under one name, with references from other forms of his name: TC Boyle, see: T. Coraghessan Boyle. When there are two authors with the same name, one of them (the second one to be added to the authority file, generally) is distinguished using a middle initial or the year of birth. Thus you can have
Smith, John
Smith, John 1709
Smith, John 1936
Smith, John A.

There are some problems with the current method used by libraries to realize authority control, not the least of which is that it is a difficult and expensive process and the number of authors is growing rapidly as we all become creators in this information age. I want to address here 3 aspects of name authority control that are especially non-functional: 1) the use of dates as distinguishing characteristics is not easy for the catalogers creating the authority record 2) the use of dates as distinguishing characteristics does not help the users 3) the name heading is not a legitimate identifier because it may change.

Date of Birth is Hard for Catalogers

We hear that authority control, including name authority control, is responsible for upwards to 40-50% of the time it takes to catalog a book. Part of this is in determining if you do indeed have a new author to enter into the system. Another part is in creating the unique entry. Take the case of Michael Fitzergerald, editor of a book called Touching All Bases. Touching all Bases is a collection of columns by sports writer Ray Fitzgerald. His sons, Michael and Kevin gathered the columns after their father's death in 1982 and published them. Because there have been other Michael Fitzgerald's as authors, the year of his birth had to be added to his name. Here's the authority record for Michael:

LC Control Number: n  83124260
Cancel/Invalid LCCN: n 97055382 no 90013838
HEADING: Fitzgerald, Michael, 1955-
Found In: Fitzgerald, R. Touching all bases, c1983 (a.e.) CIP t.p.
(Michael Fitzgerald)
Call to publisher, 6/27/83 (Accountant, b. 2/22/1955)

Michael Fitzgeralds seem to be in great abundance. There was even another one who wrote a book and was also born in 1955. To distinguish between them, Michael Fitzgerald 1955 #2 has his full date of birth added to his name

LC Control Number: n 2003097483
LC Class Number: PS3556.I8345
HEADING: Fitzgerald, Michael, 1955 June 11-
Found In: The Creative circle, 1989: t.p. (Michael Fitzgerald) p. 241
(teaches at Shenandoah College in Virginia)
Earth circle, c2003: CIP t.p. (Michael Fitzgerald) data
sheet (b. 06-11-55)
His book, Creative Circle, is about art, music and literature from a Baha'I perspective. We see that at the time the authority record was created he was teaching at Shenandoah College.

So here we have two authors whose works would never be mistaken for each other, yet who have the same name. The authority records are evidence of why it is so time consuming to create these identifiers. Because the date of birth is generally not one of the pieces of information about an author that is included in the book nor in the promotional material provided by publishers, the librarians establishing the name heading often must resort to contacting the publisher or the author or the author's institution to determine that information.

Date of Birth May Not Help Users

In a time when few people wrote books, and when users may have come to the library with some knowledge of the famous intellectual whose works they were seeking, the distinction between two John Smiths, one born in 1709 and one born in 1936, may have been an obvious one. We are now, however, in a time of author abundance. Anyone can, and many do, write books, and many of those writing are not known in wider circles. Reading is now considered a "popular" activity, as the bookshelves of any chain bookstore will evidence. So a user of a library catalog may find himself facing a daunting choice among authors, such as these, all named "Michael Fitzgerald":

Fitzgerald, Michael
Fitzgerald, Michael, 1768-1831
Fitzgerald, Michael, 1859-
Fitzgerald, Michael, 1918-
Fitzgerald, Michael, 1937-
Fitzgerald, Michael, 1946-
Fitzgerald, Michael, 1955-
Fitzgerald, Michael, 1955 June 11-
Fitzgerald, Michael, 1957-
Fitzgerald, Michael, 1958-
Fitzgerald, Michael, 1959-
Fitzgerald, Michael, 1970-
FitzGerald, Michael A.
The Michael Fitzgerald born June 11, 1955, will be able to find himself in this list, but other than members of his immediate family, no one else will know which of these he is. Catalogers have to call publishers or authors to find out the author's date of birth because it's not included on the book, so there is no reason to believe that the date is available to users of the library catalog. All of that time and effort is expended to create a distinction that often doesn't help the user.

All That, and It's not Even a Valid Identifier

The final blow to name authority control is that the name heading (as the name entry is called, e.g. Smith, John A.) can change. Sometimes it might change because a mistake was made in creating the heading, or even in the printing of the book, other times it changes because the library rules for creating name headings change. The heading performs multiple functions: it is the display form in displays of the book's data, it is used as the string to search on in a catalog, and it identifies the author. If a new display form is needed, then the identifier itself changes. When this happened on a grand scale a few decades ago, due to a change in the library cataloging rules, all of the connections between names and books broke, and names in library records all over the country (and beyond) had to be changed. A true identifier only identifies, and if display forms change the identifier stays the same. John Smith is the same person even if the library entry changes from Smith, John A. to Smith, John Arthur.

What Now?

It seems pretty clear that we won't be able to deal with our author abundance using the current name authority methods. There are too many new authors appearing for us to spend time calling around to determine birthdates. There are also too many new authors for those dates of birth to be useful as a way to distinguish between persons. To add to that, we really need a true identifier for authors.

Library catalogs attempt to maintain uniformity throughout, so the idea of treating contemporary authors differently from historical ones is a very disruptive concept. However, the notion is beginning to circulate that we could have contemporary authors identify themselves in some way. Something to the effect of: Yes, I am the same Michael Fitzgerald who authored that book on Art, and that's the identifier for me. After all, who better than the author knows his own identity?

That doesn't solve the problem that users have of identifying the author they seek from a long list of persons with essentially the same name. Perhaps the days of looking at lists of authors' names is over. Maybe users need to see a cloud of authors connected to topic areas in which they have published, or related to books titles or institutional affiliations. In this time of author abundance, names are not meaningful without some context.


searchtools said...

OpenID? I know it sounds facile, but it's gaining traction: Orange telecomm is going to give OpenIDs to all its users. How we'll manage multiple ones is gonna be a problem, but it does seem like a way to identify people. Maybe they're print it in the CIP data in books, along with the ISBN.

(I was just researching this for InfoToday's newsbreaks, should be out on Monday. The best article I found was oldish but very helpful: Open ID for non-SuperUsers)


Leo Klein said...

First, I remember a library who's name will go unmentioned that had a list of close to 20 different entries for Louis XIV.

I sent them a note with the subj: "Whole Lot of Louis Going On." I meant it as a joke but the head cataloger almost collapsed in tears.

Anyway, moving right along, the advantage of "unique identifiers" is that each "author" can be represented by an arbitrary yet unique sequence of numbers.

This doesn't help in the recognition department but the real name is pulled out of the database before the user sees it.

Anyway, the point is, the identifier doesn't rely on some human-significant characteristic.

Cory Nimer said...

As you put it at the end of the entry, "in this time of author abundance, names are not meaningful without some context." Current authority control practices, with their focus on unique headings for collocation, provide neither adequate context for distinguishing who is who nor persistent identifiers for stable database linking.

At a certain level, the library community has already begun to recognize the lack of context as it has tried to address the concerns of archives. The latest draft of FRAD incorporates ISAAR(CPF) elements, and divides name records from person/corporate body/family records, allowing the linked record to provide the context. In the end, searching would probably be done on these entity records rather than in a name file.

This vision of linked records reinforces the need to use something other than the name form itself to associate bibliographic and authority records. Although there has been talk about using ISADN for years, it seems to have been left off of the table by FRANAR in favor of national authority numbers and projects such as LEAF. Still, current ILS' don't even make use of existing LCCN's for linking records.

It would be nice to think that this will be addressed soon, but based on the slow adoption of FRBR I think this is some way off.

Anonymous said...

"All of that time and effort is expended to create a distinction that often doesn't help the user." Apparently saying that adding death dates doesn't help the catalog user if the users don't know when the author was born. I beg to differ. It greatly helps the user, because it distinguishes the authors with the same name from one another, so that the works of one author can be found together in the catalog, not mixed up with all the others. That is not helpful? -- Jack Hall, cataloger, University of Houston Libraries

Karen Coyle said...

No, Jack. Using dates of birth do not help the user if the user does not know the birth year of the person they are seeking. Yes, this might clump them together in a long list, but users are doing searching and selecting, not browsing through a card catalog. If I am given a long list of Michael Fitzgerald's, how do I know which one to select? To distinguish in a way that is useful, you have to distinguish using something that helps the user *recognize* the author he is seeking. Users do not generally know the date of the author's birth. And if they have to find that out in order to use the catalog, believe me, they'll give up on the catalog as a tool.

avirr said...

Ah, the light dawns. I really didn't get it before. But someone just posted to one of my mailing lists making fun of Amazon for recommending they buy a neurology textbook, which happened to be written by a less-popular Jon Stewart.

Context is the key. It's the Michael Fitzgerald who wrote this, not the one who filmed that. Identity is tied to information object, if there are existing ones, or created new if there are not.

Netflix just added lists of movies to their search results for people. That's a dynamic context that works pretty well, IMNSHO

Karen Coyle said...

"avirr" -- exactly. I'd rather see a list of authors with one or more of the items they authored, perhaps with the date of publication/creation. That gives me a context that is more likely to be meaningful to my search than the author's date of birth.

Jonathan Gorman said...

In "The Inmates are Running The Asylum" Cooper calls this "Uninformed consent" I believe. (In the middle of reading it now, but the book isn't right besides me).

It happens in software because it closely models the reality of the most immediate underlying structure. It forces the user to navigate the system and get nested information that doesn't really have to be nested.

It would seem at least we could add some titles. I mean, we already know this. I've mused about similar ideas and had one cataloger mention that the issue would be uncontrolled headings. I suspect that it would be better to accidentally list the same author as two separate ones. At least that would narrow down to just a few clicks as opposed to literally several hundred clicks for large libraries. (So say I know John Smith was born in the early 20th century...that could still be hundreds of clicks).

Hooking up authors with some sort of unique resolver would be also be nice. Maybe one of these days I'll get a chance to try to implement something similar. Meanwhile I'll keep an eye on the WorldCat Identities projct.

avirr said...

I've been mulling this over, and found that someone wrote a paper on using Wikipedia for disambiguation. It's got a writeup at Greg Linden's blog and conveniently online: the PDF paper by Silviu Cucerzan. He has a nice survey of the recent research as well. At least Microsoft is throwing some of its money in a useful spot.

Anonymous said...

I would not agree that the days of looking at lists of authors' names is over. By all means let's add new methods of indicating to users which is the name they are most likely to want. But we need to *add* the new methods, not substitute them for the old ones. If we provide new ones, and lose the existing ones, where is the aggregate advantage?

Arrangement by date is invaluable if you know that the person you are looking for lived around 1800. You can eliminate all the "19XX" people at a glance.

Try to find in VIAF the Thomas Thornton who went on a tour of France in around 1802. The names aren't even in alphabetical order!

It should become possible for the user to sort a browse list by surname and then date, and lots of other ways. In the meantime, sorting by name+forenames, and then by date, is the next best thing, and in many systems it's available now!

Thanks for interesting blog