Monday, June 23, 2008

The "Mao" problem

I've been assisting the Internet Archive on its Open Library project, my role being primarily to help them understand library data. It's fascinating watching non-librarians encounter library data -- so much that we take for granted isn't obvious at all to others. I'm thinking that it's time for a "Library Data for Dummies." I am seriously considering setting it up as a wiki so we can all contribute to it.

Most recently on the OL project we ran into what I like to call the "Mao" problem. It begins like this: the database uses bibliographic records from libraries and from Amazon. The Amazon data presents author names in natural order ("John Smith"), while the library records use the inverted order with the family name first ("Smith, John"). It's best for users of the service to see the names presented uniformly (the mixture is quite jarring). If you think about it for a moment, you realize that converting the natural order names to inverted order will be problematic, since there is nothing to tell you where the family name begins ("Oscar della Renta"). So the solution is to un-invert the inverted names, something that is purely mechanical.

Until you encounter Mao, Zedong -- and the thousands of other authors for whom "natural order" is family name followed by a given name. I find that Mao is the example that hits the "Aha!" button for most people. Obviously, presenting the name as "Zedong Mao" pretty much makes it unrecognizable. So what to do?

Well, I suppose it helps to NOT think like a librarian. Edward Betts, the coder on this project, came up with an ingenious idea: he compared the names in the Open Library records with names on Amazon and on Wikipedia, and has made a list of names that generally appear in family name first order with a link to the source where it was found. For famous authors or historical figures, Wikipedia contains many of the names and is good about presenting various name forms. It gives the traditional and simplified Chinese forms, and sometimes both Wade-Giles and Pinyin transliterations. It also often has the note:
This is a Chinese name; the family name is Chen.
Naturally, an automated solution of this kind will produce some false hits, but that's why the Open Library is designed as a wiki -- so errors can be corrected. I'm beginning to think, though, that a link from author names to Wikipedia is not a bad idea in itself. The articles are often quite comprehensive and definitely are more useful than a link to a name authority record. I'd also favor a link to OCLC's Worldcat Identities pages, which are quite rich and link well to library data, since that's what they are based on. Presumably one could launch a search to either from a name heading. Has anyone tried this yet?

11 comments:

Anonymous said...

The Germans have been doing it from Wikipedia into the DNB catalog for quite a while now.

http://www.d-nb.de/aktuell/presse/pressemitt_wikipedia.htm [In German]

See also: A news note in CCQ 42 (1) 2006 alerts us to a press release dated 3 Aug 2005 reporting on Die Deutsche Bibliothek (DDB) linking thousands of personal name authorities to biographical articles in the German-language version of Wikipedia.

Perhaps with some of the recent projects to provide links to authorities by either OCLC or LC, or Worldcat Identities something similar could be done.

For example, see this page on Hegel. http://de.wikipedia.org/wiki/Hegel Scroll to the bottom and see the 1st link under the heading: Literaturverzeichnisse.

Anonymous said...

This is one of those things that really excite me about Wikipedia and other electronic resources. It's often bothered me that we do not seem to use a lot of the information out there already in digital form when trying to automate or process our library data. I have not yet had my morning caffeine, so I'm drawing a blank on other examples. The only one I can think of right now is using atlas and the like and various geographical data to guess at place names and the like.

Dorothea said...

I have an old presentation from Extreme Markup, um, 2004? on library data and data structures, aimed at smart people (markup geeks) who don't know anything about library innards. I don't think I have it in electronic form because I remember doing a lot of cutting-and-pasting of paper, but I'd be happy to dig it up, scan it, and contribute it to the cause.

Th said...

The production WorldCat Identities service is at http://worldcat.org/identities/. I did a posting about linking to it which is still mostly right. If you have an LCCN, linking is pretty easy: http://worldcat.org/identities/lccn-n79-32879, and there is a full SRU interface underneath it all.

--Th

Karen Coyle said...

Thanks, Thom, for the correct URL. I thought it had changed but didn't find that one in my simple Google search. Worldcat Identities is one of my favorite examples of what we can do with bib data if we think beyond the individual record.

John Mark Ockerbloom said...

On The Online Books Page, I uninvert the names on many of the displays (actually, all of them except for listings by author, and the full metadata screen). But to make it work right, I have to annotate certain names (like Mao's) with the specific form to use for the "normal" uninverted form. (I do have a shortcut macro indicating that the normal form doesn't get uninverted).

I was surprised not to find anything already in authority records or other standard library bibliographic metadata saying what a person's "normal" name should be. (The title statement of responsibility in a full MARC record often shows it, but
not in a way that allows the name to be unambiguously and reliably picked out. And the MARC 100 field indicator isn't sufficient either-- it doesn't work the way you'd expect for names like Mao's.)

If libraries want to track the best "normal" name form to use, I'd think the easiest place to keep track of this would be some authority record field (so you'd only have to record it once, and then look it up any time you needed it). Is there any field, or any proposal of a field, to use for this?

Karen Coyle said...

JMO - I haven't encountered any interest on the part of librarians for use of natural order display of names. In fact, some are down right hostile to the idea. It appears that we've been looking at these reverse-order forms for so long that they look natural to us. I suspect that's not the case with many of our users, however. Surely they must puzzle over some of the odd constructions that library cataloging produces ("Augustine, Saint, Bishop of Hippo").

The "statement of responsibility" is the most puzzling thing about library cataloging for me. Along with the title it carefully reconstructs the title page, but its lack of coding makes it virtually useless. It has the names in the form that they appear on the book itself, as well as the roles, as stated (e.g. "illustrated by"), but all you can do is display it after the title for a person to read. All that good information, but unusable. Yet... it will continue to be that way in the next cataloging rules.

Unknown said...

Of course, context is everything. Not just the name's context, but the user's context. A user in Hungary will consider "Bartok Bela" to be natural order while a user in the United States will consider "Bela Bartok" to be natural order...

Karen Coyle said...

edphrens -- you are absolutely right. This is why we need to have names coded in a way that you can adjust display for whatever context you are in. Or maybe even give users a choice: librarians could continue to see the inverted order while other users could choose whatever "natural order" they see as "natural."

Dawn Lawson said...

At NYU we are implementing ExLibris' Primo, which also has the "Mao problem." It's especially jarring when the Chinese characters appear with the name, in the customary order. It's also a problem with Arabic and others, and as you noted, even with names that appear in romanization only. For the short-term, we were going to see if we could defeat the system's name reversal in records with an 066 field, but obviously a wider scale solution is in order.

Sherman Clarke said...

The Union List of Artist Names (one of the Getty vocabularies) does include a display version of a name. It's not the index form, and ULAN avoids saying preferred form. Cataloging Cultural Objects uses display forms in direct order (or normal order, if you will).