Coyle's InFormation: 06/01/2008

Sunday, June 29, 2008

RDA Update at ALA

As usual, the ALA Annual meeting here in Anaheim is a 4-day meeting crammed into two days. It is impossible to get to more than 3 or 4 meetings a day, even if you have the energy to do so, and there are at least two more at each time slot that you would like to attend.

I did make it to the RDA Update session that was in a very large auditorium. It started out quite well because, walking in with Diane Hillmann, co-author with me on the scathing Article: Resource Description and Access (RDA)): Cataloging Rules for the 20th Century, we came upon a table holding RDA promotional brochures, with an enticingly similar title. Yes, this time the century has been incremented by one.

Unfortunately, the first announcement is that RDA has been delayed once again, this time by two months. The public review version will not be available until October, and I had the feeling that "late October" would be an accurate statement. In case you hadn't heard, the final review will be done using the online system that is being developed, not through the familiar PDF documents that we have seen so far. This means that the chapters that we have not yet seen will remain unknown to us until that October date.

The good news is that ALA hired the smartest woman in the world, Nannette Naught, to create the online system and she has actually taken RDA, as we have seen it in its paper-ish form, and turned it into a huge complex of entities and relationships with their related instructions, scope notes and examples. It will be possible to create customized views and workflows within the system, and even add instructions relating to your library. (Since I can't see the purpose of each library doing this, I'm wondering if there won't be a market for customizing that ALA can respond to.)

The bad news is that this online subscription service will be the only way to access RDA. In other words, there will be no public access to the key library standard. Most of the questions at the session were from librarians worried about how this would be priced -- because where once they would buy a copy of AACR2, they will now have an ongoing cost to access the cataloging standard. And, yes, although the ALA rep Don Chatham was loathe to admit it, ongoing services do increase in price over time as a rule. (Note that ALA publishing is considering "derivative print works" if there is a market for them, which is logical. But there are no immediate plans for such works while all of the focus is on the primary product.)

I think the RDA product looks great, and I intend to spend much time with it during the review period -- in part because that is probably the last time I will be able to use it. I will be one of the many people who are interested in library data, even working with library data, but because I am not in a traditional institution I will not have access to the cataloging standard. I don't mind that I won't have access to the nifty tool designed for catalogers -- I don't need that. I do need to know what the rules are, however, so that I can continue to help people interpret library data.

Monday, June 23, 2008

The "Mao" problem

I've been assisting the Internet Archive on its Open Library project, my role being primarily to help them understand library data. It's fascinating watching non-librarians encounter library data -- so much that we take for granted isn't obvious at all to others. I'm thinking that it's time for a "Library Data for Dummies." I am seriously considering setting it up as a wiki so we can all contribute to it.

Most recently on the OL project we ran into what I like to call the "Mao" problem. It begins like this: the database uses bibliographic records from libraries and from Amazon. The Amazon data presents author names in natural order ("John Smith"), while the library records use the inverted order with the family name first ("Smith, John"). It's best for users of the service to see the names presented uniformly (the mixture is quite jarring). If you think about it for a moment, you realize that converting the natural order names to inverted order will be problematic, since there is nothing to tell you where the family name begins ("Oscar della Renta"). So the solution is to un-invert the inverted names, something that is purely mechanical.

Until you encounter Mao, Zedong -- and the thousands of other authors for whom "natural order" is family name followed by a given name. I find that Mao is the example that hits the "Aha!" button for most people. Obviously, presenting the name as "Zedong Mao" pretty much makes it unrecognizable. So what to do?

Well, I suppose it helps to NOT think like a librarian. Edward Betts, the coder on this project, came up with an ingenious idea: he compared the names in the Open Library records with names on Amazon and on Wikipedia, and has made a list of names that generally appear in family name first order with a link to the source where it was found. For famous authors or historical figures, Wikipedia contains many of the names and is good about presenting various name forms. It gives the traditional and simplified Chinese forms, and sometimes both Wade-Giles and Pinyin transliterations. It also often has the note:

This is a Chinese name; the family name is Chen.

Naturally, an automated solution of this kind will produce some false hits, but that's why the Open Library is designed as a wiki -- so errors can be corrected. I'm beginning to think, though, that a link from author names to Wikipedia is not a bad idea in itself. The articles are often quite comprehensive and definitely are more useful than a link to a name authority record. I'd also favor a link to OCLC's Worldcat Identities pages, which are quite rich and link well to library data, since that's what they are based on. Presumably one could launch a search to either from a name heading. Has anyone tried this yet?

Monday, June 09, 2008

More patent insanity: Google's virtual bookshelf

I've only read a few patents in my time, and they are very strange documents. Stranger even because they have a real effect on the world.

I don't know if there is a specific language and style of patents, but the ones that I have read are amazingly vague. That is especially frightening given that patents describe technologies -- things you can create in some 'real world' fashion. The latest patent to make it through the magical and mysterious process at the Patent Office that turns it from "nutty idea" to "take it to court" is the Google patent called "Computer-implemented interactive, virtual bookshelf system and method."

"A computer-implemented method and system for realizing an interactive, virtual bookshelf representing physical books and digitally stored books of the user. Using a search query, the Web is searched using search metadata to identify a desired book. Library metadata corresponding to the physical books and digitally stored books of the user is then searched using the search metadata to determine whether the book is present in the virtual online bookshelf. Results indicative of whether the desired book is present on the virtual on-line bookshelf can be displayed."

That's the abstract, but even reading the details and looking at the diagrams there are many things that are not clear. Here's one flow:

Search term or query of search data and/or search metadata
Search hits metadata of desired books

OK, so far we have exactly what happens in a library catalog (but not what happens using Google Book Search, which is based on actual text).

Filter or compare to library/metadata of selected virtual bookshelf

If found in virtual bookshelf, "User acquires physical or digitally-stored copy of desired book from physical bookshelf of selected virtual bookshelf."

Now this last one is just nonsense. There's something called Physical Bookshelf that somehow the user accesses from an Internet search. Does this mean that the user gets a call number and goes to the shelf? And in the diagrams, the Physical Bookshelf contains a "Memory of Digitally Stored Books." So this must be magic, because I don't know of any physical bookshelves with memory. Well, at least none outside of L-Space.

The last thing that happens in this very odd flow is that if one does not find the book in the virtual book shelf, the question is put:

Acquire physical or digitally stored copy of desired book?

If the answer to that is yes, then the user acquires metadata of the desired book, which is then compared to the virtual bookshelf. The same virtual bookshelf where the item wasn't found.

Since we know that patents tend to be interpreted very broadly, this patent could be seen to cover any search of metadata that results in finding books that can be either digital or physical. That is essentially every library catalog in the nation, and beyond. And indeed what is a library catalog but a "virtual bookshelf"? The one caveat is that it is the Web that is searched, not a library database. But if we go forward with our ideas to have library metadata searchable over the Web, then ...

Patents today are rarely used by their inventors to create actual products. Instead they are used to bludgeon competitors who are also working in the same approximate service space. The patents are ends in themselves and are designed to prevent invention. Quite honestly, if something isn't done about this, we'll find ourselves completely unable to innovate.

At this point I should come up with some clever, satirical example of outrageous patents, but it's really impossible to one-up reality in this particular area.

For a more positive view of the patent, see SEO by the Sea blog post, and a post about that post by Lorcan Dempsey.