Speaker: Andrew Pace
Andrew Pace is Head of Information Technology, North Carolina State University Libraries, the folks who created one of the first faceted library user interfaces using Endeca technology.
Title: The Promise and Paradox of Bibliographic Control
NCSU's catalog
Pace starts off with "Rumsfeld's law" (which he claims he will now retire): You search the data you have not the data you want to have. (I didn't get that right - Andrew, please correct)
The now famous NCSU/Endeca catalog was designed to overcome some "regular" library catalog problems:
- known item searching works well, but topical searches don't.
- No relevance ranking.
- Unforgiving on spelling errors.
- Response times weren't good enough.
- Some of the good data is in the item record, but usually not well used.
Andrew quoted Roy Tennant saying that the library catalog "should be removed from public view."
Catalogs are going to change more frequently than they have in the past, and have to adapt to new technology, different kinds of screen technologies. The need to be flexible.
The "next gen" catalog is really responding to "this gen" users. (By the time we get to "next" we'll be waaaaaay behind.)
Data Reality Check
In the "old" catalog, 80 MARC fields were indexed in the keyword index -- 33 of those are not publicly displayed. There are 37 different labels in the display. In Endeca they indexed 50 MARC fields.
Simple data are the best. Were thinking of going to XML, but Endeca preferred a flat file, basically a display form of the MARC record. Removed punctuation.
With the Endeca system they were able to re-index their entire database every night, without bring down existing system. This meant that they were able to tweak relevance algorithms many times to get it right. (How many of us don't think of a "re-index" as a two-week job?) This kind of ability to manipulate the data makes a huge difference in how we can perfect what we do.
Andrew then gave the usual, impressive demo of NCSU catalog, and the facets. It's easy to see how far superior this is to the standard library catalog.
How to Relevancd Rank
Slide: Relevance ranking TF/IDF not adequate (Andrew, what does TF/IDF mean?)
Basically, we haven't really figured out how to do ranking with library metadata. The NCSU catalog used some dynamic ranking (phrase, rank of the field, weights), plus static ordering based on pub date.
Andrew gave some interesting statistics about use of the catalog:
2/3 of users do a plain search, don't use refinements.
25% doing search and some navigation
8% are doing pure navigation (browse, no query); mainly looking at "new books" option.
Two most freqently used "options" are LC classification and subject headings. Subject-based navigation is nearly 1/2 of the navigation. It doesn't appear that the order of the dimensions (facets) determines usage. The statistics from the NCSU catalog show that users are selecting facets that appear lower in the page.
Most searches in the catalog are keyword searches. Subject searches are very small (4%). Author searches only 8%. [Note: later in the day, someone suggested that the committee should gather stats about actual use of catalogs to inform the discussion. Duh!]
The definition of "most popular" (which is an option selected 12% of the time) is based on circulation figures. Call number search, title and author search are used at about the same amount, each around 10%
We still have a natural language problem -- and LCSH isn't very good for this. Andrew gave the example of the common term "Revolutionary War" vs. an LC subject heading that reads: United States-History-Revolution-1775--1783. [Look this up in any library catalog -- the dates vary so it's really hard to tell what subject heading defines this topic.]
The new discovery tools point out inadequacies in the data. What could replace LCSH? User tagging is interesting, but there's the difficulty that the same tag gets added to many items, and the retrieved set is huge.
Will we be able to make sense out of full text? Right now our store of digital materials is incomplete so it is very hard to draw any conclusions from the full text works that we have.
Andrew present a wish list:
- faceted classification system - or one that enables faceting navigation.
- A work identifier for books and serials.
- Something other than LC name authority for organizations (publishers, licensors, etc.)
- Physical descriptions that help libraries send books to off-site shelving and to patron's mailboxes. We used to care about the height of books and whether they would fit on a shelf. Now we need to care about width and weight, for shelving and mailing.
- Something other than MARC in which to encode all of the above
- Systems that can use this encoding
6 comments:
Thanks for all you're blogging on this meeting. It is very helpful.
TF-IDF stands for "text frequency/inverse document frequency". The article in Wikipedia on it is pretty good.
Andrew Pace's presentation[PPT] is now up, linked from his
Andrew blog post take on the meeting.
He says "[Karen's] summary of my own presentation (PowerPoint) was up before I even took my seat."
Kudos for this live blogging, which is giving this event wide and immediate distribution!
-Jodi
Thank you very much for the liveblog of this meeting!
The example you bring up is interesting. I just tried it, looking up "Revolutionary War" in my online book subject catalog:
http://onlinebooks.library.upenn.edu/subjects.html
and I got right at the top a clickable link to the official subject heading. And when I clicked on *that*, I got not only the items that were under that subject heading, but also suggestions and example items for a bunch of related headings. Including terms that have no facet-based relationship to the original, such as "American loyalists" and "Boston Tea Party, 1773".
Okay, it's a fairly small collection, and the subject cataloging is not always as precise as it could be. Though the tradeoff on the latter point is that the subject cataloging was cheap; what you see there is a mix of subject headings grabbed from other libraries' cataloging, and subject headings automatically assigned based on other data at hand (in this case, call number data, but you can imagine it coming from other sources, such as tagging or automated text analysis.)
My point is not that LCSH is great as it is (far from it!) but that before we propose "burning the catalog" and starting over without LCSH, we should consider whether we can still get some useful leverage from it if we improve our tools for using it; and also consider whether we have acceptable substitutes for it if we decide to do without. (Even in Timothy Burhe's "burn the catalog" post, he admits that Amazon's subject headings are no better than LCSH's. And while other techniques such as full text search, tagging, and facet classification are all useful for many kinds of discoivery scenarios, it remains to be seen whether they can support as well the sorts of research problems that LCSH was designed to help with.)
The announcement of the meeting came the day after I had scheduled a vacation for that date :-( So many, many thanks for blogging this, Karen!
I am glad that Andrew Pace mentioned the need for more physical descriptions data. Only the height tends to appear on a record, the depth in special situations. Of course we need the width. I hope practice changes soon because those of us having to manage shelving and shifting could use more information. There is hope. Many publishers who use ONIX list all three dimensions. In fact, according to Product Metadata Best Practices (1.1) (see p. 67; http://www.bisg.org/docs/Best_Practices_Document.pdf), dimensions data is mandatory. The book industry needs the data for their inventory management systems. Could we adapt the information for ours? I am not sure how exactly it’s used. I’d like to imagine a day when a function in a library inventory management system can depict on a computer screen actual shelf usage in three dimensions. I’d like to be able to anticipate in real time if I am going to have enough space on hand to accommodate new purchases, especially if the collections librarians decide to go on a spending spree, rather than having to sort it out after the fact. The system could tell me when it expects areas to fill up or whatever. Having that data on hand from the start will also reduce the labor it takes to measure books shipped to off-site storage facilities. The system could just assign the location based on the dimensions data already in the record. Of course, it’s not current cataloging practice to add all three dimensions, but isn’t time? I have written the editors of RDA suggesting that dimensions data be mandatory, and we now have a potential source for that data. Making data work harder.
These views are my own, not my employer's.
Bryan Campbell
Library Assistant
Nashville, TN
classz696@yahoo.com
Post a Comment