Speaker: Andrew Pace
Andrew Pace is Head of Information Technology, North Carolina State University Libraries, the folks who created one of the first faceted library user interfaces using Endeca technology.
Title: The Promise and Paradox of Bibliographic Control
Pace starts off with "Rumsfeld's law" (which he claims he will now retire): You search the data you have not the data you want to have. (I didn't get that right - Andrew, please correct)
The now famous NCSU/Endeca catalog was designed to overcome some "regular" library catalog problems:
- known item searching works well, but topical searches don't.
- No relevance ranking.
- Unforgiving on spelling errors.
- Response times weren't good enough.
- Some of the good data is in the item record, but usually not well used.
Andrew quoted Roy Tennant saying that the library catalog "should be removed from public view."
Catalogs are going to change more frequently than they have in the past, and have to adapt to new technology, different kinds of screen technologies. The need to be flexible.
The "next gen" catalog is really responding to "this gen" users. (By the time we get to "next" we'll be waaaaaay behind.)
Data Reality Check
In the "old" catalog, 80 MARC fields were indexed in the keyword index -- 33 of those are not publicly displayed. There are 37 different labels in the display. In Endeca they indexed 50 MARC fields.
Simple data are the best. Were thinking of going to XML, but Endeca preferred a flat file, basically a display form of the MARC record. Removed punctuation.
With the Endeca system they were able to re-index their entire database every night, without bring down existing system. This meant that they were able to tweak relevance algorithms many times to get it right. (How many of us don't think of a "re-index" as a two-week job?) This kind of ability to manipulate the data makes a huge difference in how we can perfect what we do.
Andrew then gave the usual, impressive demo of NCSU catalog, and the facets. It's easy to see how far superior this is to the standard library catalog.
How to Relevancd Rank
Slide: Relevance ranking TF/IDF not adequate (Andrew, what does TF/IDF mean?)
Basically, we haven't really figured out how to do ranking with library metadata. The NCSU catalog used some dynamic ranking (phrase, rank of the field, weights), plus static ordering based on pub date.
Andrew gave some interesting statistics about use of the catalog:
2/3 of users do a plain search, don't use refinements.
25% doing search and some navigation
8% are doing pure navigation (browse, no query); mainly looking at "new books" option.
Two most freqently used "options" are LC classification and subject headings. Subject-based navigation is nearly 1/2 of the navigation. It doesn't appear that the order of the dimensions (facets) determines usage. The statistics from the NCSU catalog show that users are selecting facets that appear lower in the page.
Most searches in the catalog are keyword searches. Subject searches are very small (4%). Author searches only 8%. [Note: later in the day, someone suggested that the committee should gather stats about actual use of catalogs to inform the discussion. Duh!]
The definition of "most popular" (which is an option selected 12% of the time) is based on circulation figures. Call number search, title and author search are used at about the same amount, each around 10%
We still have a natural language problem -- and LCSH isn't very good for this. Andrew gave the example of the common term "Revolutionary War" vs. an LC subject heading that reads: United States-History-Revolution-1775--1783. [Look this up in any library catalog -- the dates vary so it's really hard to tell what subject heading defines this topic.]
The new discovery tools point out inadequacies in the data. What could replace LCSH? User tagging is interesting, but there's the difficulty that the same tag gets added to many items, and the retrieved set is huge.
Will we be able to make sense out of full text? Right now our store of digital materials is incomplete so it is very hard to draw any conclusions from the full text works that we have.
Andrew present a wish list:
- faceted classification system - or one that enables faceting navigation.
- A work identifier for books and serials.
- Something other than LC name authority for organizations (publishers, licensors, etc.)
- Physical descriptions that help libraries send books to off-site shelving and to patron's mailboxes. We used to care about the height of books and whether they would fit on a shelf. Now we need to care about width and weight, for shelving and mailing.
- Something other than MARC in which to encode all of the above
- Systems that can use this encoding