Friday, March 30, 2007

Podcast on Libraries and Standards

I was interviewed by Scott Mace for his Open Source Conversations podcast. Although the title says "Libraries and standards" in fact we ranged all over a number of subjects: DRM, restrictions on the use of Google Books content, Library 2.0 (although it's never called that, but we talk about web services and taking the library to the user's work space), and the future of NISO and library standards.

As is always the case in this kind of interview, listening to it afterward I cringe to hear things that I got wrong, or where I mis-spoke. If you've gone through the interview process you know how hard it is to be both spontaneous and correct. Feel free to point out errors -- this can be the errata page.

Friday, March 23, 2007

There's always something new

I was looking over the impressively improved RDA Part A Chapter 3. I still have some issues (natch!), but it's clear that there has been a lot of thought about data elements and how they differ from simply strings of text.

One of the areas that needs to be thought through is how RDA will be able to change over time, something that is often called "extensibility" of a data standard. I was looking, for example, at the carrier types (3.3.0.2.2), trying to think of what future technology might not be covered. Then I ran into it at the office supply store, where I was replenishing my store of brightly colored sticky notes: you can now purchase tax software on a thumb drive.

(Note: of the carriers listed in RDA A/3 there is something called a "computer chip cartridge". This could be the term that would be used for the thumb drive, but the only examples I can find relate to Nintendo cartridges. So I'm going to pretend, for the purposes of this discussion, that thumb drives aren't covered. Even if they are, something else new will come along, and probably soon.)

RDA has a list of carriers, broken up into large categories like "audio carriers" and "computer carriers." If the item you have in hand isn't on the list, then you are instructed to use "other audio carrier," "other computer carrier," etc. Which means that anything really new will end up in the "other" category, which isn't terribly useful. It also means that something that gets coded as "other" today will have to be updated when the list catches up, but your search on "other computer carrier" will bring up a list of items that may be very different from each other. So there needs to be a way that such lists can be extended quickly, even in a provisional way, to keep up with this fast-changing world. There also needs to be a way that people in the field can find out that the list has been updated.

There are many different ways that you can develop extensibility for a set of data elements. The main thing is that you want the newly minted term to have a clear context (what list does it belong to?), and you want to be able to get people to the definition of the term when they encounter it. In this case, the context is that it is a carrier of information, and it is specifically a new kind of computer carrier. It is also extending an existing list, say, the RDA carrier list.

Let's pretend that we have a registry of terms. And let's pretend that the registry has some management mechanism, such as a small group of participants that oversees the various lists in the registry (so it's not total anarchy). Our thumb drive could be added such that:

http://authoritylists.info/RDA:carrier:computer_carrier:USB_flash_drive

returns this information in a machine-readable format:

owner: RDA
list: carrier
sublist: computer carrier
element: USB flash drive
status: provisional
date added: 2007-03-30
description: "USB flash drives are NAND-type flash memory data storage devices integrated with a USB (universal serial bus) interface." (quotes because I took that from wikipedia, but generally the expert adding the term would write a suitable description.)
synonyms: thumb drive, jump drive, flash drive

The many products based on RDA would make use of the registry to support the creation of new records and the reading of existing records. With some periodicity, these systems would check that their lists are up to date (like the automatic update of virus lists in your anti-virus software). Such a system could decide that provisional entries would be flagged in some way (maybe they would show up as red on the screen). Or a system receiving a record with a previously unknown item in an authority list could quickly grab the description from the registry and use that to provide services, like definitions and synonyms, to its users.

OK, I'm sure that there are geekier folks out there who could (and hopefully will) point out what parts of this don't work, but I'm mainly interested in exploring the general concept: can we get away from text lists and create something that is dynamic, machine-actionable, and useful?

Thursday, March 22, 2007

ARL statistics and the increase in holdings

OCLC research and ARL have published some interesting data in Changing Global Book Collection Patterns in ARL Libraries. The thrust of the paper is about the pattern of holdings of foreign publications. However, what I grabbed onto in the first few pages were the figures on titles and holdings in general. Figure 1 on page 2 shows the number of records in OCLC for all books published before 1980 and after 1980:
Pre-1980: 15,222,793
1980-2004: 13,210,095

Basically, the number of titles for all books before 1980 is only barely greater than the number of titles in the 24 years since 1980.

Figure 4 shows numbers of holdings in the entire OCLC catalog:
Pre-1980: 330,291,378
1980-2004: 388,721,240
ARL libraries aren't as top-heavy in the 1980-2004 level as are other OCLC libraries, which makes sense. But what this does tell me is that libraries are facing a huge increase in volumes that they must manage, and it must be because there is a lot more being published. Wikipedia reports these figures for the number of titles published per year in the UK and US. Presumably this is book titles only:
  1. United Kingdom (1996) 107,263 [2] (2005) 206,000, [3]
  2. United States (1996) 68,175 [5] (2005) 172,000, [3]
Note the near doubling in the UK, and near trebling in the US. This may not be news, but I have never before seen figures that confirmed my general feeling that it's all speeding up.

Monday, March 19, 2007

Unintended consequences

I was working at the University of California in the early 1980's when the UC union catalog, MELVYL, was developed. Shortly after MELVYL became available as one of the early online public access catalogs, we obtained a copy of NLM's Medline database, consisting of articles and books in a wide variety of fields related to medicine. This was the first article database that we were making available. At that time, the only people who had access to Medline were medical researchers in the four or five medical centers at the university, and they accessed it via the Dialog search service. Dialog charged by the minute and was relatively pricey. Few members of the university community had access to Dialog's databases in any subject area because of the cost.

I don't know how many minutes or hours of searching were done monthly on Medline before we added the database to the university's library system, but within a few years the number of searches on Medline were rivaling the number of searches in the entire union catalog of the 9 university campuses. It was heavily used even on those campuses that had no medical school. Had everyone developed a sudden interest in the bio-medical sciences? Perhaps, but not to that degree. I think that we had created a monster of "availability." As the only freely available online database of articles, Medline became the one everyone searched.

(If anyone is looking for a master's thesis, try looking at the citations in dissertations granted at the University of California for the period 1985-1995, and compare it to the previous decade. Count how many of the cited articles come from journals that are indexed in NLM's database. I have a feeling that it will be possible to see the "Medline-ization" of the knowledge produced by that generation of scholars, from architecture to zoology. )

When we make materials available, or when we make them more available than they have been in the past, we aren't just providing more service -- we are actually changing what knowledge will be created. We tend to pay attention to what we are making available, but not to think about how differing availability affects the work of our users. We are very aware that many of us are searching online and not looking beyond the first two screens, which produces an idiosyncratic view of the information universe. But we don't see when libraries create a similar situation by making certain materials more available than others (for example scanning all of their out-of-copyright works and making them available freely as digital texts, while the in-copyright books remain on the shelves of the library).

There's a discussion going on at the NGC4LIB list about the meaning of "collection." What is a library collection today? Is it just what the library owns? Is it what the library owns and also what it licenses? Does it include some carefully selected Internet resources? Some have offered that the collection is whatever the library users can access through the library's interface. I am beginning to think that access is a tricky concept and it is inevitably tied up with the realities of a library collection. Users will view the library's collection through the principle of least effort. In the user view, ease of access trumps all -- it trumps quality, it trumps collection, it trumps organization. So we can't just look at what we have -- we have to look at what the user will perceive as what we have, and that perception will necessarily be tempered by effort and attention. To our users, what the library has will be what is easiest to locate and fastest to arrive.

In other words, our collection is not a quantity of materials. The collection is a set of services built around a widely divergent set of resources. To the user, the services are the library, especially because any one user will see only a tiny portion of what the library has to offer. The actual collection -- those thousands or millions of library-owned items -- is not what the user sees. The user sees the first two screens of any search.

Hopefully, they are not in main-entry alphabetical order.

Saturday, March 17, 2007

Users and Uses - The Official Summary

It probably helps to be an insider. This report on the LoC page for the Working Group does a better job of bringing together the messages of that day than I did.
Based on the meeting presentations and comments, two main information user and use environments for bibliographic data are apparent: a consumer environment and a management environment. The consumer environment relates to the end-user of the bibliographic data, the information consumer, as described by Karen Markey and Timothy Burke, and services that are designed to assist the end-user in finding relevant information, from search engines to specialized catalog interfaces. The management environment pertains to resource collection management. Although these two environments represent different perspectives of bibliographic data, they are interrelated...

This tension between the user view and the management view is something that I keep coming up against. Whenever the question comes up of how we will define the library catalog of the future, most librarians exhibit an interesting schizophrenia, trying simultaneously to satisfy the library management need of inventory control and the much broader needs of users who simply want the best information now, no matter who owns it. We really must resolve this conflict if we are to move forward.

Friday, March 09, 2007

Users and Uses - Karen's Summary

This meeting was announced about two weeks ago, catching many of us by surprise. As I noted in my setup post, the meeting was originally intended to have an "invitation only" audience. The switch to "open to the public" may have come late in the planning. There were about 50 people there, most of them from the immediate area. The members of the LoC committee were also there.

This committee has a huge task: to define the future of "bibliographic control." No one defined the term bibliographic control during this meeting, and in fact it was rarely voiced as a term. That may be for the better, because it describes something that libraries have traditionally done, and at least some people are suggesting that we shouldn't do it in the future. Thus the "future of bibliographic control" may be an oxymoron.

By the end of the day, however, none of us in the audience could have made a clear statement about the day's topic. The speaker who seemed most on track (and who was the most interesting, IMO) was Timothy Burke, professor of history. Burke talked about how he searches for information, but most importantly he talked about why he searches for information. Some examples he gave were:
  • to find a single book to read on a topic that is current and popular
  • to find a book that is 1) in print 2) affordable 3) that he can teach to
  • to confirm a memory (what was that author's first name? did the title say "about" or "of"??)
He also talked about the sociology of knowledge, that is needing to know who is authoritative, what work has influenced other work. What the opposing camps are in a field, and how a line of thought has developed. In the discussion afterward libraries were talked about as static while information is social and dynamic. Later, Lorcan Dempsey summarized this with a concept from Eric Hellman: the difference between lakes and rivers. Libraries are lakes; a little comes in a little goes out, but it pretty much stays the same. Information as we use it in the networked world is a river, fast moving and you never step into the same place twice. Burke offered that libraries could decide that they will specialize in the static, stable part of our information use, and leave the rest to others, but he acknowledged that would not be a good idea. (Unfortunately, that is our status today.)

When Andrew Pace showed the NCSU Endeca catalog, I could see some of Burke's dynamism taking place in the ability to get the information from different angles. Pace, however, began his talk by explaining that the catalog was designed to work with the data that they had, that is, standard MARC records. See his wishlist at the end of his talk.

Two speakers made specific comments about problems with MARC. Bernie Hurley showed that much of the detail of MARC is never used, and at the same time that creating indexes from the used fields is very complex because the data for a single index is scattered over many fields and subfields. (Think "title") Oren Beit-Arie of Ex Libris had a list of MARC problems, including the resource types (scattered throughout the LDR< 006,007,008), uniform titles (which he thinks are unnecessary), and internationalization, which MARC does insufficiently.

There was some interesting discussion about full text. Dan Clancy, of Google Book Search, talked about the difficulties of doing ranking with full text books. He stated that Google does not organize web information - the web contains its own organization in the form of links and link text, which give you both the connection between documents and the context for that connection. The main revelation in this talk for me was that they are experimenting with full text scans for de-duplication. This is intriguing when you think about how you could map "likeness" when you have the full text and the images in a large body of books.

Some brave statements were made:

Burke: we may have to forget about backward compatibility
Hurley: we have to simplify the MARC format
Pace: we need a faceted classification system, not the "facets" in LCSH
Beit-Arie: we need to split the end-user services from the back-office systems

It was a provocative day, and although there wasn't a lot that was really new it was interesting to see that there is some commonality of thought coming from what are essentially different perspectives. As I process more of this, I will add ideas from this day to the futurelib wiki so we can work with them there.

Wednesday, March 07, 2007

Users and Uses, Vendor perspective

Speaker: Oren Beit-Arie

Oren Beit-Arie is Chief Strategy Office at Ex Libris, one of the key (and ever diminishing group of) library systems vendors. Oren created the first OpenURL resolution service that was offered commercially, and has been active in metasearch, electronic resource management, and other ILS developments.

He began by saying: "I'm just glad that there still is something called the library vendor."

Vendors are affected by the changing economics of libraries and the information area.

Libraries are focusing more on their core role, not just their core competencies. Focusing on what they "ought" to do.

Users are:
- library services users
-- end users: differ in community, and have different skills and needs
-- agents/applications: personal productivity tools; the web; courseware
- library management
-- end users
-- agents/applications

Uses:
Discovery & Delivery
Solutions need to take into account other languages and other cultures; differences in workflows.
What we don't do well:
- selection and evaluation
- organization and creation (data mining, citations)

Economics -- we need more mid-level collaboration
Library catalogs and other services are in decline:
- users seek content beyond bundles (books, journals...); looking for more granular information (wikipedia article); catalog was designed for a different paradigm.
-- just fixing the catalog is not enough

There is more content and more content types
Can't be isolated - have to interoperate (and that needs to go both ways)

Role of the library
- connect users to content; be the "last mile"
- can provide services tailored to a specific community (rather than global)
- there are new opportunities for libraries, esp. in research areas

Needs:
- de-silo content
- provide ranking
- provide easy delivery
- enrichment (content; description)
- organization and navigation
- social networking

Challenge:
End-user services are tightly tied to back-office operations; This isn't going to work. Overall architecture has to change. We need to decouple the user experience from the back-office operations.

Role of metadata
- moving away from a metadata-centric model to an object-centric design; metadata shouldn't be the center of the design; services should be the center.
- the back office can be better if it isn't also doing the user service
- as new formats have come in, they've been added to the library management in a way that creates redundancy
- look for more efficiencies in terms of local v. consortial functions. Centralize whatever you can.

What you see is NOT what you get; what you get is NOT what you see. Decoupling complexity and user view.

Parts of MARC that just don't work:
- Resource type (006,007, LDR)
- Uniform titles (240, 130) the practice is problematic; doesn't help grouping
- Collections of separate works (analytics); MARC doesn't have good structure for this. MAB (German bibliographic format) does this better.
- Multiple languages; we need a language designator on a field level.
- Need to integrate data from multiple sources, especially in terms of ranking and facets. How can we do this when they have different data?
- Wd need identifiers (for many different things)
- Role of controlled vocabularies - need more of this.
- e-Books; they are very different from books. Need more analysis.

Full text adds a lot to the mix, but it's a very different beast. It isn't clear what the role of metadata is in the full text world. There seems to still be room for manual processes to clarify semantics. How can libraries benefit from full text without taking on whole expense of storage and organization?

Lower barriers have a better chance for success; but some radical change can be handled.

Users and Uses, Research Libraries

Speaker: Bernie Hurley

Bernie Hurley is the Director for Library Technology at the UC Berkeley library.

Opening screen: ;-)

Title:
245 00
$aBibliographic control
$h[electronic resource]
$bA perspective from a Research Library
$cBernard J Hurley

Most of what bibliographic control is to libraries is: MARC.

"I'm desensitized to MARC. Or thought I was; I actually have some deep feelings about it. The title [of this talk] is a metaphor."

Metadata Needs for Research Libraries
- purpose of university is to confer tenure. This means: teaching, research, and publishing.

Metadata is used for
-discovery
-identification (also for inventory)
- delivery

How we index things:
2/3 of searching done in 3 indexes : title keyword, personal author, subject keyword.
Limit by "location" is the most frequent in the UCB catalog.

We are maintaining access points that are rarely used. This is a question of where we put our resources -- we should put more energy into keyword indexes.

MARC is not only encoding, but what we encode. MARC 245 has information about the title, but also information about the author, dates, medium form, version. This makes indexing complex, as indexes pull from individual subfields all over the record.

Display
simple displays use very few fields.
Our catalog displays 75 of the 175 MARC fields; it maps those into 27 labels. Display loses a lot of detail.

Delivery:
digital: 856 with the URL works pretty well; but the 856 also has lots of other information
Print: leads users to shelf

There is a mismatch between the richness of MARC and how we serve our users
1) we create many indexes, but catalogs use only a few
2) we have to dismember MARC to create our indexes (fields don't correspond to indexes)
3) the "meaning" of MARC is not being translated to library users.

Can we make it work harder? Maybe MARC isn't the *right* metadata. (Oh, horrors!)

It's expensive to create MARC records. It's expensive to create the MARC format. MARC sucks up all of the resources available for metadata creation. At Berkeley, the technical services staff doesn't have time to do metadata creation for digital library, so digital library is setting up its own metadata creation function.

The UC Bibliographic Services Task Force Report
- Enhancing Search and Retrieval
- Ranking; better navigation of search results; better searching for non-Roman materials
Recommenders; customization

MARC isn't flexible - it's hard to integrate new metadata into MARC.
Things like faceted browsing, full indexing, etc. are hard to do with MARC
We need to radically simplify MARC - we aren't using most of it. It could be used with other metadata, like DC, ONIX, LOM. METS already packages these together. It's not just MARC anymore.

Best quote:
"Research libraries are spending a fortune on creating metadata that is mismatched to our users' needs."

New services make the MARC mis-match worse; we can't fit new stuff into MARC.

Users and Uses, Google Scholar

Speaker: Anurag Acharya

Anurag Acharya is the Principal Engineer at Google Scholar.

Anurag could not be here so Dan Clancy from Google Book Search is taking Anurag's place.

In the early Internet days, Yahoo started out emulating a traditional catalog with its subject categories, but people seem to prefer the search method. The search method works because the web itself provides the organization through links. Google doesn't organize the web, instead it makes use of the organization that web pages provide. Google makes heavy use of anchor text that defines links. These anchors provide the meaning behind the link; essentially, aboutness. A link is an assertion about the relationship. It is also a kind of metadata.

Google Book Search currently relies on things like the title for ranking, not links. On the web, people consider search to work well, but without those links, search is not a "solved problem."

One of the questions that a system like Google must address is: What is an object?
The answer is not simply: "a web page is an object." There are many "same" pages on the web, so even the web needs to be de-duped. How do you determine "sameness"? It's not pure equivalence; sameness is a fuzzy function. In the end, things are determined to be "effectively equivalent."

Apply this to books. It depends on the context. Google needs to algorithmically determine equivalence.

Authority: Who is an Expert?
Authority used to be easier to determine -- professors, where they work, what degrees they have. Doesn't work on the web. He calls the web a "democracy." The only way to get authority is to take advantage of the masses; there's too much stuff for you to be able to make determinations any other way.

The cost of asserting opinions determines value: it costs more to maintain a web page than to write a blog; it costs more to write a blog than to tag photos in flickr.

Searching things other than the web.
- number of objects is smaller than the number of pages on the web. Do the same lessons apply?
Some does; some doesn't.
Google started at the other end from library catalogs: keywords and relevance ranking.
"Full text and keywords is not always the answer." But it is part of the answer.

How to decide?: Listen to your users
Let the users tell you what they want to do.
That doesn't mean that you can't also serve minority groups. (kc: This implies that "average" users like Google Scholar, specialist users prefer library or vendor databases.)

Clancy gives some examples in a demo:
- "searches related to" at the bottom of the page
- "Refine results" at the top of the page, with aspects or synonyms. This is done with human tagging, and authoritative users.
- spelling correction

Google Book Search
Features:
- work level clustering (FRBR expression, not work)
- find it in a library
- Libcat results (a full catalog search)
- Integration into regular Google search results
-
Metadata problems are a big issue (?) Didn't say much to support this (we should get him to elaborate)

Duplication
How do you determine if these two books are the same? (Two books from different libraries)
It's easier to figure out once you've scanned the same book twice. (This implies that they use the scans to determine duplicates. Intriguing idea!)

How do you create an ontology of non-web objects?
- FRBR
- References
- authorship
- criticism and review
...

Lessons learned:

Discovery must be universal - if you can't find it, you can't access it

Make the common case easy -
especially time to task completion
the faster things work, the more often users come back
complex interfaces take longer to learn

Ranking can adapt to many problems

A Consistent experience is crucial

Metasearch of a diverse sources is a dead-end.
- ranked lists are hard to merge, even when you know the ranking functions
- can't do whole-corpus analyses
- speed is defined by the slowest search

Google Books is an Opportunity to help users
"We have the opportunity to help users find this wealth [in libraries]"

Roy asks: what do *you* mean by metadata.
"Things that describe this book."

Person from Bowker:
ISTC is coming along; will Google use it?
Clancy: sure, if it helps users.

Lorcan:
Do they have any authority files for persons, places, etc.
Clancy: We probably do not use authority files to the extent we should. We mainly work with the text.

Users and Uses, New Services

Speaker: Andrew Pace

Andrew Pace is Head of Information Technology, North Carolina State University Libraries, the folks who created one of the first faceted library user interfaces using Endeca technology.

Title: The Promise and Paradox of Bibliographic Control

NCSU's catalog

Pace starts off with "Rumsfeld's law" (which he claims he will now retire): You search the data you have not the data you want to have. (I didn't get that right - Andrew, please correct)

The now famous NCSU/Endeca catalog was designed to overcome some "regular" library catalog problems:
- known item searching works well, but topical searches don't.
- No relevance ranking.
- Unforgiving on spelling errors.
- Response times weren't good enough.
- Some of the good data is in the item record, but usually not well used.

Andrew quoted Roy Tennant saying that the library catalog "should be removed from public view."

Catalogs are going to change more frequently than they have in the past, and have to adapt to new technology, different kinds of screen technologies. The need to be flexible.

The "next gen" catalog is really responding to "this gen" users. (By the time we get to "next" we'll be waaaaaay behind.)

Data Reality Check

In the "old" catalog, 80 MARC fields were indexed in the keyword index -- 33 of those are not publicly displayed. There are 37 different labels in the display. In Endeca they indexed 50 MARC fields.

Simple data are the best. Were thinking of going to XML, but Endeca preferred a flat file, basically a display form of the MARC record. Removed punctuation.

With the Endeca system they were able to re-index their entire database every night, without bring down existing system. This meant that they were able to tweak relevance algorithms many times to get it right. (How many of us don't think of a "re-index" as a two-week job?) This kind of ability to manipulate the data makes a huge difference in how we can perfect what we do.

Andrew then gave the usual, impressive demo of NCSU catalog, and the facets. It's easy to see how far superior this is to the standard library catalog.

How to Relevancd Rank

Slide: Relevance ranking TF/IDF not adequate (Andrew, what does TF/IDF mean?)

Basically, we haven't really figured out how to do ranking with library metadata. The NCSU catalog used some dynamic ranking (phrase, rank of the field, weights), plus static ordering based on pub date.

Andrew gave some interesting statistics about use of the catalog:
2/3 of users do a plain search, don't use refinements.
25% doing search and some navigation
8% are doing pure navigation (browse, no query); mainly looking at "new books" option.

Two most freqently used "options" are LC classification and subject headings. Subject-based navigation is nearly 1/2 of the navigation. It doesn't appear that the order of the dimensions (facets) determines usage. The statistics from the NCSU catalog show that users are selecting facets that appear lower in the page.

Most searches in the catalog are keyword searches. Subject searches are very small (4%). Author searches only 8%. [Note: later in the day, someone suggested that the committee should gather stats about actual use of catalogs to inform the discussion. Duh!]

The definition of "most popular" (which is an option selected 12% of the time) is based on circulation figures. Call number search, title and author search are used at about the same amount, each around 10%

We still have a natural language problem -- and LCSH isn't very good for this. Andrew gave the example of the common term "Revolutionary War" vs. an LC subject heading that reads: United States-History-Revolution-1775--1783. [Look this up in any library catalog -- the dates vary so it's really hard to tell what subject heading defines this topic.]

The new discovery tools point out inadequacies in the data. What could replace LCSH? User tagging is interesting, but there's the difficulty that the same tag gets added to many items, and the retrieved set is huge.

Will we be able to make sense out of full text? Right now our store of digital materials is incomplete so it is very hard to draw any conclusions from the full text works that we have.

Andrew present a wish list:
- faceted classification system - or one that enables faceting navigation.
- A work identifier for books and serials.
- Something other than LC name authority for organizations (publishers, licensors, etc.)
- Physical descriptions that help libraries send books to off-site shelving and to patron's mailboxes. We used to care about the height of books and whether they would fit on a shelf. Now we need to care about width and weight, for shelving and mailing.
- Something other than MARC in which to encode all of the above
- Systems that can use this encoding

Users and Uses, Social Networking

Speaker: Tony Hammond

Tony does technology development with the Nature Publishing Group, was on the NISO OpenURL standard committee, and is the creator of Connotea, a social tagging system for scientists.

Talk title: Agile Descriptions

Tony's talk was a review of the various Web 2.0 microformats available. He refers to this as "Rivers of Metadata."

He distinguishes between "Markup of documents (semantics)" and "Exposed metadata (microformats)"

Markup includes:
- Embedded metadata (hidden from user); on Web this is in the HTML; may be in comments.
- Embedded metadata in PDFs, in JPEGS, in MP3 files.

Exposed metadata (microformats) includes:
- A way to qualify content that is displayed on a web page. Because the metadata is exposed it is less prone to abuse (i.e. for search engines)
- "design patterns with semantics"

Exposed metadata could replace custom APIs for metadata exchange. Pages that are marked up with microformats can be turned into RDF for use in the semantic web.

Here he goes through various microformats, some of which connect to the kinds of things that Burke was talking about: hCard, hReview, hCite -- which allow one to make connections between things. xFolk for bookmarks. Although all of these can be used to make connections, it isn't always clear what the connection is. This came up in the discussion after Burke's talk, which is that ranking things by popularity can be mis-used... but Burke pointed out that popularity, even if you don't know WHY, tells you something about the sociology of the knowledge.

He describes tags (as in social tagging) as "simple labels" and as person "aides-memoires". Burke talked about how some of the searching he does is to confirm a memory -- we seem to do this alot, we leave bread crumbs all over, but generally they don't connect to each other. Microformats are turning into usable bread crumb paths.

Now he's showing a topic map based on the author-assigned keywords from some Nature journals. In the topic map, the tag "pediatric urology" is a larger blob than "urology." He explains this by saying that "tags are created in a context." You can see this with Flickr -- the tags something is given are within the context of the person putting the picture on the site. At the time, they are looking just at that one photo. They aren't making connections in the sense that Burke wanted, and the tags probably only make sense in that context -- but the context is not knowable to anyone but the tagger. The upshot is, however, that a topic map made from tags will not look like a topic map done as a general exercise or using a normal topical hierarchy.

Users and Uses, Research 2

Speaker, Timothy Burke

Note: this was really the stellar talk in this meeting. Not only was this guy the only non-librarian, he was the thoughtful user that we all hope to meet.

Dr. Burke is Associate Professor in the Department of History at Swarthmore College. He wrote a piece in 2004 titled "Burn the Catalog" In this he says:
I’m to the point where I think we’d be better off to just utterly erase our existing academic catalogs and forget about backwards-compatibility, lock all the vendors and librarians and scholars together in a room, and make them hammer out electronic research tools that are Amazon-plus, Amazon without the intent to sell books but with the intent of guiding users of all kinds to the books and articles and materials that they ought to find, a catalog that is a partner rather than an obstacle in the making and tracking of knowledge.

Burke presents himself as "the outsider." An academic, but not in the library or in information fields. His talk (excellent) was about how he gets/uses/searches for information. He started with a story about helping a student search in an area in which he wasn't terribly familiar. The topic was about economics, politics, and China. He said that they began with a World Bank report that had some citations. But they needed some context: who is a trusted source in this area? Who is authoritative? They tried the library catalog and LCSH, and finally went to Amazon for a current book on Chinese economy. Why Amazon? It was the easiest place to find what's new and what people are reading. Then from there they went into articles with author names, and only then did they turn to Google because they needed some knowledge about the topic in order to interpret the "torrent of results" that would be retrieved.

How/why he searches: (He's obviously has thought about this a lot)

- he very rarely searches in his own area of expertise; he already knows who is writing, what new info is coming out; he only searches to remember something forgotten, to confirm a memory;
- he searches to stretch beyond his area of specialization - something near his area or related topics; "search is a prop to knowledge I already possess." He uses search to "put top layers on foundations I have already built;" what's new? what's most authoritative? Where does this fit in to a larger world?
- "coveting my neighbor's knowledge" -- moving into other fields where he needs knowledge. Searching to "not look stupid," not to claim deep expertise; to be able to begin a conversation; who's big, who's important, who are people talking about in this area? Scholarly information doesn't give you this perspective. This is a conversation, and in the academic world conversations happen at conferences, not in the literature.
- when he is involved in the production of knowledge; when writing a paper or book; searching is citationally specific in this case, and he wants to output citations from it; the search has to be comprehensive; this is the kind of search that library catalogs were built to service.
- serendipitous search. Looking to find something he doesn't know, that he can't produce a connection to. Amazon's 'people who bought this....'. "This is what the web has done for me as an intellectual." It has given exploration, not a focused search. the thing that doens't make sense form the search query
- syllabus construction - what's in print, how much does it cost, what's teachable? not same books as would use for research.
- searching to help other people do research.

The tools he needs:
- tools that recognize existing clusters of knowledge; if you find a book using lcsh, you probably already know it existed. tool that recognizes the conversation the book was in. those that were written after the book came out and have continued the conversation.
- tools that know lines of descent; chronology of publications; later readers determine connection between texts
- tools that find unknown connections (full text search; topic maps?)
- tools that produce serendipity -- hidden connections.
- tools that inform me of authority
- tools that know about real world usage (those who bought x bought y; how many people checked this out?)
- tools that know about the sociology of knowledge; the pedigrees of authors: who were they trained by, how long ago; how trustworthy is this institution?

What's not out there?
- clusters of contextuality
-
What search can't do and shouldn't try to do: tell me in advance the key words I need to do my searches. A necessary permanent feature is that search is a multi-step practice; search teaches you something.

Users and Uses, Research

Speaker: Karen Markey

Dr. Markey is a professor at the School of Information at the University of Michigan. She recently published at article in D-Lib magazine on the future of the catalog that had some very provocative statements. "The Online Library Catalog: Paradise Lost and Paradise Regained?" D-Lib Magazine 13, no. 1/2 (2007).

Markey didn't make it. This is a quick review of her paper, which will be posted on the web.

The paper creates a matrix that identifies types of information seekers. From low domain expertise and low procedural knowledge to high expertise and high knowledge. She concludes that the real novices account for 77% of information seekers. The real experts account for .5% of users.

The recommendations are similar to those in her D-Lib article.

Users and Uses, Introduction

Speaker: Deanna Marcum

Marcum is the Associate Librarian for Library Services at the Library of Congress -- in other words, very high up in that organization and known as a real mover and shaker. She is probably pretty much single-handedly responsible for the fact that this group exists and that the discussion is taking place under the auspices of the Library of Congress.

Marcum gave a general introduction, and said that the meeting is being video-captured and may appear as a webcast, but no specific access information was given.

Preamble: Users and Uses Meeting

I'll be attending the meeting of the Bibliographic Control Working Group taking place on March 8, and will blog it here. The impetus for the meeting (and for the two that will follow) can be found in the minutes of the first meeting of the group. In that document, the members discussed:
... the critical importance of ensuring that all voices in the community have a chance to be heard. The Working Group therefore concluded that, rather than planning a single summit meeting on the future of bibliographic control, it would schedule several regional meetings during 2007.
In that first meeting the group was talking about having the regional meetings be "invitation only," but this meeting has been advertised as open to the public. There will be another meeting on May 9 in Chicago on "Structures and Standard for Bibliographic Data," and a final one on July 9 on the East coast on "Economics and Organization of Bibliographic Data."

It's not clear from the agenda (below) how interactive this meeting will be. It appears to be structured as a series of talks, with two 25-minute sessions for discussion. Let's hope that there is good facilitation and a willingness to open up the discussion to all.

Agenda

Location: Google Headquarters, 1500 Plymouth Street, Mountain View, CA 94043

9:00-9:10 a.m. Deanna Marcum, Introduction

9:00-9:20 a.m. José-Marie Griffiths, The Issues

9:20-10:00 a.m. Karen Markey, Research in Online Library Catalogs

10:00-10:40 a.m. Timothy Burke, Research Using Library Catalogs

10:55-11:35 a.m. Tony Hammond, Social Networking

11:35-12:00 p.m. Public comment, Questions

1:00-1:40 p.m. Andrew Pace, New Library Services

1:40-2:10 p.m. Anurag Acharya, Google Scholar Perspective

2:10-2:50 p.m. Bernie Hurley, Research University Perspective

3:00-3:40 p.m. Oren Beit-Arie, Library System Support Perspective

3:40-4:15 p.m. Questions; Wrap-up

Speakers

  • Deanna Marcum, Associate Librarian for Library Services, Library of Congress

  • José-Marie Griffiths, Working Group Chair; Dean and Professor, SILS, University of North Carolina at Chapel Hill,

  • Karen Markey, Professor, School of Information, University of Michigan

  • Timothy Burke, Associate Professor, Department of History, Swarthmore College

  • Tony Hammond, New Technology, Nature Publishing Group

  • Andrew Pace, Head, Information Technology, North Carolina State University Libraries

  • Anurag Acharya, Principal Engineer, Google Scholar

  • Bernie Hurley, Director for Library Technologies, Director of the Northern Regional Library Facility, The UC Berkeley Library

  • Oren Beit-Arie, Chief Strategy Officer, Ex Libris Group