Sunday, July 10, 2016

Catalogs and Context: Part V

Previous posts: Part I, Part II, Part III, Part IVInterlude

Before we can posit any solutions to the problems that I have noted in these posts, we need to at least know what questions we are trying to answer. To me, the main question is:

What should happen between the search box and the bibliographic display?

Or as Pauline Cochrane asked: "Why should a user ever enter a search term that does not provide a link to the syndetic apparatus and a suggestion about how to proceed?"[1] I really like the "suggestion about how to proceed" that she included there. Although I can think of some exceptions, I do consider this an important question.

If you took a course in reference work at library school (and perhaps such a thing is no longer taught - I don't know), then you learned a technique called "the reference interview." The Wikipedia article on this is not bad, and defines the concept as an interaction at the reference desk "in which  the librarian responds to the user's initial explanation of his or her information need by first attempting to clarify that need and then by directing the user to appropriate information resources." The assumption of the reference interview is that the user arrives at the library with either an ill-formed query, or one that is not easily translated to the library's sources. Bill Katz's textbook "Introduction to Reference Work" makes the point bluntly:

"Be skeptical of the of information the patron presents" [2]

If we're so skeptical that the user could approach the library with the correct search in mind/hand, then why then do we think that giving the user a search box in which to put that poorly thought out or badly formulated search is a solution? This is another mind-boggler to me.

So back to our question, what SHOULD happen between the search box and the bibliographic display? This is not an easy question, and it will not have a simple answer. Part of the difficulty of the answer is that there will not be one single right answer. Another difficulty is that we won't know a right answer until we try it, give it some time, open it up for tweaking, and carefully observe. That's the kind of thing that Google does when they make changes in their interface, but we haven't got either Google's money nor its network (we depend on vendor systems, which define what we can and cannot do with our catalog).

Since I don't have answers (I don't even have all of the questions) I'll pose some questions, but I really want input from any of you who have ideas on this, since your ideas are likely to be better informed than mine. What do we want to know about this problem and its possible solutions?

(Some of) Karen's Questions

Why have we stopped evolving subject access?

Is it that keyword access is simply easier for users to understand? Did the technology deceive us into thinking that a "syndetic apparatus" is unnecessary? Why have the cataloging rules and bibliographic description been given so much more of our profession's time and development resources than subject access has? [3]

Is it too late to introduce knowledge organization to today's users?

The user of today is very different to the user of pre-computer times. Some of our users have never used a catalog with an obvious knowledge organization structure that they must/can navigate. Would they find such a structure intrusive? Or would they suddenly discover what they had been missing all along? [4]

Can we successfully use the subject access that we already have in library records?

Some of the comments in the articles organized by Cochrane in my previous post were about problems in the Library of Congress Subject Headings (LCSH), in particular that the relationships between headings were incomplete and perhaps poorly designed.[5] Since LCSH is what we have as headings, could we make them better? Another criticism was the sparsity of "see" references, once dictated by the difficulty of updating LCSH. Can this be ameliorated? Crowdsourced? Localized?

We still do not have machine-readable versions of the Library of Congress Classification (LCC), and the machine-readable Dewey Decimal Classification (DDC) has been taken off-line (and may be subject to licensing). Could we make use of LCC/DDC for knowledge navigation if they were available as machine-readable files?

Given that both LCSH and LCC/DDC have elements of post-composition and are primarily instructions for subject catalogers, could they be modified for end-user searching, or do we need to develop a different instrument altogether?

How can we measure success?

Without Google's user laboratory apparatus, the answer to this may be: we can't. At least, we cannot expect to have a definitive measure. How terrible would it be to continue to do as we do today and provide what we can, and presume that it is better than nothing? Would we really see, for example, a rise in use of library catalogs that would confirm that we have done "the right thing?"


Notes

[1]*Modern Subject Access in the Online Age: Lesson 3
Author(s): Pauline A. Cochrane, Marcia J. Bates, Margaret Beckman, Hans H. Wellisch, Sanford Berman, Toni Petersen, Stephen E. Wiberley and Jr.
Source: American Libraries, Vol. 15, No. 4 (Apr., 1984), pp. 250-252, 254-255
Stable URL: http://www.jstor.org/stable/25626708

[2] Katz, Bill. Introduction to Reference Work: Reference Services and Reference Processes. New York: McGraw-Hill, 1992. p. 82 http://www.worldcat.org/oclc/928951754. Cited in: Brown, Stephanie Willen. The Reference Interview: Theories and Practice. Library Philosophy and Practice 2008. ISSN 1522-0222

[3] One answer, although it doesn't explain everything, is economic: the cataloging rules are published by the professional association and are a revenue stream for it. That provides an incentive to create new editions of rules. There is no economic gain in making updates to the LCSH. As for the classifications, the big problem there is that they are permanently glued onto the physical volumes making retroactive changes prohibitive. Even changes to descriptive cataloging must be moderated so as to minimize disruption to existing catalogs, which we saw happen during the development of RDA, but with some adjustments the new and the old have been made to coexist in our catalogs.

[4] Note that there are a few places online, in particular Wikipedia, where there is a mild semblance of organized knowledge and with which users are generally familiar. It's not the same as the structure that we have in subject headings and classification, but users are prompted to select pre-formed headings, with a keyword search being secondary.

[5] Simon Spero did a now famous (infamous?) analysis of LCSH's structure that started with Biology and ended with Doorbells.

Monday, July 04, 2016

Catalogs and Content: an Interlude

Part I, Part II, Part III, Part IV

"Editor's note. Providing subject access to information is one of the most important professional services of librarians; yet, it has been overshadowed in recent years by AACR2, MARC, and other developments in the bibliographic organization of information resources. Subject access deserves more attention, especially now that results are pouring in from studies of online catalog use in libraries."
American Libraries, Vol. 15, No. 2 (Feb., 1984), pp. 80-83
Having thought and written about the transition from card catalogs to online catalogs, I began to do some digging in the library literature, and struck gold. In 1984, Pauline Atherton Cochrane, one of the great thinkers in library land, organized a six-part "continuing education" to bring librarians up to date on the thinking regarding the transition to new technology. (Dear ALA - please put these together into a downloaded PDF for open access. It could make a difference.) What is revealed here is both stunning and disheartening, as the quote above shows; in terms of catalog models, very little progress has been made, and we are still spending more time organizing atomistic bibliographic data while ignoring subject access.

The articles are primarily made up of statements by key library thinkers of the time, many of whom you will recognize. Some responses contradict each other, others fall into familiar grooves. Library of Congress is criticized for not moving faster into the future, much as it is today, and yet respondents admit that the general dependency on LC makes any kind of fast turn-around of changes difficult. Some of the desiderata have been achieved, but not the overhaul of subject access in the library catalog.

The Background

If you think that libraries moved from card catalogs to online catalogs in order to serve users better, think again. Like other organizations that had a data management function, libraries in the late 20th century were reaching the limits of what could be done with analog technology. In fact, as Cochrane points out, by the mid-point of that century libraries had given up on the basic catalog function of providing cross references from unused to used terminology, as well as from broader and narrower terms in the subject thesaurus. It simply wasn't possible to keep up with these, not to mention that although the Library of Congress and service organizations like OCLC provided ready-printed cards for bibliographic entries, they did not provide the related reference cards. What libraries did (and I remember this from my undergraduate years) is they placed near the card catalog copies of the "Red Book". This was the printed Library of Congress Subject Heading list, which by my time was in two huge volumes, and, yes, was bound in red. Note that this was the volume that was intended for cataloging librarians who were formulating subject headings for their collections. It was never intended for the end-users of the catalog. The notation ("x", "xx", "sa") was far from intuitive. In addition, for those users who managed to follow the references, it pointed them to the appropriate place in LCSH, but not necessarily in the catalog of the library in which they were searching. Thus a user could be sent to an entry that simply did not exist.

The "RedBook" today
From my own experience, when we brought up the online catalog at the University of California, the larger libraries had for years had difficulty keeping the card catalog up to date. The main library at the University of California at Berkeley regularly ran from 100,000 to 150,000 cards behind in filing into the catalog, which filled two enormous halls. That meant that a book would be represented in the catalog about three months after it had been cataloged and shelved. For a research library, this was a disaster. And Berkeley was not unusual in this respect.

Computerization of the catalog was both a necessary practical solution, as well as a kind of holy grail. At the time that these articles were written, only a few large libraries had an online catalog, and that catalog represented only a recent portion of the library's holdings. (Retrospective conversion of the older physical card catalog to machine-readable form came later, culminating in the 1990's.) Abstracting and indexing databases had preceded libraries in automating, DIALOG, PRECIS, and others, and these gave librarians their first experience in searching computerized bibliographic data.

This was the state of things when Cochrane presented her 6-part "continuing education" series in American Libraries.

Subject Access

The series of articles was stimulated by an astonishingly prescient article by Marcia Bates in 1977. In that article she articulates both concerns and possibilities that, quite frankly, we should all take to heart today. In Lesson 3 of Cochrane's articles, Bates is quotes from 1977 saying:
"...with automation, we have the opportunity to introduce many access points to a given book. We can now use a subject approach... that allows the naive user, unconscious of and uninterested in the complexities of synonymy and vocabulary control, to blunder on to desired subjects, to be guided, without realizing it, by a redundant but carefully controlled subject access system." 
and
"And now is the time to change -- indeed, with MARC already so highly developed, past time. If we simply transfer the austerity-based LC subject heading approach to expensive computer systems, then we have used our computers merely to embalm the constraints that were imposed on library systems back before typewriters came into use!"

This emphasis on subject access was one of the stimuli for the AL lessons. In the early 1980's, studies done at OCLC and elsewhere showed that over 50% of the searches being done in the online catalogs of that day were subject searches, even those going against title indexes or mixed indexes. (See footnotes to Lesson 3.) Known item searching was assumed to be under control, but subject searching posed significant problems. Comments in the article include:
"...we have not yet built into our online systems much of the structure for subject access that is already present in subject cataloging. That structure is internal and known by the person analyzing the work; it needs to be external and known by the person seeking the work."
"Why should a user ever enter a search term that does not provide a link to the syndetic apparatus and a suggestion about how to proceed?"
Interestingly, I don't see that any of these problems has been solved into today's systems.

As a quick review, here are some of the problems, some proposed solutions, and some hope for future technologies that are presented by the thinkers that contributed to the lessons.

Problems noted

Many problems were surfaced, some with fairly simple solutions, others that we still struggle with.
  • LCSH is awkward, if not nearly unusable, both for its vocabulary and for the lack of a true hierarchical organization
  • Online catalogs' use of LCSH lacks syndetic structure (see, see also, BT, NT). This is true not only for display, but in retrieval, search on a broader term does not retrieve items with a narrower term (which would be logical to at least some users)
  • Libraries assign too few subject headings
  • For the first time, some users are not in the library while searching so there are no intermediaries (e.g. reference librarians) available. (One of the flow diagrams has a failed search pointing to a box called "see librarian" something we would not think to include today.)
  • Lack of a professional theory of information seeking behavior that would inform systems design. ("Without a blueprint of how most people want to search, we will continue to force them to search the we want to search." Lesson 5)
  • Information overload, aka overly large results, as well as too few results on specific searches

Proposed solutions

Some proposed solutions were mundane (add more subject headings to records) while others would require great disruption to the library environment.
  • Add more subject headings to MARC records
  • Use keyword searching, including keywords anywhere in the record.
  • Add uncontrolled keywords to the records.
  • Make the subject authority file machine-readable and integrate it into online catalogs.
  • Forget LCSH, instead use non-library bibliographic files for subject searching, such as A&I databases.
  • Add subject terms from non-library sources to the library catalog, and/or do (what today we call) federated searching
  • LCSH must provide headings that are more specific as file sizes and retrieved sets grow (in the document, a retrieved set of 904 items was noted with an exclamation point)

Future thinking 

As is so often the case when looking to the future, some potential technologies were seen as solutions. Some of these are still seen as solutions today (c.f. artificial intelligence), while others have been achieved (storage of full text).
  • Full text searching, natural language searches, and artificial intelligence will make subject headings and classification unnecessary
  • We will have access to back-of-the-book indexes and tables of contents for searching, as well as citation indexing
  • Multi-level systems will provide different interfaces for experts and novices
  • Systems will be available 24x7, and there will be a terminal in every dorm room
  • Systems will no longer need to use stopwords
  • Storage of entire documents will become possible

End of Interlude

Although systems have allowed us to store and search full text, to combine bibliographic data from different sources, and to deliver world-wide, 24x7, we have made almost no progress in the area of subject access. There is much more to be learned from these articles, and it would be instructive to do an in-depth comparison of them to where we are today. I greatly recommend reading them, each is only a few pages long.

----- The Lessons -----

*Modern Subject Access in the Online Age: Lesson 1
by Pauline Atherton Cochrane
Source: American Libraries, Vol. 15, No. 2 (Feb., 1984), pp. 80-83
Stable URL: http://www.jstor.org/stable/25626614

*Modern Subject Access in the Online Age: Lesson 2 Pauline A. Cochrane American Libraries Vol. 15, No. 3 (Mar., 1984), pp. 145-148, 150 Stable URL: http://www.jstor.org/stable/25626647

*Modern Subject Access in the Online Age: Lesson 3
Author(s): Pauline A. Cochrane, Marcia J. Bates, Margaret Beckman, Hans H. Wellisch, Sanford Berman, Toni Petersen, Stephen E. Wiberley and Jr.
Source: American Libraries, Vol. 15, No. 4 (Apr., 1984), pp. 250-252, 254-255
Stable URL: http://www.jstor.org/stable/25626708

*Modern Subject Access in the Online Age: Lesson 4
Author(s): Pauline A. Cochrane, Carol Mandel, William Mischo, Shirley Harper, Michael Buckland, Mary K. D. Pietris, Lucia J. Rather and Fred E. Croxton
Source: American Libraries, Vol. 15, No. 5 (May, 1984), pp. 336-339
Stable URL: http://www.jstor.org/stable/25626747

*Modern Subject Access in the Online Age: Lesson 5
Author(s): Pauline A. Cochrane, Charles Bourne, Tamas Doczkocs, Jeffrey C. Griffith, F. Wilfrid Lancaster, William R. Nugent and Barbara M. Preschel
Source: American Libraries, Vol. 15, No. 6 (Jun., 1984), pp. 438-441, 443
Stable URL: http://www.jstor.org/stable/25629231

*Modern Subject Access In the Online Age: Lesson 6
Author(s): Pauline A. Cochrane, Brian Aveney and Charles Hildreth Source: American Libraries, Vol. 15, No. 7 (Jul. - Aug., 1984), pp. 527-529
Stable URL: http://www.jstor.org/stable/25629275

Wednesday, June 29, 2016

Catalog and Context, Part IV

Part I, Part II, Part III

(I fully admit that this topic deserves a much more extensive treatment than I will give it here. My goal is to stimulate discussion that would lead to efforts to develop models of that catalog that support a better user experience.)

Recognizing that users need a way to make sense out of large result sets, some library catalogs have added features that attempt to provide context for the user. The main such effort that I am aware of is the presentation of facets derived from some data in the bibliographic records. Another model, although I haven't seen it integrated well into library catalogs, is data mining; doing an overall analysis combining different data in the records, and making this available for search. Lastly, we have the development of entities that are catalog elements in their own right; this generally means the treatment of authors, subjects, etc., as stand-alone topics to be retrieved and viewed, even apart from their role in the retrieval of a bibliographic item. Treating these as "first-class entities" is not the same as the heading layer over bibliographic records, but it may be exploitable to provide a kind of context for users.

Facets

Faceted classification was all the rage when I attended library school in the early 1970's, having been bolstered by the work of the UK-based Classification Research Group, although the prime mover of this type of classification was S R Ranganathan who thoroughly explicated the concept in the 1930s. Faceted classification was to 1970's knowledge organization what KWIC and KWOC were to text searching: facets potentially provided a way to create complex subject headings whose individual parts could be the subject of access on their own or in context.

In library systems "faceting" has exploited information from the bibliographic record can be discretely applied to a retrieved set. Facets are all "accidents" of the existing data, as catalog record creation is not based on faceted cataloging.

In general, facets are fixed data elements, or whole or part heading strings. Authors are used as facets, generally showing the top-occurring author names with counts.
Authors as facets
Date of publication is also a commonly used facet, not so much because it is inherently useful but mainly because "it exists."

Dates as facets

Subject Facets

Faceting is, to a degree, already incorporated into our subject access systems. Library of Congress subject headings are faceted to some extent, with topic facets, geographic facets, and time facets. The Library of Congress Classification and the Dewey Decimal Classification make some use of facets where they allow entries in the classification to be extended by place, time, or other re-usable subdivisions.

Some systems have taken a page from the FAST book. FAST is Faceted Application of Subject Terminology, and it creates facets by breaking apart the segments of a Library of Congress subject heading such that topics, geographical entries, and time periods become separate entries. FAST itself does more than this, including turning some inverted headings (Lake, Erie) back to their natural order, and other changes. One of the main criticisms of FAST, however, is that it loses the very context that is provided by the composite subject heading. Thus the headings on Moby Dick become Whales / Whaling / Mentally Ill / Fiction, and leaves it unclear who or what is mentally ill in this example. (I'm sure there are better examples - send them in!)

Summon system use of facets

The Open Library created subject facets from Library of Congress subject headings, and categorizes each by its facet "type":
Open Library subject facts


Although these are laudable attempts to give the user a way to both understand and further refine the retrieved set, there are a number of problems with these implementations, not the least of which is that many of these are not actually facets in the knowledge organization sense of that term. Facets need to be conceptual divisions of the landscape that help a user understand that landscape.
Online sales sites use something that they call faceted classification, although it varies considerably from the concept of faceted classification that originated with S. R. Ranganathan in the 1930's. On a sales site, facets divide the site's products into categories so that users can add those categories to their searches. A search for shoes in general is less useful than a search for shoes done under the categories "men's", "women's" or "children's". In the online sales sense, a facet is a context for the keyword search. For all that the overall universe that these facets govern is much simpler than the entire knowledge universe that libraries must try to handle, at least the concept of context is employed to help the user.

Amazon's facets
While it may be helpful to see who are the most numerous authors in a retrieved set, authorship does not provide a conceptual organization for the user. Next, not everything that can be exploited in a bibliographic record to narrow a result set is necessarily useful. The list of publication dates from the retrieved set is not only too granular to be a useful facet (think of how many different dates there could be) but the likelihood that a user's query can be fulfilled by a publication year datum is scant indeed.

The last problem is really the key here, which is that while isolated bits of data like date or place may help narrow a large result set they do not provide the kind of overall context for searches that a truly faceted system might. However, providing such a view requires that the entries in the library catalog have been classified using a faceted classification system, and that is simply not the case.

Data Mining


I include this because I think it is interesting, although the only real instances of it that I am aware of come from OCLC, which is uniquely positioned to do the kind of "big data" work that this implies. The WorldCat Identities project shows the kind of data that one can extract from a large bibliographic database. Data mining applies best to the bibliographic universe as a whole, rather than individual catalogs, since those latter are by definition incomplete. It would, however, be interesting to see what uses could be made of mined data like WorldCat Identities, for example giving users of individual catalogs information about sources that the library does not hold. It is also a shame that WorldCat Identities appears to have been a one-off and is not being kept up to date.
Emily Dickinson at WorldCat Identities

First Class Objects

A potential that linked data brings (but does not guarantee) is the development of some of the key bibliographic entities into "first class objects". By that I mean that some entities could be the focus of searches on their own, not just as index entries to bibliographic records. Having some entities be first class objects means that, for example, you can have  a page for a person that is truly about the person, not just a heading with the personal name in it. This allows you to present the user with additional information, either similar to WorldCat Identities, if you have that information available to you, or taking text from sources like Wikipedia, like Open Library did:
Open Library author page

This was also the model used in the linked data database Freebase (which has now been killed by Google), and is not entirely unlike Google's use of Wikipedia (and other sources) to create its "knowledge graph."
Google Knowledge Graph

The treatment of some things as first class objects is perhaps a step toward the catalog of headings, but the person as an object is not itself a replication of the heading system that is found in bibliographic records, which go beyond the person's name in their organizational function:
Dickens, Charles, 1812-1870--Adaptations. Dickens, Charles, 1812-1870--Adaptations--Comic books, strips, etc. Dickens, Charles, 1812-1870--Adaptations--Congresses. Dickens, Charles, 1812-1870--Aesthetics. Dickens, Charles, 1812-1870--Anecdotes. Dickens, Charles, 1812-1870--Anniversaries, etc. Dickens, Charles, 1812-1870--Appreciation. Dickens, Charles, 1812-1870--Appreciation--Croatia.
For subject headings, a key aspect of the knowledge map is the inclusion of relationships from broader and narrower terms and related terms. I will not pretend that the existing headings are perfect, as we know they are not, but it is hard to imagine a knowledge organization system that will not make use of these taxonomic concepts in one way or another.
Lake Erie See: Erie, Lake Lake Erie, Battle of, 1813. BT:United States--History--War of 1812--Campaigns Lake Erie, Battle of, 1813--Bibliography. Lake Erie, Battle of, 1813--Commemoration. Lake Erie, Battle of, 1813--Fiction. Lake Erie, Battle of, 1813--Juvenile fiction. Lake Erie, Battle of, 1813--Juvenile literature. Lake Erie Transportation Company≈ See Also: Erie Railroad Company.
This information is now available through the Library of Congress linked data service, id.loc.gov and surely, with some effort, these aspects of the "first class entity" (person, place, topic, etc.) could be recovered and made available to the user. Unfortunately (how often have I said that in these posts?), the subject heading authorities were designed as a model for subject heading creation, not as a full list of all possible subject headings, and connecting the authority file, which contains the relationships between terms, mechanically to the headings in bibliographic records is not a snap. Again, what was modeled for the card catalog and worked well in that technology does not translate perfectly to the newer technologies.

Note that the emphasis on bibliographic entities in FRBR, RDA and BIBFRAME could facilitate such a solution. All three encourage an entity view of data that has traditionally included in bibliographic records and that is not entirely opposed to the concept of the separation of bibliographic data and authorities. In addition, FRBR provides a basis for conceptualizing works and editions (FRBR's expression) as separate entities. These latter exist already in many forms in the "real world" as objects of critical thinking, description, and point of sale. The other emphasis in FRBR is on bibliographic relationships. This has helped us understand that relationships are important, although these bibliographic relationships are the tip of the iceberg if we look at user service as a whole.

Next 

Next I want to talk about possibilities. But because I do not have the answers, I am going to present them in the form of questions - because we need first to have questions before we can posit any answers.

Thursday, June 23, 2016

Catalog and Context Part III

Part I
Part II

In the previous two parts, I explained that much of the knowledge context that could and should be provided by the library catalog has been lost as we moved from cards to databases as the technologies for the catalog. In this part, I want to talk about the effect of keyword searching on catalog context.

KWIC and KWOC

If you weren't at least a teenager in the 1960's you probably missed the era of KWIC and KWOC (neither a children's TV show nor a folk music duo). These meant, respectively, KeyWords In Context, and KeyWords Out of Context. These were concordance-like indexes to texts, but the first done using computers. A KWOC index would be simply a list of words and pointers (such as page numbers, since hyperlinks didn't exist yet). A KWIC index showed the keywords with a few words on either side, or rotated a phrase such that each term appeared once at the beginning of the string, and then were ordered alphabetically.

If you have the phrase "KWIC is an acronym for Key Word in Context", then your KWIC index display could look like:

 KWIC is an acronym for Key Word In Context
Key Word In Context
acronym for Key Word In Context
            KWIC is an acronym for 
acronym for Key Word In Context

To us today these are unattractive and not very useful, but to the first users of computers these were an exciting introduction to the possibility that one could search by any word in a text.

It wasn't until the 1980's, however, that keyword searching could be applied to library catalogs.

Before Keywords, Headings


Before keyword searching, when users were navigating a linear, alphabetical index, they were faced with the very difficult task of deciding where to begin their entry into the catalog. Imagine someone looking for information on Lake Erie. That seems simple enough, but entering the catalog at L-A-K-E E-R-I-E would not actually yield all of the entries that might be relevant. Here are some headings with LAKE ERIE:

Boats and boating--Erie, Lake--Maps. 
Books and reading--Lake Erie region.
Lake Erie, Battle of, 1813.
Erie, Lake--Navigation

Note that the lake is entered under Erie, the battle under Lake, and some instances are fairly far down in the heading string. All of these headings follow rules that ensure a kind of consistency, but because users do not know those rules, the consistency here may not be visible. In any case, the difficulty for users was knowing with what terms to begin the search, which was done on left-anchored headings.

One might assume that finding names of people would be simple, but that is not the case either. Names can be quite complex with multiple parts that are treated differently based on a number of factors having to do with usage in different cultures:

De la Cruz, Melissa
Cervantes Saavedra, Miguel de


Because it was hard to know where to begin a search, see and see also references existed to guide the user from one form of a name or phrase to another. However, it would inflate a catalog beyond utility to include every possible entry point that a person might choose, not to mention that this would make the cataloger's job onerous. Other than the help of a good reference librarian, searching in the card catalog was a kind of hit or miss affair.

When we brought up the University of California online catalog in 1982, you can image how happy users were to learn that they could type in LAKE ERIE and retrieve every record with those terms in it regardless of the order of the terms or where in the heading they appeared. Searching was, or seemed, much simpler. Because it feels simpler, we all have tended to ignore some of the down side of keyword searching. First, words are just strings, and in a search strings have to match (with some possible adjustment like combining singular and plural terms). So a search on "FRANCE" for all information about France would fail to retrieve other versions of that word unless the catalog did some expansion:

Cooking, French
France--Antiquities
Alps, French (France)
French--America--History
French American literature

The next problem is that retrieval with keywords, and especially the "keyword anywhere" search which is the most popular today, entirely misses any context that the library catalog could provide. A simple keyword search on the word "darwin" brings up a wide array of subjects, authors, and titles.

Subjects:
Darwin, Charles, 1809-1882 – Influence
Darwin, Charles, 1809-1882 — Juvenile Literature
Darwin, Charles, 1809-1882 — Comic Books, Strips, Etc
Darwin Family
Java (Computer program language)
Rivers--Great Britain
Mystery Fiction
DNA Viruses — Fiction
Women Molecular Biologists — Fiction

Authors:
Darwin, Charles, 1809-1882
Darwin, Emma Wedgwood, 1808-1896
Darwin, Ian F.
Darwin, Andrew
Teilhet, Darwin L.
Bear, Greg
Byrne, Eugene

Titles:
Darwin
Darwin; A Graphic Biography : the Really Exciting and Dramatic 
    Story of A Man Who Mostly Stayed at Home and Wrote Some Books
Darwin; Business Evolving in the Information Age
Emma Darwin, A Century of Family Letters, 1792-1896
Java Cookbook
Canals and Rivers of Britain
The Crimson Hair Murders
Darwin's Radio

It wouldn't be reasonable for us to expect a user to make sense of this, because quite honestly it does not make sense.

 In the first version of the UC catalog, we required users to select a search heading type, such as AU, TI, SU. That may have lessened the "false drops" from keyword searches, but it did not eliminate them. In this example, using a title or subject search the user still would have retrieved items with the subjects DNA Viruses — Fiction, and Women Molecular Biologists — Fiction, and an author search would have brought up both Java Cookbook and Canals and Rivers of Britain. One could see an opportunity for serendipity here, but it's not clear that it would balance out the confusion and frustration. 

You may be right now thinking "But Google uses keyword searching and the results are good." Note that Google now relies heavily on Wikipedia and other online reference books to provide relevant results. Wikipedia is a knowledge organization system, organized by people, and it often has a default answer for search that is more likely to match the user's assumptions. A search on the single word "darwin" brings up:

In fact, Google has always relied on humans to organize the web by following the hyperlinks that they create. Although the initial mechanism of the search is a keyword search, Google's forte is in massaging the raw keyword result to bring potentially relevant pages to the top. 

Keywords, Concluded

The move from headings to databases to un-typed keyword searching has all but eliminated the visibility and utility of headings in the catalog. The single search box has become the norm for library catalogs and many users have never experienced the catalog as an organized system of headings. Default displays are short and show only a few essential fields, mainly author, title and date. This means that there may even be users who are unaware that there is a system of headings in the catalog.

Recent work in cataloging, from ISBD to FRBR to RDA and BIBFRAME focus on modifications to the bibliographic record, but do nothing to model the catalog as a whole. With these efforts, the organized knowledge system that was the catalog is slipping further into the background. And yet, we have no concerted effort taking place to remedy this. 

What is most astonishing to me, though, is that catalogers continue to create headings, painstakingly, sincerely, in spite of the fact that they are not used as intended in library systems, and have not been used in that way since the first library systems were developed over 30 years ago. The headings are fodder for the keyword search, but no more so than a simple set of tags would be. The headings never perform the organizing function for which they were intended. 

Next


Part IV will look at some attempts to create knowledge context from current catalog data, and will present some questions that need to be answered if we are to address the quality of the catalog as a knowledge system.

Monday, June 20, 2016

Catalog and Context, Part II

In the previous post, I talked about book and card catalogs, and how they existed as a heading layer over the bibliographic description representing library holdings. In this post, I will talk about what changed when that same data was stored in database management systems and delivered to users on a computer screen.

Taking a very simple example, in the card catalog a single library holding with author, title and one subject becomes three separate entries, one for each heading. These are filed alphabetically in their respective places in the catalog.

In this sense, the catalog is composed of cards for headings that have attached to them the related bibliographic description. Most items in the library are represented more than once in the library catalog. The catalog is a catalog of headings.

In most computer-based catalogs, the relationship between headings and bibliographic data is reversed: the record with bibliographic and heading data, is stored once; access points, analogous to the headings of the card catalog, are extracted to indexes that all point to the single record.

This in itself could be just a minor change in the mechanism of the catalog, but in fact it turns out to be more than that.

First, the indexes of the database system are not visible to the user. This is the opposite of the card catalog where the entry points were what the user saw and navigated through. Those entry points, at their best, served as a knowledge organization system that gave the user a context for the headings. Those headings suggest topics to users once the user finds a starting point in the catalog.

When this system works well for the user, she has some understanding of where she was in the virtual library that the catalog created. This context could be a subject area or it could be a bibliographic context such as the editions of a work.

Most, if not all, online catalogs do not present the catalog as a linear, alphabetically ordered list of headings. Database management technology encourages the use of searching rather than linear browsing. Even if one searches in headings as a left-anchored string of characters a search results in a retrieved set of matching entries, not a point in an alphabetical list. There is no way to navigate to nearby entries. The bibliographic data is therefore not provided either in the context or the order of the catalog. After a search on "cat breeds" the user sees a screen-full of bibliographic records but lacking in context because most default displays do not show the user the headings or text that caused the item to be retrieved.

Although each of these items has a subject heading containing the words "Cat breeds" the order of the entries is not the subject order. The subject headings in the first few records read, in order:

  1. Cat breed
  2. Cat breeds
  3. Cat breeds - History
  4. Cat breeds - Handbooks, manuals, etc.
  5. Cat breeds
  6. Cat breeds - Thailand
  7. Cat breeds

If if the catalog uses a visible and logical order, like alphabetical by author and title, or most recent by date, there is no way from the displayed list for the user to get the sense of "where am I?" that was provided by the catalog of headings.

In the early 1980's, when I was working on the University of California's first online catalog, the catalogers immediately noted this as a problem. They would have wanted the retrieved set to be displayed as:

(Note how much this resembles the book catalog shown in Part I.) At the time, and perhaps still today, there were technical barriers to such a display, mainly because of limitations on the sorting of large retrieved sets. (Large, at that time, was anything over a few hundred items.) Another issue was that any bibliographic record could be retrieved more than once in a single retrieved set, and presenting the records more than once in the display, given the database design, would be tricky. I don't know if starting afresh today some of these features would be easier to produce, but the pattern of search and display seems not to have progressed greatly from those first catalogs.

In addition, it is in any case questionable whether a set of bibliographic items retrieved from a database on some query would reproduce the presumably coherent context of the catalog. This is especially true because of the third major difference between the card catalog and the computer catalog: the ability to search on individual words in the bibliographic record rather than being limited to seeking on full left-anchored headings. The move to keyword searching was both a boon and a bane because it was a major factor in the loss of context in the library catalog.

Keyword searching will be the main topic of Part III of this series.


Catalog and Context, Part I

This multi-part post is based on a talk I gave in June, 2016 at ELAG in Copenhagen.

Imagine that you do a search in your GPS system and are given the exact point of the address, but nothing more.

Without some context showing where on the planet the point exists, having the exact location, while accurate, is not useful.



In essence, this is what we provide to users of our catalogs. They do a search and we reply with bibliographic items that meet the letter of that search, but with no context about where those items fit into any knowledge map.

Because we present the catalog as a retrieval tool for unrelated items, users have come to see the library catalog as nothing more than a tool for known item searching. They do not see it as a place to explore topics or to find related works. The catalog wasn't always just a known item finding tool, however. To understand how it came to be one, we need a short visit to Catalogs Past.

Catalogs Past


We can't really compare the library catalog of today to the early book catalogs, since the problem that they had to solve was quite different to what we have today. However, those catalogs can show us what a library catalog was originally meant to be.
book catalog entry

A book catalog was a compendium of entry points, mainly authors but in some cases also titles and subjects. The bibliographic data was kept quite brief as every character in the catalog was a cost in terms of type-setting and page real estate. The headings dominated the catalog, and it was only through headings that a user could approach the bibliographic holdings of the library. An alphabetical author list is not much "knowledge organization", but the headings provided an ordered layer over the library's holdings, and were also the only access mechanism to them.

Some of the early card catalogs had separate cards for headings and for bibliographic data. If entries in the catalog had to be hand-written (or later typed) onto cards, the easiest thing was to slot the cards into the catalog behind the appropriate heading without adding heading data to the card itself.

Often there was only one card with a full bibliographic description, and that was the "main entry" card. All other cards were references to a point in the catalog, for example the author's name, where more information could be found.

Again, all bibliographic data was subordinate to a layer of headings that made up the catalog. We can debate how intellectually accurate or useful that heading layer was, but there is no doubt that it was the only entry to the content of the library.

The Printed Card


In 1902 the Library of Congress began printing cards that could be purchased by libraries. The idea was genius. For each item cataloged by LC a card was printed in as many copies as needed. Libraries could buy the number of catalog card "blanks" they required to create all of the entries in their catalogs. The libraries would use as many as needed of the printed cards and type (or write) the desired headings onto the top of the card. Each of these would have the full bibliographic information - an advantage for users who then would not longer need to follow "see" references from headings to the one full entry card in the catalog.


These cards introduced something else that was new: the card would have at the bottom a tracing of the headings that LC was using in its own catalog. This was a savings for the libraries as they could copy LC's practice without incurring their own catalogers' time. This card, for the first time, combined both bibliographic information and heading tracings in a single "record", with the bibliographic information on the card being an entry point to the headings.


Machine-Readable Card Printing


The MAchine Readable Cataloging (MARC) project of the Library of Congress was a major upgrade to card printing technology. By including all of the information needed for card printing in a computer-processable record, LC could take advantage of new technology to stream-line its card production process, and even move into a kind of "print on demand" model. The MARC record was designed to have all of the information needed to print the set of cards for a book; author, title, subjects, and added entries were all included in the record, as well as some additional information that could be used to generate reports such as "new acquisitions" lists.

Here again the bibliographic information and the heading information were together in a single unit, and it even followed the card printing convention of the order of the entries, with the bibliographic description at top, followed by headings. With the MARC record, it was possible to not only print sets of cards, but to actually print the headers on the cards, so that when libraries received a set they were ready to do into the catalog at their respective places.

Next, we'll look at the conversion from printed cards to catalogs using database technology.

-> Part II

Wednesday, June 01, 2016

This is what sexism looks like, # 3

I spend a lot of time in technical meetings. This is no one's fault but my own since these activities are purely voluntary. At the end of many meetings, though, I vow to never attend one again. This story is about one.

There was no ill-preparedness or bad faith on the part of either the organizers or the participants at this meeting. There is, however, reality, and no amount of good will changes that.

This took place at a working meeting that was not a library meeting but at which some librarians were present. At lunch one day, three librarians, myself and two others, all female, were sitting together. I can say that we are all well-known and well seasoned in library systems and standards. You would recognize our names. As lunch was winding down, the person across from us opened a conversation with this (all below paraphrased):

P: Libraries should get involved with the Open Access movement; they are in a position to have an effect.

us: Libraries *are* heavily involved in the OA movement, and have been for at least a decade.

P: (Going on.) If you'd join together you could fight for OA against the big publishers.

us: Libraries *have* joined together and are fighting for OA. (Beginning to get annoyed at this point.)

P: What you need to do is... [various iterations here]

us: (Visibly annoyed now) We have done that. In some cases, we have started an effort that is going forward. We have organizations dedicated to that, we hold whole conferences on these topics. You are preaching to the choir here - these aren't new ideas for us, we know all of this. You don't need to tell us.

P: (Going on, no response to what we have said.) You should set a deadline, like 2017, after which you should drop all journals that are not OA.

us: [various statements about a) setting up university-wide rules for depositing articles; b) the difference in how publishing matters in different disciplines: c) the role of tenure, etc.]

P: (Insisting) If libraries would support OA, publishers like Elsevier could not survive.

us: [oof!]

me: You are sitting here with three professionals with a combined experience in this field of well over 50 years, but you won't listen to us or believe what we say. Why not?

P: (Ignoring the question.) I'm convinced that if libraries would join in, we could win this one. You should...

At this point, I lost it. I literally head-desked and groaned out "Please stop with the mansplaining!" That was a mistake, but it wasn't wrong. This was a classic case of mansplaining. P hopped up and stalked out of the room. Twenty minutes later I am told that I have violated the "civility code" of the conference. I have become the perpetrator of abuse because I "accused him" of being sexist.

I don't know what else we could have done to stop what was going on. In spite of a good ten minutes of us replying that libraries are "on it" not one of our statements was acknowledged. Not one of P's statements was in response to what we said. At no point did P acknowledge that we know more about what libraries are doing than he does, and perhaps he could learn by listening to us or asking us questions. And we actually told him, in so many words, he wasn't listening, and that we are knowledgeable. He still didn't get it.

This, too, is a classic: Catch-22. A person who is clueless will not get the "hints" but you cannot clue them or you are in the wrong.

Thanks to the men's rights movement, standing up against sexism has become abuse of men, who are then the victims of what is almost always characterized as "false accusations". Not only did this person tell me, in the "chat" we had at his request, "I know I am not sexist" he also said, "You know that false accusations destroy men's lives." It never occurred to him that deciding true or false wasn't de facto his decision. He didn't react when I said that all three of us had experienced the encounter in the same way. The various explanations P gave were ones most women have heard before: "If I didn't listen, that's just how I am with everybody." "Did I say I wasn't listening because you are women? so how could it be sexist?" And "I have listened to you in our meetings, so how can you say I am sexist?" (Again, his experience, his decision.) During all of this I was spoken to, but no interest was shown in my experience, and I said almost nothing. I didn't even try to explain it. I was drubbed.

The only positive thing that I can say about this is that in spite of heavy pressure over 20 minutes, one on one, I did not agree to deny my experience. He wanted me to tell him that he hadn't been sexist. I just could't do that. I said that we would have to agree to disagree, but apologized for my outburst.

When I look around meeting rooms, I often think that I shouldn't be there. I often vow that the next time I walk into a meeting room and it isn't at least 50% female, I'm walking out. Unfortunately, that meeting room does not exist in the projects that I find myself in.

Not all of the experience at the meeting was bad. Much of it was quite good. But the good doesn't remove the damage of the bad. I think about the fact that in Pakistan today men are arguing that it is their right to physically abuse the women in their home and I am utterly speechless. I don't face anything like that. But the wounds from these experiences take a long time to heal. Days afterward, I'm still anxious and depressed. I know that the next time I walk into a meeting room I will feel fear; fear of further damage. I really do seriously think about hanging it all up, never going to another meeting where I try to advocate for libraries.

I'm now off to join friends and hopefully put this behind me. I wish I could know that it would never happen again. But I get that gut punch just thinking about my next meeting.