Wednesday, June 29, 2016

Catalog and Context, Part IV

Part I, Part II, Part III

(I fully admit that this topic deserves a much more extensive treatment than I will give it here. My goal is to stimulate discussion that would lead to efforts to develop models of that catalog that support a better user experience.)

Recognizing that users need a way to make sense out of large result sets, some library catalogs have added features that attempt to provide context for the user. The main such effort that I am aware of is the presentation of facets derived from some data in the bibliographic records. Another model, although I haven't seen it integrated well into library catalogs, is data mining; doing an overall analysis combining different data in the records, and making this available for search. Lastly, we have the development of entities that are catalog elements in their own right; this generally means the treatment of authors, subjects, etc., as stand-alone topics to be retrieved and viewed, even apart from their role in the retrieval of a bibliographic item. Treating these as "first-class entities" is not the same as the heading layer over bibliographic records, but it may be exploitable to provide a kind of context for users.

Facets

Faceted classification was all the rage when I attended library school in the early 1970's, having been bolstered by the work of the UK-based Classification Research Group, although the prime mover of this type of classification was S R Ranganathan who thoroughly explicated the concept in the 1930s. Faceted classification was to 1970's knowledge organization what KWIC and KWOC were to text searching: facets potentially provided a way to create complex subject headings whose individual parts could be the subject of access on their own or in context.

In library systems "faceting" has exploited information from the bibliographic record can be discretely applied to a retrieved set. Facets are all "accidents" of the existing data, as catalog record creation is not based on faceted cataloging.

In general, facets are fixed data elements, or whole or part heading strings. Authors are used as facets, generally showing the top-occurring author names with counts.
Authors as facets
Date of publication is also a commonly used facet, not so much because it is inherently useful but mainly because "it exists."

Dates as facets

Subject Facets

Faceting is, to a degree, already incorporated into our subject access systems. Library of Congress subject headings are faceted to some extent, with topic facets, geographic facets, and time facets. The Library of Congress Classification and the Dewey Decimal Classification make some use of facets where they allow entries in the classification to be extended by place, time, or other re-usable subdivisions.

Some systems have taken a page from the FAST book. FAST is Faceted Application of Subject Terminology, and it creates facets by breaking apart the segments of a Library of Congress subject heading such that topics, geographical entries, and time periods become separate entries. FAST itself does more than this, including turning some inverted headings (Lake, Erie) back to their natural order, and other changes. One of the main criticisms of FAST, however, is that it loses the very context that is provided by the composite subject heading. Thus the headings on Moby Dick become Whales / Whaling / Mentally Ill / Fiction, and leaves it unclear who or what is mentally ill in this example. (I'm sure there are better examples - send them in!)

Summon system use of facets

The Open Library created subject facets from Library of Congress subject headings, and categorizes each by its facet "type":
Open Library subject facts


Although these are laudable attempts to give the user a way to both understand and further refine the retrieved set, there are a number of problems with these implementations, not the least of which is that many of these are not actually facets in the knowledge organization sense of that term. Facets need to be conceptual divisions of the landscape that help a user understand that landscape.
Online sales sites use something that they call faceted classification, although it varies considerably from the concept of faceted classification that originated with S. R. Ranganathan in the 1930's. On a sales site, facets divide the site's products into categories so that users can add those categories to their searches. A search for shoes in general is less useful than a search for shoes done under the categories "men's", "women's" or "children's". In the online sales sense, a facet is a context for the keyword search. For all that the overall universe that these facets govern is much simpler than the entire knowledge universe that libraries must try to handle, at least the concept of context is employed to help the user.

Amazon's facets
While it may be helpful to see who are the most numerous authors in a retrieved set, authorship does not provide a conceptual organization for the user. Next, not everything that can be exploited in a bibliographic record to narrow a result set is necessarily useful. The list of publication dates from the retrieved set is not only too granular to be a useful facet (think of how many different dates there could be) but the likelihood that a user's query can be fulfilled by a publication year datum is scant indeed.

The last problem is really the key here, which is that while isolated bits of data like date or place may help narrow a large result set they do not provide the kind of overall context for searches that a truly faceted system might. However, providing such a view requires that the entries in the library catalog have been classified using a faceted classification system, and that is simply not the case.

Data Mining


I include this because I think it is interesting, although the only real instances of it that I am aware of come from OCLC, which is uniquely positioned to do the kind of "big data" work that this implies. The WorldCat Identities project shows the kind of data that one can extract from a large bibliographic database. Data mining applies best to the bibliographic universe as a whole, rather than individual catalogs, since those latter are by definition incomplete. It would, however, be interesting to see what uses could be made of mined data like WorldCat Identities, for example giving users of individual catalogs information about sources that the library does not hold. It is also a shame that WorldCat Identities appears to have been a one-off and is not being kept up to date.
Emily Dickinson at WorldCat Identities

First Class Objects

A potential that linked data brings (but does not guarantee) is the development of some of the key bibliographic entities into "first class objects". By that I mean that some entities could be the focus of searches on their own, not just as index entries to bibliographic records. Having some entities be first class objects means that, for example, you can have  a page for a person that is truly about the person, not just a heading with the personal name in it. This allows you to present the user with additional information, either similar to WorldCat Identities, if you have that information available to you, or taking text from sources like Wikipedia, like Open Library did:
Open Library author page

This was also the model used in the linked data database Freebase (which has now been killed by Google), and is not entirely unlike Google's use of Wikipedia (and other sources) to create its "knowledge graph."
Google Knowledge Graph

The treatment of some things as first class objects is perhaps a step toward the catalog of headings, but the person as an object is not itself a replication of the heading system that is found in bibliographic records, which go beyond the person's name in their organizational function:
Dickens, Charles, 1812-1870--Adaptations. Dickens, Charles, 1812-1870--Adaptations--Comic books, strips, etc. Dickens, Charles, 1812-1870--Adaptations--Congresses. Dickens, Charles, 1812-1870--Aesthetics. Dickens, Charles, 1812-1870--Anecdotes. Dickens, Charles, 1812-1870--Anniversaries, etc. Dickens, Charles, 1812-1870--Appreciation. Dickens, Charles, 1812-1870--Appreciation--Croatia.
For subject headings, a key aspect of the knowledge map is the inclusion of relationships from broader and narrower terms and related terms. I will not pretend that the existing headings are perfect, as we know they are not, but it is hard to imagine a knowledge organization system that will not make use of these taxonomic concepts in one way or another.
Lake Erie See: Erie, Lake Lake Erie, Battle of, 1813. BT:United States--History--War of 1812--Campaigns Lake Erie, Battle of, 1813--Bibliography. Lake Erie, Battle of, 1813--Commemoration. Lake Erie, Battle of, 1813--Fiction. Lake Erie, Battle of, 1813--Juvenile fiction. Lake Erie, Battle of, 1813--Juvenile literature. Lake Erie Transportation Company≈ See Also: Erie Railroad Company.
This information is now available through the Library of Congress linked data service, id.loc.gov and surely, with some effort, these aspects of the "first class entity" (person, place, topic, etc.) could be recovered and made available to the user. Unfortunately (how often have I said that in these posts?), the subject heading authorities were designed as a model for subject heading creation, not as a full list of all possible subject headings, and connecting the authority file, which contains the relationships between terms, mechanically to the headings in bibliographic records is not a snap. Again, what was modeled for the card catalog and worked well in that technology does not translate perfectly to the newer technologies.

Note that the emphasis on bibliographic entities in FRBR, RDA and BIBFRAME could facilitate such a solution. All three encourage an entity view of data that has traditionally included in bibliographic records and that is not entirely opposed to the concept of the separation of bibliographic data and authorities. In addition, FRBR provides a basis for conceptualizing works and editions (FRBR's expression) as separate entities. These latter exist already in many forms in the "real world" as objects of critical thinking, description, and point of sale. The other emphasis in FRBR is on bibliographic relationships. This has helped us understand that relationships are important, although these bibliographic relationships are the tip of the iceberg if we look at user service as a whole.

Next 

Next I want to talk about possibilities. But because I do not have the answers, I am going to present them in the form of questions - because we need first to have questions before we can posit any answers.

No comments:

Post a Comment

Comments are moderated, so may not appear immediately, depending on how far away I am from email, time zones, etc.