Coyle's InFormation: knowledge organization

Showing posts with label knowledge organization. Show all posts

Monday, April 08, 2019

I, too, want answers

Around 1966-67 I worked on the reference desk at my local public library. For those too young to remember, this was a time when all information was in paper form, and much of that paper was available only at the library. The Internet was just a twinkle in the eye of some scientists at DARPA, and none of us had any idea what kind of information environment was in our future.* The library had a card catalog and the latest thing was that check-outs were somehow recorded on microfilm, as I recall.

As you entered the library the reference desk was directly in front of you in the prime location in the middle of the main room. A large number of library users went directly to the desk upon entering. Some of these users had a particular research in mind: a topic, an author, or a title. They came to the reference desk to find the quickest route to what they sought. The librarian would take them to the card catalog, would look up the entry, and perhaps even go to the shelf with the user to look for the item.**

There was another type of reference request: a request for facts, not resources. If one wanted to know what was the population of Milwaukee, or how many slot machines there were in Saudia Arabia***, one turned to the library for answers. At the reference desk we had a variety of reference materials: encyclopedias, almanacs, dictionaries, atlases. The questions that we could answer quickly were called "ready reference." These responses were generally factual.

Because the ready reference service didn't require anything of the user except to ask the question, we also provided this service over the phone to anyone who called in. We considered ourselves at the forefront of modern information services when someone would call and ask us: "Who won best actor in 1937?" OK, it probably was a bar bet or a crossword puzzle clue but we answered, proud of ourselves.

I was reminded of all this by a recent article in Wired magazine, "Alexa, I Want Answers."[1] The argument as presented in the article is that what people REALLY want is an answer; they don't want to dig through books and journals at the library; they don't even want an online search that returns a page of results; what they want is to ask a question and get an answer, a single answer. What they want is "ready reference" by voice, in their own home, without having to engage with a human being. The article is about the development of the virtual, voice-first, answer machine: Alexa.

There are some obvious observations to be made about this. The glaringly obvious one is that not all questions lend themselves to a single, one sentence answer. Even a question that can be asked concisely may not have a concise answer. One that I recall from those long-ago days on the reference desk was the question: "When did the Vietnam War begin?" To answer this you would need to clarify a number of things: on whose part? US? France? Exactly what do you mean by begin? First personnel? First troops? Even with these details in hand experts would differ in their answers.

Another observation is that in the question/answer method over a voice device like Alexa, replying with a lengthy answer is not foreseen. Voice-first systems are backed by databases of facts, not explanatory texts. Like a GPS system they take facts and render them in a way that seems conversational. Your GPS doesn't reply with the numbers of longitude and latitude, and your weather app wraps the weather data in phrases like: "It's 63 degrees outside and might rain later today." It doesn't, however, offer a lengthy discourse on the topic. Just the facts, ma'am.[3]

It is very troubling that we have no measure of the accuracy of these answers. There are quite a few anecdotes about wrong answers (especially amusing ones) from voice assistants, but I haven't seen any concerted studies of the overall accuracy rate. Studies of this nature were done in the 1970's and 1980's on library reference services, and the results were shocking. Even though library reference was done by human beings who presumably would be capable of detecting wrong answers, the accuracy of answers hovered around 50-60%.[2] Repeated studies came up with similar results, and library journals were filled with articles about this problem. The solution offered was to increase training of reference staff. Before the problem could be resolved, however, users who previously had made use of "ready reference" had moved on to in-sourcing their own reference questions by using the new information system: the Internet. If there still is ready reference occuring in libraries, it is undoubtedly greatly reduced in the number of questions asked, and it doesn't appear that studying the accuracy is on our minds today.

I have one final observation, and that is that we do not know the source(s) of the information behind the answers given by voice assistants. The companies behind these products have developed databases that are not visible to us, and no source information is given for individual answers. The voice-activated machines themselves are not the main product: they are mere user interfaces, dressed up with design elements that make them appealing as home decor. The data behind the machines is what is being sold, and is what makes the machines useful. With all of the recent discussion of algorithmic bias in artificial intelligence we should be very concerned about where these answers come from, and we should seriously consider if "answers" to some questions are even appropriate or desirable.

Now, I have question: how is it possible that so much of our new technology is based on so little intellectual depth? Is reductionism an essential element of technology, or could we do better? I'm not going to ask Alexa**** for an answer to that.

[1] Vlahos, James. “Alexa, I Want Answers.” Wired, vol. 27, no. 3, Mar. 2019, p. 58. (Try EBSCO)
[2] Weech, Terry L. “Review of The Accuracy of Telephone Reference/Information Services in Academic Libraries: Two Studies.” The Library Quarterly: Information, Community, Policy, vol. 54, no. 1, 1984, pp. 130–31.
[3] https://en.wikipedia.org/wiki/Joe_Friday

* The only computers we saw were the ones on Star Trek (1966), and those were clearly a fiction.
** This was also the era in which the gas station attendent pumped your gas, washed your windows, and checked your oil while you waited in your car.
*** The question about Saudia Arabia is one that I actually got. I also got the one about whether there were many "colored people" in Haiti. I don't remember how I answered the former, but I do remember that the user who asked the latter was quite disappointed with the answer. I think he decided not to go.
**** Which I do not have; I find it creepy even though I can imagine some things for which it could be useful.

Tuesday, August 14, 2018

Libraryland, We Have a Problem

The first rule of every multi-step program is to admit that you have a problem. I think it's time for us librarians to take step one and admit that we do have a problem.

The particular problem that I have in mind is the disconnect between library data and library systems in relation to the category of metadata that libraries call "headings." Headings are the strings in the library data that represent those entities that would be entry points in a linear catalog like a card catalog.

It pains me whenever I am an observer to cataloger discussions on the proper formation of headings for items that they are cataloging. The pain point is that I know that the value of those headings is completely lost in the library systems of today, and therefore there are countless hours of skilled cataloger time that are being wasted.

The Heading

Both book and card catalogs were catalogs of headings. The catalog entry was a heading followed by one or more bibliographic entries. Unfortunately, the headings serve multiple purposes, which is generally not a good data practice but is due to the need for parsimony in library data when that data was analog, as in book and card catalogs.

A heading is a unique character string for the "thing" – the person, the corporate body, the family – essentially an identifier.

Tolkien, J. R. R. (John Ronald Reuel), 1892-1973

It supports the selection of the entity in the catalog from among the choices that are presented (although in some cases the effectiveness of this is questionable)

It is an access point, intended to be the means of finding, within the catalog, those items held by the library that meet the need of the user.

It provides the sort order for the catalog entries (which is why you see inverted forms like "Tolkien, J. R. R.")

United States. Department of State. Bureau for Refugee Programs
United States. Department of State. Bureau of Administration
United States. Department of State. Bureau of Administration and Security
United States. Department of State. Bureau of African Affairs

That sort order, and those inverted headings, also have a purpose of collocation of entries by some measure of "likeness"

Tolkien, J. R. R. (John Ronald Reuel), 1892-1973

Tolkien Society

Tolkien Trust

The last three functions, providing a sort order, access, and collocation, have been lost in the online catalog. The reasons for this are many, but the main explanation is that keyword searching has replaced alphabetical browse as a way to locate items in a library catalog.

The upshot is that many hours are spent during the cataloging process to formulate a left-anchored, alphabetically order-able heading that has no functionality in library catalogs other than as fodder for a context-less keyword search.

Once a keyword search is done the resulting items are retrieved without any correlation to headings. It may not even be clear which headings one would use to create a useful order. The set of retrieved bibliographic resources from a single keyword search may not provide a coherent knowledge graph. Here's an illustration using the keyword "darwin":

Gardiner, Anne.

Melding of two spirits : from the "Yiminga" of the Tiwi to the "Yiminga" of Christianity / by Anne Gardiner ; art work by

Darwin : State Library of the Northern Territory, 1993.

Christianity--Australia--Northern Territory.

Tiwi (Australian people)--Religion.

Northern Territory--Religion.

Crabb, William Darwin.

Lyrics of the golden west. By W. D. Crabb.

San Francisco, The Whitaker & Ray company, 1898

West (U.S.)--Poetry.

Darwin, Charles, 1809-1882.

Origin of species by means of natural selection; or, The preservation of favored races in the struggle for life and The descent of man and selection in relation to sex, by Charles Darwin.

New York, The Modern library [1936]

Evolution (Biology)

Natural selection.

Heredity.

Human evolution.

Bear, Greg, 1951-

Darwin's radio / Greg Bear.

New York : Ballantine Books, 2003.

Women molecular biologists--Fiction.

DNA viruses--Fiction.

No matter what you would choose as a heading on which to order these, it will not produce a sensible collocation that would give users some context to understand the meaning of this particular set of items – and that is because there is no meaning to this set of items, just a coincidence of things named "Darwin."

Headings that have been chosen to be controlled strings should offer a more predictable user search experience than free text searching, but headings do not necessarily provide collocation. As an example, Wikipedia uses the names of its pages as headings, and there are some rules (or at least preferred practices) to make the headings sensible. A search in Wikipedia is a left-to-right search on a heading string that is presented as a drop-down list of a handful of headings that match the search string:

Included in the headings in the drop-down are "see"-type terms that, when selected, take the user directly to the entry for the preferred term. If there is no one preferred term Wikipedia directs users to disambiguation pages to help users select among similar headings:

The Wikipedia pages, however, only provide accidental collocation, not the more comprehensive collocation that libraries aim to attain. That library-designed collocation, however, is also the source of the inversion of headings, making those strings unnatural and unintuitive for users. Although the library headings are admirably rules based, they often use rules that will not be known to many users of the catalog, such as the difference in name headings with prepositions based on the language of the author. To search on these names, one therefore needs to know the language of the author and the rule that is applied to that language, something that I am quite sure we can assume is not common knowledge among catalog users.

De la Cruz, Melissa

Cervantes Saavedra, Miguel de

I may be the only patron of my small library branch that has known to look for the mysteries by Icelandic author Arnaldur Indriðason under "A" not "I".

What Is To Be Done?

There isn't an easy (or perhaps not even a hard) answer. As long as humans use words to describe their queries we will have the problem that words and concepts, and words and relationships between concepts, do not neatly coincide.

I see a few techniques that might be used if we wish to save collocation by heading. One would be to allow keyword searching but for the system to use that to suggest headings that then can be used to view collocated works. Some systems do allow users to retrieve headings by keyword, but headings, which are very terse, are often not self-explanatory without the items they describe. A browse of headings alone is much less helpful that the association of the heading with the bibliographic data it describes. Remember that headings were developed for the card catalog where they were printed on the same card that carried the bibliographic description.

Another possible area of investigation would be to look to the classified catalog, a technique that has existed alongside alphabetical catalogs for centuries. The Decimal Classification of Dewey was a classified approach to knowledge with a language-based index (his "Relativ Index") to the classes. (It is odd that the current practice in US libraries is to have one classification for items on shelves and an unrelated heading system (LCSH) for subject access.)

The classification provides the intellectual collocation that the headings themselves do not provide. The difficulty with this is that the classification collocates topically but, at least in its current form, does not collocate the name headings in the catalog that identify people and organizations as entities.

Conclusion (sort of)

Controlled headings as access points for library catalogs could provide better service than keyword search alone. How to make use of headings is a difficult question. The first issue is how to exploit the precision of headings while still allowing users to search on any terms that they have in mind. Keyword search is, from the user's point of view, frictionless. They don't have to think "what string would the library have used for this?".

Collocation of items by topical sameness or other relationships (e.g. "named for", "subordinate to") is possibly the best service that libraries could provide, although it is very hard to do this through the mechanism of language strings. Dewey's original idea of a classified order with a language-based index is still a good one, although classifications are hard to maintain and hard to assign.

If challenged to state what I think the library catalog should be, my answer would be that it should provide a useful order that illustrates one or more intellectual contexts that will help the user enter and navigate what the library has to offer. Unfortunately I can't say today how we could do that. Could we think about that together?

Readings

Dewey, Melvil. Decimal classification and relativ index for libraries, clippings, notes, etc. Edition 7. Lake Placid Club, NY., Forest Press, 1911. https://archive.org/details/decimalclassifi00dewegoog

Shera, Jesse H, Margaret E. Egan, and Jeannette M. Lynn. The Classified Catalog: Basic Principles and Practices. Chicago, Ill: American Library Association, 1956

Sunday, December 18, 2016

Transparency of judgment

The Guardian, and others, have discovered that when querying Google for "did the Holocaust really happen", the top response is a Holocaust denier site. They mistakenly think that the solution is to lower the ranking of that site.

The real solution, however, is different. It begins with the very concept of the "top site" from searches. What does "top site" really mean? It means something like "the site most often pointed to by other sites that are most often pointed to." It means "popular" -- but by an unexamined measure. Google's algorithm doesn't distinguish fact from fiction, or scientific from nutty, or even academically viable from warm and fuzzy. Fan sites compete with the list of publications of a Nobel prize-winning physicist. Well, except that they probably don't, because it would be odd for the same search terms to pull up both, but nothing in the ranking itself makes that distinction.

The primary problem with Google's result, however, is that it hides the relationships that the algorithm itself uses in the ranking. You get something ranked #1 but you have no idea how Google arrived at that ranking; that's a trade secret. By not giving the user any information on what lies behind the ranking of that specific page you eliminate the user's possibility to make an informed judgment about the source. This informed judgment is not only about the inherent quality of the information in the ranked site, but also about its position in the complex social interactions surrounding knowledge creation itself.

This is true not only for Holocaust denial but every single site on the web. It is also true for every document that is on library shelves or servers. It is not sufficient to look at any cultural artifact as an isolated case, because there are no isolated cases. It is all about context, and the threads of history and thought that surround the thoughts presented in the document.

There is an interesting project of the Wikimedia Foundation called "Wikicite." The goal of that project is to make sure that specific facts culled from Wikipedia into the Wikidata project all have citations that support the facts. If you've done any work on Wikipedia you know that all statements of fact in all articles must come from reliable third-party sources. These citations allow one to discover the background for the information in Wikipedia, and to use that to decide for oneself if the information in the article is reliable, and also to know what points of view are represented. A map of the data that leads to a web site's ranking on Google would serve a similar function.

Another interesting project is CITO, the Citation Typing Ontology. This is aimed at scholarly works, and it is a vocabulary that would allow authors to do more than just cite a work - they could give a more specific meaning to the citation, such as "disputes", "extends", "gives support to". A citation index could then categorize citations so that you could see who are the deniers of the deniers as well as the supporters, rather than just counting citations. This brings us a small step, but a step, closer to a knowledge map.

All judgments of importance or even relative position of information sources must be transparent. Anything else denies the value of careful thinking about our world. Google counts pages and pretends not to be passing judgment on information, but they operate under a false flag of neutrality that protects their bottom line. The rest of us need to do better.