Coyle's InFormation: 2016

Sunday, December 18, 2016

Transparency of judgment

The Guardian, and others, have discovered that when querying Google for "did the Holocaust really happen", the top response is a Holocaust denier site. They mistakenly think that the solution is to lower the ranking of that site.

The real solution, however, is different. It begins with the very concept of the "top site" from searches. What does "top site" really mean? It means something like "the site most often pointed to by other sites that are most often pointed to." It means "popular" -- but by an unexamined measure. Google's algorithm doesn't distinguish fact from fiction, or scientific from nutty, or even academically viable from warm and fuzzy. Fan sites compete with the list of publications of a Nobel prize-winning physicist. Well, except that they probably don't, because it would be odd for the same search terms to pull up both, but nothing in the ranking itself makes that distinction.

The primary problem with Google's result, however, is that it hides the relationships that the algorithm itself uses in the ranking. You get something ranked #1 but you have no idea how Google arrived at that ranking; that's a trade secret. By not giving the user any information on what lies behind the ranking of that specific page you eliminate the user's possibility to make an informed judgment about the source. This informed judgment is not only about the inherent quality of the information in the ranked site, but also about its position in the complex social interactions surrounding knowledge creation itself.

This is true not only for Holocaust denial but every single site on the web. It is also true for every document that is on library shelves or servers. It is not sufficient to look at any cultural artifact as an isolated case, because there are no isolated cases. It is all about context, and the threads of history and thought that surround the thoughts presented in the document.

There is an interesting project of the Wikimedia Foundation called "Wikicite." The goal of that project is to make sure that specific facts culled from Wikipedia into the Wikidata project all have citations that support the facts. If you've done any work on Wikipedia you know that all statements of fact in all articles must come from reliable third-party sources. These citations allow one to discover the background for the information in Wikipedia, and to use that to decide for oneself if the information in the article is reliable, and also to know what points of view are represented. A map of the data that leads to a web site's ranking on Google would serve a similar function.

Another interesting project is CITO, the Citation Typing Ontology. This is aimed at scholarly works, and it is a vocabulary that would allow authors to do more than just cite a work - they could give a more specific meaning to the citation, such as "disputes", "extends", "gives support to". A citation index could then categorize citations so that you could see who are the deniers of the deniers as well as the supporters, rather than just counting citations. This brings us a small step, but a step, closer to a knowledge map.

All judgments of importance or even relative position of information sources must be transparent. Anything else denies the value of careful thinking about our world. Google counts pages and pretends not to be passing judgment on information, but they operate under a false flag of neutrality that protects their bottom line. The rest of us need to do better.

Tuesday, December 13, 2016

All the (good) books

I have just re-read Ray Bradbury's Fahrenheit 451, and it was better than I had remembered. It holds up very well for a book first published in 1953. I was reading it as an example of book worship, as part of my investigation into examples of an irrational love of books. What became clear, however, is that this book does not describe an indiscriminate love, not at all.

I took note of the authors and the individual books that are actually mentioned in Fahrenheit 451. Here they are (hopefully a complete list):

Authors: Dante, Swift, Marcus Aurelius, Shakespeare, Plato, Milton, Sophocles, Thomas Hardy, Ortega y Gasset, Schweitzer, Einstein, Darwin, Gandhi, Guatama Buddha, Confucius, Thomas Love Peacock, Thomas Jefferson, Lincoln, Tom Paine, Machiavelli, Christ, Bertrand Russell.

Books: Little Black Sambo, Uncle Tom's Cabin, the Bible, Walden

I suspect that by the criteria with which Bradbury chose his authors, he himself, merely an author of popular science fiction, would not have made his own list. Of the books, the first two were used to illustrate books that offended.

"Don't step on the toes of the dog lovers, the cat lovers, doctors, lawyers, merchants, chiefs, Mormons, Baptists, Unitarians, second-generation Chinese, Swedes, Italians, Germans, Texans, Brooklynites, Irishmen, people from Oregon or Mexico."
...
"Colored people don't like Little Black Sambo. Burn it. White people don't feel good about Uncle Tom's Cabin. Burn it. Someone's written a book on tobacco and cancers of the lungs? The cigarette people are weeping? Burn the book. Serenity, Montag."

The other two were examples of books that were being preserved.

Bradbury was a bit of a social curmudgeon, and in terms of books decidedly a traditionalist. He decried the dumbing down of American culture, with digests of books (perhaps prompted by the Reader's Digest brand, which began in 1950), then "digest-digests, digest-digest-digests," then with books being reduced to one or two sentences, and television keeping people occupied but without any perceptible content. (Although he pre-invents a number of recognizable modern technologies, such as earbuds, he fails to anticipate the popular of writers like George R. R. Martin and other writers of brick-sized tomes.)

Fahrenheit 451 is not a worship of books, but of their role in preserving a certain culture. The "book-people" who each had memorized a book or a chapter hoped to see those become the canon once the new "dark ages" had ended. This was not a preservation of all books but of a small selection of books. That is, of course, exactly what happened in the original dark ages, although the potential corpus then was much smaller: only those texts that had been carefully copied and preserved, and in small numbers, were available for distribution once printing technology became available. Those manuscripts were converted to printed texts, and the light came back on in Europe, albeit with some dark corners un-illuminated where texts had been lost.

Another interesting author on the topic of preservation, but less well-known, is Louis-Sébastien Mercier, writing in 1772 in his utopian novel of the future, Memoirs of the Year Two Thousand Five Hundred.* In his book he visits the King's Library in that year to find that there is only a small cabinet holding the entire book collection. He asks the librarians whether some great fire had destroyed the books, but they answered instead that it was a conscious selection.

"Nothing leads the mind farther astray than bad books; for the first notions being adopted without attention, the second become precipitate conclusion; and men thus go on from prejudice to prejudice, and from error to error. What remained for us to do, but to rebuild the structure of human knowledge?" (v. 2, p. 5)

The selection criteria eliminated commentaries ("works of envy or ignorance") but kept original works of discovery or philosophy. These people also saw a virtue in abridging works to save the time of the reader. Not all works that we would consider "classics" were retained:

"In the second division, appropriated to the Latin authors, I found Virgil, Pliny, and Titus Livy entire; but they had burned Lucretius, except some poetic passages, because his physics they found false, and his morals dangerous." (v. 2, p.9)

In this case, books are selectively burned because they are considered inferior, a waste of the reader's time or tending to lead one in a less than moral direction. Although Mercier doesn't say so, he is implying a problem of information overload.

In Bradbury's book the goal was to empty the minds of the population, make them passive, not thinking. Mercier's world was gathering all of the best of human knowledge, perhaps even re-writing it, as Paul Otlet proposed. (More on him in a moment.) Mercier's year 2500 world eliminated all the works of commentary on other works, treating them like unimportant rantings on today's social networks. Bradbury also did not mention secondary sources; he names no authors of history (although we don't know how he thought of Bertrand Russell, as philosopher or also a historian) or works of literary criticism.

Both Bradbury and Mercier would be considered well-read. But we are all like the blind men and the elephant. We all operate based on the information we have. Bradbury and Mercier each had very different minds because they had been informed by what they had read. For the mind it is "you are what you see and read." Mercier could not have named Thoreau and Bradbury did not mention any French philosophers. Had they each saved a segment of the written output of history their choices would have been very different with little overlap, although they both explicitly retain Shakespeare. Their goals, however, run in parallel, and in both cases the goal is to preserve those works that merit preserving so that they can be read now and in the future.

In another approach to culling the mass of books and other papers, Kurt Vonnegut, in his absurdist manner, addressed the problem as one of information overload:

"In the year Ten Million, according to Koradubian, there would be a tremendous house-cleaning. All records relating to the period between the death of Christ and the year One Million A.D. would be hauled to dumps and burned. This would be done, said Koradubian, because museums and archives would be crowding the living right off the earth.

The million-year period to which the burned junk related would be summed up in history books in one sentence, according to Koradubian: Following the death of Jesus Christ, there was a period of readjustment that lasted for approximately one million years." (Sirens of Titan, p. 46)

While one hears often about a passion for books, some disciplines rely on other types of publications, such as journal articles and conference papers. The passion for books rarely includes these except occasionally by mistake, such as the bound journals that were scanned by Google in its wholesale digitization of library shelves, and the aficionados of non-books are generally limited to specific forms, such as comic books. In the late 19th and early 20th century, Belgian Paul Otlet, a fascinating obsessive whose lifetime and interests coincided with that our own homegrown bibliographic obsessive, Melvil Dewey, began work leading to his creation of what was intended to be a universal bibliography that included both books and journal articles, as well as other publications. Otlet's project was aimed at all knowledge, not just that contained in books, and his organization solicited books and journals from European and North American learned societies, especially those operating in scientific areas. As befits a project with the grandiose goal of cataloging all of the world's information, Otlet named it the Mundaneum. Otlet represents another selection criterion, because his Mundaneum appears to have been limited to academic materials and serious works; at the least, there is no mention of fiction or poetry in what I have read on the topic.

Among Otlet's goals was to pull out information buried in books and bring related bits of information together. He called the result of this a Biblion. This Biblion sounds somewhat related to the abridgments and re-gatherings of information that Mercier describes in his book. It also sounds like what motivated the early encyclopedists. To Otlet, the book format was a barrier, since his goal was not the preservation of the volumes themselves, but was to be a centralized knowledge base.

So now we have a range of book preservation goals, from all the books to all the good books, and then to the useful information in books. Within the latter two we see that each selection represents a fairly limited viewpoint that would result in a loss of a large number of the books and other materials that are held in research libraries today. For those of us in libraries and archives, the need is to optimize quality without being arbitrary, and at the same time to serve a broad intellectual and creative base. We won't be as perfect as Otlet or as strict the librarians in the year 2500, but hopefully our preservation practices will be more predictable than the individual choices made by Bradbury's "human books."

* In the original French, the title referred to the year 2440 ("L'An 2440, rêve s'il en fut jamais"). I have no idea why it was rounded up to 2500 in the English translation.

Works cited or used

Bradbury, Ray. Fahrenheit 451. New York: Ballantine, 1953

Mercier, Louis-Sébastien. Memoirs of the year two thousand five hundred, London, Printed for G. Robinson, 1772 (HathiTrust copy)

Vonnegut, Kurt. The Sirens of Titan. New York: Dial Press Trade Paperbacks, 2006

Wright, Alex. Cataloging the World: Paul Otlet and the Birth of the Information Age. New York, NY : Oxford University Press, 2014.

Monday, November 28, 2016

All the Books

I just joined the Book of the Month Club. This is a throwback to my childhood, because my parents were members when I was young, and I still have some of the books they received through the club. I joined because my reading habits are narrowing, and I need someone to recommend books to me. And that brings me to "All the Books."

"All the Books" is a writing project I've had on my computer and in notes ever since Google announced that it was digitizing all the books in the world. (It did not do this.) The project was lauded in an article by Kevin Kelley in the New York Times Magazine of May 14, 2006, which he prefaced with:

"What will happen to books? Reader, take heart! Publisher, be very, very afraid. Internet search engines will set them free. A manifesto."

There are a number of things to say about All the Books. First, one would need to define "All" and "Books". (We can probably take "the" as it is.) The Google scanning projects defined this as "all the bound volumes on the shelves of certain libraries, unless they had physical problems that prevented scanning." This of course defines neither "All" nor "Books".

Next, one would need to gather the use cases for this digital corpus. Through the HathiTrust project we know that a small number of scholars are using the digital files for research into language usage over time. Others are using the the files to search for specific words or names, discovering new sources of information about possibly obscure topics. As far as I can tell, no one is using these files to read books. The Open Library, on the other hand, is lending digitized books as ebooks for reading. This brings us to the statement that was made by a Questia sales person many years ago, when there were no ebooks and screens were those flickery CRTs: "Our books are for research, not reading." Given that their audience was undergraduate students trying to finish a paper by 9:30 a.m. the next morning, this was an actual use case with actual users. But the fact that one does research in texts one does not read is, of course, not ideal from a knowledge acquisition point of view.

My biggest beef with "All the Books" is that it treats them as an undifferentiated mass, as if all the books are equal. I always come back to the fact that if you read one book every week for 60 years (which is a good pace) you will have read 3,120. Up that to two books a week and you've covered 6,240 of the estimated 200-300 million books represented in WorldCat. The problem isn't that we don't have enough books to read; the problem is finding the 3-6,000 books that will give us the knowledge we need to face life, and be a source of pleasure while we do so. "All the Books" ignores the heights of knowledge, of culture, and of art that can be found in some of the books. Like Sarah Palin's response to the question "Which newspapers form your world view?", "all of them" is inherently an anti-intellectual answer, either by someone who doesn't read any of them, or who isn't able to distinguish the differences.

"All the Books" is a complex concept. It includes religious identity; the effect of printing on book dissemination; the loss of Latin as a universal language for scholars; the rise of non-textual media. I hope to hunker down and write this piece, but meanwhile, this is a taste.

Monday, September 26, 2016

2 Mysteries Solved!

One of the disadvantages of a long tradition is that the reasons behind certain practices can be lost over time. This is definitely the case with many practices in libraries, and in particular in practices affecting the library catalog. In U.S. libraries we tend to date our cataloging practices back to Panizzi, in the 1830's, but I suspect that he was already building on practices that preceded him.

A particular problem with this loss of history is that without the information about why a certain practice was chosen it becomes difficult to know if or when you can change the practice. This is compounded in libraries by the existence of entries in our catalogs that were created long before us and by colleagues whom we can no longer consult.

I was recently reading through volume one of the American Library Journal from the year 1876-1877. The American Library Association had been founded in 1876 and had its first meeting in Philadelphia in September, 1876. U.S. librarianship finally had a focal point for professional development. From the initial conference there were a number of ALA committees working on problems of interest to the library community. A Committee on Cooperative Cataloguing, led by Melvil Dewey, (who had not yet been able to remove the "u" from "cataloguing") was proposing that cataloging of books be done once, centrally, and shared, at a modest cost, with other libraries that purchased the same book. This was realized in 1902 when the Library of Congress began selling printed card sets. We still have cooperative cataloging, 140 years later, and it has had a profound effect on the ability of American libraries to reduce the cost of catalog creation.

Other practices were set in motion in 1876-1877, and two of these can be found in that inaugural volume. They are also practices whose rationales have not been obvious to me, so I was very glad to solve these mysteries.

Title case

Some time ago I asked on Autocat, out of curiosity, why libraries use sentence case for titles. No one who replied had more than a speculative answer. In 1877, however, Charles Ammi Cutter reports on The Use of Capitals in library cataloging and defines a set of rules that can be followed. His main impetus is "readability;" that "a profusion of capitals confuses rather than assists the eye...." (He also mentions that this is not a problem with the Bodleian library catalog, as that is written in Latin.)

Cutter would have preferred that capitals be confined to proper names, eschewing their use for titles of honor (Rev., Mrs., Earl) and initialisms (A.D). However, he said that these uses were so common that he didn't expect to see them changed, and so he conceded them.

All in all, I think you will find his rules quite compelling. I haven't looked at how they compare to any such rules in RDA. So much still to do!

Centimeters

I have often pointed out, although it would be obvious to anyone who has the time to question the practice, that books are measured in centimeters in Anglo-American catalogs, although there are few cultures as insistent on measuring in inches and feet than those. It is particularly un-helpful that books in libraries are cataloged with a height measurement in centimeters while the shelves that they are destined for are measured in inches. It is true that the measurement forms part of the description of the book, but at least one use of that is to determine on which shelves those books can be placed. (Note that in some storage facilities, book shelves are more variable in height than in general library collections and the size determination allows for more compact storage.) If I were to shout out to you "37 centimeters" you would probably be hard-pressed to reply quickly with the same measurement in inches. So why do we use centimeters?

The newly-formed American Library Association had a Committee on Sizes. This committee had been given the task of developing a set of standard size designations for books. The "size question" had to do with the then current practice to list sizes as folio, quarto, etc. Apparently the rise of modern paper making and printing meant that those were no longer the actual sizes of books. In the article by Charles Evans (pp. 56-61) he argued that actual measurements of the books, in inches, should replace the previous list of standard sizes. However, later, the use of inches was questioned. At the ALA meeting, W.F. Poole (of Poole's indexes) made the following statement (p. 109):

"The expression of measure in inches, and vulgar fractions of an inch, has many disadvantages, while the metric decimal system is simple, and doubtless will soon come into general use."

The committee agreed with this approach, and concluded:

"The committee have also reconsidered the expediency of adopting the centimeter as a unit, in accordance with the vote at Philadelphia, querying whether it were really best to substitute this for the familiar inch. They find on investigation that even the opponents of the metric system acknowledge that it is soon to come into general use in this country; that it is already adopted by nearly every other country of importance except England; that it is in itself a unit better adapted to our wants than the inch, which is too large for the measurement of books." (p. 180)

The members of the committee were James L. Whitney, Charles A. Cutter, and Melvil Dewey, the latter having formed the American Metric Bureau in July of 1876, both a kind of lobbying organization and a sales point for metric measures. My guess is that the "investigation" was a chat amongst themselves, and that Dewey was unmovable when it came to using metric measures, although he appears not to have been alone in that. I do love the fact that the inch is "too large," and that its fractions (1/16, etc.) are "vulgar."

Dewey and cohort obviously weren't around when compact discs came on the scene, because those are measured in inches ("1 sound disc : digital ; 4 3/4 in"). However, maps get the metric treatment: "1 map : col. ; 67 x 53 cm folded to 23 x 10 cm". Somewhere there is a record of these decisions, and I hope to come across them.

It would have been ideal if the U.S. had gone metric when Dewey encouraged that move. I suspect that our residual umbilical chord linking us to England is what scuppered that. Yet it is a wonder that we still use those too large, vulgar measurements. Dewey would be very disappointed to learn this.

So there it is, two of the great mysteries solved in the record of the very first year of the American library profession. Here are the readings; I created separate PDFs for the two most relevant sections:

American Library Journal, volume 1, 1876-1877 (from the Internet Archive)
Cutter, Charles A. The use of capitals. American Library Journal, v.1, n. 4-5, 1877. pp. 162-166
The Committee on Sizes of Books, American Library Journal, v.1, n. 4-5, 1877, pages 178-181

Also note that beginning on page 92 there is a near verbatim account of every meeting at the first American Library Association conference in Philadelphia, September, 1876. So verbatim that it includes the mention of who went out for a smoke and missed a key vote. And the advertisements! Give it a look.

Wednesday, August 31, 2016

User tasks, Step one

Brian C. Vickery, one of the greats of classification theory and a key person in the work of the Classification Research Group (active from 1952 to 1968), gave this list of the stages of "the process of acquiring documentary information" in his 1959 book Classification and Indexing in Science[1]:

Identifying the subject of the search.
Locating this subject in a guide which refers the searcher to one or more documents.
Locating the documents.
Locating the required information in the documents.

These overlap somewhat with FRBR's user tasks (find, identify, select, obtain) but the first step in Vickery's group is my focus here: Identifying the subject of the search. It is a step that I do not perceive as implied in the FRBR "find", and is all too often missing from library/use interactions today.

A person walks into a library...

Presumably, libraries are an organized knowledge space. If they weren't the books would just be thrown onto the nearest shelf, and subject cataloging would not exist. However, if this organization isn't both visible and comprehended by users, we are, firstly, not getting the return on our cataloging investment and secondly, users are not getting the full benefit of the library.

In Part V of my series on Catalogs and Context, I had two salient quotes. One by Bill Katz: "Be skeptical of the of information the patron presents"[2]; the other by Pauline Cochrane: "Why should a user ever enter a search term that does not provide a link to the syndetic apparatus and a suggestion about how to proceed?"[3]. Both of these address the obvious, yet often overlooked, primary point of failure for library users, which is the disconnect between how the user expresses his information need vis-a-vis the terms assigned by the library to the items that may satisfy that need.

Vickery's Three Issues for Stage 1

Issue 1: Formulating the topic

Vickery talks about three issues that must be addressed in his first stage, identifying the subject on which to search in a library catalog or indexing database. The first one is "...the inability even of specialist enquirers always to state their requirements exactly..." [1 p.1] That's the "reference interview" problem that Katz writes about: the user comes to the library with an ill-formed expression of what they need. We generally consider this to be outside the boundaries of the catalog, which means that it only exists for users who have an interaction with reference staff. Given that most users of the library today are not in the physical library, and that online services (from Google to Amazon to automated courseware) have trained users that successful finding does not require human interaction, these encounters with reference staff are a minority of the user-library sessions.

In online catalogs, we take what the user types into the search box as an appropriate entry point for a search, even though another branch of our profession is based on the premise that users do not enter the library with a perfectly formulated question, and need an intelligent intervention to have a successful interaction with the library. Formulating a precise question may not be easy, even for experienced researchers. For example, in a search about serving persons who have been infected with HIV, you may need to decide whether the research requires you to consider whether the person who is HIV positive has moved along the spectrum to be medically diagnosed as having AIDS. This decision is directly related to the search that will need to be done:

HIV-positive persons--Counseling of
AIDS (Disease)--Patients--Counseling of

Issue 2: from topic to query

The second of Vickery's caveats is that "[The researcher] may have chosen the correct concepts to express the subject, but may not have used the standard words of the index."[1 p.4] This is the "entry vocabulary" issue. What user would guess that the question "Where all did Dickens live?" would be answered with a search using "Dickens, Charles -- Homes and haunts"? And that all of the terms listed as "use for" below would translate to the term "HIV (Viruses)" in the catalog? (h/t Netanel Ganin):

As Pauline Cochrane points out[4], beginning in the latter part of the 20th century, libraries found themselves unable to include the necessary cross-reference information in their card catalogs, due to the cost of producing the cards. Instead, they asked users to look up terms in the subject heading reference books used by catalog librarians to create the headings. These books are not available to users of online catalogs, and although some current online catalogs include authorized alternate entry points in their searches, many do not.* This means that we have multiple generations of users who have not encountered "term switching" in their library catalog usage, and who probably do not understand its utility.

Even with such a terminology-switching mechanism, finding the proper entry in the catalog is not at all simple. The article by Thomas Mann (of Library of Congress, not the German author) on “The Peloponnesian War and the Future of Reference, Cataloging, and Scholarship in Research Libraries” [5] shows not only how complex that process might be, but it also indicates that the translation can only be accomplished by a library-trained expert. This presents us with a great difficulty because there are not enough such experts available to guide users, and not all users are willing to avail themselves of those services. How would a user discover that literature is French, but performing arts are in France?:

French literature
Performing arts -- France -- History

Or, using the example in Mann's piece, the searcher looking for in information on tribute payments in the Peloponnesian war needed to look under "Finance, public–Greece–Athens". This type of search failure fuels the argument that full text search is a better solution, and a search of Google Books on "tribute payments Peloponnesian war" does yield some results. The other side of the argument is that full text searches fail to retrieve documents not in the search language, while library subject headings apply to all materials in all languages. Somehow, this latter argument, in my experience, doesn't convince.

Issue 3: term order

The third point by Vickery is one that keyword indexing has solved, which is "...the searcher may use the correct words to express the subject, but may not choose the correct combination order."[1 p.4] In 1959, when Vickery was writing this particular piece, having the wrong order of terms resulted in a failed search. Mann, however, would say that with keyword searching the user does not encounter the context that the pre-coordinated headings provide; thus keyword searching is not a solution at all. I'm with him part way, because I think keyword searching as an entry to a vocabulary can be useful if the syndetic structure is visible with such a beginning. Keyword searching directly against bibliographic records, less so.

Comparison to FRBR "find"

FRBR's "find" is described as "to find entities that correspond to the user’s stated search criteria". [6 p. 79] We could presume that in FRBR the "user's stated search criteria" has either been modified through a prior process (although I hardly know what that would be, other than a reference interview), or that the library system has the capability to interact with the user in such a way that the user's search is optimized to meet the terminology of the library's knowledge organization system. This latter would require some kind of artificial intelligence and seems unlikely. The former simply does not happen often today, with most users being at a computer rather than a reference desk. FRBR's find seems to carry the same assumption as has been made functional in online catalogs, which is that the appropriateness of the search string is not questioned.

Summary

There are two take-aways from this set of observations:

We are failing to help users refine their query, which means that they may actually be basing their searches on concepts that will not fulfill their information need in the library catalog.
We are failing to help users translate their query into the language of the catalog(s).

I would add that the language of the catalog should show users how the catalog is organized and how the knowledge universe is addressed by the library. This is implied in the second take-away, but I wanted to bring it out specifically, because it is a failure that particularly bothers me.

Notes

*I did a search in various catalogs on "cancer" and "carcinoma". Cancer is the form used in LCSH-cataloged bibliographic records, and carcinoma is a cross reference. I found a local public library whose Bibliocommons catalog did retrieve all of the records with "cancer" in them when the search was on "carcinoma"; and that the same search in the Harvard Hollis system did not (carcinoma: 1889 retrievals; cancer 21,311). These are just two catalogs, and not a representative sample, to say the least, but the fact seems to be shown.

References

[1] Vickery, B C. Classification and Indexing in Science. New York: Academic Press, 1959.
[2] Katz, Bill. Introduction to Reference Work: Reference Services and Reference Processes. New York: McGraw-Hill, 1992. p. 82 http://www.worldcat.org/oclc/928951754. Cited in: Brown, Stephanie Willen. The Reference Interview: Theories and Practice. Library Philosophy and Practice 2008. ISSN 1522-0222
[3] Modern Subject Access in the Online Age: Lesson 3 Author(s): Pauline A. Cochrane, Marcia J. Bates, Margaret Beckman, Hans H. Wellisch, Sanford Berman, Toni Petersen, Stephen E. Wiberley and Jr. Source: American Libraries, Vol. 15, No. 4 (Apr., 1984), pp. 250-252, 254-255 Stable URL: http://www.jstor.org/stable/25626708
[4] Modern Subject Access in the Online Age: Lesson 2 Pauline A. Cochrane American Libraries Vol. 15, No. 3 (Mar., 1984), pp. 145-148, 150 Stable URL: http://www.jstor.org/stable/25626647
[5] Thomas Mann, “The Peloponnesian War and the Future of Reference, Cataloging, and Scholarship in Research Libraries” (June 13, 2007). PDF, 41 pp. http://guild2910.org/Pelopponesian%20War%20June%2013%202007.pdf
[6] IFLA Study Group on the Functional Requirements for Bibliographic Records. Functional Requirements for Bibliographic Records, 2009. http://archive.ifla.org/VII/s13/frbr/frbr_2008.pdf.

Sunday, August 21, 2016

Wikipedia and the numbers falacy

One of the main attempts at solutions to the lack of women on Wikipedia is to encourage more women to come to Wikipedia and edit. The idea is that greater numbers of women on Wikipedia will result in greater equality on the platform; that there will be more information about women and women's issues, and a hoped for "civilizing influence" on the brutish culture.

This argument is so obviously specious that it is hard for me to imagine that it is being put forth by educated and intelligent people. Women are not a minority - we are around 52% of the world's population and, with a few pockets of exception, we are culturally, politically, sexually, and financially oppressed throughout the planet. If numbers created more equality, where is that equality for women?

The "woman problem" is not numerical and it cannot solved with numbers. The problem is cultural; we know this because attacks against women can be traced to culture, not numbers: the brutal rapes in India, the harassment of German women by recent-arrived immigrant men at the Hamburg railway station on New Year's eve, the racist and sexist attacks on Leslie Jones on Twitter -- none of these can be explained by numbers. In fact, the stats show that over 60% of Twitter users are female, and yet Jones was horribly attacked. Gamergate arose at a time when the number of women in gaming is quite high, with data varying from 40% to over 50% of gamers being women. Women gamers are attacked not because there are too few of them, and there does not appear to be any safety in numbers.

The numbers argument is not only provably false, it is dangerous if mis-applied. Would women be safer walking home alone at night if we encouraged more women to do it? Would having more women at frat parties reduce the rape culture on campus? Would women on Wikipedia be safer if there were more of them? (The statistics from 2011 showed that 13% of editors were female. The Wikimedia Foundation had a goal to increase the number to 25% by 2015, but Jimmy Wales actually stated in 2015 that the number of women was closer to 10% than 25%.) I think that gamergate and Twitter show us that the numbers are not the issue.

In fact, Wikipedia's efforts may have exacerbated the problem. The very public efforts to bring more women editors into Wikipedia (there have been and are organized campaigns both for women and about women) and the addition of more articles by and about women is going to be threatening to some members of the Wikipedia culture. In a recent example, an edit-a-thon produced twelve new articles about women artists. They were immediately marked for deletion, and yet, after analysis, ten of the articles were determined to be suitable, and only two were lost. It is quite likely that twelve new articles about male scientists (Wikipedia greatly values science over art, another bias) would not have produced this reaction; in fact, they might have sailed into the encyclopedia space without a hitch. Some editors are rebelling against the addition of information about women on Wikipedia, seeing it as a kind of reverse sexism (something that came up frequently in the attack on me).

Wikipedia's culture is a "self-run" society. So was the society in the Lord of the Flies. If you are one of the people who believe that we don't need government, that individuals should just battle it out and see who wins, then Wikipedia might be for you. If, instead, you believe that we have a social obligation to provide a safe environment for people, then this self-run society is not going to be appealing. I've felt what it's like to be "Piggy" and I can tell you that it's not something I would want anyone else to go through.

I'm not saying that we do not want more women editing Wikipedia. I am saying that more women does not equate to more safety for women. The safety problem is a cultural problem, not a numbers problem. One of the big challenges is how we can define safety in an actionable way. Title IX, the US statute mandating equality of the sexes in education, revolutionized education and education-related sports. Importantly, it comes under the civil rights area of the Department of Justice. We need a Title IX for the Internet; one that requires those providing public services to make sure that there is no discrimination based on sex. Before we can have such a solution, we need to determine how to define "non-discrimination" in that context. It's not going to be easy, but it is a pre-requisite to solving the problem.

Wednesday, August 17, 2016

Classification, RDF, and promiscuous vowels

"[He] provided (i) a classified schedule of things and concepts, (ii) a series of conjunctions or 'particles' whereby the elementary terms can be combined to express composite subjects, (iii) various kinds of notational devices ... as a means of displaying relationships between terms." [1]

"By reducing the complexity of natural language to manageable sets of nouns and verbs that are well-defined and unambiguous, sentence-like statements can be interpreted...."[2]

The "he" in the first quote is John Wilkins, and the date is 1668.[3] His goal was to create a scientifically correct language that would have one and only one term for each thing, and then would have a set of particles that would connect those things to make meaning. His one and only one term is essentially an identifier. His particles are linking elements.

The second quote is from a publication about OCLC's linked data experiements, and is about linked data, or RDF. The goals are so obviously similar that it can't be overlooked. Of course there are huge differences, not the least of which is the technology of the time.*

What I find particularly interesting about Wilkins is that he did not distinguish between classification of knowledge and language. In fact, he was creating a language, a vocabulary, that would be used to talk about the world as classified knowledge. Here we are at a distance of about 350 years, and the language basis of both his work and the abstract grammar of the semantic web share a lot of their DNA. They are probably proof of some Chomskian theory of our brain and language, but I'm really not up to reading Chomsky at this point.

The other interesting note is how similar Wilkins is to Melvil Dewey. He wanted to reform language and spelling. Here's the section where he decries alphabetization because the consonants and vowels are "promiscuously huddled together without distinction." This was a fault of language that I have not yet found noted in Dewey's work. Could he have missed some imperfection?!

*Also, Wilkins was a Bishop in the Anglican church, and so his description of the history of language is based literally on the Bible, which makes for some odd conclusions.

[1]Schulte-Albert, Hans G. Classificatory Thinking from Kinner to Wilkins: Classification and Thesaurus Construction, 1645-1668. Quoting from Vickery, B. C. "The Significance of John Wilkins in the History of Bibliographical Classification." Libri 2 (1953): 326-43.
[2]Godby, Carol J, Shenghui Wang, and Jeffrey Mixter. Library Linked Data in the Cloud: Oclc's Experiments with New Models of Resource Description. , 2015.
[3] Wilkins, John. Essay Towards a Real Character, and a Philosophical Language. S.l: Printed for Sa. Gellibrand, and for John Martyn, 1668.

Tuesday, August 16, 2016

The case of the disappearing classification

I'm starting some research into classification in libraries (now that I have more time due to having had to drop social media from my life; see previous post). The main question I want to answer is: why did research into classification drop off at around the same time that library catalogs computerized? This timing may just be coincidence, but I'm suspecting that it isn't.

I was in library school in 1971-72, and then again in 1978-80. In 1971 I took the required classes of cataloging (two semesters), reference, children's librarianship, library management, and an elective in law librarianship. Those are the ones I remember. There was not a computer in the place, nor do I remember anyone mentioning them in relation to libraries. I was interested in classification theory, but not much was happening around that topic in the US. In England, the Classification Research Group was very active, with folks like D.J. Foskett and Brian Vickery as mainstays of thinking about faceted classification. I wrote my first published article about a faceted classification being used by a UN agency.[1]

In 1978 the same school had only a few traditional classes. I'd been out of the country, so the change to me was abrupt. Students learned to catalog on OCLC. (We had typed cards!) I was hired as a TA to teach people how to use DIALOG for article searching, even though I'd never seen it used, myself. (I'd already had a job as a computer programmer, so it was easy to learn the rules of DIALOG searching.) The school was now teaching "information science". Here's what that consisted of at the time: research into term frequency of texts; recall and precision; relevance ranking; database development.

I didn't appreciate it at the time, but the school had some of the bigger names in these areas, including William Cooper and M. E. "Bill" Maron. (I only just today discovered why he called himself Bill - the M. E., which is what he wrote under in academia, stands for "Melvin Earl". Even for a nerdy computer scientist, that was too much nerdity.) 1978 was still the early days of computing, at least unless you were on a military project grant or worked for the US Census Bureau. The University of California, Berkeley, did not have visible Internet access. Access to OCLC or DIALOG was via dial-up to their proprietary networks. (I hope someone has or will write that early history of the OCLC network. For its time it must have been amazing.)

The idea that one could search actual text was exciting, but how best to do it was (and still is, to a large extent) unclear. There was one paper, although I so far have not found it, that was about relevance ranking, and was filled with mathematical formulas for calculating relevance. I was determined to understand it, and so I spent countless hours on that paper with a cheat sheet beside me so I could remember what uppercase italic R was as opposed to lower case script r. I made it through the paper to the very end, where the last paragraph read (as I recall): "Of course, there is no way to obtain a value for R[elevance], so this theory cannot be tested." I could have strangled the author (one of my profs) with my bare hands.

Looking at the articles, now, though, I see that they were prescient; or at least that they were working on the beginnings of things we now take for granted. One statement by Maron especially strikes me today:

A second objective of this paper is to show that about is, in fact, not the central concept in a theory of document retrieval. A document retrieval system ought to provide a ranked output (in response to a search query) not according to the degree that they are about the topic sought by the inquiring patron, but rather according to the probability that they will satisfy that person‘s information need. This paper shows how aboutness is related to probability of satisfaction.[2]

This is from 1977, and it essentially describes the basic theory behind Google ranking. It doesn't anticipate hyperlinking, of course, but it does anticipate that "about" is not the main measure of what will satisfy a searcher's need. Classification, in the traditional sense, is the quintessence of about. Is this the crux of the issue? As yet, I don't know. More to come.

[1]Coyle, Karen (1975). "A Faceted Classification for Occupational Safety and Health". Special Libraries. 66 (5-6): 256–9.
[2]Maron, M. E. (1977) "On Indexing, Retrieval, and the Meaning of About". Journal of the American Society for Information Science, January, 1977, pp. 38-43

Thursday, August 11, 2016

This is what sexism looks like: Wikipedia

We've all heard that there are gender problems on Wikipedia. Honestly there are a lot of problems on Wikipedia, but gender disparity is one of them. Like other areas of online life, on Wikipedia there are thinly disguised and not-so thinly disguised attacks on women. I am at the moment the victim of one of those attacks.

Wikipedia runs on a set of policies that are used to help make decisions about content and to govern behavior. In a sense, this is already a very male approach, as we know from studies of boys and girls at play: boys like a sturdy set of rules, and will spend considerable time arguing whether or not rules are being followed; girls begin play without establishing a set of rules, develop agreed rules as play goes on if needed, but spend little time on discussion of rules.

If you've been on Wikipedia and have read discussions around various articles, you know that there are members of the community that like to "wiki-lawyer" - who will spend hours arguing whether something is or is not within the rules. Clearly, coming to a conclusion is not what matters; this is blunt force, nearly content-less arguing. It eats up hours of time, and yet that is how some folks choose to spend their time. There are huge screaming fights that have virtually no real meaning; it's a kind of fantasy sport.

Wiki-lawyering is frequently used to harass. It is currently going on to an amazing extent in harassment of me, although since I'm not participating, it's even emptier. The trigger was that I sent back for editing two articles about men that two wikipedians thought should not have been sent back. Given that I have reviewed nearly 4000 articles, sending back 75% of those for more work, these two are obviously not significant. What is significant, of course, is that a woman has looked at an article about a man and said: "this doesn't cut it". And that is the crux of the matter, although the only person to see that is me. It is all being discussed as violations of policy, although there are none. But sexism, as with racism, homophobia, transphobia, etc., is almost never direct (and even when it is, it is often denied). Regulating what bathrooms a person can use, or denying same sex couples marriage, is a kind of lawyering around what the real problem is. The haters don't say "I hate transexuals" they just try to make them as miserable as possible by denying them basic comforts. In the past, and even the present, no one said "I don't want to hire women because I consider them inferior" they said "I can't hire women because they just get pregnant and leave."

Because wiki-lawyering is allowed, this kind of harassment is allowed. It's now gone on for two days and the level of discourse has gotten increasingly hysterical. Other than one statement in which I said I would not engage because the issue is not policy but sexism (which no one can engage with), it has all been between the wiki-lawyers, who are working up to a lynch mob. This is gamer-gate, in action, on Wikipedia.

It's too bad. I had hopes for Wikipedia. I may have to leave. But that means one less woman editing, and we were starting to gain some ground.

The best read on this topic, mainly about how hard it is to get information that is threatening to men (aka about women) into Wikipedia: WP:THREATENING2MEN: Misogynist Infopolitics and the Hegemony of the Asshole Consensus on English Wikipedia

I have left Wikipedia, and I also had to delete my Twitter account because they started up there. I may not be very responsive on other media for a while. Thanks to everyone who has shown support, but if by any chance you come across a kinder, gentler planet available for habitation, do let me know. This one's desirability quotient is dropping fast.

Sunday, July 10, 2016

Catalogs and Context: Part V

This entire series is available a single file on my web site.

Before we can posit any solutions to the problems that I have noted in these posts, we need to at least know what questions we are trying to answer. To me, the main question is:

What should happen between the search box and the bibliographic display?

Or as Pauline Cochrane asked: "Why should a user ever enter a search term that does not provide a link to the syndetic apparatus and a suggestion about how to proceed?"[1] I really like the "suggestion about how to proceed" that she included there. Although I can think of some exceptions, I do consider this an important question.

If you took a course in reference work at library school (and perhaps such a thing is no longer taught - I don't know), then you learned a technique called "the reference interview." The Wikipedia article on this is not bad, and defines the concept as an interaction at the reference desk "in which the librarian responds to the user's initial explanation of his or her information need by first attempting to clarify that need and then by directing the user to appropriate information resources." The assumption of the reference interview is that the user arrives at the library with either an ill-formed query, or one that is not easily translated to the library's sources. Bill Katz's textbook "Introduction to Reference Work" makes the point bluntly:

"Be skeptical of the of information the patron presents" [2]

If we're so skeptical that the user could approach the library with the correct search in mind/hand, then why then do we think that giving the user a search box in which to put that poorly thought out or badly formulated search is a solution? This is another mind-boggler to me.

So back to our question, what SHOULD happen between the search box and the bibliographic display? This is not an easy question, and it will not have a simple answer. Part of the difficulty of the answer is that there will not be one single right answer. Another difficulty is that we won't know a right answer until we try it, give it some time, open it up for tweaking, and carefully observe. That's the kind of thing that Google does when they make changes in their interface, but we haven't got either Google's money nor its network (we depend on vendor systems, which define what we can and cannot do with our catalog).

Since I don't have answers (I don't even have all of the questions) I'll pose some questions, but I really want input from any of you who have ideas on this, since your ideas are likely to be better informed than mine. What do we want to know about this problem and its possible solutions?

(Some of) Karen's Questions

Why have we stopped evolving subject access?

Is it that keyword access is simply easier for users to understand? Did the technology deceive us into thinking that a "syndetic apparatus" is unnecessary? Why have the cataloging rules and bibliographic description been given so much more of our profession's time and development resources than subject access has? [3]

Is it too late to introduce knowledge organization to today's users?

The user of today is very different to the user of pre-computer times. Some of our users have never used a catalog with an obvious knowledge organization structure that they must/can navigate. Would they find such a structure intrusive? Or would they suddenly discover what they had been missing all along? [4]

Can we successfully use the subject access that we already have in library records?

Some of the comments in the articles organized by Cochrane in my previous post were about problems in the Library of Congress Subject Headings (LCSH), in particular that the relationships between headings were incomplete and perhaps poorly designed.[5] Since LCSH is what we have as headings, could we make them better? Another criticism was the sparsity of "see" references, once dictated by the difficulty of updating LCSH. Can this be ameliorated? Crowdsourced? Localized?

We still do not have machine-readable versions of the Library of Congress Classification (LCC), and the machine-readable Dewey Decimal Classification (DDC) has been taken off-line (and may be subject to licensing). Could we make use of LCC/DDC for knowledge navigation if they were available as machine-readable files?

Given that both LCSH and LCC/DDC have elements of post-composition and are primarily instructions for subject catalogers, could they be modified for end-user searching, or do we need to develop a different instrument altogether?

How can we measure success?

Without Google's user laboratory apparatus, the answer to this may be: we can't. At least, we cannot expect to have a definitive measure. How terrible would it be to continue to do as we do today and provide what we can, and presume that it is better than nothing? Would we really see, for example, a rise in use of library catalogs that would confirm that we have done "the right thing?"

Notes

[1]*Modern Subject Access in the Online Age: Lesson 3
Author(s): Pauline A. Cochrane, Marcia J. Bates, Margaret Beckman, Hans H. Wellisch, Sanford Berman, Toni Petersen, Stephen E. Wiberley and Jr.
Source: American Libraries, Vol. 15, No. 4 (Apr., 1984), pp. 250-252, 254-255
Stable URL: http://www.jstor.org/stable/25626708

[2] Katz, Bill. Introduction to Reference Work: Reference Services and Reference Processes. New York: McGraw-Hill, 1992. p. 82 http://www.worldcat.org/oclc/928951754. Cited in: Brown, Stephanie Willen. The Reference Interview: Theories and Practice. Library Philosophy and Practice 2008. ISSN 1522-0222

[3] One answer, although it doesn't explain everything, is economic: the cataloging rules are published by the professional association and are a revenue stream for it. That provides an incentive to create new editions of rules. There is no economic gain in making updates to the LCSH. As for the classifications, the big problem there is that they are permanently glued onto the physical volumes making retroactive changes prohibitive. Even changes to descriptive cataloging must be moderated so as to minimize disruption to existing catalogs, which we saw happen during the development of RDA, but with some adjustments the new and the old have been made to coexist in our catalogs.

[4] Note that there are a few places online, in particular Wikipedia, where there is a mild semblance of organized knowledge and with which users are generally familiar. It's not the same as the structure that we have in subject headings and classification, but users are prompted to select pre-formed headings, with a keyword search being secondary.

[5] Simon Spero did a now famous (infamous?) analysis of LCSH's structure that started with Biology and ended with Doorbells.

Monday, July 04, 2016

Catalogs and Content: an Interlude

This entire series is available a single file on my web site.

"Editor's note. Providing subject access to information is one of the most important professional services of librarians; yet, it has been overshadowed in recent years by AACR2, MARC, and other developments in the bibliographic organization of information resources. Subject access deserves more attention, especially now that results are pouring in from studies of online catalog use in libraries."
American Libraries, Vol. 15, No. 2 (Feb., 1984), pp. 80-83

Having thought and written about the transition from card catalogs to online catalogs, I began to do some digging in the library literature, and struck gold. In 1984, Pauline Atherton Cochrane, one of the great thinkers in library land, organized a six-part "continuing education" to bring librarians up to date on the thinking regarding the transition to new technology. (Dear ALA - please put these together into a downloaded PDF for open access. It could make a difference.) What is revealed here is both stunning and disheartening, as the quote above shows; in terms of catalog models, very little progress has been made, and we are still spending more time organizing atomistic bibliographic data while ignoring subject access.

The articles are primarily made up of statements by key library thinkers of the time, many of whom you will recognize. Some responses contradict each other, others fall into familiar grooves. Library of Congress is criticized for not moving faster into the future, much as it is today, and yet respondents admit that the general dependency on LC makes any kind of fast turn-around of changes difficult. Some of the desiderata have been achieved, but not the overhaul of subject access in the library catalog.

The Background

If you think that libraries moved from card catalogs to online catalogs in order to serve users better, think again. Like other organizations that had a data management function, libraries in the late 20th century were reaching the limits of what could be done with analog technology. In fact, as Cochrane points out, by the mid-point of that century libraries had given up on the basic catalog function of providing cross references from unused to used terminology, as well as from broader and narrower terms in the subject thesaurus. It simply wasn't possible to keep up with these, not to mention that although the Library of Congress and service organizations like OCLC provided ready-printed cards for bibliographic entries, they did not provide the related reference cards. What libraries did (and I remember this from my undergraduate years) is they placed near the card catalog copies of the "Red Book". This was the printed Library of Congress Subject Heading list, which by my time was in two huge volumes, and, yes, was bound in red. Note that this was the volume that was intended for cataloging librarians who were formulating subject headings for their collections. It was never intended for the end-users of the catalog. The notation ("x", "xx", "sa") was far from intuitive. In addition, for those users who managed to follow the references, it pointed them to the appropriate place in LCSH, but not necessarily in the catalog of the library in which they were searching. Thus a user could be sent to an entry that simply did not exist.

The "RedBook" today

From my own experience, when we brought up the online catalog at the University of California, the larger libraries had for years had difficulty keeping the card catalog up to date. The main library at the University of California at Berkeley regularly ran from 100,000 to 150,000 cards behind in filing into the catalog, which filled two enormous halls. That meant that a book would be represented in the catalog about three months after it had been cataloged and shelved. For a research library, this was a disaster. And Berkeley was not unusual in this respect.

Computerization of the catalog was both a necessary practical solution, as well as a kind of holy grail. At the time that these articles were written, only a few large libraries had an online catalog, and that catalog represented only a recent portion of the library's holdings. (Retrospective conversion of the older physical card catalog to machine-readable form came later, culminating in the 1990's.) Abstracting and indexing databases had preceded libraries in automating, DIALOG, PRECIS, and others, and these gave librarians their first experience in searching computerized bibliographic data.

This was the state of things when Cochrane presented her 6-part "continuing education" series in American Libraries.

Subject Access

The series of articles was stimulated by an astonishingly prescient article by Marcia Bates in 1977. In that article she articulates both concerns and possibilities that, quite frankly, we should all take to heart today. In Lesson 3 of Cochrane's articles, Bates is quotes from 1977 saying:

"...with automation, we have the opportunity to introduce many access points to a given book. We can now use a subject approach... that allows the naive user, unconscious of and uninterested in the complexities of synonymy and vocabulary control, to blunder on to desired subjects, to be guided, without realizing it, by a redundant but carefully controlled subject access system."

and

"And now is the time to change -- indeed, with MARC already so highly developed, past time. If we simply transfer the austerity-based LC subject heading approach to expensive computer systems, then we have used our computers merely to embalm the constraints that were imposed on library systems back before typewriters came into use!"

This emphasis on subject access was one of the stimuli for the AL lessons. In the early 1980's, studies done at OCLC and elsewhere showed that over 50% of the searches being done in the online catalogs of that day were subject searches, even those going against title indexes or mixed indexes. (See footnotes to Lesson 3.) Known item searching was assumed to be under control, but subject searching posed significant problems. Comments in the article include:

"...we have not yet built into our online systems much of the structure for subject access that is already present in subject cataloging. That structure is internal and known by the person analyzing the work; it needs to be external and known by the person seeking the work."
"Why should a user ever enter a search term that does not provide a link to the syndetic apparatus and a suggestion about how to proceed?"

Interestingly, I don't see that any of these problems has been solved into today's systems.

As a quick review, here are some of the problems, some proposed solutions, and some hope for future technologies that are presented by the thinkers that contributed to the lessons.

Problems noted

Many problems were surfaced, some with fairly simple solutions, others that we still struggle with.

LCSH is awkward, if not nearly unusable, both for its vocabulary and for the lack of a true hierarchical organization
Online catalogs' use of LCSH lacks syndetic structure (see, see also, BT, NT). This is true not only for display, but in retrieval, search on a broader term does not retrieve items with a narrower term (which would be logical to at least some users)
Libraries assign too few subject headings
For the first time, some users are not in the library while searching so there are no intermediaries (e.g. reference librarians) available. (One of the flow diagrams has a failed search pointing to a box called "see librarian" something we would not think to include today.)
Lack of a professional theory of information seeking behavior that would inform systems design. ("Without a blueprint of how most people want to search, we will continue to force them to search the we want to search." Lesson 5)
Information overload, aka overly large results, as well as too few results on specific searches

Proposed solutions

Some proposed solutions were mundane (add more subject headings to records) while others would require great disruption to the library environment.

Add more subject headings to MARC records
Use keyword searching, including keywords anywhere in the record.
Add uncontrolled keywords to the records.
Make the subject authority file machine-readable and integrate it into online catalogs.
Forget LCSH, instead use non-library bibliographic files for subject searching, such as A&I databases.
Add subject terms from non-library sources to the library catalog, and/or do (what today we call) federated searching
LCSH must provide headings that are more specific as file sizes and retrieved sets grow (in the document, a retrieved set of 904 items was noted with an exclamation point)

Future thinking

As is so often the case when looking to the future, some potential technologies were seen as solutions. Some of these are still seen as solutions today (c.f. artificial intelligence), while others have been achieved (storage of full text).

Full text searching, natural language searches, and artificial intelligence will make subject headings and classification unnecessary
We will have access to back-of-the-book indexes and tables of contents for searching, as well as citation indexing
Multi-level systems will provide different interfaces for experts and novices
Systems will be available 24x7, and there will be a terminal in every dorm room
Systems will no longer need to use stopwords
Storage of entire documents will become possible

End of Interlude

Although systems have allowed us to store and search full text, to combine bibliographic data from different sources, and to deliver world-wide, 24x7, we have made almost no progress in the area of subject access. There is much more to be learned from these articles, and it would be instructive to do an in-depth comparison of them to where we are today. I greatly recommend reading them, each is only a few pages long.

----- The Lessons -----

*Modern Subject Access in the Online Age: Lesson 1
by Pauline Atherton Cochrane
Source: American Libraries, Vol. 15, No. 2 (Feb., 1984), pp. 80-83
Stable URL: http://www.jstor.org/stable/25626614

*Modern Subject Access in the Online Age: Lesson 2 Pauline A. Cochrane American Libraries Vol. 15, No. 3 (Mar., 1984), pp. 145-148, 150 Stable URL: http://www.jstor.org/stable/25626647

*Modern Subject Access in the Online Age: Lesson 3
Author(s): Pauline A. Cochrane, Marcia J. Bates, Margaret Beckman, Hans H. Wellisch, Sanford Berman, Toni Petersen, Stephen E. Wiberley and Jr.
Source: American Libraries, Vol. 15, No. 4 (Apr., 1984), pp. 250-252, 254-255
Stable URL: http://www.jstor.org/stable/25626708

*Modern Subject Access in the Online Age: Lesson 4
Author(s): Pauline A. Cochrane, Carol Mandel, William Mischo, Shirley Harper, Michael Buckland, Mary K. D. Pietris, Lucia J. Rather and Fred E. Croxton
Source: American Libraries, Vol. 15, No. 5 (May, 1984), pp. 336-339
Stable URL: http://www.jstor.org/stable/25626747

*Modern Subject Access in the Online Age: Lesson 5
Author(s): Pauline A. Cochrane, Charles Bourne, Tamas Doczkocs, Jeffrey C. Griffith, F. Wilfrid Lancaster, William R. Nugent and Barbara M. Preschel
Source: American Libraries, Vol. 15, No. 6 (Jun., 1984), pp. 438-441, 443
Stable URL: http://www.jstor.org/stable/25629231

*Modern Subject Access In the Online Age: Lesson 6
Author(s): Pauline A. Cochrane, Brian Aveney and Charles Hildreth Source: American Libraries, Vol. 15, No. 7 (Jul. - Aug., 1984), pp. 527-529
Stable URL: http://www.jstor.org/stable/25629275

Wednesday, June 29, 2016

Catalog and Context, Part IV

Part I, Part II, Part III

(I fully admit that this topic deserves a much more extensive treatment than I will give it here. My goal is to stimulate discussion that would lead to efforts to develop models of that catalog that support a better user experience.)

Recognizing that users need a way to make sense out of large result sets, some library catalogs have added features that attempt to provide context for the user. The main such effort that I am aware of is the presentation of facets derived from some data in the bibliographic records. Another model, although I haven't seen it integrated well into library catalogs, is data mining; doing an overall analysis combining different data in the records, and making this available for search. Lastly, we have the development of entities that are catalog elements in their own right; this generally means the treatment of authors, subjects, etc., as stand-alone topics to be retrieved and viewed, even apart from their role in the retrieval of a bibliographic item. Treating these as "first-class entities" is not the same as the heading layer over bibliographic records, but it may be exploitable to provide a kind of context for users.

Facets

Faceted classification was all the rage when I attended library school in the early 1970's, having been bolstered by the work of the UK-based Classification Research Group, although the prime mover of this type of classification was S R Ranganathan who thoroughly explicated the concept in the 1930s. Faceted classification was to 1970's knowledge organization what KWIC and KWOC were to text searching: facets potentially provided a way to create complex subject headings whose individual parts could be the subject of access on their own or in context.

In library systems "faceting" has exploited information from the bibliographic record can be discretely applied to a retrieved set. Facets are all "accidents" of the existing data, as catalog record creation is not based on faceted cataloging.

In general, facets are fixed data elements, or whole or part heading strings. Authors are used as facets, generally showing the top-occurring author names with counts.

Authors as facets

Date of publication is also a commonly used facet, not so much because it is inherently useful but mainly because "it exists."

Dates as facets

Subject Facets

Faceting is, to a degree, already incorporated into our subject access systems. Library of Congress subject headings are faceted to some extent, with topic facets, geographic facets, and time facets. The Library of Congress Classification and the Dewey Decimal Classification make some use of facets where they allow entries in the classification to be extended by place, time, or other re-usable subdivisions.

Some systems have taken a page from the FAST book. FAST is Faceted Application of Subject Terminology, and it creates facets by breaking apart the segments of a Library of Congress subject heading such that topics, geographical entries, and time periods become separate entries. FAST itself does more than this, including turning some inverted headings (Lake, Erie) back to their natural order, and other changes. One of the main criticisms of FAST, however, is that it loses the very context that is provided by the composite subject heading. Thus the headings on Moby Dick become Whales / Whaling / Mentally Ill / Fiction, and leaves it unclear who or what is mentally ill in this example. (I'm sure there are better examples - send them in!)

Summon system use of facets

The Open Library created subject facets from Library of Congress subject headings, and categorizes each by its facet "type":

Open Library subject facts

Although these are laudable attempts to give the user a way to both understand and further refine the retrieved set, there are a number of problems with these implementations, not the least of which is that many of these are not actually facets in the knowledge organization sense of that term. Facets need to be conceptual divisions of the landscape that help a user understand that landscape.
Online sales sites use something that they call faceted classification, although it varies considerably from the concept of faceted classification that originated with S. R. Ranganathan in the 1930's. On a sales site, facets divide the site's products into categories so that users can add those categories to their searches. A search for shoes in general is less useful than a search for shoes done under the categories "men's", "women's" or "children's". In the online sales sense, a facet is a context for the keyword search. For all that the overall universe that these facets govern is much simpler than the entire knowledge universe that libraries must try to handle, at least the concept of context is employed to help the user.

Amazon's facets

While it may be helpful to see who are the most numerous authors in a retrieved set, authorship does not provide a conceptual organization for the user. Next, not everything that can be exploited in a bibliographic record to narrow a result set is necessarily useful. The list of publication dates from the retrieved set is not only too granular to be a useful facet (think of how many different dates there could be) but the likelihood that a user's query can be fulfilled by a publication year datum is scant indeed.

The last problem is really the key here, which is that while isolated bits of data like date or place may help narrow a large result set they do not provide the kind of overall context for searches that a truly faceted system might. However, providing such a view requires that the entries in the library catalog have been classified using a faceted classification system, and that is simply not the case.

Data Mining

I include this because I think it is interesting, although the only real instances of it that I am aware of come from OCLC, which is uniquely positioned to do the kind of "big data" work that this implies. The WorldCat Identities project shows the kind of data that one can extract from a large bibliographic database. Data mining applies best to the bibliographic universe as a whole, rather than individual catalogs, since those latter are by definition incomplete. It would, however, be interesting to see what uses could be made of mined data like WorldCat Identities, for example giving users of individual catalogs information about sources that the library does not hold. It is also a shame that WorldCat Identities appears to have been a one-off and is not being kept up to date.

Emily Dickinson at WorldCat Identities

First Class Objects

A potential that linked data brings (but does not guarantee) is the development of some of the key bibliographic entities into "first class objects". By that I mean that some entities could be the focus of searches on their own, not just as index entries to bibliographic records. Having some entities be first class objects means that, for example, you can have a page for a person that is truly about the person, not just a heading with the personal name in it. This allows you to present the user with additional information, either similar to WorldCat Identities, if you have that information available to you, or taking text from sources like Wikipedia, like Open Library did:

Open Library author page

This was also the model used in the linked data database Freebase (which has now been killed by Google), and is not entirely unlike Google's use of Wikipedia (and other sources) to create its "knowledge graph."

Google Knowledge Graph

The treatment of some things as first class objects is perhaps a step toward the catalog of headings, but the person as an object is not itself a replication of the heading system that is found in bibliographic records, which go beyond the person's name in their organizational function:


Dickens, Charles, 1812-1870--Adaptations.
Dickens, Charles, 1812-1870--Adaptations--Comic books, strips, etc.
Dickens, Charles, 1812-1870--Adaptations--Congresses.
Dickens, Charles, 1812-1870--Aesthetics.
Dickens, Charles, 1812-1870--Anecdotes.
Dickens, Charles, 1812-1870--Anniversaries, etc.
Dickens, Charles, 1812-1870--Appreciation.
Dickens, Charles, 1812-1870--Appreciation--Croatia.

For subject headings, a key aspect of the knowledge map is the inclusion of relationships from broader and narrower terms and related terms. I will not pretend that the existing headings are perfect, as we know they are not, but it is hard to imagine a knowledge organization system that will not make use of these taxonomic concepts in one way or another.


Lake Erie
 See: Erie, Lake
Lake Erie, Battle of, 1813.
 BT:United States--History--War of 1812--Campaigns
Lake Erie, Battle of, 1813--Bibliography.
Lake Erie, Battle of, 1813--Commemoration.
Lake Erie, Battle of, 1813--Fiction.
Lake Erie, Battle of, 1813--Juvenile fiction.
Lake Erie, Battle of, 1813--Juvenile literature.
Lake Erie Transportation Company≈
 See Also: Erie Railroad Company.

This information is now available through the Library of Congress linked data service, id.loc.gov and surely, with some effort, these aspects of the "first class entity" (person, place, topic, etc.) could be recovered and made available to the user. Unfortunately (how often have I said that in these posts?), the subject heading authorities were designed as a model for subject heading creation, not as a full list of all possible subject headings, and connecting the authority file, which contains the relationships between terms, mechanically to the headings in bibliographic records is not a snap. Again, what was modeled for the card catalog and worked well in that technology does not translate perfectly to the newer technologies.

Note that the emphasis on bibliographic entities in FRBR, RDA and BIBFRAME could facilitate such a solution. All three encourage an entity view of data that has traditionally included in bibliographic records and that is not entirely opposed to the concept of the separation of bibliographic data and authorities. In addition, FRBR provides a basis for conceptualizing works and editions (FRBR's expression) as separate entities. These latter exist already in many forms in the "real world" as objects of critical thinking, description, and point of sale. The other emphasis in FRBR is on bibliographic relationships. This has helped us understand that relationships are important, although these bibliographic relationships are the tip of the iceberg if we look at user service as a whole.

Next I want to talk about possibilities. But because I do not have the answers, I am going to present them in the form of questions - because we need first to have questions before we can posit any answers.

Thursday, June 23, 2016

Catalog and Context Part III

This entire series is available a single file on my web site.

In the previous two parts, I explained that much of the knowledge context that could and should be provided by the library catalog has been lost as we moved from cards to databases as the technologies for the catalog. In this part, I want to talk about the effect of keyword searching on catalog context.

KWIC and KWOC

If you weren't at least a teenager in the 1960's you probably missed the era of KWIC and KWOC (neither a children's TV show nor a folk music duo). These meant, respectively, KeyWords In Context, and KeyWords Out of Context. These were concordance-like indexes to texts, but the first done using computers. A KWOC index would be simply a list of words and pointers (such as page numbers, since hyperlinks didn't exist yet). A KWIC index showed the keywords with a few words on either side, or rotated a phrase such that each term appeared once at the beginning of the string, and then were ordered alphabetically.

If you have the phrase "KWIC is an acronym for Key Word in Context", then your KWIC index display could look like:

 KWIC is an acronym for Key Word In Context
Key Word In Context
acronym for Key Word In Context
            KWIC is an acronym for 
acronym for Key Word In Context

To us today these are unattractive and not very useful, but to the first users of computers these were an exciting introduction to the possibility that one could search by any word in a text.

It wasn't until the 1980's, however, that keyword searching could be applied to library catalogs.

Before Keywords, Headings

Before keyword searching, when users were navigating a linear, alphabetical index, they were faced with the very difficult task of deciding where to begin their entry into the catalog. Imagine someone looking for information on Lake Erie. That seems simple enough, but entering the catalog at L-A-K-E E-R-I-E would not actually yield all of the entries that might be relevant. Here are some headings with LAKE ERIE:

Boats and boating--Erie, Lake--Maps. 
Books and reading--Lake Erie region.
Lake Erie, Battle of, 1813.
Erie, Lake--Navigation

Note that the lake is entered under Erie, the battle under Lake, and some instances are fairly far down in the heading string. All of these headings follow rules that ensure a kind of consistency, but because users do not know those rules, the consistency here may not be visible. In any case, the difficulty for users was knowing with what terms to begin the search, which was done on left-anchored headings.

One might assume that finding names of people would be simple, but that is not the case either. Names can be quite complex with multiple parts that are treated differently based on a number of factors having to do with usage in different cultures:

De la Cruz, Melissa
Cervantes Saavedra, Miguel de

Because it was hard to know where to begin a search, see and see also references existed to guide the user from one form of a name or phrase to another. However, it would inflate a catalog beyond utility to include every possible entry point that a person might choose, not to mention that this would make the cataloger's job onerous. Other than the help of a good reference librarian, searching in the card catalog was a kind of hit or miss affair.

When we brought up the University of California online catalog in 1982, you can image how happy users were to learn that they could type in LAKE ERIE and retrieve every record with those terms in it regardless of the order of the terms or where in the heading they appeared. Searching was, or seemed, much simpler. Because it feels simpler, we all have tended to ignore some of the down side of keyword searching. First, words are just strings, and in a search strings have to match (with some possible adjustment like combining singular and plural terms). So a search on "FRANCE" for all information about France would fail to retrieve other versions of that word unless the catalog did some expansion:

Cooking, French
France--Antiquities
Alps, French (France)
French--America--History
French American literature

The next problem is that retrieval with keywords, and especially the "keyword anywhere" search which is the most popular today, entirely misses any context that the library catalog could provide. A simple keyword search on the word "darwin" brings up a wide array of subjects, authors, and titles.

Subjects:

Darwin, Charles, 1809-1882 – Influence
Darwin, Charles, 1809-1882 — Juvenile Literature
Darwin, Charles, 1809-1882 — Comic Books, Strips, Etc
Darwin Family
Java (Computer program language)
Rivers--Great Britain
Mystery Fiction
DNA Viruses — Fiction
Women Molecular Biologists — Fiction

Authors:

Darwin, Charles, 1809-1882
Darwin, Emma Wedgwood, 1808-1896
Darwin, Ian F.
Darwin, Andrew
Teilhet, Darwin L.
Bear, Greg
Byrne, Eugene

Titles:

Darwin
Darwin; A Graphic Biography : the Really Exciting and Dramatic

    Story of A Man Who Mostly Stayed at Home and Wrote Some Books
Darwin; Business Evolving in the Information Age
Emma Darwin, A Century of Family Letters, 1792-1896
Java Cookbook
Canals and Rivers of Britain
The Crimson Hair Murders
Darwin's Radio

It wouldn't be reasonable for us to expect a user to make sense of this, because quite honestly it does not make sense.

In the first version of the UC catalog, we required users to select a search heading type, such as AU, TI, SU. That may have lessened the "false drops" from keyword searches, but it did not eliminate them. In this example, using a title or subject search the user still would have retrieved items with the subjects DNA Viruses — Fiction, and Women Molecular Biologists — Fiction, and an author search would have brought up both Java Cookbook and Canals and Rivers of Britain. One could see an opportunity for serendipity here, but it's not clear that it would balance out the confusion and frustration.

You may be right now thinking "But Google uses keyword searching and the results are good." Note that Google now relies heavily on Wikipedia and other online reference books to provide relevant results. Wikipedia is a knowledge organization system, organized by people, and it often has a default answer for search that is more likely to match the user's assumptions. A search on the single word "darwin" brings up:

In fact, Google has always relied on humans to organize the web by following the hyperlinks that they create. Although the initial mechanism of the search is a keyword search, Google's forte is in massaging the raw keyword result to bring potentially relevant pages to the top.

Keywords, Concluded

The move from headings to databases to un-typed keyword searching has all but eliminated the visibility and utility of headings in the catalog. The single search box has become the norm for library catalogs and many users have never experienced the catalog as an organized system of headings. Default displays are short and show only a few essential fields, mainly author, title and date. This means that there may even be users who are unaware that there is a system of headings in the catalog.

Recent work in cataloging, from ISBD to FRBR to RDA and BIBFRAME focus on modifications to the bibliographic record, but do nothing to model the catalog as a whole. With these efforts, the organized knowledge system that was the catalog is slipping further into the background. And yet, we have no concerted effort taking place to remedy this.

What is most astonishing to me, though, is that catalogers continue to create headings, painstakingly, sincerely, in spite of the fact that they are not used as intended in library systems, and have not been used in that way since the first library systems were developed over 30 years ago. The headings are fodder for the keyword search, but no more so than a simple set of tags would be. The headings never perform the organizing function for which they were intended.

Part IV will look at some attempts to create knowledge context from current catalog data, and will present some questions that need to be answered if we are to address the quality of the catalog as a knowledge system.