Monday, April 26, 2010

Social aspects of subject headings

You've probably played the "my favorite subject heading" game when geeking out with librarian friends. Here's some additional fuel in case you've run out of zingers.

The Open Library takes the LC subject headings and breaks them apart at the subfield level into subjects, persons, places, genres, and times. It also includes some BISAC headings retrieved from Amazon, so the subject list is not "pure." The separate subject entries obtained are similar to, but not the same as, OCLC's FAST headings, and look much like some facets that appear in library catalogs.

The Open Library database currently holds about 24 million records for books (at least partially de-duped). In a recent dump of subjects, the total number of different subjects came out as 1,278,539. Of those, 336,638 were of the "topical" variety, that is either a 650 $a or a 65X $x. The top 25 are as follows:

825168 History
322928 Biography
212822 Politics and government
206519 Congresses
192968 History and criticism
184183 Fiction
123838 Law and legislation
119333 Bibliography
95555 Juvenile literature
93364 Description and travel
90866 Economic conditions
84787 Criticism and interpretation
74878 Claims
71468 Social life and customs
70926 Social conditions
70563 Catalogs
69205 Private Bills
69191 Private bills
66480 Education
63410 Exhibitions
63301 World War, 1939-1945
60235 Foreign relations
60068 Philosophy
56219 Dictionaries
55460 Study and teaching

I find it interesting that with the exception of "World War, 1939-1945" these appear to have the function of qualifiers, and I'm thinking that it would be interesting to contrast the $a and $x terms. My guess is that these are $x, but that not all $x are of this nature.

Of the subfields, 164,342 appear only once in the database. These are a great source of interesting an unusual headings, including "Social aspects of adzes" and "Deer as pets." In fact, the "Social aspects...." tail is so amusing that I have made a file of those with a count of 1.

The full file of topical subjects is 8 megabytes, but can probably yield innumerable hours of library cocktail hour amusement. (text in format "count - tab - subject") I will also look into names, organizations, places and times as subjects.

8 comments:

Jonathan Rochkind said...

MANY of those examples are -- if they are from at all modern cataloging after the 6xx subfields actually existed -- $z rather than $x. They're form/genre subdivisions, which go in $z.

I seem to recall that someone somewhere in the cataloging world is proposing taking those OUT of $z, and just putting them in 655$a where they belong. I forget if this had anything to do with RDA or what.

Jonathan Rochkind said...

I'm also curious how many of those "Social Aspects of X" headings have an LC authority file with an established relationship to "X" as a broader term.

If it's not actually in the LC authorities, this would still be a fine candidate for easy machine adding of relationships from "Social Aspects of [any X]" to "X".

As we start to actually have good interfaces which can display hieararchical relationship in a reasonable way (not quite there yet), we'll probably be looking for candidates like this for automated relationship establishment.

Karen Coyle said...

Jonathan, yes, some of the subjects come out of "genre" subfields. That one is tricky: when we had a separate "genre" index in the UC catalog, it was almost never used -- I don't think users generally understood it. They tend to think of "Dictionaries" as a subject, like "I want an English-Spanish dictionary." At the same time, there are genres that are confusing -- e.g. books of fiction v. books about fiction. I'm thinking that the "about" v. "is a" is something that needs to be addressed in a faceted display, where users can see and understand the context.

The Social Aspects of... headings, I have just learned, were originally "Something -- Social aspects" and got turned around in OL as a way to keep them together. (Similar to some of the combination of subheadings that was done in FAST, but I currently cannot find that documentation on OCLC's site.) It might be good to develop some kind of "best practices" for taking apart LC subject headings when one doesn't want to use the "--" form. FAST is a beginning if there can be some community input on development. FAST seems to have retained many of the "--" forms ("Care of the sick--Social aspects"), for reasons that I do not understand (unless they anticipate using headings in an alphabetic sort order). In any case, I think there's potential, but am not quite sure how to move on it.

Jonathan Rochkind said...

Interesting. To me, that seems like almost the OPPOSITE of what FAST intends to do. FAST tries to take LCSH and break it down into facets. You've taken what were already facets, and precoordinated them into single multi-facet strings! From "X -- Social Aspects Of" to "Social Aspects of X".

The interesting thing about it is why you did it "To bring these together" -- meaning that OL is still stuck in the same place as all of us, "bringing things together" by the alphbatetical listing of pre-coordinated strings! They've made a different decision about what things to bring together (All "social aspects of" together, instead of all things regarding "X" together) is all.

Karen Coyle said...

Sorry, I wasn't clear. There is a strong desire to be "--"-free - no dash-dash. So the 'bring together' was to bring together the two subfields into one string, and mainly for greater readability. Similarly, the subfields with reversed terms ("Authors, German" becomes "German Authors") were changed to natural order. To the non-librarian eye it appears that it is much easier to read these uninterrupted phrases.

I wasn't involved in this decision so I don't know if other changes of this nature were made. However, I think it is telling that folks who could well be better representative of our users that we ourselves are find LCSH basically unreadable. I suspect that many "catalog 2.0" users have no idea that the facets are actually parts of subject headings, because they never gave subject headings more than a simple glance -- and then gave up on them as too complex to read. So, if we could code data rather than write displays, a subject like:

Architecture, Modern -- 20th century -- United States -- Bibliography

could possible be presented to users as:

Bibliography of 20th century American architecture

while individual facets could also be available for those functions where faceting is useful.

Anonymous said...

Quoting Jonathan: MANY of those examples are -- if they are from at all modern cataloging after the 6xx subfields actually existed -- $z rather than $x. They're form/genre subdivisions, which go in $z.

That should be $v for form/genre subdivisions. $z is used for geographic subdivisions.

Harry Lawen said...

For me, it is a surprise there are 192968 items of "History and criticism" and not medicine items in this top-tier list.

Karen Coyle said...

Harry, most information in the field of medicine comes out in journal articles, which libraries do not index individually. In fact, it would be very interesting to get a picture of the difference (subject-wise) between book publishing and article publishing. My guess is that most of current scientific content is found in articles, and that books will tend to be in the history-sociology-literature areas.