Wednesday, January 02, 2013

OCLC Top 50

OCLC recently released a file of 1.2 million metadata records for the most widely held items in its catalog. These are all items with 250 library holdings or more. I created a list on WorldCat of the top 50, mostly out of curiosity. I was quite surprised at the results, however.

Here's how it breaks down:
  • 16 periodicals, with Time and Newsweek being numbers 1 and 2, respectively
  • 29 kid and YA books, four of which (and very high even in this small list) from the Diary of a Wimpy Kid series
  • 5 adult books
The five adult books are:
  1. McCullough, D. G. (1992). Truman. New York: Simon & Schuster. 
  2. Brown, D. (2003). The Da Vinci code: A novel. New York: Doubleday.
  3. Johnson, S. (1998). Who moved my cheese?: An a-mazing way to deal with change in your work and in your life. New York: Putnam. 
  4. Haley, A. (1976). Roots. Garden City, N.Y: Doubleday.  
  5. Peters, T. J., & Waterman, R. H. (1982). In search of excellence: Lessons from America's best-run companies. New York: Harper & Row
This small set gives me many ideas of things to investigate in the full set. First, the monographs in this set are all recent dates, with the oldest being 1976, and most after 2000.
I am hoping to graph the full set by date. What I expect is that the items will be overwhelmingly recent publications because libraries tend to hold what people read, and my guess is that readers are mainly reading new books. Also, libraries buy from the set of things that are in print, so even if they are buying a so-called classic (as they do every time yet another movie is made of Pride and Prejudice) they are buying a current edition which will have a recent date.

The next obvious bit of information would be correlation between holdings and date, which I expect to be high for the very reasons given above.

The overall distribution of holdings is unsurprising, starting high (at almost 7000 holdings), dropping off dramatically, and creating a long tail. (I had managed to coax a chart of out ooCalc but it crashed before I captured it. Am now studying how to deal with large files and visualization. Advice gladly received.) Of course, the tail would be very, very long if you could chart the entire WorldCat database. (Anyone know how many items in WC are held by only one library? I can't find that in the available WC stats.)

I think it would be interesting to be able to analyze library holdings in correlation with the FRBR-ization that OCLC has done. In fact, I would really like to see the top 1% (or .5%) of FRBR-ized items. Related to FRBR I am mainly wondering if we can estimate how frequently FRBR might fulfill its promise of saving the time of the cataloger. But that's for another day.


Steve Thomas said...

I'd love to see the titles in a "smallest number" list; a kind of "rare and endangered species" list.

ssp said...

Interesting idea.

I ran a few Unix commands on the OCLC file based on the predicate (which only seems to exist for 680K of the items) and made a graph from that.

Google Spreadsheet with Graph

Commands at github

So what happened in the late 70s?

Karen Coyle said...

ssp - I have no idea why there's a big drop in the 70's (a recession?). As for the pub date, it looks like another 488,760 have schema:copyrightYear, and from looking at a few examples they appear to be ones that have a copyright symbol or a "c" before the date in the MARC record.* That still leaves many items without dates, which seems unusual for highly held items. If you're willing to run your commands again with both datePublished and copyrightYear, it would be interesting to see if the 70's look any better.

Also note that the publication dates for the OCLC work are currently being pulled from the 260 $c, which is a textual field. That means that there are non-numeric strings like "1960?" and "1923-" and "1976, c1980". I have talked to Jeff Young (who did the programming for this and for VIAF) about using the normalized dates from the 008 either instead or along with. (I prefer instead, just because I'd rather not have to deal with multiple dates when playing with the file.)

* Have I mentioned often enough how nice it would be to have actual documentation that accompanies files of data? :-)

ssp said...

Thanks for the hint about copyrightYear, Karen. I put that into the command and results are much more complete now. I updated the content of the links I provided to take this into account.

This also seems to fill the 'drop' in the 70s. Perhaps just a sign of a change in cataloguing styles? As I have no experience with this data set, nor the conversions performed on this, I’m probably not the most qualified person to guess on this.

Probably you need to mention the documentation a few extra times…