Tuesday, May 19, 2009

LCSH as linked data: beyond "dash-dash"

The SKOS version of LCSH developed by LC has made some choices in how LCSH would be presented in a linked-data format. One of these choices is that the complex headings (which is the vast majority of them) are treated as a single string:

While this might fit appropriately as a SKOS vocabulary, in my opinion it does not work as linked data. I'm going to try to explain why, although it's quite complex. Part of that complexity is that LCSH is itself complex, primarly because there are many exceptions to any pattern that you might care to describe. (For more on this, I suggest Lois Mai Chan's Library of Congress Subject Headings, 4th edition, the chapter on geographic subject headings, pp. 67-89)

Taking the heading above, as I mentioned in my previous post, the geographic term Italy is not in LCSH even though it can indeed be used as a subject heading. Instead, Italy is defined as a name heading in the LC name authorities file. In that file, and only in the name file, alternate forms of the name are included (altLabels, in SKOS terminology):
451 __ |a Repubblica italiana (1946- )
451 __ |a Italian Republic (1946- )
451 __ |a Wlochy
451 __ |a Regno d’Italia (1861-1946)
451 __ |a It?alyah
451 __ |a Italia
451 __ |a Italie
451 __ |a Italien
451 __ |a Italii?a?
451 __ |a Kgl. Italienische Regierung
451 __ |a Ko¨nigliche Italienische Regierung

There are no altLabels in the LCSH entry for Italy--etc. And because the term Italy is buried in an undifferentiated string, there is no linked data way to say that the Italy in Italy--History--1492-1559--Fiction is the same as http://id.loc.gov/authorities/n79021783, which will presumably be the URI for the name.

It is assumed in LC authorities that the altLabels for a name term that appears in a subject heading apply to both the name used as a name and the name used as a subject heading. In the card catalog, where the name alone would appear first in the alphabetical browse of the cards, it was only necessary to make references to that "head" of the list, which would, in our case, be Italy alone. This has caused great problems in online catalogs where searching is by keyword, not a linear alphabetical search. Some systems manage to get around this by doing a string compare to the same subfields in name headings and subject headings, and then transferring the altLabel forms to the related subject headings.
$a Shakespeare, William, $d 1564-1616
$a Shakespeare, William, $d 1564-1616 $v Adaptations $v Periodicals
In this case, the $a and $d subfields represent the same authoritative entity. The rules say that they are, and must be, the same authoritative entity. If they don't match exactly then someone has done something wrong. They are both instances of a name identified as "n 78095332", and which will presumably be given the URI http://id.loc.gov/authorities/n78095332. There is no question about that.

There is also no question that when the name is used in a subject heading it has the full meaning that it is given in the name heading record, including alternate forms of the name and the many notes fields provided by the catalogers that created the authority record. That these don't appear in the LCSH file does not mean that it is not the case: it means only that the LCSH record assumes that the name record exists and provides that information, and that the information is applied to the name in the subject entry through the linear nature of the dictionary catalog.

We musn't confuse the form with the meaning. That LCSH has a rather arrested form is unfortunate, but it was never intended to be used outside of the context of the full set of authorities that gives full treatment to those things that have "proper names." (c.f. Chan, chapter 4)

If we wish for the LC authorities to be used in a linked data environment, then we have to make sure that the linking capabilities are there. Although I agree that each LCSH record has an identifier, and that identifier should be used, I don't agree that what is expressed in the LCSH record is a dumb, undifferentiated string. In this post I have addressed the relation to name headings, but there are other uses of controlled vocabularies within the subject headings that I haven't fully investigated yet.

Wednesday, May 13, 2009

LCSH as linked data: what is an LC Subject Heading?

The Library of Congress Subject Headings have been placed online in SKOS. You can search within the set or download the entire thing in RDF/XML or a n-triples. This is a welcome development.

I must say that I would also welcome some documentation on the decisions that were made, as viewing the actual data has left me with a number of questions. I'm going to begin my comments with a question about scope, and some confusion that is causing me as I think about how I would want to use this data.

What's an LC Subject Heading?

It appears that the LCSH file that is online represents those authority records whose LC control number begin with "sh", as in: sh 00009880. (Numbering 342,684 records.) However, if you do a Subject Authority Headings search in the LC authorities database you will retrieve any authority record that can be used as a subject. This means that you will retrieve personal names, corporate names, and geographic entities that can be used as subjects. (Note, this is probably a large portion of the name authority file.) This is a mixture of records with LCCNs that begin "n" (for name file) and those that begin "sh" (for subject heading file). I'm at a loss to explain/understand what determines whether a heading has an LCCN beginning with "sh" and would love to get an explanation.

The result is that a search in the LCSH file on the word "Italy" brings up 3,516 headings, with the word somewhere in the heading. However, the heading "Italy" alone is not included. You do have:
Italy, Central
Italy, Northern
Italy, Southern
and you have:
Italy, Northern--Civilization
Italy, Northern--Civilization--Germanic influences
But not "Italy."

A search in the name heading database on LC's online authority file yields a name heading entry for "Italy." That database (whose response is in the form of a browse list) has innumerable pages for corporate names under the initial term "Italy":
Italy. Ambasciata (India)
Italy. Confederazione fascista degli industriali.
It also includes "Italy, Southern" with its LC control number "sh 85069035".

The upshot is that the LC Subject heading file at http://id.loc.gov is not the same as a subject heading search in the online authorities database. It also isn't always logical which file headings fall into. The "Italy. Ambasciata (India)" is in the name heading file as a corporate name, but "Palazzo Dell'Ambasciata di Spagna (Rome, Italy)" is in the subject heading file as a corporate name. There undoubtedly is a set of rules that explains all of this, but it seems to me that a separation of the subject file and the name files creates a split between headings that will not be mirrored in actual use.

This may not matter if the files are combined in the end, and the URI makes it look like all authorities will have ids that directly follow "/authorities/" in the URI. However, although they are both coded as corporate names, the "Palazzo... " record gets the "cool URI" http://id.loc.gov/authorities/sh2002000509#concept. Note the ending in "concept". I don't know what hash ending will be given to entries from the names file, but I do find it odd that corporate names ccould have two different hash endings, depending on which file they are from. To be frank, especially since the division into different files doesn't seem terribly logical, and that many items in the name file can also be used as concepts, I would prefer that the "#" indicate the type of heading (personal name, corporate name, conference, geographical name, topic) rather than the file that it comes from. That is, that the "#" would reflect the MARC tag - 100, 110, 111, 150, 151.

Sunday, May 10, 2009

Walt Crawford should read the document

In his March, 2009 Cites & Insites, Walt Crawford does a roundup of comments on the Google/AAP settlement, and gets very agitated when reviewing some of my posts. I'm used to that. But agitation tends to cancel out reason, and Walt gets some things wrong that he might have understood better if he had kept a clear head.

In response to my criticism that Google is digitizing without regard to collection building, Walt says:
"I don’t know of any big academic library or public library that’s a single disciplinary collection—or, realistically, a set of well-curated collections. "
I'd like to hear from academic librarians on this one. My understanding was that an academic library is INDEED a set of well-curated collections.

"I don’t remember public universities admitting to substantial costs in cooperating with Google."
What's the cost? Dan Greenstein estimated $1-2 per book. Cheap, but still considerable for a library scanning millions of books. The cost is primarily in staff time, shelving and reshelving books. Under this agreement, there is also the cost of meeting the security requirements that are imposed. (That's in Appendix D) These requirements, which are possibly quite reasonable, will have a greater cost than what most libraries do today for digital materials, and will be one of the primary reasons why some libraries do not contract to receive copies of the digitized items. (Note that some of the potential library partners are working hard to collaborate on the Hathi Trust, which does appear to meet the standards of the agreement; others, however, have decided that they will not attempt to store digital copies.)

In a post I argued that had libraries gone ahead and digitized their own collections (for the purposes of indexing and searching), that this probably would have been considered fair use.

"Well…this is not a judicial finding. I find it unfortunate that Google didn’t fight the good fight, and I think it will make things much harder for another commercial entity to attempt similar digitization and use—but I don’t see that library use of “their own materials” has changed in any way."
Not of their hard copy materials, but legal minds think that this changes the landscape for digitization and the use of digitized materials, even closing some options that might have been available before.
"The proposed settlement agreement would give Google a monopoly on the largest digital library of books in the world. It and BRR, which will also be a monopoly, will have considerable freedom to set prices and terms and conditions for Book Search’s commercial services.... If asked, the authors of orphan books in major research libraries might well prefer for their books to be available under Creative Commons licenses or put in the public domain so that fellow researchers could have greater access to them. The BRR will have an institutional bias against encouraging this or considering what terms of access most authors of books in the corpus would want." Pam Samuelson
And to my statement:
"The digitization of books by Google is a massive project that will result in the privatization of a public good: the contents of libraries. While the libraries will still be there, Google will have a de facto monopoly on the online version of their contents."
Walt first prefaces it with:
"I take issue with the very first sentence, as I’ve taken issue consistently with the same claim by others with even higher profiles than Coyle (who are even less likely to ever admit they could be mistaken)."
Well, it would have been nice if he had said who they are. But thanks for letting me know that you consider me a "lower profile" person, Walt. He goes on to say:
"Nonsense. Sheer, utter nonsense. The libraries and contents will still be there. OCA will still be there. I’m sorry, but this one just drives me nuts: It’s demonization of the worst kind and an abuse of the language."
Well, I'm not sure how this abuses language, but there is general agreement that Google gets a monopoly... at least on out-of-print books, which is the vast majority of books in libraries. (Not on public domain books, which is what the OCA digitizes, but anyone can digitize public domain books.) So although the libraries and their contents will still be there, and can be used in hard copy as they are today, no one but Google can digitize the in-copyright works without incurring liability. So "monopoly on online version of their contents" is a factual statement, if you understand that public domain is public domain. (Note, this settlement agreement is extremely complex, with some real zingers hidden in its 134 pages. It's not possible to cover it all in a blog post, so anyone who is interested really needs to read the document itself, painful as that process is.)

In terms of preservation and longevity concerns, Walt asks:
"Won’t the fully-participating libraries have digital copies? I can’t think of institutions with better longevity."
To begin with, only fully participating libraries will have digital copies, and we don't yet know how many libraries will choose that option. Other libraries, even those that are only allowing Google to digitize public domain books, do not get to keep copies of the digital files. (Not only that, public domain libraries that have been cooperating with Google have to delete all of their copies of the files that they hold today, as per this agreement. See Appendix B-3.) The only party with copies of all of the files will be Google.

There are statements in the settlement about what happens if Google "fails to meet the Require Library Services Requirement" or simply decides not to continue. I refer you to page 84 of the settlement, and hope that someone can make sense out of it. The way I read it, libraries can then engage a third-party provider, who will receive the files from Google.

The key thing here is that even in the event of the failure of Google, libraries are not allowed to make uses of their own scans, such as those that are permitted to Google by this settlement. The restriction to "computational uses" and some other minor uses stands, even in that eventuality.

When I say:
"Google should be required to carry all digital Books without discrimination and without liability."
Walt replies:
"You mean “all digital books that Google’s scanned”? I suspect Google wouldn’t argue with this."
That is exactly what I mean, and Google does indeed argue with it. As a matter of fact, the settlement only obligates Google to provide access to at least 85% of the books it scans. That "access" refers to the subscription service that will be available to libraries and other institutions. The settlement says:
"Google may, at its discretion, exclude particular Books from one or more Display Uses for editorial or non-editorial reasons." p.36
That's followed by an affirmation of the "value of the principle of freedom of expression," which I must say rings a bit hollow in this context. Google has to notify the Registry if it has excluded a book, and to provide a digital copy of that book to the Registry. The Registry can then seek out a third party to provide services for excluded books. Here, however, is James Grimmelmann's concern on that front:
"The second is that no one besides the Registry might ever find out that Google has chosen to de-list a book. If the Registry doesn’t or can’t engage a replacement for Google, the book would genuinely vanish from this new Library of Alexandria. Perhaps that should happen for some books, but decisions like that shouldn’t be made in secret. When Google choses to exclude a book for editorial reasons, it should be [R13] required to inform the copyright owner and the general public, not just the Registry. "
What might Google exclude? Perhaps very little, but at the ALA panel in Denver in January, 2009, Dan Clancy of Google gave an off-the-cuff remark that, as I recall, had the word "pornography" in it. Given the recent embarassment of Amazon when it had to face the fact that many of its best sellers are rather salacious in nature, I can imagine Google also developing concern about the visibility of the texts that make us uncomfortable.

There are a lot of legitimate reasons for concern about this proposed settlement. And I don't think that anything that I have said is "nonsense."