Coyle's InFormation

Friday, July 10, 2009

Yee: Questions 1-2

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

As I mentioned previously, I am going to try to cover each of Martha Yee's questions from her ITAL article of June, 2009. The title of the article is: "Can Bibliographic data be Put Directly onto the Semantic Web?" Here are the first three. As always, these are my answers, which may be incorrect or incomplete, so I welcome discussion both of Yee's text as well as mine. (Martha's article is available here.)

Question 1

Is there an assumption on the part of the Semantic Web developers that a given data element, such as publisher name, shuld be expressed as either a literal or using a URI ... but never both?

The answer to this is "no," and is explained in greater detail in my post on RDF basics.

Yee goes on, however, to state that there is value in distinguishing the following types of data:

Copied as is from an artifact (transcribed)

Supplied by a cataloger

Categorized by a cataloger (controlled)

She then says that

"For many data elements, therefore it will be important to be able to record both a literal (transcribed or composed form or both) and a URI (controlled form)."

This distinction between types of data is important, and is one that we haven't made successfully in our current cataloging data. The example I usually give is that of the publisher name in the publisher statement area. Unless you know library cataloging, you might assume that is a controlled name that could be linked to, for example, a Publisher entity in a data model. That's not the case. The publisher name is a sort-of transcribed element, with a lot of cataloger freedom to not record it exactly as it appears. If we want to represent a publisher entity, we need to add it to our data set. There are various possible ways to do this. One would be to declare a publisher property that has a URI that identifies the publisher, and a literal that carries the sort-of transcribed element. But remember that there are two kinds of literals in Yee's list: transcribed and cataloger supplied. So a property that can take both a URI and a literal is still not going to allow us to make that distinction.

A better way to look at this is perhaps to focus more on the meaning of the properties that you wish to use to describe your resource. The transcribed publisher, the cataloger supplied publisher, and the identifier for the corporate body that is the publisher of the resource -- are these really the same thing? You may eventually wish to display them in the same area of your display, but that does not make them semantically the same. For the sake of clarity, if you have a need to distinguish between these different meanings of "publisher", then it would be best to treat them as three separate properties (a.k.a. "data elements").

Paying attention to the meaning of the property and the functionality that you hope to obtain with your data can go a long way toward solving some of these areas where you are dealing with what looks like a single complex data element. In library data that was meant primarily for display, making these distinctions was less important, and we have numerous instances of data elements that could either have values that aren't exactly alike or that were expected to perform more than one function. Look at the wide range of uniform titles, from a simple common title ("Hamlet") to the complex structured titles for music and biblical works. Or how the controlled main author heading functions as display, enforcement of sort order, and link to an authority record. There will be a limit to how precise data can be, but some of our traditional data elements may need a more rigorous definition to support new system functionality.

Question 2

Will the Internet ever be fast enough to assemble the equivalent of our current records from a collection of hundreds or even thousands of URIs?

I answered this in that same post, but would like to add what I think we might be doing with controlled lists in near-future systems. What we generally have today is a text document online that is updated by the relevant maintenance agency. The documents are human-readable, and updates generally require someone in the systems area of the library or vendor's support group to add new entries to the list. This is very crude considering the capabilities of today's technology.

I am assuming that in the future controlled lists will be available in a known and machine-actionable format (such as SKOS). With our lists online and in a coded form, the data could be downloaded automatically by library systems on a periodic basis (monthly, weekly, nightly -- it would depend on the type of list and needs of the community). The downloaded file could be processed into the library system without human intervention. The download could include the list term, display options, any definitions that are available, and a date on which the term becomes operational. Management of this kind of update is no different to what many systems do today to receive updated bibliographic records from LC or from other producers.

The use of SKOS or something functionally similar can give us advantages over what we have today. It could provide alternate display forms in different languages, links to cataloger documentation that could be incorporated into workstation software, and it could provide versioning and history so that it would be easier to process records created in different eras.

There could be similar advantages to be gained by using identifiers for what today we call "authority data." That's a bit more complex however, so I won't try to cover it in this short post. It's a great topic for a future discussion.

Tuesday, July 07, 2009

Yee on RDF and Bibliographic Data

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

I've been thinking for a while about how I could respond to some of the questions in Martha Yee's recent article in Information Technology and Libraries (June 2009 - pp. 55-80). Even the title is a question: "Can Bibliographic data be Put Directly onto the Semantic Web?" (Answer: it already is ) Martha is conducting an admirable gedanken experiment about the future of cataloging, creating her own cataloging code and trying to mesh her ideas with concepts coming out of the semantic web community. The article's value is not only in her conclusions but in the questions that she raises. In its unfinished state, Martha's thinking is provocative and just begging for further discussion and development.(Note: I hope Martha is allowed to put her article online, because otherwise access is limited to LITA members.) (Martha's article is available here.)

The difficulty that I am having at the moment is that it appears to me that there are some fundamental misunderstandings in Yee's attempt to grapple with an RDF model for library data. In addition, she is trying to work with FRBR and RDA, both of which have some internal consistencies that make a rigorous analysis difficult. (In fact, Yee suggests an improvement to FRBR that I think IFLA should seriously consider, and that is that subject in FRBR should be a relationship, and that the entities in Group 3 should be usable in any relevant situation, not just as subjects. p. 66, #6. After that, maybe they'll consider my similar suggestion regarding the Group 1 entities.)

I'm trying to come up with an idea of how to chunk Yee's questions so that we can have a useful but focused discussion.

I'm going to try to begin this with a few very basic statements that are based on my understanding of the semantic web. I do not consider myself an expert in RDF, but I also suspect that there are few real experts among us. If any of you reading this want to disagree with me, or chime in with your own favorite "RDF basics," please do.

1. RDF is not a record format; it isn't even a data format

Those of us in libraries have always focused on the record -- essentially a complex document that acts as a catalog surrogate for a complex thing, such as a book or a piece of recorded music. RDF says nothing about records. All that RDF says is that there is data that represents things and there are relationships between those things. What is often confusing is that anything can be an RDF thing, so the book, the author, the page, the word on the page -- if you wish, any or all of these could be things in your universe.

Many questions that I see in library discussions of the possible semantic web future are about records and applications: Will it be possible to present data in alphabetical order? What will be displayed? None of these are directly relevant to RDF. Instead, they are questions about the applications that you build out of your data. You can build records and applications using data that has "RDF Nature." These records and applications may look different from the ones we use today, and they may provide some capabilities in terms of linking and connecting data that we don't have today, but if you want your application to do it, it should be possible to do it using data that follows the RDF model. However, if you want to build systems that do exactly what today's library systems do, there isn't much reason to move to semantic web technology.

2. A URI is an identifier; it identifies

There is a lot of angst in the library world about using URI-structured identifiers for things. The concern is mainly that something like "Mark Twain" will be replaced with "http://id.loc.gov/authorities/n79021164" in library data, and that users will be shown a bibliographic record that goes like:

http://id.loc.gov/authorities/n79021164
Adventures of Tom Sawyer

or will have to wait for half an hour for their display because the display form must be retrieved from a server in Vanuatu. This is a misunderstanding about the purpose of using identifiers. A URI is not a substitute for a human-readable display form. It is an identifier. It identifies. Although my medical plan may identify me as p37209372, my doctor still knows me as Karen. The identifier, however, keeps me distinct from the many other Karens in the medical practice. Whether or not your application carries just identifiers in its data, carries an identifier and a preferred display form, or an identifier and some number of different display forms (e.g. in different languages) is up to the application and its needs. The point is that the presence of an identifier does not preclude having human-readable forms in your data record or database.

So why use identifiers? An identifier gives you precision in the midst of complexity. Author n790211164 may be "Mark Twain" to my users, and "Ma-kʻo Tʻu-wen" to someone else's, but we will know it is the same author if we use the same identifier. And Pluto the planet-like object will have a different identifier from Pluto the animated character because they are different things. It doesn't matter that they have the same name in some languages. The identifier is not intended for human consumption, but is needed because machines are not (yet?) able to cope with the ambiguities of natural language. Using identifiers it becomes possible for machines to process statements like "Herman Melville is the author of Moby Dick" without understanding one word of what that means. If Melville is A123 and Moby Dick is B456 and authorship is represented by x->, then a machine can answer a question like: "what are all of the entities with A123 x->?", which to a human translates to: "What books did Herman Melville write?"

As we know from our own experience, creating identities is tricky business. As we rely more on identifiers, we need to be aware of how important it is to understand exactly what an identifier identifies. When a library creates an authority record for "Twain, Mark," it may appear to be identifying a person; in fact, it is identifying a "personal author," who can be the same as a person, but could be just one of many names that a natural person writes under, or could be a group of people who write as a single individual. This isn't the same definition of person that would be used by, for example, the IRS or your medical plan. We can also be pretty sure that, barring a miracle, we will not have a situation where everyone agrees on one single identifier or identifier system, so we will need switching systems that translate from one identifier space to another. These may work something like xISBN, where you send in one identifier and you get back one or more identifiers that are considered equivalent (for some definition of "equivalent").

3. The key to functional bibliographic systems is in the data

There is a lot of expressed disappointment about library systems. There is no doubt that the systems have flaws. The bottom line, however, is that a system works with data, and the key to systems functionality is in the data. Library data, although highly controlled, has been primarily designed for display to human readers, and a particular kind of display at that.

One of the great difficulties is with what libraries call "authority control." Certain entities (persons, corporate bodies, subjects) are identified with a particular human-readable string, and a record is created that can contain variant forms of that string and some other strings with relationships to the entity that the record describes. This information is stored separately from the bibliographic records that carry the strings in the context of the description of a resource. Unfortunately, the data in the authority records is not truly designed for machine-processing. It's hard to find simple examples, so I will give a simplistic one:

US (or U.S.)
is an abbreviation for United States. The catalog needs to inform users that they must use United States instead of US, or must allow retrieval under either. The authority control record says:
"US see United States"

United States, of course, appears in a lot of names. You might assume then that every place where you find "United States" you'll find a reference, such that United States. Department of State would have a reference from U.S. Department of State that refers the user from that undesirable form of the name ... but it doesn't. The reference from U.S. to United States is supposed to somehow be generalized to all of the entries that have U.S. in them. Except, of course, for those to which it should not be applied, like US Tumbler Co. or US Telecomm Inc. (but it is applied to US Telephone Association). There's a pattern here, but probably not one that can be discerned by an algorithm and quite possibly not obvious to all humans, either. What it comes down to, however, is that if you want machines to be able to do things with your data, you have to design your data in a way that machines can work with it using their plodding, non-sentient, aggravatingly dumb way of making decisions: "US" is either equal to "United States" or it isn't.

Another difficulty arises from the differences between the ideal data and real data. If you have a database in which only half of the records have an entry for the language of the work, providing a search on language guarantees that many records for resources will never be retrieved by those searches even if they should be. We don't want to dumb down our systems to the few data elements that can reliably be expected in all records, but it is hard to provide for missing data. One advantage of having full text is that it probably will be possible to determine the predominant language of work even if it isn't encoded in the metadata, but when you are working with metadata alone there often isn't much you can do.

A great deal of improvement could be possible with library systems if we would look at the data in terms of system needs. Not in an idealized form, because we'll never have perfect data, but looking at desired functionality and then seeing what could be done to support that functionality in the data. While the cataloging data we have today nicely supports the functionality of the card catalog, we have never made the transition to truly machine-actionable data. There may be some things we decide we cannot do, but I'm thinking that there will be some real "bang for the buck" possibilities that we should seriously consider.

Next... I'll try to get to the questions in Martha's article.

Tuesday, June 30, 2009

Even paranoids....

I'm not the most diligent of bloggers, by any means, and the contents of this blog are pretty narrow in terms of topics. Mostly I have written about Google books, about RDA and other library metadata developments, and recently about OCLC. Although each post is probably offensive to someone out there, the total number of enemies that I can make is probably quite small -- and compared to some bloggers nearly infinitesimal.

So imagine my surprise this morning when I received a notice from Google saying that my blog had been marked as Spam, and would be removed if I didn't take action. There are two ways that your blog can get the Spam qualification: 1) if it is caught by Google's automatic spam detectors and 2) if someone clicks on the "flag blog" link and reports it as spam.

Given the technical nature of my posts, I find the first possibility highly unlikely. This means that I must consider the latter. I hope it is only coincidence that my latest post (and one that has lingered here as the latest for a bit too long, perhaps) is a critique of OCLC and its record use policy. I would love to be able to say that I know that OCLC would not stoop to this kind of censorship, but unfortunately I have experience to the contrary.

Earlier this year I arrived in Dublin only to be refused admittance to a meeting that they had agreed that I could attend (and that I had flown all of the way to Ohio to attend). Than, a few months ago when OCLC was told that I would be writing an article for InfoToday on their "web-scale service" the journal's editor received numerous phone calls from OCLC's press person voicing OCLC management's "concern" that I had been chosen to write the article. What the editor was supposed to do about that concern wasn't articulated, but she kept me on the story and even resisted their request to review the article before it was published. It was a dramatic couple of days, and I'm very grateful to her for her unwavering defense of freedom of the press.

I admit that it is at least equally likely that some random person with a cosmic grudge decided to click on "this is spam," but you may understand why I'm beginning to be a bit paranoid, and wondering if I don't have real enemies.

Tuesday, June 09, 2009

OCLC Policy - What is the Question?

I have a difficult time understanding the discussion around the OCLC WorldCat Record Use Policy. At least one reason for my confusion is that I have yet to see an explanation of the problem that the policy is attempting to address. The recent talk by Jennifer Younger, on the initial recommendations of the WorldCat record use policy review board, left me with the same uncertainty: If this is the answer, than what is the question?

Many people have commented on Younger's slides, and in particular on the recommendation by the review board that OCLC abandon the November, 2008 policy and begin anew. Younger's talk, however, did not answer one of the key questions that she herself lists:

"Despite the statement on intent in the proposed policy and the use examples in the FAQ, respondents [to the review board's survey] indicated they do not see the problem, and where they do, they do not see how the proposed policy will address the problem." [Younger]

[Quotes in this post are taken from Peter Murray's much appreciated transcription of Younger's talk.]

Younger's brief defintion of the charge of the review board is:

"The review board is charged with recommending principles on which a new policy should be based." [Younger]

I strongly suggest that before beginning work on the principles that the review board take as much time as it needs to clarify the nature of the problem that the policy wishes to address. Unless the problem is clear, the policy cannot provide a coherent solution.

That said, what do we know about the problems as addressed by the policy and the review board?

"... a legal document..."

The policy document itself does not have a clear problem statement; it limits itself to stating what the rules might be for use and re-use of WorldCat records. The FAQ goes a bit further toward defining a problem, although as stated it seems to be saying that the problem is a lack of a policy:

"To be successful negotiating and working with prospective partners, many of which are in the private sector, OCLC needs to move beyond the Guidelines to a policy that will be recognized by these organizations and others outside the library-archives-museum space as a legal document. Achieving clarity about the rights and conditions for using and transferring WorldCat data is a precondition to OCLC's sitting down to talk, on its members' behalf, with organizations that otherwise might have little or no interest in promoting the use and visibility of library, archival, and museum collections and services." [FAQ]

In this statement OCLC arguing for its need to clarify the contract between OCLC and the libraries so that it can engage in revenue-producing deals with other entities. It seems unlikely that OCLC would not have the right or ability to make deals involving WorldCat records. If that is the question, then OCLC just needs an agreement with its membership that the organization can make use of the members' data in this way. But that's not the issue that the policy addressed. The issue isn't OCLC's use of the data but the libraries' use of the data. Without the policy, OCLC and the members are both free to monetize (or not monetize) the bibliographic records that they hold however they wish. The policy, however, specifically requires that all deals with other entities be controlled by OCLC. It also designates OCLC as the sole decision-maker for use of WorldCat records. It shouldn't be surprising that this may not be acceptable to all members.

What's Good for OCLC....

Statements by OCLC always equate the interests of OCLC with the interests of the members, a view that clearly isn't shared by all members. In her talk, Younger brings up this tension between OCLC and its members, which she calls "the gap problem":

"Perhaps then in this context we should not be surprised that 'the gap problem' emerged, that the proposed record use policy was perceived by many as putting OCLC's interests ahead of those of member libraries." [Younger]

Younger's talk places the solution to this problem squarely in the laps of the members, who are seen as the ones who need to change for the gap to be closed:

"Second, the equilibrium has been disrupted. We must revisit the social contract between OCLC and its members. ... But as new generations of members come into our ranks, it becomes more difficult to explain the social contract that is OCLC. Just as in ballroom dancing it takes two people to tango. We need to work together — OCLC and its members — to solve the gap problem as it relates to the past but more importantly to the future. We need to understand our respective roles in reinforcing the values for working within the OCLC collaborative and understand how those values can support working with other partners in the information ecosystem." [Younger]

As presented, the "gap" is that some members do not agree that what benefits OCLC always benefits the entire membership. The solution is to "reinforce the values for working within the OCLC collaborative." The assumption that what is good for OCLC is good for the members is presented without question. The 2008 policy lacked provisions for member input into the decision process relating to particular uses of WorldCat records, nor did it provide a mechanism for members to remedy decisions that they feel do not benefit them. Policy statements that give OCLC the sole decision-making power on record use ("may be withheld by OCLC, without liability, within its sole discretion" D. 3) understandably make some people nervous.

The "gap" is presented as a difference in perception, but never a difference in actual benefits. For example, in the FAQ, OCLC explains that one of the needs is for a policy that ensures a "fair return" to OCLC members. That "fair return" is revenue that goes to OCLC.

"The existing Guidelines, and now the revised Policy seek to support WorldCat’s continued value by ensuring that the use and transfer of WorldCat data outside the OCLC cooperative provides a fair return to OCLC members and benefits libraries, archives, and museums in general." [FAQ]

It shouldn't be surprising that a phrase like "fair return to OCLC members" causes members to ask what it is that they are getting. I can find nothing that explains what revenues have been received by OCLC for WorldCat data, and nothing specifically about how those revenues have benefited the cooperative. This points to what I perceive as one of the causes for the "gap problem" and that is a general lack of transparency about OCLC's business. It is not going to work to simply say to the membership "trust us, we have your interests in mind." The members have every right to ask for proof of that. OCLC, for its part, should be quite happy to show members how uses of WorldCat data have been for their benefit. (Note, OCLC's members council may have this information, but I don't find it in the annual reports, nor in the IRS 990 form, of which the most recent is the one for 2006.)

The Value Proposition

The one thing OCLC cannot rely on is any argument that OCLC must be supported and maintained simply because it is OCLC. Everything hinges on a convincing argument of WorldCat's "value." The value of WorldCat is invoked frequently, but no elaboration on the nature of the value is given:

"We must focus on the value of sustaining WorldCat for the benefit of members and non-members, for OCLC, libraries and other memory institutions, and other partners in the information ecosystem." [Younger]

Threats

The policy implied -- but never stated -- that OCLC was responding at least in part to some perceived threats. Younger uses the term "threat," but without elaborating or giving examples:

"One intent expressed in the proposed policy was to protect the members’ investment in WorldCat and ensure the use of WorldCat records would benefit the membership. We need to identify the major encroachments that threaten WorldCat." [Younger]

There are some hints that threats would be to the size, comprehensiveness and quality of the database. The policy's definition of "reasonable use" stated that reasonable use:

"... would not include any Use of WorldCat Records that:
a. discourages the contribution of bibliographic and holdings data to WorldCat, thus damaging OCLC Members' investment in WorldCat, and/or
b. substantially replicates the function, purpose, and/or size of WorldCat."

Younger expands briefly on the threat question, talking about comprehensiveness of the database, and cash flow:

"Would the proposed uses by an OCLC member, consortia, or other players lead to a less-comprehensive or authoritative WorldCat? Would the proposed uses draw a significant cash flow away from maintaining WorldCat? Would the proposed uses benefit some segment of the library and other memory institution community without materially diminishing the benefit and use of WorldCat by other members?" [Younger]

This last statement at least introduces the possibility that there could be uses of library bibliographic data that do not threaten WorldCat. Defining this line between threat and non-threat seems to be key to decision-making around record use. While possibly not appropriate for the policy itself, it could be defined in some detail in operational documents that are available to members and potential users of WorldCat records.

The Control Issue

There is an implication that allowing bibliographic data to "go wild" on the Internet would weaken OCLC. It seems obvious that the policy is designed to give OCLC control over the use of records in order to obtain revenue from the uses. Throughout Younger's talk she refers to "members" and "partners."

"Members see a future in which WorldCat is available for reasonable use on a non-discriminatory basis to members as well as to other partners." [Younger]

There is never any mention of the general public and no mention of open access. The vision here is clearly a closed system with usage controls in place. This was the problem with the policy as written, and it will continue to be a problem if OCLC (with or without its members) insists on maintaining control. There is no hint here that the OCLC model may need to change, that a walled bibliographic city no longer makes sense. It will be very disappointing if the review board does not challenge this basic assumption, if it does not explore other options for the future.

Conclusion

I hope that OCLC's members will insist on a clarification of the goals of the policy as well as on how those goals will be managed over time. Sticking my neck out, I conclude that:

there cannot be an workable policy without a clear problem statement to guide it
a library data silo is quite possibly not the best thing for the library community today, and this needs to be addressed
the idea that "what is good for OCLC is always good for OCLC's members" is unreasonable; no contract should be accepted that doesn't provide for negotiation between the library members and OCLC regarding uses of the WorldCat records

Tuesday, May 19, 2009

LCSH as linked data: beyond "dash-dash"

The SKOS version of LCSH developed by LC has made some choices in how LCSH would be presented in a linked-data format. One of these choices is that the complex headings (which is the vast majority of them) are treated as a single string:

Italy--History--1492-1559--Fiction

While this might fit appropriately as a SKOS vocabulary, in my opinion it does not work as linked data. I'm going to try to explain why, although it's quite complex. Part of that complexity is that LCSH is itself complex, primarly because there are many exceptions to any pattern that you might care to describe. (For more on this, I suggest Lois Mai Chan's Library of Congress Subject Headings, 4th edition, the chapter on geographic subject headings, pp. 67-89)

Taking the heading above, as I mentioned in my previous post, the geographic term Italy is not in LCSH even though it can indeed be used as a subject heading. Instead, Italy is defined as a name heading in the LC name authorities file. In that file, and only in the name file, alternate forms of the name are included (altLabels, in SKOS terminology):

451 __ |a Repubblica italiana (1946- )
451 __ |a Italian Republic (1946- )
451 __ |a Wlochy
451 __ |a Regno d’Italia (1861-1946)
451 __ |a It?alyah
451 __ |a Italia
451 __ |a Italie
451 __ |a Italien
451 __ |a Italii?a?
451 __ |a Kgl. Italienische Regierung
451 __ |a Ko¨nigliche Italienische Regierung

There are no altLabels in the LCSH entry for Italy--etc. And because the term Italy is buried in an undifferentiated string, there is no linked data way to say that the Italy in Italy--History--1492-1559--Fiction is the same as http://id.loc.gov/authorities/n79021783, which will presumably be the URI for the name.

It is assumed in LC authorities that the altLabels for a name term that appears in a subject heading apply to both the name used as a name and the name used as a subject heading. In the card catalog, where the name alone would appear first in the alphabetical browse of the cards, it was only necessary to make references to that "head" of the list, which would, in our case, be Italy alone. This has caused great problems in online catalogs where searching is by keyword, not a linear alphabetical search. Some systems manage to get around this by doing a string compare to the same subfields in name headings and subject headings, and then transferring the altLabel forms to the related subject headings.

$a Shakespeare, William, $d 1564-1616
$a Shakespeare, William, $d 1564-1616 $v Adaptations $v Periodicals

In this case, the $a and $d subfields represent the same authoritative entity. The rules say that they are, and must be, the same authoritative entity. If they don't match exactly then someone has done something wrong. They are both instances of a name identified as "n 78095332", and which will presumably be given the URI http://id.loc.gov/authorities/n78095332. There is no question about that.

There is also no question that when the name is used in a subject heading it has the full meaning that it is given in the name heading record, including alternate forms of the name and the many notes fields provided by the catalogers that created the authority record. That these don't appear in the LCSH file does not mean that it is not the case: it means only that the LCSH record assumes that the name record exists and provides that information, and that the information is applied to the name in the subject entry through the linear nature of the dictionary catalog.

We musn't confuse the form with the meaning. That LCSH has a rather arrested form is unfortunate, but it was never intended to be used outside of the context of the full set of authorities that gives full treatment to those things that have "proper names." (c.f. Chan, chapter 4)

If we wish for the LC authorities to be used in a linked data environment, then we have to make sure that the linking capabilities are there. Although I agree that each LCSH record has an identifier, and that identifier should be used, I don't agree that what is expressed in the LCSH record is a dumb, undifferentiated string. In this post I have addressed the relation to name headings, but there are other uses of controlled vocabularies within the subject headings that I haven't fully investigated yet.

Wednesday, May 13, 2009

LCSH as linked data: what is an LC Subject Heading?

The Library of Congress Subject Headings have been placed online in SKOS. You can search within the set or download the entire thing in RDF/XML or a n-triples. This is a welcome development.

I must say that I would also welcome some documentation on the decisions that were made, as viewing the actual data has left me with a number of questions. I'm going to begin my comments with a question about scope, and some confusion that is causing me as I think about how I would want to use this data.

What's an LC Subject Heading?

It appears that the LCSH file that is online represents those authority records whose LC control number begin with "sh", as in: sh 00009880. (Numbering 342,684 records.) However, if you do a Subject Authority Headings search in the LC authorities database you will retrieve any authority record that can be used as a subject. This means that you will retrieve personal names, corporate names, and geographic entities that can be used as subjects. (Note, this is probably a large portion of the name authority file.) This is a mixture of records with LCCNs that begin "n" (for name file) and those that begin "sh" (for subject heading file). I'm at a loss to explain/understand what determines whether a heading has an LCCN beginning with "sh" and would love to get an explanation.

The result is that a search in the LCSH file on the word "Italy" brings up 3,516 headings, with the word somewhere in the heading. However, the heading "Italy" alone is not included. You do have:

Italy, Central
Italy, Northern
Italy, Southern

and you have:

Italy, Northern--Civilization
Italy, Northern--Civilization--Germanic influences
etc.

But not "Italy."

A search in the name heading database on LC's online authority file yields a name heading entry for "Italy." That database (whose response is in the form of a browse list) has innumerable pages for corporate names under the initial term "Italy":

Italy.
Italy. Ambasciata (India)
Italy. Confederazione fascista degli industriali.

It also includes "Italy, Southern" with its LC control number "sh 85069035".

The upshot is that the LC Subject heading file at http://id.loc.gov is not the same as a subject heading search in the online authorities database. It also isn't always logical which file headings fall into. The "Italy. Ambasciata (India)" is in the name heading file as a corporate name, but "Palazzo Dell'Ambasciata di Spagna (Rome, Italy)" is in the subject heading file as a corporate name. There undoubtedly is a set of rules that explains all of this, but it seems to me that a separation of the subject file and the name files creates a split between headings that will not be mirrored in actual use.

This may not matter if the files are combined in the end, and the URI makes it look like all authorities will have ids that directly follow "/authorities/" in the URI. However, although they are both coded as corporate names, the "Palazzo... " record gets the "cool URI" http://id.loc.gov/authorities/sh2002000509#concept. Note the ending in "concept". I don't know what hash ending will be given to entries from the names file, but I do find it odd that corporate names ccould have two different hash endings, depending on which file they are from. To be frank, especially since the division into different files doesn't seem terribly logical, and that many items in the name file can also be used as concepts, I would prefer that the "#" indicate the type of heading (personal name, corporate name, conference, geographical name, topic) rather than the file that it comes from. That is, that the "#" would reflect the MARC tag - 100, 110, 111, 150, 151.

Sunday, May 10, 2009

Walt Crawford should read the document

In his March, 2009 Cites & Insites, Walt Crawford does a roundup of comments on the Google/AAP settlement, and gets very agitated when reviewing some of my posts. I'm used to that. But agitation tends to cancel out reason, and Walt gets some things wrong that he might have understood better if he had kept a clear head.

In response to my criticism that Google is digitizing without regard to collection building, Walt says:

"I don’t know of any big academic library or public library that’s a single disciplinary collection—or, realistically, a set of well-curated collections. "

I'd like to hear from academic librarians on this one. My understanding was that an academic library is INDEED a set of well-curated collections.

Walt:

"I don’t remember public universities admitting to substantial costs in cooperating with Google."

What's the cost? Dan Greenstein estimated $1-2 per book. Cheap, but still considerable for a library scanning millions of books. The cost is primarily in staff time, shelving and reshelving books. Under this agreement, there is also the cost of meeting the security requirements that are imposed. (That's in Appendix D) These requirements, which are possibly quite reasonable, will have a greater cost than what most libraries do today for digital materials, and will be one of the primary reasons why some libraries do not contract to receive copies of the digitized items. (Note that some of the potential library partners are working hard to collaborate on the Hathi Trust, which does appear to meet the standards of the agreement; others, however, have decided that they will not attempt to store digital copies.)

In a post I argued that had libraries gone ahead and digitized their own collections (for the purposes of indexing and searching), that this probably would have been considered fair use.

Walt:

"Well…this is not a judicial finding. I find it unfortunate that Google didn’t fight the good fight, and I think it will make things much harder for another commercial entity to attempt similar digitization and use—but I don’t see that library use of “their own materials” has changed in any way."

Not of their hard copy materials, but legal minds think that this changes the landscape for digitization and the use of digitized materials, even closing some options that might have been available before.

"The proposed settlement agreement would give Google a monopoly on the largest digital library of books in the world. It and BRR, which will also be a monopoly, will have considerable freedom to set prices and terms and conditions for Book Search’s commercial services.... If asked, the authors of orphan books in major research libraries might well prefer for their books to be available under Creative Commons licenses or put in the public domain so that fellow researchers could have greater access to them. The BRR will have an institutional bias against encouraging this or considering what terms of access most authors of books in the corpus would want." Pam Samuelson

And to my statement:

"The digitization of books by Google is a massive project that will result in the privatization of a public good: the contents of libraries. While the libraries will still be there, Google will have a de facto monopoly on the online version of their contents."

Walt first prefaces it with:

"I take issue with the very first sentence, as I’ve taken issue consistently with the same claim by others with even higher profiles than Coyle (who are even less likely to ever admit they could be mistaken)."

Well, it would have been nice if he had said who they are. But thanks for letting me know that you consider me a "lower profile" person, Walt. He goes on to say:

"Nonsense. Sheer, utter nonsense. The libraries and contents will still be there. OCA will still be there. I’m sorry, but this one just drives me nuts: It’s demonization of the worst kind and an abuse of the language."

Well, I'm not sure how this abuses language, but there is general agreement that Google gets a monopoly... at least on out-of-print books, which is the vast majority of books in libraries. (Not on public domain books, which is what the OCA digitizes, but anyone can digitize public domain books.) So although the libraries and their contents will still be there, and can be used in hard copy as they are today, no one but Google can digitize the in-copyright works without incurring liability. So "monopoly on online version of their contents" is a factual statement, if you understand that public domain is public domain. (Note, this settlement agreement is extremely complex, with some real zingers hidden in its 134 pages. It's not possible to cover it all in a blog post, so anyone who is interested really needs to read the document itself, painful as that process is.)

In terms of preservation and longevity concerns, Walt asks:

"Won’t the fully-participating libraries have digital copies? I can’t think of institutions with better longevity."

To begin with, only fully participating libraries will have digital copies, and we don't yet know how many libraries will choose that option. Other libraries, even those that are only allowing Google to digitize public domain books, do not get to keep copies of the digital files. (Not only that, public domain libraries that have been cooperating with Google have to delete all of their copies of the files that they hold today, as per this agreement. See Appendix B-3.) The only party with copies of all of the files will be Google.

There are statements in the settlement about what happens if Google "fails to meet the Require Library Services Requirement" or simply decides not to continue. I refer you to page 84 of the settlement, and hope that someone can make sense out of it. The way I read it, libraries can then engage a third-party provider, who will receive the files from Google.

The key thing here is that even in the event of the failure of Google, libraries are not allowed to make uses of their own scans, such as those that are permitted to Google by this settlement. The restriction to "computational uses" and some other minor uses stands, even in that eventuality.

When I say:

"Google should be required to carry all digital Books without discrimination and without liability."

Walt replies:

"You mean “all digital books that Google’s scanned”? I suspect Google wouldn’t argue with this."

That is exactly what I mean, and Google does indeed argue with it. As a matter of fact, the settlement only obligates Google to provide access to at least 85% of the books it scans. That "access" refers to the subscription service that will be available to libraries and other institutions. The settlement says:

"Google may, at its discretion, exclude particular Books from one or more Display Uses for editorial or non-editorial reasons." p.36

That's followed by an affirmation of the "value of the principle of freedom of expression," which I must say rings a bit hollow in this context. Google has to notify the Registry if it has excluded a book, and to provide a digital copy of that book to the Registry. The Registry can then seek out a third party to provide services for excluded books. Here, however, is James Grimmelmann's concern on that front:

"The second is that no one besides the Registry might ever find out that Google has chosen to de-list a book. If the Registry doesn’t or can’t engage a replacement for Google, the book would genuinely vanish from this new Library of Alexandria. Perhaps that should happen for some books, but decisions like that shouldn’t be made in secret. When Google choses to exclude a book for editorial reasons, it should be [R13] required to inform the copyright owner and the general public, not just the Registry. "

What might Google exclude? Perhaps very little, but at the ALA panel in Denver in January, 2009, Dan Clancy of Google gave an off-the-cuff remark that, as I recall, had the word "pornography" in it. Given the recent embarassment of Amazon when it had to face the fact that many of its best sellers are rather salacious in nature, I can imagine Google also developing concern about the visibility of the texts that make us uncomfortable.

There are a lot of legitimate reasons for concern about this proposed settlement. And I don't think that anything that I have said is "nonsense."