Coyle's InFormation: 2010

Saturday, December 25, 2010

Signs of success

Either this:

(unavailable)

Or this:

(reduced to using a raw ip address)

Tuesday, December 14, 2010

OCLC Motion to Dismiss, Pt II

Continuing on...

Rights

Here's a somewhat extended quote from the Motion that quotes the original complaint:

"At other points in the Complaint, without addressing the text of the records use policy, Plaintiffs characterize the policy as placing broad restriction on a library's use of its own records. ([Complaint] paras. 34-36) However, these conclusory allegations are belied by the actual terms of the records use policy pled above.. For example, Plaintiffs claim that 'a member library may not transfer or share records of its own holdings with commercial firms' ([complaint] para 35), but the records use policy states no such thing. Throughout these allegations, moreover, Plaintiffs confuse and obscure the terms 'OCLC records' and 'library records.' In reality, the situation is simple: OCLC does not prohibit a library from sharing its original cataloging records with whomever it pleases; it does, consistent with the fact that the WorldCat database is copyright, claim a legal right to the unique identifier information used to link and make usable records in WorldCat." (Motion, pp 7-8)

"Again, at most, the Complaint pleads only that libraries cannot share OCLC's records, not that they cannot share the records they themselves created." (Motion, p. 14)

This is a very interesting set of statements. First, it plays with the ambiguity in talking about "library records," denying that libraries cannot convey records of their holdings, as stated in the Complaint, then stating that they can share their original cataloging records, which is not what most in the library world would consider equivalent to "library holdings." What it comes down to is the ownership of the records in the library catalogs that represent the holdings of the library. By "the holdings of the library" I understand not just some holdings, but either all of the holdings or some useful set of those holdings. The set of records that were originally cataloged by the library is a somewhat random set, and not useful as "library holdings." OCLC claims ownership in all records in a library's catalog that were not created as original cataloging by that library. Although this is a distinction it is not a distinction that relates to any particular functionality or useful library projects relating to their holdings. It's useless nonsense, is what it is, nitpicky, and proof that OCLC was boxed into a corner as it tried to claim ownership over the millions of records created by libraries around the world.

OCLC also states in the second quote above that those records in the library data are "OCLC's records" and are not records that the libraries created. Here, "created" is a key verb. Any library that has done significant modification and upgrading to a record can probably claim at least an amount of co-creation with other libraries. The claim that those records belong to OCLC is an insult to the libraries that have put so much effort into the shared pool of bibliographic data. Of course, OCLC would counter that the libraries and OCLC are one and the same. The unilateral actions of OCLC around the record use policy definitively shattered that view.

Equally interesting is the claim of copyright on the database, a claim that has not been challenged and that might not survive a challenge. A database of bibliographic data may just be seen as a compilation of facts, essentially sweat of the brow rather than a creative output. Add to that the fact that much of the sweat was not OCLC's but was on the part of thousands of libraries, and the copyright claim looks thin. Ditto the claim to the OCLC number, which is purely a sequential number assigned to records as they enter the system. The claim that the OCLC identifier makes OCLC records usable is not defensible, IMO, in that every database assigns numbers to things as part of the mechanical database management process. There's nothing new or creative about the fact that OCLC records have OCLC database numbers.

Remember, though, that these statements are not meant for you and me; they are addressed to a court that may have very little knowledge in these matters. Obfuscation of the facts is undoubtedly part of the trial process, and on the part of all parties involved. Unfortunately, OCLC's motion goes beyond obfuscation -- it gets nasty.

Sarcasm and Nastiness

I've only read the legal documents for a few cases that I'm particularly interested in, so my experience here is limited. However, I would assume that a court case would best be won on cleverness, wily strategies and the ability to out-wit ones' opponent. In this as in other professional and public endeavors, I would expect the participants to affect a tone of detached politeness, even while skewering their rival. The OCLC motion plummets into sarcasm and nastiness. Here are some quotes:

"...Plaintiffs have thrown a plethora of allegations of OCLC's purportedly anticompetitive actions into the Complain to see if any stick..." (Motion, pp. 1-2)

"While OCLC denies that either of these libraries has suffered as the result of anything other than purchasing the Plaintiff's inferior cataloging software..." (Motion, p. 17)

"... vigorous competition against a company offering less expensive, but inferior products, is perfectly lawful." (Motion, p. 1)

"Nevertheless, what is sauce for the goose is sauce for the gander -- having pled a fiction that undercuts the existence of any claims they can pursue, Plaintiffs cannot claim to have been injured..." (Motion, p. 4, footnote)

"Nothing in the antitrust laws requires OCLC to subsidize SkyRiver's inferior product by setting its pricing for registering holdings into WorldCat as low as possible." (Motion, p. 28)

I find these statements to be embarrassingly unprofessional in nature, although for all I know this is the norm in legal arguments.

Separate Realities

I suppose that one of the main skills for legal argumentation is the ability to present "facts" in ways that benefit your client, regardless of the facts. (If I were a judge and had to listen to this stuff, I'm sure I'd be driven to homicide.) Here are some examples from the motion to dismiss:

1. The named libraries, Michigan State and Cal State Long Beach, were not harmed by OCLC, they simply declined to purchase OCLC's record upload service. This is cited as proof that they were not coerced into making a purchase (which appears to be one of the antitrust offenses). (p. 29) There is no mention that the libraries could not afford the price that OCLC offered, that the price changed without warning, etc.

2. WorldCat Local is not a competitor to ILS systems because it exists in addition to the ILS system. The Motion of course completely fails to connect WC Local, its attempt to limit use of the bibliographic data, and the upcoming "in the cloud" library systems platform. Are they worried that it might actually look like improper use of the WorldCat database?

3. SkyRiver does have bibliographic records, so OCLC cannot be accused of having a monopoly on bibliographic records. (As if any bunch of bibliographic records will do.) Elsewhere in the document they boast of having the largest bibliographic database. Are we back to the Goose and the Gander?

_____

These are just a few of the topics in the Motion, and just the ones that I found most interesting. They may not even be the most relevant topics relating to the lawsuit. I suggest that you read the Motion and other documents for yourself.

OCLC Motion to Dismiss, Pt I

OCLC has filed a motion to dismiss in the anti-trust lawsuit brought by SkyRiver/III. I presume that this is Standard Operating Procedure in cases of this type. As someone who is not versed in the complexities of antitrust law, I have no idea if OCLC makes a good case in its motion. My impression is that the OCLC lawyers are quite adept, and that bodes well for OCLC in the case.

I will comment on some interesting text and subtext of the motion. Since this will get long, here is quick summary of what follows:

The motion states that SkyRiver has so far offered little proof of harm due to OCLC's business practices.
The motion may play on the court's ignorance of the library world and of OCLC's definitions.
OCLC makes some interesting claims to rights.
The motion makes claims that twist the words of SkyRiver's complaint.
The motion contains some unfortunate use of sarcasm and nastiness.
The motion undermines some previous OCLC claims as to the force of the Record Use policy.

Little Proof

The motion claims that the SkyRiver complaint contains few hard facts that could be used to back up the anti-trust claims. (Although I have no idea how detailed such a complaint is supposed to be.) It doesn't explain the library market and OCLC's role in it. What I find particularly lacking is that there is no comparison of pricing for record uploads between the libraries that moved to SkyRiver for cataloging and other libraries that upload records to OCLC. (According to the 2009 annual report, only 12% of records added to WorldCat were added via cataloging on OCLC; the rest were batch loaded.)

Ignorance and Definitions

OCLC plays heavily on the confusion between WorldCat, the database, and the records in libraries' catalogs. This is not an easy concept to grasp, and it is not explained well in the SkyRiver complaint. Wherever SkyRiver's complaint refers to "library records" OCLC counters using "WorldCat" in its place. It makes a huge difference to be talking about the records in a library's catalog vs. the entire WorldCat database. OCLC claims that SkyRiver is demanding that OCLC make all of WorldCat available for free to competitors. What is actually said is:

"Library records should be freely and openly available for use and re-use either in the public domain or by reasonable means of access for all, including for-profit library services firms." (Complaint, para. 76)

But OCLC re-words this in its response as:

"... (a) library records should be free, regardless of OCLC's inestment in aggregating, normalizing, enhancing, maintaing(sic), and delivering services based on them..." (Motion, p. 10)

OCLC also says:

"Plaintiffs pled, at most, only that libraries cannot share OCLC records, not that they are prevented from sharing records they created." (Motion, p. 21)

What is clear here, as it is throughout the motion document, is that SkyRiver is talking about the records that are in library catalogs, and OCLC is talking about "OCLC" or "WorldCat" records. By referring to the records in library catalogs as "OCLC" records, OCLC thus claims ownership to those records. In the former meaning, the libraries are prevented from making use of the records in their catalogs as they wish; in the latter, OCLC is the owner of a database and claims are being made against that database. Unless these definitions are cleared up, the two parties are just talking past each other, and no member of the court is going to make sense of it all. That, of course, would probably be to OCLC's advantage.

Record Use Policy

The original complaint cites the OCLC record use policy as a means by which OCLC maintains

"strict control over its members' access and use of the WorldCat database...". (Complaint, para. 33)

OCLC's motion first complains that SkyRiver did not attach a copy of the Policy with its original filing (but did so to their response to the Motion to Transfer). This is irrelevant to the case, I believe, and therefore is a bit of sniping at SkyRiver's lawyers, hinting that they aren't doing a good job. Anyway, here's how OCLC replies to that:

"The nature of these documents is not pled: it is not claimed that these documents are anything other than 'guidelines' OCLC publishes or that OCLC has ever used these documents to prevent a library from providing its catalog records to Plaintiffs or any other entity." (Motion, p. 7)

There's more, but let's first examine this statement. During the big broo-ha-ha about the policy, Karen Calhoun published "Notes on OCLC's updated Record Use Policy" on the OCLC blog, and stated:

"The updated policy is a legal document. Being a player on the Web, working on behalf of libraries, requires that the policy be a legal document."

That is of course the opposite of what is said in the motion.
(See comment below by Jennifer Younger: "The new 2010 policy is correctly characterized in OCLC's Motion to Dismiss as a code of good practice to guide members' choices about how they share their copies of WorldCat records.")

What is sad, however, is the statement, true as far as I know, that OCLC has never used these documents to prevent libraries from sharing their records. It hasn't had to, because the mere threat has been enough to prevent libraries from acting. The libraries that have released their records have done so unscathed, but they are few. There are of course two ways to interpret this: libraries are afraid to release their records, fearing retribution, or that libraries agree with OCLC's argument that WorldCat would be endangered should library records be openly shared.

I'll pause here and take up again shortly.

Monday, December 06, 2010

Online 2010 and SWIB

I'm just back from a lengthy trip that ended at the Semantic Web in Bibliotheken (SWIB)(#swib10) conference in Cologne, Germany, followed by Online Information 2010 in London ( #online2010). These are some thoughts from those events.

SWIB

I saw two examples of uses of FRBR that do not follow the structure provided in the FRBR documentation and both made good sense to me.

The Bibliotheque Nationale of France (BNF) is working to export its data in a linked data format. They are linking the Manifestation directly to the Work and to the Expression, rather than following the M -> E -> W order that is defined in FRBR. I need to think about this some more, but it seems to remove some of the rigidity of the linear WEMI.
The Deutsche Nationalbibliothek is using an identifier method that seems to resolve the (long) discussion I instigated on the FRBR list about identifying WEMI with a single identifier. They give an identifier to the single WEMI group (one work, one expression, one manifestation, and presumably one Item, but no one seems to be talking about items.) There is also an identifier for each W, E, M, I. This works well for input and output (and sharing). When a matching W or WE is found, a "merged" identifier is coined for the FRBR units. I couldn't follow the presentation, as it was in German, but from the slides it looked to me that all of these identifiers could co-exist, and therefore would represent different views simultaneously of the bibliographic data that would depend on the function in play (e.g. export of data about a book v. support of shared cataloging).

The key thing that I learned, though, was that there is a plethora of semantic web activities in libraries in Europe. Among these, the British Library has released the National Bibliography (1956-); the BNF will soon make data available, as will the German National Library. What do these libraries have in common? Among other things, their data is not bound by the OCLC record policy, so they are able to make it freely available.

Online 2010

I was the opening speaker on a panel about the Semantic Web at this conference and unfortunately that was the only bit of the conference I was able to attend other than the exhibits. Online Info is a combined publisher/library conference, with the publishing side being primary. At the conference one of the three tracks was "Exploiting Open and Linked Data." In the exhibits the term "semantic" was everywhere. I would like to attend this conference (because I can't really say that I have) to get a view of linked data from another industry's perspective.

My co-speakers were Sarah Barlett of Talis, and Martin Malmsten from the Swedish National Library. Sarah did something that had never occurred to me, but now I just think "Doh!" it's so obvious. Her talk walked through a literary, rather than bibliographic, view of some library materials. She showed how you could use linked data to support the humanities. It was, as the British say, brilliant. It's also a great way to teach people about linked data, and she advised everyone to come up with something they have a passion for and use it as an exercise in linking. Now I want to come up with some fun linking exercises for teaching purposes.

Martin talked about the motivation for making LIBRIS, the Swedish union catalog, open as linked data since 2008. He and I agreed that we really need a good linked data app that would allow people to explore the linked data space. He quoted Corey Harper saying that the killer app for linked data will probably be created by a 13-year-old, someone for whom the idea of open linking is neither novel nor new. I am really interested to see what the "linked open data" generation comes up with!

Response to JPW

Note: John Price Wilkin of Michigan wrote a post on the Open Knowledge Foundation blog that is very critical of the library linked data movement and the creation of numerous disjoint files of bib data in linked data formats. I admit that it isn't clear to me what he thinks should happen, but it seems to be something like this photo, which I took at the Online 2010 exhibit hall. This is OCLC's booth.

A separate cloud for libraries. Totally the wrong idea.

I must say that I see things quite differently from JPW. Although I agree that a bunch of static bibliographic files do not open library linked data make, my view is:

1) Each file represents a person or group who got interested in transforming library data and went through the learning process of actually doing it. Therefore each file is a contribution to our collective knowledge about linked data. When we add these files to heterogeneous stores like Open Library or Freebase, we exercise that knowledge.

2) These files are the fodder for further experimentation with mixing library data and non-library data, which to me is one of the main points of linked library data. We are in the "training wheels" stage of this change, and like training wheels these early files may end up being discarded when we finally learn to ride. I see no harm in that.

3) This experimentation is taking place primarily outside of the US in places where the OCLC record use policy does not apply. The British Library, the National Library of Sweden, soon the Bibliotheque Nationale, and a handful of German libraries are at the forefront of this. If you cannot release your bibliographic data openly, you cannot participate in the linked data movement.

4) I do think that we will have library systems that make use of a different data format to the one we have today, but those are not the same as linked data, and are definitely not the linked open data that is the main focus of the linked data activity. How we manage our data for ourselves may well be different from how we share it with the world. We do need a well-ordered library data universe where we do our bibliographic work. That should exist in parallel with open sharing that reaches beyond the library cataloging community.

Friday, October 29, 2010

SkyRiver/OCLC suit moved to Ohio court

The judge in San Francisco's Ninth Circuit court has agreed to OCLC's request to transfer the proceedings in the SkyRiver/OCLC suit to the Southern District Court of Ohio. In an impressively thoughtful 10 page document, the judge weighs the various arguments by the parties relating to the request to transfer. In the end, the decision was based on two things:

A majority of the potential witnesses that are neither SkyRiver nor OCLC employees (e.g. libraries that can give evidence) are closer to Ohio than to California.
In terms of documentation as evidence, most of this documentation will need to come out of OCLC's file cabinets, since the suit refers to OCLC business practices over a significant period of time.

I was hoping to be able to sit in on some of the action in the San Francisco court, although more experienced folks have told me that it could be deadly dull. Now we need to find possible bloggers in the Ohio area to cover this. Any volunteers?

Tuesday, October 12, 2010

Beyond MARC-up

In the recent Code4lib journal, Jason Thomale has published an article "Interpreting MARC: Where’s the Bibliographic Data?" in which he struggles to find the separate logical elements in a MARC 245 field. I must admit that I'm not entirely clear on what he means by 'bibliographic data' but I empathize with his attempts to find the data in MARC. In his conclusion he says:

... MARC has as much in common with a textual markup language (such as SGML or HTML) as it does with what we might consider to be “structured data.”

I have myself often referred to MARC as a markup language, to distinguish it from what a computer scientist would call "data." We took the catalog card and marked it up so that we could store the text in a machine-readable form and re-create the card format as precisely as possible. Along the way, a few fields (publication date, language, format) were considered in need of being expressed as actual data, and so the fixed fields were designed to hold those. Oddly enough, though, in most cases the same information was available in the text, meaning that the information had to be entered twice: once as text, and once as data.

008 pos. 07/10 = 1984
260 $c 1984

This fact is proof that at one point the MARC developers were fully aware that the text in the variable fields was ill-suited to machine operations other than printing on a card (or display on a screen).

I have been working off and on for a number of years on an analysis of MARC that is perhaps similar to Thomale's search for the bibliographic data of MARC. I characterize my project as an attempt to define the data elements of the MARC record. The logic goes like this: if we want to create a new, more flexible format for library data, one way to begin that process is to break MARC data up into its data elements. These can then be re-combined using a new data carrier. The converse is that if we cannot break MARC up into its data elements, then any new carrier will surely be saddled with some of the problematic aspects of MARC, such as:

redundancy, especially the repeat of the same content in many different fields
inconsistency, where the content in those different fields is coded differently or with a different level of granularity
potential contradiction between data in fixed fields and textual data

I am still just in the beginnings of my analysis, but for anyone who wants to follow along and comment/cajole/criticize, I am doing my thinking out loud on the futurelib wiki. I thought I would start with the 0XX fields, but decided to drop back and start with 007/008. I have a database of all of the 007/008 elements and their values, (linked in tab-delimited format on this wiki page) so I've been able to sort and eliminate and do other database-y things that help me see what's there.

I'm not interested in replicating MARC, so I do not want to create something that is one-to-one with MARC fields and subfields. As an example, some fixed field data elements and their values appear more than once in the MARC format, such as the 008 "Government publication" element which is identical in the 008 for books, computer files, maps, continuing resources and visual materials. As far as I'm concerned it is a single data element. On the other hand, an element named "Color" appears in more than one 007 field, but in each case the values that are valid for the data element are different. These then are different data elements.

I am struggling with how to create usable output from my investigations. I may code some things in the Open Metadata Registry, but at the moment that would have to be done by hand and I need something more automated. I would like to represent the controlled lists in the fixed fields in an RDF-compatible way using SKOS. This should be relatively simple once certain decisions are made (naming, URIs, etc.).

A big question is how to link all of this back to MARC. For the fixed fields it's relatively easy to create a string that represents the MARC origins of the data, for example:

007microform05 to represent the data element (field 007, category of material Microform, position 05)
007microform05f to represent the actual value (field 007, category of material Microform, position 05, value=f)

When it comes to the variable fields this is going to be more difficult because, as Thomale points out in his article, a logical element may span more than one field/subfield, and there may also be multiple elements in a single subfield. Working that out is going to be very, very difficult. So it seems best to go for the low-hanging fruit of the fixed fields.

Note that there have been other good starts at defining the MARC fixed fields in SKOS, and eventually we may be able to bring this all together. Meanwhile, I did grab marc21.info for the URI portion of this work and obviously am working toward dereferenceable URIs.

Friday, September 10, 2010

Libraries, FOAF, and community

Note: this is being posted simultaneously on two blogs: Metadata Matters and Coyle's InFormation

“Why don’t libraries just use FOAF for their Person metadata? Why do they insist on creating their own?”

We don’t know how many times we have heard this on various lists. It often is not really posed as a question; in other words, it isn’t asking for an explanation of why libraries do not choose to use FOAF. It’s more rhetorical, along the lines of “Why can’t we all just get along?” But it is worthy of being asked as a real question, and of getting a real answer.

[Note first that the question of FOAF comes up not so much as we consider the current library standards, but in discussions of upcoming standards that will hopefully be based on the FR** family of standards (FRBR, FRAD, FRSAR). ]

A comparison of FOAF Person and the library Person entity (either in MARC authority files, or RDA, or FRAD) shows that there is not one defined element (or “property” as it is called in Semantic Web-ese) that the two have in common. This is not a coincidence; the two vocabularies serve significantly different communities and purposes. This does not mean that they are irreconcilable; the question therefore becomes: What keeps them apart? and can that be overcome?

The key is in the nature of the two communities.

FOAF stands for ‘Friend of a Friend’, which is a clue to its context: the schema is primarily for use in social networking situations. Its focus is on people who are alive and online, and it includes online contact information like email addresses, web sites, work web sites, Facebook IDs, Skype IDs, etc. The name of the person in FOAF is not an identifier, but presumes that the name of the person plus one or more of the contact IDs is enough to distinguish most humans from one another.

Library name data (which is a form of controlled vocabulary, called “name authority data” in library terms) is focused on creating a unique identifier that brings together the different forms of a name used in published materials under one form. Library users, therefore, can expect to find all of the works by or about a named person under a single entry regardless of the various forms of the name that exist in real data. Uniqueness of names is enforced by adding information to a non-unique name, usually the year of birth, but when that isn’t known (especially for persons of antiquity) titles or even areas of endeavor (“poet”) can be added.

To accommodate both the FOAF (social) function and the libraries’ identification function, at the very least the libraries would need to define a sub-property of FOAF Person, one that has a more strict definition and usage. However, for the library “Person” to be designated as more specific than FOAF:Person does not require that these two be in the same vocabulary. That is one of the important features of Semantic Web properties: like any other resource, they can be linked and related to any other resources on the Web.

Why not combine the library and FOAF properties into a single metadata vocabulary? The answer has little to do with technology, but instead relates to the functioning of communities. Metadata standards need to be developed by (and for) actual communities. The FOAF and library communities clearly have different needs, different goals, and are working with fundamentally different use cases. They also are significantly different as communities.

FOAF is being developed by an informal group of developers, and is quite recent in origin. The group is small: the FOAF development email list has about 350 members. Another 350 individuals are listed on the FOAF wiki pages as having a FOAF profile available on the Web. This is obviously not the full extent of FOAF usage, but these numbers reflect the recent development of this kind of metadata.

The library community has hundreds of years of investment in the creation of metadata (even though it was not called that when libraries began to create it). There are at least tens of thousands of libraries in the world, many of which have been in existence for centuries. Library data has its origins in early 19th century book catalogs but has been created in a machine-readable format since the late 1960’s. Library data is created following formal rules governed in part by international agreements, and there are many hundreds of millions of machine-readable bibliographic records in existence that were created based on these library cataloging principles.

Libraries have engaged in wide-spread data sharing for centuries, and with the global networking capabilities of today libraries are actually able to exchange and re-use data on a huge scale. Libraries do not each create metadata for the same book or item, but instead share the metadata created by one library in cooperative efforts oriented towards resource sharing and efficiency.

This sharing is built into the very core of library data management. The ability to use data created by others is supported by standards and those standards form the basis for the library systems. While most users see only the library catalog available to the public, that is only one function of a system that supports purchasing, fund accounting, inventory control, circulation and patron management, and collection analysis. In the Western world these systems are not created and maintained by libraries but by a small number of specialized commercial vendors whose products are specifically created for the library customers using agreed library standards. Thus the very same system can be sold to hundreds or thousands of libraries, creating a viable market base for system development.

A number of the 70,000 libraries contributing to OCLC are using a single standard, MARC21, and others are following international standards such as ISBD that produces standardized bibliographic description. The development of these standards is based on a large scale community process with international participation. It is not a perfect process by any means, and clearly must be updated to meet modern needs and new technologies that have changed the way we work, but the degree of data sharing libraries depend on requires that a formal process be in place to support the standards of this community.

Sharing of data on a large scale is necessitated by the economic reality of the library sector. Libraries face increasingly shrinking budgets while coping with an upswing in demand for their services. Realistically, this means that changes to library data must be carefully coordinated in order to minimize disruption to the complex network of data sharing that makes cost-effective library services management, based on this data, possible. Libraries may appear to be mistrustful of change agents, and in some cases they certainly are, but there is a real need to minimize risk for the community as a whole in order to assure the health of these often financially fragile institutions.

So we come back to the question of libraries and FOAF. In the final analysis, we’re not at all sure that there’s much gain in trying to combine these two approaches, with the differences in their communities and functions. It could be like trying to combine oil and water, requiring compromises that in the end would be less than satisfactory for both communities. One could argue that the difference between the vocabularies and their contexts is a positive, allowing more than one view of the Person entity. As two separately maintained metadata vocabularies, anyone creating metadata can choose from either as needed without sacrificing precision. One can also imagine other views that will arise, such as Persons in medical data or financial data, which would each carry data elements that are neither in FOAF nor library data, from blood type to bank balance. The important thing is to make sure that these vocabularies are properly described and related to each other where possible. That way, each community can manage its own process based on its needs for standards integration, but data can be shared where appropriate.

We could begin with a more detailed discussion between the FOAF and the library communities about their metadata needs. With hundreds of years of experience in representing names in library catalogs, we feel confident that the library community’s knowledge could contribute in general to the use of personal names in the Semantic Web.

Tuesday, August 17, 2010

OCLC, SkyRiver, and the slow arm of the law

I suppose one could be gratified to learn that there are institutions that move at least as slowly as libraries, but I'm not happy about the delayed gratification that entails, nor the fact that it means that will we have to try to move forward as a community without having answers for quite a while.

The recent documents that have been filed with the court in the SkyRiver/OCLC case have the following actions and dates in them:

First, OCLC will request that the suit be moved from Northern California to the Southern District of Ohio. Just to cover that motion will take us through October, 2010.

If that does not derail the current calendar (and I presume it could cause this date to be moved back), then the Case Management Conference will be on January 14, 2011 in the San Francisco courtroom.

No, I have no idea what a "case management conference" is but it sounds like something preliminary. I would love it if someone with a legal background could offer some occasional commentary on what some of these steps mean. Right now I presume that all of this is par for the course for lawsuits of this nature, but never having observed such a case before, I really have no idea. Anyone know some law librarians who can chime in?

Friday, July 30, 2010

SkyRiver/III v. OCLC, Part II

In my previous post I covered what I saw as the stronger arguments made in the complaint. In this post I will cover points that either puzzled me or seemed to be off the mark.

The OCLC Number
The complaint states that

"This OCLC number has permitted OCLC to police its members to ensure that their records are not shared with unauthorized users." (p. 5)

Since anyone can add or delete an OCLC number from a MARC record in their own database, I don't see how this could be the case. I would like to see how this claim is supported.

The ILS Market

"OCLC is rapidly gaining market share in the ILS market by leveraging its monopoly power over its bibliographic database... " (p. 6)

Can they supply the figures to support this rapid gain in market share? They do state the number of WorldCat Local installations ("624", p. 22), but WCL is not an ILS (even though it may eventually become the basis for one).

Academic Libraries only
The complaint appears to only address academic libraries. (p.7) This could be because the evidence that they claim to have only relates to academic libraries, but both OCLC and III serve many public libraries. The complaint also states that:

"The relevant geographic market ... is the United States, because academic libraries cannot turn to suppliers of these products in other countries to meet their needs." (p. 10)

This may just be poorly worded, but if it intends to mean that there are no extra-US companies providing the service then it should have said so. The way it is worded it sounds like there are prohibitions on using non-US suppliers that pertain to academic libraries... could that be so?

New Products
In numerous places in the document, the complaint states that OCLC members are required to participate in product development as part of their membership obligation:

"Membership also obligates libraries to assist OCLC in developing new products and services to compete with for-profit firms." (p. 5)

"OCLC developed, and is still developing, WorldCat Local and WorldCat Local "quick start" through pilot programs in which many of its member university libraries have agreed to participate, without compensation, purportedly to meet the requirements of their membership in OCLC." (p. 20)

I have never heard of this requirement, and would be interested in hearing from institutions who did find themselves essentially forced to participate in pilots as part of their membership.

Acquisition of Other Companies
The complaint states that over time OCLC has expanded by acquiring 19 library industry companies, 14 of which were for-profit. (They fail to mention that at least some of those companies magically became non-profit when acquired by OCLC, cf. netLibrary.) The remainder of the sentence reads:

"... either to obtain software and other products that enable it to offer library services in competition with the remaining for-profit providers or simply to eliminate products from the marketplace." (p. 23)

These are strong words that the complainants should be prepared to prove. I'm not saying that it isn't true. However, in the few cases of which I am aware (WLN, netLibrary, RLG) the acquired company was in financial free-fall and OCLC's purchase was viewed at the time as a rescue that benefited the library community as a whole. In the case of netLibrary, OCLC had agreed to be the escrow agent for the ebooks purchased by libraries, to be called upon should netLibrary go out of business. In that case, OCLC was pretty much pre-obligated to rescue netLibrary or provide some service of its own. (I don't know what the monetary arrangements of the escrow were.) As for WLN and RLG, it's hard to know what would have happened if OCLC hadn't purchased those agencies. I suspect that the libraries using those services would have had to become OCLC members in any case in order to continue functioning as libraries. This only covers three of the 19, and may or may not be representative of OCLC's acquisitions.

[Partial list of acquisitions, gleaned from press releases and annual report:
Dewey Decimal System (1988), Information Dimensions (1993) [sold in 1997], Public Affairs Information Service/PAIS(1999), WLN (1998), netLibrary (2002, with MetaText eTextbook Division, a for-profit subsidiary), Openly Informatics (2006, OpenURL services), RLG (2006), EZproxy (2008), Amlib (2008, Australian web-based ILS), PICA (1997), Fretwell-Downing (2005), Sisis Information Systems (2005). Note: these may not be the same companies referred to in the complaint. This is my cobbled together list, and should only be seen as such.]

Head-hunting
Another strange statement is about OCLC's use of head-hunters to hire staff away from other companies:

"In addition to acquiring for-profit companies, OCLC also uses headhunters to identify and recruit employees from for-profit firms. Plaintiffs are informed and believe and based thereon allege that OCLC is using its tax-free dollars to recruit employees of for-profit vendors of library services to eliminate competition and extend OCLC's monopoly to the ILS market." (p. 26)

There's obviously a story here, but I don't know what it is. Using headhunters is standard industry practice for a well-heeled high-tech organization. Has OCLC engaged in predatory hiring behavior? And can that allegation be proved?

Access to WorldCat
The strangest thing in this complaint is the repeated insistence that OCLC should give access to the WorldCat database to potential competitors.

"...As a result of OCLC's conduct... Innovative [and SkyRiver, in another paragraph] has suffered and will continue to suffer irreparable harm ... unless this Court orders defendant OCLC to provide access to the WorldCat database to Innovative and other competitors, on such terms as are just and reasonable." (p. 31; same but ref. to SkyRiver p. 29)

This argument comes as a surprise to me. I had always assumed that the goal was to allow libraries to provide their bibliographic records freely to anyone they wished, including for-profit companies. I see that as very different from giving competitors direct access to WorldCat. It seems to me that the former goal would be very easy to argue, but direct access to OCLC's own database seems much more difficult to justify. I'm quite puzzled by this, unless I am drawing the wrong conclusion about what it means.

There's a part of me that wants this to go to court so that we can get answers to these intriguing questions. There's another part of me that sees the possiblity that this could be a lose-lose proposition. Given the overall stress in the library community, both monetary and technological, in-fighting looks to be the worst thing we could do to ourselves.

There is no doubt that a large, union catalog of library holdings is key to providing the kind of web-scale (sorry, but I couldn't think of another word) services that libraries absolutely must provide today. That said, that database does not have to be WorldCat, although WorldCat performs that function at this moment in time. The main thing is that we must have a union/universal catalog that serves libraries and their users. It shouldn't be a limited access asset that is being fought over for market share. I don't have a solution to offer, but it's clear to me that the solution is: FREE THE DATA.

SkyRiver/III v. OCLC: the lawsuit

I have now had a chance to read the legal complaint that SkyRiver/III have filed against OCLC. Marshall Breeding does a good overview of the complaint in a Library Journal piece. I'm going to focus on highlights and lowlights, what I think works and what I think doesn't. The caveat is that I do not know enough about anti-trust law to understand whether the suit is convincing on that score. So what follows is my reading of the complaint today, and I welcome corrections, other views, and any commentary.

Smoking Guns

The complaint has what I see as two smoking guns:

the use of differential pricing to specifically prevent OCLC members from becoming SkyRiver customers
the claim that OCLC paid cash "inducements" to university officials and paid for "luxury trips to expensive resorts to obtain their commitments to promote OCLC products..." (p. 21)

Both of these are extremely damaging to OCLC if they are true. The latter is possibly not illegal on OCLC's part, although it may have been illegal on the part of the officials who accepted such favors in exchange for a contract with OCLC. This, however, should come to the attention of OCLC's members, who, if this is proven to be true, will undoubtedly find this activity unacceptable for their organization.

The arguments about differential pricing are less sensational but could be equally damaging. Differential pricing is a normal practice in business, often based on concrete aspects like volume of trade or length of contract. Whether or not it is normal for a non-profit I don't know. Member libraries have accepted that each one forges a contract with OCLC which is considered confidential (although I suspect that librarians discuss with each other informally about what they pay to OCLC). SkyRiver/III claims to have proof that OCLC has used this differential pricing to punish libraries that have moved their cataloging activity from OCLC to SkyRiver. (The MSU case, as one example.) They also claim to have proof that OCLC lowered cataloging charges for some libraries that were intending to move to SkyRiver, and thus kept them as customers. (See pp. 14-19) This alone may not be illegal, but in this complaint it is described as an unfair use of OCLC's current monopoly position on cataloging services.

[Note: There appear to be more libraries that batch load their records into OCLC than ones that catalog on OCLC. In the 2008/2009 annual report, OCLC states that it has 11,810 member libraries, and 72,035 participating libraries. (I'm not sure of the difference.) In that same time frame, "the number of items cataloged by batch loading increased to 241.8 million, up from 212.1 the previous year...." They also state (p.2) that the total of cataloged records plus batch loaded records was 278.3, meaning that batch loading accounted for 87% of the records added to OCLC that fiscal year.]

Solid Arguments

The complaint has a number of solid arguments about OCLC's behavior that may be significant should this go to court. Briefly, these are:

OCLC does not act like a non-profit or a cooperative. Throughout the document the complaint uses terms like "purported member-based cooperative" when referring to OCLC. In particular, it says:

"Plaintiffs are informed and believe and based thereon allege that OCLC is not a true cooperative in that its members do not share its revenues or control its management, operations or policies. A majority of its Board of Trustees is elected by the Board itself. ... Rather than operating with transparency as a cooperative would be expected to do, OCLC charges different prices to its members for the same services and conceals those differences from its members." p. 5

The complaint also speaks to OCLC's revenue:

"An insignificant percentage of OCLC's revenues come from membership, grants or charitable contributions." (p. 26)

This is followed by a table of revenues, expenses and corporate equity (in 9-digit figures).

It isn't clear to me that this is a convincing argument. Non-profits are not required to obtain their revenue through contributions, and there are probably many non-profits that receive considerable income from services. Perhaps OCLC's "mix" of revenues is off the normal curve? That's data that would be interesting to see. However, the degree of competitive behavior against for-profit companies does seem to belie the nonprofit status of the organization.

OCLC competes directly with for-profit companies. This argument is for a large part about OCLC's entry into the ILS market with its web-based services, but also relates to its inter-library loan (ILL) services, which compete with III's ILL. The main thrust, though, is that OCLC has announced that it will go into direct competition with the primary services of commercial vendors who serve the library market with library systems. The argument is that as a non-profit OCLC has an unfair advantage because it does not pay the federal taxes that are required of its for-profit competitors. Repeatedly the complaint refers to OCLC's "tax-free profits." (see p. 2, 9, 21)

OCLC is a monopoly, and is taking advantage of its monopoly position. I believe that the unfair use of a monopoly position is essential to the anti-trust aspect of this lawsuit. I also believe that this is a point that is hard to prove. To begin with, there is nothing illegal about having a monopoly position in a market if one has acquired that position with normal dealing. And some of the accusations in the complaint may not be anything other than regular business practices, such as providing some services for free (WorldCat Local quickstart, as an example) as a way to induce customers to buy into for-fee services, or to reward customers for their loyalty. The use of pricing to make it financially untenable for its own customers to contract for non-OCLC services is probably the most damaging argument in this area.

OCLC has used its position to avoid the public procurement process. As we know, most public institutions have to go through a cumbersome process in order to procure goods and services. This process is designed to make sure that public money is spent fairly and under controlled conditions that are designed to minimize corruption. The complaint claims that OCLC has obtained contracts for WorldCat Local with public institutions without going through that procurement process. (p. 20)

Trustees are also members. There is a claim of conflict of interest in the fact that high-level employees of OCLC member institutions also sit on OCLC's board. What isn't mentioned here, oddly enough, is that some of those members draw salaries from OCLC (in addition to the salaries received from their institutions -- see any recent IRS 990 form from OCLC, which lists salaried officers). The conflict of interest is that these same individuals may have decision-making roles in their institution for the purchase of library vendor services. "By agreeing to advance the interests and products of OCLC they are effectively excluding competitors." (p. 27) This may be an issue for OCLC, but it seems that it should also be an issue for the institutions that employ these folks.

Coming next: Some odd claims, and some misses

Thursday, July 29, 2010

SkyRiver Sues OCLC over Anti-Trust

(Full document now here! Thanks Marshall Breeding!)

The newly created competitor to OCLC's cataloging services, SkyRiver, is suing OCLC in federal court in San Francisco. (Press release, PDF) I have only seen the press release, so until someone figures out how to free up the actual legal document, what we know is:

SkyRiver is claiming that OCLC is attempting to "monopolize the the markets for cataloging services, interlibrary lending, and bibliographic data, and attempting to monopolize the market for integrated library systems, by anticompetitive and exclusionary practices." The press release refers to OCLC's "tax-free profits," and that OCLC has used those profits to purchase 14 for-profit companies.

The press release quotes Leslie Straus, President of SkyRiver, as saying:

“In the process OCLC has punished its own members who have tried to seek out lower cost alternatives like SkyRiver.”

Which undoubtedly refers to the Michigan State issue, which I reported on here. In that case, OCLC appears to charge MSU an unusually large fee for uploading records to WorldCat after MSU began cataloging on SkyRiver instead of OCLC.

Undoubtedly, a good part of the concern here is over OCLC's plans to provide Web services that comprise the full functionality of an integrated library system (ILS), thus competing with current ILS vendors. You probably know that SkyRiver was started by Jerry Kline, owner of Innovative Interfaces. If OCLC successfully launches a full-service option for libraries, Innovative and other ILS's will suffer. As the representative of a major ILS company explained to me a few years ago, the library market is a zero-sum game: every time one vendor wins, others must lose, because the number of customers is not growing. The library market is a pie that can be divided into any number of slices, but the pie remains the same. This makes the rise of any one company a threat to all. In the commercial marketplace, the vendors compete over functionality and price. With its non-profit status OCLC has a distinct advantage: it doesn't pay federal income tax on the revenues it brings in. That said, given its size and depth of its involvement in day-to-day library operations, it is plausible that even without its non-profit status OCLC would be a formidable competitor for ILS vendors.

I cannot comment on the charges of anti-trust because the press release does not give enough information. Hopefully we will get more details about this suit in the near future.

Sunday, July 04, 2010

Catching up: OCLC, GBS, LOD

Some short comments on recurring themes:

OCLC Record Use Policy

OCLC has finalized its record use policy. The content is substantially the same as it was in the previous draft, which I commented on. There is one important improvement, however: the text clarifies OCLC's claims to copyright.

While, on behalf of its members, OCLC claims copyright rights in WorldCat as a compilation, it does not claim copyright ownership of individual records.

Of course, claiming copyright and actually having the right are not the same thing, especially with databases. Here's what BitLaw says:

Databases as Compilations: Databases are generally protected by copyright law as compilations. Under the Copyright Act, a compilation is defined as a "collection and assembling of preexisting materials or of data that are selected in such a way that the resulting work as a whole constitutes an original work of authorship." 17. U.S.C. § 101.

Generally, carefully selected compilations may make the "original work of authorship" cut; I'm not convinced that a union catalog of library holdings does.

Google Books

We are still waiting to hear from the judge in the Google Books case. (Every time I write that I check to see if it hasn't been released in the last hour.) Meanwhile, GBS continues to function in Internet time. Google has many publishers on board with its partners program, enough that GBS is becoming a serious rival to Amazon. It has even announced that it will begin selling e-books. The opening screen is the exact opposite of the Google Search screen -- it loads up many dozens of book covers and requires significant scrolling to browse to the bottom. Google has added personalization options ("my library") and lets you create multiple "shelves" to organize your materials.

Google was first sued in 2005. Five years is a very long time where technology is concerned. In 2005 the ebook was considered dead; now with the Kindle and the iPad, ebooks are alive and well and everyone is trying to get into that game. In that time since 2005, Google has pretty much shown the publishing industry that they can benefit from the online presence that Google is providing. The settlement reads like it was written in another era, trying to solve problems that may not really be considered problems today. The only issue remaining is that of orphan works, and if we could do a decent analysis of copyright holdings, I suspect that the number of orphan works would not be all that large.

Linked Library Data

At ALA there was a one-day preconference on linked data, and a half day un-conference attended by about 50 people. There are notes from the un-conference, which broke out barcamp-style into 6 groups for discussion.

The World Wide Web consortium has an incubator group on linked library data. This group is tasked to spend one year figuring out how to jump-start the creation of linked data in the library world.

There are ongoing efforts at Library of Congress to produce vocabularies, and of course the RDA vocabularies are available (and almost finalized). Ross Singer has announced some of the MARC codes are available (I presume on his own site). FRBR is being defined in linked data form by IFLA.

We've got just about everything but ... linked data. I'm thrilled that things are moving forward, but frustrated that I still can't see usable results. Deep breath; patience.

Wednesday, May 26, 2010

FRBR and Sharability

One of the possible advantages to using FRBR as a bibliographic model is that it can provide us with sharable bits in the form of the defined entities. I've been working on creating a test set of records to illustrate some linked data concepts, and so I began thinking about how the data would break out into sharable units. It turns out to be... an interesting question.

Work

Let's start with the Work, which I believe many people have high hopes for. I have a book in hand which I will use for this illustration. Because this is a book, there are only a few possible data elements in the Work, and these are:

Title of the work: Mort
Preferred title for the work: Mort
Date of work: 1987
Place of origin of the work: England, UK

As you can see, there isn't a lot of information in the Work entity itself. In many cases, a cataloger will not know the date of the work, and may not know where the work was written, in which case you could have just title, and the entire Work entity would be:

Title of the work: Mort

What is obviously missing here is the name of the author. That, however, is not an attribute of the Work in FRBR, but is an entity of its own, either Person, Corporate Body, or Family. It seems clear that without the name of the creator (where appropriate) the Work isn't terribly useful on its own. So I am going to add that creator from FRBR Group 2:

Work:
Title of the work: Mort
Preferred title for the work: Mort
Date of work: 1987
Place of origin of the work: England, UK

Person:
Author: Terry Pratchett

OK, now we are getting somewhere. We have an author and a title. This is a "unit" that someone could grab or link to and make use of. They aren't really separable, which is what puzzles me a bit about FRBR. It's not like you could re-use this Work for another book with the same title (and there are others with this same title). It's only the Work by Terry Pratchett that this Work entity can represent. As far as I am concerned, the creator entity and the work entity are inseparable in the description of a work. A creator can be associated with many works, but Work cannot be re-used with different creators. Once the creator(s) of the Work are defined, that relationship is fixed as part of the identity of the Work.

We could leave Work as it is here, but if you want to include subject headings in your sharing, they need to be included in the shared Work, because subject headings in FRBR are only associated with the Work. Given that, our sharable Work becomes:

Work:
Title of the work: Mort
Preferred title for the work: Mort
Date of work: 1987
Place of origin of the work: England, UK

Person:
Author: Terry Pratchett

Subject:
Topic: Fantasy fiction, English
Topic: Discworld (Imaginary place) -- Fiction

This is the unit that needs to be created so we can share Works.

Expression

Now let's move on to the Expression, the real bugbear of FRBR. For books, Expression has few data elements. In this case we have:

Date of expression: 1987
Language of expression: English

All perfectly fine and well, but clearly not something that can stand alone. Similar to Work, this expression is not usable with just any English language work written in 1987 -- it's not sharable in that sense. This Expression must be associated irrevocably with a particular Work, in this case the Work we created above. There will be some link that essentially says:

E:identifier --> expresses --> W:identifier

Second thought: Expression can also have an important creator/agent role, such as translator, editor, adaptor -- and possibly others related to music that I'm not knowledgeable about -- so it, too, should include those for sharing. In fact, probably all of the Group2 to Group1 relationships need to be included in a sharing situation. So we get:

Expression
Date of expression: 1987
Language of expression: French

Person
Translator: J-P Sartre

The unit of sharing here must be the expanded Expression plus the expanded Work (with Group2 and Group3 entities). This illustrates something that has bothered me a bit about the Group1 FRBR entities, which is the dependency inherent in the hierarchy WEMI. WEMI essentially must be created as a single thing with multiple parts. This is true even of the Manifestation.

Manifestation

The Manifestation is seemingly the richest and therefore the most independent of the FRBR Group1 entities, but as we'll see, without the Work and Expression you do not get a useful set of data elements. Here is what we have for our Manifestation:

Title proper: Mort
Statement of responsibility: Terry Pratchett
Title proper of series: Discworld
Date of publication: 2001
Copyright date: 1987
Place of publication: New York, NY
Publisher's name: HarperTorch
Extent of text: 243 pages
Dimensions: 17 cm
Carrier type: volume
Mode of issuance: single unit
Media type: unmediated

What is lacking here? Well, there's no link to the entity for the author, which would provide an identification of the author and any variant forms of the author's name. There's no language of text, because that's in the Expression. And there are no subject headings, because those are associated with the Work. If this were a translation, there would be no link to the Work in the original title. The Manifestation entity is very readable, but if we are sharing for the purposes of copy cataloging, it has to be bundled with the Work and Expression to be usable.

Our Sharable Units

So this is what we get as sharable units:

Work + Group 2 (creator) + Group 3 (subject)
Expression + Group2 (creator) + Work + Group 2 (creator) + Group 3 (subject)
Manifestation + Expression + Group2 (creator) + Work + Group 2 (creator) + Group 3 (subject)

With these three, it will be possible to build on Works and Expressions as needed, creating new Expressions and Manifestations for a Work. It will also be possible to "grab" a Manifestation and along with it get a full description including subjects and creators.

Now we just need a system to test this out.

Monday, May 03, 2010

Bib data and the Semantic Web

I know that I've gone on and on about transforming bibliographic data into a semantic web format. And whenever folks have asked me: "What will it look like?" I haven't had a good response. Now there is something to show you: Freebase.

Freebase is a database of interlinked semantic web "statements": essentially what are called by the SemWeb types as "triples." The statements come from a variety of open data sources such as Wikipedia, TVDB.com, a science fiction fan database, and Open Library. By placing a user interface over these data they now have a searchable, navigable site that can link books to movies to (theoretically) music to science to... well, anything where linked data is available.

Their book data isn't as strong as it should be, given that they claim to have imported the Open Library file (I suspect it was only partially imported). When you look at the Freebase entry for Emily Dickinson you only see two works listed. Open library has 137 Works for Dickinson, and WorldCat Identities lists 3, 388. Also, their approach is more "popular" than rigorous. However, there is no reason why this same technique could not be used with "pure" library data, and library catalogs could make use of any of the data in such a database because it is all available through linking and APIs. A database like Freebase essentially serves as a huge pot of available, re-usable information.

In its current form, Freebase would not be sufficient for library data sharing, although it could provide an interesting testing ground. What we need to work out for libraries is a way to version and source content so that you know who provided each statement and when, and to make it easy to contribute new information or improvements to the information in a sensible and automated way. There is no reason why we could not create a "LibBase" that exists solely of what libraries would consider to be authoritative information; a kind of linked data WorldCat. That data would have to be able to interact with other data on the Web, and by doing so libraries would become discoverable on the Web. It would be logical for projects like Freebase to link to the library data. Library users would have a rich, navigable information base that could help them follow (or even make) connections between library resources -- connections that are much less evident in today's catalogs. Some technical magic would need to occur to allow users to move seamlessly from the whole world to their local library, but I don't think that's going to take rocket science to solve.

There is a group of interested souls planning to get together on the Friday morning of ALA DC to begin some exploration of how we might make semantic web technology work for libraries. There will be announcements on various lists (I'm guessing NGC4LIB, CODE4LIB, LITA-L and RDA-L, a the very least). If you can get to ALA a little early, please mark that slot on your calendar. It'll be a free-floating, working, barcamp-style meeting, as I understand it.

Monday, April 26, 2010

Social aspects of subject headings

You've probably played the "my favorite subject heading" game when geeking out with librarian friends. Here's some additional fuel in case you've run out of zingers.

The Open Library takes the LC subject headings and breaks them apart at the subfield level into subjects, persons, places, genres, and times. It also includes some BISAC headings retrieved from Amazon, so the subject list is not "pure." The separate subject entries obtained are similar to, but not the same as, OCLC's FAST headings, and look much like some facets that appear in library catalogs.

The Open Library database currently holds about 24 million records for books (at least partially de-duped). In a recent dump of subjects, the total number of different subjects came out as 1,278,539. Of those, 336,638 were of the "topical" variety, that is either a 650 $a or a 65X $x. The top 25 are as follows:

825168 History
322928 Biography
212822 Politics and government
206519 Congresses
192968 History and criticism
184183 Fiction
123838 Law and legislation
119333 Bibliography
95555 Juvenile literature
93364 Description and travel
90866 Economic conditions
84787 Criticism and interpretation
74878 Claims
71468 Social life and customs
70926 Social conditions
70563 Catalogs
69205 Private Bills
69191 Private bills
66480 Education
63410 Exhibitions
63301 World War, 1939-1945
60235 Foreign relations
60068 Philosophy
56219 Dictionaries
55460 Study and teaching

I find it interesting that with the exception of "World War, 1939-1945" these appear to have the function of qualifiers, and I'm thinking that it would be interesting to contrast the $a and $x terms. My guess is that these are $x, but that not all $x are of this nature.

Of the subfields, 164,342 appear only once in the database. These are a great source of interesting an unusual headings, including "Social aspects of adzes" and "Deer as pets." In fact, the "Social aspects...." tail is so amusing that I have made a file of those with a count of 1.

The full file of topical subjects is 8 megabytes, but can probably yield innumerable hours of library cocktail hour amusement. (text in format "count - tab - subject") I will also look into names, organizations, places and times as subjects.

Friday, April 09, 2010

OCLC record use policy

OCLC has issued a new draft of its record use policy for member comment. As others have remarked, while better worded and seemingly less draconian than the previous policy (the one that was withdrawn) the substance has not changed one iota. There are many things wrong with the policy itself, but the primary problem with it is not the text of the policy but the way that OCLC has chosen to define the problem it is trying to solve. Here are some of the issues I have with the approach:

1. Pushing the river
The central issue is that OCLC wants to limit downstream use of bibliographic data that is stored in WorldCat. This simply cannot be done. The same data is also stored in individual library catalogs, some union or consortial catalogs, and in bibliographic software used by many hundreds of thousands of researchers around the world. It also often closely resembles data created outside of OCLC's sphere, such as through publisher and retailer channels. Sharing of this data is absolutely necessary for the furtherance of intellectual pursuits and scientific progress, as well as the market for new and used items. Ironically, the policy would restrict use of the data by OCLC members without restricting its use by the multitude of non-members. It would be unacceptable even if it were workable, which it isn't.

2. One-sided
The policy has a section on member rights and responsibilities, but no such section on OCLC's rights and responsibilities. (Nope, I was wrong about that. The section does exist, I must have missed it.) The policy carries the assumption that, if anything, members are the problem, OCLC the solution, and gives no sense of the policy being the result of an agreement between the parties. OCLC can make unilateral decisions about record use, such as its agreement with Google, but members must ask permission of OCLC for many uses. There is nothing here that acknowledges that there could be a situation where the interests of a library and the interests of OCLC are in conflict, nor how that would be resolved. All-in-all, it reads as if the purpose of membership were to sustain OCLC (instead of the purpose of OCLC being to support libraries).

3. Transparency
OCLC, or one of OCLC's governing groups, will make decisions. Yet there are no criteria given for making these decisions, no timelines, no reporting back to members, no mechanism for feedback. Will members know how "their" WorldCat records are being used? Will they have any choice in the matter? Will there be a way to know what requests for use have come in to OCLC, which ones have been accepted, which turned down? If WorldCat is such a "community good" shouldn't the community at least have this information about the use of that good?

4. No options
In most agreements there is some give and take. If you do X, you will get Y. The OCLC record use policy does not give members options. An example of an option would be: if you do your cataloging on OCLC, ILL will cost you $X; if you do not do your cataloging on OCLC, uploading your records will cost you $Y and ILL will cost you $Z. With clear options, libraries can decide what is best for them in their particular situation. Without clear options libraries have no way to make rational decisions about their participation in OCLC. It's not a religion, it's a business relationship, and it should be treated like one.

5. Avoids facing the problem
The problem that OCLC is trying to fix arises, as far as I can tell, because of OCLC's particular mix of costs and expenses. Most of the revenue comes in to OCLC from its cataloging service, so having members choose to catalog elsewhere is the problem. Exhorting members to keep their records in their databases so that others cannot create a large database of bibliographic data is not a solution to this problem. Large bibliographic databases do and will exist. If their existence is a threat to OCLC, then the jig is already up. Rather than stew about what others are doing with bibliographic data, OCLC needs to find a balance of income and revenue that meets the needs of its member libraries, and that might include making some hard decisions about OCLC services.

6. Ignores market forces
If someone can do it better, cheaper, more conveniently, why should libraries stick with OCLC as their vendor? For the purchase of materials or library systems or other services, libraries move to new vendors when they see advantages. With the economic downturn there is a scramble by libraries to cut costs wherever they can. No amount of loyalty to the "collective" can overcome the economic situation libraries find themselves in today. In a sense, OCLC seems to expect the libraries to act irrationally by sticking with the service even if something more economical comes along. Libraries obviously cannot afford to do this.

I cannot tell what steps OCLC's members can take at this point. The web site points to a community forum where people can post comments, but posting comments on the policy doesn't begin to solve the underlying problems as presented here. If I were a member, I think I would feel like a row boat hitching a ride behind the Titanic, hoping it will get me through the ice floes. Nothing is unsinkable, as we have unfortunately found out in the past.

Wednesday, April 07, 2010

After MARC

The report on the Future of Bibliographic Control made it clear that the members of that committee felt that it was time to move beyond MARC:

"The existing Z39.2/MARC “stack” is not an appropriate starting place for a new bibliographic data carrier because of the limitations placed upon it by the formats of the past." p. 24

The recent report from the RLG/OCLC group Implications of MARC Tag Usage on Library Metadata Practices comes to a similar conclusion:

"5. MARC itself is arguably too ambiguous and insufficiently structured to facilitate machine processing and manipulation." p.27

We seem to be reaching a point of consensus in our profession that it is time to move beyond MARC. When faced with that possibility, many librarians will wonder if we have the technical chops to make this transition. I don't have that worry; I am confident that we do. What worries me, however, is the complete lack of leadership for this essential endeavor.

Where could/should this leadership come from? Library of Congress, the maintenance agency for the current format, and OCLC, the major provider of records to libraries, both have a very strong interest in not facilitating (and perhaps even in preventing) a disruptive change. So far, neither has shown any interest in letting go of MARC. The American Library Association has just invested a large sum of money in the development of a new cataloging code. It has neither the funds nor the technical expertise to take the logical next step and help create the carrier for that data. Yet, a code without a carrier is virtually useless in today's computer-driven networked world. NISO, the official standards body for everything "information" is in the same situation as ALA: it cannot fund a large effort, and it has no technical staff to guide such a project.

It seems ironic that there have been projects funded recently to develop library-related software based on MARC even though we consider this format to be overdue for replacement. The one effort I'm aware of to obtain funding for the development of a new carrier was rejected on the grounds that it wasn't technically interesting. In fact, the technology of such an effort isn't all that interesting; the effort requires the creation of a social structure that will nurture and maintain our shared data standard (or standards, as the case may be). It requires an ongoing commitment, broad participation, and stability. Above all, however, it requires vision and leadership. Those are the qualities that are hard to come by.

Friday, March 05, 2010

MARC: from mark-up to data

The main reason that I keep pushing the semantic web is not that I think the semantic web is the answer to all of our problems -- but I do think we need to have something to be moving toward in terms of transforming our data carrier to something both more modern and web-compatible. The semantic web gives us some basic concepts of data design. I'm not sure that the semantic web concepts will hold for data as complex as the library bibliographic record, but there's only one way to find out: do it. That's a huge task, of course.

The first question to be answered is: What are our data elements? In theory, this should be one of the simpler questions, but it's not. I can create a list of all of the MARC fields, subfields, and fixed field elements (which I have, and they are linked from this page of the futurelib wiki), but that doesn't answer the question. Here's why:

Indicators

The indicators in the MARC fields are like a wild card in poker -- they can be used to utterly transform the play. Some of the indicators are simple and probably can be dismissed: the non-filing indicators and the indicators that control printing. Some are data elements in themselves: "Existence in NAL collection" is essentially a binary data element. Many further refine the meaning of the field, allowing the field to carry any one of a number of related subelements:

Second - Type of ring
# - Not applicable

0 - Outer ring

1 - Exclusion ring

Others name the source of the term, such as LCSH or MeSH. It'll take a fair amount of work to figure out what all of these qualifiers mean in terms of actual data elements.

Redundancy

There is non-textual (although not non-string) data in the MARC record, primarily in the fixed fields (00X) but also in some of the number and code fields (0XX). Some of these, actually most of these, are redundant with display information in the body of the record. Should these continue to be separate data elements, or can we remove this redundancy and still have useful user displays? Basically, having the same information entered in two different ways in your data is just begging for trouble and we've all seen fixed field dates and display (260 $c) dates that contradict each other.

Inconsistency

Primarily due to the constraints of the MARC format, the same information has been coded differently in different fields. A personal author entry in the 100 field uses subfields abcdejqu; in the 760 linking entry field, all of that data is entered into subfield a. It's the same data element, and by that I mean that the some contents are contained in the concatenation of abcdejqu as in a. Bringing together all of these krufty bits into a more rational data definition is something I really long for.

And of course my favorite... data buried in text

So much of our data isn't data, it's text, or it's data buried in text. My favorite example is the ISBN. Everyone knows how important the ISBN is in all kinds of bibliographic linking operations. But there isn't a place in our record for the ISBN as a data element. Instead, there is a subfield that takes the ISBN as well as other information.

020 __ |a 0812976479 (pbk.)

This means that every system that processes MARC records has to have code that separates out the actual ISBN from whatever else might be in the subfield. Other buried information includes things like pagination and size or other extents:

300 __ |a 1 sound disc : |b analog, 33 1/3 rpm, stereo. ; |c 12 in.

300 __ |a 376 p. ; |c 21 cm.

Once this analysis is done (and I do need help, yes, thank you!), it may be possible to compare MARC to the RDA elements and see where we do and don't have a match. I have a drafty web page where I am putting the lists I'm creating of RDA elements, but I will try to get it all written up on the futurelib wiki so it's all in one place. I encourage others to grab this data and play with it, or to start doing whatever you think you can do with the registered RDA vocabularies. And please post your results somewhere and let me know so that I can gather it all, probably on the wiki.

Thursday, March 04, 2010

The Letters Keep Coming In

Today I received a copy of a letter written by Roman Kochan, Dean and Director of Library Services at the California State University, Long Beach (CSULB). It's the perfect day for this, because today is the national day of protest in support of education. This movement has blossomed (exploded?) over the deep cuts the California state legislature has made to the education budget in the state, cuts which are having a devastating effect on the CSU system, with the libraries extremely hard hit.

The letter is addressed to "Link+™ Member Libraries and ILL Partners." The subject line on Kochan's letter reads: Threat to CSULB Library's ILL Participation. He states that faced with budget cuts, not only this year but foreseeable for many years to come, CSULB decided to move to SkyRiver™ as their cataloging utility, with anticipated significant savings.

The next three paragraphs are worth quoting in their entirety:

"We notifed OCLC of this decision, while at the same time advising them of the Library's intent to continue membership in OCLC, to continue to make use of OCLC interlibrary loan services, and to contribute records for our current and future acquisitions to OCLC for batch upload. OCLC's charge for batch upload was (until recently) popsted on the OCLC website as 23¢ per record. That is the amount I referred to in my letter to the organization. I have subsequently learned that:
The price schedule for batch downloading [sic, read: uploading] that contained the 23¢ charge has suddenly and mysteriously disappeared from the OCLC website
Another academic library that chose to displace OCLC with SkyRiver reports that OCLC has quoted a revised charge for downloading their records that amounts to about $2.85 per record; it is a charge that they report would effectively (and one might not think coincidentally) offset the savings accrued from their change to SkyRiver.

The irony in all of this is that CSULB will still be able to have up-to-date ILL services using INN-Reach and Link+, the Innovative Interfaces (III) ILL service. It's ironic because SkyRiver was founded by Jerry Kline, the owner of III. Link+ is undoubtedly of smaller reach than OCLC's ILL services, but may in the long run grow if more III libraries move to SkyRiver.

Offsetting the cost of having a library move to another vendor may make some economic sense, but this is a matter that will need to get cleared up before other libraries move to SkyRiver thinking that they'll be able to upload their records to OCLC for $.23. MSU and CSULB were caught be surprise, which is very unfortunate.

Friday, February 26, 2010

Yet more OCLC

I have in hand a letter from Clifford H. Haka, Director of the Michigan State University Libraries, addressed to "ILL Partners" and dated February 24, 2010. The letter is a response to Larry Alford's document in my previous post. I will try to represent the facts he presents here as accurately as possible, and to distinguish those from my own opinions.

FACTS (from the letter)

MSU libraries chose to move their cataloging from OCLC to SkyRiver in a cost saving effort. They expect to save about $80,000 per year. Because MSU uses OCLC for ILL, they intended to pay to have their records loaded into OCLC. The OCLC service charge list gives the price for this service as $0.23 per record.

However, when MSU requested the upload service, OCLC offered them a price of $54,000 for five months (presumably end of fiscal year?), which would amount to $74,000 per year for 26,000 records, or $2.85 per record. (Some of this would be offset by cataloging credits.)

MSU has decided that they cannot afford this, and therefore will not be uploading current cataloging into OCLC. Haka says: "While we will continue with OCLC for ILL, I regret that our newer holdings will not be available for others to consult."

Now My Take

I find it astonishing that any corporation would choose to punish customers rather than to work to win them back. I also find it astonishing that OCLC is willing to keep current customers through threats and fear. Essentially, MSU is being made an example: if you move your cataloging to a competitor, we'll cut you out of OCLC services. This is a lesson for anyone else thinking of moving to SkyRiver or some other service.

As Haka points out in his letter, the OCLC database has a huge number of records that were not created through OCLC cataloging services. When the RLIN cataloging service still existed, many libraries that did their cataloging in RLIN uploaded those records to OCLC so that they could use the OCLC ILL service. They paid an amount similar to the $0.23 that Haka quoted from the current price list. This ability to upload (economically, I should add) is directly in support of the stated goal of maintaining WorldCat's value as a union catalog. The more complete the catalog, the more value it has for services like ILL, resource sharing, and collection development. Yet it is OCLC's action that is devaluing WorldCat by deliberately setting an upload price that MSU obviously cannot support economically. This tells me that the real issue is not the "value of WorldCat" but the revenue that OCLC receives from cataloging.

Business 101 would tell you that the existence of a competitor brings prices down in the sector. If you can't meet your competitor's price, then you can try to keep your customers through a superior product and better services, but for some price will be the main factor. If someone else can provide the same service at a better price, your customers will go there.

It seems to me, and Haka alludes to this, that OCLC's reliance on cataloging revenue may be in trouble, not just because of SkyRiver but also because of the Internet: it is now very easy for anyone to store and move metadata on the public Internet. The number of sites dedicated to the same materials that one finds in libraries in increasing rapidly. We have Amazon, Google Books, LibraryThing, Open Library, IMDB, and on and on. They all have metadata describing the things in their focus. It's not the same as library metadata, but the library catalog is no longer, and not by any means, an exclusive source of description for books, films, or music.

What OCLC has that is unique is not just the quantity of metadata but the library holdings information. And they seem to be aware of this as they load in both records and holdings from many libraries that do not do their cataloging on OCLC. OCLC's value is in the whole package, but it still relies on cataloging as its primary revenue (although shrinking as a percentage of the total income, as you can see in their annual reports).

The services, like ILL, that OCLC provides for libraries are incredibly valuable and it would be a great detriment to the library community to lose them. It does appear, however, that there has been shift in the marketplace; a shift that has nothing to do with library loyalty to the OCLC collective, but one of changing technology and economics. OCLC is trying to push water upriver, when it should be seeking a new balance in its revenue stream. Instead, OCLC is making a real mess of its relationship with its members -- first with the horribly botched record use policy (which isn't going to solve this problem anyway), and now with acting punitively toward members who make the kinds of economic decisions that we all make every day. I believe the "collective" can be saved, but only if OCLC decides to work with, not against, its members.

More thoughts (added later)

I realize now that I have many other questions about record loading on OCLC. For example, many libraries get some of their records from their book vendors, and those do get loaded into OCLC. Is that charged as cataloging, or as record loading? Are there different fees for loading records if you are doing your cataloging on OCLC vs. if you are not? Are there "load only" libraries who load their records in order to participate in ILL and other services? If so, what are they charged for record loading?

I say this because it makes sense to me that libraries that do not do their cataloging on OCLC would be encouraged to load their records so that they can participate in other services. It also makes sense that the price for this would be commensurate with that of adding your holdings online (or maybe a bit cheaper if it's more economical for OCLC to batch load rather than provide cataloging online). In fact, what difference does it make how you get your records into OCLC? The most important thing is that your records are there as part of WorldCat.

What the MSU letter tells me is that the OCLC economics are such that cataloging on OCLC is paying for other services, like record uploads, which may be under-priced. A different upload charge for non-cataloging libraries makes sense, and if that's the case then OCLC needs to make that clear. However, it wouldn't surprise me if that wouldn't make alternative cataloging services unmarketable, because as the MSU case shows, the total for cataloging elsewhere plus loading on OCLC would favor doing cataloging on OCLC. This makes perfect sense to me, but it appears that members haven't been informed of this pricing practice. Really, a little more transparency about pricing could go a long way toward avoiding situations like the MSU one.