Monday, May 26, 2008

Amputation

OK, maybe I'm just in a particularly bad mood, but I guess I've just about had it with libraries shooting themselves in the foot. Then letting gangrene set in and going for amputation.

I'm talking about not sharing the data we have. We've all heard the complaints that the Web community is going forward "reinventing the wheel" in terms of bibliographic data. But then when anyone shows an interest in our bibliographic data, we withhold like the anal retentives we are. (Wow, I must be in an extraordinarily bad mood!)

We've just learned that OCLC is sharing its data with Google, and that libraries can download records for Google books from OCLC -- although there was no mention of the "usual fee," which I'm sure applies. I happen to know that Google receives a full bibliographic record with each book that it digitizes. I have no idea what they do with that data because it doesn't appear on the screen in Google Book Search. What is significant is that the quality data that has been created by libraries is still essentially invisible to most users of the Internet. Do you still wonder why we are overlooked by most of the information seeking population? How can they possibly know what we've done to organize the bibliographic world if we won't let them see?

So the latest thing that got my goat came across on the NGC4LIB list. First, it turns out that University of Michigan has made available a file of record representing the books of their digitized by Google that are in the public domain via OAI-PMH. (The OAI Identify command.) This is a good thing, obviously. But in this post, we learn that the records have been "truncated" to meet the requirements of OCLC for record sharing. So I decided to see what "truncated" means. Here are some examples. The bolded text is what you get when you retrieve the records from Michigan. The text in italics is what shows in the Michigan catalog for the same item.

1.
LDR nam 22003251i 4500
005 19901105000000.0
006 m d
007 cr bn ---auaua
008 901105s1980 dcu b f00010 eng d
020 |b pbk.
035 |a (OCoLC)ocm06624048
040 |a GPO |c GPO |d m.c. |d m/c |d EYM
074 |a 968-H-1
0860 |a J 26.2:C 73/10
1001 |a Villano, Clair E.
24510 |a Complaint and referral handling / |c by Clair E. Villano, Metropolitan Denver District Attorneys' Office of Consumer Fraud and Economic Crime.
260 |a [Washington] : |b Dept. of Justice, Law Enforcement Assistance Administration, |c 1980.
300 |a v, 25 p. ; |c 28 cm.
440 0 |a Operational guide to white-collar crime enforcement
500 |a At head of title: The National Center on White-Collar Crime.
500 |a Project supported by Grant No. 77-TA-99-0008, awarded to the Battelle Memorial Institute Law and Justice Study Center.
500 |a May 1980.
504 |a Includes bibliographical references.
533 |a Electronic text and image data |b Ann Arbor, Mich. : |c University of Michigan Library |d 2007 |e Includes both image files and keyword searchable text. |f [Michigan Digitization Project]
538 |a Mode of access: Internet.
650 0 |a Complaints (Criminal procedure) |z United States.
650 0 |a White collar crimes |z United States.
7101 |a United States. |b Law Enforcement Assistance Administration.
7102 |a National Center on White-Collar Crime (U.S.)
7102 |a Metropolitan Denver District Attorneys' Office of Consumer Fraud and Economic Crime.
7102 |a Battelle Law and Justice Study Center.
856 4 |uhttp://hdl.handle.net/2027/mdp.39015034803505
|wmdp.39015034803505 |xeContent

2.
LDR nam 2200301 a 4500
005 19901105000000.0
006 m d
007 cr bn ---auaua
008 901105s1980 dcu bc f00010 eng c
010 |a 81600912 //r90
035 |a (OCoLC)ocm06999418
040 |a DGPO/DLC |c DLC |d GPO |d EYM
043 |a n-us---
05000 |a Z6616.B3215 |b B37 |a E746.B37
074 |a 383-B
08200 |a 016.355/0092/4 |2 19
0860 |a D 214.13:B 26/859-930
1001 |a Bartlett, Merrill L.
24510 |a George Barnett, 1859-1930 : |b register of his personal papers / |c compiled by Merrill L. Bartlett.
260 |a Washington, D.C. : |b History and Museums Division, Headquarters, U.S. Marine Corps, |c 1980.
300 |a vii, 18 p. ; |c 27 cm.
504 |a Bibliography: p. 17-18.
533 |a Electronic text and image data |b Ann Arbor, Mich. : |c University of Michigan Library |d 2007 |e Includes both image files and keyword searchable text. |f [Michigan Digitization Project]
538 |a Mode of access: Internet.
60010 |a Barnett, George, |d 1850-1930 |x Manuscripts |x Catalogs.
61010 |a United States. |b Marine Corps. |x History |x Sources |x Manuscripts |x Catalogs.
61010 |a United States. |b Marine Corps. |b History and Museums Division |x Catalogs.
650 0 |a Manuscripts, American |z Washington (D.C.) |x Catalogs.
7101 |a United States. |b Marine Corps. |b History and Museums Division.
856 4\|uhttp://hdl.handle.net/2027/mdp.39015035037590
|wmdp.39015035037590|xeContent


There are obviously many other examples, but the trend is clear (well, partly so). The lack of a 245 $c (statement of responsibility) and all of the 7XX's (added entries) means that the record is incomplete in terms of authorship. You won't be able to see or search on anything but the main author. I'm baffled by the removal of the place of publication, since it's not used for retrieval (the coded place is in the 008 field). Ditto the 300 $c with the size in centimeters. The subject headings have been rendered entirely useless. As we know, the 6XX $a is not the top of some logical hierarchy, but is idiosyncratically the first term based on some rather complex rules. So in the first record we lose "United States" because it is the second term, but in the second record we get only "United States" and lose all references to "Marine Corps." which is the actual topic of the item.

I note also that the output records do not have the required 001/003/005 fields. (And not having required fields is a problem in itself.) These are needed to identify the source of the record (003) and provide a unique identity for the record in that source system (001 + 003). The 005 would give the date of the most recent update before the record was exported. The combination of these three would allow a receiving system to accept periodic updates to these records. I suspect (and this is just me) that the existence of the 001 would also allow one to retrieve the un-truncated record from Michigan's catalog using something like Z39.50. I admit, however, that the lack of these fields could be a simple error.

I'm not going to say that these records are completely useless, and I am talking to the Open Library folks about adding them to their database. But I do consider the maiming of these records to be an embarrassment, a kind of self-mutilation done by a profession with amazing low self-esteem. Please, folks, let's find our power and stand up for what we know is right!

13 comments:

Simon Spero said...

Now, to my non-lawyer eyes, the University of Michigan would be seem to be required by state law ( MCL §§15.231-246 [1]) to provide access to the non-copyrighted parts of its bibliographic database.

The fact that the non-copyrighted portions of those records may be available for purchase elsewhere does not affect this duty. (See Attorney General's opinion No. 5500 [2])

Where records contain exempted and non-exempted material, the agency must separate the exempted materials from the non-exempted materials (MSL §15.244 [1]).

The cost of such separation may not be charged to the requester unless "failure to charge a fee would result in unreasonably high costs to the public body because of the nature of the request in the particular instance, and the public body specifically identifies the nature of these unreasonably high costs"

If records are requested in electronicly readable form in MARC format, the agency may not substitute printed records ([3]).

Presumably the procedure involved would be to file a request with the university, and if denied, file in the Federal court for the relevant district as a federal question. (probably eastern district michigan).


Just a thought...

[1] http://www.legislature.mi.gov/(S(2l2nxiysu4vrqp55451qu245))/mileg.aspx?page=getobject&objectname=mcl-act-442-of-1976

[2] http://www.ag.state.mi.us/opinion/datafiles/1970s/op05500.htm

[3] Farrell v Detroit, 209 Mich App 7; 530 NW2d 105 (1995).

Diane Hillmann said...

Hear, hear! This is an issue that must be addressed, and soon, for the community to move ahead. OCLC must be strongly, STRONGLY, encouraged to develop a business plan that does not disable data or data sharing, but yet allows them to legitimately receive compensation for services rendered. Otherwise, libraries will continue to confront the future with amputated limbs and the fear of that 8000 lb. gorilla standing in the road ahead, machete in hand.

Free the DATA ...

Unknown said...

When I first saw that the Michigan records were available, like you I downloaded the file, examined a sample, and came to the same conclusion: that they were a joke in terms of usefulness. I can think of few better ways to generate negative publicity.

Karen Coyle said...

Answering Simon, I'm going to do another post on the "copyrightability" of library records. I think that there are actually few fields in library records that are purely factual, for various reasons, including decisions about what is the title and what authors to include, etc. The resulting record would be less, not more, than what Michigan is providing now.

Simon Spero said...

Remember that mechanical application of rules, even if requiring skilled craftsmanship, does not satisfy the constitutional requirement of minimum creativity.

If the choice of titles or authors is governed by a set of common rules, such as the AACR2,LCRI and SCM, such that the selection of the same elements by multiple catalogers applying the rules properly, then it is unlikely that mechanical application of those rules would meet minimum originality, no matter the skill is required to apply those rules.

Given that one of primary purposes of standardized bibliographic control is to maximize that degree of inter-cataloger and inter-indexer consistency.

For example, the IFLA cataloguing principles define one of the objectives for the cataloging codes as " Standardization. Descriptions and construction of access points should be standardized to the extent and
level possible. This enables greater consistency which in turn increases the ability to share
bibliographic and authority records. "

I will post more later (I left Svenonius and McGarry's Lubetsky in the car - dash it.)

Karen Coyle said...

I understand your point, Simon, but I don't consider many of the rules in AACR2 to be actually "mechanical." There is often judgment involved, considerations for your community of users, etc. Things like: use the name the person is most commonly known as. But I'll try to explore all of this in a post, because I think it's worth thinking about.

Simon Spero said...

Quick followup before bedtime, now that I've remembered a couple of cases I meant to cite earlier:

The toy iron bank case is:

L. Batlin & Son, Inc., Appellee, v. Jeffrey Snyder d/b/a/ J.S.N.Y. and Etna Products, 536 F.2d 486; 1976 U.S. App. LEXIS 11846; 189 U.S.P.Q. (BNA) 753

This case is also rather on point:
MATTHEW BENDER & COMPANY, INC., Plaintiff, HYPERLAW, INC., Intervenor-Plaintiff-Appellee, v. WEST PUBLISHING CO. and WEST PUBLISHING CORPORATION, Defendants-Appellants,158 F.3d 674; 1998 U.S. App. LEXIS 30790; 48 U.S.P.Q.2D (BNA) 1560

Anonymous said...

I comment somewhat pre-emptively in relation to your contemplated forthcoming posting. I imagine that your position would be that we must find the means to share records despite the applicability of copyright law but please consider whether copyright law applies with all due care. I hope that you will do some legal research and not give rise to further confusion in the already deeply confused notions of copyright law typically held by library administrators.

[Although I made no comment at the time, I thought that your post about Google's copying of in copyright books for indexing purposes had insufficiently considered the subtleties of copyright law. Individual catalogue records are a less complex case but attention to the details of how copyright law is applied is still important.]

1. ORIGINALITY FOR COPYRIGHT.

For a work to be copyrightable under the copyright law of any country, the work must meet a standard of original authorship. The standard may not be especially high and may vary with the laws and court interpretations of particular jurisdictions but the lack of perfect uniformity does not mean lack of some minimal standard.

Mere facts about the world do not constitute the requisite originality. The originality required is a creative originality not one of mere difference from others. Facts need not be determined by the application of rigid mechanical process to fail the originality test. Even if significant understanding and interpretation must be applied to discern a set of facts where the understanding and interpretation may differ from one observer to another, those differences do not transform mere facts into an original work subject to copyright.

Originality in the form in which the facts are presented is required for copyright law to apply to a particular presentation of facts. Sufficiently large aggregations of facts which form an original selection serving a creative purpose, such as an entire database of carefully selected records may be copyrightable. However, US courts have increasingly been interpreting complete databases consisting of aggregations of mere factual records as insufficiently original in compilation.

While a catalogue record may contain something more than mere facts, such records are unusual. An individual catalogue record may contain an original work in some fields designed for that purpose, such as fields which may include the full text or significant extracts of a work catalogued.

The creation of records of mere facts is sweat of the brow work which, even if not copyrightable, has value which deserves compensation to ensure its continued creation. Records have more value to the world to the extent to which they are shared as you argue.

The absence of copyright applicability does not stop empty claims of copyright applicability or the use of other means to limit the exchange of work which cannot be controlled by copyright.

2. CONTRACTUAL RESTRICTIONS.

Some contracts may impinge upon the free exchange of records of mere facts to maintain a funding model for record creation. When such contracts are the only adequate means for securing necessary funding, then such contracts are appropriate. However, the most pervasive of such contracts impose restrictions which are now contributing to the marginalisation of libraries in the culture. This marginalisation is significantly diminishing the prospects for funding the work of libraries. At some point, I will take the time to fully articulate how organisations such as OCLC could continue to fund their good works from brokering record services, instead of obstructing the exchange of records to the detriment of their own members.

jpw said...

Hey, Karen. I'm glad you added the possibility that some of the decisions "could be a simple error." Three quick things here: (1) Yes, in many of the cases you point out, the omissions are simple short-sightedness. (2) We're at work on something that we intend to be more sustainable, facilitating updates and getting at sources like OCLC and the individual member library's catalog (e.g., via z39.50). (3) There are many reasons not to want to get into the record distribution business and to ensure OCLC's role, not least of which is their prohibition against sharing contributed/member cataloging. I've had a preliminary conversation with Bill Carney at OCLC to try to see if they'll serve this function for us. To my mind, that would be the best possible outcome.

Karen Coyle said...

JPW - I fully agree that there's more to the distribution of records than most institutions want to take on. (People do tend to forget about updates, and sustainability over time, both of which require a lot of effort.) So it would be ideal for this distribution to be done by an organization that already has that kind of mechanism in place. I really hope it happens. I also hope the records that are distributed are complete.

As for "simple errors" -- I figure that in my lifetime I have overseen the processing of something like 100 million bib records. I feel like I've seen everything, including a record with only one field (and it wasn't the title). Nobody's perfect, and for some reason when it comes to bib data, imperfection seems to be the norm. (smiley face here)

Jeffrey Beall said...

(I realize this conversation took place back in May.)

I loaded over one hundred thousand Mbooks records into my library's online catalog, despite the crummy, OCLC-amputated data. I have used the OPAC's global update feature to add missing dates for the voluminous authors and persons as subjects; this is ongoing as I stumble on the split files. The reaction has been positive. The tradeoff between wonderful content and low-quality metadata seems to favor the content. We would be willing to pay for better-quality records, I think, given that the content is free and valuable.

Jeffrey Beall said...

I forgot to add this: If you want to see some of the records in the online catalog, please go here:

http://skyline.cudenver.edu/search/o?SEARCH=mbooks2&Submit=Search

Thanks.

Karen Coyle said...

Jeffrey, thanks for this update. It's great to hear that you are making use of these records. Perhaps we could arrange a kind of pooled 'exchange' for upgraded records so others can take advantage... ideas on how to do this are very much welcome.