Coyle's InFormation: 2009

Friday, December 04, 2009

Girls are still icky

In 1996 I wrote a piece that was published in a book called Wired Women about the gender gap in computer advertising. (Online version) At the time, the computer world was so awash in testosterone that the back pages of computer user magazines, like PC Magazine, were mainly adds for pornography. The message for women was clear: No Girls Allowed. Like this postcard ad for a mid-1990's bulletin board system, which said on the back:

No wasted time.
No garbage.
No noise.
No irrelevant clutter.
For serious computer programmers and developers, BIX is the exclusive online club.

And one could add: no girls allowed.

I thought all of that was in the past, but ... nooooooooo. This Verizon Droid ad is just an updated version of that 1996 BBS advertisement.

Not only should the phone not be pretty (so maybe Ugly Betty would be acceptable), but it is not a "tiara-wearing digitally clueless beauty pageant queen" and, as the women puts on her lipstick in the street, "it's not a princess, it's a robot." I don't want to go so far as to assume that the attack on the fashion-plate male manikins equals homophobia, but the ending salvo of "It trades hair-do for can-do" is eerily reminiscent of the image above.

Monday, November 23, 2009

1923

The Google Books Settlement is causing a great deal of previously unexpressed bibliographic interest -- just how many books are there in the known universe? How many are published in the four countries now included in the settlement agreement? (US, UK, Canada, Australia). And how many are in the public domain?

Lorcan Dempsey and Brian Lavoie have recently published an article in DLib that looks at these figures using the world's largest database of bibliographic data, WorldCat. The data is fascinating, but I have already seen it mis-interpreted, so I thought some clarification might be useful.

Dempsey and Lavoie are very clear that what they are measuring is "Manifestations." Folks outside of the library environment are unlikely to know what that means, therefore it is important to clarify what the numbers in the Dempsey/Lavoie article represent. Each “book” that is counted represents a published product at about the same level of granularity that today would be given an ISBN. Therefore if a publisher re-issues a book in their backlist after the previous print run has been exhausted (say, a decade later) and with a new introduction, it is considered a different book. The publication date that is fed into the study is the date of the new issuing of the book. Also, as publishers re-package and re-print public domain books, these also are considered separate products with new ISBNs and new dates.

Thus, if you look up a commonly re-published book like “Moby Dick, Or The Whale” in the Library of Congress catalog, you retrieve 40 items (and more if you use the short form of the name, simply “Moby Dick”), of which only one is pre-1923 — that one was published in 1851. Of the other thirty-nine instances of the publication of the work, which range from 1925 to 2006, some contain what GBS called “inserts” - that is, separately copyrightable intellectual property in the form of introductions, etc., but others may be a straight republication of the text. If you do the same lookup in FictionFinder, a work-based view of a portion of the WorldCat database. you find:

823 editions of "Moby Dick" (which combines the various versions of the title)
534 of which are in English

of these:

9 have an unknown date
60 have a date of 1923 or earlier
465 have dates after 1923

Looking through the list on FictionFinder it is easy to see that there are some duplicate records, both in the pre- and post-1923 entries.

Therefore, the question we now need to answer is: how many public domain works have been republished after the 1923 cut-off date?

Google appears to currently lack the ability to make the proper connection between the original text that is in the public domain and the many “manifestations” (as they are called in library-speak) that were published later — and are also in the public domain, at least as far as the primary text is concerned. This is a non-trivial exercise when one is working only with the metadata that describes the work, but may become more feasible with the ability to do a full text analysis of the contents of the various packages in which publishers have placed the original work of Melville. I assume that Google is working on this, although I cannot predict how it will affect their assessment of the PD/(c) split.

What is clear, however, is that Google is going to need to identify Works (if not strictly in the FRBR sense, then at least in the sense that meets some definition that is valid for copyright law).

Saturday, November 14, 2009

Amended Google/AAP Settlement

The amended settlement has been issued (the best way to see the changes is in the redline version). I will summarize here the changes that I see as having the greatest impact on libraries and on the public. For legal issues, I suggest James Grimmelmann's blog. For business issues, probably the NY Times and Wall Street Journal.

Foreign Works Mostly Excluded

Undoubtedly due to the many complaints from foreign rights holders, the settlement now only includes (oddly enough) US, UK, Australian and Canadian works. This would include, as I interpret it, translations of non-USetc works published in those four countries. This greatly changes the value of the institutional subscription for higher education, as well as the value of the 'research corpus' (essentially a database of the OCR'd texts that researchers can use for computational research).

Since we know that information seekers prefer accessing works online rather than in hard copy, I anticipate that the online service will be very popular. But it will contain almost exclusively these Anglo-American products, a narrow swath of the intellectual output of the planet. As it is, too many Americans are unaware of the world outside of those Anglo-American borders. This will just exacerbate that problem. It could change the content of of education and research. As I've said before, availability is a significant determinant of what intellectual materials people use in their research.

Particular to Libraries

In general, the sections on libraries (both participation and use of the digital copies) remain unchanged. There are a few minor changes, some of which are puzzling.

Public Libraries

The statement about the free access for public libraries has been changed from:

in the case of each Public Library, no more than one terminal per Library Building

in the case of each Public Library, one terminal per Library Building.; provided, however, that the Registry may authorize one or more additional terminals in any Library Building under such further conditions at it may establish, acting in its sole discretion and in furtherance of the interests of all Rightsholders.

So it leaves the options open for giving some public libraries additional (free?) access. Still, there is no information on whether or how public libraries could subscribe in a way that would allow them to fully serve their communities.

Microforms

The definition of "books" that could be digitized originally included microforms. The word "not" has been added:

hard copy (not including microform)

No idea why, but perhaps a look at the comments will reveal one from UMI or some other party related to microforms.

[Found it: The ProQuest letter states that dissertations should NOT be included as they are controlled through ProQuest's dissertation service. The letter mentions that some dissertations are in microform format, but that today many are available as print-on-demand or online. Although microforms were excluded, p. 327 of the redline document states:
"What Material Is Covered?
"Books” include in-copyright written works, such as novels, textbooks, dissertations, and other writings...".
So ProQuest did not get what it asked for.]

OCLC Networks

The original settlement had a strange exception that removed OCLC networks from the definition of "consortium":

"Institutional Consortium” means a group of libraries, companies, institutions or other entities located within the United States that is a member of the International Coalition of Library Consortia with the exception of Online Computer Library Center (OCLC) - affiliated networks.

That exception has been removed. I would love to know why it was there in the first place, but can only assume that one or both of these requests came about because of participation by OCLC in the settlement discussions.

[Note: I discovered that Lyrasis and Nylink filed an objection about this exception, which may be why it was removed. Their analysis was that it had come from OCLC and gave OCLC the ability to manage competition by determining which organizations would be excluded from participating in the business of brokering services for libraries. They assume that OCLC hopes to be in that business itself.]

Download Formats & Course Packs

In the original settlement, the only download format mentioned was PDF. As we know, since then Google has announced that it will provide e-books from the publisher partner content that it carries on GBS. Ebook formats have been added in to the settlement as possible download formats. At the same time, the product line described as:

Custom Publishing - Per-page pricing of Books, or
portions thereof, for course materials, and other forms of custom
publishing for the educational and professional markets

has been removed.

Other?

There are complex changes to the treatment of orphan works which I have not tried (yet) to absorb. Those will undoubtedly have some impact on libraries and the public but at the moment I have no thoughts on that.

The settlement now allows rightsholders to place a Creative Commons license on their works. I really don't see a great deal of significance in this, although it does emphasize that by participating in GBS your rights are now governed by contract law rather than copyright law.

And, last, Google admits to some of its own difficulties in bibliographic control when it states that "The inclusion of a work within the Books Database does not, in and of itself, mean that the
work is a Book within the meaning of Section 1.19 (Book)." In other words: we threw a whole bunch of bib records into a database; don't assume anything from it.

Monday, November 09, 2009

Googled

Waiting for the next round of Google/AAP/AG settlement prose (which was due today, November 9, but has been moved back to Friday, November 13, when the parties will presumably present it to the judge), I have read Ken Auletta's book "Googled: the end of the world as we know it." It's mainly a business book, and primarily about media and advertising. I can sum up what it says about Google in three statements:

Engineering can fix anything
Information is neutral and measurable
Advertising is information

OK, maybe that's a bit overly concise, but that is what it boils down to. I've often wondered how your motto can be "Don't be evil" when you are in the advertising business. It obviously works if you consider information to only have meaning based on numerical measures, and that advertising is just another kind of information. This engineer-based mentality as the guiding principle of the largest, richest advertising company in the world falls somewhere between Ayn Rand's objectivism and Bernie Maddoff's ponzi scheme. About 50% of Google's employees are engineers, and engineers, on average, earn twice what non-engineers earn.

Google has ramped up the advertising game by orders of magnitude, destabilizing huge, long-lived media companies, and it's all based on... winners win. Google sees its role as matching up users with things they are seeking, whether it's web sites, books, or a place to buy sneakers. It doesn't matter to Google what the information is.

There is something creepy about the way that Auletta refers to SergeyandLarry as "the founders." It sounds almost... cult-like. The fact that the book treats the founders and CEO Eric Schmidt as a three-some is just way too trinitarian for my taste.

Friday, October 23, 2009

Objecting to GBS/AAP/AG Settlement

ALA Washington Office has posted an analysis of who filed comments/briefs to the court relating to the Google/AAP/AG settlement. Of the "class member objectors," e.g. authors and publishers, 82 US parties filed objections. Astonishingly, there were 295 objections filed by foreign "class members," including the publisher organizations in a number of countries. The objections range from the seemingly trivial (the poor quality of the translations of the notice that were provided) to concrete descriptions of how the settlement violates the rights of rights holders under the Berne Convention. I'll sum up some of these objections:

The class -- members of the class were not given sufficient notice, nor were they able to read the actual settlement documents, which were not provided in translation.
Moral rights -- Berne includes moral rights, that is the right of the author to control the use of ones' work. This is interpreted quite liberally in some countries, to include things like cover images used in sales, metadata, etc. While these may seem unimportant, the Italian publishers' organization AIE was horrified to find one of its newsletters listed with an author of "Fascist Federation of Publishers". This was a previous name of the organization, but was found offensive to the organization.
Registration requirements -- Berne states clearly that "... exercise of these rights shall not be subject to any formality..." It was this aspect of Berne that ended the copyright registration requirement in the US. Objectors claim that the need to register with the Books Rights Registry violates this aspect of Berne. The logic being that you are the copyright holder regardless of any action you take to assert that.
Definition of "out of print" -- This is probably being revised by the main parties, but the original settlement document stated that "Google will use the publishing status, product availability and/or availability codes to determine whether or not the particular database being used considers that Book to be offered for sale new through one or more then-customary channels of trade in the United States." Various objectors were able to show that Google's determination (as available in the database managed by Google today) was wrong in a majority of cases.
Definition of "in print" -- This one also might be undergoing revision. The settlement defines "in print" as "be offered for sale new." Some objectors pointed out that there are books that are free, that are online for open access, etc. The argument is that these cannot be considered out of print.
Representation -- None of the foreign class members consider either the AAP nor the AG to represent them. Some ask that there at least be foreign class members on the board of the Rights Registry. Others simply consider the class membership to be invalid.
The burden on publishers -- The burden of identification has been placed on publishers. For a publisher with an active list of titles, this could be a considerable amount of work. Google offered that if publishers would provide ONIX metadata, they would do an automated matching against the database. Apparently this has failed to provide relief, most likely because of differences between the publishers' metadata and that of Google.
The effect of secrecy -- Because Google works heavily in "trade secret mode," it is very difficult for the rights holders to find and diagnose problems relating to their works. Yet the settlement does not hold Google accountable for errors in the data.
Privacy -- the EU has rather strict privacy rules. This argument is a bit contorted because at the moment there is no plan to allow EU users to access the books covered by the settlement, since the settlement is only valid in the US. But at least one objector acknowledged that users would gain access by going through US proxy servers. It isn't clear to me if one can apply local law when masquerading as someone else through a proxy.
Local digitization laws -- At least one country, Germany, has made provisions for library digitization of works (and in-library access) which requires that the library obtain permission from the rights holder. This objection is a bit indirect, but it seems to be one of indignation that Google could be digitizing works that the national library of the country where the work was published cannot.
Censorship -- Many are concerned that Google may eliminate books from its service "for editorial reasons" without having to justify itself. This is an interesting and difficult argument -- it's like saying you're against the service, and you're afraid it won't have everything. It makes sense, however, because if Google becomes the predominant access to books in the US and it could censor without recourse, that a single company gets a great deal of control over both information and culture. There should be more objection to this from within the US.
General moral and cultural indignation -- I read about a dozen of the foreign objections. In some cases, I may have been reading into the text an undertone of moral and cultural indignation. Not in the case of Germany and France, however, who were quite clear on their objection to the monetization of their cultural heritage. Here are some quotes:

"... the proposed settlement homogenizes (or "Googlizes") and demeans those special elements that distinguish the unique cultural tradition of France by turning books into a merely industrial by-product of a computer database."

"France's concern for its authors is only heightened by the proposed settlement's shroud of secrecy and hint of an uncontrolled, autocratic concentration of power in a single corporate entity, Google, that generates more revenue than many countries."

"The Federal Republic of Germany is historically called "Das Land der Dichter und Denker" (the land of poets and thinkers). ... Germany can rightfully claim the mantle of birthplace of modern printing and publishing. ... [the settlement] will flout German laws that have been established to protect German authors and publishers... creating a new worldwide copyright regime without any input from those who will be greatly impacted -- German authors, publishers and digital libraries and German citizens who seek to obtain access to digital publications through the Google service. "

Wednesday, October 14, 2009

OCLC and "Competition"

The announcement of a new company, SkyRiver, providing cataloging services to libraries has sparked a number of comments about competing with OCLC and WorldCat. For a number of reasons, I don't think that the result of such a service is necessarily competitive, although I am very glad to see alternatives enter the marketplace, especially for those who do not use OCLC.

To begin with, OCLC is more than an online cataloging service. Admittedly, revenue from cataloging is OCLC's largest income source, so cataloging is not in any way just an incidental function from OCLC's point of view, but cataloging alone is not the point or purpose of OCLC to its users. I see OCLC as a kind of social network where the "beings" are libraries. The value of OCLC is directly related to the population it encompasses, and the social services it can provide based on that population. Shared cataloging copy is one service, but discovery and delivery options probably motivate OCLC members as much or even more than the cataloging effort. This was evident when RLG still existed, as some RLG member libraries who did their cataloging in RLIN also loaded their records into WorldCat in order to participate in the services that OCLC provided.

The value of the catalog copy on OCLC may be second to the value of the holdings information that OCLC maintains. Catalog copy, if that's all you want, can be found in innumerable library catalogs (including the Library of Congress), and some library systems allow you to export or retrieve a full MARC record that you can then add to your own catalog. Catalog copy can also often be found on the retro of the title page in the form of Cataloging in Publication (CIP), although not in MARC format and not as a complete record. But no one else, and no other service, has the combined holdings of some 60,000 libraries, and that's the main thing that OCLC brings to the table. It is only because of these holdings that WorldCat has value to individual searchers and to the libraries who serve them.

The view of OCLC as "the only game in town" for library cataloging ignores the fact that there are libraries who do not participate in OCLC, for a variety of reasons, but who still need to create bibliographic records. These libraries may not be able to afford OCLC's prices for cataloging services, or they may simply not wish to be bound by the standards of that society of libraries. Some libraries, in particular those in corporate settings, are not able to share their holdings publicly, and therefore are not able to participate in the social life of libraries that WorldCat represents.

There are also non-library providers of library catalog records, in particular the vendors who include catalog data with the products they sell to libraries. These vendors need a source of cataloging copy that is unrelated to particular holdings information.

If we can think further down the line, a database of bibliographic records, like that in SkyRiver or biblios.net could become a resource for anyone who needs to work with bibliographic data. This could include anyone on a research project who wants to provide a quality bibliography with a minimum of effort. Although the bibliography will follow citation standards, the basic data is the same as that found in library records.

Another advantage that these and other bibliographic services may provide to us all in the library profession is that they could be a source of data for experimentation. What with RDA looming on the horizon and much talk about updating our data format from MARC to something else, we'll need data to work with. OCLC has historically been slow to change its data, and not without reason: OCLC is integrated into the workflows of tens of thousands of libraries that depend on it for every day functionality. Although the OCLC research division comes up with innovative ideas, the OCLC core functionality is essentially the same as it was two or three decades ago. If we want to experiment with radical change, I for one expect it to come from the sidelines, not the center.

Sunday, September 20, 2009

DOJ drops bomb in Google/AAP settlement

On Friday, September 17, 2009 the Department of Justice delivered its long-awaited Statement of Interest in the proposed settlement between Google and the AAP/AG in the class action suit surrounding the Google Book Search product. The DOJ has some very specific requirements for modification of the settlement, some of which could result in significant changes in the nature of the agreement. The headline, however, is:

that "the court should reject the settlement in its current form," and reconsider after changes are made.

Beyond that, my summary is this:

1) the DOJ does not like that the settlement allows uses of orphan works that go beyond those allowed by copyright law, and especially that others will be profiting from those uses

2) the DOJ considers the settlement to be anti-competitive, and

3) between the lines, it appears that the DOJ can't decide between supporting the full access to scanned books for the good of mankind, and wanting the settlement to limit itself to the original scope of Google's project, which was to digitize for indexing only.

And I should add:

4) nothing here has a direct effect on libraries or the Google library partners, except, perhaps, in that it changes the product that Google will provide as its subscription service, and

5) that the DOJ letter clearly states that Google and the AAP/AG are already in the process of making changes to the settlement to respond to the DOJ's concerns.

The Concerns

The Class

The first has to do with the definition of the class of rights holders who are party to the class action suit. DOJ concludes that the settlement does not satisfy the rules for defining a class as set out in Rule 23, the rule that governs class action suits.

In this area, DOJ is mainly concerned with the potential rights holders of orphan works. It isn't easy to understand what solutions DOJ sees for finding the rights holders for these works, but the Department is uneasy that known rights holders will be the ones negotiating with the rights registry, and that they will also benefit from any money made on orphan works. In other words, it will be to the advantage of rights holders that the parents of those orphans NOT be found. DOJ suggests, among other things, that the money made on orphan works not be paid out to others, but be used to try to find rights holders.

It also suggests that not enough work was done to notify all potential members of the class, in particular foreign authors.

The Potential Uses, and Orphan and Out-of-Print Works

DOJ appears to be nervous about the open-endedness of the future uses that Google can make of both orphan and out-of-print works. To remedy this, it is suggested that out-of-print works (including orphans) be treated the same as in-print works, that is, that rights holders must opt-in to any uses that Google intends to make of the works. To me this makes sense from a legal point of view, since copyright does not distinguish between in- and out-of-print status. It makes less sense from a market point of view, because presumably there is less active interest in the out-of-print works on the part of the rights holder. However, we really do not know what in- and out-of-print mean in a predominantly digital environment, and it may be a mistake to be making decisions based on the analog market, as the settlement does.

There are some parts of the DOJ document that suggest what could be radical solutions, yet they appear almost as asides, such as when suggesting that out-of-print works should be subject to opt-in, they say:

"Such a revision would, of course, not give Google immediate authorization to use all out-of-print works beyond the digitization and scanning which is the foundation of the plaintiffs' Complaint in this matter." p. 14

This seems to indicate that DOJ would be more comfortable with a settlement that essentially authorized the current scope of the Google Book Search product, which was the basis for Google's claim of Fair Use: search and snippet display.

In another section, they voice concern over the fact that some rights holders will be earning money on the unclaimed works of others. They say:

"The risk of such improper leveraging might also be reduced by narrowing the scope of the license. A settlement that simply authorized Google to engage in scanning and snippet displays in the future would limit the profits that others could potentially derive from out-of-print works whose owners fail to learn of their right to claim those profits." p. 15

In fact, this would greatly limit the profit that Google could earn (from which those of the rights holders derive), since the main source of expected profit for Google seems to be from the licensing of full views of the books (to libraries and other institutions) and the "sale" of books to individuals. If this is really what the DOJ means, then it is essentially suggesting that Google have no more use of orphaned works than it has today. With that limitation, it seems that Google might as well go forward with its Fair Use defense, if it would want to continue scanning books at all.

Competition

DOJ is concerned that the settlement doesn't allow for sufficient competition. It isn't clear to me, however, how that competition might be achieved. First the document states that the Registry does not have the power to give access to works to entities other than Google, since copyright law doesn't allow it. Then it says that the best solution is to make sure that other companies get equal access. To show that I'm not making this up (although I may be mis-interpreting):

"The Proposed Settlement does not forbid the Registry from licensing these works to others. But the Registry can only act "to the extent permitted by law." S.A. 6.2(b). And the parties have represented to the United States that they believe the Registry would lack the power and ability to license copyrighted books without the consent of the copyright owner -- which consent cannot be obtained from the owners of orphan works." p. 23

"This risk of market foreclosure would be substantially ameliorated if the Proposed Settlement could be amended to provide some mechanism by which Google's competitors could gain comparable access to orphan works...." p. 25

As far as antitrust goes, the document states that although there are concerns about antitrust, the full analysis has not been completed. There are suggestions, however, that the main concerns have to do with the Book Rights Registry and the setting of prices for all works (instead of relying on competition to determine prices).

-------------

All in all, it seems to me that the DOJ has pointed out some of the same problems indicated by others, but unfortunately hasn't really given a clear direction for the settlement to take. What we do know is that we'll see a new version of the settlement sometime in the future... many more pages of dense text to ponder.

Monday, September 14, 2009

Google Books Metadata and Library Functions

In a recent post in the NGC4LIB list, we got a very welcome answer from Chip Nilges of OCLC about Google's use of WorldCat records:

To answer Karen's most recent post, Google can use any WC metadata field. And it's important to note as well that our agreement with Google is not exclusive. We're happy to work with others in the same way. The goal, as I said in my original post, is to support the efforts of our members to bring their collections online, make them discoverable, and drive traffic to library services.

Regards,

Chip

As we have seen from recent postings about the metadata being presented in the Google Books Search service, there are some problems. Although Google claims to have taken the metadata from its library partners, we can look at records in GBS and the record for that item in the library partner database and see how very different they are. It is clear that Google has not retained all of the fields that libraries have provided, and has made some very odd choices about what to keep. Perhaps what we need to do, to help Google improve the metadata, is to make clear what data elements we anticipate we will need in order to integrate the Google product with library services.

When you ask people what metadata is needed for a service, they will often reply something like "everything" or "more is better." I'm going to take a different approach here because I think it is a good idea to connect metadata needs with actual functionality. This not only justifies the metadata, but the functionality helps explain the nature of the metadata that is required. For example, if we say that we want "date of publication" in our metadata, it may seem that we could use the date from the publication statement, which can have dates like "c1956" or "[1924]." If, instead, we indicate that we want to use dates in computational research, then it is clear (hopefully) that we need the fixed field date (from the 008 field in the MARC record).

So here are the functions that come to my mind, and I welcome additions. (Do remember that at this point we are only talking about books, so many fields relating to other formats will not be included.) I'll add the related MARC fields as I get a chance.

Function: Scholarship
Need: A thorough description of the edition in question. This will include authors, titles, physical description, and series information.

Function: Metasearch
Need: To be able to combine searches with the same data elements in library catalogs. Generally this means "headings," from the bibliographic record (authors, titles, subject headings).

Function: Collection development
Need: To use GBS to fill in gaps (or make comparisons) in a library's holdings, usually using classification numbers.

Function: Linking to other bibliographic collections or databases
Need: Identifiers and headings that may be found in other collections that would allow linking.

Function: Computation
Need: Data elements that can mark a text in time and space (date and place of publication), as well as those that can help segment the file, like language. This function also may need to rely on combining editions into groupings of Works, since this research may need to distinguish Works from Manifestations. Computation will most likely use metadata as a controlled vocabulary, and the full text of the work as the "meat" of the research.

Tuesday, September 08, 2009

GBS, according to Amazon

When I first read the settlement agreement between Google, the AAP and the Author's Guild, I immediately thought: "Wow. Jeff Bezos must be freaking out!" Because it is obvious that the settlement, as written, sets up a bookselling operation of unprecedented proportions. It also does so in a way that makes it hard if not impossible for any other company to compete in certain areas, particularly in relation to works that are out of print but not out of copyright.

Amazon has responded to the proposed settlement with a document for the court. (The document for Amazon was authored by David Nimmer, known for "Nimmer on Copyright", the primary text on the topic of US copyright -- and which sells for over $2,000. When it comes to "big guns" it's hard to get any bigger.) The document makes four major points relating to the settlement. I will paraphrase them here, but if you have an interest in what Amazon has to say you must read the document yourself, because my analysis undoubtedly reflects my non-expert reading of it.

The settlement should be rejected because it makes changes to copyright law that should be decided by Congress, not a lawsuit.

The settlement should be rejected because the Book Rights Registry that it creates is a cartel of rights holders, and violates anti-trust law.

The settlement must be rejected because its expropriation of orphan works violates the copyright act.

The settlement must be rejected because it would release Google from liability of future actions.

All of these seem like good arguments to me, but I am especially taken by the fourth one. The Amazon document explains in some detail that class action here is being used to allow future actions that are not part of the complaint.

"A class action settlement can only extinguish claims that arise from the same factual predicate as the class claims.... Future claims for future conduct cannot be released by a settlement agreement because they are not part of the same factual predicate as the purported claims." p. 35

What this says, in my interpretation, is that Google is being taken to court by the AAP and AG because it has, in the past, scanned and OCR'd books that are in copyright without asking permission of the rights holders. Yet, the settlement addresses actions that Google has not yet taken, such as the sale of institutional subscriptions, consumer sales of access to books, and a variety of possible revenue models such as print on demand. This is not redress for violation of rights but a kind of blanket agreement that gives Google rights over the materials for future developments.

"The sale of books or subscriptions to a database of scanned works is conduct in which Google has not yet engaged and, because of criminal sanctions, likely would never engage without a clear license to do so." p. 39

Nimmer's analysis seems to be that this is not appropriate in a lawsuit, and especially one in which members of the class are giving up future rights that cannot even be enumerated. The hypothetical example reads:

"... let us imagine that Google has already scanned Lonesome Dove and included it in the Google Books Program, that Technology X is invented in 2016, and that Google decides in 2020 to inaugurate widescale expoitation of books via that new technology including Lonesome Dove. To the extent that author Larry McMurtry objects to that exploitation in 2021 (in the same way that previous litigation contested the scope of his grant of books rights to his publisher in Lonesome Dove at the dawn of the age of audio books), a dispute may develop between author and publisher. The Settlement Agreement goes out of its way to immunize Google from any liability for copyright infringement under those circumstances." p. 39 footnote 29

I cannot confirm nor dispute this analysis, but there is something very frightening about giving up (or assigning, depending on how you see it) rights for an indefinite future when we have no idea what that future will bring. The Amazon comments have interpreted the settlement as having overly expansive concessions to Google that could have unintended consequences in the future.

Monday, September 07, 2009

GBS and Bad Metadata

Ever since Geoffrey Nunberg got up at the Google Books Colloquium at Berkeley on August 28, 2009, and showed the audience how bad the Google Books metadata is (Google's Book Search: A Disaster for Scholars, article in The Chronicle of Higher Education, Google Books: The Metadata Mess, the slide presentation from the Conference at UC Berkeley), some parts of the academic world have been buzzing about the topic.

Google representatives claim that their data comes from libraries and from other sources, but it is easy to show that Google is not including the library's bibliographic record in GBS. It might just be seen as a short-sighted decision on their part not to keep all of the data from the MARC records supplied by the libraries. After all, which of these do you think makes the most sense to the casual reader:

12 pages
12 p. 27 cm.

However, there is some evidence that Google is missing parts of the library bibliographic record. Here are some examples of subjects from GBS and the records from the very libraries that supplied the works:

GBS:
Indians of North America
Indian baskets

Library:
Indians of North America -- Languages.
Indians of North America -- California
Indian baskets -- North America

This is the same pattern that appeared in the records released by the University of Michigan for their public domain scanned books -- only the $a of the 6XX field was included. (I wrote about this: http://kcoyle.blogspot.com/2008/05/amputation.html). Many other fields are also excluded from those Michigan records, and one has to wonder if the same was true of the records received/used by Google.

I know that it is possible to retrieve the full library records for the books because the Open Library is using this technique to retrieve bibliographic data for the public domain books scanned by Google. Google is obviously capable of doing this, yet chooses not to.

This leaves us with a bit of a mystery, although I think I know the answer. The mystery is: why would Google only use limited metadata from the participating libraries? And why won't they answer the question that I asked at the Conference: "Do you have a contract with OCLC? And does it restrict what data you can use?" Because if the answer is "yes and yes" then we only have ourselves (as in "libraries") to blame. And Nunberg and his colleagues should be furious at us.

Sunday, August 23, 2009

Googlebooks: Innovation and the Future of the Book

There's a standard joke about a restaurant-goer who complains afterward: "The food was terrible... and there was so little of it!"

I'm reminded of this while reading the letter by the University of California faculty to the judge in the Google/AAP settlement case. First they argue that the class represented by the Author's Guild does not include academics, who are major, if not the major, producers of authored texts. Then they state their three primary concerns:

to maximize access, prices should be reasonable
there must be provision for open access choices for authors who want to maximize access to their works
user privacy must be guaranteed

I find it unfortunate that the faculty chose to lead with the question of price... it sounds a bit like "This settlement is seriously flawed... and we might not be able to afford it!" The other two concerns suggest important modifications to the settlement as currently written.

All three of these concerns are premised on the acceptance of the settlement. There is another, perhaps more serious concern that isn't here, although it may not have been possible for this group because it could be incompatible with basic premises of the settlement. That concern is the question of INNOVATION.

Innovation

We may be at a crossroads in the evolution of the book, one that could change forever how one goes about the acts of scholarship and knowledge creation. What happens when some portion of our previously analog texts is available digitally? What changes take place to the nature of research? What are the potential unintended consequences?

We don't know the answers to these questions, in part because they are about the future. It is probably safe to assume, however, that the future of the book is not a linear progression from where we are today, but that it could go in a number of different directions, ending up somewhere that we can't even imagine at this moment in time. To get there, we must experiment, we must innovate. There will be trial and error. There may not be a single "killer app." Above all, the change will make use of technology but it will essentially be a cultural change. Perhaps a massive cultural change.

Some commentators have said that the production of texts digitally makes books irrelevant, that the stable, book-length text will cease to exist as we know it today. Instead, we may be returning to our medieval roots, before texts were fixed by the repetitive nature of the printing process. [1] Others see digitization of previously analog books as a way to reassert the impact of the last 500 years of thought on scholarship. It is easy to imagine the discovery of long-hidden gems from the stacks of university libraries, just as it is easy to imagine being overwhelmed by marginal and irrelevant retrievals. The main thing is that we won't know until it happens.

We can have some fun speculating on the types of things that we may be able to do with these previously analog texts in the digital future: integrating these texts with those that were 'born digital;" creating hyperlinks between them (one's own personal Memex); recombining texts into new text; annotating and commenting in a public or semi-public sphere; mashing up text with sound and video and data. On a larger scale, we face the possibility of global topic maps that will show us previously undiscovered connections.

Which of these possibilities will be available to us, however, is up to Google, because under this settlement only Google has the right to innovate across the body of digitized texts. The rest of us, including the faculty of the universities whose libraries are providing the books, are merely consumers. We can buy the product, or not buy the product, but the raw materials, the digital copies of the library texts, belong to Google. The monetizing of the texts is the job of the Book Rights Registry (BRR) that will be formed, which will represent the rights holders. Google's job is to provide the product that will make monetization possible. Both of these entities, Google and the BRR, are focused on the price issue as their main concern. In that, they have much in common with the University of California faculty.

This is a perfect example of how the asking of a question shapes reality. The question surrounding the settlement is: are authors (as defined by the Author's Guild) served by the Google/AAP settlement -- yes or no? The bigger question, What is the future of the book in our civilization? is not on the table. Yet, in the end, that may be the question that is answered by this settlement, whether that outcome serves authors or not.

[1] Recommended reading: The Future of the Book (Berkeley: University of California Press, 1996).

[Note: I am aware that there are serious issues of copyright law in all of this; that it's not just a question of technology. Whether or not this settlement helps or hinders the evolution of copyright law is a discussion better left to those with a legal background. But it is an active part of the discussion around the settlement.]

Academic publishing as a percentage of Google Books

A group representing University of California faculty have expressed their concerns about the Google/AAP settlement in a letter to the presiding judge. In that letter they state one of their concerns as:

"Specifically, we are concerned that the Authors Guild negotiators likely prioritized maximizing profits over maximizing public access to knowledge, while academic authors would have reversed those priorities."

The next sentence says:

"We note that the scholarly books written by academic authors constitute a much more substantial part of the Book Search corpus than the Authors Guild members’ books."

I was disappointed that they didn't include any data to support the "substantial part" statement, and think that their letter would have been stronger if they had. (I am presuming that they meant "substantial" in a quantitative way, rather than qualitative. The latter would be hard to support.)

Edward Betts of the Open Library did an experiment in identifying publishers in the OL data. Because the way publisher names are recorded in bibliographic data, he used ISBN publisher prefixes, where available, to bring together different forms of the name. He posted his results on the OL blog. The post links to his files. His data shows counts for each (presumed) individual publisher.

I mentioned in a comment to Edward's blog post that it was interesting to me that a university press (Oxford UP) turned up in the #1 spot as the publisher with the greatest number of books in the OL. As a matter of fact, out of the top 20 publishers, five are university presses (UPs), and they make up over 1/4 of the books in that group. (Download a tab-delimited, ranked version of the data, but be sure to look at Edward's detailed data to understand what makes up each publisher entry.)

# of books records published by the top 20 publishers: 1,935,327
# of books in the top 20 by University Presses: 577,323

Out of the top 100, the UPs make up a little less than 25% of the file. I'm only including those presses with "University" in their names, meaning that the figure doesn't include Academic Press, Elsevier, Scholastic, etc., which primarily publish the output of academic writers.

This study of OL publisher data was just experimental, so these figures should be taken with a grain of salt. However, this shows that there is an interesting study to be done, if it can be done, quantifying the relative roles of academic and commercial publishing. Given that Google is digitizing books in university libraries, the tendency toward academic publications should be quite strong. (Note that OL has taken its records from the Library of Congress, online book sales, and some libraries, and probably is less heavy on academic presses than Google Books will be.)

The UC faculty's concern that the interests of academic writers are not well-served by the Author's Guild is compelling to me. I hope the judge takes it seriously.

Thursday, August 20, 2009

Greetings from Undefined!

My kinda place, Undefined. (From iGoogle page.)

Sunday, August 16, 2009

What is a (FRBR) Work?

"What is a Work?" is an oft-discussed question. Answers tend to come down on one side or another of what is essentially a philosophical reaction to the inherent abstractness of the nature of the FRBR Work.

I was poking around in the Futurelib wiki (much neglected of late, but I have recently gathered there all of my posts on Martha Yee's article on library cataloging and RDF), and came across an interesting comment by Kristin Antelman from over 2 years ago:

I realize this is a point of absolutely no controversy in the FRBR community, but I have never been happy with the title attribute in association with the abstract entities, work and expression. It seems contrary to the spirit of an abstract entity, not to mention creating practical problems (e.g., for serials). There obviously could be many titles associated a work in its manifestations. Libraries may want to select one over another for the "work," or display, title. Works and expressions only need identifier attributes: for the work, author and subject.

RDA recognizes both a Work title and a Manifestation title (called title proper). I've been known to argue that these are distinct data elements: the Manifestation title is transcribed from the piece and is part of the surrogate for the manifestation; the Work title (which we used to call "Uniform title") acts as a unifier for all of the Expressions and Manifestations of the Work. Kristin's comment got me thinking about this again, and I agree with her: there is no title for the Work. The Uniform title served as an identifier for the Work in the days when we used things like titles to identify things. But FRBR and RDA recognize that entities will have identifiers that are separate from the display forms of names and titles that we have used in the past.

This "no title" solution actually helps out with one of the sticky problems in creating Work displays, particularly in a multi-lingual catalog. When you follow the concept of uniform titles, the Work title should be the title of the original. This means that we would be showing our users Война и мир as the title for the Work that most of them will know as War and Peace. We could show them the English language title, but what if your catalog users are global? What if some of them will only understand the title if you display it in French or Turkish or Chinese? If a Work has an identifier (which is only useful for machine processing, not for display to humans), then you can let users choose what language they prefer in Work displays. (Obviously having some default for the case where the user's preferred language isn't available.)

So I like the idea of not assigning a title to the work, but I must admit that I'm increasingly seeing the Work not as a thing but as a set; a set made up of things that claim to be Manifestations of the work. Each resource that claims to be a Manifestation of that Work (using the Work identifier) is then part of the Work set, and it is the set that defines the Work.

The Work set is not fixed - new items can add themselves to the set at any time. Thus, the Work is defined from the bottom up, from the contents of the set. The members of the set have titles and have subjects, and that means that the Work also has those titles and subjects.

This "solution" requires us still to make decisions about what we display to represent the Work. Do we show subject headings as related to the Work? What about reviews and excerpts? If you want to add a cover to the display, how do you select from among the various covers you may have?

A great advantage of this solution is that you can make different display decisions at different times, or in different contexts. A public library can point users directly to the shelf location from the Work display; a rare book archive can include key information about available editions; a social networking site can list the users who own versions of the Work. The concept of Work becomes somewhat fluid and malleable, which in my mind is closer to reality than a fixed thing that has only certain attributes.

Tuesday, August 11, 2009

FRSAD

I was reminded by Jenn Riley's post on FRSAD that I hadn't yet read the document. Jenn had some interesting concerns about the model, and now that I have read it, so do I.

The main thing that bothers me is that the FRSAD's view of authority data appears to be that it names things, and by that I mean that it names things for the human reader. The introduction to FRSAD says:

The purpose of authority control is to ensure consistency in representing a value -- a name of a person, a place name, or a subject term -- in the elements used as access points in information retrieval.

The example given is that of World War II, which can be called by many different names in publications, but is brought together under a single heading in LCSH.

I think that the goal of authority control is to come up with a single representation for a concept or a thing. The nature of that representation is very important, however. By choosing the preferred display form as the representation of the entity your metadata has a fatal flaw: any change to the display form creates a different entity. A display form simply is not a viable persistent identifier. Using the display form also makes it much more difficult to share your data across languages and across contexts. "World War II" and "Seconde Guerre mondiale" are the same thing conceptually, but if only the names are used to identify the topic those two terms are far apart. It would be simple to bring them together, however, if the topic had a true identifier, one that is independent of the preferred display form.

I am a bit perplexed that no one on the FRSAD committee was able to introduce the concept of identifier into the project. It seems to be such an obvious answer. Each topical entity must have an identifier. That identifier remains the same regardless of decisions about display. The determination of a single display may still be required for certain user functions, but the big plus is that you can decide to display the authorized form in English or Spanish, for adults or children, in a transliterated form or vernacular, without changing the identity of your entity.

Without an identifier, there is no way to represent an entity as metadata. The Work and the Thema (FRSAD's word for subject) have no existence in metadata without a machine-readable identity that allows them to have being. This is a basic rule of the Semantic Web, but it has always been a fact of metadata usage in machine-readable form. Those of us in libraries have struggled to create systems and programs that attempt to control identities with user display forms, and it is both a frustrating and flawed approach. We need to move FRSAD from:

to:

where the display forms are flexible and aren't involved with identifying what our metadata is about. Display forms are for humans; identifiers are for machines. Identifiers are also language neutral and can facilitate sharing across languages and communities. It's really that simple.

Ebook sales through the roof

Reported ebook sales for the second quarter of 2009 are more than 3 times what they were for the same period in 2008. The Kindle may turn out to have been the killer app. So far, however, I haven't found good figures on ebook sales by vendor or ebook format. If you run into that data, please pass it along.

Monday, July 20, 2009

Yee: Questions 12-13

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

Question 12

How do we document record display decisions?

In a sense we've covered this in the answers to other questions, but to reiterate, in addition to the data elements (called "properties" in RDF) you will need to develop one or more record formats. The record formats will provide the application with the information that is needed to produce the desired displays. I'm assuming that most display rules will be implemented in the application, but I'd be interested in hearing other ideas.

Question 13

Can all bibliographic data be reduced to either a class or a property with a finite list of values?

The answer to this is "no," but I think there's a misunderstanding here about the RDF "class." Table 2 on page 64 of Yee's article equates the RDFS "Class" with RDF "Subject" and I don't think that this is correct. As I understand Class in RDF it has a function somewhat like abstract classes in object-oriented programming: it essentially is the umbrella for a group of like things, but itself never has an actual value. Think of classes as the upper levels of a hierarchy where only the bottom elements actually are filled in with real "stuff." In the Dublin Core Metadata Elements there is a class called Agent. Particular properties like rightsHolder or creator are members of the Agent class. Agent itself isn't a property, it's an organizing feature.

That said, and I can't claim I understand it fully, I'm still not sure if the FRBR entities work as classes. In some cases, like with Person, it seems to work, but in others, like WEMI, I'm less sure.

Back to answering the question: I think we'll have the following types of properties in our library metadata:

plain strings, like the transcribed title or a note

formatted strings, like dates

controlled lists of values, like language lists or media type lists

Then we have one other type of data, and that is where we select a display form from what today we call an "authority record." This is often considered the same as #3 in the above list, but I think there is a significant difference because an authority record is more than a term in a list: it is a rich information resource of its own. This harks back to Yee's question #5 about using cross references in authority control. Yee asks: "how will we design our systems to take advantage of the richness of authority control?" while my question is "how can we design authority control so that systems can make use of it?"

Sunday, July 19, 2009

Yee: Questions 9-11

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

Question 9

How do we express the arrangement of elements that have a definite order?

Creating order, whether in display or in other system functions, is the job of the application. However, to do so, the information that it needs to create that order must be present in the data. Simple orders, like alphabetical or numeric, are easy to do. Yee, however, gives an example of a non-simple ordering problem:

Could one define a property such as natural language order of forename, surname, middle name, patronymic, matronymic, and/or clan name of a person given that the ideal order of these elements might vary from one person to another?

Well, yes. If you have defined separately all of the elements that you need to take into account in your ordering, then your application can have rules for the order that it uses. If you wish to use more than one set of rules, then you also would have an element for the rule set to be applied.

Many of the problems that we have in today's data are due to the fact that we present the data as a string with the elements in a single sort order, but we don't fully explain what those elements are. Is "Anna Marie" a single forename that happens to be two words, or is her name "Anna" with a middle name "Marie"?

At some point, however, we have to ask ourselves the question: is it worth it to code our data in such great detail? What do we gain from a particular capability of ordering, and what is its value to the users of our catalogs? Is there an easier way to help the users find what they are looking for? Detailed coding of data is expensive, and the cost of precise ordering may be more than the value we obtain from it.

Question 10

How do we link related data elements in such a way that effective indexing and displays are possible?

Yee wants to know how you can say "two oboes and three guitars" in a way that you don't retrieve this item when you search on "two guitars." Again, this isn't directly related to RDF but to the metadata record format you create. When your data is represented just as a character string the only way to prevent the false drop is with a phrase search. That has limitations (e.g. if you search on the phrase "three guitars and two oboes" you won't retrieve the record). With your data coded for machine processing, conceptually as

[instrument = guitar, number = 3]
[instrument = oboe, number = 2]

you can create an application that allows the user to query for the correct number of instruments.

The underlying RDF may not look anything like that example, and that's ok. The application will use the defined RDF entities and properties as it needs. RDF itself should be seen as the building blocks for a metadata record. This means that the element for "instrument" will be defined in RDF as a property that has as its value a selection from a list of terms. The application will create a structure that allows you to input all of the relevant instruments for whatever the metadata is describing, along with a number.

One sentence in this section of Yee's is puzzling, however:

The assumption seems to be that there will be no repeatable data elements.

I think this comes out of a confusion between RDF and the application that uses RDF properties. RDF itself is expressed in what are called "triples." Each triple works like a simple sentence: subject - verb - object. If you have more than one of any of those, you create another triple.

Dick and Jane wrote Fun with Dick and Jane.

becomes two triples, one that says:

Dick wrote Fun with Dick and Jane
Jane wrote Fun with Dick and Jane

This is really no different than creating a bibliographic record with one title field and two author fields. It's just a different way of organizing it under the hood. You actually can take a MARC record and reorganize it as triples.

I think the main point here is that data creators and users may not even be aware that RDF is under the hood. Humans will not be presented with RDF triples -- those are for machines. Only the people creating the systems structures need to be aware of the RDF-ness of the metadata. (Think of this as the difference between programmers who work with fields defined as "character" or "numeric" vs. what users of the data see, such as titles and dates.)

Since RDF uses some fairly abstract concepts, a group of us are working to create design patterns for the most common situations that will be needed to define metadata elements: a simple string; an element that uses a controlled list of terms; etc. These then become the building blocks for metadata element definitions: title will be defined as a string of characters; language of the text will be a term taken from the standard list of languages. Once you have your metadata elements defined then you can begin to build applications.

Question 11

Can a property have a property in RDF?

This is a question about how you create elements like "publisher statement" that themselves contain elements like place, publisher, date of publication. This kind of structure is common in our bibliographic records today. Whether one should create similar structures in RDF is somewhat controversial. One solution is to define your place of publication, date of publication, and publisher as elements, and let the application gather them into a unit as desired. The publisher statement as an element is really just a way to collect them together for display, which could be considered to be the job of the application. By defining your data elements in some detail, there can't be ambiguity between, say, the date of publication and some other date in the same record. However, if you absolutely must gather the elements together as a unit for some reason, then RDF allows you to create something called a "blank node" for that purpose.

Using RDF will require us to rethink some of our data practices. This is hard because we've worked with data that looks like a catalog record for our entire careers. It will be important for future systems that use these re-engineered data elements to present them in an easy-to-understand way to the cataloging community and to catalog users. I'm betting that you could put out an input form that looks exactly like today's MARC record but based on RDA data elements defined in RDF. That wouldn't gain us much in terms of functionality, but the internal guts of the data definitions don't dictate what catalogers or users see on the screen. What we should be looking forward to, though, is what new functionality we can have when we are able to express rich relationships between resources or between persons and resources. Replicating the "old" way of doing things would be a step backward.

Yee: Questions 6-8

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

(Martha's article is available here.)

Question 6

To recognize the fact that the subject of a book or a film could be a work, a person, a concept, an object an event, or a place (all classes in the model), is there any reason we cannot define subject itself as a property (a relationship) rather than a class in its own right?

As I said in my earlier post, this makes perfect sense to me. In fact, we need to accept that FRBR, while it holds many important concepts, probably needs to be revised in light of the advances in thinking that have taken place since it was first developed in the late 1990's. FRBR is not RDF-compliant, and it has some vestiges of the record and database concept that guided thinking in the past.

This was one of the recommendations of the report on the future of bibliographic control: that we are putting ourselves in a dangerous position basing RDA on FRBR when it doesn not appear that FRBR has been thoroughly scrutinized, much less tested. You could say, however, that RDA is the test of FRBR, but that means that we must be prepared to do some revision based on what we learn when trying to use RDA for a FRBR view of bibliographic data.

Question 7

How do we distinguish between the corporate behavior of a jurisdiction and the subject behavior of a geographical location?

and...

To distinguish between the corporate behavior of a jurisdiction and the subject behavior of a geographical location, I have defined two different classes for place: Place as Jurisdictional Corporate Body and Place as Geographic Area.

This isn't directly related to RDF, but it's an interesting example of how one can approach the definition of metadata elements. I agree with Martha that jurisdictions and delineated areas on the planet are different entities. For data that is destined to be interpreted by humans, you can talk about, for example, "California state government" and "California rivers" without having to distinguish between political entity and geography. As we read those phrases we adjust our thinking accordingly. But for processing by machine, it is necessary to provide the information that humans derive automatically from the context or their own knowledge.

Political entities are a particularly interesting problem because 1) they are often entirely or somewhat contiguous with geographic entities and may commonly be called by the same name 2) they can have different meanings at different times. To say "Louisiana" when referring to an area in 1810 is very different to the state of Louisiana that was formed in 1812. Geologic areas also have a time component, but they are much less volatile -- their changes take place in "geologic time."

When we relied entirely on humans to interpret our data, we could create data elements that depended on context and the human ability to read the data in that context. The more that we move toward machine processing of our data, and toward interaction between programs, the more we need to be precise in the defintion of our data elements. We can see this somewhat in RDA, where work titles are defined differently from titles of expressions. In our MARC records, we treat these simply as titles, and assume that the people looking at our displays will make sense of them. Sense, of course, is exactly what a computer does not have, so there is an extra burden on us to be clear about our meanings.

Question 8

What is the best way to model a bound-with or an issued-with relationship, or a part-whole relationship in which the whole must be located to obtain the part?

This is primarily a question about FRBR and RDA, but it is also an opportunity to think about how we might use relationships in future systems. The problem with bound-with is that of the logical entity (a book, an journal issue, a pamphlet) and the physical entity that the library holds. In today's catalog, we don't have a way to create relationships between catalog records -- "bound with" becomes a note. In FRBR "bound with" is an item-to-item relationship. Having a way to code explicit relationships between entities should make it possible to help users navigate our catalogs.

Thursday, July 16, 2009

Yee: Questions 3-5

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

Question 3

Is RDF capable of dealing with works that are identified using their creators?

Yee goes on to say:

We need to treat author as both an entity in its own right and as a property of a work.... Is RDF capable of supporting the indexing necessary to allow a user to search using any variant of the author's name and any variant of the title of a work... etc.

I'm not entirely sure of the point of these questions, but they appear to me to be mainly about applications and system design, not RDF, which is the same advice she has gotten from others. Let me say that as I understand RDF, it is particularly suited to allowing entities like author to be used in a variety of relationships. So a person entity can be the author of one book, the illustrator of another, and the translator of yet another. But there's something here about "identifying a work using the creator" and I think that is entirely a question of how we decide to identify works, and is unrelated to the capabilities of RDF.

The identification of all of the FRBR Group 1 entities raises many interesting questions. The fact is that we do not have a real identifier for any of them, with the possible except of the barcodes that libraries place on items. But Works, Expressions and Manifestations are lacking in true identifiers. As Jonathan Rochkind has pointed out, we use identifiers like OCLC numbers and LCCNs as pseudo-identifiers for manifestations because most of the time they work pretty well. Many systems rely heavily on ISBNs, which work reasonably well for modern published books and have the advantage of being printed on the books themselves, thus making a connection between the physical object and the metadata. Other than that, though, we're not very well set as far as identifiers go.

Yee talks about the use of the main author + title (or uniform title) as a work identifier, but even those are not a true identifier for the Work, at least not in the sense of a URI. As long as we rely on display forms we won't have an identifier that we can share with anyone whose author or title display may vary from ours (and even within the AACR2 community, there are differences in choices about names and a great gap in the actual use of uniform titles). It should be possible to create an authority-type record for name/title pairs that would include the variants from different practices, and assign a single identifier for it. But we have to stop thinking that we can create identifiers out of display forms -- that's not going to allow us to share our data outside of a tightly controlled cataloging tradition.

What I also read here is a frustration that our current systems do not produce a linear display that is analogous to the display in the card catalog (and is one of the goals of our cataloging practices). I'll pose my own question here, which is: can we create a system design that imitates the linear card catalog and at the same time provide us with the Catalog/Web 2.0 features that some members of our community desire? If not, how do we resolve these apparent conflicting requirements? (BTW, Beth Jefferson of Bibliocommons gave at talk at ALA in which she said that in their usability research, users invariable disliked -- or even hated -- the linear alphabetic display that so many librarians find necessary. I believe that statistics show that the browse function in current catalogs is seldom used. I suspect that most use is by library staff.)

Question 4

Do all possible inverse relationships need to be expressed explicitly, or can be they inferred?

If they are truly reciprocal, they can be inferred. It will require rules (the reciprocal of parent of = child of, the reciprocal of is author is has author). How this is handled internally in applications is something else, that is whether they create the inverse relationships in local storaage or are able to traverse them in any direction using rules on the fly. But I see no need to create the inverse relationships in one's metadata standard.

Question 5

Can RDF solve the problems we are having now because of the lack of transitivity or inheritance in the data models that underlie current ILSes, or will RDF merely perpetual these problems?

I answer this (first post, my #3) when I talk about the inconsistencies in authority data that make it very hard to make the appropriate inferences about relationships between data elements. It is possible that we could use RDF as the basis of our data and create these same ambiguities, but I hope that we will use the opportunity of moving to a new set of rules and a new data format to correctly restructure our data so that it does have the functionality we want.

Friday, July 10, 2009

Yee: Questions 1-2

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

As I mentioned previously, I am going to try to cover each of Martha Yee's questions from her ITAL article of June, 2009. The title of the article is: "Can Bibliographic data be Put Directly onto the Semantic Web?" Here are the first three. As always, these are my answers, which may be incorrect or incomplete, so I welcome discussion both of Yee's text as well as mine. (Martha's article is available here.)

Question 1

Is there an assumption on the part of the Semantic Web developers that a given data element, such as publisher name, shuld be expressed as either a literal or using a URI ... but never both?

The answer to this is "no," and is explained in greater detail in my post on RDF basics.

Yee goes on, however, to state that there is value in distinguishing the following types of data:

Copied as is from an artifact (transcribed)

Supplied by a cataloger

Categorized by a cataloger (controlled)

She then says that

"For many data elements, therefore it will be important to be able to record both a literal (transcribed or composed form or both) and a URI (controlled form)."

This distinction between types of data is important, and is one that we haven't made successfully in our current cataloging data. The example I usually give is that of the publisher name in the publisher statement area. Unless you know library cataloging, you might assume that is a controlled name that could be linked to, for example, a Publisher entity in a data model. That's not the case. The publisher name is a sort-of transcribed element, with a lot of cataloger freedom to not record it exactly as it appears. If we want to represent a publisher entity, we need to add it to our data set. There are various possible ways to do this. One would be to declare a publisher property that has a URI that identifies the publisher, and a literal that carries the sort-of transcribed element. But remember that there are two kinds of literals in Yee's list: transcribed and cataloger supplied. So a property that can take both a URI and a literal is still not going to allow us to make that distinction.

A better way to look at this is perhaps to focus more on the meaning of the properties that you wish to use to describe your resource. The transcribed publisher, the cataloger supplied publisher, and the identifier for the corporate body that is the publisher of the resource -- are these really the same thing? You may eventually wish to display them in the same area of your display, but that does not make them semantically the same. For the sake of clarity, if you have a need to distinguish between these different meanings of "publisher", then it would be best to treat them as three separate properties (a.k.a. "data elements").

Paying attention to the meaning of the property and the functionality that you hope to obtain with your data can go a long way toward solving some of these areas where you are dealing with what looks like a single complex data element. In library data that was meant primarily for display, making these distinctions was less important, and we have numerous instances of data elements that could either have values that aren't exactly alike or that were expected to perform more than one function. Look at the wide range of uniform titles, from a simple common title ("Hamlet") to the complex structured titles for music and biblical works. Or how the controlled main author heading functions as display, enforcement of sort order, and link to an authority record. There will be a limit to how precise data can be, but some of our traditional data elements may need a more rigorous definition to support new system functionality.

Question 2

Will the Internet ever be fast enough to assemble the equivalent of our current records from a collection of hundreds or even thousands of URIs?

I answered this in that same post, but would like to add what I think we might be doing with controlled lists in near-future systems. What we generally have today is a text document online that is updated by the relevant maintenance agency. The documents are human-readable, and updates generally require someone in the systems area of the library or vendor's support group to add new entries to the list. This is very crude considering the capabilities of today's technology.

I am assuming that in the future controlled lists will be available in a known and machine-actionable format (such as SKOS). With our lists online and in a coded form, the data could be downloaded automatically by library systems on a periodic basis (monthly, weekly, nightly -- it would depend on the type of list and needs of the community). The downloaded file could be processed into the library system without human intervention. The download could include the list term, display options, any definitions that are available, and a date on which the term becomes operational. Management of this kind of update is no different to what many systems do today to receive updated bibliographic records from LC or from other producers.

The use of SKOS or something functionally similar can give us advantages over what we have today. It could provide alternate display forms in different languages, links to cataloger documentation that could be incorporated into workstation software, and it could provide versioning and history so that it would be easier to process records created in different eras.

There could be similar advantages to be gained by using identifiers for what today we call "authority data." That's a bit more complex however, so I won't try to cover it in this short post. It's a great topic for a future discussion.

Tuesday, July 07, 2009

Yee on RDF and Bibliographic Data

[This entire thread has been combined on the futurelib wiki. Please add your comments and ideas there.]

I've been thinking for a while about how I could respond to some of the questions in Martha Yee's recent article in Information Technology and Libraries (June 2009 - pp. 55-80). Even the title is a question: "Can Bibliographic data be Put Directly onto the Semantic Web?" (Answer: it already is ) Martha is conducting an admirable gedanken experiment about the future of cataloging, creating her own cataloging code and trying to mesh her ideas with concepts coming out of the semantic web community. The article's value is not only in her conclusions but in the questions that she raises. In its unfinished state, Martha's thinking is provocative and just begging for further discussion and development.(Note: I hope Martha is allowed to put her article online, because otherwise access is limited to LITA members.) (Martha's article is available here.)

The difficulty that I am having at the moment is that it appears to me that there are some fundamental misunderstandings in Yee's attempt to grapple with an RDF model for library data. In addition, she is trying to work with FRBR and RDA, both of which have some internal consistencies that make a rigorous analysis difficult. (In fact, Yee suggests an improvement to FRBR that I think IFLA should seriously consider, and that is that subject in FRBR should be a relationship, and that the entities in Group 3 should be usable in any relevant situation, not just as subjects. p. 66, #6. After that, maybe they'll consider my similar suggestion regarding the Group 1 entities.)

I'm trying to come up with an idea of how to chunk Yee's questions so that we can have a useful but focused discussion.

I'm going to try to begin this with a few very basic statements that are based on my understanding of the semantic web. I do not consider myself an expert in RDF, but I also suspect that there are few real experts among us. If any of you reading this want to disagree with me, or chime in with your own favorite "RDF basics," please do.

1. RDF is not a record format; it isn't even a data format

Those of us in libraries have always focused on the record -- essentially a complex document that acts as a catalog surrogate for a complex thing, such as a book or a piece of recorded music. RDF says nothing about records. All that RDF says is that there is data that represents things and there are relationships between those things. What is often confusing is that anything can be an RDF thing, so the book, the author, the page, the word on the page -- if you wish, any or all of these could be things in your universe.

Many questions that I see in library discussions of the possible semantic web future are about records and applications: Will it be possible to present data in alphabetical order? What will be displayed? None of these are directly relevant to RDF. Instead, they are questions about the applications that you build out of your data. You can build records and applications using data that has "RDF Nature." These records and applications may look different from the ones we use today, and they may provide some capabilities in terms of linking and connecting data that we don't have today, but if you want your application to do it, it should be possible to do it using data that follows the RDF model. However, if you want to build systems that do exactly what today's library systems do, there isn't much reason to move to semantic web technology.

2. A URI is an identifier; it identifies

There is a lot of angst in the library world about using URI-structured identifiers for things. The concern is mainly that something like "Mark Twain" will be replaced with "http://id.loc.gov/authorities/n79021164" in library data, and that users will be shown a bibliographic record that goes like:

http://id.loc.gov/authorities/n79021164
Adventures of Tom Sawyer

or will have to wait for half an hour for their display because the display form must be retrieved from a server in Vanuatu. This is a misunderstanding about the purpose of using identifiers. A URI is not a substitute for a human-readable display form. It is an identifier. It identifies. Although my medical plan may identify me as p37209372, my doctor still knows me as Karen. The identifier, however, keeps me distinct from the many other Karens in the medical practice. Whether or not your application carries just identifiers in its data, carries an identifier and a preferred display form, or an identifier and some number of different display forms (e.g. in different languages) is up to the application and its needs. The point is that the presence of an identifier does not preclude having human-readable forms in your data record or database.

So why use identifiers? An identifier gives you precision in the midst of complexity. Author n790211164 may be "Mark Twain" to my users, and "Ma-kʻo Tʻu-wen" to someone else's, but we will know it is the same author if we use the same identifier. And Pluto the planet-like object will have a different identifier from Pluto the animated character because they are different things. It doesn't matter that they have the same name in some languages. The identifier is not intended for human consumption, but is needed because machines are not (yet?) able to cope with the ambiguities of natural language. Using identifiers it becomes possible for machines to process statements like "Herman Melville is the author of Moby Dick" without understanding one word of what that means. If Melville is A123 and Moby Dick is B456 and authorship is represented by x->, then a machine can answer a question like: "what are all of the entities with A123 x->?", which to a human translates to: "What books did Herman Melville write?"

As we know from our own experience, creating identities is tricky business. As we rely more on identifiers, we need to be aware of how important it is to understand exactly what an identifier identifies. When a library creates an authority record for "Twain, Mark," it may appear to be identifying a person; in fact, it is identifying a "personal author," who can be the same as a person, but could be just one of many names that a natural person writes under, or could be a group of people who write as a single individual. This isn't the same definition of person that would be used by, for example, the IRS or your medical plan. We can also be pretty sure that, barring a miracle, we will not have a situation where everyone agrees on one single identifier or identifier system, so we will need switching systems that translate from one identifier space to another. These may work something like xISBN, where you send in one identifier and you get back one or more identifiers that are considered equivalent (for some definition of "equivalent").

3. The key to functional bibliographic systems is in the data

There is a lot of expressed disappointment about library systems. There is no doubt that the systems have flaws. The bottom line, however, is that a system works with data, and the key to systems functionality is in the data. Library data, although highly controlled, has been primarily designed for display to human readers, and a particular kind of display at that.

One of the great difficulties is with what libraries call "authority control." Certain entities (persons, corporate bodies, subjects) are identified with a particular human-readable string, and a record is created that can contain variant forms of that string and some other strings with relationships to the entity that the record describes. This information is stored separately from the bibliographic records that carry the strings in the context of the description of a resource. Unfortunately, the data in the authority records is not truly designed for machine-processing. It's hard to find simple examples, so I will give a simplistic one:

US (or U.S.)
is an abbreviation for United States. The catalog needs to inform users that they must use United States instead of US, or must allow retrieval under either. The authority control record says:
"US see United States"

United States, of course, appears in a lot of names. You might assume then that every place where you find "United States" you'll find a reference, such that United States. Department of State would have a reference from U.S. Department of State that refers the user from that undesirable form of the name ... but it doesn't. The reference from U.S. to United States is supposed to somehow be generalized to all of the entries that have U.S. in them. Except, of course, for those to which it should not be applied, like US Tumbler Co. or US Telecomm Inc. (but it is applied to US Telephone Association). There's a pattern here, but probably not one that can be discerned by an algorithm and quite possibly not obvious to all humans, either. What it comes down to, however, is that if you want machines to be able to do things with your data, you have to design your data in a way that machines can work with it using their plodding, non-sentient, aggravatingly dumb way of making decisions: "US" is either equal to "United States" or it isn't.

Another difficulty arises from the differences between the ideal data and real data. If you have a database in which only half of the records have an entry for the language of the work, providing a search on language guarantees that many records for resources will never be retrieved by those searches even if they should be. We don't want to dumb down our systems to the few data elements that can reliably be expected in all records, but it is hard to provide for missing data. One advantage of having full text is that it probably will be possible to determine the predominant language of work even if it isn't encoded in the metadata, but when you are working with metadata alone there often isn't much you can do.

A great deal of improvement could be possible with library systems if we would look at the data in terms of system needs. Not in an idealized form, because we'll never have perfect data, but looking at desired functionality and then seeing what could be done to support that functionality in the data. While the cataloging data we have today nicely supports the functionality of the card catalog, we have never made the transition to truly machine-actionable data. There may be some things we decide we cannot do, but I'm thinking that there will be some real "bang for the buck" possibilities that we should seriously consider.

Next... I'll try to get to the questions in Martha's article.