Tuesday, December 30, 2008

Google's Gift of Books

As part of the settlement between the Association of American Publishers (AAP) and Google, each public library in the U.S. can get one free access to the Public Access Service to Google Books. (That's defined as one "terminal" per library building.)

I'm sure that many folks are quite impressed at this generosity: free access to the public! What's not to like? Well, keep reading.

Nothing's Really Free
Some of you may remember the late 1990's when Microsoft donated computers (running Microsoft software) and modems to public libraries, making it possible for them to offer free Internet access to the public. This was a great boon for the libraries, but there were numerous hidden costs. First, the libraries had to scramble to find space for the workstation, and finding "extra" space in a library is enough to make one hate the law of physics that precludes two objects occupying the same place at the same time. Then they had to get phone line access to the place where the computer would sit, and this had to be a dedicated line because it would be in use most of the hours that the library was open. The librarians had to learn about the Internet so they could help the public, which was especially difficult because the same computer that served the public was the only one that the librarians could learn on. As the Internet access became more popular, the libraries had to manage the demand for the service, setting up ways for patrons to sign up for time on the computer and mediating disagreements about whose time it was. In libraries where often the staff didn't have printers attached to their own work computers, they also had to find a way to manage the fact that users who didn't have a computer at home needed a way to take away what they found online.

All of these were costs for the libraries. They may seem like minor costs, but if you're thinking that then you're probably not working in a public library. I often say that public libraries are like old-age pensioners: they're on a fixed income that doesn't keep up with inflation, much less the demand for more services. (And my impression is that they've already been living on dog food for a number years now.)

Some costs were not so minor, however. For example, I discovered that the small branch of my public library nearest my home was paying for the phone line that this "free" internet access used. The problem was that library phone lines were considered business lines and they were being charged per-minute rates. This library was paying $2000 a month or more for the use of the phone line attached to its one public Internet workstation. That's more each month than Microsoft paid initially for the equipment it provided to the library. Yet Microsoft was considered "generous," while the story in the press ignored the costs to the library; costs we taxpayers were all bearing.

Nothing in this should be construed to demean the gift from Microsoft or the value of adding public Internet access to libraries. The story here is that free has costs, and those costs could be considerable. The story is also that some of those costs, perhaps many of those costs, get passed on to the public, even though the public doesn't have a say in the choice to support this service.

The First One is Always Free

That same small library that started with one of the Microsoft computers now has something like six public Internet access workstations that are rarely sitting unused. That initial gift led to the development of what today is an essential public library service. It is often the case, however, that in cash-strapped times libraries have to make trade-offs, dropping old services (like magazine subscriptions) to pay for new ones (like Internet access). Should the single access to the Google Books Public Access Service not suffice, libraries will need to add more subscriptions to meet the demand. It isn't known what this will cost, but unless it is ridiculously cheap, it eats into the already strained budgets of the libraries. Eventually, the cost will be absorbed into the budget as part of normal expenses, but there will be a painful phase at the beginning. Before they introduce this free service, libraries need to know what the costs will be in 2, 3 or possibly 5 years so they can begin the budget planning process that will allow them to provide full service to their users, if that's what they wish to do.

Just Say No

If taking advantage of the Google Books Public Access Service is going to strain library budgets, why don't the libraries just say no? They aren't being forced to accept the free service, after all. This creates a real dilemma for public libraries, the same dilemma that was created by the initial free Internet access: the mission of public libraries is to level the information access playing field for everyone. To do so, public libraries need to keep up with new information resources and services as they become available; to purchase or license these; and to give equal access to all. Generally, public libraries lag behind their richer cousins, the academic and research libraries, in providing information services. Academic libraries had access to the current crop of online versions of abstracting and information services about a decade before public libraries began to provide these to their users. But if public libraries don't provide these services as they become affordable, we end up with a two-tiered world of information haves, those with a connection to an academic institution, and information have-nots, the remainder of the public.

Equal Access for All

One option that libraries must consider when new services arise that are outside of their budget capabilties is whether they will choose to provide the service with a user fee attached. I remember this in academic libraries when the first article indexing services were available through Dialog in the 1970's. These services were quite expensive (they billed by the minute, if not the second, as I recall). Libraries tried budgeting a set amount to provide the service to their users, but the appeal of the "free" service was such that the entire year's budget was exhausted within months. For the remainder of the budget year, users had to foot the costs for the searches. Academic institutions can decide to give some users (professors, researchers) services that are not available to others (undergraduate students). They also can decide to charge fees, looking on this as part of the cost of attending the institution and making use of its facilities.

The public library mission of equal access to all, however, argues against requiring fees for services, other than those nominal fees designed to prevent squandering of resources (e.g. 25 cents for each book put on hold), or cost recovery for consumable materials, like photocopy services. But generally speaking, once a user has entered the library, it's an "all you can eat" situation. This is not the nature of Google's online book service. The settlement agreement is incredibly complex in terms of what is free and what is pay-for. For certain works, a certain number of pages can be viewed for free, after which one must purchase the book to see the rest. The number of pages that can be printed may be limited, and there may be charges for printing.

We do know that public libraries will not be able to offer remote access to their free subscription, only on-site access. That, of course, excludes many users. We also know that there may be advertising included in the service, and it may include the ability to purchase books (online or in hard copy) and additional services. In other words, the library's users become the service's customers.

Product Placement

When Microsoft began giving away software to libraries (actually, making them pay a pittance for the licenses), an article in Salon stated:
In the case of computer companies, giving away free product is a way to increase market share, influence future purchases, create good will at relatively low cost, and get a tax write-off for your efforts.
While possibly cynical, it's also true. Giving away samples of your product is a time-worn approach to building a customer base.

Charity is giving people what they need, not what you want them to have or what you would like them to buy in the future. While the provision of a free, one-user license to libraries may be generous, it is not charitable. It should be viewed in the same way that free samples of cereal are. Actually, the better analogy harks back to the days when cigarette companies gave away free packs of cigarettes on city streets, hoping to encourage non-smokers to become smokers. It is best to look on the free access to Google Books as part of an advertising campaign; it is definitely not Google and the AAP following in the footsteps of Carnegie. It's as if Carnegie had given each city enough steel (his product) to build part of a bridge.

Did Anyone Ask Public Libraries Before Deciding This?

One of the great difficulties that we have in understanding the Google/AAP settlement is that none of the participants can reveal the nature of the negotiations; they are all bound by a non-disclosure agreement. So we don't know who represented the libraries nor what they asked for. We don't know if the Google Public Access Service was offered by Google or demanded by library participants. We don't even know who the library participants were. A logical assumption would be that the library representatives in the discussions were limited to representatives of the current Google library partners. If that is the case, then they are all representatives of research and academic libraries. We don't know if any of them surveyed public libraries, even informally, about the desirability of this service, or about the burdens it might place on those libraries as it has been formulated. Could there have been a different deal that was better for public libraries and equally acceptable to the major players?
There is very little in the settlement that would allow one to imagine the precise nature of this service and how this service will be implemented and managed.

Public librarians I have talked to are very concerned about this matter. There is still plenty of time to work out details, but is there a plan to engage a representative group of public libraries to do the planning? What happens if the service, as envisioned in the negotiations, doesn't meet the needs of public libraries, or doesn't fit in with their current online systems?Are there different needs and capabilities in large urban public libraries and small rural ones? Will it be possible to serve these equally?

Where is the Public's Voice?

At the negotiations there were lawyers representing the AAP, lawyers representing Google, and lawyers and librarians representing libraries. But the public had no lawyer at that table, no representative. While each of the parties could have desires to serve the public, they each had a primary self interest that they were there to serve. Without public representation, it is not possible to say that the public's interest has been served. Without public representation, the public's interest has not even been solicited, much less heard. Yet, this settlement has a great effect on the public and its relationship with a major public resource, the collective wisdom contained in hundreds of years of publication of text on paper. I will address this in a forthcoming post.

Saturday, December 27, 2008

FRBR and Group 2 & 3 Oddities

You've probably realized by now that I cycle back to FRBR frequently, each time discovering something new. New to me, at least. Perhaps because of not being a cataloger it seems that I have missed some key concepts in earlier readings. This might help explain some misunderstandings between me and more catalog-savvy folks.

This time I was thinking about the way that the entities are used with the subject relationship. But before I get to that, there's always the publisher to torment me.

Creators and Publishers in FRBR and RDA

The Group 2 entities have what is called "responsibility relationships" with the Group 1 entities. The diagram (Figure 3.2, p. 14) shows the two Group 2 (G2) entities, person and corporate body, to related to the Group 1 entities in the following way:
Work is created by... G2
Expression is realized by ... G2
Manifestation is produced by ... G2
Item is owned by ... G2
(Note that I find it odd that FRBR limits the Group 1 to Group 2 relationships to only four, and only one per Group 1 entity, but that is how it is written. It makes me wonder what one does with, say, an illustrator of a particular expression of a book. Surely the addition of illustrations doesn't make it a new work?)

In section 4 of FRBR, the Group 2 entities are not included in the lists of attributes of the Group 1 entities. In other words, when you read the list of attributes of a work, there is no mention of creator, and the list of attributes of an item does not include owner.

I was therefore surprised to find among the attributes of a manifestation:
4.4.5 Publisher/Distributor
The publisher/distributor of the manifestation is the individual, group, or organization named in the manifestation as being responsible for the publication, distribution, issuing, or release of the manifestation. A manifestation may be associated with one or more publishers or distributors.
Since Group 2 entities are not listed as attributes in the Group 1 attribute lists, this pretty clearly states that publisher is not a person or corporate body entity.
Yet, the section on relationships between Group 1 and Group 2 entities says:
5.2.2 Relationships to Persons and Corporate Bodies
The entities in the second group (person and corporate body) are linked to the first group by four relationship types: the “created by” relationship that links both person and corporate body to work; the “realized by” relationship that links the same two entities to expression; the “produced by” relationship that links them to manifestation; and the “owned by” relationship that links them to item.
Essentially, this apparent inconsistency between the definitions of the entities and the attribute list for the manifestation has to do with the practice of transcribing data from the manifestation:
At first glance certain of the attributes defined in the model may appear to duplicate objects of interest that have been separately defined in the model as entities and linked to the entity in question through relationships. For example, the manifestation attribute “statement of responsibility” may appear to parallel the entities person and corporate body and the “responsibility” relationships that link those entities with the work and/or expression embodied in the manifestation. However, the attribute defined as “statement of responsibility” pertains directly to the labeling information appearing in the manifestation itself, as distinct from the relationship between the work contained in the manifestation and the person and/or corporate body responsible for the creation or realization of the work. (Section 4.1)
What this points out is that while FRBR supposedly puts forth an entity-relation model, in fact it is no more ER than our current bibliographic model with its mixture of transcribed data, cataloger supplied data, and controlled headings.

Then Comes Group 3

This is easier to explain, because it is very simple: The Group 3 entities (concept, object, event, place) can ONLY be used as subjects, e.g.:
For the purposes of this study places are treated as entities only to the extent that they are the subject of a work (e.g., the subject of a map or atlas, or of a travel guide, etc.). (section 3.2.10)
This eliminates any thought of using place as in "place of publication." Not to mention that each of these has a very limited attribute list; in fact, they each have exactly one attribute:
term for the concept/object/event/place
The Upshot

The upshot is that FRBR does not give us a true entity-relation model for our bibliographic data. This is frustrating for those of us trying to move library data in an ER direction, and it means that to achieve the ER model we will have to go beyond what exists today in FRBR, and beyond the version of FRBR that has been realized in RDA. I've kind of known this, but it's discouraging to have it confirmed in the FRBR document itself. Even more frustrating that it's been there the whole time and I missed it.

I've looked again at FRBR in RDF and the Scholarly Works Application Profile, and both make some interesting extensions to the FRBR concepts, taking them further along the ER road. It seems to me that the DC/RDA work will need also to deviate from FRBR in order to achieve its goals. The big question is: how far can we go and still be compatible with library data?

Tuesday, December 23, 2008

Monday, December 22, 2008

Google Replies on OCA Blog

The Open Content Alliance blog has a post on the Google/AAP agreement with a lengthy reply from Dan Clancy of Google Books, and my reply to Dan's reply.

LC forces take-down of lcsh.info

I am beside myself with fury. I hardly know where to begin. Not long ago, Ed Summers took the LCSH authority file and created an online site with the LC Subject Heading authority file re-formatted as a SKOS vocabulary. For the first time, Web services could link directly to LC subjects as represented in the authority file. And some did.

But the Library of Congress, our Federal, if not National, library, has required Ed to take down the site. A site that contained nothing more than LCSH in a usable form. Data that SHOULD be in the public domain, for anyone to use as they wish. This is an assault against libraries everywhere, an act of censorship.

You can read Ed's statement on lcsh.info.

I would very much like to hear LoC's statement about this. They should not be allowed to control the use of this data, data that belongs to all of us.

Ed couldn't refuse the Library's demand, but anyone who isn't an employee of LoC should have greater freedom. Let's gather around a find a new home for LCSH, one that can't be removed from the public.

Thursday, December 04, 2008

Google and Fair Use

There's some background to the Google/AAP settlement that I believe is key to understanding the subtext around it. This won't be news to most folks, but I thought it would be good to re-articulate it in the context of the settlement, lest we forget.

Google's first business is that of indexing resources that are on the web. I'll talk about them as if they were all texts because it's easier, but the same thing could be said for images and other resources.

To do the indexing, Google must make a copy of the web page or document. Using this copy, it adds the page to its search engine. As a good citizen, Google pays attention to the robots.txt file, and does not index pages where the site owner has opted out of being included in search engines.

This is all fine and unremarkable until you look at it from the point of view of copyright law. Copyright is specifically about... making copies, and it gives the right to make copies, or to authorize the making of copies, to the copyright holder. That can be the author, or someone to whom the author has passed along the right. Copyright holders must opt in to the making copies: they have to give permission. The default in copyright law is that copies cannot be made unless the copyright holder gives approval.

So the big question is: Is Google violating copyright law by making copies of web pages without the permission of the copyright holders? There are two main ways of looking at this:
  1. The web is different from the print environment. Anyone who has put their works out on the web has agreed to copying because no one can even view the work without making a copy. If they don't want people copying, they need to hide their works behind a security screen. However, there is no such exception or wording in copyright law that would support this.
  2. The web is not different from the print environment. But Google is just producing an index and there is nothing in copyright law that would prevent someone from producing an index of words in texts. The incidental copies that Google makes in order to produce the index are allowed under the Fair Use aspects of the copyright law.
So then we move on to the Google Books project. Initially, Google claimed that it was doing the same thing with books as it does with the web: making incidental copies in order to create keyword indexes to the texts. In terms of copyright law, argument #1 is pretty much out because these works can be read without making a copy, so the copyright holders haven't agreed to let their works be copied. This leaves us with argument #2: it must be fair use.

In fact, Google did and does make the fair use argument. The libraries that partnered with Google also came to the fair use conclusion in at least some cases. The CIC project FAQ says:

University of Michigan said this in 2007:

Does this project comply with copyright law?

Yes. This project was undertaken with careful attention to the law and to the rights and responsibilities of the various parties involved. The purpose of copyright law is to promote progress in society. We are confident that the Books Library project is fully consistent with the fair use doctrine under U.S. copyright law and the principles underlying copyright law itself. Copyright law strikes a balance between rewarding creators of intellectual property for their creations and facilitating public access to these works in ways that do not create a business harm. For books, this means ensuring authors write books, publishers sell them and libraries lend them. By making books more discoverable, Google is enhancing the ability of authors and publishers to sell books to an audience beyond the traditional book market.

What was at stake with the AAP lawsuit was exactly this decision about Fair Use. If copying the books for the purpose of indexing were determined to not be fair use, then this decision could bleed over into the web. And of course it would mean the end of Google Book Search (which has now become Google Book Store). Although Google has always provided a confident posture to the public, declaring unwaveringly that what it does as a search engine is perfectly within copyright law, the idea of going to court over the issue would have put their entire operation at risk.

Now back to libraries. Fair use is not a list of things you can do but a judgment call relating to some complex factors. Some key factors have to do with whether your use is commercial in nature or could compete with the exploitation of works by the copyright holders. There are, in addition, exceptions in the copyright law relating to research and study, and special exceptions for libraries. In fact, in relation to copyright law, libraries and educational institutions get considerably more latitude in using works than do commercial enterprises. As an example, a teacher can make copies of an article for her students as part of a lesson, and that is generally considered fair use. A company manager who wants his staff to read an article cannot rely on fair use for copying, but must apply to the copyright holder (usually through an intermediary such as CCC) and pay a fee. (See the Texaco case.)

What happened with Google Book Search and the AAP is that the digitization of the libraries' books and subsequent use of those was judged not by the criteria that would be used normally for libraries, of course, but by the criteria that would be used for a commercial entity. That's totally logical, since although Google was partnered with the libraries, the primary use of the materials was to fuel Google Book Search, an obviously for-profit activity.

Libraries have gotten the short end of the stick because their use of their own materials became commercialized through their partnership with Google. If instead libraries had managed to digitize the books on their own, the outcome would have likely have been entirely different (if any lawsuit had been brought, which might not have happened). I believe that libraries could be found to have a fair use case for digitizing their works for the purposes of searching, and could be allowed to use those digitized copies for the exceptions spelled out in section 108 of the copyright law (such as providing access to the sight impaired, or for replacement of deteriorated originals). Unfortunately, the concept of digitization of the contents of libraries has now been tainted with the air of commercialization and has earned the wrath of the publishers and authors. The Google/AAP settlement has created a mechanism that ignores the inherent rights of the libraries, but also makes it more difficult for them to justify undertaking their own digitization project.

This is why I disagree heartily when I hear statements like:

We're delighted that this agreement creates new opportunities for libraries and universities to offer their patrons and students access to millions of books beyond their own collections. (from Google)
The settlement might look good from the point of view of a commercial entity facing copyright law, but it binds the non-profit educational and cultural heritage community to legal decisions designed for the for-profit sector. This is not only not a win for libraries, but it will hinder libraries in their efforts to make use of current technologies to further the arts and sciences.