Coyle's InFormation

Monday, January 02, 2012

Google Book Search Redux

The document I referred to in the previous post would have been so much clearer if I had read the two preceding documents. Now that I have, the story is even more dramatic.

On December 12, 2011, the Author's Guild filed a fourth amended complaint (PDF) against Google. This complaint is nearly identical to the first one, filed on September 20, 2005 (PDF). The two complaints between these (October 28, 2008 and November 16, 2009) included the Association of American Publishers, as did the two attempts at settling the case. (October 28, 2008, and November 13, 2009). The publishers had had their own complaint in 2005 before combining forces with the Authors Guild. Now the Authors Guild is again standing alone against Google's book digitizing efforts.

This fourth amended complaint brings us pretty much back to square one, with the addition of the involvement of more libraries and the creation of HathiTrust as a way for the libraries to store their (allegedly) ill-gotten copies. The library copies are a key element of the suit because they are proof that Google has not only digitized the library books but has made copies (the purview of copyright law) and distributed them to others.

The most interesting document of this latest group, and the one with the greatest detail about Google's actions, is the Memorandum in support of the class certification. This document is the explanation of why the Authors Guild should be considered by the court to be a valid representative of all authors in a class action suit. The document has a number of quotable moments, of which my favorite is the "tell it like it is in plain language" opening:

This litigation arises from Google's business decision to gain a competitive edge over its rivals in the search engine market by making digital copies of millions of "offline" printed materials. ... Rather than obtaining licenses from copyright owners for the digital use of their printed works, Google instead entered into agreements with libraries to gain access to these works. A number of university libraries allowed Google to make digital copies of the books in the libraries' collections, including in-copyright books. In exchange, Google provided digital copies of the books to the libraries. Google refers to this massive copyright infringement as its "Library Project." (p.1)

The assumption on the part of most folks commenting on this latest development in this now 6-year-old case is that the settlement is dead. We are therefore back to the question of whether Google's book scanning is or is not Fair Use. This question, though, is only being asked on the part of authors, not publishers, and if anyone has inside knowledge on what approach the publishers are taking I would love to hear it. It is clear that the position of publishers in relation to Google has changed greatly over these past 5-6 years since the suit was originally filed. There are now reportedly thousands of publishers who are using Google Books to promote and sell their works. It also makes sense that publishers, as corporations, are better able to negotiate with Google than are individual authors. A large publisher with numerous books in print and in its backlist has clout that a single person does not have. In addition, large publishers have lawyers, or access to legal counsel. At least some publishers have made their peace with Google and are seeing the relationship as advantageous.

Looking at this from the library point of view I wonder what will happen to the millions of library books already scanned by Google. I also wonder what this awkward and failed attempt to create the overly broad settlement between Google and the AG/AAP will mean for future digitization projects. There are strong arguments for digitization for scholarly purposes, and the creation of a computational capability over millions of texts could be a positive step for research, especially in the social sciences and humanities. I hope that the botched attempt to commercialize the contents of libraries will not prejudice the future of digital research.

Monday, December 26, 2011

Google files motion to dismiss

"The claims of the associations should be dismissed without leave to amend because they lack standing as a matter of law, since they do not themselves own copyrights and do not meet the test for associational standing set forth in Hunt." p. 19

With that conclusion, Google has filed a motion asserting that the copyright infringement lawsuits filed by the Authors' Guild and the American Society of Media Photographers, Inc. be dismissed. The arguments made in the document are:

"Individual copyright owners' participation is necessary to establish a claim for copyright infringement." (p.1)
"Plaintiff associations do not own copyrights alleged to have been infringed, and do not have standing to sue for copyright infringement." (p.4)
"Every copyright, and every alleged copyright infringement, is different."(p.7)
"... a central issue in these cases is whether the conduct alleged in the Complaints constitute fair use under 17 U.S.C. 107. Litigating that issue will require the participation of individual association members, because many of the relevant facts are specific to the particular work in question." (p.11)

All of this sounds plausible to this legal novice, but there are a couple of puzzling issues. First, why did Google not make these arguments in 2005 when the Authors' Guild filed suit? Instead, they negotiated with the association for six years, presumably in good faith, and those negotiations hinged on the acceptance of the AG as a representative of authors and their rights in their works. If Google had thought that the AG did not have standing, none of that negotiation would have made much sense.

Second, Google says in this document that fair use has to be determined on a case-by-case basis. They even quote from Campbell v. Acuff-Rose Music, Inc. that "Fair use must 'be judged case by case, in light of the ends of the copyright law....' It is 'not to be simplified with bright-line rules." (p.11) This seems to undermine Google's original defense that copying for the purposes of creating an index is itself fair use, not something that has to be determined on a case by case basis.

It isn't surprising the Google wants to bring an end to this case. It is now entering its seventh year (the original suit was filed in September of 2005), and has undoubtedly been costly for all parties. Google was moving ahead in putting into place the foundations for the settlement, including the creation of a large database of works and a means for owners to claim the copyrights. They had designated a director for the Book Rights Registry, which would administer the business agreed on in the settlement. The failure of the settlement and the amended settlement to get court approval meant that all of that effort was for naught. Yet it isn't clear to me (and I hope someone can speak to this) what practical outcome Google is seeking for its book digitization effort. A dismissal of this nature would put Google in the rather cynical position of continuing book scanning knowing that few individual authors will have the means to take Google to court, and those individual payments would probably be affordable for this multi-billion dollar company. If dismissal is rejected, then at least that aspect of the suit is clarified, but next steps surely will be that the suit goes forward as first entered.

The one thing that is clear is that negotiations between Google and the AG are no longer on the horizon.

Note, also, that the Authors Guild has filed suit against the HathiTrust for copyright infringement, and the decision here will no doubt reflect on that case as well.

Thursday, December 22, 2011

National Library of Sweden and OCLC fail to agree

In a blog post entitled "No deal with OCLC" the National Library of Sweden has announced that after five years they have ended negotiations with OCLC to become participants in WorldCat. The point of difference was over the OCLC record use policy. Sweden has declared the bibliographic data in the Swedish National Catalog, Libris, to be open for use without constraints.

"A fundamental condition for the entire Libris collaboration is voluntary participation. Libraries that catalogue in Libris can take out all their bibliographic records and incorporate them instead into another system, or use them in anyway the library finds suitable." (from the blog post)

This is an example of the down-stream constraint issues that we discussed while working on the Open Bibliography Principles for the Open Knowledge Foundation. While open data may appear to be primarily an ideological stance it in fact has real practical implications. A bibliographic database is made up of records and data elements that can have uses in many contexts. In addition, the same bibliographic data may exist in numerous databases managed by members of entirely different communities. Someone may wish to create a new database or service using data coming from a variety of sources. At times someone will want to use only portions of records and may mix and match individual data elements from different sources. Any kind of constraints on use of the data, including something as seemingly innocuous as allowing all non-commercial use, require the user of the data to keep track of the source of each record or data element. Practically this means that an application using the mix of data is effectively constrained by the most strict contract in the mix.

The Swedish library was concerned that their participating libraries would be hindered in their future systems and activities if any limitations were placed on data use. In addition, they would not be able to share their data with the Europeana project, as Europeana requires that the data contributed be open precisely because of the complications of managing hundreds or thousands of different sources with different obligations.

As many of us pointed out during the discussions about the OCLC record use policy, the practical problems of controlling down-stream use of data are insurmountable. Some people argue that the record use policy hasn't affected libraries using WorldCat, but my experience is that the policy has a chilling effect on some libraries, and is making it more difficult for libraries to embrace the linked open data model. The Swedish National Library had to make the difficult decision between WorldCat services and future capabilities. It was undoubtedly a hard decision, but it is admirable that the National Library did not give up what it saw as important rights for its users.

Monday, December 12, 2011

Learning not to share

"Learning to share" used to be one of the basic lessons of childhood, with parents beaming the first time their offspring spontaneously handed half of a cookie to a playmate. But some time before that same child first puts fingers to keyboard she will have to learn a new lesson: not to share online.

The Facebook phenomenon has taken that simple concept of sharing with others to an industrial level. Any page you go to on the Web today connects into your online social life, so that while reading the news or watching a video you are exhorted to share your activity with your online "friends." I say "friends" in quotes because the way that Facebook involvement grows means that many of the people seeing your posts or learning about your activities are like second and third cousins; related to your friends but at least a step removed from the inner circle you relate to. It is easy to forget that those more distant relations are there, but bit by bit the links pull in more invitations and, since we have been told that it is impolite not to share, we rarely slam the digital door on those seeking our friendship.

To increase this digital sharing, the House has passed a revision to the Video Privacy Act. You may not recall the "Bork law" of 1988. It was one of the fastest privacy laws ever passed in the U.S. legislature. Here's the description from the New York Times article:

In 1987, the Washington City Paper, a weekly newspaper, published the video rental records of Judge Robert H. Bork, who at the time was a nominee to the Supreme Court. One of the paper’s reporters had obtained the records from Potomac Video, a local rental store. Judge Bork’s choice of movies — he rented a number of classic feature films starring Cary Grant — may have seemed innocuous.

But the disclosure of Judge Bork’s cultural consumption so alarmed Congress that it quickly passed a law giving individuals the power to consent to have their records shared. The statute, nicknamed the “Bork law,” also made video services companies liable for damages if they divulged consumers’ records outside the course of ordinary business.

At the time the passage of the law had a comic aspect to it: you could imagine the thoughts going through the heads of members of Congress when they realized that any reporter could talk into their local video store and learn what they had rented. Zingo! New law!

The revised bill, stated in the article as being backed primarily by Netflix, would allow consumers (and that's all we are, right, consumers?) to sign a blanket waiver on their video privacy in order to facilitate sharing with friends.

The Times article has various quotes giving pros and cons, online services vs. privacy advocates, all talking about how much you do or don't want your "friends" to know about you. What the article fails to state, however, is that whether you like it or not, every site where you share is a de facto friend as well. If your Facebook friends get your Netflix picks, both Facebook and Netflix (and their advertising partners) also get your video viewing information. The more you share with your friends, the more you are sharing with an invisible network of corporations - who, by the way, you cannot "unfriend" even if you want to.

This is why we need to learn not to share: it's a lie, a deceit. We aren't really sharing with our friends, our friends are being used to get us to divulge information to faceless corporations who have insinuated themselves into our lives for the sole purpose of benefiting from our consumption. They have distorted the entire idea of "friend," and turned it into a buyer's club for their benefit.

Dear friends: I'm looking forward to seeing you ... offline.

Tuesday, November 01, 2011

Future Format: Goals and Measures

The LC report on the future bibliographic format (aka replacement for MARC) is out. The report is short and has few specifics, other than the selection of RDF as the underlying data format. A significant part of the report lists requirements; these, too, are general in nature and may not be comprehensive.

What needs to be done before we go much further is to begin to state our specific goals and the criteria we will use to determine if we have met those goals. Some goals we will discover in the course of developing the new environment, so this should be considered a growing list. I think it is important that every goal have measurements associated with it, to the extent possible. It makes no sense to make changes if we cannot know what those changes have achieved. Here are some examples of the kinds of things I am thinking of in terms of goals; these may not be the actual goals of the project, they are just illustrations that I have invented.

COSTS
- goal: it should be less expensive to create the bibliographic data during the cataloging process
   measurement: using time studies, compare cataloging in MARC and in the new format
- goal: it should be less expensive to maintain the format
   measurement: compare the total time required for a typical MARBI proposal to the time required for the new format
- goal: it should be less expensive for vendors to make required changes or additions
   measurement: compare the number of programmer hours needed to make a change in the MARC environment and the new environment

COLLABORATION
- goal: collaboration on data creation with a wider group of communities
   measurement: count the number of non-library communities that we are sharing data with before and after
- goal: greater participation of small libraries in shared data
   measurement: count number of libraries that were sharing before and after the change
- goal: make library data available for use by other information communities
   measurement: count use of library data in non-library web environments before and after

INNOVATION
- goal: library technology staff should be able to implement "apps" for their libraries faster and easier than they can today.
   measurement: either number of apps created, or a time measure to implement (this one may be hard to compare)
- goal: library systems vendors can develop new services more quickly and more cheaply than before
   measurement: number of changes made in the course of a year, or number of staff dedicated to those changes. Another measurement would be what libraries are charged and how many libraries make the change within some stated time frame

As you can tell from this list, most of the measurements require system implementation, not just the development of a new format. But the new format cannot be an end in itself; the goal has to be the implementation of systems and services using that format. The first MARC format that was developed was tested in the LC workflow to see if it met the needs of the Library. This required the creation of a system (called the "MARC Pilot Project") and a test period of one year. The testing that took place for RDA is probably comparable and could serve as a model. Some of the measurements will not be available before full implementation, such as the inclusion of more small libraries. Continued measurement will be needed.

So, now, what are the goals that YOU especially care about?

Monday, October 17, 2011

Relativ index

Most of us, when we hear "Dewey Decimal Classification" (DDC) think about the numbers that go onto the backs of books that then tell us where the book can be found on the library's shelves. The subject classification and its decimal notation was only part of Dewey's invention, however. The other part was the "Relativ Index." The Relativ Index was the entry vocabulary for the classification scheme. It was to be consulted by library users as the way to find topics in the library.

"The Index givs similar or sinonimus words, and the same words in different connections, any any intelijent person wil surely get the ryt number. A reader wishing to know sumthing of the tarif looks under T, and, at a glance, finds 337 as its number. This gyds him to shelvs, to all books and pamflets, to shelf catalog, to clast subject catalog on cards, to clast record of loans, and, in short, in simple numeric order, thruout the whole library to anything bearing on his subject." (Dewey, Edition 11, p. 10) (Yes, that is how he spelled things.)

The most recent version of DDC that I own is from 1922, so this example is an entry in the Relativ Index of Edition 11 under "Leaves:"

Leaves fertilizers 631.872
shapes of botany 581.4

In the schedules these classes are listed as:

631.872 : "Vegetable manures, Leaves" (coming right after "Vegetable manures, Muck").
581.4 : "Morfology comparativ anatomy"

You can see that the index is not just a repeat of the names of the points in the classification but is a kind of subject thesaurus on its own. It doesn't just point to the classification number but it gives some context ("fertilizers" "botany") to help the user decide which class number to select.

What I find odd today in libraries (mainly public libraries) is that we do not have an entry vocabulary for the Dewey classification. Libraries in the U.S. use the Library of Congress Subject Headings even when their classification scheme is Dewey. While LC subject headings will lead you to a catalog entry that has a classification number, they aren't an index to that classification scheme.

There are more oddities, actually.

One oddity is that we never explain these classification numbers to the users. Yes, I can go from the catalog to the shelf and find books that are near the one I am seeking, but in a small public library I can encounter a number of different topics on a single shelf; and in a large academic library I can wander whole aisles without seeing a change in the initial class number and have no idea if I have exhausted my topic area on the shelf as decimal points three places out change. Yet there is nothing either at the shelf nor anywhere else in the library to tell me what those numbers mean except usually at a very macro level. What I have before me are book spines and class numbers, and since I don't know what the class numbers mean I have to rely on the spine titles. So if I browse a shelf and see:

364.106 D26f   The first family
364.106 En36h Havana nocturne
364.106 En36i The Westies
364.106 En36p Paddy whacked

... it may not be clear to me what topic I am looking at. At the very least I would like to be able to type "364.106" into an app on my phone and get a display something like:

300 Social sciences
    360 Social problems & social services
        364 Criminology
   364.106....

(That example is truncated because the divisions to the right of the decimal point are not available to me. Presumably the display would take me down to .106, which would then have something to do with gangs and/or organized crime and/or mafia, but I'm just guessing at that.)

Even better, I'd like to point my phone camera at a book spine and get a similar read-out. Yes, I know that's not going to be simple.

Another oddity is that we put multiple subject headings on a bibliographic record, but only one classification number, reducing the role of classification to simply the ordering of books on the shelves. This means that there are subject headings on the records that would logically lead to class numbers other than the one that has been given.

Using my crime books as an example, the subject headings are clearly more diverse than the single classification code:

    Mafia -- United States -- History
    Mafia -- United States -- Biography
    Criminals -- United States -- Biography
    Organized crime -- United States -- Case studies

    Lansky, Meyer, 1902-
    Luciano, Lucky, 1897-1962
    Mafia -- Cuba -- Havana
    Cuba -- History -- 1933-1959
    Havana (Cuba) -- Social conditions -- 20th century

    Westies (Gang) -- History
    Gangs -- New York (State) -- New York -- History
    Irish American criminals -- New York (State) -- New York -- History
    Hell's Kitchen (New York, N.Y.)

    Organized crime -- United States -- History
    Irish American criminals -- United States -- History
    Gangsters -- United States -- History

This won't be a surprise to my readers, but this dual system is full of "gotchas" for users. If I look up "Irish American criminals" in the subject headings I retrieve some items in 364.106, some in the 920 area (biographies, but many users won't know that), and some in fiction (under the author's last name). It's not that there isn't a rhyme or reason, but there is nothing to explain the differences between these items to the library user that would justify going to three entirely different places in the library to explore this topic. My guess is that the system seems quite arbitrary.

Things are a bit better in libraries that use Library of Congress Classification (LCC) along with LCSH, since the two seem to be developed with some coordination. In his essay "The Peloponneasian War and the Future of Reference" Thomas Mann, of the Library of Congress, explains how LCSH and LCC work together:

"In order to find which areas of the bookstacks to browse, however, researchers need the subject headings in the library catalog to serve as the index to the class scheme. But the linkage between a subject heading and a classification number is usually dependent on the precoordination of multiple facets within the same string. For example, notice the specific linkages of the following precoordinated strings:

Greece–History–Peloponnesian War, 431-404 B.C.: DF229-DF230
Greece–History–19th century: DF803
Greece–History–Acarnanian Revolt, 1836: DF823.6
Greece–History–Civil War, 1944-1949: DF849.5"

This is the correlation that will appear in the LCSH documentation, but this is not what the user sees in the catalog. A search in LC's catalog for Greece-History-19th century brings up books with a variety of classification numbers, the first four being:

DF803 .H45
DF725 .A14
DF951.T45
DK508.95.O33

Again, the user is directed to different shelf locations from what seems to be a single subject heading, with no explanation of what these different locations mean.* It's got to be terribly confusing.

Compact notation is essential for the ordering of books on the shelf. But it seems truly odd that we order the books on the shelf but do not tell users what the order means. This can be seen as providing a delightful serendipity, but I presume that we could provide serendipity with less intellectual effort than has been dedicated to DDC and LCC, which are both enormously detailed and growing more so each year in an attempt to encompass the complexity of the published world. How much richer would the user's library experience be if she understood the relationship between the items on the shelf? Does it make sense to create detailed and complex relationships that then are not understood or used? What would a shelf system look like that was meaningful to library users? in a small library? in a large library? And, finally, can we use computing power to overcome to limitations that brought us to the situation we are in today in terms of organized subject access?

* Before someone explains to me that the first subject heading determines the class number... you know that, I know that, but millions of library users have no idea what the order of the subject headings means. Besides, library catalog users often don't see the full record with all of the subject headings. Even in the LC catalog subject headings are not included in the default display. We can't blame the users if they don't know what we don't help them know.

Monday, October 03, 2011

Organizing knowledge

At the LITA forum on Saturday I stated that classification and knowledge organization seem to have fallen off the library profession's radar. (LITA2011 keynote.) We have spent considerable amounts of time and money on making modifications to our cataloging rules (four times in about fifty years), but the discussion of how we organize information for our users has waned. I can illustrate what is at least my impression of this through some searches done against Google Books using its nGram service.

"Library classification" peaks around 1960, and drops off rapidly. (The chart ends at 2000.)

Library classification

Faceted classification

Faceted classification has a meteoric rise around the 1960's, but falls abruptly from 1970 to 1980. The rise possibly corresponds closely to the activities of the Classification Research Group, based in the UK, whose big interest was in faceted classification.


Decimal Classification

The decimal classifications, most likely both Dewey and Universal, rise steadily up until the mid-1960's then begin a steep decline.


Keyword searching

Keyword searching comes along slowly in the 1960's and 70's then takes off from 1980 to 2000. Today, as we know, it's basically the only kind of information retrieval being discussed.

Knowledge organization also has a steady rise through the 1970's and 80's, and seems to reach a peak that continues up to recent times.

This is hardly a scientific study, but it illustrates what my gut was telling me, which was that keyword searching has essentially replaced any kind of classed access. That does make me wonder what is being discussed under the rubric of "knowledge organization." Keyword indexing, per se, does not do any organization of knowledge; there are no classes or categories, no broader concepts or narrower concepts, no direction toward similar topics. It also has no facets, at least none based on the topic of the resource, only on its descriptive properties (date of publication, format, domain).

Keyword searching is not organized knowledge. Any topical organization takes place after retrieval by the searcher, who must look through the retrievals and select those that are relevant. This in part explains why Wikipedia is the perfect complement to keyword searches: Wikipedia is organized knowledge. A keyword search can pull up a Wikipedia page that will provide context, disambiguation, and pointers to related topics. I find increasingly that I begin my searches in Wikipedia when my searches are topical, leaving Google to function as my "internet phone book" when I need to find a specific person, company, product or document.

It makes sense for us to ask now: is there any reason (other than shelf placement) to continue library classification practices? Keep your eyes on this space for more about that.

Added note: Richard Urban offers this nGram view comparing all of the library classification phrases with the term "Ontologies":

As @repoRat tweeted: Karen Coyle makes air whoosh out of my lungs. bit.ly/nArBBh Perhaps classification to be replaced by relationship metadata? That's a distinct possibility, and we'd better get cracking on that! Many "ontologies" out there today are simple term lists, and few of them seem to have relationships that you can follow productively. What really excites me is the possibility of relationships that we haven't explored in the past, both between concepts and between resources; all of the "based on" "responds to" "often appear together" -- and lots more that my brain isn't sharp enough to even imagine.