Saturday, June 23, 2012

Europe and Library Linked Data

I have just returned from a conference on library linked data in Florence, Global Interoperability and Linked Data in Libraries.  The talks were fascinating and there is only a small amount that I can convey in a blog post, but I thought that a good way to start would be to highlight some differences I see in the European approach.

Cultural Heritage

The cultural timeline in Europe is on an entirely different scale from what we are used to in the US. Library of Congress's "Historical Newspapers" collection covers 1836-1922. At the museum of the synagog of Florence, the docent referred to an event in 1571 as "the first in modern times." We have history, but Europe has History. This means that there is a great emphasis on archives, manuscripts, and museums in all work done by cultural heritage organizations. At no time during the conference was there discussion of "STM" materials (that's Science, Technology, Medicine) other than a talk on Renaissance science, or scholarly communication, both of which are often on programs in the US. (See talks on linked data and the Vatican library and Linked Heritage.) Note also that the shared European culture database, Europeana, uses linked data, which encourages all contributors to also move in that direction.

National Libraries

Closely connected to the view of cultural heritage is the role of the national library. In many countries, if not most, the national library's primary role is to conserve the written heritage of the nation. This fact could be used for better data sharing; for example, each national library could have the responsibility for its subset of name authorities, and the name file could be 'in the cloud.' (See talk by Malmsten.) Ditto for the cataloging of modern publications.

Government Data

Led by various European Union initiatives there is currently a strong movement to make government data available. (Note: there is also an open government movement in the US, but it has less emphasis, as I see it, on the sharing aspects of releasing the data.) Government data from the member countries is needed to make possible the analyses needed for Europe-wide programs. Since this data must be shared and linked, providing it as linked data makes perfect sense.(See:, a catalog of government data catalogs, talks by Morando, Moriando, Menduni. )

Rights and licenses

As countries decide to make their government data open, they must decide about rights, in particular data ownership. The European Union has recommendations for data rights, and individual countries are developing licenses that will be used for their released data. Because libraries are often government-funded institutions, these licenses will also apply to library data. This is not always a good thing. Some countries have declared government data on an open license, but others, like UK, Italy, and France, are using a license similar to CC-BY. The reasons for this have to do with the need to maintain provenance of government data since that data often has an official role in decision-making. (I wonder if the addition of provenance to linked data will help, and that makes me wonder about provenance "spoofing" and how much of a problem that will be.) (See talk by Morando.)

International Standards

People talk about how insular countries like China and North Korea are, but I become deeply aware of how insular we are in the US when I attend meetings outside of this country. We are beginning here (finally) to pay more attention to the global network of libraries, but Europe follows IFLA standards "religiously," as well as EU standards for data sharing. The Web of Data as a global resource makes even more sense in the European context. This makes me wish I knew more about the remainder of the globe. 

My talk at the conference is here: English, italiano.

Friday, June 01, 2012

Google Books: TBD

The latest

The Google Book Search lawsuit is essentially back to square one. Judge Denny Chin has ruled on an important aspect of the post-(failed)settlement lawsuit of the Author's Guild against Google: Google's objection that the Author's Guild (AG) cannot represent all authors, since copyright must be determined on a case-by-case basis. (It has been widely noted that when it came to declaring the copying to be Fair Use, Google was happy to treat the works en masse, in direct contradiction of their response to this latest suit that a copyright suit claiming infringement would need to be individual.) Chin has ruled that the Author's Guild can proceed as representative of "authors" as a class. The class includes not only those members of the association, but all authors whose books were scanned by Google. This means that the AG suit against Google can move forward, and that sometime in (hopefully) the near future we will have a ruling on whether or not Google's scanning of books is within the guidelines of Fair Use.

Quick update

In about 2004, Google began scanning books in partnership with a handful of major libraries, stating its goal as creating search access to books in the same way that it provides search access to web pages. Google Book Search gave results looking much like those for Google's web search: minimal metadata and about three snippets from the book showing the context for the search terms. Google's claim was that copying the books solely for the purpose of search was a clear case of "fair use."

In 2005, the Author's Guild sued Google for copyright infringement in a class action lawsuit representing "authors" as a class. Shortly thereafter, the Association of American Publishers brought their own lawsuit on the part of publishers.

Scanning of books in the libraries continued through 2008 with no word about the lawsuit. Meanwhile, more libraries had been added to the program, and it is estimated that about 7 million books had been scanned. (The exact number is not known.)

In October of 2008, much to the surprise of nearly everyone, it was announced that Google had arrived at a settlement with the AG and the AAP. The settlement was far-reaching, and created a mechanism that would allow Google to scan out-of-print books and make them available for use or sale, returning revenues to the copyright holders. Copying continued.

The settlement received hundreds of responses, mostly negative, from authors and from publishers, in particular those not in the US. Initially the parties were sent back to revise the settlement to address certain concerns. They did so, but not to the judge's satisfaction: the settlement was rejected in court in March of 2011.

In late 2011, the Author's Guild (without the publishers) updated and reprised its original lawsuit against Google, primarily demanding that all scanning stop. It also brought a suit against Hathitrust, the library-sponsored archival facility that houses many of the library books scanned by Google.  The lawsuit claims that the copies in HathiTrust are not legal copies and demands that they be destroyed.

My opinion

I could imagine getting a fair use ruling based on the original definition of the project, which was the scanning of books solely for the purpose of allowing keyword searching on texts, with minimal metadata and a few short snippets shown to the public, although Google's for-profit status might have nixed such a decision. However, this was complicated by the active participation of the libraries, and by the fact that Google returned a copy of the digital scan (and sometimes also the OCR and the OCR "map" that carries the location of the text on the page) to the library that had offered the book. While the copying for search might be considered fair use, since no actual copies of the books are made available to anyone, the presentation of a copy to the libraries is a pretty clear act of copying.

The terms of the settlement were arrived at through negotiations between the parties, and including input from some of the library partners. It was during this time that library partner U of Michigan began planning the archive now known as HathiTrust. It is undoubtedly not a coincidence that such an archive was described in a fair amount of detail in the settlement document as a requirement for library archiving of their received digital copies. In addition, the settlement allowed for computational research on the corpus, something that would be of great benefit to researchers.

During the time between 2008 and 2011, when the parties presented the first settlement and then the amended settlement, there was a fair amount of optimism that the settlement would be accepted, and plans to engage in the terms of the settlement, including the creation of a special bureau to manage payments to copyright holders, went forward. Google appeared to be all-powerful, able to bend law and legislation in order to create an entirely new view of copyright and digitization.

With the rejection of the settlement, and this latest ruling that allows the AG lawsuit to go forward, the picture has entirely changed, but not necessarily for the better. Many were hoping that we would be able to digitize the entirety of our analog matter, increasing access and preservation capabilities.

Now we are facing the possibility that not only may mass digitization be declared in violation of copyright, at least in this instance, but that the libraries may lose the copies the digital versions of the items in their collection, and researchers will lose the access to these items in HathiTrust and Google Book Search.

At the same time, for Google the lawsuit has become nearly moot. At the moment Google has tens of thousands of publishing partners that allow Google to index digital versions of their books and make them available for sale either in hard copy or as ebooks. Google could lose the out-of-print books in its collection for which it has no publisher agreement, but these books are not providing any revenue for Google.


