Thursday, October 18, 2012

Is Linked Data the Answer?

I recently gave keynote talks at Dublin Core 2012 and Emtacl12 with the title "Think 'Different'." Since the slides of my talks don't generally have much text on them, I wrote up the talk as a document. The document has a kind of appendix covering the point in my presentation where I took advantage of my position on stage to ask and answer what I think is a common question: Is linked data the answer?

Many would expect me to answer "yes" to this question, but my answer is a bit more complex. Linked data is a technology that I believe we will make use of to connect library data to other information resources. That's what the "linked" in linked data is all about -- creating a web of information by connecting bits of data in different documents and datasets. However, we have to be very cautious about having "an answer." When you have an answer you tend to stop looking at the questions that arise, and you also tend to ignore questions that aren't going to be solved by the answer you have chosen. There is no technology that will do everything that we need, so while linked data can be useful for some things we may need to do, it cannot be the answer to all of our technical requirements.

Note that I describe linked data as "connecting bits of data." The origin of the semantic web is in the need and desire to make actionable data that today is essentially hidden within the text of documents. For example, if I say:

"My name is Karen. I will be holding a webinar on June 4 at 3:00 Pacific time for anyone who wants to learn about my paperweight collection."

That's text. There is interesting information in there, but it isn't available for any computational uses. The Semantic Web, as implemented through linked data, would make that information actionable. There are various ways to do this, and one is through the use of microformats which mark up data within a document. This could look something like:

<p>My name is <span class="name">Karen</span>. I will be holding a <span class="event">webinar</span> on <span class="datetime" title="2012-06-04T03:00-09:0000">June 4 at 3:00 Pacific time</span> for anyone who wants to learn about my <span topic="paperweights">paperweight</span> collection.</p>

This text now also has bits of data that can be used for various purposes, including linking. The linking capabilities in this particular example are low, but some additional information, like standard identifiers for the person and for the topic, would then increase the linkability of this data.

<p>My name is <span class="name" id="http://viaf.org/viaf/48369992/">Karen</span>. I will be holding a <span class="event">webinar</span> on <span class="datetime" title="2012-06-04T03:00-09:0000">June 4 at 3:00 Pacific time</span> for anyone who wants to learn about my <span topic="paperweights" id="http://id.loc.gov/authorities/subjects/sh85097666.html">paperweight </span> collection.</p>

This isn't a perfect example, but I wouldn't claim that we're heading toward perfect data. What we need is to get more out of the information we have. 

I perceive an assumption in the library linked data movement that what the Web needs (because linked data is data on the Web) is our bibliographic data. I disagree. The Web is awash in bibliographic data - from Amazon to Google Books, from fan sites like IMDB or MusicBrainz, and from sharing sites like LibraryThing and GoodReads. Libraries may have some unique bibliographic data, but most of what we have would duplicate what is already there, many times over.

There's also the fact that much of bibliographic data isn't DATA in the linked data sense. It isn't actionable data elements for the most part. In fact, bibliographic data is more like a structured document: it mainly has text, and that text is to be displayed to humans. It is possible to extract actual data (dates of publication, numbers of pages, various identifiers), but the text itself is a large part of the point about bibliographic data.

What this means for us in libraries is that we shouldn't be thinking that linked data will replace bibliographic data. It will encode the aspects of bibliographic data that will give us the most and the best links.

Then we need to ask: why are we linking? What will we get? Well, we can get connections between books and maps, between books and documents, and between search retrievals and libraries. This latter interests me especially. Google is experimenting with using microformat data, in particular the schema.org data that it is fostering along with Yahoo!, Bing, and Yandex (the Russian search engine).  Schema.org microformat data allows a search engine like Google to enrich the snippets with more than just a block of text from the page. This is an example from the Google Webmaster pages on Rich Snippets:
Below is my conceptualization of what we could do with library data. The bibliographic data, as I've said before, often already exists on the Web and we may not be helping things by adding many more duplicate copies of that data. But what we have in libraries that no one else has is library holdings data. We know where Web users can find "stuff" in their local community. If that could be linked to the Web, a future rich snippet might look like:

Obviously there are steps to be taken to make this possible, but if you want to think about how library data might fit into the Web of data that information seekers make use of millions or billions of times a day, this is one option. It's a start, and it uses data we already have.

You can take a look at the schema.org data that is created for WorldCat records simply by doing a WorldCat search and scrolling down to the section called "Linked data." The number of holdings is included (and this in itself is something that might interest Google as a measure of popularity). Making the link to the holdings of an actual library, and making that possible for all libraries, not just OCLC member libraries, is something I consider a worthy experiment for linking library data.

Tuesday, October 16, 2012

Copyright Victories, Part II

I did a short factual piece for InfoToday on the Authors Guild v. HathiTrust decision that was issued last week. The Authors Guild brought the suit against HathiTrust because HathiTrust is storing copies of books, digitized by Google, that are still under copyright. Fortunately for HathiTrust, its partners, and all of us in libraries, the judge decided:
  • The digitization of books for the purposes of providing a searchable index is transformative, and therefore is a Fair Use under copyright law.
  • The provision of these search capabilities “promotes the Progress of Science and useful Arts” and thus supports the goals of U.S. copyright policy and law.
  • The provision of in-copyright texts for visually impaired students and researchers is in direct support of the Americans With Disabilities Act.
The decision in the case of the Authors Guild v. Hathitrust echoes some of the same thinking as the GSU case, in particular on the educational and research use of intellectual property. This case hinged on the use of the digitized texts for indexing rather than for reading. The judge determined that the books in HathiTrust were not substitutes for the books on the library shelves, since they are not presented to users as texts to be read. The "transformation" of the readable texts to a searchable index that returns only page numbers and the number of times a term appears on the page results in a new product, not an imitation of the hard copy.

The judge decided this for HathiTrust, but this is the same question that is being asked in the Authors Guild lawsuit against Google. There are some obvious differences between the two situations, however. First, unlike HathiTrust Google is a for-profit company, so it loses points on the first factor of the fair use test:

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
Because Google is digitizing works primarily from university libraries, both HathiTrust and Google do well on the second factor:

(2) the nature of the copyrighted work;
 Works of a creative nature (defined as "prose fiction, poetry, and drama") are given greater protection than works of fact. HathiTrust reports that only 9% of its digital collection meets the "creative" definition.

The third factor:

(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole

would seem to go against HathiTrust (and Google), but the judge looked at the two primary uses for the digital texts, keyword indexing and providing digital copies to members of the community with sight disabilities, and determined that they could not be done with anything less than a complete copy. If the argument of transformation is made in the case against Google, this factor should be the same.

Factor four is about the effect on the market:

(4) the effect of the use upon the potential market for or value of the copyrighted work

This is a bit tricky because presumably HathiTrust will point its users to the library-owned hard copies of the book, especially since many of the digitized books will be out of print and not available from publishers. Therefore there isn't much interaction with the market at all. The judge added that, if anything, the greater amount of discovery might lead to sales, but I wouldn't hold out much hope for that. The other use is to provide access to the blind; this is a non-market for print materials if there ever was one. Google, on the other hand, has partnered with publishers to sell digitized books as ebooks, and therefore the positive market force should be stronger in that case if Google can show that previously out-of-print books can be sold through its service.

Not mentioned anywhere that I can find is the question of digital "photographs" of pages vs. OCR'd text. The suit and the decision blend these together as "a digital copy." Having seen some of the results of Google's digitization I can say that the text resulting from the OCR can be quite lossy depending on the page layout (tables of contents in particular come out quite badly) and the quality of the original book. It is also the "transformation" part of the copying, since the photographs of the pages are simply copies of the page and are by their nature human-readable substitutes for the page itself. The judge seems to consider these "transitory" but in fact they are quite solidly real, and are stored in the HathiTrust repository. I suspect it is these pictures of the pages that the Authors Guild fears will be pirated should HathiTrust be hacked, less so the OCR'd pages which are unattractive plain text. However, HathiTrust was able to show the judge that it takes security quite seriously, and the Authors Guild was unable to demonstrate any quantifiable risk. 

What is heartening in this decision is the judge's enthusiasm for the role of libraries in further science and knowledge, and his great admiration of HathiTrust's service to scholars and to the blind. His decision is both factual and moral: he refers to the "invaluable contribution to the progress of science and the cultivation of the arts that at the same time effectuates the ideas espoused by the ADA." We could not have hoped for a better advocate for digital libraries than Judge Harold Baer Jr.

Copyright Victories, Part I

While we are awaiting the results of the long-standing Google Books digitization copyright suit, there have been some important copyright battles libraries have won. The first was the Georgia State University digital reserves. The second is the recent decision regarding HathiTrust. I'll cover GSU in this post, and HathiTrust in a subsequent one, since my comments are long.

Earlier this year the case of publishers vs. Georgia State University regarding their e-reserves program resulted in a win for GSU as 69 out of 74 copyright infringement claims were denied by the judge. The case was about the provision of course readings in digital form. The suit was brought by three academic publishers (Oxford, Cambridge and Sage) but was bankrolled by the Association of American Publishers and the Copyright Clearance Center. Most readings were individual book chapters, and these were digitized by the library. Students in the class were able to access these through a password-protected site. The judge decided that for all but a few of the works the use was a fair use based on the nature of the use (educational) and the amount being used (one chapter). Unfortunately the judge also decided to enforce a "bright line" test of not more than 10% of the work (and the amount used of the works in question averaged 10.1%). As we know, bright lines are not in the spirit of fair use, yet the judge clearly needed some way to make her decisions.

One of the more striking things in the GSU case was that the publishers were unable to prove their ownership of the copyrights for over a third of the original items. They had published the books, but many of them were essentially anthologies and they could not find the appropriate paperwork for all of the individual pieces. This lack of proof in essence revealed these particular documents to be orphan works, even though the publisher of the book itself is known and some of the books may have been in print. I suspect that if you were to require actual proof of rights ownership for books or journal articles that the number of orphans would grow considerably. This would be especially true for articles, at least based on my experience: journal publishers are rather casual about getting signed agreements, and I have often modified agreements through strike-outs which were never contested.

This is just more evidence that our copyright system is a huge mess. Proving ownership requires expensive research (the copyright office charges $165 per hour) and often does not solidly determine who holds the rights at this moment in time. Most of our action around intellectual property rights is based on claims and suppositions, not facts, and we often act as if there were evidence of held rights even though we have no such proof. In contrast, the patent system is fully documented with descriptions and drawings and references to other patents, although by its own admission the patent office has about a 3-year backlog, and filing and researching patents is time-consuming and expensive.

Another interesting aspect of the GSU case is that some of the works being copied were not covered by any available licensing scheme. This is especially interesting since the publisher plaintiffs were backed by the Copyright Clearance Center, presumably the agency that one would turn to when desiring to license a work for use. Licensing is a relatively big business: CCC earns over $200 million per year. The publishers included in this case each earn something shy of $500K per year in fees from CCC licensing. Much of that, however, comes from the commercial printing of course packs, not from direct educational institution use. The judge determined that the percentage of publisher revenue from electronic course content would be .00046 (five one-hundredths of one percent) of the average net revenue for any one of the publishers.

Reading this it becomes rather obvious that the move from traditional course packs, which are produced by commercial copy shops, to digital course readings, which are produced by the library or the professor, would mean a loss of revenue for the academically-oriented publishers. Course packs got slammed by the copyright holders not because copies were being made but because copies were being sold and all profit was going to the copy shops, none to the rights holders. In fact, in this lawsuit there were files on reserve that were never downloaded by students in the class, and the judge removed these titles from the suit because they were not read. This is an interesting answer to the question: "What if you make a copy and no one sees it?" Another way of wording this is: "Is it a copy if it has only been viewed by a computer?" With course packs, every student purchases every item in the course pack and you have no idea if any of those are read. With digital copies, every download can be counted. Although a download does not guarantee that the item has been read by the downloadee, it is a quantifiable use in the same way that the number of course packs printed is quantifiable. A file online does not seem, in and of itself, to be the same as a physical copy. This could have implications for library digitization projects, and relates to the decision in the Authors Guild v. HathiTrust case.

There are some gotchas to the use of a copyright licensing agency because of the inherent nature of the US fair use law. When one approaches CCC to license a work there is no fair use determination that is made as part of that request. It is up to the requestor to decide whether a license is needed or not. There are annual licenses available for educational institutions that cover a set of materials. None of these licenses are needed if the use is a fair use, and for educational institutions many uses are indeed fair uses, in particular classroom use. Therefore the CCC annual educational license may be paying for uses that do not require payment under copyright law. In essence, the license may be seen as a kind of insurance policy against infringement claims, but it may not be money well-spent. As the judge in the GSU case states (p. 66) "In the absence of judicial precedent concerning the limits of fair use for nonprofit educational uses, colleges and universities have been guessing about the permissible extent of fair use."

The decision itself runs to 350 pages, much of which is taken up with the decisions about the 74 documents in question. The judge does a very nice job of defining the nature of a work, and why individual chapters are viable on their own as part of a course syllabus. The decision that 10% of a "work" is permissible makes me wonder, however, if publishers won't see the light and begin digital publishing of individual chapters rather than creating book-length anthologies. 

More legal analysis is available on James Grimmelman's blog.

Saturday, October 06, 2012

Google and publishers settle

Recent news announcements state that the AAP and Google have settled their lawsuit. Essentially this is a formality rather than an actual change in status between Google and the publishers. As you probably recall, the AAP engaged in a lawsuit against Google, in partnership with the Author's Guild, in 2005, and five specific publishers were named in that suit. Since then the lawsuit has gone through two failed attempts to settle and a massive number of pages of legal parrying. Meanwhile, the publishers realized that Google provided them with a new sales opportunity and about 40,000 of them have entered into agreements with Google in which the publisher provides (or allows Google to provide) a digital copy of the book which then can be sold as a Google eBook. Each contract specifies the amount of the book that potential buyers can browse. This browsing is designed to prevent users from satisfying their needs without a purchase: although up to 20% of a digital book may be browsable, Google denies access to sequences of more than 9 pages in a row.

When the Author's Guild revived their suit against Google in late 2011, the AAP was notably absent from the list of plaintiffs. By then the publishers had made a satisfactory agreement with Google and had no interest in suing their business partner. This current settlement appears to be a pro forma legal action to terminate the previous lawsuit, and, as far as I can tell, makes no change to the current status of the business relationship between Google and publishers.

There is one question that remains which is what this means in terms of copyright, if anything. The contractual arrangements between Google and the publishers are standard business agreements which the publishers engage in as representatives of the rights holder through their contract with that rights holder. Sometimes the publisher also holds the copyright, but that isn't the salient point here. So unless I am missing something, this agreement has absolutely no effect on the questions of copyright and digitization that some of us have been so eager to hear about. It's just business as usual.

[After-note: Some authors and journalists question the settlement and want details made public.]