Friday, December 07, 2012

Invisible women 2: Cognitive Surplus

I've just read Clay Shirky's 2010 book Cognitive Surplus: creativity and generosity in a connected age. The short summary of the book is: since the 50's people have had more and more leisure time. Until recently that leisure time was taken up with the passive activity of watching television. The Internet has given us the possibility to use our leisure time for social and creative activities, like creating Wikipedia, engaging in online discussion, and even creating lolcats.

Yet I have to ask: how could a smart, well-read professor write an entire book about what people do with their leisure time and not address the well-known and well-documented gender inequality in time available? The OECD did an entire report on what is called "unpaid labor:"
"Most unpaid work is cooking and cleaning – on average 2 hours 8 minutes work per day across the OECD – followed by care for household members at 26 minutes per day. Shopping takes up 23 minutes per day across the OECD on average" Visualized it looks like:


(The left-hand column is minutes, and obviously not all countries are listed.  The full data is available as an Excel file.)

The Economist put it more bluntly (and I do think this image is unnecessarily demeaning, but I haven't found one with the same message that is more neutral):
If they had read Shirky's book, this fellow would have been sitting at his computer, updating Wikipedia pages or adding his cognitive surplus to a discussion group on health issues. But he still would have had more leisure time than a woman.

A social world based on "cognitive surplus" will be one that is not gender neutral. It will have more participation by males, and therefore will be socially skewed to the masculine -- at least until we have gender parity in taking care of the home, the children, the elderly, etc. That is something that I would expect an intelligent observer of society to notice. Not only notice, but to ponder: what does this tell us about the nature of the things being created with this cognitive surplus? Does this explain, in whole or in part, the masculine view of "hacking," the participation in Open Source, the gender nature of games and gaming? 

** I woke up this morning realizing something that is both not in this book but that I hadn't mentioned: the difference in leisure time and income level. I don't have any figures on that right now, but will do some investigation. My assumption is that leisure is not evenly distributed, and that the working poor have much less leisure time than the middle and upper classes. 

Friday, November 23, 2012

Fair Use(-ful)

The beauty and the aggravation of Fair Use in US copyright law is that one cannot pre-define particular uses as "fair." The countries that have, instead, the legal concept of "Fair Dealing" have an enumerated set of uses that are considered fair, although there is obviously still some need for interpretation. The advantage to Fair Use is that it can be re-interpreted with the times without the need for modification of the law. As new technologies come along, such as digitization of previously analog works, courts can make a decision based on the same four factors that have been used for earlier technologies. However, until such a decision is made in a court of law, it isn't possible to be sure whether a use is fair or not.

We have recently seen a court case that decided that HathiTrust's use of digitized books to provide an index to those books is fair. There is another court case that will decide a similar question regarding Google's digitization of books for its Google Book Search. Note, however, that even if both of these are determined to be fair use, each is a particular situation in a particular context. Both organizations have developed their services in an attempt to meet what they judged to be the letter of the law, and yet there is a considerable difference in the services they provide.

HathiTrust stores copies of digitized books from the collections of member libraries. In this case, HT is not itself doing the digitization but is storing files for books mostly digitized by Google. A search in the full text database of OCR'd page images returns, for in-copyright items, the page numbers on which the terms were found, and the number of hits found on each page. There are no snippets and no view of the text unless the text itself is deemed to be out of copyright.

Google has a different approach. To begin with, Google has performed mass digitization of books (estimated at about 20 million) without first obtaining permission from rights holders. So the Google case includes the act of digitization, whereas the HathiTrust case begins with digital files obtained from Google. Therefore the act of digitizing was not a factor in that case. In terms of use of the digitized works, Google also provides keyword searching of the OCR'd digital images, but takes a different approach to the results viewable by the searchers. Google provides short (about 3-5 lines) snippets that show the search terms in context on a page.
Google, however, places specific restrictions to avoid letting users "game" the search to gain access to enough of the text to substitute for actually acquiring access to the book. Here is how Google describes this in its recent legal response:
"The information that appears in Google Books does not substitute for reading the book. Google displays no more than three snippets from a book in response to a search query, even if the search term appears many times in the book. ... Google also prevents users from view a full page, or even several contiguous snippets, by displaying only one snippet per page in response to a given search and by 'blacking' (i.e. making unable for snippet view in response to any search) at least one snippet per page and one out of ten pages in a book." p.8
Google also exempts some types of books, like reference works, cookbooks, and poetry, from snippet display entirely.

The differences in the results returned by these two services reflect the differences in their contexts and their goals. HathiTrust has member institutions and their authorized users. The collection within HathiTrust reflects the holdings of the member institutions' libraries which means that the authorized users should have access, either in their library or through inter-library loan, to the physical book that was scanned. The HathiTrust full text is a search on the members' "stuff." The decision to give only page numbers makes some sense in this context, although providing snippets to scholars might have been acceptable to the judge. The return of page numbers and full word counts within pages reflects, IMO, the interest in quantitative analysis of term use. It also gives scholars some idea of the weight the term has within the text.

Google's situation is different. Google has no institutions, no members, no libraries; it provides its service to the general public (at least to the US public). There is no reason to assume that all of the members of that public will have access to the hard copy of any particular digitized book. Google seems to have decided that promoting its service as having primarily a marketing function, with the snippets as "teasers," would mollify the various intellectual property owners. In its brief of November 9, Google reiterates that it does not put advertising on the Google Book Search results pages, nor does Google make any money off of its referrals to book purchasing sites.

So here are two organizations that have bent over backwards to stay within what they deemed to be the boundaries of fair use, and they have done so in significantly different ways. This means that the fair use determination of each of these could have different outcomes, and each will provide different clues as to how fair use is viewed for digitized works.

It of course bears mentioning that both of these solutions provide hurdles for users. The HathiTrust user who is searching on a term that could have more than one meaning ("iron" "dive" "foot") does not have any context to help her understand if the results are relevant. The Google user, on the other hand, gets some context but cannot see all of the results and therefore does not know if there are key retrievals among those that have been blocked algorithmically. A use that is "fair" within copyright law may not seem "fair" to the user who is doing research. It makes you wonder if our idea of "fair use" couldn't be extended to be fair but also "useful."

Related posts
http://kcoyle.blogspot.com/2012/10/copyright-victories-part-ii.html

Thursday, November 01, 2012

Turing's Cathedral, or Women Disappear

"She features significantly in computing historian George Dyson's book, Turing's Cathedral: The Origins of the Digital Universe, ISBN 978-0375422775."
From the Wikipedia article for Klara Dan von Neumann

Unfortunately, she features significantly mainly as von Neumann's wife, even though she also was "a pioneer computer programmer," as per the Wikipedia article. In fact, of the 35 women whose names are in the book's index, 24 are in the book as wives, including Klara. Klara is the only one who gets a full bio and a fair amount of ink. Much of the ink comes from her unfinished memoirs about her life as von Neumann's wife. She was also one of the primary programmers working on the ENIAC, and Dyson's book names her as one of the first three programmers, along with her husband, programming ENIAC. (p. 104). Her work, however, is described as "help," one of the ways that women's activities are diminished in importance (men "do", women "help"):
"'With the help of Klari von Neumann,' says Metropolis, 'plans were revised and completed and we undertook to implement them on the ENIAC...'" p. 194
Yet she obviously provided more than "help." In fact, she invented:
"'Your code was described and was impressive,' von Neumann wrote to Klari from Los Alamos, discussing whether a routine she had developed should be coded as software or hardwired into the machine. 'They claim now, however, that making one more, 'fixed,' function table is so little work, that they want to do it. It was decided that they will build one, with the order soldered in." (p. 195)
Of the other women mentioned, one is a secretary, the other the manager of the cafeteria. The saddest story is that of Bernetta Miller, the fifth licensed woman pilot in the US who was a demonstration pilot for an airplane company, volunteered for duty in WWI and was wounded, then became secretary to the directory of the Institute for Advanced Study in Princeton. In the Dyson book, she is mainly remembered for her memoranda about dining room accounting, and for being fired by Oppenheimer. (p. 91-92)

There are eight women, other than Klara, who are in the book in their professional positions. Three of them are mentioned in a single sentence as "computers," that is people (mainly women) who did the hard math by hand before the machine computers were up to the job. (see: http://en.wikipedia.org/wiki/Human_computer, and I highly recommend the books by Grier in the bibliography if you wish to learn of the sophistication of methods that were developed by the "girls.")

One woman, Mina Rees, is named twice as someone who was written to:
"... Goldstine had written to Mina Rees of the Office of Naval Research." p. 147
"... 'The best change for a real undersatnding of protein chemistry lies in the x-ray diffraction field,' he wrote to Mina Rees at the Office of Naval Research." p.229
Later there is a quote from a report that states,
"... was informed by Dr. Mina Rees and Colonel Oscar Maier, representing the Office of Naval Research, and the Air Material Command, respectively..." p. 321
In themselves these quotes are not important, but this is one of the few professional women who gets mentioned in the book, and this is all that is said about her. Dr. Mina Rees was an amazing character: "She earned her doctorate in 1931 with a thesis on "Division algebras associated with an equation whose group has four generators," published in the American Journal of Mathematics, Vol 54 (Jan. 1932), 51-65. Her advisor was Leonard Dickson." (Wikipedia article) At the time of these references she was head of the Mathematics Department at the Office of Naval Research.

There are some other minor mentions, like one of Meg Ryan in a parenthetical sentence about a named location that was later used in a movie, and one woman mathematician who was named with two male mathematicians in a single sentence. These obviously are not major characters in the book, and the book is wide-ranging with everything from Aldus Huxley to George Washington, also not major characters.

The real mystery woman is Hedvig Selberg.
"'... says Atle Selberg, whose wife, Hedi was hired by von Neumann on September 29, 1950, and remained with the computer project until its termination in 1958.'" p. 152
Later we get a short bio of her: born in 1919 in Transylvania, graduated with a master's degree in mathematics at the head of her class, and was the only family member to survive Auschwitz. She came to the U.S. and was hired to work on the first computer project. She seems to have worked closely with a Martin Schwarzschild on a complex model of stellar evolution that related to the radiation effects of the bomb that was being designed. Schwarzschild went on to fame, as did Selberg's husband, a mathematician. Hedvig didn't even rate an obituary in the big newspapers (nor a Wikipedia article), although she is mentioned in her husband's obit where he first marries her, then she dies (in 1995) and he remarries.
"His first wife, Hedvig Liebermann, a researcher at the institute and Princeton's Plasma Physics Laboratory, died in 1995." (NYT Aug 17, 2007)
(Note: Mina Rees did get a NY Times obit. )

Among the other striking aspects of this treatment of women (and this book isn't by any means unusual in this respect) is that women tend not to exist until they marry a man of interest, and then suddenly they appear on the scene. Men, on the other hand, have parents and educations and often interesting stories that are told in the book, both as character building but also as bone fides. It is therefore a bit of a shock to learn in some aside that the wife has a PhD in "trans-sonic aerodynamics" as in the case with Kathleen Booth. (p. 133)

Admittedly the opportunities for women in science were very limited in the period being discussed in this book, the 1950's. However, the role of a historian is to go beyond the period's view of itself and tease out a deeper meaning from the privileged position of hind-sight. I have read other histories of computing that also failed to notice that there were women involved in the invention of this field, but this one has come out in 2012. Really, we didn't need another book on the topic written with male blinders. What a shame.

Suggested reading:
Noble, David F. The Religion of Technology : the Divinity of Man and the Spirit of Invention. 1st ed. New York: A.A. Knopf :, 1997.
Mozans, H. J. Woman in Science; with an Introductory Chapter on Woman’s Long Struggle for Things of the Mind,. New York,: D. Appleton and company, 1913.
Toole, Betty A., and Ada King Lovelace. Ada, the Enchantress of Numbers : a Selection from the Letters of Lord Byron’s Daughter and Her Description of the First Computer. 1st ed. Mill Valley, Calif.: Strawberry Press ;, 1992.
Grier, David Alan. When Computers Were Human. Princeton: Princeton University Press, 2005. Print.



Thursday, October 18, 2012

Is Linked Data the Answer?

I recently gave keynote talks at Dublin Core 2012 and Emtacl12 with the title "Think 'Different'." Since the slides of my talks don't generally have much text on them, I wrote up the talk as a document. The document has a kind of appendix covering the point in my presentation where I took advantage of my position on stage to ask and answer what I think is a common question: Is linked data the answer?

Many would expect me to answer "yes" to this question, but my answer is a bit more complex. Linked data is a technology that I believe we will make use of to connect library data to other information resources. That's what the "linked" in linked data is all about -- creating a web of information by connecting bits of data in different documents and datasets. However, we have to be very cautious about having "an answer." When you have an answer you tend to stop looking at the questions that arise, and you also tend to ignore questions that aren't going to be solved by the answer you have chosen. There is no technology that will do everything that we need, so while linked data can be useful for some things we may need to do, it cannot be the answer to all of our technical requirements.

Note that I describe linked data as "connecting bits of data." The origin of the semantic web is in the need and desire to make actionable data that today is essentially hidden within the text of documents. For example, if I say:

"My name is Karen. I will be holding a webinar on June 4 at 3:00 Pacific time for anyone who wants to learn about my paperweight collection."

That's text. There is interesting information in there, but it isn't available for any computational uses. The Semantic Web, as implemented through linked data, would make that information actionable. There are various ways to do this, and one is through the use of microformats which mark up data within a document. This could look something like:

<p>My name is <span class="name">Karen</span>. I will be holding a <span class="event">webinar</span> on <span class="datetime" title="2012-06-04T03:00-09:0000">June 4 at 3:00 Pacific time</span> for anyone who wants to learn about my <span topic="paperweights">paperweight</span> collection.</p>

This text now also has bits of data that can be used for various purposes, including linking. The linking capabilities in this particular example are low, but some additional information, like standard identifiers for the person and for the topic, would then increase the linkability of this data.

<p>My name is <span class="name" id="http://viaf.org/viaf/48369992/">Karen</span>. I will be holding a <span class="event">webinar</span> on <span class="datetime" title="2012-06-04T03:00-09:0000">June 4 at 3:00 Pacific time</span> for anyone who wants to learn about my <span topic="paperweights" id="http://id.loc.gov/authorities/subjects/sh85097666.html">paperweight </span> collection.</p>

This isn't a perfect example, but I wouldn't claim that we're heading toward perfect data. What we need is to get more out of the information we have. 

I perceive an assumption in the library linked data movement that what the Web needs (because linked data is data on the Web) is our bibliographic data. I disagree. The Web is awash in bibliographic data - from Amazon to Google Books, from fan sites like IMDB or MusicBrainz, and from sharing sites like LibraryThing and GoodReads. Libraries may have some unique bibliographic data, but most of what we have would duplicate what is already there, many times over.

There's also the fact that much of bibliographic data isn't DATA in the linked data sense. It isn't actionable data elements for the most part. In fact, bibliographic data is more like a structured document: it mainly has text, and that text is to be displayed to humans. It is possible to extract actual data (dates of publication, numbers of pages, various identifiers), but the text itself is a large part of the point about bibliographic data.

What this means for us in libraries is that we shouldn't be thinking that linked data will replace bibliographic data. It will encode the aspects of bibliographic data that will give us the most and the best links.

Then we need to ask: why are we linking? What will we get? Well, we can get connections between books and maps, between books and documents, and between search retrievals and libraries. This latter interests me especially. Google is experimenting with using microformat data, in particular the schema.org data that it is fostering along with Yahoo!, Bing, and Yandex (the Russian search engine).  Schema.org microformat data allows a search engine like Google to enrich the snippets with more than just a block of text from the page. This is an example from the Google Webmaster pages on Rich Snippets:
Below is my conceptualization of what we could do with library data. The bibliographic data, as I've said before, often already exists on the Web and we may not be helping things by adding many more duplicate copies of that data. But what we have in libraries that no one else has is library holdings data. We know where Web users can find "stuff" in their local community. If that could be linked to the Web, a future rich snippet might look like:

Obviously there are steps to be taken to make this possible, but if you want to think about how library data might fit into the Web of data that information seekers make use of millions or billions of times a day, this is one option. It's a start, and it uses data we already have.

You can take a look at the schema.org data that is created for WorldCat records simply by doing a WorldCat search and scrolling down to the section called "Linked data." The number of holdings is included (and this in itself is something that might interest Google as a measure of popularity). Making the link to the holdings of an actual library, and making that possible for all libraries, not just OCLC member libraries, is something I consider a worthy experiment for linking library data.

Tuesday, October 16, 2012

Copyright Victories, Part II

I did a short factual piece for InfoToday on the Authors Guild v. HathiTrust decision that was issued last week. The Authors Guild brought the suit against HathiTrust because HathiTrust is storing copies of books, digitized by Google, that are still under copyright. Fortunately for HathiTrust, its partners, and all of us in libraries, the judge decided:
  • The digitization of books for the purposes of providing a searchable index is transformative, and therefore is a Fair Use under copyright law.
  • The provision of these search capabilities “promotes the Progress of Science and useful Arts” and thus supports the goals of U.S. copyright policy and law.
  • The provision of in-copyright texts for visually impaired students and researchers is in direct support of the Americans With Disabilities Act.
The decision in the case of the Authors Guild v. Hathitrust echoes some of the same thinking as the GSU case, in particular on the educational and research use of intellectual property. This case hinged on the use of the digitized texts for indexing rather than for reading. The judge determined that the books in HathiTrust were not substitutes for the books on the library shelves, since they are not presented to users as texts to be read. The "transformation" of the readable texts to a searchable index that returns only page numbers and the number of times a term appears on the page results in a new product, not an imitation of the hard copy.

The judge decided this for HathiTrust, but this is the same question that is being asked in the Authors Guild lawsuit against Google. There are some obvious differences between the two situations, however. First, unlike HathiTrust Google is a for-profit company, so it loses points on the first factor of the fair use test:

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
Because Google is digitizing works primarily from university libraries, both HathiTrust and Google do well on the second factor:

(2) the nature of the copyrighted work;
 Works of a creative nature (defined as "prose fiction, poetry, and drama") are given greater protection than works of fact. HathiTrust reports that only 9% of its digital collection meets the "creative" definition.

The third factor:

(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole

would seem to go against HathiTrust (and Google), but the judge looked at the two primary uses for the digital texts, keyword indexing and providing digital copies to members of the community with sight disabilities, and determined that they could not be done with anything less than a complete copy. If the argument of transformation is made in the case against Google, this factor should be the same.

Factor four is about the effect on the market:

(4) the effect of the use upon the potential market for or value of the copyrighted work

This is a bit tricky because presumably HathiTrust will point its users to the library-owned hard copies of the book, especially since many of the digitized books will be out of print and not available from publishers. Therefore there isn't much interaction with the market at all. The judge added that, if anything, the greater amount of discovery might lead to sales, but I wouldn't hold out much hope for that. The other use is to provide access to the blind; this is a non-market for print materials if there ever was one. Google, on the other hand, has partnered with publishers to sell digitized books as ebooks, and therefore the positive market force should be stronger in that case if Google can show that previously out-of-print books can be sold through its service.

Not mentioned anywhere that I can find is the question of digital "photographs" of pages vs. OCR'd text. The suit and the decision blend these together as "a digital copy." Having seen some of the results of Google's digitization I can say that the text resulting from the OCR can be quite lossy depending on the page layout (tables of contents in particular come out quite badly) and the quality of the original book. It is also the "transformation" part of the copying, since the photographs of the pages are simply copies of the page and are by their nature human-readable substitutes for the page itself. The judge seems to consider these "transitory" but in fact they are quite solidly real, and are stored in the HathiTrust repository. I suspect it is these pictures of the pages that the Authors Guild fears will be pirated should HathiTrust be hacked, less so the OCR'd pages which are unattractive plain text. However, HathiTrust was able to show the judge that it takes security quite seriously, and the Authors Guild was unable to demonstrate any quantifiable risk. 

What is heartening in this decision is the judge's enthusiasm for the role of libraries in further science and knowledge, and his great admiration of HathiTrust's service to scholars and to the blind. His decision is both factual and moral: he refers to the "invaluable contribution to the progress of science and the cultivation of the arts that at the same time effectuates the ideas espoused by the ADA." We could not have hoped for a better advocate for digital libraries than Judge Harold Baer Jr.

Copyright Victories, Part I

While we are awaiting the results of the long-standing Google Books digitization copyright suit, there have been some important copyright battles libraries have won. The first was the Georgia State University digital reserves. The second is the recent decision regarding HathiTrust. I'll cover GSU in this post, and HathiTrust in a subsequent one, since my comments are long.

Earlier this year the case of publishers vs. Georgia State University regarding their e-reserves program resulted in a win for GSU as 69 out of 74 copyright infringement claims were denied by the judge. The case was about the provision of course readings in digital form. The suit was brought by three academic publishers (Oxford, Cambridge and Sage) but was bankrolled by the Association of American Publishers and the Copyright Clearance Center. Most readings were individual book chapters, and these were digitized by the library. Students in the class were able to access these through a password-protected site. The judge decided that for all but a few of the works the use was a fair use based on the nature of the use (educational) and the amount being used (one chapter). Unfortunately the judge also decided to enforce a "bright line" test of not more than 10% of the work (and the amount used of the works in question averaged 10.1%). As we know, bright lines are not in the spirit of fair use, yet the judge clearly needed some way to make her decisions.

One of the more striking things in the GSU case was that the publishers were unable to prove their ownership of the copyrights for over a third of the original items. They had published the books, but many of them were essentially anthologies and they could not find the appropriate paperwork for all of the individual pieces. This lack of proof in essence revealed these particular documents to be orphan works, even though the publisher of the book itself is known and some of the books may have been in print. I suspect that if you were to require actual proof of rights ownership for books or journal articles that the number of orphans would grow considerably. This would be especially true for articles, at least based on my experience: journal publishers are rather casual about getting signed agreements, and I have often modified agreements through strike-outs which were never contested.

This is just more evidence that our copyright system is a huge mess. Proving ownership requires expensive research (the copyright office charges $165 per hour) and often does not solidly determine who holds the rights at this moment in time. Most of our action around intellectual property rights is based on claims and suppositions, not facts, and we often act as if there were evidence of held rights even though we have no such proof. In contrast, the patent system is fully documented with descriptions and drawings and references to other patents, although by its own admission the patent office has about a 3-year backlog, and filing and researching patents is time-consuming and expensive.

Another interesting aspect of the GSU case is that some of the works being copied were not covered by any available licensing scheme. This is especially interesting since the publisher plaintiffs were backed by the Copyright Clearance Center, presumably the agency that one would turn to when desiring to license a work for use. Licensing is a relatively big business: CCC earns over $200 million per year. The publishers included in this case each earn something shy of $500K per year in fees from CCC licensing. Much of that, however, comes from the commercial printing of course packs, not from direct educational institution use. The judge determined that the percentage of publisher revenue from electronic course content would be .00046 (five one-hundredths of one percent) of the average net revenue for any one of the publishers.

Reading this it becomes rather obvious that the move from traditional course packs, which are produced by commercial copy shops, to digital course readings, which are produced by the library or the professor, would mean a loss of revenue for the academically-oriented publishers. Course packs got slammed by the copyright holders not because copies were being made but because copies were being sold and all profit was going to the copy shops, none to the rights holders. In fact, in this lawsuit there were files on reserve that were never downloaded by students in the class, and the judge removed these titles from the suit because they were not read. This is an interesting answer to the question: "What if you make a copy and no one sees it?" Another way of wording this is: "Is it a copy if it has only been viewed by a computer?" With course packs, every student purchases every item in the course pack and you have no idea if any of those are read. With digital copies, every download can be counted. Although a download does not guarantee that the item has been read by the downloadee, it is a quantifiable use in the same way that the number of course packs printed is quantifiable. A file online does not seem, in and of itself, to be the same as a physical copy. This could have implications for library digitization projects, and relates to the decision in the Authors Guild v. HathiTrust case.

There are some gotchas to the use of a copyright licensing agency because of the inherent nature of the US fair use law. When one approaches CCC to license a work there is no fair use determination that is made as part of that request. It is up to the requestor to decide whether a license is needed or not. There are annual licenses available for educational institutions that cover a set of materials. None of these licenses are needed if the use is a fair use, and for educational institutions many uses are indeed fair uses, in particular classroom use. Therefore the CCC annual educational license may be paying for uses that do not require payment under copyright law. In essence, the license may be seen as a kind of insurance policy against infringement claims, but it may not be money well-spent. As the judge in the GSU case states (p. 66) "In the absence of judicial precedent concerning the limits of fair use for nonprofit educational uses, colleges and universities have been guessing about the permissible extent of fair use."

The decision itself runs to 350 pages, much of which is taken up with the decisions about the 74 documents in question. The judge does a very nice job of defining the nature of a work, and why individual chapters are viable on their own as part of a course syllabus. The decision that 10% of a "work" is permissible makes me wonder, however, if publishers won't see the light and begin digital publishing of individual chapters rather than creating book-length anthologies. 

More legal analysis is available on James Grimmelman's blog.

Saturday, October 06, 2012

Google and publishers settle

Recent news announcements state that the AAP and Google have settled their lawsuit. Essentially this is a formality rather than an actual change in status between Google and the publishers. As you probably recall, the AAP engaged in a lawsuit against Google, in partnership with the Author's Guild, in 2005, and five specific publishers were named in that suit. Since then the lawsuit has gone through two failed attempts to settle and a massive number of pages of legal parrying. Meanwhile, the publishers realized that Google provided them with a new sales opportunity and about 40,000 of them have entered into agreements with Google in which the publisher provides (or allows Google to provide) a digital copy of the book which then can be sold as a Google eBook. Each contract specifies the amount of the book that potential buyers can browse. This browsing is designed to prevent users from satisfying their needs without a purchase: although up to 20% of a digital book may be browsable, Google denies access to sequences of more than 9 pages in a row.

When the Author's Guild revived their suit against Google in late 2011, the AAP was notably absent from the list of plaintiffs. By then the publishers had made a satisfactory agreement with Google and had no interest in suing their business partner. This current settlement appears to be a pro forma legal action to terminate the previous lawsuit, and, as far as I can tell, makes no change to the current status of the business relationship between Google and publishers.

There is one question that remains which is what this means in terms of copyright, if anything. The contractual arrangements between Google and the publishers are standard business agreements which the publishers engage in as representatives of the rights holder through their contract with that rights holder. Sometimes the publisher also holds the copyright, but that isn't the salient point here. So unless I am missing something, this agreement has absolutely no effect on the questions of copyright and digitization that some of us have been so eager to hear about. It's just business as usual.

[After-note: Some authors and journalists question the settlement and want details made public.]

Monday, September 24, 2012

Library signage

After all of the hoopla about libraries converting to BISAC bookstore categories instead of using the Dewey Decimal System, a trip to Barnes and Noble one day last week made me wonder if it's really the categories that matter, or if it's all about the signage.
Here's some recent signage at Barnes and Noble:


Here's what the library signage in my local library looks like:


Which do you think is understood best by the people who step into those institutions?
I've referred to library cataloging as "the secret language of twins," understood by a small in-crowd and completely unknown to others. This library signage is even worse than that; it's as if the library decided to encrypt its subject access, and won't let the users have the key. There is no copy of DDC in the library for users to consult. (I know this because I looked for it.) You can get to a place on the shelf by doing a search in the catalog, but you can't find out what the numbers mean, and there is no natural language translation given in the library, other than "Fiction" and "Non-Fiction" over the doors to the main shelf areas.

How could this possibly be seen as functional?

Tuesday, September 18, 2012

Seven years, and waiting

Just a quick note to say that the Google Book Search lawsuit has entered a new "pause" and will thus be delayed further. Among the salvos that Google and the Author's Guild have fired back and forth are those around the question of whether the Author's Guild can legally represent all authors against Google. As I recall, the Author's Guild has something on the order of 5,000 members, all alive (as far as I know), while Google's digitization now covers somewhere upwards to 10 million books and probably nearly as many authors, from the days of early printed works to the present.

Google challenged the Author's Guild as class representative of all authors, but Judge Denny Chin, the judge who has seen this case through most of its life, allowed the suit to go forward with the Guild as class representative. This has been reversed by an appeals court judge, which means that the question of class representation must be decided before the suit can continue. 

Friday, September 14, 2012

Rich snippets

At the recent Dublin Core annual meeting I heard Dan Brickley talk about Google's use of schema.org for rich snippets. Schema.org is commonly thought of as "search engine optimization" (SEO), which to most people means "how to get onto the first page of a Google results search." But the microdata in web sites can also be used to make the snippets shown more useful by incorporating more information from the web page. The examples above, from the Google rich snippets page, show features like ratings as well as links to actual content within the web page.

Now that WorldCat has schema.org markup, my first thought was: what kind of rich snippet would be good for library data? There is a rich snippet testing tool where you can plug in a URL and see 1) the snippet 2) what microdata is visible to Google. You can plug in a WorldCat permalink and see what the rich snippet result is:

http://www.worldcat.org/oclc/874206  (opens in separate window)

There is no rich snippet displayed here, which tells us that Google hasn't yet developed a rich snippet model for our kind of data. But you can see, in great detail, all of the coded data that is available. (The red warnings indicate that there is data in the OCLC microdata that isn't part of schema.org. OCLC is talking to the schema.org developers to incorporate new elements, some of which show up as warnings here.)

I began to think about how I would like this data used. It could be used to format a more bibliographic-like display, adding author, publisher, pagination. The ISBN could of course link to key online bookstores. (That would also bring in revenue for Google, so might be a popular choice for the search engine.) But what about libraries? How could rich snippets help libraries and library users?

The snippet could  lead back to WorldCat where the user could find a nearby library, but... wait! Google often knows your approximate location, and WorldCat knows whether libraries in your area have the book. AND the library catalog often has information about availability. I don't know how this data would interact with the WorldCat tool, but here's what I would like to see in the snippet:


This definitely goes beyond what "rich snippet" means today, but is not inconsistent with retrievals that pull data from multiple online sales outlets.  In the sales model, Google's assumption is that the searcher wants to obtain (in FRBR-speak) the item, and therefore various outlets that could provide that data are listed. This same logic could apply to libraries, of course. Libraries are a local source of many of the same things that are sold online, so the obtain logic fits.

This analysis of mine obviously ignores the economic incentive for Google to provide library holdings, especially since they would be seen as competing with sales.  I'm just dreaming here, doing the "what if" thing without the practical limitations.






Saturday, August 11, 2012

The success paradox

An article entitled "Study: Public Awareness Gap on Ebooks in Libraries" in the July/August 2012 issue of American Libraries reports that 62% of Americans polled did not know if their library (presumably their local public library) lends ebooks. This statement was followed by a quote from Molly Raphael about how libraries should increase public awareness of their services. I would naturally be inclined to agree with her except for the other statistics that were cited: 56% of those who do borrow ebooks were unable to borrow a particular book they were seeking, and 52% had encountered wait lists for books.

Given that data, you have to wonder what the results would be if libraries did make the public more aware of the availability of ebooks.

An institution with a fixed budget cannot afford to be too successful, or at least not successful in the sense of encouraging more use of its services. The more successful the institution is, the more it will fail its users because demand will overwhelm its ability to serve them. As we see in the ebook example, libraries are failing to serve well even the minority of people who are aware that they have ebooks to lend. What if everyone was made aware of the availability of "free" ebooks from the library? The ebook lending service would be worse from the user's perspective.

Where each purchase of the book makes money for the bookseller, each demand of the library results in a cost rather than a revenue gain. This is because the library is on a fixed income and the basis of the library's budget is only tangentially related to the number of 'customers' it serves. A library with an annual budget of, say, $500,000 has that amount to spend even if use of the library increases greatly during that year. Such an increase of use does not guarantee that the library's budget will be increased when the next fiscal year's budget is decided on by the governing authority (such as city, state, or college). In fact, as we have seen in these hard economic times, increased use and decreased budgets can go hand-in-hand.

Because the library's model is to get more use out of a limited number of books (and DVDs and other items) by lending them sequentially to patrons, the direct result of high demand is an increase in the failure of the library to meet the demand, evidenced by long waiting lists for the book and patrons who are unhappy with the library's service. My local library in the city of Berkeley today has nine copies of "50 Shades of Grey" with 70 holds. This is in a small city with a population of about 120,000. The library of the city of Santa Clara in Silicon Valley, which is similar in size to Berkeley, also has nine copies, but 119 holds. New York Public Library has "Holds: 1657 on 131 copies."

In this sense, success -- that is, many people turning to the library with their desire to read a highly popular book -- is in fact the cause of failure; the failure of the library to meet that demand.

I come around to these thoughts when I find myself frustrated at the reluctance or inability of libraries to promote their services despite obvious opportunities to do so. Then I think about what it looks like inside a branch of my local library, with fewer and fewer staff available and the obvious strain, as evidenced by cart after cart of books that have been checked in but not yet returned to the shelves, by long lines to use the public computers despite the fact that those now take up a significant amount of floor space, and shelf after shelf of sadly worn trade paperbacks on topics that were fleeting fads a decade or so ago.

Really, why would an institution so stretched in its resources want to stimulate more demand?

Libraries, of course, are not the only such institutions; few if any inner-city emergency rooms would consider it a good idea to stimulate the arrival of more patients. I've been in a higher end hospital emergency room in my town (fortunately very seldom) and even that had patients on gurneys in the hallways because every more appropriate space was filled. Again here, high demand promotes failure.

Libraries from their beginnings developed in response to scarcity. It would not be unreasonable to suggest that the library model based on scarcity is not well suited to the current climate of media abundance. Yet, any public institution that would base its services on scarce but rarely sought goods is on a suicide mission, especially in today's economic and social climate.

Is there a solution to this dilemma? In a perfect world (obviously not the one we live in) government and institutions of higher learning would recognize the value of a well-stocked, vibrant community information space. Instead, library budgets are being sharply cut, and library services are not perceived has having high value. If libraries have lost support among their traditional communities, it may have something to do with the current measures of success: number of items checked-out being the primary one for public libraries; number of volumes owned for research libraries.

Whatever success looks like in the future, it simply cannot be based on increasing the number of holds on materials or on providing services that you hope only a few people will discover and make use of. Circulation figures should not be the main measure of the library's value; we need not only new services but new measures of the library's worth to the community. There is a pressing need for actual information services, not just the storage and circulation of items.There is also a desire for participation, as evidenced by social information networking like Wikipedia, LibraryThing, GoodReads, and others. Perhaps the answer lies in asking what the users can contribute to the library so that use adds value as well as incurs costs. Perhaps the library of today should serve its users by giving them a means to crowd-source solutions to their information problems, with the library contributing knowledge organization skills, its awareness of community needs, and a commitment to quality information services. Maybe the library of the future should be less about circulating books and DVDs and more about helping people make sense of the information glut that we live in; less about keeping up with global bestsellers and more about learning with and within the community.

Then maybe success can be success, not failure.


Sunday, July 29, 2012

Fair Use Dejà Vu

In its July 27 court filing,[1] Google has made the case for its fair use defense for the digitization of books in its Google Book Search (GBS) project. [2] As many of us have hoped, the case it makes appears strong. That it was necessary to throw libraries under the bus to achieve this is unfortunate, but I honestly do not see a an alternative that wouldn't weaken the case a bit.

Fair Use is Fair

The argument that Google has made from the beginning of its book scanning project is that copying for the purpose of providing keyword access to full texts is fair use. They are fortunately able to cite case law to defend this, including case law allowing the copying of entire images by image search engines.

Among the reasons that they give for their fair use defense are:

1. Keyword search is not a substitute for the text itself. In fact, the copy of the text is necessary to provide a means for users to discover the existence of books and therefore for the books to fulfill their purpose of being read.
"Books exist to be read. Google Books exists to help readers find those books. Like a paper index or a card catalogue, it does not substitute for reading the books themselves..." (p. 2)

2. Google has elaborate protections in place to prevent users from reconstructing the text from its products. They reveal some of these protections, such as disabling snippet display for one instance of the keyword on each page, and disabling display of one page out of ten.
"One of the snippets on each page is blacklisted (meaning that it will not be shown). In addition, at least one out of ten entire pages in each book is blacklisted." (p. 10)
3. No advertising appears on the GBS pages. This implies that Google is not making any money that could be claimed by authors as being theirs.

4. The Authors Guild has no proof of harm that has come from the digitization of the books. It is suggested that a thorough study might show that there have been gains rather than losses in terms of book sales. Even the Authors Guild (the Plaintiff in this case) advises authors to provide some of the text of their books (usually the first chapter) for browsing in online bookstores, and many rights holders participate voluntarily in Amazon's "Look inside" feature that shows considerably more than the disputed snippets that are displayed in GBS. And Google notes that 45,000 (!) publishers have signed up to have their in-print books searchable in GBS, with varying amounts of text available to the searcher prior to purchase. This makes the case that search and some text display is good for authors, not harmful.

5. Digital copies of books have never been "distributed to the public" (key wording in the copyright law). Only the libraries themselves that held the actual hard copies could receive a copy of the files resulting from the digitization.

Of course, all of this is done citing court cases in support of these arguments. The Authors Guild undoubtedly has counter-cases to present.

Libraries Under the Bus

One of the key copyright-related arguments that Google makes is that its full text search within books provides a public service and support of research that is unprecedented. In making these claims Google decided to particularly emphasize its superiority to library catalogs. (Google refers multiple times to "card catalogues" which seems oddly antiquated, but perhaps that was the intent.)
"The tool is not a substitute for the books themselves -- readers still must buy a book from a store or borrow it from a library to read it. Rather, Google Books is an important advance on the card-catalogue method of finding books. The advance is simply stated: unlike card catalogues, which are limited to a very small amount of bibliographic information, Google Books permits full-text search, identifying books that could never be found using even the most thorough card catalog." (p.1) [sic uses of "catalogue" and "catalog" in the same paragraph.]
"Google Books was born of the realization that much of the store of human knowledge lies in books on library shelves where it is very difficult to find....Despite the importance of this vast store of human knowledge, there exists no centralized way to search these texts to identify which might be germane to the interests of a particular reader." (p. 4)
As a librarian, I have to say that this dismissal of the library as inadequate really hurts. Yet I believe that Google is expressing an opinion that is probably quite common among information searchers today. One could counter with many examples where the library catalog entry succeeds and GBS fails, but of course that wouldn't bolster Google's arguments here. A reasonable analysis would put the two methods (full text and standards-based metadata) as complementary.

Google also argues that it did not give copies of the digital files resulting from its scanning to the libraries. How this plays out is not only clever, but it shows some real foresight on Google's part. They developed a portal where the libraries could request that a copy of the files be made "on demand" for the library, and using an encryption specific to that library. The transmission of the files from Google to the libraries was then an act of the libraries, not of Google.
"Moreover, the undisputed facts show that it is the libraries that make the library copies, not Google, and that Google provides only a technological system that enables libraries to create digital copies of books in their collections. Under established Second Circuit precedent, Google cannot be held directly liable for infringement because Google itself has not engaged in any volitional act constituting distribution." (p. 33)
Clearly, Google designed the system (with goes by the acronym "GRIN") with this in mind.

I don't mind this, but wish that Google hadn't included a dig at HathiTrust as part of this argument. The document would not have suffered, in my opinion, if Google had left the parenthetical phrase off of this sentence:
"No library may obtain a digital copy created from another library's book -- even if both libraries own identical copies of that book (although libraries may delegate that task to a technical service provider such as HathiTrust)." (p. 15)
It's one thing to claim innocence, but another to point the finger at others.

Omissions

There a few glaring omissions from the document, some of which would weaken Google's case.

There is no mention of the computational uses that can be made of the digital corpus, something that was a strong focus in the failed settlement between Google and the authors and publishers. I have no doubts that Google is currently engaged in research using this corpus -- I don't see how they could resist doing so. They do mention the "n-gram" feature briefly, but as this is based on what appears to be a simple use of term frequency, it may not attract the court's attention.

In another omission, Google states that:
"Informed by the results of a search of that index, users can click on links in Google Books to locate a library from which to borrow those books ... " (p. 4)
Google fails to state that this is not a service provided by Google but one provided by OCLC using exactly those card catalogues that Google finds so inadequate. Credit should be given where credit is due, but there is an important battle to be won.

Bottom Line

The ability to create full text searches of printed works (and other physical materials) is so important to research and learning -- and should be such an obvious modern approach to searching these materials -- that a win for Google is a win for us all. Although some aspects of this document shot arrows into my librarian-ly heart, I hope with all of that wounded heart that they prevail in this suit.


[1] This points to the ScribD site which unfortunately is now connected to Facebook and therefore is a huge privacy monster. The document should appear on the Public Index site shortly, with no login required.
[2] The term "product" could also be used to describe GBS.

Wednesday, July 25, 2012

Authorities and entities

In my previous post, I talked about the three database scenarios proposed by the JSC for RDA. These can be considered to be somewhat schematic because, of course, real databases are often modified for purposes of efficiency in searching and display, as well as to facilitate update. But the conceptual structures provided in the JSC document are useful ways to think about our data future.

There is one problem that I see, however, and that is the transition from authority control to entities. Because we have authority records for some of the same things that are entities in the entity-relationship model of FRBR, there seems to be a wide-spread assumption that an authority record is the same as an entity record. In fact, IFLA has developed "authority data" models for names and for subjects that are intended to somehow mix with the FRBR model to create a more complete view of the bibliographic description.

This may be a wholly mis-guided activity, for the reason that authority control and entities (in the entity-relation sense) are not at all the same thing.

The library authority control, and the record that carries the information, has as its sole purpose to provide the preferred heading representing the "thing" being identified (person, corporate body, subject heading). It also provides non-preferred forms of the name or subject that might be ones that a catalog user would include in a query for that thing. The rest of the information contained in the record is solely in support of the process of selection of the appropriate string, including documentation of the resources used by the cataloger in making that decision. In knowledge organization thinking, this would be considered a controlled list of terms.

To understand what an entity is, one might use the WEMI entities as examples. An entity is indeed about some defined "thing," and it contains a description of that thing that fulfills one or more intended uses of the metadata. In the WEMI case, we can cite the four FRBR user tasks of find, identify, select, obtain. So if Work is an entity and contains all of the relevant bibliographic information about that Work, then Person is an entity and should contain all of the relevant information about that person. One such piece of information could be a preferred form of the person's name for display in a particular community's bibliographic data, although I could also make the argument that library applications could continue to make use of brief records that support cataloging and display of controlled text strings if that is the only function that is required. In fact, in the VIAF union database of authority data, the data is treated as a controlled list of terms, not unlike a list of terms for languages or musical instruments.

What would be a Person entity? It could, of course, be as much or as little as you would like, but it would be a description of the Person for your purposes. It is this element of description that I think is key, and we could think of it in terms of the FRBR user tasks:

find - would help users find the Person using any form of the name, but also using other criteria like: 19th century French generals; Swedish mystery writers; translators of the Aeneid.

identify - would give users information to help disambiguate between Persons retrieved. This assumes that there would be some amount of biographical information as well as categorization that let users know who precisely this Person entity represents.

select - this is where this would differ from traditional FRBR which seems to assume that one is already looking for bibliographic materials at this step. I suppose that here one might select between Charles Dodgson and Lewis Carroll, whose biographic information is similar but whose area of activity is entirely different.

obtain - this step would lead one to the library's held works by and/or about that Person, but it could also lead to further information, like web pages, entries in an online database, etc.

If you are wondering what a Person entity might look like, it might look like a mashup between an entry in WorldCat identities and Wikipedia. I suggest a mashup because Identities is limited to data already in bibliographic and authority records and therefore has little in the way of general biographical information. That latter is available, sometimes abundantly, in Wikipedia, and of course a link to that Wikipedia entry would be a logical addition to a library record for a Person entity.

What this thinking leads me to conclude is:

1) the library authority file is a term list, not a set of entities, and therefore is not the Person entity implied in FRBR
2) having person entities in our files could be a great service for our users, and it might be possible to create them to take the place of the simple term lists that our authority records now represent
3) the FRBR user tasks may need to be modified or reinterpreted to be focused less on seeking a particular document and more on seeking a particular person (agent) or subject

Monday, July 23, 2012

Futures and Options

No, I'm not talking about the stock market, but about the options that we have for moving beyond the MARC format for library data. You undoubtedly know that the Library of Congress has its Bibliographic Framework Transition Initiative that will consider these options. In an ALA Webinar last week I proposed my own set of options -- undoubtedly not as well-studied as LC's will be, but I offer them as one person's ideas.

It helps to remember the three database scenarios of RDA. These show a progressive view of moving from the flat record format of MARC to a relational database. The three RDA scenarios (which should be read from the bottom up) are

  1. Relational database model -- In this model, data is stored as separate entities, presumably following the entities defined in FRBR. Each entity has a defined set of data elements and the bibliographic description is spread across these entities which are then linked together using FRBR-like relationships.
  2. Linked authority files -- The database has bibliographic records and has authority records, and there are machine-actionable links between them. These links should allow certain strings, like name headings, to be stored only once, and should reflect changes to the authority file in the related bibliographic records.
  3. Flat file model -- The database has bibliographic records and it has authority records, but there is no machine-actionable linking between the two. This is the design used by some library systems, but it is also a description of the situation that existed with the card catalog.

These move from #3, being the least desirable, to #1, being the intended format of RDA data. I imagine that the JSC may not precisely subscribe to these descriptions today because of course in the few years since the document was created the technology environment has changed, and linked data now appears to be the goal. The models are still interesting in the way that they show a progression.

I also have in mind something of a progression, or at least a set of three options that move from least to most desirable. To fully explain each of these in sufficient detail will require a significant document, and I will attempt to write up such an explanation for the Futurelib wiki site. Meanwhile, here are the options that I see, with their advantages and disadvantages. The order, in this case, is from what I see as least desirable (#3, in keeping with the RDA numbering) to most desirable (#1).

#3 Serialization of MARC in RDF

Advantages

  • mechanical - requires no change to the data
  • would be round-tripable, similar to MARCXML
  • requires no system changes, since it would just be an export format

Disadvantages

  • does not change the data at all - all of the data remains as text strings, which do not link
  • keeps library data in a library-only silo
  • library data will not link to any non-library sources, and even linking to library sources will be limited because of the profusion of text strings

#2 Extraction of linked data from MARC records

Advantages

  • does not require library major system changes because it extracts data from current MARC format
  • some things (e.g. "persons") can be given linkable identifiers that will link to other  Web resources
  • the linked data can be re-extracted as we learn more, so we don't have to get it right or complete the first time
  • does not change the work of catalogers

Disadvantages

  • probably not round-trippable with MARC
  • the linked data is entirely created by programs and algorithms, so it doesn't get any human quality control (think: union catalog de-duping algorithms)
  • capabilities of the extracted data are limited by what we have in records today, similar to the limitations of attempting to create RDA in MARC

#1 Linked data "all the way down", e.g. working in linked data natively

Advantages

  • gives us the greatest amount of interoperability with web resources and the most integration with the information in that space
  • allows us to implement the intent of RDA
  • allows us to create interesting relationships between resources and possibly serve users better

Disadvantages

  • requires new library systems
  • will probably instigate changes in cataloging practice
  • presumably entails significant costs, but we have little ability to develop a cost/benefit analysis

There is a lot behind these three options that isn't explained here, and I am also interested in hearing other options that you see. I don't think that our options are only three -- there could be many points between them -- but this is intended to be succinct.

To conclude, I don't see much, if any, value in my option #3; #2 is already being done by the British Library, OCLC, and the National Library of Spain; I have no idea how far in our future #1 is, nor even if we'll get there before the next major technology change. If we can't get there in practice, we should at least explore it in theory because I believe that only #1 will give us a taste of a truly new bibliographic data model.

Sunday, July 15, 2012

Friends of HathiTrust

I have written before about the lawsuit of the Author's Guild (AG) against HathiTrust (HT). The tweet-sized explanation is that the AG claims that the corpus of digitized books in the HathiTrust that are not in the public domain are infringements of copyright. HathiTrust claims that the digitized copies are justified under fair use. (It may be relevant that many of the digitized texts stored in HT are the result of the mass digitization done by Google.)

For analysis of the legal issues, please see James Grimmelman's blog, in particular his post summarizing how the various arguments fit into the copyright law's "four factors."

I want to focus on some issues that I think are of particular interest to librarians and scholars. In particular, I want to bring up some of the points from the amicus brief from the digital humanities and law scholars.

While scientists and others who work with quantifiable data (social scientists using census data, business researchers with huge amounts of data from stock markets, etc.), those working in the humanities whose raw material is in printed texts have not been able to make use of the massive data mining techniques that are moving other areas of research forward. If you want to study how language has changed over time, or when certain concepts entered the vocabulary of mass media, the physical storage of this information makes it impossible to run these as calculations, and the size of the corpus makes it very difficult, if not impossible, to do the research in "human time". Thus, the only way for the "Digital Humanities" to engage in modern research is after the digitization of their primary materials.

This presumably speaks to the first factor of fair use:

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

As Grimmelman says "The Authors Guild focuses on the corpus itself; HathiTrust focuses on its uses." It may make sense that scholars should be allowed to make copies of any material they need to use in their research, but I can imagine objections, some of which the AG has already made: 1) you don't need to systematically copy every book in every library to do your research and 2) that's fine, but can you guarantee that infringing copies will not be distributed?

It's a hard sell, yet it's also hard not to see the point of view of the humanities scholars who feel that they could make great progress (ok, and some good career moves) if they had access to this material.

The other argument that the digital humanities scholars make is that the data derived from the digitization process is not infringing because it is non-expressive metadata. Here it gets a bit confusing because although they refer to the data derived from digitization as "metadata," the examples that they give vary from the digitized copies themselves, to a database where all of this is stored, and to the output from Google n-grams. If the database consists of metadata, then the Google n-grams are an example of the use of that metadata, but are not an example of the metadata itself. In fact the "metadata" that is produced from digitization is a good graphic copy of each page of the book, plus a reproduction, word for word (with unfortunate but not deliberate imprecision) of the text itself. That this copy is essential for the research uses desired is undeniable, and the brief gives many good examples of quantitative research in the humanities. But I fear that their insistence that digitization produces mere "metadata" may not be convincing.

Here's a short version from the text:

"In ruling on the parties’ motions, the Court should recognize that text mining is a non-expressive use that presents no legally cognizable conflict with the statutory rights or interests of the copyright holders. Where, as here, the output of a database—i.e., the data it produces and displays—is noninfringing, this Court should find that the creation and operation of the database itself is likewise noninfringing. The copying required to convert paper library books into a searchable digital database is properly considered a “nonexpressive use” because the works are copied for reasons unrelated to their protectable expressive qualities; none of the works in question are being read by humans as they would be if sitting on the shelves of a library or bookstore." p. 2

They also talk about transformation of works, and the legal issues here are complex and my impression is that the various past legal decisions may not provide a clear path. They then end a section with this quote:

"By contrast, the many forms of metadata produced by the library digitization at the heart of this litigation do not merely recast copyrightable expression from underlying works; rather, the metadata encompasses numerous uncopyrightable facts about the works, such as author, title, frequency of particular words or phrases, and the like." (p.17)

This, to me, comes completely out of left field. Anyone who has done digitization projects is aware that most projects use human-produced library metadata for the authors and titles of the digitized works. In addition, the result of the OCR step of the digitization process is a large text file that is the text, from first word to last, in that order, and possibly a mapping file that gives the coordinates of the location of each word on each OCR'd page. Any term frequency data is a few steps away from the actual digitization process and its immediate output, and fits in perfectly with the earlier arguments around the use of datamining.

I do sincerely hope that digitization of texts will be permitted by the court for the purposes argued in this paper. An attempt at justification, after the fact, of Google's mass digitization project may, however, suffer weaknesses inherent in that project, in particular that no prior negotiation was attempted with either authors nor publishers, and once the amended settlement between Google and the suing parties was denied by court, there is no mutual agreement on uses, security, nor compensation.

In addition, the economic and emotional impact of Google's role in this process cannot be ignored: this is a company that is so strong and so pervasive in our lives that mere nations struggle to protect their own (and their citizens') interests. When Google or Amazon or Facebook steps into your territory, the earth trembles and fear is not an unreasonable response. I worry that idea of digitization itself has been tainted, making it harder for scholars to make their case of the potential benefits of post-digitization research.

Thursday, July 05, 2012

ISBN as URI

One thing that is annoying in this early stage of attempting to create linked data is that we lack URI forms for some key elements of our data. One of those is the ISBN. At the linked data conference in Florence, I asked a representative of the international ISBN agency about creating an "official" URI form for ISBNs and was told that it already exists: ISBN-A, which uses the DOI namespace.

The format of the ISBN URI is:
  • Handle System DOI name prefix = "10."
  • ISBN (GS1) Bookland prefix = "978." or "979."
  • ISBN registration group element and publisher prefix = variable length numeric string of 2 to 8 digits
  • Prefix/suffix divider = "/"
  • ISBN Title enumerator and checkdigit = maximum 6 digit title enumerator and 1 digit check digit
I was thrilled to find this, but as I look at it more closely I wonder how easy it will be to divide the ISBN at the right point between the publisher prefix and the title enumerator. In the case where an ISBN is recorded as a single string (which is true in many instances in library data):

9781400096237
9788804598770

there is nothing in the string to indicate where the divider  should be, which in these two cases is:

978.14000/96237
978.8804/598770

I have two questions for anyone out there who wants to think about this:

1) Is this division into prefix/suffix practical for our purposes?
2) Would having a standard ISBN-A URI format relieve us of the need to have a separate property (such as bibo:ISBN) for these identifiers? In other words, is the URI format identification enough that this could be used directly in a situation like:

<bookURI> <RDVocab:hasManifestationIdentifier><http://10.978.14000/96237>


Saturday, June 23, 2012

Europe and Library Linked Data

I have just returned from a conference on library linked data in Florence, Global Interoperability and Linked Data in Libraries.  The talks were fascinating and there is only a small amount that I can convey in a blog post, but I thought that a good way to start would be to highlight some differences I see in the European approach.

Cultural Heritage

The cultural timeline in Europe is on an entirely different scale from what we are used to in the US. Library of Congress's "Historical Newspapers" collection covers 1836-1922. At the museum of the synagog of Florence, the docent referred to an event in 1571 as "the first in modern times." We have history, but Europe has History. This means that there is a great emphasis on archives, manuscripts, and museums in all work done by cultural heritage organizations. At no time during the conference was there discussion of "STM" materials (that's Science, Technology, Medicine) other than a talk on Renaissance science, or scholarly communication, both of which are often on programs in the US. (See talks on linked data and the Vatican library and Linked Heritage.) Note also that the shared European culture database, Europeana, uses linked data, which encourages all contributors to also move in that direction.

National Libraries

Closely connected to the view of cultural heritage is the role of the national library. In many countries, if not most, the national library's primary role is to conserve the written heritage of the nation. This fact could be used for better data sharing; for example, each national library could have the responsibility for its subset of name authorities, and the name file could be 'in the cloud.' (See talk by Malmsten.) Ditto for the cataloging of modern publications.

Government Data

Led by various European Union initiatives there is currently a strong movement to make government data available. (Note: there is also an open government movement in the US, but it has less emphasis, as I see it, on the sharing aspects of releasing the data.) Government data from the member countries is needed to make possible the analyses needed for Europe-wide programs. Since this data must be shared and linked, providing it as linked data makes perfect sense.(See: datacatalogs.org, a catalog of government data catalogs, talks by Morando, Moriando, Menduni. )

Rights and licenses

As countries decide to make their government data open, they must decide about rights, in particular data ownership. The European Union has recommendations for data rights, and individual countries are developing licenses that will be used for their released data. Because libraries are often government-funded institutions, these licenses will also apply to library data. This is not always a good thing. Some countries have declared government data on an open license, but others, like UK, Italy, and France, are using a license similar to CC-BY. The reasons for this have to do with the need to maintain provenance of government data since that data often has an official role in decision-making. (I wonder if the addition of provenance to linked data will help, and that makes me wonder about provenance "spoofing" and how much of a problem that will be.) (See talk by Morando.)

International Standards

People talk about how insular countries like China and North Korea are, but I become deeply aware of how insular we are in the US when I attend meetings outside of this country. We are beginning here (finally) to pay more attention to the global network of libraries, but Europe follows IFLA standards "religiously," as well as EU standards for data sharing. The Web of Data as a global resource makes even more sense in the European context. This makes me wish I knew more about the remainder of the globe. 

My talk at the conference is here: English, italiano.


Friday, June 01, 2012

Google Books: TBD

The latest

The Google Book Search lawsuit is essentially back to square one. Judge Denny Chin has ruled on an important aspect of the post-(failed)settlement lawsuit of the Author's Guild against Google: Google's objection that the Author's Guild (AG) cannot represent all authors, since copyright must be determined on a case-by-case basis. (It has been widely noted that when it came to declaring the copying to be Fair Use, Google was happy to treat the works en masse, in direct contradiction of their response to this latest suit that a copyright suit claiming infringement would need to be individual.) Chin has ruled that the Author's Guild can proceed as representative of "authors" as a class. The class includes not only those members of the association, but all authors whose books were scanned by Google. This means that the AG suit against Google can move forward, and that sometime in (hopefully) the near future we will have a ruling on whether or not Google's scanning of books is within the guidelines of Fair Use.

Quick update

In about 2004, Google began scanning books in partnership with a handful of major libraries, stating its goal as creating search access to books in the same way that it provides search access to web pages. Google Book Search gave results looking much like those for Google's web search: minimal metadata and about three snippets from the book showing the context for the search terms. Google's claim was that copying the books solely for the purpose of search was a clear case of "fair use."

In 2005, the Author's Guild sued Google for copyright infringement in a class action lawsuit representing "authors" as a class. Shortly thereafter, the Association of American Publishers brought their own lawsuit on the part of publishers.

Scanning of books in the libraries continued through 2008 with no word about the lawsuit. Meanwhile, more libraries had been added to the program, and it is estimated that about 7 million books had been scanned. (The exact number is not known.)

In October of 2008, much to the surprise of nearly everyone, it was announced that Google had arrived at a settlement with the AG and the AAP. The settlement was far-reaching, and created a mechanism that would allow Google to scan out-of-print books and make them available for use or sale, returning revenues to the copyright holders. Copying continued.

The settlement received hundreds of responses, mostly negative, from authors and from publishers, in particular those not in the US. Initially the parties were sent back to revise the settlement to address certain concerns. They did so, but not to the judge's satisfaction: the settlement was rejected in court in March of 2011.

In late 2011, the Author's Guild (without the publishers) updated and reprised its original lawsuit against Google, primarily demanding that all scanning stop. It also brought a suit against Hathitrust, the library-sponsored archival facility that houses many of the library books scanned by Google.  The lawsuit claims that the copies in HathiTrust are not legal copies and demands that they be destroyed.

My opinion

I could imagine getting a fair use ruling based on the original definition of the project, which was the scanning of books solely for the purpose of allowing keyword searching on texts, with minimal metadata and a few short snippets shown to the public, although Google's for-profit status might have nixed such a decision. However, this was complicated by the active participation of the libraries, and by the fact that Google returned a copy of the digital scan (and sometimes also the OCR and the OCR "map" that carries the location of the text on the page) to the library that had offered the book. While the copying for search might be considered fair use, since no actual copies of the books are made available to anyone, the presentation of a copy to the libraries is a pretty clear act of copying.

The terms of the settlement were arrived at through negotiations between the parties, and including input from some of the library partners. It was during this time that library partner U of Michigan began planning the archive now known as HathiTrust. It is undoubtedly not a coincidence that such an archive was described in a fair amount of detail in the settlement document as a requirement for library archiving of their received digital copies. In addition, the settlement allowed for computational research on the corpus, something that would be of great benefit to researchers.

During the time between 2008 and 2011, when the parties presented the first settlement and then the amended settlement, there was a fair amount of optimism that the settlement would be accepted, and plans to engage in the terms of the settlement, including the creation of a special bureau to manage payments to copyright holders, went forward. Google appeared to be all-powerful, able to bend law and legislation in order to create an entirely new view of copyright and digitization.

With the rejection of the settlement, and this latest ruling that allows the AG lawsuit to go forward, the picture has entirely changed, but not necessarily for the better. Many were hoping that we would be able to digitize the entirety of our analog matter, increasing access and preservation capabilities.

Now we are facing the possibility that not only may mass digitization be declared in violation of copyright, at least in this instance, but that the libraries may lose the copies the digital versions of the items in their collection, and researchers will lose the access to these items in HathiTrust and Google Book Search.

At the same time, for Google the lawsuit has become nearly moot. At the moment Google has tens of thousands of publishing partners that allow Google to index digital versions of their books and make them available for sale either in hard copy or as ebooks. Google could lose the out-of-print books in its collection for which it has no publisher agreement, but these books are not providing any revenue for Google.

Resources


the public index - all of the filed documents for the lawsuits, beginning in 2005

HathiTrust Information about the AG lawsuit

James Grimmelman's analysis of yesterday's decision

My posts are found under tag "googlebooks"