Sunday, August 23, 2009

Googlebooks: Innovation and the Future of the Book

There's a standard joke about a restaurant-goer who complains afterward: "The food was terrible... and there was so little of it!"

I'm reminded of this while reading the letter by the University of California faculty to the judge in the Google/AAP settlement case. First they argue that the class represented by the Author's Guild does not include academics, who are major, if not the major, producers of authored texts. Then they state their three primary concerns:
  1. to maximize access, prices should be reasonable
  2. there must be provision for open access choices for authors who want to maximize access to their works
  3. user privacy must be guaranteed
I find it unfortunate that the faculty chose to lead with the question of price... it sounds a bit like "This settlement is seriously flawed... and we might not be able to afford it!" The other two concerns suggest important modifications to the settlement as currently written.

All three of these concerns are premised on the acceptance of the settlement. There is another, perhaps more serious concern that isn't here, although it may not have been possible for this group because it could be incompatible with basic premises of the settlement. That concern is the question of INNOVATION.

Innovation

We may be at a crossroads in the evolution of the book, one that could change forever how one goes about the acts of scholarship and knowledge creation. What happens when some portion of our previously analog texts is available digitally? What changes take place to the nature of research? What are the potential unintended consequences?

We don't know the answers to these questions, in part because they are about the future. It is probably safe to assume, however, that the future of the book is not a linear progression from where we are today, but that it could go in a number of different directions, ending up somewhere that we can't even imagine at this moment in time. To get there, we must experiment, we must innovate. There will be trial and error. There may not be a single "killer app." Above all, the change will make use of technology but it will essentially be a cultural change. Perhaps a massive cultural change.

Some commentators have said that the production of texts digitally makes books irrelevant, that the stable, book-length text will cease to exist as we know it today. Instead, we may be returning to our medieval roots, before texts were fixed by the repetitive nature of the printing process. [1] Others see digitization of previously analog books as a way to reassert the impact of the last 500 years of thought on scholarship. It is easy to imagine the discovery of long-hidden gems from the stacks of university libraries, just as it is easy to imagine being overwhelmed by marginal and irrelevant retrievals. The main thing is that we won't know until it happens.

We can have some fun speculating on the types of things that we may be able to do with these previously analog texts in the digital future: integrating these texts with those that were 'born digital;" creating hyperlinks between them (one's own personal Memex); recombining texts into new text; annotating and commenting in a public or semi-public sphere; mashing up text with sound and video and data. On a larger scale, we face the possibility of global topic maps that will show us previously undiscovered connections.

Which of these possibilities will be available to us, however, is up to Google, because under this settlement only Google has the right to innovate across the body of digitized texts. The rest of us, including the faculty of the universities whose libraries are providing the books, are merely consumers. We can buy the product, or not buy the product, but the raw materials, the digital copies of the library texts, belong to Google. The monetizing of the texts is the job of the Book Rights Registry (BRR) that will be formed, which will represent the rights holders. Google's job is to provide the product that will make monetization possible. Both of these entities, Google and the BRR, are focused on the price issue as their main concern. In that, they have much in common with the University of California faculty.

This is a perfect example of how the asking of a question shapes reality. The question surrounding the settlement is: are authors (as defined by the Author's Guild) served by the Google/AAP settlement -- yes or no? The bigger question, What is the future of the book in our civilization? is not on the table. Yet, in the end, that may be the question that is answered by this settlement, whether that outcome serves authors or not.

[1] Recommended reading: The Future of the Book (Berkeley: University of California Press, 1996).

[Note: I am aware that there are serious issues of copyright law in all of this; that it's not just a question of technology. Whether or not this settlement helps or hinders the evolution of copyright law is a discussion better left to those with a legal background. But it is an active part of the discussion around the settlement.]

Academic publishing as a percentage of Google Books

A group representing University of California faculty have expressed their concerns about the Google/AAP settlement in a letter to the presiding judge. In that letter they state one of their concerns as:
"Specifically, we are concerned that the Authors Guild negotiators likely prioritized maximizing profits over maximizing public access to knowledge, while academic authors would have reversed those priorities."
The next sentence says:
"We note that the scholarly books written by academic authors constitute a much more substantial part of the Book Search corpus than the Authors Guild members’ books."
I was disappointed that they didn't include any data to support the "substantial part" statement, and think that their letter would have been stronger if they had. (I am presuming that they meant "substantial" in a quantitative way, rather than qualitative. The latter would be hard to support.)

Edward Betts of the Open Library did an experiment in identifying publishers in the OL data. Because the way publisher names are recorded in bibliographic data, he used ISBN publisher prefixes, where available, to bring together different forms of the name. He posted his results on the OL blog. The post links to his files. His data shows counts for each (presumed) individual publisher.

I mentioned in a comment to Edward's blog post that it was interesting to me that a university press (Oxford UP) turned up in the #1 spot as the publisher with the greatest number of books in the OL. As a matter of fact, out of the top 20 publishers, five are university presses (UPs), and they make up over 1/4 of the books in that group. (Download a tab-delimited, ranked version of the data, but be sure to look at Edward's detailed data to understand what makes up each publisher entry.)
# of books records published by the top 20 publishers: 1,935,327
# of books in the top 20 by University Presses: 577,323
Out of the top 100, the UPs make up a little less than 25% of the file. I'm only including those presses with "University" in their names, meaning that the figure doesn't include Academic Press, Elsevier, Scholastic, etc., which primarily publish the output of academic writers.

This study of OL publisher data was just experimental, so these figures should be taken with a grain of salt. However, this shows that there is an interesting study to be done, if it can be done, quantifying the relative roles of academic and commercial publishing. Given that Google is digitizing books in university libraries, the tendency toward academic publications should be quite strong. (Note that OL has taken its records from the Library of Congress, online book sales, and some libraries, and probably is less heavy on academic presses than Google Books will be.)

The UC faculty's concern that the interests of academic writers are not well-served by the Author's Guild is compelling to me. I hope the judge takes it seriously.

Thursday, August 20, 2009

Greetings from Undefined!


My kinda place, Undefined. (From iGoogle page.)

Sunday, August 16, 2009

What is a (FRBR) Work?

"What is a Work?" is an oft-discussed question. Answers tend to come down on one side or another of what is essentially a philosophical reaction to the inherent abstractness of the nature of the FRBR Work.

I was poking around in the Futurelib wiki (much neglected of late, but I have recently gathered there all of my posts on Martha Yee's article on library cataloging and RDF), and came across an interesting comment by Kristin Antelman from over 2 years ago:

I realize this is a point of absolutely no controversy in the FRBR community, but I have never been happy with the title attribute in association with the abstract entities, work and expression. It seems contrary to the spirit of an abstract entity, not to mention creating practical problems (e.g., for serials). There obviously could be many titles associated a work in its manifestations. Libraries may want to select one over another for the "work," or display, title. Works and expressions only need identifier attributes: for the work, author and subject.
RDA recognizes both a Work title and a Manifestation title (called title proper). I've been known to argue that these are distinct data elements: the Manifestation title is transcribed from the piece and is part of the surrogate for the manifestation; the Work title (which we used to call "Uniform title") acts as a unifier for all of the Expressions and Manifestations of the Work. Kristin's comment got me thinking about this again, and I agree with her: there is no title for the Work. The Uniform title served as an identifier for the Work in the days when we used things like titles to identify things. But FRBR and RDA recognize that entities will have identifiers that are separate from the display forms of names and titles that we have used in the past.

This "no title" solution actually helps out with one of the sticky problems in creating Work displays, particularly in a multi-lingual catalog. When you follow the concept of uniform titles, the Work title should be the title of the original. This means that we would be showing our users Война и мир as the title for the Work that most of them will know as War and Peace. We could show them the English language title, but what if your catalog users are global? What if some of them will only understand the title if you display it in French or Turkish or Chinese? If a Work has an identifier (which is only useful for machine processing, not for display to humans), then you can let users choose what language they prefer in Work displays. (Obviously having some default for the case where the user's preferred language isn't available.)

So I like the idea of not assigning a title to the work, but I must admit that I'm increasingly seeing the Work not as a thing but as a set; a set made up of things that claim to be Manifestations of the work. Each resource that claims to be a Manifestation of that Work (using the Work identifier) is then part of the Work set, and it is the set that defines the Work.
The Work set is not fixed - new items can add themselves to the set at any time. Thus, the Work is defined from the bottom up, from the contents of the set. The members of the set have titles and have subjects, and that means that the Work also has those titles and subjects.

This "solution" requires us still to make decisions about what we display to represent the Work. Do we show subject headings as related to the Work? What about reviews and excerpts? If you want to add a cover to the display, how do you select from among the various covers you may have?

A great advantage of this solution is that you can make different display decisions at different times, or in different contexts. A public library can point users directly to the shelf location from the Work display; a rare book archive can include key information about available editions; a social networking site can list the users who own versions of the Work. The concept of Work becomes somewhat fluid and malleable, which in my mind is closer to reality than a fixed thing that has only certain attributes.

Tuesday, August 11, 2009

FRSAD

I was reminded by Jenn Riley's post on FRSAD that I hadn't yet read the document. Jenn had some interesting concerns about the model, and now that I have read it, so do I.

The main thing that bothers me is that the FRSAD's view of authority data appears to be that it names things, and by that I mean that it names things for the human reader. The introduction to FRSAD says:
The purpose of authority control is to ensure consistency in representing a value -- a name of a person, a place name, or a subject term -- in the elements used as access points in information retrieval.
The example given is that of World War II, which can be called by many different names in publications, but is brought together under a single heading in LCSH.

I think that the goal of authority control is to come up with a single representation for a concept or a thing. The nature of that representation is very important, however. By choosing the preferred display form as the representation of the entity your metadata has a fatal flaw: any change to the display form creates a different entity. A display form simply is not a viable persistent identifier. Using the display form also makes it much more difficult to share your data across languages and across contexts. "World War II" and "Seconde Guerre mondiale" are the same thing conceptually, but if only the names are used to identify the topic those two terms are far apart. It would be simple to bring them together, however, if the topic had a true identifier, one that is independent of the preferred display form.

I am a bit perplexed that no one on the FRSAD committee was able to introduce the concept of identifier into the project. It seems to be such an obvious answer. Each topical entity must have an identifier. That identifier remains the same regardless of decisions about display. The determination of a single display may still be required for certain user functions, but the big plus is that you can decide to display the authorized form in English or Spanish, for adults or children, in a transliterated form or vernacular, without changing the identity of your entity.

Without an identifier, there is no way to represent an entity as metadata. The Work and the Thema (FRSAD's word for subject) have no existence in metadata without a machine-readable identity that allows them to have being. This is a basic rule of the Semantic Web, but it has always been a fact of metadata usage in machine-readable form. Those of us in libraries have struggled to create systems and programs that attempt to control identities with user display forms, and it is both a frustrating and flawed approach. We need to move FRSAD from:

to:

where the display forms are flexible and aren't involved with identifying what our metadata is about. Display forms are for humans; identifiers are for machines. Identifiers are also language neutral and can facilitate sharing across languages and communities. It's really that simple.

Ebook sales through the roof

Reported ebook sales for the second quarter of 2009 are more than 3 times what they were for the same period in 2008. The Kindle may turn out to have been the killer app. So far, however, I haven't found good figures on ebook sales by vendor or ebook format. If you run into that data, please pass it along.