Monday, November 23, 2009

1923

The Google Books Settlement is causing a great deal of previously unexpressed bibliographic interest -- just how many books are there in the known universe? How many are published in the four countries now included in the settlement agreement? (US, UK, Canada, Australia). And how many are in the public domain?

Lorcan Dempsey and Brian Lavoie have recently published an article in DLib that looks at these figures using the world's largest database of bibliographic data, WorldCat. The data is fascinating, but I have already seen it mis-interpreted, so I thought some clarification might be useful.

Dempsey and Lavoie are very clear that what they are measuring is "Manifestations." Folks outside of the library environment are unlikely to know what that means, therefore it is important to clarify what the numbers in the Dempsey/Lavoie article represent. Each “book” that is counted represents a published product at about the same level of granularity that today would be given an ISBN. Therefore if a publisher re-issues a book in their backlist after the previous print run has been exhausted (say, a decade later) and with a new introduction, it is considered a different book. The publication date that is fed into the study is the date of the new issuing of the book. Also, as publishers re-package and re-print public domain books, these also are considered separate products with new ISBNs and new dates.

Thus, if you look up a commonly re-published book like “Moby Dick, Or The Whale” in the Library of Congress catalog, you retrieve 40 items (and more if you use the short form of the name, simply “Moby Dick”), of which only one is pre-1923 — that one was published in 1851. Of the other thirty-nine instances of the publication of the work, which range from 1925 to 2006, some contain what GBS called “inserts” - that is, separately copyrightable intellectual property in the form of introductions, etc., but others may be a straight republication of the text. If you do the same lookup in FictionFinder, a work-based view of a portion of the WorldCat database. you find:

823 editions of "Moby Dick" (which combines the various versions of the title)
534 of which are in English

of these:

9 have an unknown date
60 have a date of 1923 or earlier
465 have dates after 1923

Looking through the list on FictionFinder it is easy to see that there are some duplicate records, both in the pre- and post-1923 entries.

Therefore, the question we now need to answer is: how many public domain works have been republished after the 1923 cut-off date?

Google appears to currently lack the ability to make the proper connection between the original text that is in the public domain and the many “manifestations” (as they are called in library-speak) that were published later — and are also in the public domain, at least as far as the primary text is concerned. This is a non-trivial exercise when one is working only with the metadata that describes the work, but may become more feasible with the ability to do a full text analysis of the contents of the various packages in which publishers have placed the original work of Melville. I assume that Google is working on this, although I cannot predict how it will affect their assessment of the PD/(c) split.

What is clear, however, is that Google is going to need to identify Works (if not strictly in the FRBR sense, then at least in the sense that meets some definition that is valid for copyright law).

4 comments:

MJ Suhonos said...

Great post, Karen! So much of the debate (certainly on the NGC4lib mailing list) around FRBR has been relatively academic -- but the issue of copyright vs. public domain and the need to identify things at the Work level is a great, pragmatic, real-world example.

jm said...

Nice post, and also brings up an issue that I've been worrying about with our computer games archiving project. The relationships which should exist between intellectual property constructs and FRBR entities are not, to my mind, at all clear, and it could be that the legal worldview and the bibliographic one are fundamentally at odds here. Copyright protects "original works of authorship fixed in any tangible medium of expression" and specifically does not protect ideas of concepts, which would appear to mean that copyright adheres at the expression level in FRBR terms. But there are distinctions we make between different manifestations that would not be considered as separate, independently copyrighted works in the legal sense. And accurately dealing with the issues of copyrights in composite works such as you describe, and issues such as dependent copyrights in translations, make everything even nastier. It was interesting to see Blizzard and Microsoft give their respective machinima communities blanket licenses for use of their IP in creating machinima, but also having to tell the community that they don't actually own the music in their games so they can't grant permissions for that. I'd really like to see someone/some group pick up the issue of intellectual property law and FRBR and provide some guidance on how we should be recording this information.

Karen Coyle said...

jm,

I did some work on metadata for copyright assessment when I was with the California Digital Library. This page links to documentation and a schema:

copyrightMD

The key thing is that it doesn't attempt to determine if something is or isn't under copyright -- instead, the goal is to provide users with whatever information the archive or library does have, rather than just leaving users on their own to figure things out. Presumably, over time more information could be filled in and it would get easier for users to know the status of the item. There's no reason why this couldn't be associated with expressions, in the FRBR sense.

Kevin Hawkins said...

My understanding is that reproductions of public domain works receive copyright protection in the UK and possibly other countries. Therefore, Google would only be able to make the full text available of reprints not published in such countries.

It's worth noting that HathiTrust is reviewing the copyright status of many manifestations and will do so by request if a user submits a request for review through the HathiTrust interface. So anyone is welcome to do so for copies of Moby Dick that you believe to be in the public domain.