Lorcan Dempsey and Brian Lavoie have recently published an article in DLib that looks at these figures using the world's largest database of bibliographic data, WorldCat. The data is fascinating, but I have already seen it mis-interpreted, so I thought some clarification might be useful.
Dempsey and Lavoie are very clear that what they are measuring is "Manifestations." Folks outside of the library environment are unlikely to know what that means, therefore it is important to clarify what the numbers in the Dempsey/Lavoie article represent. Each “book” that is counted represents a published product at about the same level of granularity that today would be given an ISBN. Therefore if a publisher re-issues a book in their backlist after the previous print run has been exhausted (say, a decade later) and with a new introduction, it is considered a different book. The publication date that is fed into the study is the date of the new issuing of the book. Also, as publishers re-package and re-print public domain books, these also are considered separate products with new ISBNs and new dates.
Thus, if you look up a commonly re-published book like “Moby Dick, Or The Whale” in the Library of Congress catalog, you retrieve 40 items (and more if you use the short form of the name, simply “Moby Dick”), of which only one is pre-1923 — that one was published in 1851. Of the other thirty-nine instances of the publication of the work, which range from 1925 to 2006, some contain what GBS called “inserts” - that is, separately copyrightable intellectual property in the form of introductions, etc., but others may be a straight republication of the text. If you do the same lookup in FictionFinder, a work-based view of a portion of the WorldCat database. you find:
823 editions of "Moby Dick" (which combines the various versions of the title)
534 of which are in English
9 have an unknown date
60 have a date of 1923 or earlier
465 have dates after 1923
Looking through the list on FictionFinder it is easy to see that there are some duplicate records, both in the pre- and post-1923 entries.
Therefore, the question we now need to answer is: how many public domain works have been republished after the 1923 cut-off date?
Google appears to currently lack the ability to make the proper connection between the original text that is in the public domain and the many “manifestations” (as they are called in library-speak) that were published later — and are also in the public domain, at least as far as the primary text is concerned. This is a non-trivial exercise when one is working only with the metadata that describes the work, but may become more feasible with the ability to do a full text analysis of the contents of the various packages in which publishers have placed the original work of Melville. I assume that Google is working on this, although I cannot predict how it will affect their assessment of the PD/(c) split.