Tuesday, October 24, 2006

Google Book Search is NOT a Library Backup

I have seen various quotes from library managers that the Google Book Search program, which is digitizing books from about a dozen large research libraries, now provides a backup to the library itself. This is simply not the case. Google is, or at least began as, a keyword search capability for books, not a preservation project. This means that "good enough" is good enough for users to discover a book by the keywords. A few key facts about GBS:

1) it uses uncorrected OCR. This means that there are many errors that remain in the extracted text. A glaring example is that all hyphenated words that break across a line are treated as separate words, e.g. re-create is in the text as "re" and "create". And the OCR has particular trouble with title pages and tables of contents:

Copyright, 18w,

B@ DODD, MEAD AND COMPANY,

411 r@h @umieS

@n(Wr@ft@ @rr@

5 OHN WILSON AND SON, CAMBRIDGE, U. S. A.

Here's the table of contents page:

(@t'

@ 1@ -r: @

@Je@ @3(

CONTENTS

CHAPTER PAGS

I. MATERIAL AND METHOD . . 7
II. TIME AND PLACE 20
III. MEDITATION AND IMAGINATION 34
IV. THE FIRST DELIGHT . . . 51
V. THE FEELING FOR LITERATURE 63
VI. THE BOOKS OF LIFE . . . 74
Vii. FROM THE BOOK TO THE READER 8@
VIII. BY WAY OF ILLUSTRATION . 95
IX. PERSONALITY 109
X. LIBERATION THROUGH IDEAS . 121
XI. THE LOGIC OF FREE LIFE. . 132
XII. THE IMAGINATION 143
XIII. BREADTH OF LIFE 154
XIV. RACIAL EXPRESSION . . . i65
XV. FRESHNESS OF FEELING. . . 174

2) it will not digitize all items from the libraries. Some will be considered too delicate for the scanning process, others will present problems because of size or layout. It isn't clear how they will deal with items that are off the shelf when that shelf is being digitized.

3) quality control is generally low. I have heard that some of the libraries are trying to work with Google on this, but the effort by the library to QC each dgitized book would be extremely costly. People have reported blurred or missing pages, but my favorite is:

"Venice in Sweden"
Search isbn:030681286X (Stones of Venice, by Ruskin)
Click on the link and you see a page of Stones of Venice. Click on the Table of Contents and you're at page two or so of a guidebook on Sweden. Click forward and backward and move seamlessly from Venice to Sweden and back again. Two! Two! Two books in one! (I reported this to G months ago.)

4) the downloaded books aren't always identical to the book available online (which in turn may be different to the actual physical book due to scanning abnormalities). Look at this version of "Old Friends" both online and after downloading, and you'll see that most of the plates are missing from the downloaded version. Not necessarily a back-up problem, but it doesn't instill confidence that copies made from their originals will be complete.

Note that these examples may not affect the usefulness of the search function provided by Google, but they do affect the assumption that these books back up the library

2 comments:

Anonymous said...

In the case of hyphenated words, we can draw some comfort from the fact that in the pre-digital age, hyphenization tended to follow predictable rules, and future search algorithms for OCR digitized texts may incorporate these rules (at least for "advanced" searches. Of greater concern is misinterpretation of scanned text. Of necessity, OCR programs are probabilistic, and, for example, may decide that what looks like "Dori" is more likely to be "Don." This can be especially frustrating when the text being scanned is in a language other than English and the OCR program assumes otherwise. For example, a remarkable number of French books in Google Book Search contain the heretofore unknown word "oonclure" ("conclure").

Anonymous said...

An Arstechnica review of the 'first official alpha version of Google's OCRopus scanning software for Linux' gives samples showing the 'typical output quality'.

The review points out 'several common errors':
'For instance, the letter "e" is often interpreted as a "c" and the letter "o" is often interpreted as a "0" in scanned documents. OCRopus provides better results when scanning text that is printed with a sans serif font, and the size of the font also has a significant effect on accuracy.'