Wednesday, April 25, 2012

Digital Urtext

As we reach a point where many of the classic books of literature and science published before the magical date of 1923 have been digitized, it is time to consider the quality of those copies and the issue of redundancy.

A serious concern in the times before printing was that copying -- and it was hand-copying in those times -- introduced errors into the text. When you received a copy of a Greek or Latin work you might be reading a text with key words missing or mis-represented. In our digitizing efforts we have reproduced this problem, and are in a similar situation as that of the Aldine Press when it set out to reproduce the classics for the first time in printed form: we need to carry the older texts into the new technology as accurately as possible.

While the digitized images of pages may be relatively accurate, the underlying (and uncorrected, for the most part) OCR introduces errors into the text. The amount of error is often determined by quality of the original or the vagaries of older fonts.If your OCR is 99.9% accurate, you still have one error for every 1,000 characters. A modern book has about 1500 characters on a page, so that means one error for every page. Also, there are particular problems in book scanning, especially where text doesn't flow easily on the page. Tables of contents seem to be full of errors:
IX. Tragedy in the Gra\'eyard 80 

X. Dire Prophecy of the Howling Dog .... 89 
XL Conscience Racks Tom 98 
In addition, older books have a tendency to use hyphenated line breaks a great deal:

and declined. At last the enemy's mother ap-
peared, and called Tom a bad, vicious, vulgar child, 

These remain on separate lines in the OCR'd text, which is accurate to the original but which causes problems for searching and any word analysis.

The other issue is that for many classic works we have multiple digital copies. Some of these are different editions, some are digitizations (and OCR-ing) of the same edition. Each has different errors.

For the purposes of study, and for the use of these texts for study, it would be useful to have a certified "Urtext" version, a quality digitization with corrected OCR that scholars agree represents the text as closely and accurately as possible. This might be a digital copy of the first edition, or it might be a digital copy of an agreed "definitive" edition.

We have a notion of "best edition" (or "editions") for many ancient texts. Determining one or a small number of best editions for modern texts should not be nearly as difficult. Having a certified version of such texts must be superior to having students and scholars reading from and studying a wide variety of flawed versions. Professors could assign the Urtext version to their classes, knowing that every one of the students was encountering the same high quality text.

(I realize that Project Gutenberg may be an example for a quality control effort -- unfortunately those texts are not coordinated with the digital images, and often do not have page numbers or information about the edition represented. But they are to be praised for thinking about quality.)


Mark McDayter said...

You are, of course, absolutely correct to note that the error rate in many online texts is very high: this is true even of the transcriptions of texts in such well-established databases as those of the TCP, which also contain many lacunae.

I'm a little uncomfortable with the idea of establishing and "certifying" URtexts, however. White criteria would one apply to an electronic edition, beyond accuracy? Should it be modern spelling? Should it include typographical conventions such as the "long s," or not? Different electronic editions have been created to "do" different things: which function is most desirable.

I am even more uncomfortable with the notion of establishing a "best edition," assuming that you are speaking now of the transcribed original. We are back to the terms of the debate between McGann and Rennear again. There is no such thing as an "URtext" in the sense of a "best edition" or neoPlatonic ideal: every instantiation of a text has its own validity within its own context. Which Hamlet is "UR"? Which Pamela, the one Richardson first published in 1740, or the one(s) he later revised in reaction to audience response? Each of these is "different," not better.

There is some danger here of flattening textual histories, or of eliding the impact of material culture on reading and writing. Let's by all means produce and identify more accurate transcriptions, but we need to avoid reductive and out-of-date theories of what constitutes "text" too.

Karen Coyle said...

Mark, I see this from a very practical view. FRBR encourages us to identify Works and Expressions, not only for the convenience of catalogers but because it tells users that these things are the "same" by some definition.

The Open Library shows 306 different Manifestations of "Tom Sawyer" of which 23 have been digitized. None of those 23 have been reviewed carefully and corrected against the hard copy version. If I were to teach a class and ask my students to read this book, I would choose an available print edition and have my students read that. We would all be reading the same words on the same page. I would like for there to be a digital edition that I know has been checked and corrected, and that I could recommend to my class.

For some works it may make sense to have multiple corrected digital versions, and we may eventually correct all of them. But for now, we need at least one reliable digital version that we know we can trust.

I know that there will be theoretical debates. These shouldn't keep us from acting. Creating a first "good" version of a digitized text does not prevent us from eventually creating corrected digital texts of other versions. But the sooner we have reliable versions the sooner we can begin to incorporate these digital works into our lives. In their current state they are unreliable.

PLS said...

One way to see the relevance of FRBR to the process of posting digital representations of printed works is to note that an image of a print work is the image of an "Item"--one sole physical copy of one manifestation of one expression of the work. Thus, one can never equate the posted image with the work; but rather it equates (sort of) to the physical item of which it is an image.
Another way to see how FRBR relates to digital representations, is to note that regardless of how a transcriptions is created--by OCR or typing--it is a new manifestation of the expression to which the source item belonged. It is never closer to the source or to any other form of the work than that.
That a transcriptions that is unproofread or poorly read could be considered by anyone to represent a work adequately is beyond me.
You will find in the "White Paper" for the NEH sponsored HRIT project a more detailed approach to digital representations of literary texts:

Mark Pumphrey said...

Just to clarify one small point: In order to retain the entire publishing history of a work of literature, it is important to have as a goal the eventual digitization of all extant copies of the work, even flawed editions. Would you agree? Would this work continue unabated after the proposed certified URtext and best edition are identified?

Karen Coyle said...

Mark, I see two different things here:
the first is the use of the text by readers interested in the content and ideas; the second is the treatment of the "thing" itself, and thus the publishing history. I assume that there will be scholars who are interested in the history of the text; I'm not sure there will be enough of them to (in the near future) warrant the cleanup of all of the digital versions, or to crowdsource all of them. These editions are already being digitized (often in multiple copies) and the digital versions are being used in spite of the flawed OCR behind them. I don't think that anyone has suggested that we only digitize particular editions. In fact, the more comprehensive digitization projects are wholly indiscriminate, digitizing entire shelves of books with no attempt to make selections.

My concern is that for all of this digitizing we do not currently have even one corrected OCR version of these texts. That means that there is no version that can be studied with any kind of accuracy, at least not the underlying text. This affects both full text searching and any computational uses of the texts. I suspect that the majority of users of Google book search are unaware that their searches are going against uncorrected OCR, and even those that are cannot quantify the degree of accuracy of any of the texts. I just don't see how we can consider these texts usable in their current state, but I also don't know where we'll get the funding to correct them all. Thus my flawed albeit practical suggestion.