A serious concern in the times before printing was that copying -- and it was hand-copying in those times -- introduced errors into the text. When you received a copy of a Greek or Latin work you might be reading a text with key words missing or mis-represented. In our digitizing efforts we have reproduced this problem, and are in a similar situation as that of the Aldine Press when it set out to reproduce the classics for the first time in printed form: we need to carry the older texts into the new technology as accurately as possible.
While the digitized images of pages may be relatively accurate, the underlying (and uncorrected, for the most part) OCR introduces errors into the text. The amount of error is often determined by quality of the original or the vagaries of older fonts.If your OCR is 99.9% accurate, you still have one error for every 1,000 characters. A modern book has about 1500 characters on a page, so that means one error for every page. Also, there are particular problems in book scanning, especially where text doesn't flow easily on the page. Tables of contents seem to be full of errors:
IX. Tragedy in the Gra\'eyard 80 X. Dire Prophecy of the Howling Dog .... 89 XL Conscience Racks Tom 98
In addition, older books have a tendency to use hyphenated line breaks a great deal:
and declined. At last the enemy's mother ap-
peared, and called Tom a bad, vicious, vulgar child,
These remain on separate lines in the OCR'd text, which is accurate to the original but which causes problems for searching and any word analysis.
The other issue is that for many classic works we have multiple digital copies. Some of these are different editions, some are digitizations (and OCR-ing) of the same edition. Each has different errors.
For the purposes of study, and for the use of these texts for study, it would be useful to have a certified "Urtext" version, a quality digitization with corrected OCR that scholars agree represents the text as closely and accurately as possible. This might be a digital copy of the first edition, or it might be a digital copy of an agreed "definitive" edition.
We have a notion of "best edition" (or "editions") for many ancient texts. Determining one or a small number of best editions for modern texts should not be nearly as difficult. Having a certified version of such texts must be superior to having students and scholars reading from and studying a wide variety of flawed versions. Professors could assign the Urtext version to their classes, knowing that every one of the students was encountering the same high quality text.
(I realize that Project Gutenberg may be an example for a quality control effort -- unfortunately those texts are not coordinated with the digital images, and often do not have page numbers or information about the edition represented. But they are to be praised for thinking about quality.)