Wednesday, April 25, 2012

Digital Urtext

As we reach a point where many of the classic books of literature and science published before the magical date of 1923 have been digitized, it is time to consider the quality of those copies and the issue of redundancy.

A serious concern in the times before printing was that copying -- and it was hand-copying in those times -- introduced errors into the text. When you received a copy of a Greek or Latin work you might be reading a text with key words missing or mis-represented. In our digitizing efforts we have reproduced this problem, and are in a similar situation as that of the Aldine Press when it set out to reproduce the classics for the first time in printed form: we need to carry the older texts into the new technology as accurately as possible.

While the digitized images of pages may be relatively accurate, the underlying (and uncorrected, for the most part) OCR introduces errors into the text. The amount of error is often determined by quality of the original or the vagaries of older fonts.If your OCR is 99.9% accurate, you still have one error for every 1,000 characters. A modern book has about 1500 characters on a page, so that means one error for every page. Also, there are particular problems in book scanning, especially where text doesn't flow easily on the page. Tables of contents seem to be full of errors:
IX. Tragedy in the Gra\'eyard 80 

X. Dire Prophecy of the Howling Dog .... 89 
XL Conscience Racks Tom 98 
 
In addition, older books have a tendency to use hyphenated line breaks a great deal:

and declined. At last the enemy's mother ap-
peared, and called Tom a bad, vicious, vulgar child, 

These remain on separate lines in the OCR'd text, which is accurate to the original but which causes problems for searching and any word analysis.

The other issue is that for many classic works we have multiple digital copies. Some of these are different editions, some are digitizations (and OCR-ing) of the same edition. Each has different errors.

For the purposes of study, and for the use of these texts for study, it would be useful to have a certified "Urtext" version, a quality digitization with corrected OCR that scholars agree represents the text as closely and accurately as possible. This might be a digital copy of the first edition, or it might be a digital copy of an agreed "definitive" edition.

We have a notion of "best edition" (or "editions") for many ancient texts. Determining one or a small number of best editions for modern texts should not be nearly as difficult. Having a certified version of such texts must be superior to having students and scholars reading from and studying a wide variety of flawed versions. Professors could assign the Urtext version to their classes, knowing that every one of the students was encountering the same high quality text.




(I realize that Project Gutenberg may be an example for a quality control effort -- unfortunately those texts are not coordinated with the digital images, and often do not have page numbers or information about the edition represented. But they are to be praised for thinking about quality.)

Thursday, April 19, 2012

Clarification from Sweden on OCLC negotiations

The National Library of Sweden has issued a short blog post clarifying their objections to the WorldCat Rights and Responsibilities (WCRR) policy. The inability of the two parties to reconcile these issues led the library to break off contract negotiations with OCLC. I find the Library's objections to be logical and undeniable:

1. The relationship with OCLC around record use is asymmetrical, with OCLC having the right to do whatever it wishes with the records while library use is restricted by the policy.

2. The policy actually requires libraries to favor WorldCat over other services, and thus hinder competition, which is not appropriate for a national library. [kc: This may even be illegal for publicly funded libraries in the US.]

3.  Open data is of strategic importance for libraries.

They conclude with:

To this end we urge OCLC to allow members to treat downloaded records as their own, including releasing them under any open license such as CC0. We feel that this would strengthen rather than diminish OCLCs strong status as a service provider to the library community.

Sunday, April 08, 2012

Content and carrier

In the midst of a discussion regarding the description of extents in RDA, I came to a realization that I might have noticed sooner if I did cataloging. As it is, I am probably coming to this a bit late.

RDA chapter three describes carriers. This is where you find all of the terms of measurement that appear in library data, things like:

12 slides
1 audiocassette
1 map
box 16 × 30 × 20 cm

There is a controlled vocabulary in RDA for carriers. It has 54 entries that are in 8 categories:
audio carriers
computer carriers
microform carriers
microscopic carriers
projected image carriers
stereographic carriers
unmediated carriers
video carriers

Note that one of the examples above, "map," is not included in the list of carriers. Nor is the most common extent used, "pages."* These are described in their own lists, "Extent of cartographic resource" and "Extent of text."** Why are these separate from other carriers? The answer is: Because they are not carriers, they are types of content. The carrier of a map is either a globe or a sheet, but map is not a carrier, it is a type of Expression, as is text.

It turns out that cataloging has been mixing content and carrier descriptions in the extent area for ... well, perhaps forever.
1 map on 4 sheets
1 atlas (xvii, 37 pages, 74 leaves of plates)
1 vocal score (x, 190 pages)

In addition, when describing books the carrier isn't mentioned at all, just the content:
xvii, 323 pages
unless there is no extent of the content to record, at which point the book is called a "volume:"
1 volume (unpaged)
I have no doubt that there are clear rules that cover all of this, telling catalogers how to formulate these statements. Yet I am totally perplexed about how to turn this into a coherent data format. In FRBR, there is something called "extent of content" as an attribute of the Expression entity:

4.3.8 Extent of the Expression
The extent of an expression is a quantification of the intellectual content of the expression (e.g., number of words in a text, statements in a computer program, images in a comic strip, etc.). For works expressed as sound and/or motion the extent may be a measure of duration (e.g., playing time).

while "extent of carrier" is an attribute of the Manifestation entity:

4.4.10 Extent of the Carrier
The extent of the carrier is a quantification of the number of physical units making up the carrier (e.g., number of sheets, discs, reels, etc.).

RDA does not have "extent of content," in part (I am told) because it would have separated the instructions for formulating the extent of content and carrier between chapters 7 and 3, respectively, and thus made it difficult for catalogers to create this mixed statement. Of course, one possible response might be that we shouldn't be creating a mixed statement, but two separate statements that could be displayed together as desired. These statements should probably also be linked to the content or carrier vocabulary term that is now carried in MARC 336, 337, or 338.

I looked at ONIX to see how this might have been handled by another bibliographic schema, and it appears that ONIX has two different measures: extent, which is used for extent of the content, and measure, which measures the physical item.

We have to clear up inconsistencies of this nature if we hope to produce a rational format or framework for bibliographic data. Dragging along practices from the past will result in poor quality data that cannot interact well with data from any other sources.

I will add this to the analysis of MARC on the futurelib wiki.

* I can't find "box" anywhere in any list, but perhaps I am missing something.

** Extent of Text is even more complex than I thought. Here is the list of terms to be used:
approximately
case
column
folded
in various foliations
in various numberings
in various pagings
incomplete
leaf
page
portfolio
sheet
unnumbered pages
unnumbered sequence of pages
volume
volume (loose-leaf)
These seem to not be extent of the text itself but the gathering of paper that something, mostly text possibly, is printed on. Volume is a carrier, as is leaf or page or case. However, approximately is totally out of place. Incomplete seems to be a statement about the content, although I suppose you could say that the carrier is incomplete when pages are missing. Note that sheet is here, but not in the list for cartographic resources, so it seems that in describing the carrier for a map one would use sheet from Extent of Text.

Friday, April 06, 2012

If not RDF, then what?

There's no question that the data format known as RDF is darned difficult. Let's suppose that we in the library world decide not to hitch our wagon to RDF, but would still like to create a new bibliographic framework. After all, if MARC simply won't work for the creation of RDA records, we still need something besides MARC that we can use to create data. And even if (although this is unlikely) we should decide not to move to RDA, our records still need some upgrading to fit better into current data processing models. We still need to:
  • define our entities
  • use data wherever possible, not text
  • use identifiers for things
  • relate attributes to entities (that is, say things about some thing)
  • use a mainstream serialization

Should we do this, the mainstream serialization could be anything from JSON to XML to RDF. In fact, it could be all of those if we play our cards right and define our data in a format neutral way. RDA does some of this for us, but not all. In particular, RDA does not distinguish between data and text, and although it allows for the use of identifiers it doesn't give any guidance on how to use them. RDA is probably fine as guidance rules for decision-making, but it needs the corresponding data definition before it becomes useful. Having that data definition could help to clarify some ambiguities in RDA. We have to expect that there will need to be some iteration between RDA and a data definition. (I will post shortly on a problem that I have run into.)

It also seems to me that we have everything to gain by beginning our work on a data format with no particular serialization in mind. We could go from RDA to RDA-as-data and then on to RDA-as-RDF. I see some dangers in skipping the middle step, mainly that we could end up making some decisions that fit RDA into RDF but that are problematic for other serializations.

VIAF gets serious

There has been an announcement by OCLC that the Virtual International Authority File (VIAF) is "transitioning" to OCLC. Since it was already being run by OCLC, this may seem like no news, but in fact it is a sign of commitment on OCLC's part both to VIAF and to linked data. The announcement states:
The change has been made to assure that VIAF will be well-positioned to scale efficiently as a long-term, cooperative activity. The transition also assures that http://viaf.org will continue to have appropriate infrastructure to respond to rising levels of traffic as VIAF gains momentum and popularity as a resource for library authority work and linked data activities.
 One of the many advantages of linked data is that you can role out your implementation of linked data gradually. This announcement about VIAF leads me to wonder if library linked data can't begin with names, followed perhaps closely by subject lists. The question then becomes how we can link from name data on the web to library catalogs. There are already links from VIAF to Wikipedia, Wikipedia to VIAF. This means that there are also links from DBPedia to VIAF, since DBPedia is a linked data form of Wikipedia. DBPedia is the center point of the linked data cloud, thus assuring maximum linking. After that, we reach a dead end, at least as far as library data is concerned, because many library systems do not make use of the authority file identifiers for names, and none, as far as I know, support the use of URIs as identifiers. We would need to pass through VIAF to get the appropriate text string to match against library data.

As for subject access, there are a number of library subject heading lists that have been coded in SKOS and that have some interlinking. The ones I know about are:
These can be found at the Data Hub under the "Bibliographic" group.

(I should mention that VIAF is being licensed as ODC-BY, meaning even commercial use is allowed. What isn't immediately clear is whether the 'BY' -- that is, attribution -- applies to the service as a whole or to individual records or even to individual headings. Attribution adds a small but significant burden on downstream users, so it would be great if the only requirement were that one acknowledge VIAF in appropriate documentation.)

Ideas on how to proceed from this point are very welcome.