Sunday, December 17, 2006

Digitization and the Catalog

I have just posted the preprint of my current column for the Journal of Academic Librarianship, titled "Mass Digitization of Books." It takes about 4-6 months for the columns to be published, and as I read over this one I can see that things have already changed. For example, when I wrote the column, Google was not yet allowing the download of its public domain books.

However, I should have included one more very important issue in the article, but it hadn't occurred to me at the time: the effect of this mass digitization on our catalogs. The cataloging rules require that the digital copy be represented in the catalog with its own record. This means that a library that undergoes a mass digitization project on its book collection faces doubling the number of book records in its catalog. Leaving aside the issues of user display for now, and assuming that the creation of the records requires very little human intervention, we can probably still calculate a significant cost in storage space (albeit cheap these days), the size of backups, the time to load and index all of those records, and a general overhead in the underlying database.

This brings up the issue of creating catalog entries that represent "multiple versions," that is, having a single record that contains the information for all of the different formats in which the book is available -- regular print, e-book version, digitized copy, large print. There are good arguments both for and against, and it's a complex discussion, but I'll just say that I am convinced that we could structure our catalog records in a way that would make this work.

Wednesday, December 13, 2006

Section 108, oh my!

The Library of Congress Study Group on Section 108 (of Title 17, the US copyright law) has issued a "notice of a public roundtable with a request for comments" in the ever-popular Federal Register. Which we all read daily, right? (I checked - no RSS feed that I could find, thank you very much.)

I have only read through the section on Topic A (there's also a Topic B), but I don't think I can go any further. This is about the worst mish-mash I have ever seen. If this is intended to clarify things, we are in deep doo-doo. (Believe me, I'm trying hard not to sound any less professional than that.)

OK, first, Section 108 is the section of the US copyright law with exceptions for libraries. In essence, section 108 allows libraries to make copies of items that are still under copyright in certain prescribed cases. Library of Congress formed a group to study Section 108 and make recommendations on how to update it for the digital environment. The group has been meeting, behind closed doors, for over a year. The group consists of lawyers, librarians, publishers, and lawyers. Oh, I said that, didn't I? They have held public meetings and have issued a document outlining what they see as the issues. This most recent call is proof that the study group is getting absolutely nowhere.

The subsections of Section 108 under question in this "notice" are the two that allow copying for lending, both within the library and over interlibrary loan. Because the study group's meetings are not open to the public, and because this is a highly political issue, the notice asks many questions that are suspiciously leading but there is no clue as to WHOSE issue it is. There also isn't much to explain the assumptions about technology that are behind some of the questions, so I often find myself unable to understand WHY a certain question is being asked.

That said, here are some examples of what I think are very strange statements and questions:

  • There is a great deal of concern about users receiving a copy of an item from a library through Interlibrary Loan without going through their own library. In other words, direct user borrowing. This violates what someone sees as the "natural friction" of ILL:

    it was presumed that users had to go to their local library to make an interlibrary loan request. ... for any user electronically to request free copies from any library from their desks, that natural friction would break down, as would the balance originally struck by the provision.

    Now this is just weird. Essentially they are implying that ILL was ok, even digital delivery, as long as it was inefficient and costly. If it becomes efficient, then it's just too much, and competes with sales. (I don't really see a difference between a user sending a request to their own library for an ILL rather than directly to the lending library -- except for the cost to the local library to pass the request through. And if that becomes efficient enough, the user won't even know how many middle-men there are in her request.)

  • Question 1:

    How can copyright law better facilitate the ability of libraries and archives to make copies for users in the digital environment without unduly interfering with the interests of rightsholders?

    What? Isn't this exactly what the study group has been discussing for 18 months? Now they put out a public notice asking the rest of us to answer the question? Haven't they at least worked it out to a set of choices or options? What have they been doing?

  • Question 3 (and Question 4 is very similar)

    How prevalent is library and archives use of subsection (d) for direct copies for their own users? For interlibrary loan copies? How would usage be affected if digital reproduction and/or delivery were explicitly permitted?

    Uh, isn't this something that someone should study? I mean, this is not something you ask people's (even educated) opinions on -- you've got to get facts and figures. It would be very interesting to know how much digital copying and delivery does go on in libraries. Without that information, we're just jabbering into the wind here, aren't we?

  • Question 5
    ... should there any any conditions on digital distribution that would prevent users from further copying or distributing the materials for downstream use?

    Well, there are conditions, and they are called copyright law. And of course they deter more than they prevent, but this really seems to be a silly question.
    Should persistent identifiers on digital copies be required?

    I wonder what they think that identifiers will accomplish? Do they see them as acting like watermarks, that would identify whose digital copy it is?

  • Question 7
    Should subsections (d) and (e) be amended to clarify that interlibrary loan transactions of digital copies require the mediation of a library or archives on both ends, and to not permit direct electronic requests from, and/or delivery to, the user from another library or archives?


OK, I'll stop here. As I have said, these statements and questions are so odd that I have no idea what happened in that closed room but it was weird.
Let me remind you that anonymous comments are allowed on this blog. So if you have some inside information on what the real problems are that are behind these questions, I would love to hear from you.

Sunday, December 10, 2006

The keyboard

I spend a lot of time each day "working the keyboard." It's easy to take it for granted; I learned to touch type in junior high school when the ability to type with speed and accuracy was part of a common job description. Little did we know at the time that we were heading into a future when everyone typed, and that typing would no longer be considered a special skill. (Nor would it be considered something "girly".)

There has been some questioning of the keyboard in the form of criticism of the QWERTY design. I tried switching to a Dvorak keyboard for a while, but didn't have the patience to work up to an approximation of the unconscious ease with which I type today. Recent ads I've seen are touting voice recognition as the replacement for typing, but I don't want to say all of my thoughts out loud, and in most offices with open or cubicled designs voice recognition would lead to cacophony. No, I'm happy to type, I just want it to be more efficient.

What I haven't seen questioned, yet it must have occurred to someone, is why we are still typing every letter when software could fill in or complete most words for us. Remember the ads that used to be on the back of magazines: "if u cn rd ths u cn gt a gd jb"? That's how I'd like to type. Yes, I can add those into my MS Word autocorrect, and I have placed a select number of long words I hate to type into the list. But we know that our language is very predictable and we should be able to take advantage of that. There are interesting IM keyboard options like T9 Word -- although obviously, the IM vocabulary doesn't need a large dictionary behind it. Open Office tries to help out by auto-completing words as you type, but this is useless for a touch typist because you have to 1) watch the screen (I often type while staring into space) and 2) take your fingers off their normal home row positions to hit the enter key. The Open Office method might work with a re-organized keyboard with a special key that means "go for it" when the screen shows the correct word, but I still think that would be slower than touch typing.

A neighbor of mine is a court reporter. She has the chorded court reporter "typewriter" which today hooks into a computer that auto-translates from the shorthand coming out of the device to words. The output isn't perfect, but it's good enough to be used in a courtroom in real time to feed the text to lawyers. That shows me that it can be done. Yes, of course, we'd all have to learn something new. But upcoming generations would benefit from a better solution to getting words onto a screen.

Sunday, November 26, 2006

Authorities and Authors

I was reading through some chapters of Joanna Russ's book "How to Suppress Women's Writing," when I had some ideas about authority control. Russ's book is one that I re-read often. It speaks to more than just women's writing -- it is a general description of how the accomplishments of a non-dominant group in any society can be ignored or devalued. Russ mentions many women authors who never make the "top 100" list, or the "Anthology of [whatever] Literature." She also states that a majority of writers in the 19th century were women.

I immediately thought how it would be interesting to use a large database, such as LoC's file, or WorldCat, to retrieve authors either by gender or country of origin. It then occurred to me that this is information that we do not include in authority records, even though it is probably available in a majority of cases. I also recall -- although I cannot place -- a discussion about adding to the authority record for an author the names of all of the author's works. In this sense, the authority record would be more than a controlled form of the author's name, it would actually contain information about the author that would be of interest to catalog users. There is talk of adding links from author names in catalog records to their Wikipedia entries. Those entries are surely more of interest to users than the authority record, which is just a list of variant forms of the name. For example, look at the wikipedia entry for Joanna Russ, and compare that to the authority record for the same person.

So imagine an "author" record, either related to or in place of the authority record. It could help users understand who the author is, and to place the author in a historical period (even if the authoritative form of the name doesn't include dates). If coded well, a database of author records could provide some interesting information for various areas of study.

Wednesday, November 15, 2006

Cataloging v. Metadata

The joke goes:
Metadata is cataloging done by men.
The point of the joke (yes, I know it isn't funny if you have to explain a joke) is that the term metadata was coined to make cataloging palatable to the computer community, and that they're really the same thing. We often use the terms interchangeably, but my recent forays into the world of RDA development have led me to look more closely at the meaning of the term cataloging, and I have concluded that metadata creation and cataloging are very different activities. I do consider a catalog record to be a form of metadata, but not all metadata, and not even all bibliographic metadata, is cataloging.

It's not just a matter of having rules or not having rules, however. Although it seems obvious to say this, the goal of cataloging is the creation of a catalog. Catalogers create entries for the catalog using rules that are designed to produce not only a certain coherent body of data, but to enforce a particular set of functions (access via alphabetically displayed headings is an example of a function) that are required to support the catalog. The catalog entries create the catalog.

Library cataloging, as a form of metadata, has traditionally had well-defined goals. The catalog records were defined in terms of the catalog's physical structure and functions. With the card-based cataloging rules, from the ALA rules through AACR2, each catalog entry was a precise, unchangeable unit of the catalog, a cell in a well-designed body. As the cataloger created the record, she could know exactly how headings would be used, exactly where they would be filed in the catalog, exactly what actions users would have to take to navigate to them.

This precision ended with the creation of the online catalog. Catalog entries that had been created for a linear card file were being accessed through keyword searches; displays no longer followed the "main entry" concept; the purposeful unit created by the cards in the catalog was destroyed. There was no longer a match between the catalog and the cataloging.

Which is why many of us are having a hard time with the development of our future cataloging rules, RDA. RDA doesn't define the catalog that it is creating catalog entries for, which brings into question how decisions are being made. What is the concept of a catalog in this highly mutable world of ours? I can't imagine that we can create cataloging rules until we define the catalog (or catalogs) that the rules pertain to. It may not be the highly structured system that we had with the card catalog, where access points were THE access points and every card had its one place in the ordered universe that was the catalog. Still, we need to define the catalog before we can expect to create the entries that go into it. We can start with the FRBR "find, identify, select, obtain," but we must close the enormous gap between those functions and actual catalog entries.

Sunday, November 05, 2006

FRBRoo (Object-Oriented)

The folks working on the Conceptual Reference Model for museums (CIDOC) have produced an analysis of FRBR as object-oriented, to match their own OO model. The FRBR final report states that FRBR is based on a relational model, but I have always though that its hierarchical nature (at least that of Group 1 entities) lends itself better to an OO form. (I portrayed it as OO in my 2004 Library Hi-Tech article.) This allows inheritance from the Work to the Expression, etc.

An OO model forces a certain rigor on the data, and in doing their analysis the CIDOC folks found that they needed to redefine some elements of FRBR, in particular the definition of the Work.
FRBRER was flawed with some logical inconsistencies, in particular with regard to its “Group 1 of entities,” those entities that account for the content of a catalogue record. (p.9)
Their problem with the FRBRer (entity-relationship) model's definition of Work seems to be the same as has been bothering me recently, especially in the reading I have done on the difficulties of applying the FRBRer model to serials.
The Work entity such as defined in FRBRER seemed to cover various realities with distinct properties. While the main interpretation intended by the originators of FRBRER seems to have been that of a set of concepts regarded as commonly shared by a number of individual sets of signs (or “Expressions”), other interpretations were possible as well: that of the set of concepts expressed in one particular set of signs, independently of the materialisation of that set of signs; and that of the overall abstract content of a given publication. FRBROO retains the vague notion of “Work” as a superclass for the various possible ways of interpreting the FRBRER definitions: F46 Individual Work corresponds to the concepts associated to one complete set of signs (i.e., one individual instance of F20 Self-Contained Expression); F43 Publication Work comprises publishers’ intellectual contribution to a given publication; and F21 Complex Work is closer to what seems to have been the main interpretation intended in FRBRER. Additionally, a further subclass is declared for F1 Work: F48 Container Work, which provides a framework for conceptualising works that consist in gathering sets of signs, or fragments of sets of signs, of various origins (“aggregates”). (p. 9-10)

If I may paraphrase, the FRBRer Work includes both individual works of creative effort as well as publisher's containers for groups of works. The CIDOC solution is perhaps more complex than I had imagined, but the distinction between the creative output and what publishers (and editors) have chosen to place together in the same container is an important one for our user service goal, especially in the areas of journal publishing and music publishing. Their "container" is what I've been calling the "package" in discussions on the next gen catalog list.

I will think about how this analysis might help us design a bibliographic record. The diagram on page 10 of this report implies that there are two forks to the description: the author's context and the publisher's context. It seems that today's cataloging rules (and perhaps RDA as well) conflate those two, and that when those contexts differ the rules emphasize the publisher's.
In other words, descriptive cataloging is describing the published Work, not the author's Work. If we see those as separate, would our catalog look more FRBR-like?

Thursday, November 02, 2006

Relators

Much of the buzz related to FRBR is its emphasis on relationships. There are the relationships between works, between works and expressions, etc. down the FRBR model through manifestations and items. These are the Group 1 entities in FRBR. Less is said about the Group 2 entities and their "Responsibility" relationships ("is created by Person" "is realized by Corporate Body"). These look a lot like the RDF "triples" that many developers are fond of as semantic organizing principles for data. This is also very similar to the relator codes that we have in the cataloging rules and in MARC21: Smith, Jane, ed.

I have often been frustrated that searches in library catalogs do not allow me to include (or exclude) roles, such as "editor" or "translator." I am annoyed when a search on "Nabokov, Vladimir Vladimirovich, 1899-1977" in the so-called "author" field brings up numerous editions of Lewis Carroll's Alice in Wonderland, translated into Russian by Mr. Nabokov. Yet when I look at the detailed records, in most he his listed simply as an added entry by his full name, but with no relator code. In essence, the catalog has no way to distinguish between works he wrote (or co-wrote, thus the use of the added entry) and those he translated. Unfortunately, it is clear when you look at records in library catalogs that those role codes have not been assigned consistently.

Thom Hickey did a study of relator codes and relator names ($4 v. $e), which he reported in his blog, and came up with the figures below. His interest was in the interaction between the code and the name. Since his study was done in the OCLC WorldCAt catalog, I think it points out that these key roles are not being coded in our records, which essentially results in a lot of false hits for our users. If we can't get these simple relationships coded into our data today, what hope do we have for a relationship-oriented bibliographic view in the future?

Thom's list of top terms:

Relator codes:

prf (1,080,900)
cnd (203,921)
voc (78,921)
itr (77,058)
aut (72,700)
act (56,518)
arr (50,621)
edt (49,205)
trl (43,608)
ill (42,657)

and the top relator terms:

ed (629,083)
joint author (474,307)
ill (214,764)
tr (172,801)
comp (123,239)
printer (60,070)
photographer (45,115)
orient (40,064)
illus (38,201)
former owner (34,892)

From this is seems obvious to me that
  1. These codes are being used primarily in certain cataloging sub-cultures (music and archival works are my best guess)
  2. Some are obviously under-used, in particular "joint author" and "translator"
Here are the added entries for a record for Rainer Werner Fassbinder's film based on Nabokov's novel "Despair":

Bogarde, Dirk, 1921-1999
Ferréol, Andrea
Spengler, Volker
Märthesheimer, Peter, 1938-
Fassbinder, Rainer Werner, 1946-1982
Stoppard, Tom
Nabokov, Vladimir Vladimirovich, 1899-1977. Otchao?a?nie
Löwitsch, Klaus, 1936-

All of these have the same coded relationship to the bibliographic work being described in the MARC21 record: personal added entry.

I very much like the idea of distinguishing roles and making those relationships part of the user experience. But if we aren't taking advantage of the ability we have today, I don't have much hope that we will code these relationships in the future.


Wednesday, November 01, 2006

Cat-a-log(gue)

I was taken to task in the FRBR Blog for saying that FRBR is not about catalogs.
I don’t understand the statement that FRBR isn’t about catalogues. It’s Functional Requirements for Bibliographic Records, and when bibliographic records are shown to users, that’s called a catalogue.
I realized from this comment that I am using the wrong term when I say catalog. A catalog is defined as a list of items, coming from word roots that mean "list, to count up." Examples are a library catalog, a list of items offered for sale, or the catalog of a museum exhibit that lists the items in the exhibit. In the library, the catalog is an inventory of items owned or held by the library. The library catalog is an inventory of items held in the library, and it was once equal to what users could access in the library. From the early or mid-1800's, the physical library catalog also served the purpose of the user interface to the library's holdings. (Prior to that time, catalogs were mainly used by librarians, and not members of the library's public.)

The as-yet-unnamed thing that I wish to define (and which I mistakenly called a catalog) serves these functions (please add on what I have forgotten):
  1. A list of items owned by the library. This list is at a macro level (e.g. serial titles but not the articles in the serial). Often that level is determined by the purchase unit, since this list interacts with the library's acquisitions function.
  2. Serial issues received. This is usually found in a separate module called a serials check-in system (which replaced the old Kardex)
  3. Licenced resources. These may be listed in the catalog, but they may either/also be found in a database used by an OpenURL resolver or in an ERM system (which is not accessible to users). In some cases, these are listed on a web site managed by the library.
  4. Journal article indexes. These used to be hard-copy reference books. They are now often electronic databases. User interface to these varies.
  5. Items available via ILL. This could be a union catalog of libraries in a borrowing unit. It also is a function that interacts with OCLC's ILL system. This latter usually isn't visible in the user view of the library.
  6. Links and connections from information systems not hosted by the library, such as the ability to link from an article in a licensed database to the full text of the article from another source; or a link from an Internet search engine to library-managed resources through a browser plug-in or a web service.
  7. Location and circulation status information, plus the ability for users to place holds on items or to request delivery of items.
  8. Interaction with institutional services such as courseware.
  9. One or more user interfaces. Many of these services above will have a separate interface just for that service, but there are also meta-interfaces that will combine services.
This is a first stab. I'll post this on the futurelib wiki where it can be easily modified.

Wednesday, October 25, 2006

FRBR-izing

There's nothing like inter-continental travel to provide you with those sleepless nights that are ideal for a re-reading of the FRBR document. I am not a cataloger, so I assume that I am missing some or much of the importance of the FRBR analysis in relation to how catalogers view their activity. My reading of FRBR is that it is a rather unrevolutionary macro analysis of what cataloging already is. As a theoretical framework, it gives the cataloging community a new way to talk about what they do and why they do it. From what I've understood of the RDA work, FRBR has brought clarity to that discussion, and that's a Good Thing.

Then I hear about people FRBR-izing their catalogs, and I have to say that I can find nothing in the FRBR analysis that would support or encourage that activity. FRBR is not about catalogs, it's not even about creating cataloging records, and it definitely does not advocate the clustering of works for user displays. I'm not sure where FRBR-izing came from, but it definitely didn't come from FRBR. FRBR defines something called the Work, but does not tell you what to do with it. In addition, the Work is not a new idea (see section 25.2 of AACR2 where it describes the use of Uniform Titles).

I think that those of us in the systems design arena have confused FRBR, or perhaps co-opted it, to solve two pressing problems of our own: 1) the need to provide a better user interface to the minority of prolific works, that is, the Shakespeare's and the oft-translated works; 2) and the need to manage works that appear in many physical formats, such as a printed journal and the microform copy of that journal, or an article that is available in both HTML and PDF. We can find elements of FRBR that help us communicate about these issues; we can talk about Works (in the FRBR sense) and Manifestations. But solving these problems is not a FRBR-ization of the catalog.

The first problem, that of prolific works, had at least a partial a solution in the card catalog: the Uniform Title. It was that title that brought together all of the Hamlets, or all of the tranlsations of Mann's "Zauberberg." While RDA may in the future define work somewhat differently from AACR2 and may expand the breadth of the groupings of bibliographic records, this isn't a new concept. Interestingly, I find that Uniform Titles are often not assigned in catalog records, which limits their usefulness. So here we are hailing FRBR when we aren't making use of a mechanism (UTs) we already have. In any case, we are finally trying to cluster works in a way that should have already been part of our online catalog. Although the definition of Work may have changed, the idea of grouping by work is not new.

The next problem is one I hoped would be addressed in RDA but it appears that it isn't (well, I can't find it in the drafts): should we catalog different physical formats as separate items, or could we have a hierarchical view of our catalog entry that would allow different physical formats to be listed as a single item? Physical formats are important because the format can determine the user's ability to make use of the item. This is conceptually a cataloging question, but it's also a systems design issue, which is one of the reasons why I would like to see some work between the RDA committees and a group of systems designers. From this latter point of view, my preference would be to create a multi-level record that allows for manifestation and copy-level information to be carried with (or linked to) the bibliographic data. The MARC21 Holdings Format gives us one model for a solution, but in my experience it needs a make-over (and another level of hierarchy) in order to play this role.

I'd like to see discussion of both of these issues and their possible solutions. It is clear to me that post-processing of current catalog records is not sufficient to create the kind of user services that we want to provide. We are going to have to talk about what we want our data to look like in order to serve users of our catalogs.

Tuesday, October 24, 2006

Google Book Search is NOT a Library Backup

I have seen various quotes from library managers that the Google Book Search program, which is digitizing books from about a dozen large research libraries, now provides a backup to the library itself. This is simply not the case. Google is, or at least began as, a keyword search capability for books, not a preservation project. This means that "good enough" is good enough for users to discover a book by the keywords. A few key facts about GBS:

1) it uses uncorrected OCR. This means that there are many errors that remain in the extracted text. A glaring example is that all hyphenated words that break across a line are treated as separate words, e.g. re-create is in the text as "re" and "create". And the OCR has particular trouble with title pages and tables of contents:

Copyright, 18w,

B@ DODD, MEAD AND COMPANY,

411 r@h @umieS

@n(Wr@ft@ @rr@

5 OHN WILSON AND SON, CAMBRIDGE, U. S. A.

Here's the table of contents page:

(@t'

@ 1@ -r: @

@Je@ @3(

CONTENTS

CHAPTER PAGS

I. MATERIAL AND METHOD . . 7
II. TIME AND PLACE 20
III. MEDITATION AND IMAGINATION 34
IV. THE FIRST DELIGHT . . . 51
V. THE FEELING FOR LITERATURE 63
VI. THE BOOKS OF LIFE . . . 74
Vii. FROM THE BOOK TO THE READER 8@
VIII. BY WAY OF ILLUSTRATION . 95
IX. PERSONALITY 109
X. LIBERATION THROUGH IDEAS . 121
XI. THE LOGIC OF FREE LIFE. . 132
XII. THE IMAGINATION 143
XIII. BREADTH OF LIFE 154
XIV. RACIAL EXPRESSION . . . i65
XV. FRESHNESS OF FEELING. . . 174

2) it will not digitize all items from the libraries. Some will be considered too delicate for the scanning process, others will present problems because of size or layout. It isn't clear how they will deal with items that are off the shelf when that shelf is being digitized.

3) quality control is generally low. I have heard that some of the libraries are trying to work with Google on this, but the effort by the library to QC each dgitized book would be extremely costly. People have reported blurred or missing pages, but my favorite is:

"Venice in Sweden"
Search isbn:030681286X (Stones of Venice, by Ruskin)
Click on the link and you see a page of Stones of Venice. Click on the Table of Contents and you're at page two or so of a guidebook on Sweden. Click forward and backward and move seamlessly from Venice to Sweden and back again. Two! Two! Two books in one! (I reported this to G months ago.)

4) the downloaded books aren't always identical to the book available online (which in turn may be different to the actual physical book due to scanning abnormalities). Look at this version of "Old Friends" both online and after downloading, and you'll see that most of the plates are missing from the downloaded version. Not necessarily a back-up problem, but it doesn't instill confidence that copies made from their originals will be complete.

Note that these examples may not affect the usefulness of the search function provided by Google, but they do affect the assumption that these books back up the library

Monday, October 23, 2006

Internet Filters and Strange Bedfellows

In the legal battle against the Children's Internet Protection Act (CIPA), the government's position was to mandate filters on library computers as a way to protect children. The ALA and the ACLU argued that such filters were unconstitutional as they blocked speech protected by the first amendment, but also that the filters were ineffective to the purpose intended, letting some "inappropriate" material through. Judith Krug testified, saying: "Even the filtering manufacturers admit it is impossible to block all undesirable material." The government, of course, argued for filters.

Now, the Child Online Protection Act (COPA) from 1998 is going to court. This law requires that Internet sites that carry material that may be harmful to children use some method (such as requiring a credit card number) to prevent children from accessing the material, or face criminal charges if children access their site. In this case, the ACLU is expected to argue that filters are a better way to prevent children from see the offending material (having evolved since CIPA days), and the government will argue that filters are ineffective because a fair amount of pornography slips through them.

*sigh* It's the absurdity of it all that gets to me. That and my paranoid fear that it's all a plot to engage our limited resources while our rights erode on so many fronts.

Thursday, October 05, 2006

Hiatus

For the next two weeks I will be in Venice, Italy, where I intend to contemplate the thing called the "book" and other wonders of the world. I am readying some posts that I may not be able to complete before leaving, one on DRM in particular.

In the meanwhile, if you are interested in the future of the library catalog and the related future of the MARC record, I invite you to add your thoughts, in draft form, to the futurelib wiki. The password is dewey76. You can add new pages, or add information to the pages that are there. Many pages have sections labeled "cooked" and "raw." The raw sections are places where you can put any ideas, even if not well formed nor coordinated with other text there. Consider it a storage box for ideas we don't want to lose. There are also places for bibliographies, if anyone wants to fill those in. And if you work on a system that is innovative (note the small "i"), add it to the list of examples.

Thursday, September 21, 2006

Description and Access?

The revision of the library cataloging rules that is underway is being called "Resource Description and Access" or RDA. Although it is undoubtedly an unpopular view point, I would like to suggest that description and access are two very different functions and that they should not be covered by a single set of rules, nor should they necessarily be performed by a single metadata record.

The pairing of description and access is functionality based on card catalog technology. A main purpose of the 19th and 20th century card catalog was access. Indeed, great discussion took place in the late 19th century about the provision of a public access point for library users: access through authors, titles, and subjects. The descriptive element, the main body of the card, was essentially a bibliographic surrogate helping users make their decision on whether to go over to the shelf to look for the book. Before easy reproduction of cards, that is, before LoC began selling card sets early in the 20th century, access cards did not carry the full description of the book. Instead, all catalog cards except the main entry card had a brief entry that would allow the user to find the main entry card which had the full description.

The combination of description and access is a habit that has carried over from the card catalog and has left a legacy that to many of us is so natural that we have trouble seeing it for what it is. For example, it is because of this combination that we create artificial "headings" that cause us to display author names in the famed "last name first" order. The heading is designed for access in a system where the means of finding items is through a linear alphabetical order, which even in library systems is no longer the predominant finding method. These headings set library systems apart from popular information systems such as Amazon, Barnes & Noble, Google Books. As a matter of fact, you can find examples of library catalogs that attempt a popular display by displaying the title and statement of responsibility as the main display, hiding the now odd-looking headings. What these headings say to anyone who is tech-savvy is that libraries are hindered by an obsolete technology. Libraries still create these contorted headings when markup of data can make display and ordering of data flexible and user friendly.

Not only does our use of arcane headings set libraries apart from more popular information resources, our concepts of "description" and "access" are not serving our users. The description provided by libraries might serve to identify the work bibliographically (something that matters to libraries for collection development purposes but is not of great interest to library users), however it doesn't describe the work to users in a way that can help them make a selection. We need at least reviews, thumbnails of images, sample chapters, and even local commentary ("Required reading for Professor Smith's class in European History"). And as for access, we know that the library-assigned subject headings are woefully inadequate discovery tools.

RDA claims that its purpose in the area of description "should enable the user to: a) identify the resource described...." Yet today we are in dire need of machine-to-machine identification, which RDA does not address. Increasingly our catalogs are interacting with other sources of discovery, such as web sites, search engines, and courseware. "Identification" that must be interpreted by a human being is going to be less and less useful as we go forward into an increasingly digital and networked information environment.

We are also greatly in need of an ability to share our data with systems that are not based on library cataloging. Each rule that varies from what would be common practice moves libraries further from the information world that our users occupy in their daily life. It is somewhat ironic that many pages of rules instruct catalogers on the choice of the "title proper," which is then marred by the addition of the statement of responsibility, a bit of library arcana that no one else considers to be part of the title of the work. And who else would create a title heading "I [heart symbol] New York"?

All this to say that the next generation library catalog cannot succeed if it is to be based on a set of rules that still carry artifacts from the days of the physical card catalog. It's time to get over the concepts of description and access that were developed in the 19th century. Let's move on, for goodness sake.

Wednesday, September 13, 2006

WiFi and the children

CNET has an article about some concerns arising around ubiquitous WiFi and children's access to the Internet. There's no mention of libraries or what they went through, but it will be interesting to see if cities that are providing open WiFi will face some of the challenges that libraries did. And if not, why not? Isn't ubiquitous, open WiFi Internet access the "worst case scenario" for those who so hotly opposed open Internet access in libraries?

I see that some libraries that offer WiFi are only allowing WiFi access to "older" children. Loudoun County, VA, has a rule: "Patrons age 17 and under must have a parent or guardian sign the necessary forms." Others, like Lansing Public Library, are making their WiFi entirely open: "Our wireless Internet access is open to patrons of all ages; parents or guardians of children under the age of 18 are responsible for supervising and guaranteeing their child's proper and safe use of the Internet." This brings up all kinds of interesting questions in my mind about the differences between accessing the Internet via the library's computer stations and accessing it from your own laptop. Is the obligation to protect children related to who provides the hardware? Are schools providing wireless, and if so are they using filters?

Thursday, September 07, 2006

Google Books and Federal Documents

The Google Books blog today announces with some fanfare that Diane Publications, a publishing house that specializes in (re)publishing Federal documents, is making all of its documents available for full viewing. The publisher states:
The free flow of government information to a democratic society is utmost in our mind.
So I did a publisher search on Google and found a publication called "Marijuana Use in America" -- which is a reprint of a 104th Congress hearing, and on each page there is a watermark that says:
Copyrighted material

Now you all know that this is wrong, because Federal documents are in the public domain, but no where does the Google blog or the publisher mention the "PD" word. This troubles me because it will now require effort to undo this misinformation.

And, of course, just to add salt to the wound, I was easily able to find a book that the Diane Publishing company sells for $30 that you can get from GPO for $2.95. This really hurts.

Tuesday, September 05, 2006

MARC - We Can Do It!

It seems pretty clear that there is interest in exploring the mur... uh, morphing of MARC. There are now some active email discussions (on the MARC list and on the ngc4lib list). However, neither the blog format nor email discussion are going to move us forward very well. We need a collaborative workplace where we can all add to lists of requirements, where we can share "prior art" links, and where we can mock-up solutions. I cannot easily run a wiki on my web site, so this is a call for a donation of a well-run sandbox where we can take these ideas a bit further, or information as to where one can do this.

Meanwhile, I'll gather up comments from the various discussions and post a summary.

Thank you all!

Friday, September 01, 2006

Murdering MARC

It's been almost four years now since Roy Tennant's rallying cry of "MARC Must Die" and little has been done to further that goal. It seems pretty clear that the MARC format will not expire of its own accord, so it may be time to contemplate murder. (I'm not usually taken to violent actions. Perhaps I've been reading a bit too much medieval history of late.)

There's understandably a great reluctance to tackle a change that will have such a wide-ranging effect on our everyday library operations. However, like all large tasks, it becomes more manageable when it has been analyzed into a number of smaller tasks, and I'm convinced that if we put our minds to it we can move on to a bibliographic record format that meets our modern needs. I'm also convinced that we can transition systems from the current MARC format to its successor without having to undergo a painful revolution.

The alternative to change is that library systems will cobble on kludge after kludge in their attempts to provide services that MARC does not support. It will be very costly for us to interact with non-library services and we'll continue to be viewed as adhering to antiquated technology. Since I don't like this alternative, I propose that we begin the "Death to MARC" process ASAP. It should start with a few analysis tasks, some of which are underway:

a) Analysis of the data elements in the MARC record. I have done some work on this, although as yet unpublished. But I will share here some of the data I have gathered over the past few years.
b) Analysis of how the MARC record has actually been used. This is underway at the University of North Texas where Bill Moen and colleagues are studying millions of OCLC records to discover the frequency with which data elements are actually used. This data is important because we absolutely will have to be able to transition our current bibliographic data to the new format or formats. (Yes, I said "formats.") Another aspect of this would be to investigate how library systems have made use of the data elements in the MARC record, with the hope of identifying ones that may not be needed in the future, or whose demise would have a minimum impact.
c) A functional analysis of library systems. There's a discussion taking place on a list called ngc4lib about the "Next Generation Catalog" for libraries. In that discussion it quickly becomes clear that the catalog is no longer (if it ever was) a discrete system, and our "bibliographic record" is really serving a much broader role than providing an inventory of library holdings. This was the bee in my bonnet when I wrote my piece for the Library Hi Tech special issue on the future of MARC. I'm not sure I could still stand by everything I said in that article, but the issues that were bugging me then are bugging me still now. If we don't understand the functions we need to serve, our basic bibliographic record will not further our goals.

If there's interest in this topic, perhaps we can get some discussion going that will lead to action. I'm all ears (in the immortal words of Ross Perot).

Tuesday, August 29, 2006

The dotted line

The University of California has released its agreement with Google ("uc:" in quotes below). As a public institution, all such contracts must be made publicly available on request. We similarly have access to the University of Michigan agreement ("um:" in quotes below), which gives us the ability to do some comparison.

As is always the case, the language of the contract does not entirely reveal the intentions of the Parties. It is instead a strange almost-Shakespearean courtship where neither Party wishes to say what they really want, and everyone pretends like it's all so wonderfully fine, while at the same time each player is hoping to pull the wool over the eyes of the other. So my interpretation here may reveal more about my own assumptions than any truth about the contracts.

[Note: I apologize for any typos, but I had to transcribe much of this from the PDF files, which did not allow for text copy. If you see errors, let me know and I'll fix them.]

Quality Control

The Michican Contract gives the library the quality control review function, and allows the library to actual hold up the digitizing process if its quality requirements are not met:
um: 2.4 Digitizing the Selected Content. Google will be responsible for Digitizing the Selected Content. Subject to handling constraints or procedures specified in the Project Plan, Google shall at its sole discretion determine how best to Digitize the Selected Content, so long as the resulting digital files meet the benchmarking guidelines agreed to by Google and U of M, and the U of M Digital Copy can be provided to U of M in a format agreed to by Google an U of M. U of M will engage in ongoing review (thorough sampling) of the resulting digital files, and shall inform Google of files that do not meet benchmarking guidelines or do not comply with the agreed-upon format, U of M may stop new work until this failure can be rectified.
Perhaps Google has learned its lesson about trying to meet the standards of libraries, because UC's contract is notably silent on the QC topic. The paragraph that begins ...
uc: 2.4 Digitizing the Selected Content. Subject to handling constraints or procedures specified in the Project Plan, Google shall in its sole discretion determine how best to Digitize the Selected Content.
... then goes on to talk about Google's responsibility in taking care of UC's books, and its promise to replace any that get damaged. Nothing more about the digital files that result.

[Note: Peter Brantley points out section 4.7.1 which contains a reference to image standards and lets UC QA up to 250 books a month to "assess quality." However, there is no stated recourse so UC and Google are relying on each other's good intentions here.]
note 4.7.1, which refers to image standards in line with
those established as a community of library partners.

What the Libraries Get

In this case, UC seems to have learned from the past experience of others, and negotiated for more from Google:
um: 2.5 ... the U of M Digital Copy will consist of a set of image and OCR files and associated information indicating at a minimum (1) bibliographic information consisting of the title and author of each Digitized work, (2) which image files correspond to that Digitized work, and (3) the logical order of these image files.
uc: 4.7 University Digital Copy. Unless otherwise agreed by the Parties in writing, the "University Digital Copy" means the digital copy of the selected content that is Digitized by Google consisting of (a) a set of image and OCR files, (b) associated meta-information about the files including bibliographic information consisting of title and author of each Digitized work and technical information consisting of the date of scanning the work, information about which image files correspond to what digitized work, and information pertaining to the logical order of image files that make up a Digitized work, (c) a list of works that are supplied for Digitization but not actually Digitized, and (d) the image coordinates for each Digitized Work ("Image Coordinates"); provided that Image Coordinates will only be provided (i) so long as University complies with the volume commitments set forth in Section 2.2 and (ii) pursuant to the restrictions on University's use and distribution of such Image Coordinates set for the Section 4.10.
The "Image Coordinates" are what make it possible to locate a word in a page image, for example for the purpose of highlighting the query word on the screen. Michigan didn't get these coordinates with its files, and possibly the other four original Google library partners did not, either. We'll look at Section 4.10 and its limitations in a moment, but the "volume commitments" in 2.2 say that:
uc: University will use reasonable efforts to provide or provide Google with access to no less than three thousand (3,000) books (or such amount that is mutually agreed to by the Parties) of Selected Content per day to Digitize commencing on the sixty-first (61st) day after the Effective Date...
So it sounds like the University really, really, really wanted the coordinates and Google really, really, really wanted to make sure that the University would not drag its heels in terms of providing the books to Google. So these two inherently unrelated desires became bargaining chips.

Using the Files

In both contracts it is stated that Google owns the Digital Copy, and makes it clear that neither Google nor the library are claiming any ownership of the underlying texts that have been digitized. This seems to be in keeping with US copyright law, although there is the inherent difficulty that occurs when you display a digital version of a public domain resource on a screen. At that point, any controls desired by the owner of the digital file are hard to enforce. In the Michigan contract, Google basically states that the university will not allow wholesale downloading of the files, and will attempt to prevent any downloading for commercial purposes (as if they could tell that from a download action):
um: 4.4.1 Use of U of M Digital Copy on U of M Website. U of M shall have the right to use the U of M Digital Copy, in whole or in part at U of M's sole discretion, as part of services offered on U of M's website. U of M shall implement technological measures (e.g., through use of the robots.txt protocol) to restrict automated access to any portion of the U of M Digital Copy or the portions of the U of M website on which any portion of the U of M Digital Copy is available. U of M shall also make reasonable efforts (including but not limited to restrictions placed in Terms of Use for the U of M website) to prevent third parties from (a) downloading or otherwise obtaining any portion of the U of M Digital Copy for commercial purposes, (b) redistributing any portions of the U of M Digital Copy, or (c) automated and systematic downloading from its website image files from the U of M Digital Copy. U of M shall restrict access to the U of M Digital Copy to those persons having a need to access such materials and shall also cooperate in good faith with Google to mutually develop methods and systems for ensuring that the substantial portions of the U of M Digital Copy are not downloaded from the services offered on U of M's website or otherwise disseminated to the public at large.

There are a number of interesting items in the above paragraph. First, that a robots.txt file is considered a "technological measure." In fact, it is at best a gentleman's agreement; there is nothing that forces you to abide by the instructions in the robots.txt file so the less gentlemanly are not stopped from accessing the items that it calls "disallowed." It also is theoretically only a message center for web crawlers, and not a way to limit access to sections of ones web site. More robust technology must be employed for that. The next is that access is to be restricted to "those persons having a need to access such materials" which is about the vaguest access condition that I can imagine. How could any of us show that we have such a need in relation to an information resource? Well, if nothing else, that language doesn't appear in the UC contract, so maybe they've re-thought that particular requirement.

Next, the contract allows Michigan to use the digital files to provide services to its consortial partners, but basically leaves it up to Michigan to get the proper agreements from those libraries:
um: 4.4.2 Use of U of M Digital Copy in Cooperative Web Services. Subject to the restrictions set forth in this section, U of M shall have the right to use the U of M Digital Copy, in whole or in part at U of M's sole discretion, as part of services offered in cooperation with partner research libraries such as the institutions in the Digital Library Federation. Before making any such distribution, U of M shall enter into a written agreement with the partner research library and shall provide a copy of such agreement to Google, which agreement shall: (a) contain limitations on the partner research library's use of the materials that correspond to and are at least as restrictive as the limitations placed on U of M's use of the U of M Digital Copy in section 4.4.1; and (b) shall expressly name Google as a third party beneficiary of that agreement, including the ability for Google to enforce the restrictions against the partner research library.
Another possible learning experience, or perhaps just a result of the particular negotiations between UC and Google, but the UC/Google contract is much more specific about uses of the files and the agreements that are required for UC to exchange part of all of the files with other parties. It does limit access to the digital files to UC Library patrons, which means that these will probably be treated similarly to licensed resources in the Library, which require a user ID login for access. Although the UC contract also contains the "robots.txt" language, it also contains some stronger wording about creating technological protection measures for the files:
uc: 4.9 Use of University Digital Copy. University shall have the right to use the University Digital Copy, in whole or in part at University's sole discretion, subject to copyright law, as part of services offered to the University Library Patrons. University may not charge, receive payment or other consideration for the use of the University Digital Copy except that University may charge of use of any services supplemental to the original work that the University supplies that add value to the University Digital Copy (for example, University may charge University Library Patrons for access to annotations to works from professors and scholars but the original work will always be accessible without a fee), and to recover copying costs actually incurred. University agrees that to the extent it makes any portion of the University Digital Copy publicly available, that it will identify the works, in a statement on a web page or other access point to be mutually agreed to by the Parties, as "Digitized by Google" or in a substantially similar manner. University shall implement technological measures (e.g., through use of the robots.txt protocol) to restrict automated access to any portion of the University Digital Copy, or the portions of the University website on which any portion of the University Digital Copy is available. University shall also prevent third parties from (a) downloading or otherwise obtaining any portion of the University Digital Copy for commercial purposes, (b) redistributing any portions of the University Digital Copy, or (c) automated and systematic downloading from its website image files from the University Digital Copy. University shall develop methods and systems for ensuring that substantial portions of the University Digital Copy are not downloaded from the services offered on University's website or otherwise disseminated to the public at large. University shall also implement security and handling procedures for the University Digital Copy which procedures shall be mutually agreed by the Parties. Except as expressly allowed herein, University will not share, provide, license, or sell the University Digital Copy to any third party.

The image coordinates, which UC seems to have "won" as a special deal, cannot be shared at all:
4.10: (a) University shall not share, provide, license, distribute or sell the Image Coordinates to any entity in any manner. University may use the Image Coordinates only as part of the University Digital Copy for the services provided to University Library Patrons set forth in Section 4.9 above.
What is particularly odd below in 4.10 (b) is that Google states that UC can distribute no more than 10% of the digital copy (which is, by definition, owned by Google), but it can distribute 100% of the digital copies of public domain works. I can imagine that UC insisted on this, but it seems to contradict the distinction that Google is making between the rights in the digital files created by Google and the rights in the underlying works.
(b) Subject to the restrictions contained herein, University shall have the right to distribute (1) no more than ten percent (10%) of the University Digital Copy (but not any portion of the Image Coordinates) to (i) other libraries and (ii) educational institutions, in each case for non-commercial research, scholarly or academic purposes and (2) all or any portion of public domain works contained in the University Digital Copy (but not any portion of the Image Coordinates) to research libraries for research, scholarly and academic purposes by those libraries and the faculty, students, scholars and staff authorized by said libraries to access their commercially licensed electronic information products. Any recipient of the University Digital Copy under this Section 4.10 is referred to herein as a "Recipient Institution." Prior to any distribution by University to a Recipient Institution, Google and the Recipient Institution must have entered into a written agreement on terms acceptable to Google governing the use of the University Digital Copy and that, among other things, provide an indemnity to Google. In addition, any distribution by University to a Recipient Institution is subject to a written agreement that (A) prohibits that Recipient Institution from redistributing without first obtaining the prior written consent of Google, (B) makes Google an express third party beneficiary of such agreement, (C) provides an indemnity to Google from the Recipient Institution for the Recipient Institutions's use of the Selected Content, (D) contains limitations at least as restrictive as the restrictions on University set forth in Section 4.9, (E) contains limitations on the use of the University Digital Copy consistent with copyright law and the limitations set for in clauses (1) and (2) above, and (E) requires each Recipient Institution, to the extent it makes any portion of the University Digital Copy publicly available, to identify the works, in a statement on the applicable web page or other access point, as "digitized by Google" or in a substantially similar manner.
Here I notice especially "(E) contains limitations on the use of the University Digital Copy consistent with copyright law" and I'm wondering to what this refers. It seems it either means that Google is asserting some intellectual property rights in the digital copies, or that they are reminding the University that it cannot re-distribute the digital copies beyond that allowed by fair use. Since the latter is a given, and not a matter of contract, it would appear that the first interpretation is correct. Yet I don't see a clear statement of Google's IP rights in the contract.

My final comment has to do with the fact that the licenses are for limited times. Michigan's extends until 2009, and UC's is stated as being for six years from the signing. Someone with more expertise in contract law will need to help me understand what this means for the restrictions given above. This may be clearer through a reading of the contracts, and I encourage anyone with the necessary skills to read them and let the rest of us know what some of this language means. Naturally, our concerns are about ownership and use, and getting a fair shake for library users.

Saturday, August 26, 2006

WebDewey: Keeping Users Uninformed

I was looking for a Dewey Decimal number to accompany a topic in an article I'm working on, and learned, although perhaps I should have known, that DDC is not available for open access. I wandered around OCLC's Dewey site, and came across the license that controls use of the WebDewey product. Some aspects of it surprised me.

The first was the definition of "Subscriber" in the WebDewey contract: "Subscriber means a library or not-for-profit information agency..." So does this mean that a corporate library cannot get a license to use the DDC? Or is it just that they must work only with the hard copy? What would be the purpose of limiting use to non-profits?

The next is from the grant of license. First, you are granted a license to use WebDewey to create bibliographic records, but "Such bibliographic records and metadata may display DDC numbers, but shall not display DDC captions." This basically eliminates the possibility of creating a rich classified display for a library. I find that it isn't enough to browse the shelf viewing only the classification number and the book titles, since the book titles alone do not reveal what the classification number means. I'd love to have a virtual shelflist that lets me know where I am, topically, and then shows me the titles in that area. But, no, you are not allowed to do that with the Dewey Classification... at least not unless you limit your display to "the DDC22 summaries," that is the first three digits of the Dewey classification number. Since modern topics have necessitated a great lengthening of the Dewey numbers (such as: Disaster relief efforts for earthquakes are classed in 363.3495095982, according to the Dewey Blog), being limited to the three digit topics is nearly useless.

I realize that the DDC is business, but the business of libraries is to inform, to help users find what they need, not to obscure our shelf order. Sheesh!

Friday, August 25, 2006

Do it yourself digital books

This user got tired of waiting for his book to appear on the Internet Archive so he just did it! Perhaps the Million Book Project just needs one million users like this:

Reviewer: papeters - 4 out of 5 stars - April 10, 2004
Subject: Good copy for PG

Tired of waiting for corrections, I got another copy of the book and made good scans. The book is available through PG at:

http://www.gutenberg.net/1/1/9/2/11926/11926-h/11926-h.htm (html)
or
http://www.gutenberg.net/1/1/9/2/11926/11926.txt (plain-text)

See this at http://www.archive.org/details/WashingtonInDomesticLife