Tuesday, February 27, 2007

Ebooks in XML - The IDPF/OEB standards

I received a notice today about a conference being held by O'Reilly on digital publishing. The conference has some tutorial sessions on using XML to create digital books, but I fear that these will not include the work being done to create an XML standard for ebooks. Once again proof that the East coast (where "traditional" publishing takes place) and the West coast (where technology happens) are very far apart.

The International Digital Publishing Forum (IDPF - once known as the Open eBook Forum) has recently announce a beta version of its e-book coding standard. I've been watching, and sometimes participating in, this group for a while, and I really think they deserve our attention and support.

To begin with, the IDPF publication structure standard (still termed OEBPS - Open Ebook Publication Structure) is designed to be used by publishers in the preparation of files that will be sent to the technology companies that transform the raw files into actual ebooks. As you know, there are dozens of e-book formats (PDF, Microsoft Reader, Mobipocket, Palm reader... etc.). The publishers need to create a single file that can be transformed into all of those formats, and the OEBPS standard is designed to meet that need. It is also designed to be an ebook format in its own right, and the upcoming Adobe ebook reader, "Adobe Digital Editions," based on Adobe's flash technology, will be able to display books in the OEBPS format.

The standard will seem overly simple to many people. It is that way on purpose. The original OEBPS standard used HTML, based on the assumption that even the publishers, who are notoriously lacking in technology chops, would have someone on board who knows HTML. The second version of the standard, the one out for comment, uses XHTML and CSS. I think this is brilliant. It means that 1) anyone can create a book and 2) anyone can display it, even in a simple browser. The KISS principle is essential for industry acceptance of the standard.

Another key thing to mention is that the OEBPS has been greatly influenced by members of the accessibility community who participate in the IDPF. The Digital Talking Book standard, which was first developed by the DAISY consortium and is now NISO standard Z39.86, uses an earlier version of the OEBPS as its book structure. This is the format that allows synchronization between a text and a reading of the text by a human reader, making it ideal for sighted and non-sighted readers alike (read it in bed, then continue listening in the car).

There is a DTD for the publication structure, although I am currently unable to get it to validate and behave. I have a question out to the authors of the DTD and will post here when I get an answer. Meanwhile, you can comment out the offending entity definition and play with the DTD.

A companion to the ebook standard is the Open Packaging Format (OPF). The easiest way to understand this is to take a look at it. Download Thoughts.epub.
Now open it in Winzip -- yes, it is a simple zipped file. In it you can find the raw xhtml of the publication; an OPF file that is the manifest for the package, and contains Dublin Core metadata for the item; a file that contains the mimetype; any images or other files that are required by the document; and an XML document that defines the overall container. Note that this is a very simple publication. The examples in the documentation show how you would create a document with multiple chapters, cover art, and illustrations. It also covers the areas of encryption and keys, for files that will be transmitted in protected formats. There is a nifty tutorial that steps you through the creation of an OCF file using Winzip.

If you have comments, suggestions, questions, or whatever, go to the discussion area of the IDPF web site and say your piece. And let me know if you have any thoughts on these standards, especially as to how they might be applicable to digital libraries.

Wednesday, February 21, 2007

FRBR User Tasks

I have long had a hard time with the FRBR user tasks, but haven't been able to quite articulate it. This is an attempt to do so.

The steps defined by FRBR are: find, identify, select, obtain. To me this describes the actions of a user approaching a library catalog with a known item in mind, and the way the tasks are defined confirms this to me.
  • to find entities that correspond to the user's search criteria
All actions begin with a search, and the search is initiated by a user. This eliminates a number of ways that catalogs can be used to link to other resources, or to analyze relationships, and a host of other functions that don't involve a user (including FRBR-zing, ironically enough). It also doesn't seem to take into account browsing, serendipity, or other ways that users encounter resources. All of the steps prior to "find" are ignored, as if the catalog doesn't intend to address them.

  • to identify an entity (i.e., to confirm that the entity described corresponds to the entity sought, or to distinguish between two or more entities with similar characteristics)
The trouble I have with this one is that there is an "entity sought." This assumes that the user has some idea of what they will find when they do the search. I don't think that's always the case. The classic user task as defined by Cutter is: "What does my library have on this topic?" To me, that general question doesn't necessary have a "sought entity" as its goal.

This also seems to skip a step, which is the evaluation of the search results. For someone who is exploring a topic rather than looking for a specific item, the retrieved set itself needs analysis. It could have items that are incongruous as a set (Java the country and Java the programming language). It could have both highly popular and totally obscure works. It could have items suitable for the user (a current edition of Mark Twain's Huckleberry Finn) or quite unsuitable (a translation of Huckleberry Finn into Latvian). The user may need to see the set analyzed into facets (geography, programming languages), or may need a ranked display showing which items have been selected most often by other users, or which are most commonly held in similar libraries. For some users, there will not be an entity to identify, and the retrieved set will often be too large for the user to look at each one individually.

  • to select an entity that is appropriate to the user's needs (i.e., to choose an entity that meets the user's requirements with respect to content, physical format, etc....)
This is the librarian's dream user -- one who knows what he needs and can interpret the entry in the catalog to that end. In fact, I suspect that most selection has an element of randomness, with the except, perhaps, of the selection on format or availability. The user may be looking for a movie, or may specifically NOT be looking for a sound recording, but I wouldn't want my salary based on how many times a user has made a selection from the catalog based on a note saying "Includes bibliographical references (p. [403]-449) and index." Also, in an environment where many items are available electronically, selection may follow "obtain," as it does in bookstores or in the open shelves of libraries. It is interesting that the FRBR user tasks ignore the interaction between the catalog and the shelves (analog or virtual). We know that users may look in the catalog under subjects to find a general shelf area to browse in; they may browse the shelf and then turn to the catalog. In fact, the FRBR user tasks have no iteration at all, whereas many of us can attest based on our own experience that many searches may be performed before the user begins to look at specific items in retrieved sets.

  • to acquire or obtain access to the entity described (i.e. to acquire an entity through purchase, loan, etc., or to access an entity electronically...)
I suppose we can ignore the person who is compiling a bibliography and will return at some point to actually access some works, or the one who was looking up a book to answer a question in a crossword puzzle (one of my main uses of catalogs, I admit). What is interesting here is the implication ("through purchase, loan, etc.") that the library catalog leads users to a wide variety of resources beyond the library itself. Yet, where library catalogs show their weakness is in limiting themselves to the works held (or licensed) by the library. There was a time when that was sufficient, since books and other works were scarce. That time ended around 1900, however, with a robust publishing industry. Now, with a highly networked electronic environment, scarcity is definitely in the past. A catalog of what is held by the library is useful to the person in the library who wants to go to the shelf and pick something up right now, but it is no more an information system than the inventory database at the local Barnes & Noble store. Users may discover that something is available through interlibrary loan precisely because it is NOT in the library's catalog, but only if they have gone to the library with a specific citation in hand. In fact, this entire FRBR task set ignores that the library catalog is but one information source in a huge sea of information sources. A catalog that represents the information resources in a single library is an anachronism in today's global information society. Looking for a particular book on a particular shelf is the very tail end of a complex process, a process that libraries seem to have stepped away from.

What it all comes down to is that the FRBR user tasks are limited in scope, and as such they limit how we think about users and catalogs. I'm thinking about an expanded set of user tasks (and even non-user tasks), and invite suggestions, both here and at http://futurelib.pbwiki.com.

Monday, February 19, 2007

DRM and Patents

Most people associate Digital Rights Management (DRM) with music or movie companies. The motivation for DRM supposedly comes from the anti-piracy efforts of the RIAA or MPAA. In this world view, digital resources are locked up by greedy companies wanting to make maximum profits from the creative efforts of (generally) exploited artists.

That's one view. Another is that DRM exists because it can make big bucks for the technology companies that own the DRM technology. Evidence for this view can be found in the patent wars that are taking place around DRM and its components.

The story is long and complex, but suffice it to say that Microsoft is one of the larger holders of patents in the area of DRM. However, their attempts to monopolize the DRM market have so far been stymied. One of the more glaring incidents relates to the patent they purchased when they bought into Xerox's ContentGuard, the company that developed XrML (which later become the rights language portion of the ISO standard based on MPEG-21) and the holder of 49 US patents plus numerous others in Europe and Japan. In particular, they claim to hold the patent on rights expression languages, such that any rights management scheme that uses a rights expression language must pay them fees for the use of their technology. This includes any use of the rights expression technology that is described in what is now the international standard, ISO 21000. Basically, use the standard, pay a fee.

To that end, Microsoft and other holders of related patents (notably Intertrust, which claims to hold 61 US patents in the area of "trusted computing") formed the MPEG-LA, that is the MPEG licensing agency. It describes itself as "... the world leader in one-stop technology platform patent licenses, enabling users to acquire patent rights necessary for a particular technology standard or platform from multiple patent holders in a single transaction." The logo on its web site has the words "fair, reasonable, non-discriminatory access to fundamental technologies," and "one-stop technology standards licensing."

Once the ISO standard 21000 was passed, MPEG-LA announced that it would now provide licensing for any patents relating to DRM. Shortly thereafter the companies providing content primarily via cell phone (and mostly in Europe), the Open Mobile Alliance (OMA), received word that MPEG-LA had developed a license for them. The members of OMA were not using the new ISO standard. They were using OMA's own DRM standard based on the Open Digital Rights Language (ODRL), a patent- and license-free technology. With the assertion that all rights languages are covered by ContentGuard's patent, OMA was told that it now owed $1 per device (think: number of cell phones in use x $1), plus 1% of all revenue from content-related transactions for using its own technology. Ah, the beauty of the patent system.

What followed was about two years in which the OMA members went through various stages of response, from incredulity (Huh?!), to fear (can they really do this to us?), to bargaining, to stalling, and now, perhaps, to open rebellion. MPEG-LA has offered reduced prices (US $0.65 per device and $0.25 per transaction) in a second version of the license that no one wants, as if paying a little less for something you don't want to pay for at all will make the price more palatable. For all of its confident (you might say even arrogant) claims to be the rightful holder of all things DRM, it appears that no one today is paying MPEG-LA anything at all. And now Intertrust, a major partner in the MPEG-LA, is offering its own, separate deal with content providers. And others who wish to use DRM are designing that DRM to avoid those technologies for which patent rights have been claimed, such as the use of rights expression languages. In essence, the patent demands have caused the technology development to shift, and it's too early to tell what the effects of that will be.

Patent wars seem to be all about blustering, making demands, and acting threatening. It's a legal version of claiming to own the water rights in an arid landscape -- people downstream tend to pay you off, mainly because the cost to fight you will be as much or more than what you demand. What is interesting about this particular war is that the big bully does not seem to be winning. Of course, we still have to wait and see what will happen now that Microsoft has brought its DRM-enabled operating system to market (Vista), but it is just possible that this time the demands were so ridiculous that instead of paying up people are just walking away, shaking their heads in disbelief.

Thursday, February 08, 2007

Catalog use data - A collective effort?

In all of the discussion around the next generation catalog and what library users really want, I've had a hunger for some real catalog use data. My first instinct was to think: "Someone needs to get a big grant to gather up statistics and analyze them." Then I remembered: we're in the "let's all get together and do it ourselves" era. I also realized that I do not want or need the perfect, scientifically sound study of catalog use; I want some data, reasonably reliable, but also growing and being updated over time. I want something to think about, not something that pretends to give me all of the answers. So...

Can we (the collective we, not the royal we) create a simple way that some libraries can contribute their catalog use data to a common storehouse? In particular, can we create a way that libraries can create a profile and upload data easily?

Here's the kind of data I have in mind:

  • What search types are available? Which of these search types is a default? How often is each search type used?
  • If there is an advanced search page, how often is it used?
  • If there are sort options, what is the default and how often are all of them used?
  • What is the default display, and how often is each display option used?
  • How many records are displayed on average?
  • If there are facets, how often is each facet type selected?
  • What system are you using? (Vendor, brand)
  • What type of library is it? (public, academic, special)
In terms of making it easy, there could be pre-defined sets for known systems that tend to look a lot alike, like III. There can also be a list of search types, perhaps the ones defined for Z39.50.

And not everyone would have to participate, just those who are interested in the future of the catalog. No coercion, no requirements. I'd be happy to see the data from a couple of dozen catalogs. Could we do it? (I volunteer to write specs and documentation, which are in my skill set.)