Friday, July 20, 2007

Copies, duplicates, identification

In at least three projects I'm working on now I am seeing problems with the conflict between managing copies (which libraries do) and managing content (which users want). Even before we go chasing after the FRBR concept of the work, we are already dealing with what FRBR-izers would call "different items of the same manifestation." Given that the items we tend to hold were mass produced, and thus there are many copies of them, it seems odd that we have never found a way to identify the published set that those items belong to.

"Ah," you say, "what about the ISBN?" The ISBN is a good manifestation identifier for things published after 1968 (not to mention some teddy bears and fancy chocolates), but it doesn't help us for anything earlier than that.

You probably aren't saying, "What about the BICI?" which was an admirable attempt to create a book identifier similar to the SICI (which covers serials, serials issues, and serials articles). The BICI never got beyond being a draft NISO standard, presumably because no one was interested in using it. The SICI is indeed a full NISO standard, but it seems to be falling out of use. Both of these were identifiers that could be derived either from the piece or from metadata, which is in itself not a bad idea. What was a less than good idea is that the BICI only could be derived for books that have ISBNs, but if you've got an ISBN you haven't a whole lot of use for a BICI, although it would allow you to identify individual chapters or sections of the book. But as a book identifier, it doesn't do much for us.

Now that we're moving into a time of digitization of books, I'm wondering if we can't at least find a way to identify the duplicate digital copies (of which there will be many as the various digitization projects go forward, madly grabbing books off of shelves and rushing them to scanners). Early books were identified using incipits, usually a few characters of beginning and ending text. Today's identifier would have to be more clever, but surely with the ability to run a computation on the digitized book there would be some way to derive an identifier that is accurate enough for the kind of operation where lives aren't usually at stake. There would be the need to connect the derived book identifier to the physical copies of the book, but I'm confident we can do that, even if over a bit of time.

Both Google and the Internet Archive are assigning unique identifiers to digitized books, but we have to presume that these are internal copy level identifiers, not manifestation-specific. The Archive seems to use some combination of the title and the author. Thus "Venice" by Mortimer Menpes is venicemenpes00menpiala while "Venice" by Berly De Zoete is venicedeselincou00dezoiala and "Venice" by Daniel Pidgeon is venicepidgeon00pidgiala. The zeroes in there lead me to believe that if they received another copy it would get identified as "01." Google produces an impenetrable identifier for the Mortimer Menpes book: id=4XsKAAAAIAAJ, which may or may not be derivable from the book itself. I suspect not. And we know that Google will have duplicates so we also know that each item will be identified, not each manifestation.

Meanwhile, there is a rumor circulating that the there is discussion taking place at Bowker, the ISBN agency, on the feasibility of assigning ISBNs to pre-1968 works, especially as they get digitized. I'm very interested in how (if?) we can attach such an identifier to the many copies of the books that already exist, and to their metadata. (This sounds like a job for WorldCat, doesn't it, since they have probably the biggest and most accurately de-duped database of manifestations.)

I know nothing more about it than that, but will pass along any info if I get it. And I'd love to hear from anyone who does know more.


Andy said...

For an identifier, how about the MD5 hash (or similar)? It's something which can be derived from the digital object via a well-known and open algorithm, and should be unique. It also avoid duplication problems bound to occur when metadata is used to create an identifier.

Karen Coyle said...


The problem is that the digital objects will not be identical, just the object of the digital objects will be. Somehow, the identifier has to be based on that underlying object. That's why I was thinking about incipits -- something that would identify the actual printed text. I'm thinking something like the first word on every nth page. Of course, if the target word gets garbled in the OCR, then ... OK, it's complicated, but I'm sure that there is a solution.