Coyle's InFormation: identifiers

Showing posts with label identifiers. Show all posts

Saturday, November 05, 2022

Cautions on ISBN and a bit on DOI

I have been reading through the documents relating to the court case that Hachette has brought against the Internet Archives "controlled digital lending" program. I wrote briefly about this before. In this recent reading I am once again struck by the use and over-use of ISBNs as identifiers. Most of my library peeps know this, but for others, I will lay out the issues and problems with these so-called "identifiers".

"BOOK"

The "BN" of the ISBN stands for "BOOK NUMBER." The "IS" is for "INTERNATIONAL STANDARD" which was issued by the International Standards Organization, whose documents are unfortunately paywalled. But the un-paywalled page defines the target of an ISBN as:

[A] unique international identification system for each product form or edition of a separately available monographic publication published or produced by a specific publisher that is available to the public.

What isn't said here in so many words is that the ISBN does not define a specific content; it defines a salable product instance in the same way that a UPS code is applied to different sizes and "flavors" of Dawn dish soap. What many people either do not know or may have forgotten is that every book product is given a different ISBN. This means that the hardback book, the trade paperback, the mass-market paperback, the MOBI ebook, the EPUB ebook, even if all brought to market by a single publisher, all have different ISBNs.

The word "book" is far from precise and it is a shame that the ISBN uses that term. Yes, it is applied to the book trade, but it is not applied to a "book" except in a common sense of that word. When you say "I read a book" you do not often mean the same thing as the B in ISBN. Your listener has no idea if you are referring to a hard back or a paperback copy of the text. It would be useful to think of the ISBN as the ISBpN - the International Standard Book product Number.

Emphasizing the ISBN's use as a product code, bookstores at one point were assigning ISBNs to non-book products like stuffed animals and other gift items. This was because the retail system that the stores used required ISBNs. I believe that this practice has been quashed, but it does illustrate that the ISBN is merely a bar code at a sales point.

1970

The ISBN became a standard product number in the book trade in 1970, in the era when the Universal Product Code (UPC) concept was being developed in a variety of sales environments. This means that every book product that appeared on the market before that date does not have an ISBN. This doesn't mean that a text from before that date cannot have an ISBN - as older works are re-issued for the current market, they, too, are given ISBNs as they are prepared for the retail environment. Even some works that are out of copyright (pre-1925) may be found to have ISBNs when they have been reissued.

The existence of an ISBN on the physical or electronic version of a book tells you nothing about its copyright status and does not mean that the book content is currently in print. It has the same meaning as the bar code on your box of cereal - it is a product identifier that can be used in automated systems to ring up a purchase.

The Controlled Digital Lending Lawsuit

The lawsuit between a group of publishers led by Hachette and the Internet Archive is an example of two different views: that of selling and that of reading.

In the lawsuit the publishers quantify the damage done to them by expressing the damage to them in terms of numbers of ISBNs. This Implies that the lawsuit is not including back titles that are pre-ISBN. Because the concern is economic, items that are long out of print don't seem to be included in the lawsuit.

The difference between the book as product and the book as content shows up in how ISBNs are used. The publisher’s expert notes that many metadata records at the archive have multiple ISBNs and surmises that the archive is adding these to the records. What this person doesn’t know is that library records, which the archive is using, often contain ISBNs for multiple book products which the libraries consider interchangeable. The library user is seeking specific content and is not concerned with whether the book is a hard back, has library binding, or is one of the possible soft covers. The “book “ that the library user is seeking is an information vessel.

It is the practice in libraries, where there is more than one physical book type available, to show the user a single metadata record that doesn’t distinguish between them. The record may describe a hard bound copy even though the library has only the trade paperback. This may not be ideal but the cost-benefit seems defensible. Users probably pay little attention to the publication details that would distinguish between these products.

From a single library metadata record

Where libraries do differentiate is between forms that require special hardware or software. Even here however the ISBN cannot be used for the library’s purpose because services that manage these materials can provide the books in the primary digital reading formats based off a single metadata record, even though each ebook format is assigned its own ISBN for the purpose of sales.

The product view is what you see on Amazon. The different products have different prices which is one way they are distinguished. A buyer can see the different prices for hard copy, paperback, or kindle book, and often a range of prices for used copies. Unlike the library user, the Amazon customer has to make a choice, even if all of the options have the same content. For sales to be possible, each of the products has its own ISBN.

Different products have different prices

Counting ISBNs may be the correct quantifier for the publishers, but they feature only minimally in the library environment. Multiple ISBNs on a single library metadata record is not an attempt to hide publisher products by putting them together; it's good library practice for serving its readers. Users coming to the library with an ISBN will be directed to the content they seek regardless of the particular binding the library owns. Counting the ISBNs in the Internet Archive's metadata will not be a good measure of the number of "books" there using the publisher's definition of "book."

Digital Object Identifier (DOI)

I haven't done a deep study of the use of DOIs, but again there seems to be a great enthusiasm for the DOI as an identifier yet I see little discussion of the limitations of its reach. DOI began in 2000 so it has a serious time limit. Although it has caught on big with academic and scientific publications, it has less reach with social sciences, political writing, and other journalism. Periodicals that do not use DOIs may well be covering topics that can also be found in the DOI-verse. Basing an article research system on the presence of DOIs is an arbitrary truncation of the knowledge universe.

The End

Identifiers are useful. Created works are messy. Metadata is often inadequate. As anyone who has tried to match up metadata from multiple sources knows, working without identifiers makes that task much more difficult. However, we must be very clear, when using identifiers, to recognize what they identify.

Wednesday, August 17, 2016

Classification, RDF, and promiscuous vowels

"[He] provided (i) a classified schedule of things and concepts, (ii) a series of conjunctions or 'particles' whereby the elementary terms can be combined to express composite subjects, (iii) various kinds of notational devices ... as a means of displaying relationships between terms." [1]

"By reducing the complexity of natural language to manageable sets of nouns and verbs that are well-defined and unambiguous, sentence-like statements can be interpreted...."[2]

The "he" in the first quote is John Wilkins, and the date is 1668.[3] His goal was to create a scientifically correct language that would have one and only one term for each thing, and then would have a set of particles that would connect those things to make meaning. His one and only one term is essentially an identifier. His particles are linking elements.

The second quote is from a publication about OCLC's linked data experiements, and is about linked data, or RDF. The goals are so obviously similar that it can't be overlooked. Of course there are huge differences, not the least of which is the technology of the time.*

What I find particularly interesting about Wilkins is that he did not distinguish between classification of knowledge and language. In fact, he was creating a language, a vocabulary, that would be used to talk about the world as classified knowledge. Here we are at a distance of about 350 years, and the language basis of both his work and the abstract grammar of the semantic web share a lot of their DNA. They are probably proof of some Chomskian theory of our brain and language, but I'm really not up to reading Chomsky at this point.

The other interesting note is how similar Wilkins is to Melvil Dewey. He wanted to reform language and spelling. Here's the section where he decries alphabetization because the consonants and vowels are "promiscuously huddled together without distinction." This was a fault of language that I have not yet found noted in Dewey's work. Could he have missed some imperfection?!

*Also, Wilkins was a Bishop in the Anglican church, and so his description of the history of language is based literally on the Bible, which makes for some odd conclusions.

[1]Schulte-Albert, Hans G. Classificatory Thinking from Kinner to Wilkins: Classification and Thesaurus Construction, 1645-1668. Quoting from Vickery, B. C. "The Significance of John Wilkins in the History of Bibliographical Classification." Libri 2 (1953): 326-43.
[2]Godby, Carol J, Shenghui Wang, and Jeffrey Mixter. Library Linked Data in the Cloud: Oclc's Experiments with New Models of Resource Description. , 2015.
[3] Wilkins, John. Essay Towards a Real Character, and a Philosophical Language. S.l: Printed for Sa. Gellibrand, and for John Martyn, 1668.

Tuesday, July 23, 2013

Linked Data First Steps & Catch-21

Often when I am with groups of librarians talking about linked data, this question comes up:

"What can we do TODAY to get ready for linked data?"

It's not really a hard question, because, at least in my mind, there is an obvious starting point: identifiers. We can begin today to connect the textual data in our bibliographic records with identifiers for the same thing or concept.

What identifiers exist? Thanks to the Library of Congress we have identifiers for all of our authority controlled elements: names and subjects. (And if you are outside of the US, look to your national library for their work in this area, or connect to the Virtual International Authority File where you can.) LoC also provides identifiers for a number of the controlled lists used in MARC21 data.

The linked data standards require that identifiers be in the form of an HTTP-based URI. What this means is that your identifier looks like a URL. The identifier for me in the LC name authority file is:

http://id.loc.gov/authorities/names/n89613425

Any bibliographic data with my name in a field should also contain this identifier. (OK, admittedly that's not a lot of bib data.) That brings us to "Catch-21" -- the MARC21 record. Although a control subfield was added to MARC21 for identifiers ($0), that subfield requires the identifier to be in a MARC21-specific format:

The control number or identifier is preceded by the appropriate MARC Organization code (for a related authority record) or the Standard Identifier source code (for a standard identifier scheme), enclosed in parentheses.

The example in the MARC21 documentation is:

100 1#$aBach, Johann Sebastian.$4aut$0(DE-101c)310008891

Modified to use LC name authorities that would be:

100 1#$aBach, Johann Sebastian,$d1685-1750$0(LoC)n79021425

The contents of the $0 therefore is not a linked data identifier even in those instances where we have a proper linked data identifier for the name. Catch-21. I therefore suggest that, as an act of Catch-21 disobedience, we all declare that we will ignore the absurdity of having recently added an anti-linked data identifier subfield to our standard, and use it instead for standard HTTP URIs:

100 1#$aBach, Johann Sebastian,$d1685-1750$0http://id.loc.gov/authorities/names/n79021425

Once we've gotten over this hurdle, we can begin to fill in identifiers for authority-controlled elements. Obviously we won't be doing this by hand, one record at a time. This should be part of a normal authority update service, or it may be feasible within systems that store and link national authority data to bibliographic records.

We should also insist that cataloging services that use the national authority files begin to include these subfields in bibliographic data as it is created/downloaded.

Note that because the linked data standard identifiers are HTTP URIs, aka URLs, by including these identifiers in your bibliographic data you have already created a link -- a link to the web page for that person or subject, and a link to a machine-readable form of the authority data in a variety of formats (MARCXML, JSON, RDF, and more). In the LC identifier service, the name authority data includes a link to the VIAF identifier for the person; the VIAF identifier for some persons is included in the Wikipedia page about the person; the Wikipedia identifier links you to DBpedia and the DBpedia identifier is used by Freebase ...

That's how it all gets started, with one identifier that becomes a kind of identifier snowball rolling down hill, collecting more and more links as it goes along.

Pretty easy, eh?

Thursday, July 05, 2012

ISBN as URI

One thing that is annoying in this early stage of attempting to create linked data is that we lack URI forms for some key elements of our data. One of those is the ISBN. At the linked data conference in Florence, I asked a representative of the international ISBN agency about creating an "official" URI form for ISBNs and was told that it already exists: ISBN-A, which uses the DOI namespace.

The format of the ISBN URI is:

Handle System DOI name prefix = "10."
ISBN (GS1) Bookland prefix = "978." or "979."
ISBN registration group element and publisher prefix = variable length numeric string of 2 to 8 digits
Prefix/suffix divider = "/"
ISBN Title enumerator and checkdigit = maximum 6 digit title enumerator and 1 digit check digit

I was thrilled to find this, but as I look at it more closely I wonder how easy it will be to divide the ISBN at the right point between the publisher prefix and the title enumerator. In the case where an ISBN is recorded as a single string (which is true in many instances in library data):

9781400096237
9788804598770

there is nothing in the string to indicate where the divider should be, which in these two cases is:

978.14000/96237
978.8804/598770

I have two questions for anyone out there who wants to think about this:

1) Is this division into prefix/suffix practical for our purposes?
2) Would having a standard ISBN-A URI format relieve us of the need to have a separate property (such as bibo:ISBN) for these identifiers? In other words, is the URI format identification enough that this could be used directly in a situation like:

<bookURI> <RDVocab:hasManifestationIdentifier><http://10.978.14000/96237>

Friday, July 20, 2007

Copies, duplicates, identification

In at least three projects I'm working on now I am seeing problems with the conflict between managing copies (which libraries do) and managing content (which users want). Even before we go chasing after the FRBR concept of the work, we are already dealing with what FRBR-izers would call "different items of the same manifestation." Given that the items we tend to hold were mass produced, and thus there are many copies of them, it seems odd that we have never found a way to identify the published set that those items belong to.

"Ah," you say, "what about the ISBN?" The ISBN is a good manifestation identifier for things published after 1968 (not to mention some teddy bears and fancy chocolates), but it doesn't help us for anything earlier than that.

You probably aren't saying, "What about the BICI?" which was an admirable attempt to create a book identifier similar to the SICI (which covers serials, serials issues, and serials articles). The BICI never got beyond being a draft NISO standard, presumably because no one was interested in using it. The SICI is indeed a full NISO standard, but it seems to be falling out of use. Both of these were identifiers that could be derived either from the piece or from metadata, which is in itself not a bad idea. What was a less than good idea is that the BICI only could be derived for books that have ISBNs, but if you've got an ISBN you haven't a whole lot of use for a BICI, although it would allow you to identify individual chapters or sections of the book. But as a book identifier, it doesn't do much for us.

Now that we're moving into a time of digitization of books, I'm wondering if we can't at least find a way to identify the duplicate digital copies (of which there will be many as the various digitization projects go forward, madly grabbing books off of shelves and rushing them to scanners). Early books were identified using incipits, usually a few characters of beginning and ending text. Today's identifier would have to be more clever, but surely with the ability to run a computation on the digitized book there would be some way to derive an identifier that is accurate enough for the kind of operation where lives aren't usually at stake. There would be the need to connect the derived book identifier to the physical copies of the book, but I'm confident we can do that, even if over a bit of time.

Both Google and the Internet Archive are assigning unique identifiers to digitized books, but we have to presume that these are internal copy level identifiers, not manifestation-specific. The Archive seems to use some combination of the title and the author. Thus "Venice" by Mortimer Menpes is venicemenpes00menpiala while "Venice" by Berly De Zoete is venicedeselincou00dezoiala and "Venice" by Daniel Pidgeon is venicepidgeon00pidgiala. The zeroes in there lead me to believe that if they received another copy it would get identified as "01." Google produces an impenetrable identifier for the Mortimer Menpes book: id=4XsKAAAAIAAJ, which may or may not be derivable from the book itself. I suspect not. And we know that Google will have duplicates so we also know that each item will be identified, not each manifestation.

Meanwhile, there is a rumor circulating that the there is discussion taking place at Bowker, the ISBN agency, on the feasibility of assigning ISBNs to pre-1968 works, especially as they get digitized. I'm very interested in how (if?) we can attach such an identifier to the many copies of the books that already exist, and to their metadata. (This sounds like a job for WorldCat, doesn't it, since they have probably the biggest and most accurately de-duped database of manifestations.)

I know nothing more about it than that, but will pass along any info if I get it. And I'd love to hear from anyone who does know more.