Thursday, July 05, 2012

ISBN as URI

One thing that is annoying in this early stage of attempting to create linked data is that we lack URI forms for some key elements of our data. One of those is the ISBN. At the linked data conference in Florence, I asked a representative of the international ISBN agency about creating an "official" URI form for ISBNs and was told that it already exists: ISBN-A, which uses the DOI namespace.

The format of the ISBN URI is:
  • Handle System DOI name prefix = "10."
  • ISBN (GS1) Bookland prefix = "978." or "979."
  • ISBN registration group element and publisher prefix = variable length numeric string of 2 to 8 digits
  • Prefix/suffix divider = "/"
  • ISBN Title enumerator and checkdigit = maximum 6 digit title enumerator and 1 digit check digit
I was thrilled to find this, but as I look at it more closely I wonder how easy it will be to divide the ISBN at the right point between the publisher prefix and the title enumerator. In the case where an ISBN is recorded as a single string (which is true in many instances in library data):

9781400096237
9788804598770

there is nothing in the string to indicate where the divider  should be, which in these two cases is:

978.14000/96237
978.8804/598770

I have two questions for anyone out there who wants to think about this:

1) Is this division into prefix/suffix practical for our purposes?
2) Would having a standard ISBN-A URI format relieve us of the need to have a separate property (such as bibo:ISBN) for these identifiers? In other words, is the URI format identification enough that this could be used directly in a situation like:

<bookURI> <RDVocab:hasManifestationIdentifier><http://10.978.14000/96237>


14 comments:

Jörg Prante said...

A DOI, which is a registered identifier, comes with a concept of "actionable" identifiers. But this is different from what is required for URIs in Linked Data. You do not need DOIs or "actionable" identifiers for the representation of ISBNs in the Semantic Web.

Instead, the URN concept could be revised. URNs are location-independent resource names that represent persistent web resources. There is no reason why URNs should not describe stable identifiers as a part of the Semantic Web. ISBNs are already embedded into the URN space since many years.

But interpreting ISBNs as URIs will sometimes fail because they are not unique.

See RFC 3187: "ISBN that has been assigned once should never be re-used. Nevertheless, publishers do occasionally re-use the same number. From the point of the URN resolution system proposed here, this will typically cause retrieval of two bibliographic records. A user can choose the correct publication using the data in the record, such as the author or title."

So the conclusion is: do not put ISBNs into URIs. Use URNs with a resolution to a "current resource".

Ian Davis said...

What's wrong with using the ISBN URN scheme? A URN is a valid URI.

See https://www.ietf.org/rfc/rfc3187.txt

Thomas Berger said...

Maybe I don't get your point, but

1. what is wrong with the IANA registered urn:isbn: URI Namespace?

2. ISBN-A, as the factsheet explains, is not about any ISBN assigned, but only for those additionally and actively registered as a DOI by their respective publisher.

Karen Coyle said...

To all --

For linked data, are we not desiring to use http URIs, as per the Design issues document?

Thomas, I was told by the ISBN agency rep that the agencies are supporting the DOI form, even if it is derived from an existing ISBN. You are right that the document says:

"ISBN-As are only assigned by DOI Registration Agencies which are also ISBN agencies (if they choose to offer this service)."

The rep assured me that this was a service of the ISBN agencies. So we need to find out if this is being offered as a general URI service by the ISBN agencies, yet looking at a couple of agency web pages I see no mention of this service. It is possible that the person I asked misunderstood my question or was highly optimistic about agency application of something she considered a standard. I'll try to find out more. (Of course, I don't remember the name of the person I spoke with... unfortunately typical for me.)

Jörg Prante said...

I think URN ISBN embedding has to be revised to turn ISBNs into first-class identifier for Linked Data, that is, ISBN as URNs can't yet do all the nice things we expect such as content negotiation.

For DOIs, an effort has been made to establish a Linked Data Service with content negotiation. See http://www.crossref.org/CrossTech/2011/04/content_negotiation_for_crossr.html

Thomas Berger said...

As to the fact sheet: Actually it is not as clear as I thought at first reading. But all examples stress that "publishers" control where the DOIs are resolving to. And since the agencies control the assignment of ISBN sub namespaces to publishers and do not even know which actual ISBNs are already assigned I doubt that even the "participating" agencies will unilaterally register *all* ISBNs under their control as DOIs: This would be too embarrassing for many publishers.

Of course, in a linked data context even http-URIs don't have to be resolvable and one could argue "if this item with ISBN had a ISBN-A registered (we don't care if it really is) it would be the DOI 10.978.12345/99990". Alas, this is not an URI. Thus make it info:doi/10.978.12345/99990 . But isn't the info Registry shutting down itself exactly because http-URIs have so many advantages?

Joe Montibello said...

Hi Karen,

"there is nothing in the string to indicate where the divider should be"

This got me curious. A little digging brought me to this page about ISBN's:

http://waltshiel.com/2008/06/22/isbn-how-to-decode-it/

So if Walt is right (and I only know him as one of many google search results), it should be pretty easy to write code that takes an unhyphenated ISBN and rehyphenate it. Once the hyphens are back in place, it should be trivial to reformat ISBN to ISBN-A.

None of this answers the real (and interesting!) questions you're asking here, but I thought it was worth noting.

Take care,
Joe Montibello

Karen Coyle said...

Thomas, yes, I was thinking that if the DOI "format" was valid that we could use it even though it does not resolve. After all, the URN:ISBN does not resolve, and the ISBNs that we are creating under bibo:ISBN don't resolve. But we really need one (or a small number, but one major one would be best) of "http://" URI formats for the ISBN for easy linking. Right now, linking on ISBNs is a bit of a mess, yet it is such an important identifier for books.

That the ISBN is controlled by the publishers is one of the weaknesses of using ISBN as an identifier. The ISBN is a product number for publishers, rightfully so, and using it for other purposes, like identification of "sought texts" is going to have some mis-matches of what is identified. One easy example is that hardback, trade paperback, and paperback all get different ISBNs because they are different products, but there is no identifier that unites them as the same text republished on different paper. (Eventually, the ISTC.... but not yet.)

My original question to the ISBN rep was whether the ISBN agencies could provide a standard http URI format for ISBNs, since it would be best if that identifier originated from the "owning" agency. They control the non-URI format (as Joe helps us with in his comment), so it would make sense for them to be the source of the linking URI.

Joe, thanks for the re-hyphenation code. That should take care of the positioning of the slash. That said, I'm not sure why the DOI requires the slash, although it does easily "identify" the publisher portion of the code. I'm just not sure why that matters for the DOI application, but it doesn't seem to be a deal-breaker.

LJNDawson said...

Hi, Karen -

I'm the product manager for identifiers at Bowker. We are just spec-ing out the requirements for the DOI registration system and building it this fiscal year.

The question is always going to be "who decides where the link points to?" And publishers, authors and agents will very much insist they should be the arbiters. We're seeing some of that now as we begin implementing ISTC (yes, we are doing that! woohoo!). Publishers are a bit concerned that they don't "own" ISTCs and therefore don't control which edition links to which other edition; they don't control the relationships between editions.

But we're entering a world where control over anything is diminished. It's a hard lesson.

If you'd like to reach out, I am laura.dawson@bowker.com.

Jonathan Rochkind said...

That's quite odd, I was under the impression the party line on ISBN's was that you should _not_ try to divide between "publisher" segment and title segment, that they should be treated as opaque identifiers. That there was not neccesarily a reliable internationally valid way to get the 'publisher' component out, and that the 'publisher' component shouldn't be assumed to always reliably correspond to an actual publisher, that was an internal detail of assignment.

So I find it very odd to see this division enshrined in the ISBN-A DOI. I suspect the ISBN folks will come to see this as a historical mistake, if they haven't already. At any rate, I think it was a mistake. ISBN's should be treated as opaque identifiers, the system does not reliably support seperating the 'publisher' portion from the title portion (nor _should_ it. While indeed this can be convenient sometimes, it's best to leave an identifier as purely an identifier and not try to double up and load it with other meaning embedded in the identifier, it often leads to all sorts of problems down the line).

Karen Coyle said...

Laura, thanks so much for commenting. As you can see, there is a community outside of publishers that needs a web-based identifier for the ISBN (warts and all!). So I'm glad that Bowker is looking into that, and I hope you can reach out to us as you develop your standards.

I believe that the library linked data (and the general bibliographic linked data) community will find the ISBN useful as an identifier regardless of what it resolves to, although I understand how important that question is for publishers. I'll hang on to your contact info, and you should feel free to get in touch with me or any others in this discussion for more info on our intended uses of an HTTP URI for the ISBN.

Karen Coyle said...

Jonathan, for the use that you and I have for the ISBN it probably is sufficient to treat it as an opaque identifier. The DOI is is designed to serve e-commerce, and I can imagine that there may be particular functions that are specific to a publisher as identified in the first two segments of the ISBN.

I'm not thrilled at the use of DOI as the identifier for the ISBN precisely because of its e-commerce "bent." Bibliographic applications will need to treat ISBNs from out-of-print materials and even materials whose publishers no longer exist. It isn't likely that the DOI will be used for these. All along I've been hoping for something more neutral, maybe an agreement by the ISBN agencies to implement a simple http://isbn.info/... . This could re-direct to DOI for current items in commerce, but could also be "minted" on the fly from metadata that carries the raw ISBN. We just need to be able to say "this is an ISBN" in a way that is generally understood.

Jörg Prante said...

Since the ISBN is a globally specified identifier of the publishing industry, the aspect of control is important. The national ISBN agencies and the international ISBN agency are the policy makers and we all have to follow their rules. Otherwise, ISBN will no longer work as a consistent number scheme for the monographic book production process (which historically was never completely free of mistakes).

So the library community and other communities like data providers perceive ISBNs as externally generated identifiers. For example, Freebase had invented "soft keys" for a "gardening" process to promote ambiguous and partial ISBNs to a "best" one. See http://wiki.freebase.com/wiki/Soft_key

ISBNs, when printed in library catalogs, are strictly speaking outside of its original domain, its life cycle - the book has been produced and sold. When libraries offer services for users to search books by ISBN, or to link catalog data by ISBN, it is just reasonable because of the enormous popularity of the ISBN. But it is not a library catalog identifier.

Just as Freebase shows how "gardening soft key data" can make a bunch of related identifiers usable for accessing linked data, libraries need to be aware that all kinds of ISBNs need to co-exist even in Linked Data environments. Obsolete ones, wrong printed ones, wrong check digits, equivalent ones (ISBN-10 and ISBN-13)... and I doubt publishers will care about the archival aspects of identifiers.

Moreover, the "current resource" of the referenced ISBN is determined by the context the user is preferring from a subjective viewpoint, not (only) by the book production process or by library staff. For example, a single ISBN in a union catalog can represent a list of book items held by several libraries, where picking up the identified book by ISBN is up to the user's decision.

Karen Coyle said...

Jorge,

Thanks for your comments. Yes, I believe we all agree that the ISBN is not a library identifier. But it is useful in our work flow.

I'm not sure what your relationship is to libraries, so what follows may be redundant with your knowledge, but may still add to the discussion.

We do have library-generated identifiers that identify the description of the resource (that is, the metadata). The main ones are the OCLC number, which covers the metadata contributed by about 70K libraries world-wide, and the National Bibliography numbers of national libraries. These have the advantage that they cover all library resources, including those that precede the 1968 date of the first uses of the ISBN. The disadvantage is that they don't help us link outside of the library environment, and this latter is today's big concern. There are systems that operate on equivalencies between some of the library numbers and ISBNs in order to facilitate linking, but in general no single identifier is sufficient and we use all that are available. (Jonathan can speak better to this as a creator of a system whose main function is to link between resources.)

We are well aware of "bad" ISBNs and even have special coding for them. (Actually, we have at least two kinds of "bad" -- structurally bad and erroneously assigned.) I consider it normal QC to be able to detect some instances of erroneous identifiers through the use of additional data elements.

I'm not sure what your final paragraph means, sorry. In my experience, the ISBN isn't used much for retrieval in library public catalogs. I've seen use of it in union catalogs in de-duplication algorithms, and of course it is used in libraries for purchasing. Library inventory and circulation of course use library-assigned item numbers, unrelated to the ISBN.

Where ISBNs exist they can be useful, and where they don't we must resort to other means. To be sure, in the case of current popular items, which are heavily sought in public libraries, the ISBN is an essential tool.