Monday, February 22, 2010

Shameless Self-Promotion

The American Library Association has published two reports that I prepared on metadata and the semantic web.

Report 1 is called: Understanding the Semantic Web: Bibliographic Data and Metadata. This is a broad overview of new concepts, aimed especially at those who are new to the semantic web and to web-based metadata. (Note: To understand the diagrams, you will need a copy of the Errata page, since a key set of the diagrams was borked.)

Report 2 has the catchy title of: RDA Vocabularies for a Twenty-First-Century Data Environment. This builds on the first report, and gives more information about building semantic web vocabularies. This report is for all of you who are wondering what on earth it is that Diane Hillmann and I keep going on about when we talk about registering RDA for semantic web use. It is not overly technical so anyone who reads through Report 1 should be able to understand the general direction that we are advocating.

Feel free to ask questions, make comments, argue with me, or tell me why I'm wrong. I don't claim to have the final answers, and want very much to have a dialog about these concepts that will lead us to a new interesting place to be in library technology.

9 comments:

Bill Dueber said...

I keep getting stuck at the "do something about it" stage. I mean, I've got 6.5M MARC records kicking around that'd I'd love to turn into something less brain-dead, but how do I even approach getting the basics into an rdf or rdfA document? Right now all I've got is lccn and isbn (see, e.g., http://catalog.hathitrust.org/Record/005826915 and its lame rdf at http://catalog.hathitrust.org/Record/005826915.rdf)

There's lots and lots in the library world explaining why the semantic web is the bee's knees, but precious little telling us how to *do* anything about it.

Karen Coyle said...

Believe me, Bill, a number of us share your pain! It's just that it's a bit of a long haul.

First, we need URIs for EVERYTHING. It would be great to have URIs for all of the MARC fields and subfields, but we don't have that at this time. From the current cataloging world, all we have so far is URI's for LC subject authority records. (At http://id.loc.gov)

In an attempt to move forward rather than backward, a group of us have created URIs for all of the RDA data elements, and all of the RDA controlled vocabularies. (See the Metadata Registry.)

The next step is to create actual DATA. Work has been started on the DCMI/RDA task group wiki. We began with use cases and scenarios, and some of the scenarios have *turtle* instantiations. This is about a year old by now, however, and we've all learned a bit since then.

I'm thinking of using a section of the futurelib wiki for continuing work in this area, mainly because it's easier to edit than the DC wiki, but either would be fine. The second report of mine has some schematic examples, easier to read than actual code, and we could try to move on from there.

I encourage anyone who is so inclined to hack away at solutions, and to make them available for review. And I also encourage everyone to be kind as we make awkward attempts to get somewhere... it may not always be pretty!

Graham said...

Are there summaries available for the general public anywhere?

Thanks
Graham

Karen Coyle said...

Graham, ah, the frustration of RIGHTS, which aren't mine to give. The presentation that started all of this, in PDF with speaker's notes, is on my web site here. I am working on screencasts of that that talk, which will appear on Youtube under my ID:kcoylenet. Right now I'm struggling with Youtube's limit of ten minutes per chunk. I'll announce these as soon as I have a few out there.

Jeffrey Beall said...

I finally finished reading “Understanding the Semantic Web: Bibliographic Data and Metadata,” and found it well-written and provocative. You’re a good writer, and I am glad you don’t take the unproductive “let’s kill MARC” stance that so many others have. Your tone is positive and upbeat. Did you ever think that you would be doing so much writing about cataloging?

The article leaves me with the following questions:

1. One of the criticisms of MARC cataloging is that it doesn’t scale to the size of the web. Would this be any different for linked data? Computers still aren’t very good at distinguishing multiple people with the same name (as you say), so linked data would need humans to assign the correct URI to, for example, each occurrence of “George Bush” (the father or the son?) in documents and in metadata. The same is true for homonyms such as bars (drinking) and bars (gymnastics). So how would linked data in libraries be able to scale to all resources that libraries make available?

2. We recently observed the record-reuse policy disaster that OCLC tried to thrust on its member libraries. If other suppliers of metadata become as protective and proprietary with their linked data as OCLC is with its metadata, won’t that decrease the chance of success for linked data-based systems in libraries since the trend seems to be towards restricting proprietary data sharing rather than promoting it?

3. What about spam? It seems like every time something new comes along in the networked environment, someone figures out a way to spam it. We have spam email, keyword stuffing, and even PURL servers have been spammed. How will linked data be any different? What safeguards will there be to prevent spamming?

4. Finally, could you give an example of an information retrieval / discovery system currently in use (in any domain) that uses linked data in a manner similar to the way you describe in your article? Thanks.

Karen Coyle said...

Jeffrey, thanks for your comments. I'll try to answer the questions here, but some are very complex (both the questions and the answers).

1. I don't think the issue is really about "scale" but maybe I don't understand what you mean by that. Linked data doesn't itself overcome the issue of identification of people and things, but it does facilitate the use of computing power to fill in where humans haven't provided the identifiers. The semantic web folks call this "making inferences." Essentially, if you have 80 instances of "Moby Dick is a book" and 79 instances of "Herman Melville wrote Moby Dick" then you can infer that he also wrote that 80th instance. It's not 100% accurate, but it's much like what google does with their "Did you mean?..." These techniques can also be used to help humans find errors and fix them, like we do with Captcha fixing OCR errors.

2. Linked data can be used in a proprietary way, so OCLC could turn its database into a linked data store. But the really fun stuff will happen when data from different sources combines in new and interesting ways to give us new information. So the LOD movement, Linked Open Data, is promoting recombinatory information, and I think it will lead to some interesting creativity.

3. As in email, etc., spam will have to be handled algorithmically... But I think we need to introduce some method of authentication to our data. This has been suggested for email, but we probably need it for the web in general. I believe the IETF has looked into this, but it means re-engineering the entire Internet, which is daunting, to say the least.

4. The RDF discovery systems right now are very ugly, but you can find some links on the linked data site. Dbpedia has some sample searches but I haven't yet found a really good search interface for it. (If anyone does know of one, speak up!) The best example using library data so far is the Open Library's new site (incomplete, but getting there). It isn't "strict" in terms of linked data, but it has the flavor that I think we are aiming for.

Diane Hillmann said...

Bill, take a look at some of the eXtensible Catalog Projects' Metadata Services Toolkit information. Jennifer Bowen talks about it and links to some of their work in a blog for ALA TechSource: http://www.alatechsource.org/blog/2010/02/focus-on-metadata-jennifer-bowen-on-the-new-metadata-environment.html

Jeffrey Beall said...

I just finished reading “RDA Vocabularies for a Twenty-First-Century Data Environment” and found it understandable and informative.

Please let me ask these questions:

1. Why are we not seeing so much commotion about the Semantic Web and linked data in other domains? For example, from your description it seems like this would benefit online news sites. They could link a single person, event, etc. over time in many different stories and publications. Also, it seems like this would be helpful in the sciences, such as geography, which has many different place names and geographical features which could be linked. Scientific nomenclature seems like another natural fit for this, separating out things like daisy the flower and Daisy the girl. Are we ahead of the curve?

2. How will linked data work with documents and information that is behind a search interface or that requires authentication to access?

3. On p. 13 of the first part, you justify RDF’s not using natural language. I find it ironic that LCSH has also been criticized for this (Criminal justice, Administration of). It’s like we’ve come full circle; it's been undesirable to use non-natural language, but now it’s cool.

4. Also when you say, “It [RDF] models knowledge as classes of things and relationships between things. Members of a class all have the characteristics that define the class.” First, in libraries, users ask for resources that are about things. For example, “I want a book about how to manufacture synthetic gems.” I think that just putting things in classes is more like form/genre than subject metadata. It’s more about “isness” than “aboutness.”

5. You talk a lot about how the Semantic web will help solve the problem of ambiguity in language, and I agree and think this will be a great improvement. I was surprised that you talked a lot about disambiguation but not about collocation. We’ll need a system that links variant terms (dentures, false teeth).

6. Small thing: you used the term “usage guidelines” when I would have used content standard (p. 19). Are they different?

7. In the triple
/Through the Looking Glass/ /has illustrator/ / John Tenniel/

... what do you do if the first element is a manifestation that has a different illustrator? Do you have to make a separate RDF record for every manifestation?

Thanks,
Jeffrey Beall

Karen Coyle said...

Jeffrey, I don't think I can answer all of your questions in this little box... but I'll do what I can:

1. go to http://linkeddata.org to see who is putting out their data in semantic web format. There are hundreds.

2. Not relevant. Don't know how to answer.

3. "not natural language" doesn't mean awkward language. Natural language means human language as opposed to codes, formulas, etc. LCSH uses natural language, but unnaturally.

4. We have a problem here, IMO. Libraries manage things, but in general users are looking for information. They have learned that they have to ask for things in libraries. I think we would serve users better by responding to their information needs rather than organizing ourselves entirely around things.

5. Semantic Web does a very good job of linking variant terms -- take a look at SKOS, which is designed for vocabularies and thesauri. The German National Library is adding their "variant" to the RDA vocabularies. For example, go here and click on any of the entries:
http://metadataregistry.org/concept/list/vocabulary_id/45.html
and you'll see that they have added the German equivalent for the terms. You can also add alternates in the same language.

6. "usage guidelines" is the term the Dublin Core community has used -- i think it means the same as content standard

7. It's probably best not to dwell too long on the triples. It seems to be necessary to describe them when talking about the semantic web, but in reality the REAL triples will be buried down at a machine level that folks do not see. As for your question, in fact the use of text here is deceptive, but not using text would be harder to read. It would go something like:

BookA has illustrator PersonX
BookB has illustrator PersonY
and you could have
BookC has illustrator PersonX

Now, what constitues "bookA" vs. "BookB" depends on your specific rules and what you are coding. If the left side of this is a manifestation, then even if they have the same name, they are different things, and therefore they would have different identifiers. Basically, it all depends on how your metadata defines the things that you will describe and identify. (Something that we are currently arguing about vehemently on the RDA-L list, because it's not easy to decide at what point what you have results in a new Work or Expression.)