Friday, September 10, 2010

Libraries, FOAF, and community

Note: this is being posted simultaneously on two blogs: Metadata Matters and Coyle's InFormation

“Why don’t libraries just use FOAF for their Person metadata? Why do they insist on creating their own?”

We don’t know how many times we have heard this on various lists. It often is not really posed as a question; in other words, it isn’t asking for an explanation of why libraries do not choose to use FOAF. It’s more rhetorical, along the lines of “Why can’t we all just get along?” But it is worthy of being asked as a real question, and of getting a real answer.

[Note first that the question of FOAF comes up not so much as we consider the current library standards, but in discussions of upcoming standards that will hopefully be based on the FR** family of standards (FRBR, FRAD, FRSAR). ]

A comparison of FOAF Person and the library Person entity (either in MARC authority files, or RDA, or FRAD) shows that there is not one defined element (or “property” as it is called in Semantic Web-ese) that the two have in common. This is not a coincidence; the two vocabularies serve significantly different communities and purposes. This does not mean that they are irreconcilable; the question therefore becomes: What keeps them apart? and can that be overcome?

The key is in the nature of the two communities.

FOAF stands for ‘Friend of a Friend’, which is a clue to its context: the schema is primarily for use in social networking situations. Its focus is on people who are alive and online, and it includes online contact information like email addresses, web sites, work web sites, Facebook IDs, Skype IDs, etc. The name of the person in FOAF is not an identifier, but presumes that the name of the person plus one or more of the contact IDs is enough to distinguish most humans from one another.

Library name data (which is a form of controlled vocabulary, called “name authority data” in library terms) is focused on creating a unique identifier that brings together the different forms of a name used in published materials under one form. Library users, therefore, can expect to find all of the works by or about a named person under a single entry regardless of the various forms of the name that exist in real data. Uniqueness of names is enforced by adding information to a non-unique name, usually the year of birth, but when that isn’t known (especially for persons of antiquity) titles or even areas of endeavor (“poet”) can be added.

To accommodate both the FOAF (social) function and the libraries’ identification function, at the very least the libraries would need to define a sub-property of FOAF Person, one that has a more strict definition and usage. However, for the library “Person” to be designated as more specific than FOAF:Person does not require that these two be in the same vocabulary. That is one of the important features of Semantic Web properties: like any other resource, they can be linked and related to any other resources on the Web.

Why not combine the library and FOAF properties into a single metadata vocabulary? The answer has little to do with technology, but instead relates to the functioning of communities. Metadata standards need to be developed by (and for) actual communities. The FOAF and library communities clearly have different needs, different goals, and are working with fundamentally different use cases. They also are significantly different as communities.

FOAF is being developed by an informal group of developers, and is quite recent in origin. The group is small: the FOAF development email list has about 350 members. Another 350 individuals are listed on the FOAF wiki pages as having a FOAF profile available on the Web. This is obviously not the full extent of FOAF usage, but these numbers reflect the recent development of this kind of metadata.

The library community has hundreds of years of investment in the creation of metadata (even though it was not called that when libraries began to create it). There are at least tens of thousands of libraries in the world, many of which have been in existence for centuries. Library data has its origins in early 19th century book catalogs but has been created in a machine-readable format since the late 1960’s. Library data is created following formal rules governed in part by international agreements, and there are many hundreds of millions of machine-readable bibliographic records in existence that were created based on these library cataloging principles.

Libraries have engaged in wide-spread data sharing for centuries, and with the global networking capabilities of today libraries are actually able to exchange and re-use data on a huge scale. Libraries do not each create metadata for the same book or item, but instead share the metadata created by one library in cooperative efforts oriented towards resource sharing and efficiency.

This sharing is built into the very core of library data management. The ability to use data created by others is supported by standards and those standards form the basis for the library systems. While most users see only the library catalog available to the public, that is only one function of a system that supports purchasing, fund accounting, inventory control, circulation and patron management, and collection analysis. In the Western world these systems are not created and maintained by libraries but by a small number of specialized commercial vendors whose products are specifically created for the library customers using agreed library standards. Thus the very same system can be sold to hundreds or thousands of libraries, creating a viable market base for system development.

A number of the 70,000 libraries contributing to OCLC are using a single standard, MARC21, and others are following international standards such as ISBD that produces standardized bibliographic description. The development of these standards is based on a large scale community process with international participation. It is not a perfect process by any means, and clearly must be updated to meet modern needs and new technologies that have changed the way we work, but the degree of data sharing libraries depend on requires that a formal process be in place to support the standards of this community.

Sharing of data on a large scale is necessitated by the economic reality of the library sector. Libraries face increasingly shrinking budgets while coping with an upswing in demand for their services. Realistically, this means that changes to library data must be carefully coordinated in order to minimize disruption to the complex network of data sharing that makes cost-effective library services management, based on this data, possible. Libraries may appear to be mistrustful of change agents, and in some cases they certainly are, but there is a real need to minimize risk for the community as a whole in order to assure the health of these often financially fragile institutions.

So we come back to the question of libraries and FOAF. In the final analysis, we’re not at all sure that there’s much gain in trying to combine these two approaches, with the differences in their communities and functions. It could be like trying to combine oil and water, requiring compromises that in the end would be less than satisfactory for both communities. One could argue that the difference between the vocabularies and their contexts is a positive, allowing more than one view of the Person entity. As two separately maintained metadata vocabularies, anyone creating metadata can choose from either as needed without sacrificing precision. One can also imagine other views that will arise, such as Persons in medical data or financial data, which would each carry data elements that are neither in FOAF nor library data, from blood type to bank balance. The important thing is to make sure that these vocabularies are properly described and related to each other where possible. That way, each community can manage its own process based on its needs for standards integration, but data can be shared where appropriate.

We could begin with a more detailed discussion between the FOAF and the library communities about their metadata needs. With hundreds of years of experience in representing names in library catalogs, we feel confident that the library community’s knowledge could contribute in general to the use of personal names in the Semantic Web.


Erik Hetzner said...

These are all great points. There are definitely some things lost when using FOAF for library data.


danbri said...

Hi Karen! Thanks for writing this up.

It deserves a longer response. I just wanted to make a quick reply for now.

Firstly a dull nitpic ' sub-property of FOAF Person', I guess you mean sub-class there, as Person is a class.

Secondly; 'use FOAF' is often ambiguous. FOAF standing solo is a very boring (and lumpy) RDF vocabulary. We originally called FOAF project "RDFWeb" and the emphasis was always on linked and decentralised description, rather than FOAF's own 'utility vocabulary' carrying the sole burden of listing all the things you'd want to say about people, groups, organizations etc.

I certainly encourage the library community to create expressive RDF models that capture as much of their data as they care to, and don't want FOAF's existence to discourage anyone from doing that. And I am distressed whenever I see phrases like "x *should* be using FOAF".

All that said, the class foaf:Person covers Darwin and Dickens just as well as it covers Alice and Bob of some social nework. This "covers" not meaning "describes" exactly, but is in the sense that all those individuals equally are *people* and therefore members of the class FOAF calls "Person".

They're doubtless members of lots of other classes too, eg. dct:Agent. This being true doesn't mean that foaf's descriptive vocabulary alone is useful for describing those people. As you note, there are lots of things to say about people, especially creators, than is supplied in FOAF currently (or likely ever).

Despite those concerns I do think it is worthwhile, to look at creating a FOAF view of bibliographic data. One that represents those famous dead folk as part of a great network of collaboration, cooperation and friendship that flows back from the beginnings of literature, through the early years of science and up until the present day of scientists and young researchers, writers, artists and the rest collaborating through the Web. In that setting, having different kinds of description for different people is only natural; but they are still a single network of people and I hope we'll find some value from representing them as such...

Excuse the messy brain dump, hope to follow this up sometime with more specifics! --Dan (FOAF spec co-editor/author)

Karen Coyle said...

Although FOAF person CAN be used for author persons, it does not define Person in the same way as libraries, so a sub-class (I can fix that :-)) would be needed. But even if libraries did use foaf:person, there is nothing else they could use without creating a sub-class/property or a new class/property.

You know, given that libraries have been doing this for a very, very long time, one could ask why foaf didn't take its definitions from libraries, since those definitions already existed, and libraries have some real expertise in dealing with names. If you do ask that question, then I think you will come to the same answer: it doesn't meet your needs.

But the real issue, the focus of the post, is that of community. It's not about the vocabulary or particular terms, but how one reaches agreement in order to create a shared vocabulary. If you have any sense, you'll not want to do it with libraries, unless you think trying to turn the Queen Mary on a dime sounds like fun. The size, breadth, and complexity of the library community requires that the standards process be aware of a wide range of institutional needs and situations. In the end, the library community may use foaf for some information, and that is a Good Thing, but if you look at FRAD (in RDF here) you see that the library's use of name information is far, far from foaf.

If you do want to explore the library approach to names, I recommend Chapter 9 of RDA. You can download a Zip of the full text from the Internet Archive. Also, there are some high level diagrams that Gordon Dunsire created of the FRs that include FRAD. It will help our discussion if the FOAF community could learn a bit about library data. To this end, there are two pages linked under "Resources" on the home page of the W3C Linked Library Data group.