Wednesday, October 09, 2013

Dublin Core usage in LOD

Thanks to some projects that gather statistics on the growth of linked data, we can find out various interesting things about the vocabularies being used and the degree of linking between data sets from different communities. The data I report here comes from LODstats via the Linked Open Vocabularies (LOV) project.

The LOV project looks particularly at the interrelations between vocabularies. For example, it can show which vocabularies use terms from other vocabularies. This crossover of terms is one of the things that makes links between datasets possible. For example, this shows that the geoSpecies vocabulary is not itself referenced by other vocabularies, but can link through its use of vocabularies like FOAF and Dublin Core. You can watch the visualization grow here.
In contrast, this is what Dublin Core terms looks like at LOV:

With the animated visualization here.

Dublin Core does seem to have fulfilled its role as a core vocabulary that many different communities have found useful, at least in part. The set of terms often abbreviated as "dcterms" (or sometimes "dct") and whose namespace is http://purl.org/dc/terms/ has been used approximately 192 million times as reported in the LOD statistics. This is only the usage in the 2289 linked data datasets used by that project. The earlier set of Dublin Core terms, the original fifteen terms, whose namespace is http://purl.org/dc/elements/1.1/, has been used 24.2 million times. This gives us a total of 216 million uses of Dublin Core in this particular count.

The interesting question, then, is what parts of DC are heavily used? I have a sorted list, from most to least, of all terms in the http://purl.org/dc/ namespace. The top fifteen terms are all from the "dcterms" namespace:

count          term
24147876    subject
22575133    identifier
17120343    title
17065873    issued
14459601    publisher
11605978    language
9930733    medium
9795117    format
9792064    BibliographicResource
7700745    isPartOf
7371553    creator
7241777    contributor
6590791    description
6184994    type
5983236    extent

Of this list, only four were not part of the original "Dublin Core 15" vocabulary: issued, medium, BibliographicResource, and isPartOf. The terms of that original vocabulary cluster together beginning right after the last term in the above list. I believe this provides an interesting affirmation that the original fifteen terms were a fair definition of "core." 

However, these terms, in the "dcterms" namespace got less than ten uses, and some were even zero:

accrualPeriodicity
Frequency
AgentClass
dateSubmitted
isRequiredBy
Jurisdiction
LicenseDocument
LinguisticSystem
MediaType
MediaTypeOrExtent
PeriodOfTime
PhysicalResource
RightsStatement

The last term, which got zero in the LOD calculations, is particularly interesting because the element "rights" in the original "DC 15" got 398,361 uses, and is ranked 39th in the list of elements the overall http://purl.org/dc namespace.

Next, I'll take a quick look at which datasets are contributing to the use of Dublin Core terms, and who is creating those datasets.



No comments: