Friday, October 25, 2013

Instant WayBack URL

Last night I attended festivities at the Internet Archive where they made a number of announcements about projects and improvements. One that particularly struck me was the ability to push a page to the WayBack Machine and instantly get a permanent WayBack URL for that page. This is significant in a number of ways but the main advantages I see are:
  1. putting permalinks in your documents rather that URLs that can break
  2. linking to a particular version of a document when citing
You will not want to use this technique if you are intending to link to, for example, a general home page where you want your link always to go to the current version of that page. But if you are quoting something, or linking to a page that you think has a limited lifetime, this ability will make a huge difference.

When you go to the WayBack machine (whose home page has changed considerably) you will see this option:

Once you provide the URL, the system echoes back the WayBay machine URL for that page at that moment in time:

You can also view the page on the WayBack machine, to make sure you captured the right one:
The page is available through the URL immediately, and will be available through the regular WayBack machine index within hours. This has great implications for scholarship and for news reporting. Note that the WayBack Machine will not capture pages that are closed to crawlers, so if you are on a commercial site, this probably will not work. I'm still very enthused about it.

Monday, October 14, 2013

Who uses Dublin Core? - the original 15

The original 15 Dublin Core elements are included in the Dublin Core Metadata Terms using the namespace There is an "updated" version of each of the original terms in the namespace (dcterms). The difference is that the /dc/terms includes formal domains and ranges, in conformance with linked data standards; the original 15 elements in the /dc/elements/1.1/ namespace have no domain or range constraints defined. This means that the original 15, often given the namespace prefix of "dc:" or "dce:", are compatible with legacy uses of the Dublin Core elements.

In the first post of this series, I showed that the most used terms are from the dcterms vocabulary, followed immediately by a cluster of terms from the dce namespace. In addition, the majority of the top dcterms are the linked data equivalents of the dce terms, thus confirming the "coreness" of the original Dublin Core 15.

From this explanation one might expect that the uses of dce in the wilds linked data would be limited to legacy data. That does not, however, seem to be the case. Out of a total of 125 datasets from the Linked Open Vocabularies,  nearly half (60) use both the linked data vocabulary (dcterms) and the dce terms. Of the top five datasets with the greatest number of uses of dce, only one, "Wikipedia 3," does not also use the dcterms.

Europeana Linked Open Data 
Wikipedia 3 
Linked Open Data Camera dei deputati 
B3Kat - Library Union Catalogues of Bavaria, Berlin and Brandenburg 
Yovisto - academic video search 
There are reasons why datasets may use both "generations" of the Dublin Core vocabulary. One is that their data contains a mix of legacy metadata and linked data, either because the dataset has grown over time, or because the set combines data from different sources. Another is that there may be situations in which the dcterms use of domains and ranges is too restrictive for the needs of the data creators.

The LOV dataset of dce usage has over 24 million uses (compared to 192 million uses of dcterms). Library and bibliographic data is again by far the majority of the use, although it is rivaled by government data, in part because of the over 4 million uses contributed by the Italian Camera dei deputati, which also uses dcterms but to a lesser extent. In fact, government data is overall a strong contender in the dce space.

My overall conclusion from looking at this data is that Dublin Core is used widely for bibliographic and non-bibliographic data; that there is a new "core" based on usage that overlaps greatly with the old core; some dcterms elements are hardly used at all in these datasets; and finally that both the linked data dcterms and the legacy dce elements show themselves to be useful, even in the linked data environment.

Related posts:

Friday, October 11, 2013

Who uses Dublin Core - dcterms?

In my previous post I gave some data on Dublin Core field use. Today I look at who is using Dublin Core's dcterms vocabulary.

The LOV statistics show 212 datasets that use the vocabulary at, and the number of instances of usage. I did some "back of the envelope" counts on what types of organizations or projects use the terms, and also the type of use. By these calculations, the highest use was from libraries (> 60%). The next highest use was in a single language study called Semantic Quran (~10%). Third was the use in government data at less than 1%.

If one looks at the type of data, bibliographic data makes up nearly 90% of the usage. In this category I included archives, eprint repositories, and a few databases of videos and teaching materials.

From this one might conclude that dcterms isn't used much outside of the bibliographic world, but in fact traditional libraries provide only 28 of the 212 datasets on this list. The range of users and uses is impressive. Here are a few to peak your interest:
  • Southampton University has a number of datasets of civic information, including a list of bus stops.
  • There is a biomedical data service called eagle-i used by 24 universities or departments that provides information on specimens, reagents and services. This contributes nearly 500,000 instance of dcterms usage.
  • The New York Times linked data service uses dcterms. This service consists of topics (persons, organizations, locations, topics) covered by the newspaper.
  • I've mentioned the Semantic Quran. This is a linguistic database consisting of 43 translations of the Quran. It contributes over 6 million instances of dcterms use.
  • There is government data covering a wide range of topic areas. By my estimate there are at least 70 sets of government data in this compilation (including international), with everything from the aforementioned bus stops to election data, patents, economic indicators and scientific information.
If one is to make conclusions from this evidence, it could be said that the dcterms vocabulary is a core vocabulary for the description of intellectual resources, such as the holdings of libraries and archives, but that it also provides functionality for a wide range of data types. 

There are also users of the original Dublin Core vocabulary, now referred to as "1.1". I will cover that usage next.

Wednesday, October 09, 2013

Dublin Core usage in LOD

Thanks to some projects that gather statistics on the growth of linked data, we can find out various interesting things about the vocabularies being used and the degree of linking between data sets from different communities. The data I report here comes from LODstats via the Linked Open Vocabularies (LOV) project.

The LOV project looks particularly at the interrelations between vocabularies. For example, it can show which vocabularies use terms from other vocabularies. This crossover of terms is one of the things that makes links between datasets possible. For example, this shows that the geoSpecies vocabulary is not itself referenced by other vocabularies, but can link through its use of vocabularies like FOAF and Dublin Core. You can watch the visualization grow here.
In contrast, this is what Dublin Core terms looks like at LOV:

With the animated visualization here.

Dublin Core does seem to have fulfilled its role as a core vocabulary that many different communities have found useful, at least in part. The set of terms often abbreviated as "dcterms" (or sometimes "dct") and whose namespace is has been used approximately 192 million times as reported in the LOD statistics. This is only the usage in the 2289 linked data datasets used by that project. The earlier set of Dublin Core terms, the original fifteen terms, whose namespace is, has been used 24.2 million times. This gives us a total of 216 million uses of Dublin Core in this particular count.

The interesting question, then, is what parts of DC are heavily used? I have a sorted list, from most to least, of all terms in the namespace. The top fifteen terms are all from the "dcterms" namespace:

count          term
24147876    subject
22575133    identifier
17120343    title
17065873    issued
14459601    publisher
11605978    language
9930733    medium
9795117    format
9792064    BibliographicResource
7700745    isPartOf
7371553    creator
7241777    contributor
6590791    description
6184994    type
5983236    extent

Of this list, only four were not part of the original "Dublin Core 15" vocabulary: issued, medium, BibliographicResource, and isPartOf. The terms of that original vocabulary cluster together beginning right after the last term in the above list. I believe this provides an interesting affirmation that the original fifteen terms were a fair definition of "core." 

However, these terms, in the "dcterms" namespace got less than ten uses, and some were even zero:


The last term, which got zero in the LOD calculations, is particularly interesting because the element "rights" in the original "DC 15" got 398,361 uses, and is ranked 39th in the list of elements the overall namespace.

Next, I'll take a quick look at which datasets are contributing to the use of Dublin Core terms, and who is creating those datasets.

Tuesday, October 08, 2013

Women in Science

Today's New York Times has an excellent article on women in science -- that is, of course, the lack of -- entitled Why are there still so few women in science? Coincidentally, this image hit the pages of Google+ in recent days even though it has been around since at least 2010:

The picture has set off a huge argument about religion and atheism on the post that carried it. There has been some less rancorous discussion of who really should or should not be in the picture. One woman posted that there are other women who should be there, but no one seems to have noticed the portrayal of the one women who is there, Marie Curie.

Of course it is absolutely right that Marie Curie be portrayed - as one of the few people who have earned more than one Nobel Prize, and the only person to have been awarded Nobels in two different scientific areas, she is obviously qualified. However, the problem is the portrait, which is not of Marie Curie, but Marie Curie as portrayed by Susan Marie Frontczak in "Manya," a one-woman drama on the life of Marie Curie. This is as if the portrait of Einstein had been actually that of the actor who played him in the episodes of Alien Nation. There are plenty of photographs of Marie Curie, from her early days to her later years.

I think it is only fitting that Curie have her own identity, and not be given the image of an actress portraying her.  So this is just a heads up for the Marie Curie fans among us: you can find plenty of photos with an image search, although you will have to color them yourself. That's hopefully not to much to ask as a way to honor such a significant scientist.

Tuesday, October 01, 2013

Cataloging as Observation*

"Last, there has been some spectacularly misguided and misinformed discussion of the need to create 'master records' for works that are manifested in different physical forms. It is hard for me to believe that this notion has been put about by people who are cataloguers. Let me spell it out. Descriptions are of physical objects (and, nowadays, of defined assemblages of electronic data). It is literally impossible to have a single description of two or more different physical objects…"

Michael Gorman, AACR3? Not! in: Schottlaender, Brian. The Future of the Descriptive Cataloging Rules: Papers from the Alcts Preconference, Aacr2000, American Library Association Annual Conference, Chicago, June 22, 1995. Chicago: American Library Association, 1998. p. 27
When I first read this aside in Michael Gorman's highly charged article in opposition to the cataloging rules that would succeed the AACR2 rules (that he edited), I was shocked that anyone would say that cataloging is primarily a "description of physical objects." I thought of library catalogs as being about content, about knowledge. But as Gorman surely has a finely honed grasp of the purposes behind library cataloging, it seemed best not to dismiss such a statement, and I marked it in my copy of the book and tucked it away in my memory.

It has come back to me as I've pondered not only RDA and the state of library catalogs, but in my attempts to explain library cataloging to non-librarians. At the meeting regarding the question of bibliographic metadata and copyright, I described library cataloging as analogous to a medical diagnosis: a great deal of testing, expert knowledge, and judgment result in a few scribbled lines in a medical file and a prescription. If you consider these latter two the "metadata" of the situation, you see that what is visible is merely the tip of the iceberg, with a great deal of intellectual activity hidden below the surface. What I didn't mention at that time, because it wasn't yet clear to me, was the role of observation in the two activities being compared. A good physician knows how to observe and analyze a patient, and a good cataloger knows how to observe and analyze a cultural artifact.

Cataloging rules actually instruct their users on how to observe. In fact, the very first rule in AACR2 (1.0.A) defines the sources of information for the catalog entry: the preferred source of information is always the thing being described. In essence the thing itself is the primary informant for the catalog record.

There are a couple of important things we can conclude from this. The first is that the act of cataloging is an act of describing what is being observed. This makes cataloging something like the act of a biologist who is describing a specimen before her. In theory, if both librarians and biologists follow the rules of their disciplines, the same specimen or artifact would be described similarly by two different professionals. (In fact, there are always edge cases that defy simple application of the rules, but these are also the cases that make the professional activity interesting.)

The next important aspect about library cataloging is that the content of the catalog record is in large part the expression of those who created the artifact itself, not that of the cataloger. As RDA (chapter 2) says:
"The elements reflect the information typically used by the producers of resources to identify their products—title, statement of responsibility, edition statement, etc. "
Significant parts of the cataloging description are either quotes from the thing or paraphrases of observable content. I am unaware that anyone has ever challenged the right of catalogers to copy this information from the artifact to the catalog record.

What I have addressed to this point follows a fairly strict definition of "descriptive cataloging" and presumably not terribly far from the division between description and access that is made by RDA, or from the division of AACR2 into description and headings. The access or heading portion of the catalog record adds information beyond the observations of the physical piece. Headings and access points are standardized forms of proper names (including persons, corporations, government bodies, and some titles). The standardized form of the name serves as an identifier for the named entity, and also normalizes display. Here are a few examples:
On the artifact: J. R. R. Tolkien
Heading: Tolkien, J. R. R. (John Ronald Reuel), 1892-1973
On the artifact: Beethoven's Ninth Symphony
Heading: Symphonies, no. 9, op. 125, D minor
On the artifact: T. C. Boyle
Heading: Boyle, T. Coraghessan.
Note that some of these add more information than is on the actual piece, and that additional information requires research. That doesn't mean necessarily that the additional information requires the level of creativity that qualifies it for copyright protection. This is one of the areas of bibliographic metadata that needs to be analyzed further. However, I think we can conclude that some portion of "descriptive cataloging" consists primarily of observations about real world objects; and some portion normalizes those observations to create standard identifiers for bibliographic entities that exist entirely independently of the cataloging act.

You have undoubtedly noticed that I have not mentioned subject headings or classification in this post. Subject analysis, although recorded on the same catalog entries as the bibliographic data, is a separate activity in the cataloging workflow, and is not covered by the above-mentioned cataloging rules.  As a topic in the "metadata and copyright" discussion it should be covered separately.

* My thanks to Tom Baker who, as I struggled to find another way to say "bibliographic description" suggested that catalogers make observations about things.