Thursday, November 14, 2013

It's FAIR!

"In my view, Google Books provides significant public benefits. It advances the progress of the arts and sciences, while maintaining respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders. It has become an invaluable research tool that permits students, teachers, librarians, and others to more efficiently identify and locate books. It has given scholars the ability, for the first time, to conduct full-text searches of tens of millions of books. It preserves books, in particular out-of-print and old books that have been forgotten in the bowels of libraries, and it gives them new life. It facilitates access to books for print-disabled and remote or underserved populations. It generates new audiences and creates new sources of income for authors and publishers. Indeed, all society benefits." p. 26
With that statement, Judge Denny Chin has ruled (PDF) that Google's digitization of books from libraries is a fair use.  And a very long saga ends.

Google was first brought to court in 2005 by the Author's Guild in a copyright infringement suit for its mass digitization of library holdings. Since then the matter has gone back to the court a number of times. Most significantly, Google, authors, and publishers developed two complex proposed settlements that were, however, so fraught with problems that the Department of Justice weighed in. Finally, the publishers bowed out and the original Author's Guild suit was revived. At that point, the question became: Is Google's digitization of books for the purposes of indexing (and showing snippets as search results) fair use?

Of course, much happened between 2005 and 2013. One important thing that happened was the development of HathiTrust, the digital repository where libraries can store the digital copies that they received from Google of their own books. The same Authors Guild sued HathiTrust for copyright infringement, but Judge Baer in that case decided for fair use.

I cannot over-emphasize either the role of libraries in this case nor the support that both judges expressed for libraries and for their promotion of "progress and the useful arts." Chin refers frequently to the amicus brief (PDF) presented by the American Library Association, as well as the conclusions in the HathiTrust case. Both judges clearly admire the mission of libraries, and it seems clear to me that the educational use of the materials by libraries was seen to offset the for-profit use by Google. In fact, Judge Chin reverses the roles of Google and the libraries when he says:
"Google provides the libraries with the technological means to make digital copies of books that they already own. The purpose of the library copies is to advance the libraries' lawful uses of the digitized books consistent with the copyright law." p. 26
In those terms, Google has simply helped libraries do what they do, better. Google's digitization of the library books is thus a public service.
"Google Books helps to preserve books and give them new life. Older books, many of which are out-of-print books that are falling apart buried in library stacks, are being scanned and saved." p. 12
Note that Google and the libraries (in HathiTrust) are exceedingly careful to stay within the letter of the law. Google's snippet display algorithm is rococo in design, making it literally impossible to reconstruct a book from the snippets it displays. So much so that it would probably take less time to re-scan the book at home on your page-at-a-time desktop scanner.

The full impact of this ruling is impossible (for me) to predict, but there are many among us who are breathing a great sigh of relief today. This opens the door for us to rethink digital scholarship based on materials produced before information was in digital form. 

I do have a wishlist, however, and at the top of that is for us to turn our attention to making the digitized texts even more useful by turning that uncorrected OCR into a more faithful reproduction of the original book. While large-scale linguistic studies may be valid in spite of a small percentage of errors, the use of the digitized materials for reading, in the case of those works in the public domain, and for listening, in the case of works made available to VIPs (visually impaired persons), is greatly hampered by the number and kinds of errors that result. In a future post I will give the results of a short study that I have done in that area.

See all my posts on Google Books

Friday, October 25, 2013

Instant WayBack URL

Last night I attended festivities at the Internet Archive where they made a number of announcements about projects and improvements. One that particularly struck me was the ability to push a page to the WayBack Machine and instantly get a permanent WayBack URL for that page. This is significant in a number of ways but the main advantages I see are:
  1. putting permalinks in your documents rather that URLs that can break
  2. linking to a particular version of a document when citing
You will not want to use this technique if you are intending to link to, for example, a general home page where you want your link always to go to the current version of that page. But if you are quoting something, or linking to a page that you think has a limited lifetime, this ability will make a huge difference.

When you go to the WayBack machine (whose home page has changed considerably) you will see this option:

Once you provide the URL, the system echoes back the WayBay machine URL for that page at that moment in time:

You can also view the page on the WayBack machine, to make sure you captured the right one:
The page is available through the URL immediately, and will be available through the regular WayBack machine index within hours. This has great implications for scholarship and for news reporting. Note that the WayBack Machine will not capture pages that are closed to crawlers, so if you are on a commercial site, this probably will not work. I'm still very enthused about it.

Monday, October 14, 2013

Who uses Dublin Core? - the original 15

The original 15 Dublin Core elements are included in the Dublin Core Metadata Terms using the namespace http://purl.org/dc/elements/1.1/. There is an "updated" version of each of the original terms in the namespace http://purl.org/dc/terms (dcterms). The difference is that the /dc/terms includes formal domains and ranges, in conformance with linked data standards; the original 15 elements in the /dc/elements/1.1/ namespace have no domain or range constraints defined. This means that the original 15, often given the namespace prefix of "dc:" or "dce:", are compatible with legacy uses of the Dublin Core elements.

In the first post of this series, I showed that the most used terms are from the dcterms vocabulary, followed immediately by a cluster of terms from the dce namespace. In addition, the majority of the top dcterms are the linked data equivalents of the dce terms, thus confirming the "coreness" of the original Dublin Core 15.

From this explanation one might expect that the uses of dce in the wilds linked data would be limited to legacy data. That does not, however, seem to be the case. Out of a total of 125 datasets from the Linked Open Vocabularies,  nearly half (60) use both the linked data vocabulary (dcterms) and the dce terms. Of the top five datasets with the greatest number of uses of dce, only one, "Wikipedia 3," does not also use the dcterms.

Europeana Linked Open Data 
Wikipedia 3 
Linked Open Data Camera dei deputati 
B3Kat - Library Union Catalogues of Bavaria, Berlin and Brandenburg 
Yovisto - academic video search 
 
There are reasons why datasets may use both "generations" of the Dublin Core vocabulary. One is that their data contains a mix of legacy metadata and linked data, either because the dataset has grown over time, or because the set combines data from different sources. Another is that there may be situations in which the dcterms use of domains and ranges is too restrictive for the needs of the data creators.

The LOV dataset of dce usage has over 24 million uses (compared to 192 million uses of dcterms). Library and bibliographic data is again by far the majority of the use, although it is rivaled by government data, in part because of the over 4 million uses contributed by the Italian Camera dei deputati, which also uses dcterms but to a lesser extent. In fact, government data is overall a strong contender in the dce space.

My overall conclusion from looking at this data is that Dublin Core is used widely for bibliographic and non-bibliographic data; that there is a new "core" based on usage that overlaps greatly with the old core; some dcterms elements are hardly used at all in these datasets; and finally that both the linked data dcterms and the legacy dce elements show themselves to be useful, even in the linked data environment.


Related posts:

Friday, October 11, 2013

Who uses Dublin Core - dcterms?

In my previous post I gave some data on Dublin Core field use. Today I look at who is using Dublin Core's dcterms vocabulary.

The LOV statistics show 212 datasets that use the vocabulary at http://purl.org/dc/terms/, and the number of instances of usage. I did some "back of the envelope" counts on what types of organizations or projects use the terms, and also the type of use. By these calculations, the highest use was from libraries (> 60%). The next highest use was in a single language study called Semantic Quran (~10%). Third was the use in government data at less than 1%.

If one looks at the type of data, bibliographic data makes up nearly 90% of the usage. In this category I included archives, eprint repositories, and a few databases of videos and teaching materials.

From this one might conclude that dcterms isn't used much outside of the bibliographic world, but in fact traditional libraries provide only 28 of the 212 datasets on this list. The range of users and uses is impressive. Here are a few to peak your interest:
  • Southampton University has a number of datasets of civic information, including a list of bus stops.
  • There is a biomedical data service called eagle-i used by 24 universities or departments that provides information on specimens, reagents and services. This contributes nearly 500,000 instance of dcterms usage.
  • The New York Times linked data service uses dcterms. This service consists of topics (persons, organizations, locations, topics) covered by the newspaper.
  • I've mentioned the Semantic Quran. This is a linguistic database consisting of 43 translations of the Quran. It contributes over 6 million instances of dcterms use.
  • There is government data covering a wide range of topic areas. By my estimate there are at least 70 sets of government data in this compilation (including international), with everything from the aforementioned bus stops to election data, patents, economic indicators and scientific information.
If one is to make conclusions from this evidence, it could be said that the dcterms vocabulary is a core vocabulary for the description of intellectual resources, such as the holdings of libraries and archives, but that it also provides functionality for a wide range of data types. 

There are also users of the original Dublin Core vocabulary, now referred to as "1.1". I will cover that usage next.

Wednesday, October 09, 2013

Dublin Core usage in LOD

Thanks to some projects that gather statistics on the growth of linked data, we can find out various interesting things about the vocabularies being used and the degree of linking between data sets from different communities. The data I report here comes from LODstats via the Linked Open Vocabularies (LOV) project.

The LOV project looks particularly at the interrelations between vocabularies. For example, it can show which vocabularies use terms from other vocabularies. This crossover of terms is one of the things that makes links between datasets possible. For example, this shows that the geoSpecies vocabulary is not itself referenced by other vocabularies, but can link through its use of vocabularies like FOAF and Dublin Core. You can watch the visualization grow here.
In contrast, this is what Dublin Core terms looks like at LOV:

With the animated visualization here.

Dublin Core does seem to have fulfilled its role as a core vocabulary that many different communities have found useful, at least in part. The set of terms often abbreviated as "dcterms" (or sometimes "dct") and whose namespace is http://purl.org/dc/terms/ has been used approximately 192 million times as reported in the LOD statistics. This is only the usage in the 2289 linked data datasets used by that project. The earlier set of Dublin Core terms, the original fifteen terms, whose namespace is http://purl.org/dc/elements/1.1/, has been used 24.2 million times. This gives us a total of 216 million uses of Dublin Core in this particular count.

The interesting question, then, is what parts of DC are heavily used? I have a sorted list, from most to least, of all terms in the http://purl.org/dc/ namespace. The top fifteen terms are all from the "dcterms" namespace:

count          term
24147876    subject
22575133    identifier
17120343    title
17065873    issued
14459601    publisher
11605978    language
9930733    medium
9795117    format
9792064    BibliographicResource
7700745    isPartOf
7371553    creator
7241777    contributor
6590791    description
6184994    type
5983236    extent

Of this list, only four were not part of the original "Dublin Core 15" vocabulary: issued, medium, BibliographicResource, and isPartOf. The terms of that original vocabulary cluster together beginning right after the last term in the above list. I believe this provides an interesting affirmation that the original fifteen terms were a fair definition of "core." 

However, these terms, in the "dcterms" namespace got less than ten uses, and some were even zero:

accrualPeriodicity
Frequency
AgentClass
dateSubmitted
isRequiredBy
Jurisdiction
LicenseDocument
LinguisticSystem
MediaType
MediaTypeOrExtent
PeriodOfTime
PhysicalResource
RightsStatement

The last term, which got zero in the LOD calculations, is particularly interesting because the element "rights" in the original "DC 15" got 398,361 uses, and is ranked 39th in the list of elements the overall http://purl.org/dc namespace.

Next, I'll take a quick look at which datasets are contributing to the use of Dublin Core terms, and who is creating those datasets.



Tuesday, October 08, 2013

Women in Science

Today's New York Times has an excellent article on women in science -- that is, of course, the lack of -- entitled Why are there still so few women in science? Coincidentally, this image hit the pages of Google+ in recent days even though it has been around since at least 2010:


The picture has set off a huge argument about religion and atheism on the post that carried it. There has been some less rancorous discussion of who really should or should not be in the picture. One woman posted that there are other women who should be there, but no one seems to have noticed the portrayal of the one women who is there, Marie Curie.

Of course it is absolutely right that Marie Curie be portrayed - as one of the few people who have earned more than one Nobel Prize, and the only person to have been awarded Nobels in two different scientific areas, she is obviously qualified. However, the problem is the portrait, which is not of Marie Curie, but Marie Curie as portrayed by Susan Marie Frontczak in "Manya," a one-woman drama on the life of Marie Curie. This is as if the portrait of Einstein had been actually that of the actor who played him in the episodes of Alien Nation. There are plenty of photographs of Marie Curie, from her early days to her later years.

I think it is only fitting that Curie have her own identity, and not be given the image of an actress portraying her.  So this is just a heads up for the Marie Curie fans among us: you can find plenty of photos with an image search, although you will have to color them yourself. That's hopefully not to much to ask as a way to honor such a significant scientist.

Tuesday, October 01, 2013

Cataloging as Observation*

"Last, there has been some spectacularly misguided and misinformed discussion of the need to create 'master records' for works that are manifested in different physical forms. It is hard for me to believe that this notion has been put about by people who are cataloguers. Let me spell it out. Descriptions are of physical objects (and, nowadays, of defined assemblages of electronic data). It is literally impossible to have a single description of two or more different physical objects…"

Michael Gorman, AACR3? Not! in: Schottlaender, Brian. The Future of the Descriptive Cataloging Rules: Papers from the Alcts Preconference, Aacr2000, American Library Association Annual Conference, Chicago, June 22, 1995. Chicago: American Library Association, 1998. p. 27
When I first read this aside in Michael Gorman's highly charged article in opposition to the cataloging rules that would succeed the AACR2 rules (that he edited), I was shocked that anyone would say that cataloging is primarily a "description of physical objects." I thought of library catalogs as being about content, about knowledge. But as Gorman surely has a finely honed grasp of the purposes behind library cataloging, it seemed best not to dismiss such a statement, and I marked it in my copy of the book and tucked it away in my memory.

It has come back to me as I've pondered not only RDA and the state of library catalogs, but in my attempts to explain library cataloging to non-librarians. At the meeting regarding the question of bibliographic metadata and copyright, I described library cataloging as analogous to a medical diagnosis: a great deal of testing, expert knowledge, and judgment result in a few scribbled lines in a medical file and a prescription. If you consider these latter two the "metadata" of the situation, you see that what is visible is merely the tip of the iceberg, with a great deal of intellectual activity hidden below the surface. What I didn't mention at that time, because it wasn't yet clear to me, was the role of observation in the two activities being compared. A good physician knows how to observe and analyze a patient, and a good cataloger knows how to observe and analyze a cultural artifact.

Cataloging rules actually instruct their users on how to observe. In fact, the very first rule in AACR2 (1.0.A) defines the sources of information for the catalog entry: the preferred source of information is always the thing being described. In essence the thing itself is the primary informant for the catalog record.

There are a couple of important things we can conclude from this. The first is that the act of cataloging is an act of describing what is being observed. This makes cataloging something like the act of a biologist who is describing a specimen before her. In theory, if both librarians and biologists follow the rules of their disciplines, the same specimen or artifact would be described similarly by two different professionals. (In fact, there are always edge cases that defy simple application of the rules, but these are also the cases that make the professional activity interesting.)

The next important aspect about library cataloging is that the content of the catalog record is in large part the expression of those who created the artifact itself, not that of the cataloger. As RDA (chapter 2) says:
"The elements reflect the information typically used by the producers of resources to identify their products—title, statement of responsibility, edition statement, etc. "
Significant parts of the cataloging description are either quotes from the thing or paraphrases of observable content. I am unaware that anyone has ever challenged the right of catalogers to copy this information from the artifact to the catalog record.

What I have addressed to this point follows a fairly strict definition of "descriptive cataloging" and presumably not terribly far from the division between description and access that is made by RDA, or from the division of AACR2 into description and headings. The access or heading portion of the catalog record adds information beyond the observations of the physical piece. Headings and access points are standardized forms of proper names (including persons, corporations, government bodies, and some titles). The standardized form of the name serves as an identifier for the named entity, and also normalizes display. Here are a few examples:
On the artifact: J. R. R. Tolkien
Heading: Tolkien, J. R. R. (John Ronald Reuel), 1892-1973
On the artifact: Beethoven's Ninth Symphony
Heading: Symphonies, no. 9, op. 125, D minor
On the artifact: T. C. Boyle
Heading: Boyle, T. Coraghessan.
Note that some of these add more information than is on the actual piece, and that additional information requires research. That doesn't mean necessarily that the additional information requires the level of creativity that qualifies it for copyright protection. This is one of the areas of bibliographic metadata that needs to be analyzed further. However, I think we can conclude that some portion of "descriptive cataloging" consists primarily of observations about real world objects; and some portion normalizes those observations to create standard identifiers for bibliographic entities that exist entirely independently of the cataloging act.

You have undoubtedly noticed that I have not mentioned subject headings or classification in this post. Subject analysis, although recorded on the same catalog entries as the bibliographic data, is a separate activity in the cataloging workflow, and is not covered by the above-mentioned cataloging rules.  As a topic in the "metadata and copyright" discussion it should be covered separately.



* My thanks to Tom Baker who, as I struggled to find another way to say "bibliographic description" suggested that catalogers make observations about things.

Tuesday, September 24, 2013

Hopes and fears for Google Books case

We're back in the saddle of the now epic lawsuit against Google for its massive scanning of the books held by libraries. I have very mixed feelings about the case and its outcomes, and the news reports from yesterday's hearing (transcript) in Judge Denny Chin's court are not making me feel any better about it. In brief, the Author's Guild is claiming that Google violated fair use by scanning in-copyright books. Since that act alone is not sufficient to address a defense of fair use, they also state (correctly, in my view) that although Google is not providing advertising on the individual book pages, that it overall makes money off of the scanned books because that digital corpus enforces its position against other search engines. There are some things that the Authors Guild has right (such as, that Google makes money off of search results pages that can include links to Google Books), but they miss the mark in other arguments:
"For all intents and purposes, it paid libraries for the right to digitize and copy much of our nation’s literary heritage and then used the resulting digital library to gain a competitive advantage over search engine competitors that respected the rights of authors by limiting their digitization programs to books that were either licensed or were no longer protected by copyright. Aided by its infringing conduct, Google’s search engine has proven remarkably successful—to the point where “google” has become a widely used verb in the English language.
First, the addition of Google Books to the search took place long after we were all "googling." Google's main value still comes from providing access to open web resources that otherwise would just be a massive digital junk heap. I suspect that those who are interested in using Google to search within the text of "closed" books (ones that are not available as full text online) consciously go to the Google Book Search pages. I don't know this for a fact, but I'd be willing to bet that user intent behind most Google searches is to access the actual content of a web page or document, not to be given a reference to an off-line resource.

Next is the statement that Google "paid libraries for the right to digitize..." This makes it sound like Google gave the libraries money, and that there was no cost to the libraries. The agreement between Google and libraries was an exchange that had costs for both (less for the libraries, more for Google) and benefits for both (less for the libraries, more for Google). In the end, Google got the better part of the deal, but libraries got something, even though something they have not yet been able to greatly benefit from: libraries got copies of the scans at a lower price than had they done the digitization themselves. Unfortunately, due to both copyright issues and the nature of the agreement between Google and the libraries, there are significant barriers to making the kind of uses that would make this a truly transformative corpus for research.

All of the news reports emphasized some comments by Judge Chin to the effect that Google Books appears to be both transformative (in the copyright law sense) and a benefit to society. What worries me a bit is that Judge Chin is not looking beyond the use of the resulting digital texts for search. I consider search to be the tip of the iceberg, and the visible part of Google Books that Google would like everyone to focus on. My assumption is that Google has a research interest in having exclusive access to 20 million non-Web digital texts in a myriad of languages, and that this research is aimed not only at search but at Google's desire to be THE interface between man and machine, which means that machines have to get better at human languages.

If Judge Chin rules that Google's book digitization is fair use, it's a huge win, not only for Google but also for libraries. After all, if it is fair use for Google to digitize works for the purposes of searching, there is no question that it is also fair use for libraries to do the same. If Judge Chin rules that Google's book digitization is NOT fair use because of profit-making, then we still do not know for sure whether library digitization would be considered fair use (although much would depend on exactly how the decision is worded). This of course makes me want to cheer on Chin toward the "is fair use" decision, but at the same time I know that this means that any research that Google is doing on its private cache of digital texts will continue, giving them great advantages over competitors in the arms race of technology advancement.

Once again, I so wish that large-scale digitization for search and research had been undertaken by libraries, not Google. The questions of "not for profit" and social value would be a slam-dunk, and I'd not be harboring this fear that there is a hidden agenda behind the project. Maybe if libraries had done this we'd only have two or three million digitized books, not 20 million (as is claimed for Google), but they'd be untainted, in my mind, and I could still consider them a cultural heritage resource rather than a commercial product.

Sunday, September 22, 2013

Copyright, Metadata, and Attribution

The Berkeley Center for Law and Technology (BCLT) has done some interesting research on copyright, including a white paper that details the issues of performing "due diligence" in a determination of orphan works.

Recently I attended a small meeting co-sponsored by BCLT and the DPLA to begin a discussion of the issues around copyright in metadata, with a particular emphasis on bibliographic metadata. Much of the motivation for this is the uncertainty in the library and archival community about whether they can freely share their metadata. As long as this question remains un-answered, there are barriers to the free flow of data from and between cultural heritage institutions.

At the conclusion of the meeting it was clear that it will take some research to fully define the problem space. Fortunately for all of us, BCLT may be able to devote resources to undertake such a study, similar to what they have done around orphan works.

One of the first questions to undertake is whether bibliographic metadata is copyrightable in the first place. If not, then no further steps need to be taken -- not even putting a CC0 license on the data. In fact, some knowledgeable folks worry that using CC0 implies that there do exist intellectual property rights that must be addressed.

However, before you can attempt to determine if bibliographic metadata can be argued to be a set of facts which, under US copyright law, do not enjoy protection, you must be able to define "bibliographic metadata." During the meeting we did not attempt to create such a definition, but discussion ranged from "anything about a resource" to a specific set of descriptive elements. As there were representatives of archives in the room, we also talked about some of the implications of describing unpublished materials, which have a different legal standing but also provide less self-identification than resources that have been published. Drawing the line between fact and embellishment in bibliographic metadata is not going to be easy. Nor will the determination of level of creativity of the data, a necessary part of the analysis for US law. Note that other types of metadata were also discussed, such as rights metadata and preservation metadata, as well as a recognition that the exchange of metadata will of course cross national boundaries. Any study will have to determine where it will draw the "metadata" line, and also whether one can address the the question with an international scope.

Another complexity is that bibliographic data is already "crowd-sourced" in a sense. For any given bibliographic record,  different contributions have been made by different librarians from different institutions and at different times. This recognition makes it hard to ascribe intellectual ownership to any one party. And while library catalog data may be considered to be factual, it is much more than a simple rendering of facts, as the complexity of the cataloging rules attests. I likened library cataloging to a medical diagnosis: the end result (some scribbles in a file and perhaps a prescription given to the patient) does not reveal all of the knowledge and judgment that went into the decision. Metadata is the tip of an iceberg. That may not change its legal status, but I think that unless you have delved into the intricacies of cataloging it is hard to appreciate all that goes into the fairly simple display that users see on the screen.

The legal question is difficult, and to me it isn't entirely clear that solving the question on the legality of bibliographic data exchange will be sufficient to break the logjam. In a sense, projects like DPLA and Europeana, both of which have declared their metadata to be available with a CC0 license, might have more real impact than a determination based in law. Significant discussion at the meeting was about the need for attribution on the part of cultural heritage institutions. Like academics, the reputation and standing of such institutions depends on their getting recognition for their work. Releasing metadata (including thumbnails in the case of visual materials) needs to increase the visibility of those institutions, and to raise public awareness of the value of their collections. It is possible that solving the attribution problem could essentially dissolve the barriers to metadata sharing, since the gain to the institutions would be obvious.

Perhaps my one unique contribution to the group discussion was this:

We all know the © symbol and what it means. What we need now is an equally concise and recognizable symbol for attribution. Something like "(@)The Bancroft Library" or "(@)Dr. Seuss Collection". This would shorten attribution statements but also make them quickly recognizable, and a statement could also be a link to the appropriate web page. Standardizing attribution in this way should make adding attributions easier, and would demonstrate a culture of "giving credit where credit is due." The symbol needs to be simple, and should be easy to understand. It's time to comb through the Unicode charts for just the right one. Any suggestions?

See Also:


Unicode 1F6A9 - Triangular flag meaning "location"

Friday, August 09, 2013

Green paper on copyright, II


"To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries" (US Constitution)
 In 1993 the US government issued a green paper on copyright and the Internet. At the time the latter was being referred to as the "National Information Infrastructure" or NII. That green paper led to a white paper called: Intellectual Property and the National Information Infrastructure. The conclusions in this paper led to government efforts like the DMCA, as well as the as-yet unresolved questions about the library exceptions in section 108 of the copyright law.

Twenty years later we have another green paper on copyright and the Internet. It is, as are nearly all government documents, long and complex, and I hope to find time to do a comparison of the two papers to see if we have progressed in this area. But from a first reading I believe I can say that there is something that the two papers have in common: who they consider to be a creator and a rights holder.

It won't surprise you to learn that the emphasis in papers I and II is on the commercial content production communities: books, movies, music, newspapers.   As quoted in the press release,
“We see a digital future in which the relationship among digital technology, the Internet, and creative industries becomes increasingly symbiotic,” said Assistant Secretary of Commerce for Communications and Information and NTIA Administrator Lawrence E. Strickling. “In this digital future, the rights of creators and copyright owners are appropriately protected; creative industries continue to make their substantial contributions to the nation’s economic competitiveness; online service providers continue to expand the variety and quality of their offerings; technological innovation continues to thrive; and consumers have access to the broadest possible range of creative content.”
Oddly, he seems to be describing the Internet that I interact with today, but calls it the digital future. Nowhere, however, is there a mention of the rights of creators of Facebook pages, Google+ entries, Youtube videos, or tweets. A distinction is often made between creators and consumers, yet through social media and Internet publication methods (like blogging), that division is less valid than it was in the past.

Nor does the report admit that there is a the vast amount of content that is user-supplied and highly consumed, like Youtube, and that this content exceeds, both in publication and in consumption, the commercial offerings that the authors of the report are so concerned about. Hearings held to discuss the first green paper had industry representatives saying that the Internet would not be a "success" if commercial and entertainment content was not available, as if the Internet at the time (1994) were just an empty shell waiting for Time-Warner and Disney and Thomson-Reuters to come along and fill it up. Bruce Lehman, chairman of the committee producing the green and white papers, said to Congress in 1995:
 Creators, publishers and distributors of works will be wary of the electronic marketplace unless the law provides them the tools to protect their property against unauthorized use. Thus, the full potential of the NII will not be realized if the education, information and entertainment products protected by intellectual property laws are not protected effectively when disseminated via the NII.
In case you weren't there at the time, the Internet in 1994 was a thriving community with huge amounts of content. It was less flashy than today's Internet, since the technology did not yet allow for the efficient streaming of video and sound, and downloading a photograph could take a while. But it was not languishing for lack of content. This goes many-fold for the Internet today, but the representatives of the commercial content industries are somehow oblivious to any content that isn't making money for them. And that includes every Youtube upload, every Facebook page, bazillions of Flickr photographs, countless tweets.

Here are some stats from Youtube and Hulu, taken from their sites. Admittedly, Youtube is not 100% non-commercial and Hulu is not 100% pay-for-view, but this still shows that Youtube and its crowd-sourced content should be counted as real content in the sense of the green paper on copyright. But it isn't.

Hulu
 Number of Hulu video views in the past year     457 million
Total number of people who watched Hulu at least once in the past year     38 million
Number of Hulu video views in the past year     457 million
Average number of videos a Hulu watcher views     12
Average length of time a person spends on Hulu     1hr 13min
Number of devices in use that Hulu is available on     120 million
Percentage of people who use Hulu and only watch television shows     73 %
Percent of people who use Hulu to watch movies     9 %
Percent of videos viewed online that Hulu makes up     4 %
Percent growth for Hulu from 2010 to 2011     60 %
Total revenue made in 2011 by Hulu     $420 million
Youtube
More than 1 billion unique users visit YouTube each month
Over 6 billion hours of video are watched each month on YouTube—that's almost an hour for every person on Earth, and 50% more than last year
100 hours of video are uploaded to YouTube every minute
70% of YouTube traffic comes from outside the US
YouTube is localized in 56 countries and across 61 languages
According to Nielsen, YouTube reaches more US adults ages 18-34 than any cable network
So Youtube got 12 billion unique visitors in a year, to Hulu's 457 million. You'd think data like that would have an impact on the thinking of these folks. And the fact that Youtube reaches more viewers in the key demographic than cable TV should also shake up some media executives. In addition, those who upload content to Youtube are rights holders, as the Youtube terms of use makes clear:
For clarity, you retain all of your ownership rights in your Content. However, by submitting Content to YouTube, you hereby grant YouTube a worldwide, non-exclusive, royalty-free, sublicenseable and transferable license to use, reproduce, distribute, prepare derivative works of, display, and perform the Content in connection with the Service
You would also think that the report would address these Youtube content providers as creators, just as those who add content to Facebook pages are creators. But these creators are somehow not real creators in the minds of the writers of the report, and their needs are not addressed. No where does the report decry the exploitation, without compensation, of the work of these creators by industry giants like Facebook and Google. If someone else is making money off of your content it's bad, unless you aren't one of us, and therefore it's ok. The 99.99% of us who do not own the means of production and distribution must cede our rights in order to participate in the creation of content.

The report urges Congress to criminalize the unauthorized streaming of copyrighted works. Copyright law violations are subject to civil penalties, so this greatly ups the threat level for violations. This also revives one of the disputed and hated aspects of the famed "Stop Online Piracy Act," (SOPA) which was defeated through massive activism.

The upshot is that copyright will benefit those who create industrially, but not the millions who create individually. We should just admit that copyright law no longer has anything to do with creativity or social impact, and instead rename it to the "copyright industries law." As the 2013 green paper explains:
The industries that rely on copyright law are today an integral part of our economy, accounting for 5.1 million U.S. jobs in 2010—a figure that has grown dramatically over the past two decades. In that same year, these industries contributed 4.4 percent of U.S. GDP, or approximately $641 billion.
That's what this green paper, and the previous green paper, and all of the changes to copyright in the centuries since the time of the US Constitution, are really saying. Very little of what is protected today, and very little of what contributes to that $641 billion, is either Science or a useful Art. The founding fathers were writing at a time when science and technology defined progress, and progress defined prosperity. They did not, and could not have, anticipated the rise of leisure time and the media that would allow the creation of a huge economic center based on entertainment (including infotainment, which covers much news reporting today). Ironically, real Science is being encouraged to provide its content as Open Access. It's time to openly state that copyright law today is not what the founding fathers had in mind.

Update: Excellent blog post on the Green Paper and copyright by Kevin Smith.


Addendum: I just re-discovered the CPSR statement on the NII from 1994, and it contains this paragraph which is so sadly insightful:
An imaginative view of the risks of an NII designed without sufficient attention to public-interest needs can be found in the modern genre of dystopian fiction known as "cyberpunk." Cyberpunk novelists depict a world in which a handful of multinational corporations have seized control, not only of the physical world, but of the virtual world of cyberspace. The middle-class in these stories is sedated by a constant stream of mass-market entertainment that distracts them from the drudgery and powerlessness of their lives. It doesn't take a novelist's imagination to recognize the rapid concentration of power and the potential danger in the merging of major corporations in the computer, cable, television, publishing, radio, consumer electronics, film, and other industries. We would be distressed to see an NII shaped solely by the commercial needs of the entertainment, finance, home shopping, and advertising industries.

Tuesday, July 30, 2013

Wikipedia as a learning experience

I have recently attended a few Wikipedia editing sessions and become interested in contributing more to Wikipedia. There is much editing to be done on pages relating to libraries and librarians; some of those pages are quite inadequate, and many have been marked as such using the Wikipedia coded messages that point out problems. The page for the LCCN is a stub, for example. Search on Sears Subject headings and what you get is a pretty poor page for Minnie Earl Sears with some information about the subject headings. Lately I've been updating the page on the Dewey Decimal Classification, which had little background information and did not have appropriate citations. I hope to move from there to the rather strange page that "compares" DDC and the LC Classification.

I estimate that I spent between 20 and 40 hours doing the research for my updates to the DDC page. The reason for that is that the Wikipedia standard requires that all facts be sourced. Add to that the requirement for a neutral point of view (called NPOV in wiki-speak), and a good Wikipedia page is a set of sourced facts, with some clear writing connecting them. (And, yes, there are a lot of not-good Wikipedia pages.)

It occurred to me that if I were a teacher I could use Wikipedia as a learning experience. Wanting your favorite topic to be well-represented in Wikipedia is a great motivator. Having to source all of your facts (and being pretty much limited to facts) means having to do research. Doing research becomes a good activity for discussing how to find sources and how to evaluate them.

Then I thought: wouldn't it be great to run a Wikipedia editing session in a library? What better place to have access to the sources? An editing session in a library with reference librarians on hand sounds like a Wikipedian's dream, and it could be used to teach people how to use the library.

Have you done this? I'd like to know.






Tuesday, July 23, 2013

Linked Data First Steps & Catch-21

Often when I am with groups of librarians talking about linked data, this question comes up:
"What can we do TODAY to get ready for linked data?"
It's not really a hard question, because, at least in my mind, there is an obvious starting point: identifiers. We can begin today to connect the textual data in our bibliographic records with identifiers for the same thing or concept.

What identifiers exist? Thanks to the Library of Congress we have identifiers for all of our authority controlled elements: names and subjects. (And if you are outside of the US, look to your national library for their work in this area, or connect to the Virtual International Authority File where you can.) LoC also provides identifiers for a number of the controlled lists used in MARC21 data.

The linked data standards require that identifiers be in the form of an HTTP-based URI. What this means is that your identifier looks like a URL. The identifier for me in the LC name authority file is:
http://id.loc.gov/authorities/names/n89613425
Any bibliographic data with my name in a field should also contain this identifier. (OK, admittedly that's not a lot of bib data.) That brings us to "Catch-21" -- the MARC21 record. Although a control subfield was added to MARC21 for identifiers ($0), that subfield requires the identifier to be in a MARC21-specific format:
The control number or identifier is preceded by the appropriate MARC Organization code (for a related authority record) or the Standard Identifier source code (for a standard identifier scheme), enclosed in parentheses.
The example in the MARC21 documentation is:
100 1#$aBach, Johann Sebastian.$4aut$0(DE-101c)310008891
Modified to use LC name authorities that would be:
 100 1#$aBach, Johann Sebastian,$d1685-1750$0(LoC)n79021425
The contents of the $0 therefore is not a linked data identifier even in those instances where we have a proper linked data identifier for the name.  Catch-21. I therefore suggest that, as an act of Catch-21 disobedience, we all declare that we will ignore the absurdity of having recently added an anti-linked data identifier subfield to our standard, and use it instead for standard HTTP URIs:
100 1#$aBach, Johann Sebastian,$d1685-1750$0http://id.loc.gov/authorities/names/n79021425
Once we've gotten over this hurdle, we can begin to fill in identifiers for authority-controlled elements. Obviously we won't be doing this by hand, one record at a time. This should be part of a normal authority update service, or it may be feasible within systems that store and link national authority data to bibliographic records.

We should also insist that cataloging services that use the national authority files begin to include these subfields in bibliographic data as it is created/downloaded.

Note that because the linked data standard identifiers are HTTP URIs, aka URLs, by including these identifiers in your bibliographic data you have already created a link -- a link to the web page for that person or subject, and a link to a machine-readable form of the authority data in a variety of formats (MARCXML, JSON, RDF, and more). In the LC identifier service, the name authority data includes a link to the VIAF identifier for the person; the VIAF identifier for some persons is included in the Wikipedia page about the person; the Wikipedia identifier links you to DBpedia and the DBpedia identifier is used by Freebase ...

That's how it all gets started, with one identifier that becomes a kind of identifier snowball rolling down hill, collecting more and more links as it goes along.

Pretty easy, eh?

Sunday, July 21, 2013

Librarians and the JK Rowling Effect

I'm sure that by now you've heard the story: a book by unknown author Robert Galbraith got good reviews but made only modest sales, until it was revealed that Galbraith was a pseudonym for JK Rowling. Within days it was "#1 with a bullet" on Amazon.

The book had reportedly sold only 500 copies in the US. The publisher most likely did not do a large print run. The hard copy has not yet made the New York Times best seller list, which is determined by sales. However, it is #1 on Amazon due to the infinite expandability of Kindle ebook copies.

This may be a kind of watershed moment for ebooks, a proof that in this world of instant access the ebook is not only good for readers (as in humans who read), but they have definite advantages for publishers. The primary message here, though, is that reputation sells. This is something that advertisers have known forever. Therefore I was surprised to come upon a short article in The Nation magazine from 1897(*)  that blames librarians for making authors famous by naming them and clearly disdains this fact.
"The role of the librarians in this country as critics of literature and arbiters of literary reputation is growingly apparent. 'Poole's Index to Periodical Literature' is of necessity selective, and the selection from each periodical embraced in the Index appertains to the particular librarian or library assistant specially charged with the care of that periodical. In most cases, the name of the author is ascertained and appended to the title, and so the aristocracy of current letters is called into being. Writers in this way come to be known for their range of subjects and interests; their weight is suggested by frequency of titles; editors and publishers will naturally apply to them as authorities."
Not only did librarians select literature for the index, they actually created recommended bibliographies!

"Not satisfied with this control at once of fame and research, the associated librarians got up a list of works recommendable for a library of five thousand volumes... Mr. Melvil Dewey... 'submitted to the librarians of the State [of New York] and others to obtain an expression of opinion respecting the best fifty books ... to be added to a village library."
The author of the piece (not named, by the way, at least not in the pages that I retrieved) saw this as "a pretty generous advertisement." My concern about the influence of libraries and librarians is quite different: access to knowledge determines future knowledge. Well, perhaps "determines" is too strong of a word, but let's say that what we can produce as new knowledge depends greatly on which giants are available to us, in the Newtonian sense of "standing on the shoulders of giants."

Oftentimes, the library has a role in providing those giants. In a small library, what the library owns will necessarily be a subset of the knowledge on a topic. In a large library, where the number of documents on a topic is way beyond the capability of most researchers to absorb, the organization of the materials will determine what researchers discover. Even if collection development were a perfect process, with unlimited funds, unlimited space, and absolute neutrality, the library in some way has an effect on future knowledge.

In that sense, Amazon and other booksellers have it easy: all that matters for them is the simple measure of sales. They don't have to, and probably do not, wonder if the world is or is not a better place now that we have Fifty Shades of Dubious Personal Interaction, and a new JK Rowling bestseller. Ironically, counting sales or hits or links is considered "neutral" while attempting to make a selection of the most important works in a subject area within a limited budged is looked at askance.

------------

* The Nation. vol. 64, no. 1663, May 13, 1897, p. 860

Saturday, June 29, 2013

FRBR and schema.org

The FRBR structure for what it calls the Group 1 entities (Work, Expression, Manifestation, and Item, hereafter written as WEMI) presents quite a few problems for data modeling. Of the many issues this brings up, there is the fact that this division is not universally recognized, not even in library data, and definitely is not recognized outside of libraries. This has particular impact for library data as part of the linked data space, where a primary goal is interlinking with data from diverse resources. It is unlikely that online bookstores or academic citations will begin to use the WEMI structure.

One area where library bibliographic data and bibliographic data from other sources may mingle is in schema.org markup in web pages. Schema already has a basic class that can be used for bibliographic data, called "CreativeWork." Creative work contains the common elements for this type of description, like author, title, publisher, pages, subject, etc. Problems arise, therefore, when trying to express either WEMI or the simplified BIBFRAME Work and Instance (hereafter bf:Work, bf:Instance) in this model. CreativeWork is a unified model that includes all descriptive elements in a single set; BIBFRAME separates those elements into two entities, and each entity contains only a defined set of the descriptive elements. Thus, where CreativeWork will have information for author, title, publisher, pages, subject, in BIBFRAME author and subject must be described in the bf:Work entity, and title, publisher and pages in the bf:Instance entity. Between MARC, FRBR, BIBFRAME, and schema.org, a full bibliographic description may require one, two, or four separate entities.

comparison of marc, frbr, bibframe, and CreativeWork


The OCLC report on BIBFRAME and schema.org proposes that one could use CreativeWork for different FRBR (or presumably BIBFRAME) entities, making the determination based on what fields are present:
"In this scheme, it would be possible to say that when only titles, subjects, and creators are mentioned, the description for a Schema:CreativeWork refers to a FRBR Work; and when copyright dates and genres are present, the description is equivalent to a FRBR Expression." (p. 14)
While that makes sense from a pure logic point of view, and would probably work in a library database, it has problems within the web and linked data contexts of schema.org. I should note, before going on, that schema.org is metadata markup for any web site, and CreativeWork will be used for books, films, music, art, and other forms of creation by anyone and everyone on the web. This is not a library-specific standard.

First, there are many sites that have a search response page with limited information about the item, requiring the user to click through for details. A search results page for books on Amazon or Ebay gives only the author and title, but does not represent the Work -- it merely doesn't give the user the full data on that page in order to fit more results onto the page. Therefore, the lack of information on one web page does not mean that the description there is complete.

Second, there is no "record" in schema.org, merely a number of coded statements with values within a web page. Any web page can contain information about any number of "things" and information about those things may be placed anywhere on the page, possibly far from each other and not coded as a single unit. It may not be possible to know how complete a description is.

Third, web site owners can opt to mark up only part of their data. In schema.org markup that I have encountered on commercial sites, markup reflects the owner's view. For example, Google (one of the originators of schema.org) does not mark up the bibliographic data in its Books pages, but instead emphasizes user ratings, images, and subjects. (This shows the markup using the Google rich snippet testing tool.) In comparison, the extracted schema.org elements for an IMDB page is much more detailed, an indication that it considers itself an information site more than a sales site.

Finally, although this is somewhat beyond schema.org, should the data in web pages be incorporated into the linked data space, it will go there as individual triples that are part of a huge graph of data. That graph is theoretically limitless and makes use of a principle called the "open world assumption." In an open world it is not possible to base your assumptions on what is missing from the graph. The open world does not have a concept of completeness because there is always the possibility that there is more information than what you are seeing at any given moment in time.

These may not be the only arguments against the use of CreativeWork for different FRBR or BIBFRAME entities, but in my mind they are sufficient to make the case that if it is desirable to encode FRBR or BIBFRAME entities in schema.org that they must be represented by different schema.org classes and cannot be inferred from data elements in CreativeWork.

Before I end, let me make clear that I do not favor an imposition of FRBR-like separations of bibliographic data on the linked data world. Even the BIBFRAME two-part bibliographic description will have problems interacting with the one-entity model that is used outside of libraries. I do think that we can find a way to talk virtually about works without stripping such key elements as authors and subjects from the description of the package that carries the content. That package is, after all, what I hold in my hand when I read something, and it is a whole, with author, title, subjects, pages, binding, publisher, etc. That is, however, a topic for another post.

Wednesday, June 19, 2013

Spying, the old-fashioned way

While the news debates the NSA's PRISM program, a massive collection of data points of electronic communication, the more human side of spying is being pushed to the background. Yet if you are fearful of privacy invasion, there is nothing more chilling than a reading of FBI files with accounts of informants and statements about "Communist leanings" and "pro-Russian" attitudes. You can get a taste for this in the FBI Vault, a public file of de-classified documents, most of which are revealed upon the death of the target. The A-Z list (which appears to be a selection, as other names can be found with searches) is full of famous names, from Al Capone to Al Gore, from George Burns to Marilyn Monroe, and from Helen Keller to Leon Trotsky. (The list is only marginally alphabetical, by first name, which is almost as shocking as the contents of the files.)

This is old-fashioned stuff for the most part. Type-written letters, lots of scribbled initials, and whole chunks of documents blacked out with what must be a special FBI-invented marker.


On a more modern note, who could resist adding their favorite FOIA file to Facebook?



Not everyone in the Vault is a potential "enemy of the state." Some are there because they were threatened, and the FBI was doing its "protect and serve" job. But coming to the attention of the FBI is often not a good thing. In the case of Bradbury, an informant tipped off the Bureau that Bradbury may have attended a writers' meeting in Cuba. This set off an investigation.
"Investigation conducted in the neighborhood of 10265 Cheviot Drive, Los Angeles, California, disclosed that a RAY DOUGLAS BRADBURY, date of birth 8/22/20, Waukegan, Illinois, resides at this address. He is a known writer and Los Angeles indices have numerous references on RAY DOUGLAS BRADBURY."
The report is itself a boon for any biographers (and Wikipedians). It gives his family history (back to 1630), location and occupation of all living family members, information about his spouse (including the location of the church where they were married), and of course his yearly income. Given that this report is from 1959, you can just imagine how much more information the FBI would have today. There are whole pages that would do a reference librarian proud: a list of his professional memberships, a complete bibliography, film credits. There is even some level of literary analysis:
"... BRADBURY was probably sympathetic with certain pro-Communist elements in the [Writers Guild of America, West]... [Informant] stated it has been his observation that some of the writers suspected of having Communist backgrounds have been writing in the field of science fiction and it appears that science fiction may be a lucrative field for the introduction of Communist ideologies."
Admittedly, this was high "red scare" time still, and Bradbury was working in and around Hollywood. However, the informant seems to have been unfamiliar with the work of L. Ron Hubbard.

Every file here has some gems worth reading. And don't forget to check the category "Unexplained Phenomenon".

Tuesday, May 14, 2013

BIBFRAME Authorities

There is a discussion taking place on the BIBFRAME listserv about the draft proposal for BIBFRAME Authorities. I've made some comments, but this is a topic that requires diagrams, and therefore doesn't work well in email. This blog post is an illustrated comment on the BIBFRAME Authorities proposal.

The way I read the proposal, this diagram represents the current thinking on BIBFRAME Authorities:

Here is an example of a BIBFRAME authority representation from the document:

<!--  BIBFRAME Authority -->
<Person id="http://bibframe/auth/person/franklin">
      <label>Franklin, Benjamin, 1706-1790</label>
      <hasIDLink resource="http://id.loc.gov/authorities/names/n79043402" />
      <hasVIAFLink resource="http://viaf.org/viaf/56609913" />
      <hasDNBLink resource="http://d-nb.info/gnd/118534912" />
</Person>
 
It is unclear to me what role or functionality the VIAF and DNB links are expected to have, so that is a question that I have. I don't know what "hasIDLink" means - whether that is specific to LCNA or means: "this is the authority file." If it does not mean that, then this does not link the BIBFRAME name display form specifically to the actual authority file that defined it. If it does mean that, then the three authority files are not treated consistently.

In addition, it does not appear that alternate name forms are including in the BIBFRAME Authority, so they are not available for indexing. That could just be something missing from the examples, however.

It would make more sense to me that if a BIBFRAME authority is needed in the BIBFRAME structure, to make a few changes. First, the alternate name forms would be included in the BIBFRAME authority, primarily for indexing. The preferred form of the name is obviously there for the purposes of display and indexing. The alternate forms are not displayed, but should be used in retrieval.

Another possible change is to make a direct link from the BIBFRAME authority to the library authority entry, in this case LCNA. Without this, it isn't clear how the two will be kept in sync as the LCNA file is updated. Links to other library authority files would be from the authority of record, which is what they are "nearly equivalent" to:
Note that this still links the annotations to the BIBFRAME authority. Other libraries using the LCNA data would not necessarily have access to annotations linked directly to the BIBFRAME authority, but that depends on how those authorities are shared. The advantage of this is that it shelters the "true authority" from possibly inappropriate stuff that might be associated with the BIBFRAME authority.

<!--  BIBFRAME Authority -->
<Person id="http://bibframe/auth/person/franklin">
      <label>Franklin, Benjamin, 1706-1790</label>
      <altLabel>Franklin, V. (Venīamin), 1706-1790</altLabel>
       <authority resource="http://id.loc.gov/authorities/names/n79043402" />
</Person>
 
<!-- LC Name Authority -->
 
  <madsrdf:PersonalName rdf:about="http://id.loc.gov/authorities/names/n79043402">

      <madsrdf:authoritativeLabel xml:lang="en">Franklin, Benjamin, 1706-1790
        </madsrdf:authoritativeLabel>
      <madsrdf:variantLabel xml:lang="en">Franklin, V. (Venīamin), 1706-1790
        </madsrdf:variantLabel> 
      <hasVIAFLink resource="http://viaf.org/viaf/56609913" />
      <hasDNBLink resource="http://d-nb.info/gnd/118534912" /> 
  </madsrdf:PersonalName>

The last option that I can propose is that of simply using the library authority. I believe that the argument against this is that such data may not always be available for record displays but as far as I know nothing prevents caching of high-use metadata statements ("triples" because it's all just triples after all), and refreshing these periodically to make sure one has the latest. In fact, it is probable that the linked data space will take a lesson from the Domain Name System, where a system of mirrors and backups distributes the DNS world-wide, syncs changes, and provides almost 100% availability. In that case, there would be no reason not to use ones' stated authority, with similarly coded local data existing where the authority data does not exist for the occasional local need.
<!--  BIBFRAME Authority -->
<Person id="http://id.loc.gov/authorities/names/n79043402"></Person>
 
<!-- LC Name Authority -->
 
  <madsrdf:PersonalName rdf:about="http://id.loc.gov/authorities/names/n79043402">
      <madsrdf:authoritativeLabel xml:lang="en">Franklin, Benjamin, 1706-1790
        </madsrdf:authoritativeLabel>
      <madsrdf:variantLabel xml:lang="en">Franklin, V. (Venīamin), 1706-1790
        </madsrdf:variantLabel> 
      <hasVIAFLink resource="http://viaf.org/viaf/56609913" />
      <hasDNBLink resource="http://d-nb.info/gnd/118534912" /> 
  </madsrdf:PersonalName>

To be sure, I am making some assumptions that should be explicit.
  1. It's all triples. It's easy to forget this when looking at graphs.
  2. Availability is a technical issue for which there is an answer (or more than one answer)
  3. The main action in a linked data space is a query. This is not only for traditional discovery, but also for forming displays. To display a person's name will be a query for a "label" linked to a URI. It doesn't matter whether the URI is a BIBFRAME authority URI, an LCNA URI, or a DNB URI -- each of those is a triple in linked data space.

Friday, April 05, 2013

The "Mellen Mess" and the changing role of publishers

Reading about the "Mellen Mess" -- the case of the publisher that is suing a librarian who criticized the quality of the houses's output -- I found the most interesting discussion to have taken place in the comments area of the original post (available via the Wayback Machine). One poster says:
On the other hand, I would say that few if any publishers do not publish a number of books that I would not buy.
To which Dale Askey replies:
The fact is, however, that libraries have to be able to trust presses to turn out good titles, or our work becomes impossible given the sheer global output of scholarship... libraries lack enough qualified subject expertise to make such judgments at the necessarily granular level, and the trend here is not encouraging. Subject librarianship is dismissed as a relic of a past age, and we now talk about “patron-driven” acquisition as if it were the Holy Grail. Having spent a brief but wonderful portion of my career as a focused subject librarian for an area where I have expertise, I know the benefit of reading substantive reviews and making intelligent choices about individual titles, but even that library no longer has the funds (or perhaps just lacks the will to commit the funds) for such esoteric enterprises.
What I think we see here is evidence of a substantial change in what it means to be a publisher in this age of "everyone can be a publisher." First, a little history.

Turin book fair, 2007
The first followers of Gutenberg were equal parts scholar, technician and businessman. There was never any question that producing print was a for-profit activity, and the same printers who turned out carefully edited classics also printed the first advertisements as well as a large number of indulgences to be sold to wealthy (but not well-behaved) Catholics. Well into the late 19th century, publishers were also printers, and often saw themselves as having a key role in scholarship and culture. The reputation of the publisher was what made the introduction of new, unknown authors possible.

Turin book fair, 2007
Although I am at my very core a "book person," I was unaware of the culture of publishers before visiting Europe and attending both bookstores and a few book fairs there. What struck me immediately was that the book covers represented the publisher more than the book itself. Near a university I found a bookstore that was entirely organized by publisher -- not by topic -- so that the only access other than "known item" was browsing by publisher.

 By my own observation, by the 1950's the role of the publisher in the US was subordinated to the book, preferably a best-seller. We could all name key books (Catcher in the Rye, To Kill a Mockingbird, The Spy Who Came in from the Cold), but I doubt if many of us could name the publishing house that issued them.

As Epstein and Schiffrin explain (see Further Reading), the purchase of publishing houses in the late 20th century by companies with a primary interest in profits, unhindered with cultural concerns, has made the publishing house no more than another business. From scholar-printer-businessman, only the latter role remains. If "best-selling" is your idea of quality, then these publishers can be considered consistent and trustworthy. If you are looking for greater cultural pursuits, you will probably be disappointed.

While that describes popular publishing, scholarly publishing has retained the publisher reputation... at least until very recently. While there still are known scholarly publishers whose output can be trusted sight un-seen (as Askey explains), there are many new entrants to this business area whose primary goal is income, not scholarship itself. This seems to be following a similar path to that of popular publishing, but with a twist: scholars must publish. The real culprit in this story is the "publish or perish" culture of academia. It matters not that there is no audience for a scholar's work; in fact, being actually read is rather icing on the cake. The main thing is that a scholar must get his or her work produced by someone acting as a publisher. It is therefore unremarkable that publishers have come on the scene to address this market.

The big "however" here is that while author fees may cover the cost (plus profit) of publishing an open access article, printed books still need to have some sales. Throughout the history of publishing, vanity books have been known as money-losers,* and some publishers have contracted with the authors to buy back any un-sold copies. This is more than an un-tenured faculty member can afford, however, so the business of publishing books by academics is one that wise investors would avoid.

The upshot of the story here is that we've gotten ourselves into an untenable position between the pressure to publish and the actual market for published works. Something has to give, and it has to give at both ends of the equation.

The next step, then, is improving the social media that the academic community uses so that the "post publication peer review" becomes the filter for quality and importance. 

---------------

* I ran into a great rant by a 19th c. Italian publisher about vanity publishing while doing research on Natale Battezzati. I unfortunately didn't mark it, but if I find it again I will link it here.


Further Reading

Epstein, Jason. Book Business: Publishing Past, Present, and Future. New York: W.W. Norton, 2001.

Schiffrin, André. The Business of Books: How International Conglomerates Took Over Publishing and Changed the Way We Read. London: Verso, 2000.

Saturday, March 30, 2013

By way of explanation

“Readers who are familiar with conventional logical semantics may find it useful to think of RDF as a version of existential binary relational logic in which relations are first-class entities in the universe of quantification. Such a logic can be obtained by encoding the relational atom R(a,b) into a conventional logical syntax, using a notional three-place relation Triple(a,R,b); the basic semantics described here can be reconstructed from this intuition by defining the extension of y as the set { : Triple(x,y,z)} and noting that this would be precisely the denotation of R in the conventional Tarskian model theory of the original form R(a,b) of the relational atom. This construction can also be traced in the semantics of the Lbase axiomatic description.”
        From the RDF Semantics document

"Doubts about the ability to know the order of the world catalyzed a crucial change, away from taxonomic forms of information storage based on natural language and toward new ones based on a symbolic language of analytical abstraction. Mathematics promised a new vision of order for both the natural and the moral worlds, where confusion was resolved by jettisoning whatever could not be known with certainty."
     Hobart, Michael E. Information Ages: Literacy, Numeracy, and the Computer Revolution. Baltimore: Johns Hopkins University Press, 1998. p. 90

“We could try to feed it algorithms for everything. There are only slightly more of them than there are particles in the universe. It would be like building a heart muscle molecule by molecule. And we’d still have a hell of an indexing and retrieval problem at the end. Even then, talking to such a decision tree would be like talking to a shopping list. It’d never get any smarter than a low-ranking government bureaucrat.”
     Richard Powers, Galatea 2.2, 1st Perenniel Ed., 1996. p. 78


Thursday, March 14, 2013

Battezzati's Cartollini

Beginning with the first edition of the Dewey Decimal System and Relativ Index, Melvil Dewey includes this intriguing acknowledgment:
Perhaps the most fruitful source of ideas was the Nuovo sistema di Catalogo Bibliografico Generale of Natale Battezzati, of Milan. Certainly he [Dewey] is indebted to this system adopted by the Italian publishers in 1871, though he has copied nothing from it.
It so happens that I did some research on this in the national library in Milan in the mid-1970's and never published what I learned. This blog post makes use of notes and photocopies from that time.

There are a number of puzzling things about Dewey's mention of Battezzati's system. One is that it had little or nothing to do with classification. It was, however, an ingenious card system. The story, in brief, goes like this:

Natale Battezzati was a printer/publisher in Milan from the mid-1800's onward. The publishers had a bi-monthly publication that carried information on new books in print. The publication was used by booksellers and customers to find books of interest. However, unless a bookseller had a perfect memory, looking for a specific book or a book on a specific topic meant combing through numerous back issues of the Bibliografia italiana. Battezzati, a member of the Associazione libreria italiana (the Italian Bookseller's Association) came up with the idea of using reprints of the title pages of books on card stock that could be kept and interfiled as a kind of "books in print" card catalog within each bookstore.

The genius of the card system was that each card had printed on each of three sides:
  1. the name of the publisher 
  2. the name of the author
  3. a subject classification based on Brunet
The cards could also have overprinted a table of contents or a summary of the contents. Each publication was also to be given, in the upper right corner, a number that could be used by the bookstores in ordering.

Thus, the bookseller would receive three copies of the card for each new book, and could create three card files. Battezzati's purpose was to increase sales by making it easier for a bookseller to satisfy the needs of the customer.

So much of this seems familiar to us today: a single "unit" card with multiple headings, a unique numbering system for books ... it's no wonder that Dewey was impressed, but it's still unclear why a reference to the system would be included in all fourteen editions of DDC that Dewey personally oversaw.

Of even greater mystery is a statement by Battezzati in one number of the association's journal that Dewey, sent by his government to the World Exposition in Vienna in 1873, saw the cards demonstrated there. This is probably a mis-interpretation by Battezzati of a letter sent to him by Dewey, since it is highly unlikely that Dewey, at the time a 22-year-old college student, would have been sent to represent the United States at such an event in Europe. It is more likely that Dewey saw the cards in the articles in the Bibliografia italiana, which was held by a few major libraries on the East coast (Dewey was attending Amherst College in 1873), such as the Boston Atheneum. There were other misunderstandings on Battezzati's part, since he referred to Dewey as the secretary of the "Associazione dei Libraj d'America" -- that is, the Association of American Booksellers. 


Monday, March 04, 2013

Sergei Brin's Masculinity

At first I thought it was a joke: "Speaking at the TED Conference today in Long Beach, Calif., Brin told the audience that smartphones are "emasculating." "You're standing around and just rubbing this featureless piece of glass," he said."  Perhaps I didn't believe it was true because I first encountered it in the form of a BoingBoing parody for "Mandroid: Google's remasculating new operating system." Another one of those moments when reality and parody are just soooooo close.

The Ted talk won't be available for while so I don't know if he said this with any hint of humor. (I rather hope so, but I fear not.) The talk was about the Google Glass product, which he was demonstrating and promoting.  But even if he meant the statement as something of a joke, there are things that need to be said about the not-so-sub text.

1. Using "emasculating" to deride a competitor's product when neither product has anything to do with gender is just a cheap shot. It's like Coke saying that Pepsi is "emasculating."

2. The ongoing attempt to raise the testosterone levels of electronic equipment has gotten out of hand. Yet, unfortunately, products must make an appeal to identity in order to sell. Apple pushes an identity of design and sophistication that was once considered "un-manly" by early Mac reviewers. Brin's remark, albeit nonsensical, pushes back against Apple's more gender-neutral image.

3. It makes little sense to eliminate women from your market, and promoting a product as a kind of "technology viagra" is not going to win over female consumers. Brin's remark shows that he's more concerned with promoting a masculine image that he is comfortable with than with following good marketing practice.

Some reading:

Wikipedia Women in Computing
Gender codes: why women are leaving computing edited by Thomas Misa
How to market to women, by Carol Nelson (1994, so a little out of date, but still useful)

Thursday, February 21, 2013

Open to Creativity

The brilliance of Google's PageRank is not the computational methods behind it but the target of those methods: links created by people in the course of making something meaningful on the Web. Without that human input, Google (and Bing and Yahoo) would be simply counting up term frequencies and perhaps analyzing linguistic characteristics, but missing most of what makes Web searching work. The results would be at least as bad as the ranking on Google Books because it would be devoid of the human commitment of significant connections between the pages.

Although Google's mission statement is "... to organize the world's information and make it universally accessible," Google is not really organizing anything. It is reading the organization provided by the Web's population. Similarly, Facebook is reading the relationships between people that are made in the course of using its software, information that it would not have otherwise.

What has made the web the rich environment that it is is that anyone can link anything to anything else. That linking is an expression; and even though we might not be able to characterize it in a few words (what does it mean that page A links to page B?) Google has shown that we can make use of those patterns of linking to help people find stuff on the Web that meets their needs.

In this regard, the Semantic Web has a serious problem. Much of the focus of Semantic Web work is the creation of vocabularies of defined relationships between things, with the intention that these relationships can be traversed and manipulated by algorithms. That is fine in itself, but the Semantic Web enthusiasts are primarily creating pre-determined, fixed relationships between things, mainly based on defining each thing in a class/sub-class relationship with other things. The Semantic Web vocabularies also define requirements called "range" which limit what you can link to from a specific element of your vocabulary. [1]

The tendency to pre-define vocabularies with strict rules is a carry-over from the Artificial Intelligence (AI) environment from which the Semantic Web arose. If you expect to work with machines, and only machines, then you have to define for them exactly what they can and cannot do, and you must present all decisions as formulas that can be calculated.

For reasons that is unclear to me, the Semantic Web work seems to be unaware of the great success that Google has had in using human information sharing activity to creating a meaningful web of links. There is little attention to the fact that establishment of relationships through linking can be done by humans, much less that the best and most useful linking will be done by humans. Human information linking is not definable a priori -- in fact, an a priori definition of allowed links essentially limits the future to re-running past concepts in new variations. It is an absolute barrier to creativity if you can only act on what has been pre-defined.

It's the difference between a prefab house, which can only become what it was designed to become (with some minor modifications that don't change its essence) and a box of blocks that can be used to create anything within the realm of physics (although superglue could extend the possibilities).

In part this is why I tend to speak of "linked data" rather than "Semantic Web." Linked data, as it has evolved as the description of metadata activity, carries less of the AI baggage than Semantic Web does.

To me what will be exciting about linked data is what people will do with it; what they will create, what they will experiment with, and both successes and failures. However, for people to do something with linked data they need tools -- tools that are as easy to use as those used to create Web pages and the links between them.

They also need a box of blocks to work with. These blocks need to be as free of predetermined rules as possible. [2] The terms need to be defined just enough to make them usable. This is also true of the relationships or links. Using our box of blocks metaphor, we want to be able to put the blocks in relationships like "above" "below" "near" each other. If the square blocks are defined as always "below" the rectangular blocks, then that limits what we can create.

The Semantic Web as a machine environment guided with AI formalities appeals to some because it promises to be neat and unambiguous. It will, however, foster only a very constrained amount of creativity, and will not be able to satisfy the full range of human curiosity.

It is a shame that many Semantic Web enthusiasts have little faith (or little interest) in the human potential that linking and openness can unleash. I, for one, am looking for partners in the development of a messy, intelligent, quirky, technology that can produce surprising results, created by people using linked data as a tool of expression. I am particularly exciting by the fact that we don't know what forms that expression will take.



[1] For example, you can define a creator has having to be type "Person" and that Person must be expressed as a URI, like http://kcoyle.net/kcoyle.rdf. In that case, you can't have a creator that is "Karen Coyle" because "Karen Coyle" is just a string of characters, not an identified entity. This means that if you don't have identifiers for your creators, you can't create data about what they created.]

[2] This is referred to as "minimum ontological commitement," introduced in Toward Princiciples for the Design of Ontologies Used for Knowledge Sharing, by T Gruber [http://www2.iath.virginia.edu/time/readings/ontology-semantics-metaphor/designing-ontologies.pdf]