Thursday, March 24, 2011

Open Data II

In this post I want to talk about some of the Open Government Data (OGD) projects taking place around the world.

Open government data is assumed to be a given by many in the US because our copyright law states that federal government data is not covered by copyright. (The situation in US states can vary, but the federal government's declaration sets the tone.) In other countries the situation is less clear and governments do not have a mandate to make data open. However, the open government data movement has purred on a number of fast-moving activities, many sponsored by governments themselves that encourage citizens to download and use government data.

The UK government has a site, Opening up government, where it not only shares data but encourages people to develop apps that use the data. Apps here can alert you to new building and planning projects in your area, and give you real-time public transportation information.

The EU has its own Open Government Data Initiative. It provides the data under these terms of use:
All Data on dev.govdata.eu is available under a worldwide, royalty-free, non-exclusive license to use, modify, and distribute the datasets in all current and future media and formats for any lawful purpose and that this license does not give you a copyright or other proprietary interest in the datasets.
There is a European site for public sector information, the European Public Section Information Platform: Europe's One-Stop Shop on Public Sector Information Re-use. You can search by country and see news and developments relating to public data, much of which is available for re-use. Because many countries for not have an explicit statement in their copyright laws covering government data, one of the important early steps for these jurisdictions is to develop blanket licenses that they can apply to the data. So when you visit the site you see recent news that Norway has developed a license for its government data and is asking for feedback (if you read Norwegian).

To understand the force of this movement, it is said that Albania and Bulgaria are on the verge of opening some government data.

The Obama administration announced its Open Government effort on the first day of his administration.
To the extent practicable and subject to valid restrictions, agencies should publish information online in an open format that can be retrieved, downloaded, indexed, and searched by commonly used web search applications. An open format is one that is platform independent, machine readable, and made available to the public without restrictions that would impede the re-use of that information.
Wired has a US-oriented "how-to" wiki on OGD. (Of course, they include in their "how-to" examples MarijuanaLobby.org, being Wired, but it's a good example of the range of utility of OGD. )

Not all data is at the country level, of course, and the movement is reaching into lower levels of government. Paris has an open data portal, while Enschede Netherlands has an open data declaration for its information. In Italy, the government of the Piemonte Region has a website for its open data.

The government open data movement is heavily influenced by grassroots efforts to convince governments that open data is a good thing -- not just for government watchdogs and opposition movements, but for heathy government and strong business. In the UK there is a Working Group on Open Government Data of the Open Knowledge Foundation, an independent not-for-profit that is promoting, as its name says, open knowledge. In Italy there is the wonderfully named "Spaghetti Open Data." Spain has a broad coalition of non-profits that form the "Coalición Pro Acceso." The CKAN web site, which is a general archive of available datasets of all kinds, has OGD under a number of tags, such as "gov". [Just out: Open Government Data video.]

We hear a lot about problems with copyright, with DRM, with information providers who want to lock down their products. Government data covers a huge variety of information types and is often the key information needed for a lot of civic and scientific decision-making. OGD can generate a mountain of new knowledge, and then tell you how high the mountain is.

Tuesday, March 22, 2011

Judge Chin rejects AAP/Google settlement

I'll say more when I've read it, but I put a copy on the Internet Archive.


After reading:

The judge's decision holds no real surprises. His analysis is fully consistent with the reactions of the interested parties to the case. He rejects the settlement primarily on these grounds:
  • It seems that a significant segment of the class of authors/publishers is not happy with the settlement. "Some 6800 class members opted out." (p.10) Also, a majority of the comments on the proposed settlement were negative, many coming from non-US copyright holders who did not identify with the class.
  • The settlement would make significant alterations to the current copyright regime, which should be a matter for Congress rather than the court.
  • The settlement's conclusion would go beyond the original lawsuit, which was over the digitization of in-copyright works by Google and the presentation of snippets relating to searches. The settlement would allow sales of full text works, which was never an issue at the time of the original lawsuit.
Although he rejects the settlement on numerous grounds, the judge concludes by saying "...many of the concerns raised in the objections would be ameliorated if the ASA were converted from an "out-out" settlement to an "opt-in" settlement." (p. 46) This leaves the door open for yet another settlement attempt between the parties.

It is important to note that the position of digitization and ebooks today are vastly different than they were in 2005 when the authors and publishers first sued Google over its library digitization project. It is possible that if the question of Google's digitizing were to be put forth for the first time today, the actions of the parties and the results would be vastly different. This is clearly a case where technology has moved forward at a rapid pace while the courts were contemplating an agreement that was standing still.

What now?

It's hard to believe that Google and the AAP/AG have not prepared themselves for this possibility. Yet, certain activities have gone forward as if the settlement were already approved.
  • A form of the Book Rights Registry is in place in the sense that there is a database of digitized works and a way to claim them to receive the proposed one-time payment. Presumably that payment is now not going to happen, but meanwhile Google has a large database with copyright holder information (including contact info, if I remember the form correctly).
  • The BRR has a chosen Director (Michael Healey).
  • It isn't clear if Google has continued digitizing books that are under copyright without specific permission. To be sure they have made many deals with publishers and with libraries to digitize works since the 2008 date when the settlement was first proposed, and digitization has gone forward.
  • Some libraries that had partnered with Google prior to the lawsuit have negotiated new contracts that are compatible with some of the conditions contained in the settlement. I don't know if these contracts have been signed or have been awaiting the result of the lawsuit but I do recall that the libraries obtained less rights in relation to retaining copies of their digitized books in the new contracts than they did in the old. The upshot being it isn't clear where this leaves the partner libraries, nor organizations like HathiTrust who are involved in the storage and possible uses of the Google digitized books.
  • For libraries and institutions that were looking forward to subscription access to the books, this access is now a big question mark. It was dependent on conditions in the settlement.
There are undoubtedly many other issues that are now open questions. When the settlement was first announced I began a "question list." It might be a good idea to revive that given this new perspective. And for those wondering "what now?" (that is, all of us) there's a flow chart.

Friday, March 04, 2011

Open Data I

The idea of open data has gone from an extremist rallying cry to a mainstream movement. In the next few posts I'll highlight just an iceberg tip's worth here, but expect to see more about this every day that passes.

The UK's educational research arm, JISC (something like NSF but more for education rather than pure science) and the research libraries' organization, RLUK, undertook a study about the advantages and possibilities afforded by opening data from libraries, archives and museums. They have produced the Open Bibliographic Data Guide, which investigates the business case for providing bibliographic data that can be re-used.


This is a practical, not a utopian, vision of open data.
"In earlier times, observers may have considered the ‘open data movement’ as the preserve of a certain type of fanaticism also associated with Open Source Software (OSS) and Open Content, emotionally and ideologically linked to the spirit of 1969.

However, OSS and Open Content have now morphed in to propositions with clear business cases of interest to corporations, institutions and governments. National strategies and Chief Information Officers espouse Open Source Software for financial and business benefit, whilst academic leaders are supporting Open Access Journals and Open Educational Resources (OER)."(link)
The report gives 17 different use cases -- situations in which an institution might want to provide its data with some degree of openness.

1 – Publish data for unspecified use
2 – Publish open linked data for unspecified use
3 – Supply data for Physical Union
4 – Allow Physical Union Catalogue to publish data
5 – Expose data for federation into Virtual Union Catalogue
6 – Publish grey literature data
7 – Contribute data to Google Scholar
8 – Publish activity data
9 – Supply holdings data for Collection
10 – Expose holdings / availability data for Closest Copy location
11 – Share data for Collaborative Cataloguing
12 – Supply data for Crowd Sourced Cataloguing
13 – Supply data to be enhanced for own
14 – Publish data for LIS research
15 – Allow personal use of data for Reference Management
16 – Publish data for lightweight application development
17 – Allow commercial use of data in mobile application

For each of the cases the report discusses pros and cons for the institution, its users, and the world, as well as the business case for making ones' data open. They acknowledge the complexity of our current environment of bibliographic data ownership:
"Our problems with bibliographic metadata are quite specific:
  • Non-profit and commercial players have built businesses around datasets of MARC records, indexing / TOC services and journal Knowledge Bases – but what is original about those accumulations?
  • Bibliographic records in the circulation amongst libraries are of uncertain and complex provenance, with the exceptions of those explicitly tagged by a ‘vendor’ or exclusive to a special collection" (link)
JISC doesn't stop at this report but is sponsoring projects and ongoing activities in this area. Already the British Library has produced its British National Bibliography data openly for reuse. You can keep up with these activities through the newsletter (subscribe here) whose logo reads: One to many; Many to one: Towards a virtuous flow of library, archival and museum data.

Sunday, February 06, 2011

Skyriver Replies

Following up on the these early stages of what will probably be an interminable legal case (it's easy to understand why one should avoid going to court whenever possible), The SkyRiver has replied to OCLC's Motion to Dismiss.[1] [2] This is the first document I have seen that to me clearly lays out Skyriver's basic contentions. Note that the major part of this document is the usual lawyerly recitation of cases supporting one statement or the other, and I have no idea what the legal arguments mean or whether they are convincing or not. But here are SkyRiver's primary facts as this document lays them out:

1. OCLC has monopolies in the US academic library market
"OCLC is monopolizing three product or service markets—bibliographic data of libraries’ holdings; cataloging service; and interlibrary lending service (ILL). OCLC is attempting to monopolize a fourth service market—integrated library systems (ILS)." p. 1
2. OCLC has used those monopoly positions to prevent competition
"Since at least 1987, OCLC has demanded that its member libraries agree to terms of membership that prohibit sharing the metadata of their own library holdings contributed to OCLC’s bibliographic database known as WorldCat with any for-profit firms for commercial use and require member libraries to use OCLC’s services. OCLC has imposed these membership terms to prevent the development of competing bibliographic databases, cataloging services or ILL services by erecting barriers to entry in these three markets. OCLC is also using its monopoly power in these three markets in its attempt to monopolize the ILS market." p.1
3. OCLC has targeted SkyRiver's business by using punitive pricing for libraries that use SkyRiver's cataloging services
"OCLC’s conduct has injured SkyRiver by deterring libraries from using its service, and has injured libraries that are using SkyRiver to reduce costs by preventing those libraries from uploading their new records into WorldCat at the price charged to everyone except SkyRiver users." p. 2
Beyond that the arguments become more complex. In particular there is the issue of the 20+ years that OCLC has been building up WorldCat under a policy that has prohibited (acc. to the response, p.4) libraries from sharing their cataloging data with for-profit entities. With no other non-profit entity providing cataloging services to US academic libraries, the records are essentially locked-up in WorldCat and no one else can enter the market.

This brings me to a point that I got wrong in a previous post, which is that Skyriver is asking for access to the WorldCat database. The argument there, if I read it correctly, is that WorldCat is the only major source of academic library holdings that can be used for an effective ILL service. WorldCat is the result of monopoly practices. To allow for competition, WorldCat (e.g. bibliographic data and holdings) should be made available for a reasonable price to competing ILL providers. While this seems jarring at first, the more I think about it the more sense it makes.

What the response does not say explicitly, and perhaps it would be irrelevant in a legal case, is that one could look on WorldCat as a shared community resource, not the property of OCLC. In fact, OCLC uses this kind of argument in its record use policy, but somehow leads to the conclusion that WorldCat should not be used to foster non-OCLC library services. It seems easy to make the opposite argument, which would be that WorldCat could be the basis for a wide range of services that would benefit libraries, even if they do not come from OCLC. Imagine if OCLC were to set non-discriminatory pricing for use of WorldCat and anyone could make use of the WorldCat data. There could be a "share-alike" clause that would require those users to return pertinent information to the bibliographic collective. WorldCat would grow, and the range of products and services available to libraries would grow. This seems like a GOOD THING.

I realize it may not be easy to do the analysis that would lead to pricing that both fosters sharing and makes it possible even for small businesses* to arise in the library market. It should be possible, given today's technology, to do this efficiently but we know very little about the cost structure of WorldCat. It is clear that there are many activities relating to the care and management of that database, all intertwined with OCLC services and valuable research projects, as well as linked deeply into tens of thousands of library systems around the world. Should the court require OCLC to open WorldCat for use, we need to see a transition that is non-destructive to the library ecology.

* The reason I emphasize small businesses is that I believe that smaller, more nimble vendors could exist to serve the needs of specialized and smaller libraries which are not OCLC members at this time. I see the potential to widen the community of sharing, even to include more non-library institutions and businesses. Another GOOD THING, IMO.

Tuesday, February 01, 2011

knowledge Organization in Norway

Last week I attended Kunnskapsorganisasjonsdagene 2011 in Oslo. (Knowledge Organization 2011 conference.) The topics ranged around linked data, the FRs, and RDA. I will try to give some flavor of the event, as I experienced it. That last caveat is because only three of the presentations were in English, the rest in Norwegian, and how much I understood really depended on whether there were slides with a lot of diagrams. I was somewhat in the position of the dog in this cartoon:


with "Ginger" being replaced by "RDA", "MARC", and "Karen Coyle."

I was the first speaker of day 1, and presented on the topic of RDA and linked data. The next talk was from the Pode project, a research project bringing together FRBR and RDF concepts and linking data to dbpedia, VIAF, and Dewey in RDF. I got the impression that while experimental, the results are sophisticated, particularly because of the mix of data sources the project is working with. The afternoon had an introduction to (and, from the moments of laughter, some commentary on) RDA by Unni Knutsen. There appears to be an equal amount of interest and skepticism about RDA. I am not sure that AACR had this same effect outside of the Anglo-American library community, and would be very interested to hear more about the impact of A-A cataloging rules, especially whether this impact is greatly increased due to the degree of international sharing of bibliographic data.

Maja Žumer, of the University of Ljubljana, Slovenia, a member of the FRSAD working group gave the best explanation of the meaning behind FRSAD's "thema" and "nomen" that I have yet heard. It is beginning to make sense. Maja is the co-author of a study on FRBR and library user mental models that was published in the Journal of Documentation in two parts. (Preprints [1] [2]) I will link to her slides when they are made available. A key take-away is that FRBR, FRAD and FRSAD have taken very different approaches that will now need to be reconciled. FRBR presents a closed universe of bibliographic data, with only FRBR entities allowed to be subjects of bibliographic resources. FRSAD essentially opens that up to anything in the known universe. Among other things this creates a possibility to link non-bibliographic concepts to described bibliographic entities. Or, at least, that's how I read it.

I was asked to do a short wrap-up of the first day, and as I usually do I turned to the audience for their ideas. Since we realized we are short on answers and long on questions, we decided to gather some of the burning questions. Here are the ones I wrote down:

  • If not RDA, what else is there?
  • Are things on hold waiting for RDA? Are people and vendors waiting to see what will happen?
  • Why wasn't RDA simplified?
  • How long will we pay for it?
  • Will communities other than those in the JSC use it?
  • Can others join JSC to make this a truly international code?
  • Should we just forget about this library-specific stuff and use Dublin Core?

I suspect that there are many others wondering these same things.

The next day there were more interesting talks. One was entitled: Må MARC dø? by Magnus Enger of Libriotech. The title means: Must MARC die? The first slide was one that needs little translation. It said simply:

JA!

Tom Scott of BBC gave a visually stunning talk about the data he manages around the nature and wildlife programming. He explained the reasons for pulling data from a variety of sources, including Wikipedia. (See this page -- and note that it encourages readers to improve the Wikipedia entry if they feel it is incorrect or insufficient.)

In another excellent talk, which I hope will come out in an English translation, Kim Tallerås and David Massey did a step-by-step walkthrough of moving from MARC-encoded data into fully linked data format, complete with URIs. There was another talk focusing on the Norwegian webDewey from the national library, with examples of converting that data to RDF.

About that time I ran out of steam, but I will post a link here when the presentations are up online. In spite of the language barrier, much content is accessible from these talks.

As is often the case I was very impressed at the quality of experimentation that is taking place by people who really want to see library data transformed and made web-able. I think we are at the start of a new and highly fruitful phase for libraries.

Friday, January 21, 2011

Analysis of MARC fixed fields

I've gone on and on on various lists about trying to analyze MARC as data elements, to the point that I'm sure many people just wish I'd shut up. The best way to shut me up is for me to either finish or at least take the analysis as far as I can. To that end, I now have a wiki page that gives my analysis of the MARC fixed fields (007-008). (Home page for project.)

These fields are pretty straight-forward since by their definition they are already defined as discrete data elements. The only tricky bit is that each data element is tied to a particular resource format (e.g. text, map, video).

I have captured the link between the elements as elements and MARC by creating a key that includes the tag, the format, and the position within the tag:

007microform05

For the actual values (and my analysis includes the values, although I skipped things like "no attempt to code" -- those could be added in), I follow the same format:

007microform05a

There is an example of a single element in Turtle and RDF/XML on the wiki page. There is also a link to the whole in Turtle and RDF/XML, which I might as well link to here:

turtle

RDF/XML

Note that these files, in fact the whole project, is basically an attempt at a proof of concept. Treat it as totally BETA and make any comments, suggestions, etc. that you want. If you see this as being useful, feel free to volunteer some time to make it better.

Special thanks to Gordon Dunsire for turning out code from my text.

Note:

I've added a link to an HTML display of the first analysis of the 0XX fields to that page. The trick here is to figure out which subfields can stand alone as data elements, and which have dependencies (like "source of code") that require them to be compound elements.


Saturday, December 25, 2010

Signs of success

Either this:



(unavailable)










Or this:














(reduced to using a raw ip address)