Coyle's InFormation

Saturday, June 18, 2011

Opportunity knocks

There will soon be a call for reviews of the draft report by the W3C Incubator Group on Library Linked Data. As a member of that group I have had a hand in writing that draft, and I can tell you that it has been a struggle. Now we seriously need to hear from you, not the least because the group is not fully representative of the library world; in fact, it leans heavily toward techy-ness and large libraries and services. We need to hear from a wide range of libraries and librarians: public, small, medium, special, management, people who worry about budgets, people who have face time with users. We also need to hear from the library vendor community, since little can happen with library data that will not involve that community. (Note: a site is being set up to take comments, and I am hoping it will be possible to post anonymously or at least pseudonymously, for those who cannot appear to be speaking for their employer.)

In thinking about the possibility of moving to a new approach to bibliographic data in libraries, I created this diagram (which will not be in the report, it was just my thinking) that to me represents a kind of needs assessment. This pyramid is not just related to linked data but to any data format that we might adopt to take the place of the card catalog mark-up that we use today.

We could use this to address the recent LC announcement on replacing MARC. Here's how I see that analysis, starting with the bottom of the pyramid:

Motivation: Our current data model lacks the flexibility that we need, and is keeping us from taking advantage of some modern technologies that could help us provide better user service. Libraries are becoming less and less visible as information providers, in part because our data does not play well on the web, and it is difficult for us to make use of web content.
Leadership: Creating a new model is going to take some serious coordination among all of the parties. Who should/could provide that leadership, and how can we fund this effort? Although LC has announced its intention to collaborate, for various reasons a more neutral organization might be desired, one that is truly global in scope. Yet who can both lead the conversion effort and be available for the future to provide stability for the long term maintenance that a library data carrier will require? And how can we be collaborative without being glacially slow?
Skills: Many of us went through library school before the term "metadata" was in common usage. We learned to follow the cataloging rules, but not to understand the basic principles of data modeling and creation. This is one of the reasons why it is hard for us to change: we are one-trick ponies in the metadata world. The profession needs new skills, and it's not enough for only a few to acquire them: we all need to understand the world we are moving into
Means: This is the really hard one: how do we get the time and funding to make this much-needed change? Both will need to be justified with some clear examples of what we gain by this effort. I favor some demonstration projects, if we can find a way to create them.
Opportunity: The opportunity is here now. We could have made this change any time over the past decade or two while cataloging with AACR2, but RDA definitely gives us that golden moment when not changing no longer makes sense.

Tuesday, May 31, 2011

All the ____ in the world

"All the ___ in the world"
"Every ____ ever created"
"World's largest ____ "
"Repository of all knowledge in ____"

There's something compelling about completeness, about the idea that you could gather ALL of something, anything, together into a single system or database or even, as in the ancient library of Alexandria, physical space. Perhaps it's because we want the satisfaction of being finished. Perhaps it's something primitive in our brain stems that has the evolutionary advantage of keeping us from declaring victory with a job half done. (Well, at least some of us.) To be sure, setting your goal to gather all of something means you don't have to make awkward choices about what to gather/keep and what to discard. The indiscriminate everything may be the easier target.

Worldcat has 229,322,364 bibliographic records.
OpenLibrary has over 20 million records and 1.7 million fulltext books.
LibraryThing has records for 6,102,788 unique works.

If you read one book a week for 60 years, you will have read 3,120 books. If you read one book a day for that same length of time, you will have read 21,360 (not counting leap years).

The trick, obviously, is to discover the set of books, articles, etc., that will enhance your brief time on this planet. To do this, we search in these large databases. By having such large databases to search we are increasing our odds of finding everything in the world about our topic. Of course, we probably do not want everything in the world about our topic, we want the right books (articles, etc.) for us.

There are some down sides to this everything approach, not surprisingly. The first is that any search in a large database retrieves an unwieldy, if not unusable, large set of stuff. For this reason, many user interfaces give us ways to reduce the set using additional searches, often in the form of facets. Yet even then one is likely to be overwhelmed.

Everything includes key works and the odd bits and pieces of dubious repute and utility. Retrieving everything places a great burden on the user to sort out the wheat from the chaff. This is especially difficult when you are investigating an area where you are not an expert. Ranking may highlight the most popular items but those may not be what you are seeking. In fact, they may be items that you have retrieved before, even multiple times, because every search begins with a tabula rasa.

Another down side is that although computers are more powerful than ever and storage space is inexpensive, these large databases tend to collapse under the demands of just a few complex queries. Because of this, what users can and cannot do is controlled by the user interface which serves to protect the system by steering users to safe functions. Users often can create their own lists, can add tags, can make changes to the underlying data, but they cannot reorder the retrieved set by an arbitrary data element, they can't compare their retrieved set against items they have already saved or seen previously, they can't run analyses like topic maps on their retrieved set to better understand what is there.

I conclude, therefore, that what would be useful would be to treat these large databases as warehouses or raw materials, and provide software that allow users to select from these to create a personal database. This personal database software would resemble, ta da!, Vannevar Bush's Memex, a combination database and information use system. I can see it having components that are analogous to some systems we already have:

automated download of data from the big warehouses, like LibraryThing
an easy visual way to do interesting queries, like Yahoo! Pipes
the ability to ask questions, like Wolfram Alpha

The personal database would be able to interact with the world of raw material and with other databases. I can imagine functions like: "get me all of the books and articles from this item's bibliography." Or: "compare my library to The Definitive Bibliography of [some topic]." Or: "Check my library and tell me if there are new editions to any of my books." In other words, it's not enough to search and get; in fact, searching and getting should be the least of what we are able to do.

There are a whole lot of resource management functions that a student or researcher could find useful because within a selected set there is still much to discover. These smaller, personal databases should also be able to interact with each other, doing comparisons and cross-database queries. We should be able to make notes and create relationships and share them (a Memex feature). The personal database should be associated with person, not a particular library or institution, and must work across institutions and services. I can't imagine what it must be like today to graduate and to lose not only the privileged access that members of institutions enjoy but also the entire personal space that one has created while attached to that institution.

In short, it's not about the STUFF, it's about the services. It doesn't matter how much STUFF you have it's what people can DO with it. Verb, not noun. Quality not quantity.

Tuesday, May 24, 2011

From MARC to Principled Metadata

Library of Congress has announced its intention to "review the bibliographic framework to better accommodate future needs." The translation of this into plain English is that they are (finally!) thinking about replacing the MARC format with something more modern. This is obviously something that desperately needs to be done.

I want to encourage LC and the entire library community to build its future bibliographic data on solid principles. Among these principles would be:

Use data, not text. Wherever possible, the stuff of bibliographic description should be computable data, not human-interpretable text. Any part of your metadata that cannot be used in machine algorithms is of limited utility in user services.
Give your things identifiers, not language tags. Identification allows you to share meaning without language barriers. Anything that has been identified can be displayed in language terms to users in any language of your (or the user's) choice.
Adopt mainstream metadata standards. This is not only for the data formats but also in terms of the data itself. If other metadata creators are using a particular standard language list or geographic names, use those same terms. If there are metadata elements for common things like colors or sizes or places or [whatever], use those. Work with international communities to extend metadata if necessary, but do not create library-specific versions.

There is much more to be said, and fortunately a great deal of it is being included in the report of the W3C Incubator Group on Library Linked Data. Although still in draft form you can see the current state of that group's recommendations, many of which address the transition that LC appears to be about to embark on. A version of the report for comments will be available later this summer.

The existence of this W3C group, however, is the proof of something very important that the Library of Congress must embrace: that bibliographic data is not solely of interest to libraries, and the future of library data should not be created as a library standard but as an information standard. This means that its development must include collaboration with the broader information community, and that collaboration will only be successful if libraries are willing to compromise in order to be part of the greater info-sphere. That's the biggest challenge we face.

Friday, May 13, 2011

Dystopias

In the 1990's I wrote often about information dystopias. In 1994 I said:

It's clear to me that the information highway isn't much about information. It's about trying to find a new basis for our economy. I'm pretty sure I'm not going to like the way information is treated in that economy. We know what kind of information sells, and what doesn't.

In 1995 I painted a surprisingly accurate picture of 2015 that included:

Big boys, like Disney and Time/Warner/Turner put out snippets of their films and have enticed viewers to upgrade their connection to digital movie quality. News programs have truly found their place on the Net, offering up-to-the second views of events happening all over the world, perfectly selected for your interests....Online shopping allows 3-D views of products and virtual walk-throughs of vacation paradises.

If there were a stock market for cynical investments, I'd be sitting pretty right now. But wait... there's more! Because there's always a future, and therefore more dystopia to predict.

My latest is concern is about searching and finding. And of course that means that I am concerned about Google, but this is in a new context. I have spent the last five years trying to convince libraries that we need to be of the web -- not only on the web but truly web resources. I strongly believe this is the only possible way to keep libraries relevant to new generations of information seekers. This has been interpreted by many as a digitization project that will result in getting the stuff of libraries (books mainly) onto the web, and getting the metadata about that stuff out of library catalogs and onto the web. Hathitrust, for example, is a massive undertaking that will store and preserve huge amounts of digitized books. The Digital Public Library of America (DPLA), just in its early planning stages today, wants to make all books available to everyone for "free."

All of these are highly commendable projects, but there is a reality that we don't seem to be have embraced, and that is that searching and finding are as important to the information seeking process as the actual underlying materials. As we can easily see with Google, the search engine is the gate-keeper to content. If content cannot be found then it does not exist. And determining what content will be accessed is real power in the information cloud. [cf. Siva Vaidhyanathan, Googlization of Everything.]

There is a danger that when this mass of library materials becomes of the web that we could entirely lose control of its discovery. But it isn't just a question of library materials, this is true for the entire linked data cloud: who will create the search engine that makes all of that data findable? With its purchase of freebase.com, it is clear that Google has at least an eye on LD space. And of course Google has the money, the servers, the technology to do this. We know, however, from our experience with the current Google search engine that the application of Google's values to search produces a particular result. We also know that Google's main business model is based on making a connection between searchers and advertisers. [cf. Ken Auletta, Googled] .

It's not enough for libraries to gather, store and preserve huge masses of information resources. We have to be actively engaged with users and potential users, and that engagement includes providing ways for them to find and to use the resources libraries have. We must provide the entry point that brings users to information materials without that access being mediated through a commercial revenue model. So for every HathiTrust or DPLA that focuses on the resources we need a related project -- equally well-funded -- that focuses on users and access. Not just creating a traditional library-type catalog but providing a whole host of services that will help uses find and explore the digital library. This interface needs to be part search engine, part individual work space, and part social networking. Users should be able to do their research, store their personal library (getting into Memex territory here), share their work with others, engage in conversations, and perhaps even manage complex research projects. It could be like a combination of Zotero, VIVO, Zoho, Yahoo pipes, Dabble, and MIT's OpenCourseWare.

Really, if we don't do this, the future of libraries and research will be decided by Google. There, I said it.

Sunday, April 24, 2011

Visualizing linked data

Chris Oliver, Diane Hillmann and I will be reprising (and updating) our three-part webinar on RDA and the future of library metadata starting on May 11. As before, Chris will cover the principles behind RDA and why RDA is different from other cataloging codes; I will talk about the Semantic Web and why it is important for libraries to be part of the web of data (May 18); Diane will show how the Open Metadata Registry makes possible a Semantic Web-compatible version of RDA (May 25).
One of the questions I always get when talking about the Semantic Web is "What does it look like?" This is kind of like asking what electricity looks like: it doesn't so much look like anything, as it makes certain things possible. But I fully understand that people need to see something for this all to make sense, so when the webinar technology allows it I have started showing some web pages. When it doesn't, I send people to links they can explore on their own. Since some of you may have this same question, here are a few illustrations using two sites that can present authors in a Semantic Web form.

When you do a search for an author on the Open Library you retrieve a page for the author. This is a page for the author Barbara Cartland. The page has not been hand-coded by a human but is derived "on the fly" from the information in the Open Library database.

That same information is available in a semantic web format, RDF in XML. (Note: it is common to code Semantic Web data in XML, but that's not the only possible data format. There is nothing inherent in the Semantic Web that would make it XML-like, it's just a convenience.) This is not intended to be human friendly -- it is code to be used by programs. You should notice that it makes use of identifiers that look like URLs:

<foaf:person about="http://openlibrary.org/authors/OL22022A"></foaf:person>

The above establishes the primary identifier for all of the information that follows in the XML.

You will also see that, like other applications using XML, it allows you to mix data elements from different "namespaces." The Open Library RDF uses a mix of elements from Dublin Core, Friend-of-a-Friend (FOAF), the Bibliographic Ontology, and RDA Vocabularies.

Another database that provides its data in RDF is the Virtual International Authority File, VIAF. VIAF combines the name authority data from about twenty national authority files, making it possible to translate from different name display forms when exchanging data. Here is part of the VIAF display for Barbara Cartland:

You can retrieve or export the metadata for this author in various formats including MARC and RDF/XML. Once again you will see that the RDF form of the data makes use of FOAF, a standard called "Simple Knowledge Organization System" or SKOS, and also RDA vocabularies for the FRBR Group2 entities from the Open Metadata Registry.

You can look at more examples on my links page, but I hope that this takes some of the mystery out of Semantic Web data, or at least makes the mystery a known rather than unknown puzzler.

Monday, April 18, 2011

FRBR as cake

I keep trying to explain what bothers me about FRBR, and in particular about WEMI. I've recently thought about it it with this image of a cake. I know this is a flawed analogy, but it works for me on some level. It goes like this:

When you make a cake, you have a number of ingredients:

When you mix them together to make a cake you don't get this:

You get this:

My point here, in case it isn't clear, is that the purpose of creating a bibliographic description using a number of different entities is to... well, to create a bibliographic description; something that as a whole has meaning. You can create it from individual "ingredients," like information about a Work and an Expression, but those do not need to remain separate entities in your final product; instead, that information can become part of your whole.

I know that people like the idea of a distributed bibliographic description with a single Work entity that links to many Expressions that then link to many Manifestations, etc., and that could be the underlying structure of ones data store. But just because there are Work entities (eggs) doesn't mean that our metadata keeps the Work entity "intact." In fact, our systems may use only a portion of the Work entity, and may use bits of it at different times in different contexts.

Leaving poorly-drawn analogies aside, creating our data as sets (or "graphs") of triples should give us maximum flexibility. One thing this means is that even a partial description is valid. Thus a full library catalog record and an abbreviated citation are both valid representations of a resource. They should connect to the larger linked data information space through any of the statements they contain, regardless of the structure of their graphs. And it is my guess that many bibliographic descriptions will be simple graphs with a single RDF subject (that means a single bibliographic resource). The highly structured bibliographic universe of FRBR will be a minority case, and the FRBR entities, like our eggs and sugar and flour, will be useful ingredients that disappear into actual creations.

Thursday, March 24, 2011

Open Data II

In this post I want to talk about some of the Open Government Data (OGD) projects taking place around the world.

Open government data is assumed to be a given by many in the US because our copyright law states that federal government data is not covered by copyright. (The situation in US states can vary, but the federal government's declaration sets the tone.) In other countries the situation is less clear and governments do not have a mandate to make data open. However, the open government data movement has purred on a number of fast-moving activities, many sponsored by governments themselves that encourage citizens to download and use government data.

The UK government has a site, Opening up government, where it not only shares data but encourages people to develop apps that use the data. Apps here can alert you to new building and planning projects in your area, and give you real-time public transportation information.

The EU has its own Open Government Data Initiative. It provides the data under these terms of use:

All Data on dev.govdata.eu is available under a worldwide, royalty-free, non-exclusive license to use, modify, and distribute the datasets in all current and future media and formats for any lawful purpose and that this license does not give you a copyright or other proprietary interest in the datasets.

There is a European site for public sector information, the European Public Section Information Platform: Europe's One-Stop Shop on Public Sector Information Re-use. You can search by country and see news and developments relating to public data, much of which is available for re-use. Because many countries for not have an explicit statement in their copyright laws covering government data, one of the important early steps for these jurisdictions is to develop blanket licenses that they can apply to the data. So when you visit the site you see recent news that Norway has developed a license for its government data and is asking for feedback (if you read Norwegian).

To understand the force of this movement, it is said that Albania and Bulgaria are on the verge of opening some government data.

The Obama administration announced its Open Government effort on the first day of his administration.

To the extent practicable and subject to valid restrictions, agencies should publish information online in an open format that can be retrieved, downloaded, indexed, and searched by commonly used web search applications. An open format is one that is platform independent, machine readable, and made available to the public without restrictions that would impede the re-use of that information.

Wired has a US-oriented "how-to" wiki on OGD. (Of course, they include in their "how-to" examples MarijuanaLobby.org, being Wired, but it's a good example of the range of utility of OGD. )

Not all data is at the country level, of course, and the movement is reaching into lower levels of government. Paris has an open data portal, while Enschede Netherlands has an open data declaration for its information. In Italy, the government of the Piemonte Region has a website for its open data.

The government open data movement is heavily influenced by grassroots efforts to convince governments that open data is a good thing -- not just for government watchdogs and opposition movements, but for heathy government and strong business. In the UK there is a Working Group on Open Government Data of the Open Knowledge Foundation, an independent not-for-profit that is promoting, as its name says, open knowledge. In Italy there is the wonderfully named "Spaghetti Open Data." Spain has a broad coalition of non-profits that form the "Coalición Pro Acceso." The CKAN web site, which is a general archive of available datasets of all kinds, has OGD under a number of tags, such as "gov". [Just out: Open Government Data video.]

We hear a lot about problems with copyright, with DRM, with information providers who want to lock down their products. Government data covers a huge variety of information types and is often the key information needed for a lot of civic and scientific decision-making. OGD can generate a mountain of new knowledge, and then tell you how high the mountain is.