Coyle's InFormation: 05/01/2011

Tuesday, May 31, 2011

All the ____ in the world

"All the ___ in the world"
"Every ____ ever created"
"World's largest ____ "
"Repository of all knowledge in ____"

There's something compelling about completeness, about the idea that you could gather ALL of something, anything, together into a single system or database or even, as in the ancient library of Alexandria, physical space. Perhaps it's because we want the satisfaction of being finished. Perhaps it's something primitive in our brain stems that has the evolutionary advantage of keeping us from declaring victory with a job half done. (Well, at least some of us.) To be sure, setting your goal to gather all of something means you don't have to make awkward choices about what to gather/keep and what to discard. The indiscriminate everything may be the easier target.

Worldcat has 229,322,364 bibliographic records.
OpenLibrary has over 20 million records and 1.7 million fulltext books.
LibraryThing has records for 6,102,788 unique works.

If you read one book a week for 60 years, you will have read 3,120 books. If you read one book a day for that same length of time, you will have read 21,360 (not counting leap years).

The trick, obviously, is to discover the set of books, articles, etc., that will enhance your brief time on this planet. To do this, we search in these large databases. By having such large databases to search we are increasing our odds of finding everything in the world about our topic. Of course, we probably do not want everything in the world about our topic, we want the right books (articles, etc.) for us.

There are some down sides to this everything approach, not surprisingly. The first is that any search in a large database retrieves an unwieldy, if not unusable, large set of stuff. For this reason, many user interfaces give us ways to reduce the set using additional searches, often in the form of facets. Yet even then one is likely to be overwhelmed.

Everything includes key works and the odd bits and pieces of dubious repute and utility. Retrieving everything places a great burden on the user to sort out the wheat from the chaff. This is especially difficult when you are investigating an area where you are not an expert. Ranking may highlight the most popular items but those may not be what you are seeking. In fact, they may be items that you have retrieved before, even multiple times, because every search begins with a tabula rasa.

Another down side is that although computers are more powerful than ever and storage space is inexpensive, these large databases tend to collapse under the demands of just a few complex queries. Because of this, what users can and cannot do is controlled by the user interface which serves to protect the system by steering users to safe functions. Users often can create their own lists, can add tags, can make changes to the underlying data, but they cannot reorder the retrieved set by an arbitrary data element, they can't compare their retrieved set against items they have already saved or seen previously, they can't run analyses like topic maps on their retrieved set to better understand what is there.

I conclude, therefore, that what would be useful would be to treat these large databases as warehouses or raw materials, and provide software that allow users to select from these to create a personal database. This personal database software would resemble, ta da!, Vannevar Bush's Memex, a combination database and information use system. I can see it having components that are analogous to some systems we already have:

automated download of data from the big warehouses, like LibraryThing
an easy visual way to do interesting queries, like Yahoo! Pipes
the ability to ask questions, like Wolfram Alpha

The personal database would be able to interact with the world of raw material and with other databases. I can imagine functions like: "get me all of the books and articles from this item's bibliography." Or: "compare my library to The Definitive Bibliography of [some topic]." Or: "Check my library and tell me if there are new editions to any of my books." In other words, it's not enough to search and get; in fact, searching and getting should be the least of what we are able to do.

There are a whole lot of resource management functions that a student or researcher could find useful because within a selected set there is still much to discover. These smaller, personal databases should also be able to interact with each other, doing comparisons and cross-database queries. We should be able to make notes and create relationships and share them (a Memex feature). The personal database should be associated with person, not a particular library or institution, and must work across institutions and services. I can't imagine what it must be like today to graduate and to lose not only the privileged access that members of institutions enjoy but also the entire personal space that one has created while attached to that institution.

In short, it's not about the STUFF, it's about the services. It doesn't matter how much STUFF you have it's what people can DO with it. Verb, not noun. Quality not quantity.

Tuesday, May 24, 2011

From MARC to Principled Metadata

Library of Congress has announced its intention to "review the bibliographic framework to better accommodate future needs." The translation of this into plain English is that they are (finally!) thinking about replacing the MARC format with something more modern. This is obviously something that desperately needs to be done.

I want to encourage LC and the entire library community to build its future bibliographic data on solid principles. Among these principles would be:

Use data, not text. Wherever possible, the stuff of bibliographic description should be computable data, not human-interpretable text. Any part of your metadata that cannot be used in machine algorithms is of limited utility in user services.
Give your things identifiers, not language tags. Identification allows you to share meaning without language barriers. Anything that has been identified can be displayed in language terms to users in any language of your (or the user's) choice.
Adopt mainstream metadata standards. This is not only for the data formats but also in terms of the data itself. If other metadata creators are using a particular standard language list or geographic names, use those same terms. If there are metadata elements for common things like colors or sizes or places or [whatever], use those. Work with international communities to extend metadata if necessary, but do not create library-specific versions.

There is much more to be said, and fortunately a great deal of it is being included in the report of the W3C Incubator Group on Library Linked Data. Although still in draft form you can see the current state of that group's recommendations, many of which address the transition that LC appears to be about to embark on. A version of the report for comments will be available later this summer.

The existence of this W3C group, however, is the proof of something very important that the Library of Congress must embrace: that bibliographic data is not solely of interest to libraries, and the future of library data should not be created as a library standard but as an information standard. This means that its development must include collaboration with the broader information community, and that collaboration will only be successful if libraries are willing to compromise in order to be part of the greater info-sphere. That's the biggest challenge we face.

Friday, May 13, 2011

Dystopias

In the 1990's I wrote often about information dystopias. In 1994 I said:

It's clear to me that the information highway isn't much about information. It's about trying to find a new basis for our economy. I'm pretty sure I'm not going to like the way information is treated in that economy. We know what kind of information sells, and what doesn't.

In 1995 I painted a surprisingly accurate picture of 2015 that included:

Big boys, like Disney and Time/Warner/Turner put out snippets of their films and have enticed viewers to upgrade their connection to digital movie quality. News programs have truly found their place on the Net, offering up-to-the second views of events happening all over the world, perfectly selected for your interests....Online shopping allows 3-D views of products and virtual walk-throughs of vacation paradises.

If there were a stock market for cynical investments, I'd be sitting pretty right now. But wait... there's more! Because there's always a future, and therefore more dystopia to predict.

My latest is concern is about searching and finding. And of course that means that I am concerned about Google, but this is in a new context. I have spent the last five years trying to convince libraries that we need to be of the web -- not only on the web but truly web resources. I strongly believe this is the only possible way to keep libraries relevant to new generations of information seekers. This has been interpreted by many as a digitization project that will result in getting the stuff of libraries (books mainly) onto the web, and getting the metadata about that stuff out of library catalogs and onto the web. Hathitrust, for example, is a massive undertaking that will store and preserve huge amounts of digitized books. The Digital Public Library of America (DPLA), just in its early planning stages today, wants to make all books available to everyone for "free."

All of these are highly commendable projects, but there is a reality that we don't seem to be have embraced, and that is that searching and finding are as important to the information seeking process as the actual underlying materials. As we can easily see with Google, the search engine is the gate-keeper to content. If content cannot be found then it does not exist. And determining what content will be accessed is real power in the information cloud. [cf. Siva Vaidhyanathan, Googlization of Everything.]

There is a danger that when this mass of library materials becomes of the web that we could entirely lose control of its discovery. But it isn't just a question of library materials, this is true for the entire linked data cloud: who will create the search engine that makes all of that data findable? With its purchase of freebase.com, it is clear that Google has at least an eye on LD space. And of course Google has the money, the servers, the technology to do this. We know, however, from our experience with the current Google search engine that the application of Google's values to search produces a particular result. We also know that Google's main business model is based on making a connection between searchers and advertisers. [cf. Ken Auletta, Googled] .

It's not enough for libraries to gather, store and preserve huge masses of information resources. We have to be actively engaged with users and potential users, and that engagement includes providing ways for them to find and to use the resources libraries have. We must provide the entry point that brings users to information materials without that access being mediated through a commercial revenue model. So for every HathiTrust or DPLA that focuses on the resources we need a related project -- equally well-funded -- that focuses on users and access. Not just creating a traditional library-type catalog but providing a whole host of services that will help uses find and explore the digital library. This interface needs to be part search engine, part individual work space, and part social networking. Users should be able to do their research, store their personal library (getting into Memex territory here), share their work with others, engage in conversations, and perhaps even manage complex research projects. It could be like a combination of Zotero, VIVO, Zoho, Yahoo pipes, Dabble, and MIT's OpenCourseWare.

Really, if we don't do this, the future of libraries and research will be decided by Google. There, I said it.