Coyle's InFormation

Sunday, July 31, 2011

User-friendliness, a lesson

I was looking for Melvil Dewey's first published version of his classification system. My first instinct was to head to Google Book Search but I decided instead to use HathiTrust as a kind of gesture to non-commercial access. I did find what I was looking for, his 1876 pamphlet, opened it up in their reader and looked through it. I knew I'd want a copy, so I found the "download as PDF" link. That popped up a box telling me to "Login to determine whether you can download this book." The copyright is listed as "public domain in the United States." I don't see why I need to log in, and I downloaded it from GBS instead, without logging in but adding to the slime trail of my life that Google owns. The added step of logging in (to be started by creating yet another login on a system I will use only occasionally), for all that it may be no more or even less invasive of my privacy, is not user-friendly. It also didn't make sense to me at the time, and I was given nothing to convince me that logging in was beneficial ... to me.

Yes, it's all about ME, me the user, me the person at the other end of the connection. I'm also not just any user, I am an advocate of libraries, a librarian, and I made the effort to go to HathiTrust -- a site that has not shown up for me in search engines.

This seems to be such a basic lesson that I do not understand why libraries can't learn it. User-friendliness.

Ooof! It just gets worse. I decided to see what login is about. To get to login, you have to search, select a book, and click on login. On the book page, you may see that a book is "Public Domain" or it may say "Public Domain, Google-digitized". When you log in, you log in either as someone from a member institution or a guest. The guest log in form states:

Does NOT provide access to full PDF downloads of public domain & open access items where not publicly available

However, it turns out that it DOES provide access to PD books (see comment by anonymous) if the book is not digitized by Google -- but that isn't what you've been told. "... not publicly available" isn't what you see on the book page, you see "Google-digitized." The page on policies has two different categories, "Open Access" and "Open Access, Google-digitized." Nothing in the definitions of those categories mentions member and guest downloading.

Basically, HathiTrust turns out to be a tiered system with member and non-member access. You don't encounter this until you try to download something that is PD but not "publicly available." Nothing on the home page mentions that this is a member-based service, therefore you don't know that as a non-member you will encounter walls.

OK, it is resolved, that from now on I will always go first to the Open Library, a site where Open means what I think it should.

Monday, July 25, 2011

RDA in XML - why not give it a shot?

Example of RDA in XML / Example2 of RDA in XML

There's a lot of talk about what we will do with RDA as data - what format we will use, how it will look to users, etc etc etc. In fact, the options are legion. The key point is that we don't have to decide on just ONE WAY to carry and store RDA data elements, as long as we follow a few rules.

As an experiment, I have coded a very simple bibliographic record using two different possible ways to encode RDA in XML. For the XML data elements I use the RDA elements from the Open Metadata Registry. These elements are defined in OWL, and therefore are compatible with semantic web applications. Their use in XML (and by that I mean non-RDF XML) may be a bit questionable, yet at the same time XML may be a good transition format from our current data to a ful RDF-based implementation. I created two XML files: one in which I used text values, much as one would in MARC, and one in which I used URIs for values that have been encoded as vocabularies. Neither has a schema because creating a schema for RDA is a huge undertaking. If there is interest in this method, however, it might be worth... undertaking.

The resulting files don't fit well in a blog post, so I created a page with a side-by-side comparison. Please have a look. Feel free to comment or send me suggestions or corrections. or other ideas on how to do this better.

Wednesday, July 20, 2011

Unequal Access

With the recent indictment of an advocate for open information access who had set up a way to download about 4 million JSTOR articles, presumably with the intent to liberate them from their native closed access, we need to step back and look at how unequal information access is in this world. In major universities in the US, academics and students log on to their computers in their offices or at home and a whole world opens up to them. That's not some kind of accident. The prime goal of university libraries is to make good on "seek and ye shall find." The proof of the success of these libraries is that researchers are oblivious to the complexity of the system that serves them. I would guess that many members of the US university community have no idea how their access to journals is managed and controlled. They don't see the contract negotiations with information providers, the continual development of software that makes single-point searching possible, the multi-faceted delivery systems that blend (or attempt to) digital and paper resources into a single stream. And they don't think about how different it would be if they weren't members of that privileged community.

Contrast that to the access available to a member of the US public who is not part of this academic sector. Like myself. Like the majority of people in this country. There is no access to JSTOR. No openURL server gives me multiple access options. The local public library does have some electronic materials, but these are much less extensive (and less expensive) than the ones in academic libraries. I may have to wait weeks to get a book that isn't in my local library's collection, if I can get it at all. I am often in the embarrassing position of not being able to access articles that I would like to read or quote from, including ones that I myself have authored.

In spite of this, I know that my information access, as a mere member of the US public, is far superior to that found in other countries; countries where serious researchers struggle to participate in research because they do not have the access that many academics here take for granted. Two anecdotes:

-- When I lived in Italy in the 1970's my friends were mainly college students or recent graduates. University education was free, but it was generally accepted that the only way to complete ones final thesis was to be able to afford to go abroad for two or three months. The purpose of this trip was to spend time in a country with a good library system, since libraries in Italy were limited. This was not just for students studying foreign literatures, but even those studying sciences, history, and art. These kids were essentially "library tourists." I don't know if this continues today.

-- During the time I worked at UC I was in a conversation with someone involved in the licensing of databases. For some reason we got talking about enforcement of contractual clauses having to do with excessive downloading and/or piracy. This person told me that all access to one of the UC campuses had been cut off recently for a few days because it was discovered that someone was systematically downloading entire journal runs. When they found the student it turned out that it was a foreign graduate student who would soon be returning home. Knowing that leaving the UC system would mean losing access to the journals he would need to continue his research, he was making himself a copy to take home.

It occurs to me as I write this that the "Digital Public Library of America" could create an information revolution in this country by upgrading the access of the general public to that of an academic or student in a large college or university, without ever digitizing a single page. What makes Stanford "Stanford" or Harvard "Harvard" is not just its famed faculty but the full range of information that is shared by that community. Everything they do, every bit of research, every new idea, is facilitated by the library and its services.

The information access gap between a university researcher and the average person on the street is immense. We have an information elite that, like most elites, considers its position to be earned, just, and reasonable. Few in academia worry that the access they have isn't widely shared. If they did, they would hopefully decide that something should be done.

Saturday, June 18, 2011

Opportunity knocks

There will soon be a call for reviews of the draft report by the W3C Incubator Group on Library Linked Data. As a member of that group I have had a hand in writing that draft, and I can tell you that it has been a struggle. Now we seriously need to hear from you, not the least because the group is not fully representative of the library world; in fact, it leans heavily toward techy-ness and large libraries and services. We need to hear from a wide range of libraries and librarians: public, small, medium, special, management, people who worry about budgets, people who have face time with users. We also need to hear from the library vendor community, since little can happen with library data that will not involve that community. (Note: a site is being set up to take comments, and I am hoping it will be possible to post anonymously or at least pseudonymously, for those who cannot appear to be speaking for their employer.)

In thinking about the possibility of moving to a new approach to bibliographic data in libraries, I created this diagram (which will not be in the report, it was just my thinking) that to me represents a kind of needs assessment. This pyramid is not just related to linked data but to any data format that we might adopt to take the place of the card catalog mark-up that we use today.

We could use this to address the recent LC announcement on replacing MARC. Here's how I see that analysis, starting with the bottom of the pyramid:

Motivation: Our current data model lacks the flexibility that we need, and is keeping us from taking advantage of some modern technologies that could help us provide better user service. Libraries are becoming less and less visible as information providers, in part because our data does not play well on the web, and it is difficult for us to make use of web content.
Leadership: Creating a new model is going to take some serious coordination among all of the parties. Who should/could provide that leadership, and how can we fund this effort? Although LC has announced its intention to collaborate, for various reasons a more neutral organization might be desired, one that is truly global in scope. Yet who can both lead the conversion effort and be available for the future to provide stability for the long term maintenance that a library data carrier will require? And how can we be collaborative without being glacially slow?
Skills: Many of us went through library school before the term "metadata" was in common usage. We learned to follow the cataloging rules, but not to understand the basic principles of data modeling and creation. This is one of the reasons why it is hard for us to change: we are one-trick ponies in the metadata world. The profession needs new skills, and it's not enough for only a few to acquire them: we all need to understand the world we are moving into
Means: This is the really hard one: how do we get the time and funding to make this much-needed change? Both will need to be justified with some clear examples of what we gain by this effort. I favor some demonstration projects, if we can find a way to create them.
Opportunity: The opportunity is here now. We could have made this change any time over the past decade or two while cataloging with AACR2, but RDA definitely gives us that golden moment when not changing no longer makes sense.

Tuesday, May 31, 2011

All the ____ in the world

"All the ___ in the world"
"Every ____ ever created"
"World's largest ____ "
"Repository of all knowledge in ____"

There's something compelling about completeness, about the idea that you could gather ALL of something, anything, together into a single system or database or even, as in the ancient library of Alexandria, physical space. Perhaps it's because we want the satisfaction of being finished. Perhaps it's something primitive in our brain stems that has the evolutionary advantage of keeping us from declaring victory with a job half done. (Well, at least some of us.) To be sure, setting your goal to gather all of something means you don't have to make awkward choices about what to gather/keep and what to discard. The indiscriminate everything may be the easier target.

Worldcat has 229,322,364 bibliographic records.
OpenLibrary has over 20 million records and 1.7 million fulltext books.
LibraryThing has records for 6,102,788 unique works.

If you read one book a week for 60 years, you will have read 3,120 books. If you read one book a day for that same length of time, you will have read 21,360 (not counting leap years).

The trick, obviously, is to discover the set of books, articles, etc., that will enhance your brief time on this planet. To do this, we search in these large databases. By having such large databases to search we are increasing our odds of finding everything in the world about our topic. Of course, we probably do not want everything in the world about our topic, we want the right books (articles, etc.) for us.

There are some down sides to this everything approach, not surprisingly. The first is that any search in a large database retrieves an unwieldy, if not unusable, large set of stuff. For this reason, many user interfaces give us ways to reduce the set using additional searches, often in the form of facets. Yet even then one is likely to be overwhelmed.

Everything includes key works and the odd bits and pieces of dubious repute and utility. Retrieving everything places a great burden on the user to sort out the wheat from the chaff. This is especially difficult when you are investigating an area where you are not an expert. Ranking may highlight the most popular items but those may not be what you are seeking. In fact, they may be items that you have retrieved before, even multiple times, because every search begins with a tabula rasa.

Another down side is that although computers are more powerful than ever and storage space is inexpensive, these large databases tend to collapse under the demands of just a few complex queries. Because of this, what users can and cannot do is controlled by the user interface which serves to protect the system by steering users to safe functions. Users often can create their own lists, can add tags, can make changes to the underlying data, but they cannot reorder the retrieved set by an arbitrary data element, they can't compare their retrieved set against items they have already saved or seen previously, they can't run analyses like topic maps on their retrieved set to better understand what is there.

I conclude, therefore, that what would be useful would be to treat these large databases as warehouses or raw materials, and provide software that allow users to select from these to create a personal database. This personal database software would resemble, ta da!, Vannevar Bush's Memex, a combination database and information use system. I can see it having components that are analogous to some systems we already have:

automated download of data from the big warehouses, like LibraryThing
an easy visual way to do interesting queries, like Yahoo! Pipes
the ability to ask questions, like Wolfram Alpha

The personal database would be able to interact with the world of raw material and with other databases. I can imagine functions like: "get me all of the books and articles from this item's bibliography." Or: "compare my library to The Definitive Bibliography of [some topic]." Or: "Check my library and tell me if there are new editions to any of my books." In other words, it's not enough to search and get; in fact, searching and getting should be the least of what we are able to do.

There are a whole lot of resource management functions that a student or researcher could find useful because within a selected set there is still much to discover. These smaller, personal databases should also be able to interact with each other, doing comparisons and cross-database queries. We should be able to make notes and create relationships and share them (a Memex feature). The personal database should be associated with person, not a particular library or institution, and must work across institutions and services. I can't imagine what it must be like today to graduate and to lose not only the privileged access that members of institutions enjoy but also the entire personal space that one has created while attached to that institution.

In short, it's not about the STUFF, it's about the services. It doesn't matter how much STUFF you have it's what people can DO with it. Verb, not noun. Quality not quantity.

Tuesday, May 24, 2011

From MARC to Principled Metadata

Library of Congress has announced its intention to "review the bibliographic framework to better accommodate future needs." The translation of this into plain English is that they are (finally!) thinking about replacing the MARC format with something more modern. This is obviously something that desperately needs to be done.

I want to encourage LC and the entire library community to build its future bibliographic data on solid principles. Among these principles would be:

Use data, not text. Wherever possible, the stuff of bibliographic description should be computable data, not human-interpretable text. Any part of your metadata that cannot be used in machine algorithms is of limited utility in user services.
Give your things identifiers, not language tags. Identification allows you to share meaning without language barriers. Anything that has been identified can be displayed in language terms to users in any language of your (or the user's) choice.
Adopt mainstream metadata standards. This is not only for the data formats but also in terms of the data itself. If other metadata creators are using a particular standard language list or geographic names, use those same terms. If there are metadata elements for common things like colors or sizes or places or [whatever], use those. Work with international communities to extend metadata if necessary, but do not create library-specific versions.

There is much more to be said, and fortunately a great deal of it is being included in the report of the W3C Incubator Group on Library Linked Data. Although still in draft form you can see the current state of that group's recommendations, many of which address the transition that LC appears to be about to embark on. A version of the report for comments will be available later this summer.

The existence of this W3C group, however, is the proof of something very important that the Library of Congress must embrace: that bibliographic data is not solely of interest to libraries, and the future of library data should not be created as a library standard but as an information standard. This means that its development must include collaboration with the broader information community, and that collaboration will only be successful if libraries are willing to compromise in order to be part of the greater info-sphere. That's the biggest challenge we face.

Friday, May 13, 2011

Dystopias

In the 1990's I wrote often about information dystopias. In 1994 I said:

It's clear to me that the information highway isn't much about information. It's about trying to find a new basis for our economy. I'm pretty sure I'm not going to like the way information is treated in that economy. We know what kind of information sells, and what doesn't.

In 1995 I painted a surprisingly accurate picture of 2015 that included:

Big boys, like Disney and Time/Warner/Turner put out snippets of their films and have enticed viewers to upgrade their connection to digital movie quality. News programs have truly found their place on the Net, offering up-to-the second views of events happening all over the world, perfectly selected for your interests....Online shopping allows 3-D views of products and virtual walk-throughs of vacation paradises.

If there were a stock market for cynical investments, I'd be sitting pretty right now. But wait... there's more! Because there's always a future, and therefore more dystopia to predict.

My latest is concern is about searching and finding. And of course that means that I am concerned about Google, but this is in a new context. I have spent the last five years trying to convince libraries that we need to be of the web -- not only on the web but truly web resources. I strongly believe this is the only possible way to keep libraries relevant to new generations of information seekers. This has been interpreted by many as a digitization project that will result in getting the stuff of libraries (books mainly) onto the web, and getting the metadata about that stuff out of library catalogs and onto the web. Hathitrust, for example, is a massive undertaking that will store and preserve huge amounts of digitized books. The Digital Public Library of America (DPLA), just in its early planning stages today, wants to make all books available to everyone for "free."

All of these are highly commendable projects, but there is a reality that we don't seem to be have embraced, and that is that searching and finding are as important to the information seeking process as the actual underlying materials. As we can easily see with Google, the search engine is the gate-keeper to content. If content cannot be found then it does not exist. And determining what content will be accessed is real power in the information cloud. [cf. Siva Vaidhyanathan, Googlization of Everything.]

There is a danger that when this mass of library materials becomes of the web that we could entirely lose control of its discovery. But it isn't just a question of library materials, this is true for the entire linked data cloud: who will create the search engine that makes all of that data findable? With its purchase of freebase.com, it is clear that Google has at least an eye on LD space. And of course Google has the money, the servers, the technology to do this. We know, however, from our experience with the current Google search engine that the application of Google's values to search produces a particular result. We also know that Google's main business model is based on making a connection between searchers and advertisers. [cf. Ken Auletta, Googled] .

It's not enough for libraries to gather, store and preserve huge masses of information resources. We have to be actively engaged with users and potential users, and that engagement includes providing ways for them to find and to use the resources libraries have. We must provide the entry point that brings users to information materials without that access being mediated through a commercial revenue model. So for every HathiTrust or DPLA that focuses on the resources we need a related project -- equally well-funded -- that focuses on users and access. Not just creating a traditional library-type catalog but providing a whole host of services that will help uses find and explore the digital library. This interface needs to be part search engine, part individual work space, and part social networking. Users should be able to do their research, store their personal library (getting into Memex territory here), share their work with others, engage in conversations, and perhaps even manage complex research projects. It could be like a combination of Zotero, VIVO, Zoho, Yahoo pipes, Dabble, and MIT's OpenCourseWare.

Really, if we don't do this, the future of libraries and research will be decided by Google. There, I said it.