Friday, August 26, 2011

New bibliographic framework: there is a way

Since my last post undoubtedly left readers with the idea that I have my head in the clouds about the future of bibliographic metadata, I wish to present here some of the reasons why I think this can work. Many of you were probably left thinking: Yea, right. Get together a committee of a gazillion different folks and decide on a new record format that works for everyone. That, of course, would not be possible. But that's not the task at hand. The task at hand is actually about the opposite of that. Here are a few parameters.

#1 What we need to develop is NOT a record format

The task ahead of us is to define an open set of data elements. Open, in this case, means usable and re-usable in a variety of metadata contexts. What wrapper (read: record format) you put around them does not change their meaning. Your chicken soup can be in a can, in a box, or in a bowl, but it is still chicken soup. That's the model we need for metadata. Content, not carrier. Meaning, not record format. Usable in many different situations.

#2 Everyone doesn't have to agree to use the exact same data elements

We only need to know the meaning of the data elements and what relationships exist between different data elements. For example, we need to know that my author and your composer are both persons and are both creators of the resource being described. That's enough for either of us to use the other's data under some circumstances. It isn't hard to find overlapping bits of meaning between different types of bibliographic metadata.

Not all metadata elements will overlap between communities. The cartographic community will have some elements that the music library community will never use, and vice versa. That's fine. That's even good. Each specialist community can expand its metadata to the level of detail that it needs in its area. If the music library finds a need to catalog a map, they can "borrow" what they need from the cartographic folks.

Where data elements are equivalent or are functionally similar, data definitions should include this information. Although defined differently, you can see that there are similarities among these data elements.
pbcoreTitle =  a name given to the media item you are cataloging
RDA:titleProper = A word, character, or group of words and/or characters that names a resource or a work contained in it.
MARC:245 $a = title of a work
dublincore:title = A name given to the resource
All of these are types of titles, and have a similar role in the descriptive cataloging of their respective communities: each names the target resource. These elements therefore can be considered members of a set that could be defined as: data elements that name the target resource. Having this relationship defined makes it possible to use this data in different contexts and even to bring these titles together into a unified display. This is no different to the way we create web pages with content from different sources like Flickr, YouTube, and a favorite music artist's web site, like the image here.

In this "My Favorites" case, the titles come from the Internet Movie Database, a library catalog display, the Billboard music site, and Facebook. It doesn't matter where they came from or what the data element was called at that site, what matters is that we know which part is the "name-of-the-thing" that we want to display here.

#3 You don't have to create all new data elements for your resources if appropriate ones already exist

When data elements are defined within the confines of a record, each community has to create an entire data element schema of their own, even if they would be coding some elements that are also used by other communities. Yet, there is no reason for different communities to each define a data element for an element like the ISBN because one will do. When data elements themselves are fully defined apart from any particular record format you can mix and match, borrowing from others as needed. This not only saves some time in the creation of metadata schemas but it also means that those data elements are 100% compatible across the metadata instances that use them.

In addition, if there are elements that you need only rarely for less common materials in your environment, it may be more economical to borrow data elements created by specialist communities when they are needed, saving your community the effort of defining additional elements under your metadata name space.

To do all of this, we need to agree on a few basic rules.

1) We need to define our data elements in a machine-readable and machine-actionable way, preferably using a widely accepted standard.

This requires a data format for data elements that contains the minimum needed to make use of a defined data element. Generally, this minimum information is:
  • a name (for human readers)
  • an identifier (for machines)
  • a human-readable definition
  • both human and machine-readable definitions of relationships to other elements (e.g. "equivalent to" "narrower than" "opposite of")

2) We must have the willingness and the right to make our decisions open and available online so others can re-use our metadata elements and/or create relationships to them.

3) We also must have a willingness to hold discussions about areas of mutual interest with other metadata creators and with metadata users. That includes the people we think of today as our "users": writers, scholars, researchers, and social network participants. Open communication is the key. Each of use can teach, and each of us can learn from others. We can cooperate on the building of metadata without getting in each others' way. I'm optimistic about this.

Thursday, August 25, 2011

Bibliographic Framework Transition Initiative

The Internet began as a U.S.-sponsored technology initiative that went global while still under U.S. government control. The transition of the Internet to a world-wide communication facility is essentially complete, and few would argue that U.S. control of key aspects of the network is appropriate today. It is, however, hard for those once in control to give it up, and we see that in ICANN, the body charged with making decisions about the name and numbering system that is key to Internet functioning. ICANN is under criticism from a number of quarters for continuing to be U.S.-centric in its decision-making. Letting go is hard, and being truly international is a huge challenge.

I see a parallel here with Library of Congress and MARC. While there is no question that MARC was originally developed by the Library of Congress, and has been maintained by that body for over 40 years, it is equally true that the format is now used throughout the world and in ways never anticipated by its original developers. Yet LC retains a certain ownership of the format, in spite of its now global nature, and it is surely time for that control to pass to a more representative body.

Some Background

MARC began in the mid-1960's as an LC project at a time when the flow of bibliographic data was from LC to U.S. libraries in the form of card sets. MARC happened at a key point in time when some U.S. libraries were themselves thinking of making use of bibliographic data in a machine-readable form. It was the right idea at the right time.

In the following years numerous libraries throughout the world adopted MARC or adapted MARC to their own needs. By 1977 there had been so much diverse development in this area that libraries used the organizing capabilities of IFLA to create a unified standard called UNIMARC. Other versions of the machine-readable format continued to be created, however.

The tower of Babel that MARC spawned originally has now begun to consolidate around the latest version of the MARC format, MARC21. The reasons for this are multifold. First there are economic reasons: library vendor systems have been having to support this cacophony of data formats now for decades, which increases their costs and decreases their efficiency. Having more libraries on a single standard means that the vendor has fewer different code bases to develop and maintain. The second reason is the increased amount of sharing of metadata between libraries. It is much easier to exchange bibliographic data between institutions using the same data format.

Today, MARC records, or at least MARC-like records, abound in the library sphere, and pass from one library system to another like packets over the Internet. OCLC has a database that consists of about 200 million records that are in MARC format, with data received from some 70,000 libraries, admittedly not all of which use MARC in their own systems. The Library of Congress has contributed approximately 12 million of those.  Within the U.S. the various cooperative cataloging programs  have distributed the effort of original cataloging among hundreds of institutions. Many national libraries freely exchange their data with their cohorts in other countries as a way to reduce cataloging costs for everyone. The directional flow of bibliographic data is no longer from LC to other libraries, but is a many-to-many web of data creation and exchange.

Yet, much like ICANN and the Internet, LC remains as the controlling agency over the MARC standard. The MARC Advisory Committee, which oversees changes to the format, has grown and has added members from Libraries and Archives Canada, The British Library, and Deutsche National Bibliothek. However, the standard is still primarily maintained by and issued by LC.

Bibliographic Framework Transition Initiative

LC recently announced the Bibliographic Framework Transition initiative to "determine a transition path for the MARC21 exchange format."
"This work will be carried out in consultation with the format's formal partners -- Library and Archives Canada and the British Library -- and informal partners -- the Deutsche Nationalbibliothek and other national libraries, the agencies that provide library services and products, the many MARC user institutions, and the MARC advisory committees such as the MARBI committee of ALA, the Canadian Committee on MARC, and the BIC Bibliographic Standards Group in the UK."
In September we should see the issuance of their 18-month plan.

Not included in LC's plan as announced are the publishers, whose data should feed into library systems and does feed into bibliographic systems like online bookstores. Archives and museums create metadata that could and should interact well with library data, and they should be included in this effort. Also not included are the academic users of bibliographic data, users who are so frustrated with library data that they have developed numerous standards of their own, such as BIBO, the Bibliographic Ontology, BIBJson, a JSON format for bibliographic data, and Fabio, the FRBR-Aligned Bibliographic Ontology. Nor are there representatives of online sites like Wikipedia and Google Books, which have an interest in using bibliographic data as well as a willingness to link back to libraries where that is possible. Media organizations, like the BBC and the U. S. public broadcasting community, have developed metadata for their video and sound resources, many of which find their way into library collections. And I almost forgot: library systems vendors. Although there is some representation on the MARC Advisory Committee, they need to have a strong voice given their level of experience with library data and their knowledge of the costs and affordances.

Issues and Concerns

There is one group in particular that is missing from the LC project as announced: information technology (IT) professionals. In normal IT development the users do not design their own system. A small group of technical experts design the system structure, including the metadata schema, based on requirements derived from a study of the users' needs. This is exactly how the original MARC format was developed: LC hired a computer scientist  to study the library's needs and develop a data format for their cataloging. We were all extremely fortunate that LC hired someone who was attentive and brilliant. The format was developed in a short period of time, underwent testing and cost analysis, and was integrated with work flows.

It is obvious to me that standards for bibliographic data exchange should not be designed by a single constituency, and should definitely not be led by a small number of institutions that have their own interests to defend. The consultation with other similar institutions is not enough to make this a truly open effort. While there may be some element of not wanting to give up control of this key standard, it also is not obvious to whom LC could turn to take on this task. LC is to be commended for committing to this effort, which will be huge and undoubtedly costly. But this solution is imperfect, at best, and at worst could result in a data standard that does not benefit the many users of bibliographic information.

The next data carrier for libraries needs to be developed as a truly open effort. It should be led by a neutral organization (possibly ad hoc) that can bring together  the wide range of interested parties and make sure that all voices are heard. Technical development should be done by computer professionals with expertise in metadata design. The resulting system should be rigorous yet flexible enough to allow growth and specialization. Libraries would determine the content of their metadata, but ongoing technical oversight would prevent the introduction of implementation errors such as those that have plagued the MARC format as it has evolved. And all users of bibliographic data would have the capability of metadata exchange with libraries.

Friday, August 19, 2011

Metadata Seminar at ASIST

I'm going to be giving a half-day seminar on October 12 in New Orleans in association with ASIST. This is something I have been wanting to do for a while. I feel like I've spent the past two years presenting Semantic Web 101 in 45-minute segments, and I really want to start moving on to 102, 103, etc. I'm hoping this seminar will fill that gap.

The topics I will cover at that seminar are:
  • Understanding data, data types, and data uses
  • Identifiers, URIs and http URIs
  • Statements and triples and their role in the 'web of data'
  • Defining properties and vocabularies that can be used effectively on the web
  • Brief introduction to semantic web standards
There will be hands-on exercises throughout the morning that give attendees a chance to learn by doing. I'm hoping that the exercises will also be fun. If you're going to ASIST and have any questions about the seminar, please contact me.

Sunday, August 14, 2011

Men, Women: Different

The title of this post was a teaser headline on the cover of USA today -- no, I don't remember when, but the statement definitely struck me. Yes, we are different. Our different points of view are so deep that it's often hard to explain why something matters.

This ad for the cordless mouse clearly made it all of the way through the company's management structure without raising an eyebrow, but many women I have shown this to have had a visceral reaction, since "cut the cord" brings up thoughts of childbirth, which makes this photo of butchers pretty gruesome.

Around this same time (and I'm again talking about the mid 1990's) women began complaining about the title of the back page of PC Magazine: Abort, Retry, Fail. It was a page of bloopers and idiotic error messages. You have to be of the older generation to remember what ARF was about, since it was a DOS error code. There are some examples here in case you are either a) too young to remember, or b) wanting a nostalgia trip.

In any case, some women objected to the use of Abort, Retry, Fail for a humor page because the term abort wasn't at all humorous to them. Men didn't understand this at all. They also seemed to think that the meaning "to end a process" was the main usage of the term, and that its association with a failed pregnancy was just a minor nit, hardly worth noticing. It obviously all depends on which meaning of abort has had the greatest affect on your life. PC magazine did change the name to Backspace, and changed it back again to Abort, Retry, Fail in 2006. The magazine didn't survive much longer, but having nothing to do with its back page, I'm sure.

This last image (unless I get ambitious later and scan some more) could almost have been used as a test for "Are you a man or a woman?" Chameleon was software for managing your Internet connection, and darned good software at that. We used it in my place of work for years. If you see a man on a motorcycle and nothing else, you just might be a man. If you see some high heels flying in the air and get an image of a woman having just been dumped on the road, you're either a woman or would make a great boyfriend. In my talks and writing I called this image: Woman as roadkill on the information highway.

It always amazes me how separate the realities can be for men and women, although there's a chance they are no more distant that those of rich and poor, abled and disabled, or any other human dichotomy you can come up with. I can say that having experienced the world of computing for nearly forty years as a woman, these differences in perception have a real effect on getting along and getting things done. One of my favorite statements is from Professor Ellen Spertus who teaches and encourages women in computer science, and who says: "You can be both rigorous and nurturing." My translation of this is: women's views count, too.

Throbnet, 1995

One of the characteristics of PC Magazine in the mid-1990's was its adult classified section. This went on for many pages; many, many offensive pages. Remember, this was a magazine that many of us read in our professional capacity, since it was the main way to get information about new products and trends. PC Magazine was the primary source of hardware and software reviews; their special printer issue was the place to go before buying a printer. But unfortunately, it also came with these pages.

This is a fairly mild example. I don't remember my rationale but I probably didn't feel comfortable showing the raunchier ads to my audiences, so I used this one. There were more explicit examples like the ads for Throbnet (a name that is still used in online porn).

But the real clincher for me was when I went to my first computer show. My memory has it that it was a MacWorld, but I can't be sure of that. It was in San Francisco, around 1995. Included among the exhibitors were some of the porn vendors who advertised in these pages. Their draw was that they had the actresses there in the booth. I distinctly remember the line of guys waiting to have their copy of "Anal ROM" signed. It was a very uncomfortable place for a woman working in the computer field.

No hairstyling tips, 1995

(The next few posts will be feminist in nature. If that type of thing annoys you, I suggest you skip them, and I'll be back to librarianship in a trice.)

There is a new generation of women dealing with the nature of computing culture. Fortunately, they have social media to help them cope. (Example1, Example2) Reading their posts reminded me that in the mid-90's I did talks about the portrayal of women in computer magazines, and that I might have some illustrations that were still usable. I have only a few since most of my examples ended up as black and white transparencies that aren't scan-able. But in the next few posts I'll offer what I do have, all from about 1995.

The above image is of a postcard that I received in the mid-nineties from a bulletin board system (BBS) called "BIX". BBSs were the only way to get online in those days, although by the mid-nineties most gave you a pass-through to the Internet. A BBS was a kind of mini-AOL: an online gathering and posting place that was a walled community. The first BBS I joined was CompuServe, since that was the main place for technical information about PC hardware and software. (Note: there was a little or no product information on the Internet, which in the 1980's and early 90's was strictly limited to academic activities and research.)

I must have gotten the BIX card because I subscribed to PC Magazine. The message, however, was far from inviting. Here's the back of the card:

The main text says:
No garbage.
No noise.
No irrelevant clutter.

Which, as the card illustrates, obviously meant: no girls.

I have more examples of this "boy's club" atmosphere in 1996 in my article in Wired Women: How hard can it be? (Available on my site.)

Friday, August 12, 2011

Models of bibliographic data

There are two main models of bibliographic data that most of us are familiar with today. One is ISBD, which models bibliographic description. ISBD is a flat list of data areas:
  1. Title and statement of responsibility
  2. Edition
  3. Material type
  4. Publication, distribution, etc.
  5. Physical description
  6. Series
  7. Notes
  8. Identifier
In part, the MARC21 record implements ISBD description because AACR2, on which it is based, is compatible with ISBD but includes additional data such as headings (also known as "access points"). While I haven't seen a diagrammatic visualization of MARC, I believe it would be flat, much like ISBD.

The other primary model is FRBR. There aren't yet many examples of FRBR-based data, although there are partial examples such as the Work views in WorldCat and the Work and Personal author views in Open Library. The most fully FRBR-ized data appears to be in the VTLS Virtual database and their RDA sandbox, but I admit I haven't spent much time looking at this as it is a "pay fer" offering.

The FRBR model isn't flat, but can be drawn as three groups of inter-related entities. The actual FRBR diagrams are too complex to fit in this blog post, but here's a simplified one that I have used in slide sets:

There is a certain amount of movement in FRBR compared to the flat models of ISBD and MARC. In particular, FRBR offers the possibility of creating paths through data by following the relationships of a single entity through the descriptions of different resources. It also allows something like a Person entity to be treated as a resource on its own and therefore to be the focus of attention for some data view.

The British Library recently announced free and open versions of their British National Bibliography, with records available in a linked data format. Their analysis of the BL data, done in collaboration with Talis, a UK library systems company that is very active in linked data space, resulted in a data model (PDF) that is unlike any we have seen before. What I give below isn't readable in its details, but I wanted to highlight the the key sections or groupings that are revealed in the analysis.

There are a number of interesting aspects to this. To begin with, just by virtue of the diagramming of entities (which each get represented by an oval) you can see how much of the record is represented by named and identified entities rather than plain text. The plain text fields are on the bottom right of the diagram in the lavender boxes. Presented this way, they seem to have less importance than they do in traditional views. In sheer diagram real estate, subjects come out as the largest group, and authors appear to be more substantial than they seem in MARC models where they are reduced to short strings.

I also find it very interesting that publication is represented as an event. This makes sense to me. In FRBR, publication isn't an action but a static description of when and where and who, and the various publications are treated as separate events unrelated to a history of how the Work resurfaces over time for new generations. I like the view that a work comes to us through a series of events, not separate and unrelated manifestations.

I would like to suggest that we explore a variety of models for our data. I don't think we have to adopt one single model, but we should design our data such that it can be used in different views depending on the service being provided. I also think that we should explore these models before we put all of our eggs in the FRBR basket. We might learn something vital that should be taken into consideration for our future bibliographic data.

Monday, August 01, 2011

Suggestions for HathiTrust UI

Here are my concrete suggestions for improvements to the HathiTrust user interface. This is based on my own experience and should not be considered to be complete or universal. These are simply the things that would have made my experience better:

On the home page, there should be two links:
  • member login
  • guest login
By each there should be a link to help (one of those question mark circles, for example).

Member help will explain: that you must be someone associated with one of these institutions (link) with an institutional id. Members can: [whatever they can do - view everything, download all PD materials, create bibliographies...]

Guest help will explain: that HT is a member-sponsored db. Guests can search and can view the full text some materials. A guest account allows you to create a persistent bibliography.

On the page for a work, do NOT say: Public domain, Google-digitized. Instead, say what the user needs to know:
Public domain; member-only download.
Public domain; anyone can download.

If you ask for a login at the time of download, ONLY ask for a member login since a guest login does not provide access at this point. The message ("member-only download") may be enough, but the login request could read: requires member login.

This was as far as I got in HT, and I'm not going to be spending much more time there, since as a non-member I am actually served better on other sites. It's a superficial look from a first-time, non-member user.