Coyle's InFormation: 2011

Monday, December 26, 2011

Google files motion to dismiss

"The claims of the associations should be dismissed without leave to amend because they lack standing as a matter of law, since they do not themselves own copyrights and do not meet the test for associational standing set forth in Hunt." p. 19

With that conclusion, Google has filed a motion asserting that the copyright infringement lawsuits filed by the Authors' Guild and the American Society of Media Photographers, Inc. be dismissed. The arguments made in the document are:

"Individual copyright owners' participation is necessary to establish a claim for copyright infringement." (p.1)
"Plaintiff associations do not own copyrights alleged to have been infringed, and do not have standing to sue for copyright infringement." (p.4)
"Every copyright, and every alleged copyright infringement, is different."(p.7)
"... a central issue in these cases is whether the conduct alleged in the Complaints constitute fair use under 17 U.S.C. 107. Litigating that issue will require the participation of individual association members, because many of the relevant facts are specific to the particular work in question." (p.11)

All of this sounds plausible to this legal novice, but there are a couple of puzzling issues. First, why did Google not make these arguments in 2005 when the Authors' Guild filed suit? Instead, they negotiated with the association for six years, presumably in good faith, and those negotiations hinged on the acceptance of the AG as a representative of authors and their rights in their works. If Google had thought that the AG did not have standing, none of that negotiation would have made much sense.

Second, Google says in this document that fair use has to be determined on a case-by-case basis. They even quote from Campbell v. Acuff-Rose Music, Inc. that "Fair use must 'be judged case by case, in light of the ends of the copyright law....' It is 'not to be simplified with bright-line rules." (p.11) This seems to undermine Google's original defense that copying for the purposes of creating an index is itself fair use, not something that has to be determined on a case by case basis.

It isn't surprising the Google wants to bring an end to this case. It is now entering its seventh year (the original suit was filed in September of 2005), and has undoubtedly been costly for all parties. Google was moving ahead in putting into place the foundations for the settlement, including the creation of a large database of works and a means for owners to claim the copyrights. They had designated a director for the Book Rights Registry, which would administer the business agreed on in the settlement. The failure of the settlement and the amended settlement to get court approval meant that all of that effort was for naught. Yet it isn't clear to me (and I hope someone can speak to this) what practical outcome Google is seeking for its book digitization effort. A dismissal of this nature would put Google in the rather cynical position of continuing book scanning knowing that few individual authors will have the means to take Google to court, and those individual payments would probably be affordable for this multi-billion dollar company. If dismissal is rejected, then at least that aspect of the suit is clarified, but next steps surely will be that the suit goes forward as first entered.

The one thing that is clear is that negotiations between Google and the AG are no longer on the horizon.

Note, also, that the Authors Guild has filed suit against the HathiTrust for copyright infringement, and the decision here will no doubt reflect on that case as well.

Thursday, December 22, 2011

National Library of Sweden and OCLC fail to agree

In a blog post entitled "No deal with OCLC" the National Library of Sweden has announced that after five years they have ended negotiations with OCLC to become participants in WorldCat. The point of difference was over the OCLC record use policy. Sweden has declared the bibliographic data in the Swedish National Catalog, Libris, to be open for use without constraints.

"A fundamental condition for the entire Libris collaboration is voluntary participation. Libraries that catalogue in Libris can take out all their bibliographic records and incorporate them instead into another system, or use them in anyway the library finds suitable." (from the blog post)

This is an example of the down-stream constraint issues that we discussed while working on the Open Bibliography Principles for the Open Knowledge Foundation. While open data may appear to be primarily an ideological stance it in fact has real practical implications. A bibliographic database is made up of records and data elements that can have uses in many contexts. In addition, the same bibliographic data may exist in numerous databases managed by members of entirely different communities. Someone may wish to create a new database or service using data coming from a variety of sources. At times someone will want to use only portions of records and may mix and match individual data elements from different sources. Any kind of constraints on use of the data, including something as seemingly innocuous as allowing all non-commercial use, require the user of the data to keep track of the source of each record or data element. Practically this means that an application using the mix of data is effectively constrained by the most strict contract in the mix.

The Swedish library was concerned that their participating libraries would be hindered in their future systems and activities if any limitations were placed on data use. In addition, they would not be able to share their data with the Europeana project, as Europeana requires that the data contributed be open precisely because of the complications of managing hundreds or thousands of different sources with different obligations.

As many of us pointed out during the discussions about the OCLC record use policy, the practical problems of controlling down-stream use of data are insurmountable. Some people argue that the record use policy hasn't affected libraries using WorldCat, but my experience is that the policy has a chilling effect on some libraries, and is making it more difficult for libraries to embrace the linked open data model. The Swedish National Library had to make the difficult decision between WorldCat services and future capabilities. It was undoubtedly a hard decision, but it is admirable that the National Library did not give up what it saw as important rights for its users.

Monday, December 12, 2011

Learning not to share

"Learning to share" used to be one of the basic lessons of childhood, with parents beaming the first time their offspring spontaneously handed half of a cookie to a playmate. But some time before that same child first puts fingers to keyboard she will have to learn a new lesson: not to share online.

The Facebook phenomenon has taken that simple concept of sharing with others to an industrial level. Any page you go to on the Web today connects into your online social life, so that while reading the news or watching a video you are exhorted to share your activity with your online "friends." I say "friends" in quotes because the way that Facebook involvement grows means that many of the people seeing your posts or learning about your activities are like second and third cousins; related to your friends but at least a step removed from the inner circle you relate to. It is easy to forget that those more distant relations are there, but bit by bit the links pull in more invitations and, since we have been told that it is impolite not to share, we rarely slam the digital door on those seeking our friendship.

To increase this digital sharing, the House has passed a revision to the Video Privacy Act. You may not recall the "Bork law" of 1988. It was one of the fastest privacy laws ever passed in the U.S. legislature. Here's the description from the New York Times article:

In 1987, the Washington City Paper, a weekly newspaper, published the video rental records of Judge Robert H. Bork, who at the time was a nominee to the Supreme Court. One of the paper’s reporters had obtained the records from Potomac Video, a local rental store. Judge Bork’s choice of movies — he rented a number of classic feature films starring Cary Grant — may have seemed innocuous.

But the disclosure of Judge Bork’s cultural consumption so alarmed Congress that it quickly passed a law giving individuals the power to consent to have their records shared. The statute, nicknamed the “Bork law,” also made video services companies liable for damages if they divulged consumers’ records outside the course of ordinary business.

At the time the passage of the law had a comic aspect to it: you could imagine the thoughts going through the heads of members of Congress when they realized that any reporter could talk into their local video store and learn what they had rented. Zingo! New law!

The revised bill, stated in the article as being backed primarily by Netflix, would allow consumers (and that's all we are, right, consumers?) to sign a blanket waiver on their video privacy in order to facilitate sharing with friends.

The Times article has various quotes giving pros and cons, online services vs. privacy advocates, all talking about how much you do or don't want your "friends" to know about you. What the article fails to state, however, is that whether you like it or not, every site where you share is a de facto friend as well. If your Facebook friends get your Netflix picks, both Facebook and Netflix (and their advertising partners) also get your video viewing information. The more you share with your friends, the more you are sharing with an invisible network of corporations - who, by the way, you cannot "unfriend" even if you want to.

This is why we need to learn not to share: it's a lie, a deceit. We aren't really sharing with our friends, our friends are being used to get us to divulge information to faceless corporations who have insinuated themselves into our lives for the sole purpose of benefiting from our consumption. They have distorted the entire idea of "friend," and turned it into a buyer's club for their benefit.

Dear friends: I'm looking forward to seeing you ... offline.

Tuesday, November 01, 2011

Future Format: Goals and Measures

The LC report on the future bibliographic format (aka replacement for MARC) is out. The report is short and has few specifics, other than the selection of RDF as the underlying data format. A significant part of the report lists requirements; these, too, are general in nature and may not be comprehensive.

What needs to be done before we go much further is to begin to state our specific goals and the criteria we will use to determine if we have met those goals. Some goals we will discover in the course of developing the new environment, so this should be considered a growing list. I think it is important that every goal have measurements associated with it, to the extent possible. It makes no sense to make changes if we cannot know what those changes have achieved. Here are some examples of the kinds of things I am thinking of in terms of goals; these may not be the actual goals of the project, they are just illustrations that I have invented.

COSTS
- goal: it should be less expensive to create the bibliographic data during the cataloging process
   measurement: using time studies, compare cataloging in MARC and in the new format
- goal: it should be less expensive to maintain the format
   measurement: compare the total time required for a typical MARBI proposal to the time required for the new format
- goal: it should be less expensive for vendors to make required changes or additions
   measurement: compare the number of programmer hours needed to make a change in the MARC environment and the new environment

COLLABORATION
- goal: collaboration on data creation with a wider group of communities
   measurement: count the number of non-library communities that we are sharing data with before and after
- goal: greater participation of small libraries in shared data
   measurement: count number of libraries that were sharing before and after the change
- goal: make library data available for use by other information communities
   measurement: count use of library data in non-library web environments before and after

INNOVATION
- goal: library technology staff should be able to implement "apps" for their libraries faster and easier than they can today.
   measurement: either number of apps created, or a time measure to implement (this one may be hard to compare)
- goal: library systems vendors can develop new services more quickly and more cheaply than before
   measurement: number of changes made in the course of a year, or number of staff dedicated to those changes. Another measurement would be what libraries are charged and how many libraries make the change within some stated time frame

As you can tell from this list, most of the measurements require system implementation, not just the development of a new format. But the new format cannot be an end in itself; the goal has to be the implementation of systems and services using that format. The first MARC format that was developed was tested in the LC workflow to see if it met the needs of the Library. This required the creation of a system (called the "MARC Pilot Project") and a test period of one year. The testing that took place for RDA is probably comparable and could serve as a model. Some of the measurements will not be available before full implementation, such as the inclusion of more small libraries. Continued measurement will be needed.

So, now, what are the goals that YOU especially care about?

Monday, October 17, 2011

Relativ index

Most of us, when we hear "Dewey Decimal Classification" (DDC) think about the numbers that go onto the backs of books that then tell us where the book can be found on the library's shelves. The subject classification and its decimal notation was only part of Dewey's invention, however. The other part was the "Relativ Index." The Relativ Index was the entry vocabulary for the classification scheme. It was to be consulted by library users as the way to find topics in the library.

"The Index givs similar or sinonimus words, and the same words in different connections, any any intelijent person wil surely get the ryt number. A reader wishing to know sumthing of the tarif looks under T, and, at a glance, finds 337 as its number. This gyds him to shelvs, to all books and pamflets, to shelf catalog, to clast subject catalog on cards, to clast record of loans, and, in short, in simple numeric order, thruout the whole library to anything bearing on his subject." (Dewey, Edition 11, p. 10) (Yes, that is how he spelled things.)

The most recent version of DDC that I own is from 1922, so this example is an entry in the Relativ Index of Edition 11 under "Leaves:"

Leaves fertilizers 631.872
shapes of botany 581.4

In the schedules these classes are listed as:

631.872 : "Vegetable manures, Leaves" (coming right after "Vegetable manures, Muck").
581.4 : "Morfology comparativ anatomy"

You can see that the index is not just a repeat of the names of the points in the classification but is a kind of subject thesaurus on its own. It doesn't just point to the classification number but it gives some context ("fertilizers" "botany") to help the user decide which class number to select.

What I find odd today in libraries (mainly public libraries) is that we do not have an entry vocabulary for the Dewey classification. Libraries in the U.S. use the Library of Congress Subject Headings even when their classification scheme is Dewey. While LC subject headings will lead you to a catalog entry that has a classification number, they aren't an index to that classification scheme.

There are more oddities, actually.

One oddity is that we never explain these classification numbers to the users. Yes, I can go from the catalog to the shelf and find books that are near the one I am seeking, but in a small public library I can encounter a number of different topics on a single shelf; and in a large academic library I can wander whole aisles without seeing a change in the initial class number and have no idea if I have exhausted my topic area on the shelf as decimal points three places out change. Yet there is nothing either at the shelf nor anywhere else in the library to tell me what those numbers mean except usually at a very macro level. What I have before me are book spines and class numbers, and since I don't know what the class numbers mean I have to rely on the spine titles. So if I browse a shelf and see:

364.106 D26f   The first family
364.106 En36h Havana nocturne
364.106 En36i The Westies
364.106 En36p Paddy whacked

... it may not be clear to me what topic I am looking at. At the very least I would like to be able to type "364.106" into an app on my phone and get a display something like:

300 Social sciences
    360 Social problems & social services
        364 Criminology
   364.106....

(That example is truncated because the divisions to the right of the decimal point are not available to me. Presumably the display would take me down to .106, which would then have something to do with gangs and/or organized crime and/or mafia, but I'm just guessing at that.)

Even better, I'd like to point my phone camera at a book spine and get a similar read-out. Yes, I know that's not going to be simple.

Another oddity is that we put multiple subject headings on a bibliographic record, but only one classification number, reducing the role of classification to simply the ordering of books on the shelves. This means that there are subject headings on the records that would logically lead to class numbers other than the one that has been given.

Using my crime books as an example, the subject headings are clearly more diverse than the single classification code:

    Mafia -- United States -- History
    Mafia -- United States -- Biography
    Criminals -- United States -- Biography
    Organized crime -- United States -- Case studies

    Lansky, Meyer, 1902-
    Luciano, Lucky, 1897-1962
    Mafia -- Cuba -- Havana
    Cuba -- History -- 1933-1959
    Havana (Cuba) -- Social conditions -- 20th century

    Westies (Gang) -- History
    Gangs -- New York (State) -- New York -- History
    Irish American criminals -- New York (State) -- New York -- History
    Hell's Kitchen (New York, N.Y.)

    Organized crime -- United States -- History
    Irish American criminals -- United States -- History
    Gangsters -- United States -- History

This won't be a surprise to my readers, but this dual system is full of "gotchas" for users. If I look up "Irish American criminals" in the subject headings I retrieve some items in 364.106, some in the 920 area (biographies, but many users won't know that), and some in fiction (under the author's last name). It's not that there isn't a rhyme or reason, but there is nothing to explain the differences between these items to the library user that would justify going to three entirely different places in the library to explore this topic. My guess is that the system seems quite arbitrary.

Things are a bit better in libraries that use Library of Congress Classification (LCC) along with LCSH, since the two seem to be developed with some coordination. In his essay "The Peloponneasian War and the Future of Reference" Thomas Mann, of the Library of Congress, explains how LCSH and LCC work together:

"In order to find which areas of the bookstacks to browse, however, researchers need the subject headings in the library catalog to serve as the index to the class scheme. But the linkage between a subject heading and a classification number is usually dependent on the precoordination of multiple facets within the same string. For example, notice the specific linkages of the following precoordinated strings:

Greece–History–Peloponnesian War, 431-404 B.C.: DF229-DF230
Greece–History–19th century: DF803
Greece–History–Acarnanian Revolt, 1836: DF823.6
Greece–History–Civil War, 1944-1949: DF849.5"

This is the correlation that will appear in the LCSH documentation, but this is not what the user sees in the catalog. A search in LC's catalog for Greece-History-19th century brings up books with a variety of classification numbers, the first four being:

DF803 .H45
DF725 .A14
DF951.T45
DK508.95.O33

Again, the user is directed to different shelf locations from what seems to be a single subject heading, with no explanation of what these different locations mean.* It's got to be terribly confusing.

Compact notation is essential for the ordering of books on the shelf. But it seems truly odd that we order the books on the shelf but do not tell users what the order means. This can be seen as providing a delightful serendipity, but I presume that we could provide serendipity with less intellectual effort than has been dedicated to DDC and LCC, which are both enormously detailed and growing more so each year in an attempt to encompass the complexity of the published world. How much richer would the user's library experience be if she understood the relationship between the items on the shelf? Does it make sense to create detailed and complex relationships that then are not understood or used? What would a shelf system look like that was meaningful to library users? in a small library? in a large library? And, finally, can we use computing power to overcome to limitations that brought us to the situation we are in today in terms of organized subject access?

* Before someone explains to me that the first subject heading determines the class number... you know that, I know that, but millions of library users have no idea what the order of the subject headings means. Besides, library catalog users often don't see the full record with all of the subject headings. Even in the LC catalog subject headings are not included in the default display. We can't blame the users if they don't know what we don't help them know.

Monday, October 03, 2011

Organizing knowledge

At the LITA forum on Saturday I stated that classification and knowledge organization seem to have fallen off the library profession's radar. (LITA2011 keynote.) We have spent considerable amounts of time and money on making modifications to our cataloging rules (four times in about fifty years), but the discussion of how we organize information for our users has waned. I can illustrate what is at least my impression of this through some searches done against Google Books using its nGram service.

"Library classification" peaks around 1960, and drops off rapidly. (The chart ends at 2000.)

Library classification

Faceted classification

Faceted classification has a meteoric rise around the 1960's, but falls abruptly from 1970 to 1980. The rise possibly corresponds closely to the activities of the Classification Research Group, based in the UK, whose big interest was in faceted classification.


Decimal Classification

The decimal classifications, most likely both Dewey and Universal, rise steadily up until the mid-1960's then begin a steep decline.


Keyword searching

Keyword searching comes along slowly in the 1960's and 70's then takes off from 1980 to 2000. Today, as we know, it's basically the only kind of information retrieval being discussed.

Knowledge organization also has a steady rise through the 1970's and 80's, and seems to reach a peak that continues up to recent times.

This is hardly a scientific study, but it illustrates what my gut was telling me, which was that keyword searching has essentially replaced any kind of classed access. That does make me wonder what is being discussed under the rubric of "knowledge organization." Keyword indexing, per se, does not do any organization of knowledge; there are no classes or categories, no broader concepts or narrower concepts, no direction toward similar topics. It also has no facets, at least none based on the topic of the resource, only on its descriptive properties (date of publication, format, domain).

Keyword searching is not organized knowledge. Any topical organization takes place after retrieval by the searcher, who must look through the retrievals and select those that are relevant. This in part explains why Wikipedia is the perfect complement to keyword searches: Wikipedia is organized knowledge. A keyword search can pull up a Wikipedia page that will provide context, disambiguation, and pointers to related topics. I find increasingly that I begin my searches in Wikipedia when my searches are topical, leaving Google to function as my "internet phone book" when I need to find a specific person, company, product or document.

It makes sense for us to ask now: is there any reason (other than shelf placement) to continue library classification practices? Keep your eyes on this space for more about that.

Added note: Richard Urban offers this nGram view comparing all of the library classification phrases with the term "Ontologies":

As @repoRat tweeted: Karen Coyle makes air whoosh out of my lungs. bit.ly/nArBBh Perhaps classification to be replaced by relationship metadata? That's a distinct possibility, and we'd better get cracking on that! Many "ontologies" out there today are simple term lists, and few of them seem to have relationships that you can follow productively. What really excites me is the possibility of relationships that we haven't explored in the past, both between concepts and between resources; all of the "based on" "responds to" "often appear together" -- and lots more that my brain isn't sharp enough to even imagine.

Wednesday, September 28, 2011

Europe's national libraries support Open Data licensing

"Meeting at the Royal Library of Denmark, the Conference of European National Librarians (CENL), has voted overwhelmingly to support the open licensing of their data. CENL represents Europe’s 46 national libraries, and are responsible for the massive collection of publications that represent the accumulated knowledge of Europe.

What does that mean in practice?
It means that the datasets describing all the millions of books and texts ever published in Europe – the title, author, date, imprint, place of publication and so on, which exists in the vast library catalogues of Europe – will become increasingly accessible for anybody to re-use for whatever purpose they want."

From an announcement by the Conference of European National Libraries.

Sunday, September 18, 2011

Meaning in MARC: Indicators

I have been doing a study of the semantics of MARC data on the futurelib wiki. An article on what I learned about the fixed fields (00X) and the number and code fields (0XX) appeared in the code4lib journal, issue 14, earlier this year. My next task was to tackle the variable fields in the MARC range 1XX-8XX.

This is a huge task, so I started by taking a look at the MARC indicators in this tag range, and have expanded this to a short study of the role that indicators play in MARC. I have to say that it is amazing how much one can stretch the MARC format with one or two single-character data elements.

Indicators have a large number of effects on the content of the MARC fields they modify. Here is the categorization that I have come up with, although I'm sure that other breakdowns are equally plausible.

I. Indicators that do not change the meaning of the field

There are indicators that have a function, but it does not change the meaning of the data in the field or subfields.

Display constants: some, but not all, display constants merely echo the meaning of the tag, e.g. 775 Other Edition Entry, Second Indicator
Display constant controller
# - Other edition available
8 - No display constant generated
Trace/Do not trace: I consider these indicators to be carry-overs from card production.
Non-filing indicators: similar to indicators that control displays, these indicators make it possible to sort (was filing) titles properly, ignoring the initial articles ("The ", "A ", etc.).
Existence in X collection: there are indicators in the 0XX range that let you know if the item exists in the collection of a national library.

II. Indicators that do change the meaning of the field

Many indicators serve as a way to expand the meaning of a field without requiring the definition of a new tag.

Identification of the source or agency: a single field like a 650 topical subject field can have content from an unlimited list of controlled vocabularies because the indicator (or the indicator plus the $2 subfield) provides the identity of the controlled vocabulary.
Multiple types in a field: some fields can have data of different types, controlled by the indicator. For example, the 246 (Varying form of title) has nine different possible values, like Cover title or Spine title, controlled by a single indicator value.
Pseudo-display controllers: the same indicator type that carries display constants that merely echo the meaning of the field also has a number of instances where the display constant actually indicates a different meaning for the field. One example is the 520 (Summary, etc.) field with display constants for "subject," "review," "abstract," and others.

Some Considerations

Given the complexity of the indicators there isn't a single answer to how this information should be interpreted in a semantic analysis of MARC. I am inclined to consider the display constants and tracing indicators in section I to not have meaning that needs to be addressed. These are parts of the MARC record that served the production of card sets but that should today be functions of system customization. I would argue that some of these have local value but are possibly not appropriate for record sharing.

The non-filing indicators are a solution to a problem that is evident in so many bibliographic applications. When I sort by title in Zotero or Mendeley, a large portion of the entries are sorted under "The." The world needs a solution here, but I'm not sure what it is going to be. One possibility is to create two versions of a title: one for display, with the initial article, and one for sorting, without. Systems could do the first pass at this, as they often to today with taking author names and inverting them into "familyname, forenames" order. Of course, humans would have to have the ability to make corrections where the system got it wrong.

The indicators that identify the source of a controlled vocabulary could logically be transformed into a separate data element for each vocabulary (e.g. "LCSH," "MeSH"). However, the number of different vocabularies is, while not infinite, very large and growing (as evidenced by the practice in MARC to delegate the source to a separate subfield that carries codes from a controlled list of sources), so producing a separate data element for each vocabulary is unwieldy, to say the least. At some future date, when controlled vocabularies "self-identify" using URIs this may be less of a problem. For now, however, it seems that we will need to have multi-part data elements for controlled vocabularies that include the source with the vocabulary terms.

The indicators that further sub-type a field, like the 520 Summary field, can be fairly easily given their own data element since they have their own meaning. Ideally there would be a "type/sub-type" relationship where appropriate.

And Some Problems

There are a number of areas that are problematic when it comes to the indicator values. In many cases, the MARC standard does not make clear if the indicator modifies all subfields in the field, or only a select few. In some instances we can reason this out: the non-filing indicators only refer to the left-most characters of the field, so they can only refer to the $a (which is mandatory in each of those fields). On the other hand, for the values in the subject area (6XX) of the record, the source indicator relates to all of the subject subfields in the field. I assume, however, that in all cases the control subfields $3-$8 perform functions that are unrelated to the indicator values. I do not know at this point if there are fields in which the indicators function on some other subset of the subfields between $a and $z. That's something I still need to study.

I also see a practical problem in making use of the indicator values in any kind of mapping from MARC to just about anything else. In 60% of MARC tags either one or both indicator positions is undefined. Undefined indicators are represented in the MARC record with blanks. Unfortunately there are also defined indicators that have a meaning assigned to the character "blank." There is nothing in the record itself to differentiate blank indicator values from undefined indicators. Any transformation from MARC to another format has to have knowledge about every tag and its indicators in order to do anything with these elements. This is another example of the complexity of MARC for data processing, and yet another reason why a new format could make our lives easier.

More on the Wiki

For anyone else who obsesses on these kinds of things there is more detail on all of this on the futurelib wiki. I welcome comments here, and on the wiki. If you wish to comment on the wiki, however, I need to add your login to the site (as an anti-spam measure). I will undoubtedly continue my own obsessive behavior related to this task, but I really would welcome collaboration if anyone is so inclined. I don't think that there is a single "right answer" to the questions I am asking, but am working on the principle that some practical decisions in this area can help us as we work on a future bibliographic carrier.

Friday, September 16, 2011

European Thoughts on RDA

Some European libraries are asking the question: "Should we adopt RDA as our cataloging code?" The discussion is happening through the European RDA Interest Group (EURIG). Members of EURIG are preparing reports on what they see as the possibilities that RDA could become a truly international cataloging code. With the increased sharing of just about everything between Europe's countries -- currency, labor force, media, etc. -- the vision of Europe's libraries as a cooperating unit seems to be a no-brainer.

There are interesting comments in the presentations available from the EURIG meeting. For example:

Spain has done comparisons with current cataloging and some testing using MARC21. They conclude: "Our decision will probably depend on the flexibility to get the different lists, vocabularies, specific rules... that we need." In other words, it all depends on being able to customize RDA to local practice.

Germany sees RDA as having the potential to be an international set of rules for data sharing (much like ISBD today), with national profiles for internal use. Germany has starting translating the RDA vocabulary terms in the Open Metadata Registry, but notes that translation of the text must be negotiated with the co-publishers of RDA, that is the American Library Association, the Canadian Library Association, and CILIP.

The most detail, though, comes from a report by the French libraries. (The French are totally winning my heart as a smart and outspoken people. Their response to the Google Books Settlement was wonderful.) This report brings up some key issues about RDA from outside the JSC.

First, it is said in this report, and also in some of the EURIG presentations from their meeting, that it is RDA's implementation of FRBR that makes it a candidate for an international cataloging code. FRBR is seen as the model that will allow library metadata to have a presence on the Web, and many in the library profession see getting a library presence on the Web as an essential element of being part of the modern information environment. One irony of this, though, is that Italy already has a cataloging code based on FRBR, REICAT, but that has gotten little attention. (A large segment of it is available in English if you are curious about their approach. )

The French interest in FRBR is specifically about Scenario 1 as defined in RDA; a model with defined entities and links between them. An implementation of Scenario 2, which links authority records to bibliographic records, would be a mere replication of what already exists in France's catalogs. In other words, they have already progressed to level 2 while U.S. libraries are still stuck in level 3, the flat data model.

Although the French libraries see an advantage to using RDA, they also have some fairly severe criticisms. Key ones are:

it ignores ISO standards, and does not follow IFLA standards, such as Names of person, or Anonymous classics*
it is a follow-on to, and makes concessions to, AACR(1 and 2), which is not used by the French libraries
it proposes one particular interpretation of FRBR, not allowing for others, and defines each element as being exclusively for use with a single FRBR entity

They recommend considering the possibility of creating a European profile of RDA scenario 1. This would give the European libraries a cataloging code based on RDA but under their control. They do ask, however, what the impact on global sharing will be if different library communities use different interpretations of FRBR. (My answer: define your data elements and exchange data elements; implement FRBR inside systems, but make it possible to share data apart from any particular FRBR structuring.)

* There is a strong adherence to ISO and IFLA standards outside of the U.S. I don't know why we in the U.S. feel less need to pay attention to those international standards bodies, but it does separate us from the greater library community.

(Thanks to John Hostage of Harvard for pointing out the recent EURIG activity on the RDA-L list.)

Due diligence do-over

In what I see as both a brave and an appropriate move, the University of Michigan admitted publicly that the Authors Guild had found some serious flaws in its process for identifying orphan works. The statement reaffirms the need to identify orphan works, and promises to revise its procedures.

"Having learned from our mistakes—we are, after all, an educational institution—we have already begun an examination of our procedures to identify the gaps that allowed volumes that are evidently not orphan works to be added to the list. Once we create a more robust, transparent, and fully documented process, we will proceed with the work, because we remain as certain as ever that our proposed uses of orphan works are lawful and important to the future of scholarship and the libraries that support it."

Among other things, what I find interesting in all this is that no one seems to be wondering why our copyright registration process is so broken that sometimes even the rights holders themselves don't know that they are the rights holders. It really shouldn't be this hard to find out if a work is covered by copyright. Larry Lessig covered this in his book "Free Culture," which is available online. The basic process of identifying copyrights is broken, and the burden is being placed on those who wish to make use of works. This is a clear anti-progress, pro-market bias in our copyright system.

Thursday, September 15, 2011

Diligence due

Oooof! Talk about making a BIG, public mistake.

HathiTrust's new Orphan Works Project proposed to do due diligence on works, then post them on the HT site for 90 days, after which those works would be assumed to be orphans and would then be made available (in full text) to members of the HT cooperating institutions. Sounds good, right? (Well, maybe other than the fact of posting the works on a site that few people even know about...)

The Authors Guild blog posted yesterday that it had found the rights holder of one of the books on HTs orphan works list in a little over 2 minutes using Google. (It's hard to believe that they didn't know this when the suit was filed on September 13 -- this is brilliant PR, if I ever saw it.) They then reported finding two others.

James Grimmelman, Associate Professor at New York Law School and someone considered expert on the Google Books case, has titled his blog post on this: "HathiTrust Single-Handedly Sinks Orphan Works Reform," stating that this incident will be brought up whenever anyone claims to have done due diligence on orphan works. I'm not quite as pessimistic as James, but I do believe this will be brought up in court and will work against HT.

Wednesday, September 14, 2011

Authors Guild in Perspective

In its suit against HathiTrust the three authors guilds claim that there are digitized copies of millions of copyrighted books in HathiTrust, and that these should be removed from the database and stored in escrow off-line.

A relevant question is: who do the authors guilds represent, and how many of those books belong to the represented authors?

The combined members of the three authors guilds is about 13,000. That seems like a significant number, but the Library of Congress name authority file has about 8 million names. That file also contains name/title combinations, and I don't have any statistics that tell me how many of those there are. (If anyone out there has a copy of the file and can run some stats on it, I'd greatly appreciate it.) Some of the names are for writers whose works are all in the public domain. Yet no matter how we slice it, the authors guilds of the lawsuit represent a small percentage of authors whose in-copyright works are in the HathiTrust database.

The legal question then is: does this lawsuit pertain to all in-copyright works in HathiTrust, or only those by the represented authors? Could I, for example, sue HathiTrust for violating Fay Weldon's copyright?

Reply to this from James Grimmelman on his blog:
Good question, Karen, and one I plan to address in more detail in a civil procedure post in the next few days. In brief, you couldn’t sue to enforce Fay Weldon’s copyright, as you aren’t an “owner” of any of the rights in it. The Authors Guild and other organizations can sue on behalf of their members, but the details of associational standing are complicated. There is also the question of the scope of a possible injunction (e.g. could Fay Weldon win as to one of her works and obtain an injunction covering others, or works by others), where there are also significant limits on how far the court can go. Again, more soon.

As I suspected, the legal issues are complex. Keep an eye on James' blog for more on this.

Monday, September 12, 2011

Authors Guild Sues HathiTrust

There has been a period of limbo since Judge Chin rejected the proposed settlement between the Author's Guild/Association of American publishers and Google. In fact, a supposedly final meeting between the parties is scheduled for this Thursday, 9/15, in the judge's court.

Monday, 9/12, the Author's Guild (and partners) filed suit against HathiTrust (and partners) for some of the same "crimes" of which it had accused Google: essentially making unauthorized copies of in-copyright texts. In addition, the recent announcement that the libraries would allow their users to access items that had been deemed to be orphan works figures in the suit. That this suit has come over 6 years since the original suit against Google is in itself interesting. Nearly all of the actions of HathiTrust and its member libraries fall within what would have been allowed if the agreement that came out of that suit had been approved by the court. Although we do not know the final outcome of that suit (and anxiously await Thursdays meeting to see if it is revelatory), this suit against the libraries is surely a sign that AG/AAP and Google have not come to a reconciliation.

The Suit

First, the suit establishes that the libraries received copies of Google-digitized items from Google, and have sent copies of these items to HathiTrust, which in turn makes some number of copies as part of its archival function. This is followed by a somewhat short exposition of the areas of copyright law that are pertinent, with an emphasis on section 108, which allows libraries to make limited copies to replace deteriorating works. The suit states that the copying being done is not in accord with section 108. Then it refers to the Orphan Works Project that several libraries are partnering in, and the plan on the part of the libraries to make the full text of orphan works available to institutional users.

Since most of these institutions (if not all of them) are state institutions that have protections against paying out large sums in a lawsuit of this nature, the goal is to regain the control of the works by forcing HathiTrust (and the named libraries) to transfer their digital copies of in-copyright works to a "commercial grade" escrow agency with the files held off network "pending an appropriate act of Congress."

As James Grimmelman comments in his blog post on the suit, there's a lot of mixing up between the orphan works and owned works in the suit. He points out that a group of organizations representing authors could hardly make a case for orphan works since, by definition, the lack of ownership of the orphans means they can't be represented by a guild of people defending their own works.

The Problems

There are numerous problems that I see in the text of this suit. (IANAL, just a Librarian.)

The suit mentions large numbers of books that have been copied without permission, but makes no attempt to state how many of those books belong to the members of the plaintiff organizations.
The suit throws around large numbers without clearly stating that none of the statements include Public Domain works. It isn't clear, therefore, what the numbers represent: the entire holdings of HathiTrust, or just the in-copyright holdings. Also, in relation to the latter, unless one has done a considerable amount of work there are many works that are post-1923 that are also in the Public Domain. Cutting off at that year does not account for works that were not renewed, or were never copyrighted. I also doubt if anyone has a clear idea how many of the works in question are Public Domain because they are US Federal documents. This imprecision on the copyright status of works is very frustrating, but HathiTrust is not to blame for this state of affairs.
Some of their claims do not seem to me to be within legal bounds. For example, in one section they claim that although HathiTrust is not giving users access to in-copyright works, they potentially could. Where does that fit in?
They also claim that there is a risk of unauthorized access. However, the security at HathiTrust meets the security standards that the Author's Guild agreed to in the (unapproved) settlement with Google. If it was good enough then, why is it now too risky?
They claim that the libraries themselves have been digitizing in-copyright books. I wasn't aware of this, and would like to know if this is the case.
They state that the libraries said that before Google it was costing them $100 a book for digitization. Then the plaintiffs say that this means that the value of the digital files is in the hundreds of millions of dollars. First, I have heard figures that are more like $30 a book. Second, I don't see how the cost to digitize can translate into a value that is relevant to the complaint.
Although the legislature has failed to pass an orphan works law that would allow the use of these materials and still benefit owners if they do come forth, it seems like a poor strategy to complain about a well-designed program of due diligence and notification, which is what the libraries have designed. Orphan works are the least available works: if you have an owner you can ask permission; if there is no owner you cannot ask permission and therefore there is no way to use the work if your use falls outside of fair use. It's hard to argue for taking these works entirely out of the cultural realm simply because we have a poorly managed copyright ownership record.
There are a few odd sections where they make reference to bibliographic data as though that were part of the "unauthorized digitization" rather than data that was created by and belongs to the libraries. There's an odd attempt to make bibliographic data searching seem nefarious.

Parties
Plaintiffs: The Author's Guild, Inc.; The Australian Society of Authors limited ; Union des Erivaines Quebecois; Pat Cummings; Angelo Loukakis; Roxana Robinson; Andre Roy; James Shapiro; Daniele Simpson; T.J. Stiles; and Fay Weldon. (Links are to some sample HathiTrust records.)

Defendants: HathiTrust; The Regents of the University of California, The Board of Regents of the University of Wisconsin System; The Trustees of Indiana University; and Cornell University.

Links
Boing Boing: Authors Guild declares war on university effort to rescue orphaned books
Library Journal: Copyright Clash

Friday, September 09, 2011

MARC vs RDA

As LC ponders the task of moving to a bibliographic framework, I can't help but worry about how much the past is going to impinge on our future. It seems to me that we have two potentially incompatible needs at the moment: the first is to fix MARC, and the second is to create a carrier for RDA.

Fixing MARC

For well over a decade some of us have been suggesting that we need a new carrier for the data that is currently stored in the MARC format. The record we work with today is full of kludges brought on by limitations in the data format itself. To give a few examples:

041 Language Codes - We have a single language code in the 008 and a number of other language codes (e.g. for original language of an abstract) in 041. The language code in the 008 is not "typed" so it must be repeated in the 041 which has separate subfields for different language codes. However, 041 is only included when more than one language code is needed. This means that there are always two places one must look to find language codes.
006 Fixed-Length Data Elements, Additional Material Characteristics - The sole reason for the existence of the 006 is that the 008 is not repeatable. The fixed-length data elements in the 006 are repeats of format-specific elements in the 008 so that information about multi-format items can be encoded.
773 Host Item Entry - All of the fields for related resources (76X-78X) have the impossible task of encoding an entire bibliographic description in a single field. Because there are only 26 possible subfields (a-z) available for the bibliographic data, data elements in these fields are not coded the same as they are in other parts of the record. For example, in the 773 the entire main entry is entered in a single subfield ("$aDesio, Ardito, 1897-") as opposed to the way it is coded in any X00 field ("$aDesio, Ardito,$d1897-").

Had we "fixed" MARC ten years ago, there might be less urgency today to move to a new carrier. As it is, data elements that were added so that the RDA testing could take place have made the format look more and more like a Rube Goldberg contraption. The MARC record is on life support, kept alive only through the efforts of the poor folks who have to code into this illogical format.

A Carrier for RDA

The precipitating reason for LC's bibliographic framework project is RDA. One of the clearest results of the RDA tests that were conducted in 2010 was that MARC is not a suitable carrier for RDA. If we are to catalog using the new code, we must have a new carrier. I see two main areas where RDA differs "record-wise" from the cataloging codes that informed the MARC record:

RDA implements the FRBR entities
RDA allows the use of identifiers for entities and terms

Although many are not aware of it, there already is a solid foundation for an RDA carrier in the registered elements and vocabularies in the Open Metadata Registry. Not long ago I was able to show that one could use those elements and vocabularies to create an RDA record. A full implementation of RDA will probably require some expansion of the data elements of RDA because the current list that one finds in the RDA Toolkit was not intended to be fully detailed.

To my mind, the main complications about a carrier for RDA have to do with FRBR and how we can most efficiently create relationships between the FRBR entities and manage them within systems. I suspect that we will need to accommodate multiple FRBR scenarios, some appropriate to data storage and others more appropriate to data transmission.

Can We Do Both?

This is my concern: creating a carrier for RDA will not solve the MARC record problem; solving the MARC record problem will not provide a carrier for RDA. There may be a way to combine these two needs, but I fear that a combined solution would end up creating a data format that doesn't really solve either problem because of the significant difference between the AACR conceptual model and that of RDA/FRBR.

It seems that if we want to move forward, we may have to make a break with the past. We may need to freeze MARC for those users continuing to create pre-RDA bibliographic data, and create an RDA carrier that is true to the needs of RDA and the systems that will be built around RDA data, with any future enhancements taking place only to the new carrier. This will require a strategy for converting data in MARC to the RDA carrier as libraries move to systems based on RDA.

Next: It's All About the Systems

In fact, the big issue is not data conversion but what the future systems will require in order to take advantage of RDA/FRBR. This is a huge question, and I will take it up in a new post, but just let me say here that it would be folly to devise a data format that is not based on an understanding of the system requirements that can fulfill desired functionality and uses.

Wednesday, September 07, 2011

XML and Library Data Future

There is sometimes the assumption that the future data carrier for library data will be XML. I think this assumption may be misleading and I'm going to attempt to clarify how XML may fit into the library data future. Some of this explanation is necessarily over-simplified because a full exposition of the merits and de-merits of XML would be a tome, not a blog post.

What is XML?

The eXtensible Markup Language (XML) is a highly versatile markup language. A markup language is primarily a way to encode text or other expressions so that some machine-processing can be performed. That processing can manage display (e.g. presenting text in bold or italics) or it can be similar to metadata encoding of the meaning of a group of characters ("dateAndTime"). It makes the expression more machine-usable. It is not a data model in itself, but it can be used to mark up data based on a wide variety of models.*

XML is the flagship standard in a large family of markup languages, although not the first: it is an evolution of SGML which had (perhaps necessary) complexities that rendered it very difficult for most mortals to use. It's also the conceptual granddaddy of HTML, a much simplified markup language that many of us take for granted.

Defining Metadata in XML

There is a difference between using XML as a markup for documents or data and using XML to define your data. XML has some inherent structural qualities that may not be compatible with what you want your data to be. There is a reason why XML "records" are generally referred to as "documents": they tend to be quite linear in nature, with a beginning, a middle, and an end, just like a good story.

XML's main structural functionality is that of nesting, or the creation of containers that hold separate bits of data together.

<paragraph>
   <sentence></sentence>
   <sentence></sentence> ...
</paragraph>

<name>
   <familyname></familyname>
   <forenames></forenames>
</name>

This is useful for document markup and also handy when marking up data. It is not unusual for XML documents to have nesting of elements many layers deep. This nesting, however, can be deceptive. Just because you have things inside other things does not mean that the relationship is anything more than a convenience for the application for which it was designed.

<customer>
    <customerNumber></customerNumber>
    <phoneNumber></phoneNumber>
</customer>

Nested elements are most frequently in a whole/part relationship, with the container representing the whole and holding the elements (parts) together as a unit (in particular a unit that can be repeated).

<address>
    <street1></street1>
    <street2></street2>
    <city></city>
    <state></state>
    <zip></zip>
</address

While usually not hierarchical in the sense of genus/species or broader/narrower, this nesting has some of the same data processing issues that we find in other hierarchical arrangements:

The difficulty of placing elements in a single hierarchy when many elements could be logically located in more than one place. That problem has to be weighed against the inconvenience and danger of carrying the same data more than once in a record or system and the chances that these redundant elements will not get updated together.
The need to traverse the whole hierarchy to get to "buried" elements. This was the pain-in-the-neck that caused most data processing shops to drop hierarchical database management systems for relational ones. XML tools make this somewhat less painful, but not painless.
Poor interoperability. The same data element can be in different containers in different XML documents, but the data elements may not be usable outside the context of the containing element (e.g. "street2").

Nesting, like hierarchy, is necessarily a separation of elements from each other, and XML does not provide a way to bring these together for a different view. Contrast the container structure of XML and a graph structure.

In the nested XML structure some of the same data is carried in separate containers and there isn't any inherent relationship between them. Were this data entered into a relational database it might be possible to create those relationships, somewhat like the graph view. But as a record the XML document has separate data elements for the same data because the element is not separate from the container. In other words, the XML document has two different data elements for the zip code:

address:zip
censusDistrict:zip

To use a library concept as an analogy, the nesting in XML is like pre-coordination in library subject headings. It binds elements together in a way that they cannot be readily used in any other context. Some coordination is definitely useful at the application level, but if all of your data is pre-coordinated it becomes difficult to create new uses for new contexts.

Avoid XML Pitfalls

XML does not make your data any better than it was, and it can be used to mark up data that is illogically organized and poorly defined. A misstep that I often see is data designers beginning to use XML before their data is fully described, and therefore letting the structure and limitations of XML influence what their data can express. Be very wary of any project that decides that the data format will be XML before the data itself has been fully defined.

XML and Library data

If XML had been available in 1965 when Henriette Avram was developing the MARC format it would have been a logical choice for that data. The task that Avram faced was to create a machine-readable version of the data on the catalog card that would allow cards to be printed that looked exactly like the cards that were created prior to MARC. It was a classic document mark-up situation. Had that been the case our records could very well have evolved in a way that is different to what we have today, because XML would not have had the need to separate fixed field data from variable field data, and expansion of some data areas might have been easier. But saying that XML would have been a good format in 1965 does not mean that it would be a good format in 2011.

For the future library data format, I can imagine that it will, at times, be conveyed over the internet in XML. If it can ONLY be conveyed in XML we will have created a problem for ourselves. Our data should be independent of any particular serialization and be designed so that it is not necessary to have any particular combination or nesting of elements in order to make use of the data. Applications that use the data can of course combine and structure the elements however they wish, but for our data to be usable in a variety of applications we need to keep the "pre-coordination" of elements to a minimum.

* For example, there is an XML serialization (essentially a record format) of RDF that is frequently used to exchange linked data, although other serializations are also often available. It is used primarily because there is a wide range of software tools available for making use of XML data in applications, and there are many fewer tools available for the more "native" RDF expressions such as triples or turtle. It encapsulates RDF data in a record format and I suspect that using XML for this data will turn out to be a transitional phase as we move from record-based data structures to graph-based ones.

Friday, August 26, 2011

New bibliographic framework: there is a way

Since my last post undoubtedly left readers with the idea that I have my head in the clouds about the future of bibliographic metadata, I wish to present here some of the reasons why I think this can work. Many of you were probably left thinking: Yea, right. Get together a committee of a gazillion different folks and decide on a new record format that works for everyone. That, of course, would not be possible. But that's not the task at hand. The task at hand is actually about the opposite of that. Here are a few parameters.

#1 What we need to develop is NOT a record format

The task ahead of us is to define an open set of data elements. Open, in this case, means usable and re-usable in a variety of metadata contexts. What wrapper (read: record format) you put around them does not change their meaning. Your chicken soup can be in a can, in a box, or in a bowl, but it is still chicken soup. That's the model we need for metadata. Content, not carrier. Meaning, not record format. Usable in many different situations.

#2 Everyone doesn't have to agree to use the exact same data elements

We only need to know the meaning of the data elements and what relationships exist between different data elements. For example, we need to know that my author and your composer are both persons and are both creators of the resource being described. That's enough for either of us to use the other's data under some circumstances. It isn't hard to find overlapping bits of meaning between different types of bibliographic metadata.

Not all metadata elements will overlap between communities. The cartographic community will have some elements that the music library community will never use, and vice versa. That's fine. That's even good. Each specialist community can expand its metadata to the level of detail that it needs in its area. If the music library finds a need to catalog a map, they can "borrow" what they need from the cartographic folks.

Where data elements are equivalent or are functionally similar, data definitions should include this information. Although defined differently, you can see that there are similarities among these data elements.

pbcoreTitle = a name given to the media item you are cataloging
RDA:titleProper = A word, character, or group of words and/or characters that names a resource or a work contained in it.
MARC:245 $a = title of a work
dublincore:title = A name given to the resource

All of these are types of titles, and have a similar role in the descriptive cataloging of their respective communities: each names the target resource. These elements therefore can be considered members of a set that could be defined as: data elements that name the target resource. Having this relationship defined makes it possible to use this data in different contexts and even to bring these titles together into a unified display. This is no different to the way we create web pages with content from different sources like Flickr, YouTube, and a favorite music artist's web site, like the image here.

In this "My Favorites" case, the titles come from the Internet Movie Database, a library catalog display, the Billboard music site, and Facebook. It doesn't matter where they came from or what the data element was called at that site, what matters is that we know which part is the "name-of-the-thing" that we want to display here.

#3 You don't have to create all new data elements for your resources if appropriate ones already exist

When data elements are defined within the confines of a record, each community has to create an entire data element schema of their own, even if they would be coding some elements that are also used by other communities. Yet, there is no reason for different communities to each define a data element for an element like the ISBN because one will do. When data elements themselves are fully defined apart from any particular record format you can mix and match, borrowing from others as needed. This not only saves some time in the creation of metadata schemas but it also means that those data elements are 100% compatible across the metadata instances that use them.

In addition, if there are elements that you need only rarely for less common materials in your environment, it may be more economical to borrow data elements created by specialist communities when they are needed, saving your community the effort of defining additional elements under your metadata name space.

To do all of this, we need to agree on a few basic rules.

1) We need to define our data elements in a machine-readable and machine-actionable way, preferably using a widely accepted standard.

This requires a data format for data elements that contains the minimum needed to make use of a defined data element. Generally, this minimum information is:

a name (for human readers)
an identifier (for machines)
a human-readable definition
both human and machine-readable definitions of relationships to other elements (e.g. "equivalent to" "narrower than" "opposite of")

2) We must have the willingness and the right to make our decisions open and available online so others can re-use our metadata elements and/or create relationships to them.

3) We also must have a willingness to hold discussions about areas of mutual interest with other metadata creators and with metadata users. That includes the people we think of today as our "users": writers, scholars, researchers, and social network participants. Open communication is the key. Each of use can teach, and each of us can learn from others. We can cooperate on the building of metadata without getting in each others' way. I'm optimistic about this.

Thursday, August 25, 2011

Bibliographic Framework Transition Initiative

The Internet began as a U.S.-sponsored technology initiative that went global while still under U.S. government control. The transition of the Internet to a world-wide communication facility is essentially complete, and few would argue that U.S. control of key aspects of the network is appropriate today. It is, however, hard for those once in control to give it up, and we see that in ICANN, the body charged with making decisions about the name and numbering system that is key to Internet functioning. ICANN is under criticism from a number of quarters for continuing to be U.S.-centric in its decision-making. Letting go is hard, and being truly international is a huge challenge.

I see a parallel here with Library of Congress and MARC. While there is no question that MARC was originally developed by the Library of Congress, and has been maintained by that body for over 40 years, it is equally true that the format is now used throughout the world and in ways never anticipated by its original developers. Yet LC retains a certain ownership of the format, in spite of its now global nature, and it is surely time for that control to pass to a more representative body.

Some Background

MARC began in the mid-1960's as an LC project at a time when the flow of bibliographic data was from LC to U.S. libraries in the form of card sets. MARC happened at a key point in time when some U.S. libraries were themselves thinking of making use of bibliographic data in a machine-readable form. It was the right idea at the right time.

In the following years numerous libraries throughout the world adopted MARC or adapted MARC to their own needs. By 1977 there had been so much diverse development in this area that libraries used the organizing capabilities of IFLA to create a unified standard called UNIMARC. Other versions of the machine-readable format continued to be created, however.

The tower of Babel that MARC spawned originally has now begun to consolidate around the latest version of the MARC format, MARC21. The reasons for this are multifold. First there are economic reasons: library vendor systems have been having to support this cacophony of data formats now for decades, which increases their costs and decreases their efficiency. Having more libraries on a single standard means that the vendor has fewer different code bases to develop and maintain. The second reason is the increased amount of sharing of metadata between libraries. It is much easier to exchange bibliographic data between institutions using the same data format.

Today, MARC records, or at least MARC-like records, abound in the library sphere, and pass from one library system to another like packets over the Internet. OCLC has a database that consists of about 200 million records that are in MARC format, with data received from some 70,000 libraries, admittedly not all of which use MARC in their own systems. The Library of Congress has contributed approximately 12 million of those. Within the U.S. the various cooperative cataloging programs have distributed the effort of original cataloging among hundreds of institutions. Many national libraries freely exchange their data with their cohorts in other countries as a way to reduce cataloging costs for everyone. The directional flow of bibliographic data is no longer from LC to other libraries, but is a many-to-many web of data creation and exchange.

Yet, much like ICANN and the Internet, LC remains as the controlling agency over the MARC standard. The MARC Advisory Committee, which oversees changes to the format, has grown and has added members from Libraries and Archives Canada, The British Library, and Deutsche National Bibliothek. However, the standard is still primarily maintained by and issued by LC.

Bibliographic Framework Transition Initiative

LC recently announced the Bibliographic Framework Transition initiative to "determine a transition path for the MARC21 exchange format."

"This work will be carried out in consultation with the format's formal partners -- Library and Archives Canada and the British Library -- and informal partners -- the Deutsche Nationalbibliothek and other national libraries, the agencies that provide library services and products, the many MARC user institutions, and the MARC advisory committees such as the MARBI committee of ALA, the Canadian Committee on MARC, and the BIC Bibliographic Standards Group in the UK."

In September we should see the issuance of their 18-month plan.

Not included in LC's plan as announced are the publishers, whose data should feed into library systems and does feed into bibliographic systems like online bookstores. Archives and museums create metadata that could and should interact well with library data, and they should be included in this effort. Also not included are the academic users of bibliographic data, users who are so frustrated with library data that they have developed numerous standards of their own, such as BIBO, the Bibliographic Ontology, BIBJson, a JSON format for bibliographic data, and Fabio, the FRBR-Aligned Bibliographic Ontology. Nor are there representatives of online sites like Wikipedia and Google Books, which have an interest in using bibliographic data as well as a willingness to link back to libraries where that is possible. Media organizations, like the BBC and the U. S. public broadcasting community, have developed metadata for their video and sound resources, many of which find their way into library collections. And I almost forgot: library systems vendors. Although there is some representation on the MARC Advisory Committee, they need to have a strong voice given their level of experience with library data and their knowledge of the costs and affordances.

Issues and Concerns

There is one group in particular that is missing from the LC project as announced: information technology (IT) professionals. In normal IT development the users do not design their own system. A small group of technical experts design the system structure, including the metadata schema, based on requirements derived from a study of the users' needs. This is exactly how the original MARC format was developed: LC hired a computer scientist to study the library's needs and develop a data format for their cataloging. We were all extremely fortunate that LC hired someone who was attentive and brilliant. The format was developed in a short period of time, underwent testing and cost analysis, and was integrated with work flows.

It is obvious to me that standards for bibliographic data exchange should not be designed by a single constituency, and should definitely not be led by a small number of institutions that have their own interests to defend. The consultation with other similar institutions is not enough to make this a truly open effort. While there may be some element of not wanting to give up control of this key standard, it also is not obvious to whom LC could turn to take on this task. LC is to be commended for committing to this effort, which will be huge and undoubtedly costly. But this solution is imperfect, at best, and at worst could result in a data standard that does not benefit the many users of bibliographic information.

The next data carrier for libraries needs to be developed as a truly open effort. It should be led by a neutral organization (possibly ad hoc) that can bring together the wide range of interested parties and make sure that all voices are heard. Technical development should be done by computer professionals with expertise in metadata design. The resulting system should be rigorous yet flexible enough to allow growth and specialization. Libraries would determine the content of their metadata, but ongoing technical oversight would prevent the introduction of implementation errors such as those that have plagued the MARC format as it has evolved. And all users of bibliographic data would have the capability of metadata exchange with libraries.

Friday, August 19, 2011

Metadata Seminar at ASIST

I'm going to be giving a half-day seminar on October 12 in New Orleans in association with ASIST. This is something I have been wanting to do for a while. I feel like I've spent the past two years presenting Semantic Web 101 in 45-minute segments, and I really want to start moving on to 102, 103, etc. I'm hoping this seminar will fill that gap.

The topics I will cover at that seminar are:

Understanding data, data types, and data uses
Identifiers, URIs and http URIs
Statements and triples and their role in the 'web of data'
Defining properties and vocabularies that can be used effectively on the web
Brief introduction to semantic web standards

There will be hands-on exercises throughout the morning that give attendees a chance to learn by doing. I'm hoping that the exercises will also be fun. If you're going to ASIST and have any questions about the seminar, please contact me.

Sunday, August 14, 2011

Men, Women: Different

The title of this post was a teaser headline on the cover of USA today -- no, I don't remember when, but the statement definitely struck me. Yes, we are different. Our different points of view are so deep that it's often hard to explain why something matters.

This ad for the cordless mouse clearly made it all of the way through the company's management structure without raising an eyebrow, but many women I have shown this to have had a visceral reaction, since "cut the cord" brings up thoughts of childbirth, which makes this photo of butchers pretty gruesome.

Around this same time (and I'm again talking about the mid 1990's) women began complaining about the title of the back page of PC Magazine: Abort, Retry, Fail. It was a page of bloopers and idiotic error messages. You have to be of the older generation to remember what ARF was about, since it was a DOS error code. There are some examples here in case you are either a) too young to remember, or b) wanting a nostalgia trip.

In any case, some women objected to the use of Abort, Retry, Fail for a humor page because the term abort wasn't at all humorous to them. Men didn't understand this at all. They also seemed to think that the meaning "to end a process" was the main usage of the term, and that its association with a failed pregnancy was just a minor nit, hardly worth noticing. It obviously all depends on which meaning of abort has had the greatest affect on your life. PC magazine did change the name to Backspace, and changed it back again to Abort, Retry, Fail in 2006. The magazine didn't survive much longer, but having nothing to do with its back page, I'm sure.

This last image (unless I get ambitious later and scan some more) could almost have been used as a test for "Are you a man or a woman?" Chameleon was software for managing your Internet connection, and darned good software at that. We used it in my place of work for years. If you see a man on a motorcycle and nothing else, you just might be a man. If you see some high heels flying in the air and get an image of a woman having just been dumped on the road, you're either a woman or would make a great boyfriend. In my talks and writing I called this image: Woman as roadkill on the information highway.

It always amazes me how separate the realities can be for men and women, although there's a chance they are no more distant that those of rich and poor, abled and disabled, or any other human dichotomy you can come up with. I can say that having experienced the world of computing for nearly forty years as a woman, these differences in perception have a real effect on getting along and getting things done. One of my favorite statements is from Professor Ellen Spertus who teaches and encourages women in computer science, and who says: "You can be both rigorous and nurturing." My translation of this is: women's views count, too.

Throbnet, 1995

One of the characteristics of PC Magazine in the mid-1990's was its adult classified section. This went on for many pages; many, many offensive pages. Remember, this was a magazine that many of us read in our professional capacity, since it was the main way to get information about new products and trends. PC Magazine was the primary source of hardware and software reviews; their special printer issue was the place to go before buying a printer. But unfortunately, it also came with these pages.

This is a fairly mild example. I don't remember my rationale but I probably didn't feel comfortable showing the raunchier ads to my audiences, so I used this one. There were more explicit examples like the ads for Throbnet (a name that is still used in online porn).

But the real clincher for me was when I went to my first computer show. My memory has it that it was a MacWorld, but I can't be sure of that. It was in San Francisco, around 1995. Included among the exhibitors were some of the porn vendors who advertised in these pages. Their draw was that they had the actresses there in the booth. I distinctly remember the line of guys waiting to have their copy of "Anal ROM" signed. It was a very uncomfortable place for a woman working in the computer field.

No hairstyling tips, 1995

(The next few posts will be feminist in nature. If that type of thing annoys you, I suggest you skip them, and I'll be back to librarianship in a trice.)

There is a new generation of women dealing with the nature of computing culture. Fortunately, they have social media to help them cope. (Example1, Example2) Reading their posts reminded me that in the mid-90's I did talks about the portrayal of women in computer magazines, and that I might have some illustrations that were still usable. I have only a few since most of my examples ended up as black and white transparencies that aren't scan-able. But in the next few posts I'll offer what I do have, all from about 1995.

The above image is of a postcard that I received in the mid-nineties from a bulletin board system (BBS) called "BIX". BBSs were the only way to get online in those days, although by the mid-nineties most gave you a pass-through to the Internet. A BBS was a kind of mini-AOL: an online gathering and posting place that was a walled community. The first BBS I joined was CompuServe, since that was the main place for technical information about PC hardware and software. (Note: there was a little or no product information on the Internet, which in the 1980's and early 90's was strictly limited to academic activities and research.)

I must have gotten the BIX card because I subscribed to PC Magazine. The message, however, was far from inviting. Here's the back of the card:

The main text says:

No garbage.
No noise.
No irrelevant clutter.

Which, as the card illustrates, obviously meant: no girls.

I have more examples of this "boy's club" atmosphere in 1996 in my article in Wired Women: How hard can it be? (Available on my site.)