Monday, June 27, 2022

The OCLC v Clarivate Dilemma

OCLC has filed suit against the company Clarivate which owns Proquest and ExLibris. The suit focuses on a metadata service proposed by Ex Libris called "MetaDoor." MetaDoor isn't a bibliographic database à la WorldCat, it is a peer-to-peer service that allows its users to find quality records in the catalog systems of other libraries. ("MetaDoor" is a terrible name for a product, by the way.)

What seems to specifically have OCLC's dander up is that Ex Libris states that it will allow any and all libraries, not just its Alma customers, to use this service for free. As the service does not yet exist it is unknown how it could affect the library metadata sharing environment. It may succeed, it may fail. If it succeeds, the technology that Ex Libris develops will be a logical next step in bibliographic data sharing, but its effect on OCLC is hard to predict.

Yesterday's and Today's Technology

WorldCat is yesterday's technology: a huge, centralized database. Peer-to-peer sharing of bibliographic records has been available since the 1980's with the development of the Z39.50 protocol, and presumably a considerable amount of sharing over that protocol has been used by libraries to obtain records from other libraries. Over the years many programs and systems have been developed to make use of Z39.50 and the protocol is built in to library systems, both for obtaining records and for sharing records.

The actual extent of peer-to-peer sharing of bibliographic records already today does not seem to be known, although I did only a brief amount of research looking for that information. It is definitely in use in library environments where participation in OCLC is unaffordable; articles vaunt its use in Russia, India, Korea, and other countries. It is built into the open source library system Koha that is aimed at those libraries that are priced out of the mainstream library systems market. Where libraries have known peers, such as the national library of a country, peer-to-peer makes good sense.

What OCLC's centralized database has that peer-to-peer lacks (at least to date) is consolidated library holdings information. As Kyle Banerjee said on Twitter, the real value in WorldCat is the holdings. This is used by interlibrary loan systems, and it is what appears on the screen when you do a WorldCat search. Cleverly, OCLC has recorded the geographical location of all of its holding libraries and can give you a list of libraries relative to your location. In the past this type of service was only available through a central database, but we may have arrived at point where peer-to-peer could provide this as well.

A couple of other things before I look at some specific points in the lawsuit. One is that WorldCat is not the only bibliographic database used for sharing of metadata. Some smaller library companies also have their own shared databases. These are much smaller than WorldCat and the libraries that use them generally are 1) unable to afford OCLC's member fees and 2) do not have need of the depth or breadth of WorldCat's bibliographic data. For example, the CARL database from TLC company has a database of 77 million records, many less than worldCat's over 500 million.  Even the Library of Congress catalog is only 20 million strong. The value for some libraries is that WorldCat contains the long tail; for others, that long tail is not needed. It's the difference between the Harvard library and your local public library. Harvard may well have need for metadata for a Lithuanian poetry journal, your local public library can do just fine with a peer database of popular works published in the US.

And another: we're slowly moving to a less "thing"-based world to a "data"-based world. Yes, scholars still need books and journals, but increasingly our information seeking returns tiny bites, not big thoughts. You can rue that, but I think it's only going to get worse. It's like the difference between a Ken Burns 10-part documentary on the Civil War and TikTok. The metadata creation activity for the deep thoughts of books and articles is not viable for YouTube, Instagram, TikTok or even Facebook. Us "book people" are hanging on to a vast repository that is less and less looking forward and more and more becoming dusty and crusty. We don't want to lose that valuable archive, but it is hard to claim that we are not a fading culture.

OK, to the lawsuit.

What is the Nut of this Case?

OCLC claims in its suit that Clarivate is undertaking MetaDoor as a malicious act, targeting WorldCat with a desire to destroy it. I don't think you need to be malicious to come up with a project to create an efficient system for sharing bibliographic records. Creating a shared database at this time is simply a logical need for any data service. 

The main fact behind OCLC's suit is that uploading library catalog bibliographic records to MetaDoor is a violation of the libraries' contracts with OCLC, and that Clarivate/Ex Libris is encouraging libraries to violate their contracts. As Clarivate has no such contract with OCLC, the suit uses terms like "conspiracy" and a lot of "tortious" to describe that Clarivate/Ex Libris is breaking some law of competition by encouraging OCLC customers to violate their contracts. 

I'm not sure how that will play in court but you can see on the Clarivate site that one of their main areas of expertise is in intellectual property around data. Regardless of the outcome of this suit we may get to see some interesting arguments around data ownership. It's still a wide-open area where some smart discussion would be very welcome.

The ILS Market

The lawsuit complains that Clarivate has become the largest player in the ILS market through its purchase of systems like Ex Libris and ProQuest. (It isn't clear to me how "large" is defined here.) It also bemoans the consolidation of the library market. The library market is hardly unique in this; consolidation of this type is a normal course of things in our barely-regulated capitalism. It is, as always, hard to understand just what Clarivate owns because Clarivate owns Proquest which owns Ex Libris which owns Innovative Interfaces which owns SkyRiver and VTLS, among others. The number of players in the library market, which once was a handful of independent companies, is shrinking at a rapid rate, and this has been a worry in the library world now for decades.

On its web site Clarivate presents itself as a research data and analytics company. It includes Proquest and Web of Science in its list of offerings, but interestingly makes no mention of Ex Libris. I've always wondered why anyone with any business sense would want to enter the library cataloging systems market. In fact, Clarivate inherited Ex Libris when it purchased Proquest, and the Clarivate press release upon acquiring Proquest makes no mention of Ex Libris or other library systems.

Speaking of market consolidation, one must remember that at one time OCLC had two rather large competitors in the library cataloging market: the Western Library Network and Research Libraries Information Network. OCLC purchased both of these, and they then ceased to exist. That was itself a consolidation that concerned many because at the time few library cataloging systems provided a significantly large database to support the cataloging activity. Also, take a gander at this chart from Marshall Breeding's Library Tech Guides that shows the "mergers and acquisitions" of OCLC:


(more readable on Marshall's site, so hop over there fore details)

What is a WorldCat record?

The lawsuit speaks of the "theft" of WorldCat records by Ex Libris for their MetaDoor product (which isn't well explained as the records will be voluntarily offered by the participating libraries). The peer-to-peer action of MetaDoor, however, does not touch the WorldCat database directly. As I understand it from the Ex Libris web site, libraries using the Ex Libris system agree to have that system harvest records from their database. Information from those records will be indexed in MetaDoor but the records themselves will not stored there. Users of MetaDoor will discover records they need for cataloging through MetaDoor, and the records will be retrieved from the library system holding the record. Without a doubt, some of those records will have been downloaded by libraries during cataloging on OCLC. The lawsuit refers to these as "WorldCat records."

Here's the hitch: these are records are distributed among individual library databases. Each MARC record is a character string in which any part of that string can be modified using software written for that purpose. That software may be part of the library catalog system, or it may be standalone software like the open software MARCedit. Other software, like Open Refine, has been incorporated into batch workflows for MARC records to make changes to records. Basically, the records undergo a lot of changes, both the "enhancements" in WorldCat that the lawsuit refers to but also an unknown quantity of modifications once individual libraries obtain them.

Note that some libraries do not use OCLC and therefore have no WorldCat records, and many libraries have multiple sources of bibliographic data. It simply isn't possible to say "all your MARC belong to us." It's much more complicated than that. Although there is nominally both provenance and versioning data in the MARC records, these fields are as editable as all others. In addition, some systems ignore these and do not attempt to update those details as the records are edited. This means that there is no way to look at a record in a library database and determine precisely from where it was originally obtained prior to being in that database. If library A uses OCLC to create a catalog record and library B (not an OCLC cataloging customer) uses its catalog system's Z39.50 option to copy that record from Library A, modifying the record for its own purposes, then library C obtains the record from Library B ... well, you see the problem. These records may flow throughout the library catalog universe, losing their identity as WorldCat records with each step.

OCLC appears to claim in its suit that the OCLC number confers some kind of ownership stamp on the records. In one of the later paragraphs of a very insightful Scholarly Kitchen blog post, Todd Carpenter reminds us that OCLC has not claimed restrictions on the identification number. Also, like everything else in the MARC record, that number can be deleted, modified, or added to a record at the whim of the cataloger. (OK, I admit that "whim" and "cataloger" probably shouldn't be used in the same sentence.) 

Rather than flinging lawsuits around, it would be very interesting to use that money to hire one of those people who looks out 20 years to tell you what the environment will be and what you should be investing in today. I can cover a certain amount of the past, but the future is a fog to me. I hope someone has ideas.

-----

As with many lawsuits, there's a lot of flinging documentation back and forth. Check out this site to keep an eye on things. I welcome recommendations of other resources.

Wednesday, January 26, 2022

What's in a Name?

This is an essay about the forms of names and their representation in metadata. It is not by any means complete, nor am I an expert in this very complex area. These are my observations and a few suggestions for future work. All comments welcome.

[Because this is huge, and printing from here is oddly difficult, here's a PDF.]

If you do anything online, and surely you do, you have filled in countless forms with your name and address. Within the Western and English-speaking world, these have some minor (and occasionally annoying) variations. You might be asked for a first name and last name, or a given name and a family name, or just a name in a particular order.


There are variations, of course. Some recognize the practice of giving a person a "middle" name, that is a second, and perhaps secondary, additional name.

Because these forms are often used in commercial sites and the companies wish to have a polite relationship with their customers, you might be asked about your preferred form of address. 

These forms of address have cultural significance, and the list itself can reveal quite a bit about a culture. This is the list from the British Airways site:

We'll come back to some of these below.

The above examples come from commerce sites. The use of names at those sites are mostly social. Even on a site like a bank, the name has only a minor role in regards to identification because security relies on user names, passwords, and two-factor identification. Names themselves are  poor identifiers because they are far from unique across a population. Even if you think you have an unusual name, you will find others with your name in the vastness of the Internet.*

If you think about times that you've been on the phone with some bank or service, they invariably ask you to provide a telephone number, an email address, or a unique identifier like a social security number as a way to identify you. Only after they have located a record with that identifier do they use your name as both a confirmation that you've given (and they've entered) the correct number, but also so that they can cheerily refer to you by your name.

Names in Cultural Heritage

Where commercial organizations use names to effect a relationship with their current customers, cultural heritage institutions have a different set of needs. They often cover not only names of modern persons but persons worldwide and of previous eras. An organization must be able to encode this full range of names in a way that is useful today but that is, to the extent possible, faithful to the cultural and historical context of the person. Royalty, religious figures, even characters in mythology all have a very tender place in their respective cultures. To treat them otherwise is to dismiss their cultural importance. You wouldn't want to provide metadata for Benedict XVI without also including that his title and his role in the church is "Pope". You most certainly would not simply name him "Joseph Ratzinger" unless you were giving a very specific, pre-Pope, context.  I don't know what name Queen Elizabeth II would provide when signing up for an Amazon account ("Elizabeth Windsor"?) as there is unlikely to be an input box appropriate for her royal name, but culturally and historically she is Elizabeth II, Queen of Great Britain.

There is also the question of giving people their due rank in whatever hierarchy the particular culture values. As you can see with the example above for the list of the titles offered by British Airlines, whereas US-based airlines limit the titles to Mr., Mrs., Ms. and Dr., that titles of nobility are important in the UK. We can presume that to "mis-title" a person would be a social faux pas in most cultures, but there is also a historical context included in titles that one would not want to lose.

The "firstname, lastname" Problem

Not all names fit the "firstname, lastname" model. A primary reason to identify these parts of names is to support displays in alphabetical order by the "last name". This assumes that the last name is a family name, and that common usage is to gather together all persons with that family name in a display. In reality, this singular "family name" is only one possible name pattern. 

As the term "family name" implies, this positions a person within a group of persons with a particular relationship. In the dominant Western world, the name is paternal and denotes a line of inheritance. But this is by no means the only name pattern that exists. There are cultures where the child's name includes the family names of both the mother and the father, and sometimes other ancestors in the family line. This is how Juan Rodríguez y García-Lozano and María de la Purificación Zapatero Valero, have a son named José Luis Rodríguez Zapatero. Treating "Rodríguez Zapatero" as the family name would not bring together the alphabetical entries of the father and son.

There are other cultures that have a given name and a patronymic. While a patronymic may look like a family name, it is not. The singer Björk may have seemed to be using a single name as part of her art, like Cher or Madonna, but in fact in the Icelandic culture persons are known by a single name. When a more precise designation is needed, that name is enhanced with a name based on the given name of their father. In this case, Björk has a more "official" name of Björk Guðmundsdóttir, which is "Björk daughter of Guðmundur". Her father's name was Guðmundur Gunnarsson, he being the son of Gunnar. The author Arnaldur Indriðason is "Arnaldur the son of Indriði." In this practice, creating an order based on the patronymic would result in just a jumble of individual parental names, and persons are almost always called solely by their "first-and-only" name.

Yet another exception to the firstname/lastname conundrum relates to the names of royalty as mentioned above. Charles, Prince of Wales is the son of Elizabeth II. Their names do not connect them which is somewhat ironic given how important family relationships are to royal lineage. Both are of the house of Windsor but you wouldn't know that from their names. Like a Pope, the cultural or political position in these cases outweighs the personal. In addition, the title by which someone is officially known can change over time, making identification even more confusing, with titles being inherited or bestowed as circumstances change. Some people hold a plethora of titles: in addition to Prince of Wales, Charles is Earl of Chester, Duke of Cornwall, Duke of Rothesay, Earl of Merioneth and Baron Greenwich. This is as bad as the name proliferation in Russian novels, and just as confusing.

And there are the "one name" instances.  We have historical figures with only a single name ("Homer", "Aesop") but there are also current cultures in which members have only one name.


Any metadata that strictly requires both a given name and a family name will be unable to accommodate these and it is not unusual for people with only one name to be required to provide a second name to conform to the given/family name expectation in other cultures. There may even be local traditions for how one invents such a name. Yet they would not use that invented name in their own home country.

Names and Language

It is hard to separate language from culture, but there are some name situations in which the name is translated into the "receiving" language. Catherine (the Great) is Catherine in French, Caterina in Italian, etc.  The same is true of Popes:

Papa Franciscus (Latin)

Papa Francesco (Italian)

Papa Francisco (Spanish)

Pope Francis (English)

Another twist is that scientists and other cognoscenti of the late medieval and early modern times communicated with each other in Latin, and, probably as a form of showing that they were members of this elite club, often converted their names to a Latin form. Thus, one Aldo Pio Manuzio, a Venetian scholar and a very early book printer, took the name Aldus Pius Manutius. Francis Bacon published his "Novum Organum" (which was in Latin) as "Franciscus Baconis". 

Things get doubly complex as people and their names move from one culture to another. Many people of Chinese origin reverse the order of their names from family name first then given name to the preferred order in Western countries that places the family name last. In some cases, as with science fiction author Liu Cixin, a change for the Western marketplace creates a bit of confusion for anyone wanting to correctly encode this Chinese name.


Note that his translator, Ken Liu, an American, uses the Western form of his own name. So this book cover is a good illustration of the name problem across cultures.

Names in Metadata

How we handle names in metadata design depends mainly on the intended application functions for the data. I give below some key functions that use names, but this is an incomplete list. I can see these four as key purposes for names and their encoding in metadata:

1. Display - Names get displayed in a number of different contexts, from phone books to faculty listings on a web site to conference name tags. Displays may use all or only part of a name, and there are a variety of ways that one can order the name parts.

2. Disambiguation - Which Mary Jones is this? How do I identify and find the one that I am looking for?

3. Addressing - We do want to address people appropriately, and we also want to talk about them appropriately.

4. Finding - Searching via keyword is without context, so I'll assume that all name forms can be searched in that way. I will describe "finding" as meaning a search for a specific, known name.

I'll trace these through some metadata schemas to illustrate the metadata capabilities one might have.

Library of Congress (and other libraries)

Libraries have been dealing with names and name forms for, well, forever; as long as there have been libraries. The set of rules for determining what name to enter for someone in the library catalog is many, many tens of pages long, and there are separate rules for personal names, corporate names, and family names. Yet library name practices have their limitations, in particular that names are entered as strings that are to be used to create a specific alphabetical sort order that begins with the surname, followed by a comma, and then the forename(s). 

Dempsey, Martin, 1904-
Dempsey, Martin E.
Dempsey, Mary.
Dempsey, Mary A.

Display by family name works well for Western names with family names, but not for Eastern names that place the family name first.

  Mao, Zedong

Following Chinese name practice his name would naturally be given as "Mao Zedong" because the family name is always given first. If one attempts to use the comma to revert names to their natural order, say from "Smith, Jane" to "Jane Smith" then you would also end up with "Zedong Mao" which is not correct in that cultural context. A culturally sensitive "natural order" display is not supported by this metadata. 

The primary display form is the Western one of lastname-comma-first names, but there are exceptions for entry by forename, which is given specific coding: 

Arnaldur Indriðason, 1961-
Homer

As I've shown in the Mao Zedong example, the encoding of name parts in library data does not provide what you might need to create other display forms. In the case of Arnaldur Indriðason, outside of the library need to alphabetize its entries, you may want to know that  Indriðason is a patronymic if you intend to use the name to address the person as he would be addressed in his culture. The example of "Mao, Zedong" is lacking the information that this is a name in a culture that regularly refers to people with their surname preceding their given name (and without a comma). You would want to know that this should be rendered as "Mao Zedong" when used in that context. 

As you can see in the examples above, the Library of Congress name practice goes beyond just the name and adds elements that are meant to inform and clarify. It includes dates (birth, death); titles and other terms associated with a name (Pope, Jr., illustrator); enumeration (II); and fuller form of the name, which fills in portions of the name that use initials ("Boyle, Timothy D. (Timothy Dale)").  Interestingly, the "III" in Pope Pius III is an enumeration, while the "III" in "John R Kennedy, III" is an "other term associated with a name." I'm going to guess that this primarily relates to the positioning of the "III" in the display. This illustrates a tension between identifying parts of the name and providing the desired display of those parts.

There is a another problem with "title and other terms" because it is a catchall element that doesn't distinguish between some very different types of data. The documentation lists:

  • titles designating rank, office, or nobility, e.g., Sir
  • terms of address, e.g., Mrs.
  • initials of an academic degree or denoting membership in an organization, e.g., F.L.A. 
  • a roman numeral used with a surname
  • other words or phrases associated with the name, e.g., clockmaker, Saint.

As you can see, some of these would display before the name in a "natural order" display:

  • Sir Paul McCartney
  • Mrs. Harriet Ward

While others display afterward:

  • John Kennedy, Jr.
  • John Kennedy, III

 And some can be either or both:

  • Dr. Paul Johnson, DDS
  • Dr. Sophie Jones, Ph.D., F.I.P.A.

There is always the need to disambiguate between people with the same name. Some of these "other terms"  work well in identifying a person:

Boyle, Tom (Professor)

Boyle, Tom (Spiritualist)

However, the clarification between identical names used most often in library name data is the dates of birth and death. These used to be included only when necessary to distinguish between identical names but the information is now included whenever it is available to the cataloger. This makes the dates an integral part of the name, much as the roman numerals of the names of Popes. 

Pius I, Pope, d. ca. 154 

Pius II, Pope, 1405-1464

Although perhaps once useful for the purpose of distinguishing otherwise identical names, the sheer number of people who are included in library catalogs has greatly limited the utility of these dates for disambiguation.

Kennedy, John, 1919-1945
Kennedy, John, 1921-
Kennedy, John, 1926-1994
Kennedy, John, 1928-
Kennedy, John, 1931-
Kennedy, John, 1931-2004
Kennedy, John, 1934-2012
Kennedy, John, 1939-
Kennedy, John, 1940-
Kennedy, John, 1947-
Kennedy, John, 1948-
Kennedy, John, 1951-
Kennedy, John, 1953-
Kennedy, John, 1956-
Kennedy, John, 1959-
Kennedy, John, 1963-
Kennedy, John, 1965-
Kennedy, John, 1973-
Kennedy, John, -1988.

There is provision for alternate versions of names in library practice although these reside in a separate file and are not always linked to the primary name in library databases.

Boyle, Thomas John
    see: Boyle, T. C. 

The library name practices, although probably the most detailed of any metadata name schemes, are not very generalizable; they serve one designated application, which is the alphabetical order of the entries in the library catalog. 

Dublin Core

Dublin Core is absolutely minimal when it comes to names, as "core" implies. It provides only one property, dct:creator, without further detail. It also does not distinguish between persons and organizations: both can be coded as "creator" with an implicit class of Agent. Any further intelligence must be provided elsewhere in a metadata scheme that makes use of Dublin Core.

Dublin Core does allow for the value of the dct:creator property to be either a literal or an IRI or Bnode, and the encoding of the value of the IRI could be a more precise name form. Using an IRI could also be a method for providing a unique identity for the creator.

FOAF

The "Friend of a Friend" vocabulary is about people, their names, and some modern social connectivity: email address, web site, etc. FOAF has three name properties:

  • foaf:name - which can be used to an entire name, undifferentiated in terms of types of name
  • foaf:familyName & foaf:givenName - intended to be used together (but with no mechanism to enforce that) this allows an obvious separation between the names. How they would display is left to the applications that make use of them. 

The  foaf:familyName and foaf:givenName cover a limited set of name forms. In the context of many online sites this may suffice, especially where there is no enforcement of "real" names. Given that FOAF was developed for use within and between online social sites, it avoids the need for historical forms of names.

All of these are defined as taking literal values, which we know does not provide an unambiguous identity for a person. There are properties defined in FOAF under the "Social Web" rubric, such as an email address, that should serve to disambiguate persons in a particular social context. These are not, however, part of the name itself.

schema.org

The vocabulary schema.org was developed to provide "structured data on the Internet". (This is exactly the original impetus behind Dublin Core. How that went south, and what schema.org attempts to do instead, is beyond this post.) The vocabulary listed under the person schema is extensive, although only a few elements are directly related to names:

sdo:familyName, sdo:givenName, sdo:additionalName

sdo:givenName, is defined as the "first name" and sdo:familyName, is defined as the "last name". sdo:additionalName, is "An additional name for a Person, can be used for a middle name".  This latter is highly flexible but at the same time non-specific. It also creates some confusion in terms of the order of names for anyone whose name does not fit the exact "first-middle-last" pattern. As shown above, it's not totally uncommon to have more than one name that can fit into any of those particular buckets. Presumably the properties are repeatable, but they are defined with the singular term "name". It also does not clarify a display order. 

sdo:givenName "T."
sdo:familyName "Boyle"
sdo:additionalName "C."

Schema.org does have properties for both pre-name and post-name honorifics. The examples given for these are: sdo:honorificPrefix (Dr., Mrs.); sdo:honorificSuffix (M.D., PhD).  These examples don't make it clear if it might be possible to encode:

sdo:givenName "Charles"
sdo:honorificSuffix "Prince of Wales"

or

sdo:givenName "Pius"
sdo:honorificPrefix "Pope"
sdo:honorificSuffix "II"

In any case it appears that this would not distinguish between the informal honorifics like "Esq." and those that are essential parts of the name such as titles of nobility. There also does not seem to be an obvious way to encode non-honorific suffixes, such as "Jr." or "III". 

Without some strong guidance, it would be hard to know which of these properties would be used for the parts of a name like María de la Purificación Zapatero Valero. We'll see a possible solution to this with Wikidata, below.

Wikipedia

Wikipedia has probably millions of articles for people and therefore has to deal with the question of names. Their search does not distinguish between names and other article topics, and all are searched in left-to-right natural order in a drop-down box. Names are article titles just as any topic can be an article title.

There is no special coding of the name or parts of the name - it is simply a string of characters. Where more than one person has the same name article creators must add something to disambiguate the name which is usually done by adding an area of activity and perhaps a location associated with that activity:


Wikipedia also has a special type of page where topics that have common terms, including names, can be further defined.

 
These pages allow an explanation to distinguish between people who share a name. It goes beyond the parenthetical phrases that are used to create unique article names for persons with the same name, and is much more human-friendly than the birth and death dates that library cataloging relies on. Yet while Wikipedia excels in disambiguation, its encoding for names is limited to a single property, "name", in the infobox for a person, although it also allows for honorifics and for alternative forms of the name. 
 

Because the various Wikipedias are divided by language, there are properties for translations and transliterations of names, and it allows for name changes over the course of a person's life. 

Wikidata

Wikidata began by extracting data points from the Wikipedia entries, primarily from the infoboxes, but has grown beyond that to a database of facts that is edited directly. Perhaps because it is massively crowd-sourced, a long list of name properties have been developed. In addition to the usual given name and family name  there are terms like demonym (a name representing a place), second family name in Spanish name, Roman cognomen (ancient surname), patronym or matronym (names representing the person's father or mother), first family name in Portuguese name, and many others. 

Also because it is crowd-sourced there should be no expectation that this list is complete or balanced. It most likely represents a modicum of self-interest on the part of participants.

Conclusion (?)

Any solution in this area needs to recognize that one size does not fit all. For some applications a single "name=[string]" will be sufficient and it would be seriously counter-productive to force those folks to engage in detailed encoding. Another barrier to detailed encoding is that few people have knowledge to encode the universe of name forms at a detailed level. Requiring metadata creators to make distinctions outside of their understanding would only result in error-ridden metadata. Better a blind single string than mis-coded details. Yet there will be applications and their metadata communities that can or must make use of the subtleties of name details that are not of interest to others.

Because of both the great variety of name forms and the variability of applications that make use of names, I recommend a metadata vocabulary that follows the principle of minimum semantic commitment. This means a vocabulary that includes broad classes and properties that can be used as is where detailed coding is not needed or desired, but which can be extended to accommodate many different contexts.

The trick then is to define broad classes that aid in defining semantics but do little restriction. Classes for things like "Agent", with subclasses for "Person", "Groups of Persons", and perhaps "Non-persons". Properties could begin with "name" which could be subdivided into any definable part of a name that people find useful. Further specificity can be provided by application profiles that define such requirements as cardinality or value types for the various properties. Applications themselves could contain rules for the displays that are needed for their use cases.

The challenge now is to find a standards group that is interested to take this on.

-------

* With perhaps a few exceptions. I once heard Lorcan Dempsey opine that person's names would be much more useful if parents would just give their children unique names, "... like Lorcan Dempsey."



Thursday, August 12, 2021

Phil Agre and the gendered Internet

There is an article today in the Washington Post about the odd disappearance of a computer science professor named Phil Agre.  The article, entitled "He predicted the dark side of the Internet 30 years ago. Why did no one listen?" reminded me of a post by Agre in 1994 after a meeting of Computer Professionals for Social Responsibility. Although it annoyed me at the time, a talk that I gave there triggered in him thoughts of gender issues;  as a women I was very much in the minority at the meeting,  but that was not the topic of my talk. But my talk also gave Agre thoughts about the missing humanity on the Web.

I had a couple of primary concerns, perhaps not perfectly laid out, in my talk, "Access, not Just Wires." I was concerned about what was driving the development of the Internet and the lack of a service ethos regarding society. Access at the time was talked about in terms of routers, modems, T-1 lines. There was no thought to organizing or preserving of online information. There was no concept of "equal access". There was no thought to how we would democratize the Web such that you didn't need a degree in computer science to find what you needed.

I was also very concerned about the commercialization of information. I was frustrated watching the hype as information was touted as the product of the information age. (This was before we learned that "you are the product, not the user" in this environment.) Seen from the tattered clothes and barefoot world of libraries, the money thrown at the jumble of un-curated and unorganized "information" on the web was heartbreaking. I said:

"It's clear to me that the information highway isn't much about information. It's about trying to find a new basis for our economy. I'm pretty sure I'm not going to like the way information is treated in that economy. We know what kind of information sells, and what doesn't. So I see our future as being a mix of highly expensive economic reports and cheap online versions of the National Inquirer. Not a pretty picture." - kcoyle in Access, not Just Wires

 Little did I know how bad it would get.

Like many or most people, Agre heard "libraries" and thought "female." But at least this caused him to think, earlier than many, about how our metaphors for the Internet were inherently gendered.

"Discussing her speech with another CPSR activist ... later that evening, I suddenly connected several things that had been bothering me about the language and practice of the Internet. The result was a partial answer to the difficult question, in what sense is the net "gendered"?" -  Agre, TNO, October 1994

This led Agre to think about how we spoke then about the Internet, which was mainly as an activity of "exploring." That metaphor is still alive with Microsoft's Internet Explorer, but was also the message behind the main Web browser software of the time, Netscape Navigator. He suddenly saw how "explore" was a highly gendered activity:

"Yet for many people, "exploring" is close to defining the experience of the net. It is clearly a gendered metaphor: it has historically been a male activity, and it comes down to us saturated with a long list of meanings related to things like colonial expansion, experiences of otherness, and scientific discovery. Explorers often die, and often fail, and the ones that do neither are heroes and role models. This whole complex of meanings and feelings and strivings is going to appeal to those who have been acculturated into a particular male-marked system of meanings, and it is not going to offer a great deal of meaning to anyone who has not. The use of prestigious artifacts like computers is inevitably tied up with the construction of personal identity, and "exploration" tools offer a great deal more traction in this process to historically male cultural norms than to female ones." - Agre, TNO, October 1994
He decried the lack of social relationships on the Internet, saying that although you know that other  people are there, you cannot see them. 

"Why does the space you "explore" in Gopher or Mosaic look empty even when it's full of other people?" - Agre, TNO, October 1994

None of us knew at the time that in the future some people would experience the Internet entirely and exclusively as full of other people in the forms of Facebook, Twitter and all of the other sites that grew out of the embryos of bulletin board systems, the Well, and AOL. We feared that the future Internet would  not have the even-handedness of libraries, but never anticipated that Russian bots and Qanon promoters would reign over what had once been a network for the exchange of scientific information.

It hurts now to read through Agre's post arguing for a more library-like online information system because it is pretty clear that we blew through that possibility even before the 1994 meeting and were already taking the first steps toward to where we are today.

Agre walked away from his position at UCLA in 2009 and has not resurfaced, although there have been reports at times (albeit not recently) that he is okay. Looking back, it should not surprise us that someone with so much hope for an online civil society should have become discouraged enough to leave it behind. Agre was hoping for reference services and an Internet populated with users with:

"...the skills of composing clear texts, reading with an awareness of different possible interpretations, recognizing and resolving conflicts, asking for help without feeling powerless, organizing people to get things done, and embracing the diversity of the backgrounds and experiences of others." - Agre, TNO, October 1994

 Oh, what world that would be!

Monday, March 01, 2021

Digitization Wars, Redux

 (NB: IANAL) 

 Because this is long, you can download it as a PDF here.

From 2004 to 2016 the book world (authors, publishers, libraries, and booksellers) was involved in the complex and legally fraught activities around Google’s book digitization project. Once known as “Google Book Search,” the company claimed that it was digitizing books to be able to provide search services across the print corpus, much as it provides search capabilities over texts and other media that are hosted throughout the Internet. 

Both the US Authors Guild and the Association of American Publishers sued Google (both separately and together) for violation of copyright. These suits took a number of turns including proposals for settlements that were arcane in their complexity and that ultimately failed. Finally, in 2016 the legal question was decided: digitizing to create an index is fair use as long as only minor portions of the original text are shown to users in the form of context-specific snippets. 

We now have another question about book digitization: can books be digitized for the purpose of substituting remote lending in the place of the lending of a physical copy? This has been referred to as “Controlled Digital Lending (CDL),” a term developed by the Internet Archive for its online book lending services. The Archive has considerable experience with both digitization and providing online access to materials in various formats, and its Open Library site has been providing digital downloads of out of copyright books for more than a decade. Controlled digital lending applies solely to works that are presumed to be in copyright. 

Controlled digital lending works like this: the Archive obtains and retains a physical copy of a book. The book is digitized and added to the Open Library catalog of works. Users can borrow the book for a limited time (2 weeks) after which the book “returns” to the Open Library. While the book is checked out to a user no other user can borrow that “copy.” The digital copy is linked one-to-one with a physical copy, so if more than one copy of the physical book is owned then there is one digital loan available for each physical copy. 

The Archive is not alone in experimenting with lending of digitized copies: some libraries have partnered with the Archive’s digitization and lending service to provide digital lending for library-owned materials. In the case of the Archive the physical books are not available for lending. Physical libraries that are experimenting with CDL face the added step of making sure that the physical book is removed from circulation while the digitized book is on loan, and reversing that on return of the digital book. 

Although CDL has an air of legality due to limiting lending to one user at a time, authors and publishers associations had raised objections to the practice. [nwu] However, in March of 2020 the Archive took a daring step that pushed their version of the CDL into litigation: using the closing of many physical libraries due to the COVID pandemic as its rationale, the Archive renamed its lending service the National Emergency Library [nel] and eliminated the one-to-one link between physical and digital copies. Ironically this meant that the Archive was then actually doing what the book industry had accused it of (either out of misunderstanding or as an exaggeration of the threat posed): it was making and lending digital copies beyond its physical holdings. The Archive stated that the National Emergency Library would last only until June of 2020, presumably because by then the COVID danger would have passed and libraries would have re-opened. In June the Archive’s book lending service returned to the one-to-one model. Also in June a suit was filed by four publishers (Hachette, HarperCollins, Penguin Random House, and Wiley) in the US District Court of the Southern District of New York. [suit] 

The Controlled Digital Lending, like the Google Books project, holds many interesting questions about the nature of “digital vs physical,” not only in a legal sense but in a sense of what it means to read and to be a reader today. The lawsuit not only does not further our understanding of this fascinating question; it sinks immediately into hyperbole, fear-mongering, and either mis-information or mis-direction. That is, admittedly, the nature of a lawsuit. What follows here is not that analysis but gives a few of the questions that are foremost in my mind.

 Apples and Oranges 

 Each of the players in this drama has admirable reasons for their actions. The publishers explain in their suit that they are acting in support of authors, in particular to protect the income of authors so that they may continue to write. The Authors’ Guild provides some data on author income, and by their estimate the average full-time author earns less than $20,000 per year, putting them at poverty level.[aghard] (If that average includes the earnings of highly paid best selling authors, then the actual earnings of many authors is quite a bit less than that.) 

The Internet Archive is motivated to provide democratic access to the content of books to anyone who needs or wants it. Even before the pandemic caused many libraries to close the collection housed at the Archive contained some works that are available only in a few research libraries. This is because many of the books were digitized during the Google Books project which digitized books from a small number of very large research libraries whose collections differ significantly from those of the public libraries available to most citizens. 

Where the pronouncements of both parties fail is in making a false equivalence between some authors and all authors, and between some books and all books, and the result is that this is a lawsuit pitting apples against oranges. We saw in the lawsuits against Google that some academic authors, who may gain status based on their publications but very little if any income, did not see themselves as among those harmed by the book digitization project. Notably the authors in this current suit, as listed in the bibliography of pirated books in the appendix to the lawsuit, are ones whose works would be characterized best as “popular” and “commercial,” not academic: James Patterson, J. D. Salinger, Malcolm Gladwell, Toni Morrison, Laura Ingalls Wilder, and others. Not only do the living authors here earn above the poverty level, all of them provide significant revenue for the publishers themselves. And all of the books listed are in print and available in the marketplace. No mention is made of out-of-print books, no academic publishers seem to be involved. 

On the part of the Archive, they state that their digitized books fill an educational purpose, and that their collection includes books that are not available in digital format from publishers:

“ While Overdrive, Hoopla, and other streaming services provide patrons access to latest best sellers and popular titles,  the long tail of reading and research materials available deep within a library’s print collection are often not available through these large commercial services.  What this means is that when libraries face closures in times of crisis, patrons are left with access to only a fraction of the materials that the library holds in its collection.”[cdl-blog]

This is undoubtedly true for some of the digitized books, but the main thesis of the lawsuit points out that the Archive has digitized and is also lending current popular titles. The list of books included in the appendix of the lawsuit shows that there are in-copyright and most likely in-print books of a popular reading nature that have been part of the CDL. These titles are available in print and may also be available as ebooks from the publishers. Thus while the publishers are arguing that current, popular books should not be digitized and loaned (apples), the Archive is arguing that they are providing access to items not available elsewhere, and for educational purposes (oranges). 

The Law 

The suit states that publishers are not questioning copyright law, only violations of the law.

“For the avoidance of doubt, this lawsuit is not about the occasional transmission of a title under appropriately limited circumstances, nor about anything permissioned or in the public domain. On the contrary, it is about IA’s purposeful collection of truckloads of in-copyright books to scan, reproduce, and then distribute digital bootleg versions online.” ([Suit] Page 3).

This brings up a whole range of legal issues in regard to distributing digital copies of copyrighted works. There have been lengthy arguments about whether copyright law could permit first sale rights for digital items, and the answer has generally been no; some copyright holders have made the argument that since transfer of a digital file is necessarily the making of a copy there can be no first sale rights for those files. [1stSale] [ag1] Some ebook systems, such as the Kindle, have allowed time-limited person-to-person lending for some ebooks. This is governed by license terms between Amazon and the publishers, not by the first sale rights of the analog world. 

Section 108 of the copyright law does allow libraries and archives to make a limited number of copies The first point of section 108 states that libraries can make a single copy of a work as long as 1) it is not for commercial advantage, 2) the collection is open to the public and 3) the reproduction includes the copyright notice from the original. This sounds to be what the Archive is doing. However, the next two sections (b and c) provide limitations on that first section that appear to put the Archive in legal jeopardy: section “b” clarifies that copies may be made for preservation or security; section “c” states that the copies can be made if the original item is deteriorating and a replacement can no longer be purchased. Neither of these applies to the Archive’s lending. 

 In addition to its lending program, the Archive provides downloads of scanned books in DAISY format for those who are certified as visually impaired by the National Library Service for the Blind and Physically Handicapped in the US. This is covered in 121A of the copyright law, Title17, which allows the distribution of copyrighted works in accessible formats. This service could possibly be cited as a justification of the scanning of in-copyright works at the Archive, although without mitigating the complaints about lending those copies to others. This is a laudable service of the Archive if scans are usable by the visually impaired, but the DAISY-compatible files are based on the OCR’d text, which can be quite dirty. Without data on downloads under this program it is hard to know the extent to which this program benefits visually impaired readers. 

 Lending 

Most likely as part of the strategy of the lawsuit, very little mention is made of “lending.” Instead the suit uses terms like “download” and “distribution” which imply that the user of the Archive’s service is given a permanent copy of the book

“With just a few clicks, any Internet-connected user can download complete digital copies of in-copyright books from Defendant.” ([suit] Page 2). “... distributing the resulting illegal bootleg copies for free over the Internet to individuals worldwide.” ([suit] Page 14).
Publishers were reluctant to allow the creation of ebooks for many years until they saw that DRM would protect the digital copies. It then was another couple of years before they could feel confident about lending - and by lending I mean lending by libraries. It appears that Overdrive, the main library lending platform for ebooks, worked closely with publishers to gain their trust. The lawsuit questions whether the lending technology created by the Archive can be trusted.
“...Plaintiffs have legitimate fears regarding the security of their works both as stored by IA on its servers” ([suit] Page 47).

In essence, the suit accuses IA of a lack of transparency about its lending operation. Of course, any collaboration between IA and publishers around the technology is not possible because the two are entirely at odds and the publishers would reasonably not cooperate with folks they see as engaged in piracy of their property. 

Even if the Archive’s lending technology were proven to be secure, lending alone is not the issue: the Archive copied the publishers’ books without permission prior to lending. In other words, they were lending content that they neither owned (in digital form) nor had licensed for digital distribution. Libraries pay, and pay dearly, for the ebook lending service that they provide to their users. The restrictions on ebooks may seem to be a money-grab on the part of publishers, but from their point of view it is a revenue stream that CDL threatens. 

Is it About the Money?

“... IA rakes in money from its infringing services…” ([suit] Page 40). (Note: publishers earn, IA “rakes in”)
“Moreover, while Defendant promotes its non-profit status, it is in fact a highly commercial enterprise with millions of dollars of annual revenues, including financial schemes that provide funding for IA’s infringing activities. ([suit] Page 4).

These arguments directly address section (a)(1) of Title 17, section 108: “(1) the reproduction or distribution is made without any purpose of direct or indirect commercial advantage”. 

At various points in the suit there are references to the Archive’s income, both for its scanning services and donations, as well as an unveiled show of envy at the over $100 million that Brewster Kahle and his wife have in their jointly owned foundation. This is an attempt to show that the Archive derives “direct or indirect commercial advantage” from CDL. Non-profit organizations do indeed have income, otherwise they could not function, and “non-profit” does not mean a lack of a revenue stream, it means returning revenue to the organization instead of taking it as profit. The argument relating to income is weakened by the fact that the Archive is not charging for the books it lends. However, much depends on how the courts will interpret “indirect commercial advantage.” The suit argues that the Archive benefits generally from the scanned books because this enhances the Archive’s reputation which possibly results in more donations. There is a section in the suit relating to the “sponsor a book” program where someone can donate a specific amount to the Archive to digitize a book. How many of us have not gotten a solicitation from a non-profit that makes statements like: “$10 will feed a child for a day; $100 will buy seed for a farmer, etc.”? The attempt to correlate free use of materials with income may be hard to prove. 

Reading 

Decades ago, when the service Questia was just being launched (Questia ceased operation December 21, 2020), Questia sales people assured a group of us that their books were for “research, not reading.” Google used a similar argument to support its scanning operation, something like “search, not reading.” The court decision in Google’s case decided that Google’s scanning was fair use (and transformative) because the books were not available for reading, as Google was not presenting the full text of the book to its audience.[suit-g] 

The Archive has taken the opposite approach, a “books are for reading” view. Beginning with public domain books, many from the Google books project, and then with in-copyright books, the Archive has promoted reading. It developed its own in-browser reading software to facilitate reading of the books online. [reader] (*See note below)

Although the publishers sued Google for its scanning, they lost due to the “search, not reading” aspect of that project. The Archive has been very clear about its support of reading, which takes the Google justification off the table. 

“Moreover, IA’s massive book digitization business has no new purpose that is fundamentally different than that of the Publishers: both distribute entire books for reading.” ([suit] Page 5). 

 However, the Archive's statistics on loaned books shows that a large proportion of the books are used for 30 minutes or less. 

“Patrons may be using the checked-out book for fact checking or research, but we suspect a large number of people are browsing the book in a way similar to browsing library shelves.” [ia1] 

 In its article on the CDL, the Center for Democracy and Technology notes that “the majority of books borrowed through NEL were used for less than 30 minutes, suggesting that CDL’s primary use is for fact-checking and research, a purpose that courts deem favorable in a finding of fair use.” [cdt] The complication is that the same service seems to be used both for reading of entire books and as a place to browse or to check individual facts (the facts themselves cannot be copyrighted). These may involve different sets of books, once again making it difficult to characterize the entire set of digitized books under a single legal claim. 

The publishers claim that the Archive is competing with them using pirated versions of their own products. That leads us to the question of whether the Archive’s books, presented for reading, are effectively substitutes for those of the publishers. Although the Archive offers actual copies, those copies that are significantly inferior to the original. However, the question of quality did not change the judgment in the lawsuit against copying of texts by Kinko’s [kinkos], which produced mediocre photocopies from printed and bound publications. It seems unlikely that the quality differential will serve to absolve the Archive from copyright infringement even though the poor quality of some of the books interferes with their readability. 

Digital is Different

Publishers have found a way to monetize digital versions, in spite of some risks, by taking advantage of the ability to control digital files with technology and by licensing, not selling, those files to individuals and to libraries. It’s a “new product” that gets around First Sale because, as it is argued, every transfer of a digital file makes a copy, and it is the making of copies that is covered by copyright law. [1stSale] 

The upshot of this is that because a digital resource is licensed, not sold, the right to pass along, lend, or re-sell a copy (as per Title 17 section 109) does not apply even though technology solutions that would delete the sender’s copy as the file safely reaches the recipient are not only plausible but have been developed. [resale] 

“Like other copyright sectors that license education technology or entertainment software, publishers either license ebooks to consumers or sell them pursuant to special agreements or terms.” ([suit] Page 15)

“When an ebook customer obtains access to the title in a digital format, there are set terms that determine what the user can or cannot do with the underlying file.”([suit] Page 16)

This control goes beyond the copyright holder’s rights in law: DRM can exercise controls over the actual use of a file, limiting it to specific formats or devices, allowing or not allowing text-to-speech capabilities, even limiting copying to the clipboard.

Publishers and Libraries 

The suit claims that publishers and libraries have reached an agreement, an equilibrium.

“To Plaintiffs, libraries are not just customers but allies in a shared mission to make books available to those who have a desire to read, including, especially, those who lack the financial means to purchase their own copies.” ([suit] Page 17).
In the suit, publishers contrast the Archive’s operation with the relationship that publishers have with libraries. In contrast with the Archive’s lending program, libraries are the “good guys.”
“... the Publishers have established independent and distinct distribution models for ebooks, including a market for lending ebooks through libraries, which are governed by different terms and expectations than print books.”([suit] Page 6).
These “different terms” include charging much higher prices to libraries for ebooks, limiting the number of times an ebook can be loaned. [pricing1] [pricing2]
“Legitimate libraries, on the other hand, license ebooks from publishers for limited periods of time or a limited number of loans; or at much higher prices than the ebooks available for individual purchase.” [agol]
The equilibrium of which publishers speak looks less equal from the library side of the equation: library literature is replete with stories about the avarice of publishers in relation to library lending of ebooks. Some authors/publishers even speak out against library lending of ebooks, claiming that this cuts into sales. (This same argument has been made for physical books.)
“If, as Macmillan has determined, 45% of ebook reads are occurring through libraries and that percentage is only growing, it means that we are training readers to read ebooks for free through libraries instead of buying them. With author earnings down to new lows, we cannot tolerate ever-decreasing book sales that result in even lower author earnings.” [agliblend][ag42]

The ease of access to digital books has become a boon for book sales, and ebook sales are now rising while hard copy sales fall. This economic factor is a motivator for any of those engaged with the book market. The Archive’s CDL is a direct affront to the revenue stream that publishers have carved out for specific digital products. There are indications that the ease of borrowing of ebooks - not even needing to go to the physical library to borrow a book - is seen as a threat by publishers. This has already played out in other media, from music to movies. 

It would be hard to argue that access to the Archive’s digitized books is merely a substitute for library access. Many people do not have actual physical library access to the books that the Archive lends, especially those digitized from the collections of academic libraries. This is particularly true when you consider that the Archive’s materials are available to anyone in the world with access to the Internet. If you don’t have an economic interest in book sales, and especially if you are an educator or researcher, this expanded access could feel long overdue. 

We need numbers 

We really do not know much about the uses of the Archive’s book collection. The lawsuit cites some statistics of “views” to show that the infringement has taken place, but the page in question does not explain what is meant by a “view”. Archive pages for downloadable files of metadata records also report “views” which most likely reflect views of that web page, since there is nothing viewable other than the page itself. Open Library book pages give “currently reading” and “have read” stats, but these are tags that users can manually add to the page for the work. To compound things, the 127 books cited in the suit have been removed from the lending service (and are identified in the Archive as being in the collection “litigation works

Although numbers may not affect the legality of the controlled digital lending, the social impact of the Archive’s contribution to reading and research would be clearer if we had this information. Although the Archive has provided a small number of testimonials, a proof of use in educational settings would bolster the claims of social benefit which in turn could strengthen a fair use defense. 

Notes

(*) The NWU has a slide show [nwu2] that explains what it calls Controlled Digital Lending at the Archive. Unfortunately this document conflates the Archive's book Reader with CDL and therefore muddies the water. It muddies it because it does not distinguish between sending files to dedicated devices (which is what Kindle is) or dedicated software like what libraries use via software like Libby, and the Archive's use of a web-based reader. It is not beyond reason to suppose that the Archive's Reader software does not fully secure loaned items. The NWU claims that files are left in the browser cache that represent all book pages viewed: "There’s no attempt whatsoever to restrict how long any user retains these images". (I cannot reproduce this. In my minor experiments those files disappear at the end of the lending period, but this requires more concerted study.) However, this is not a fault of CDL but a fault of the Reader software. The reader is software that works within a browser window. In general, electronic files that require secure and limited use are not used within browsers, which are general purpose programs.

Conflating the Archive's Reader software with Controlled Digital Lending will only hinder understanding. Already CDL has multiple components:

  1. Digitization of in-copyright materials
  2. Lending of digital copies of in-copyright materials that are owned by the library in a 1-to-1 relation to physical copies

We can add #3, the leakage of page copies via the browser cache, but I maintain that poorly functioning software does not automatically moot points 1 and 2. I would prefer that we take each point on its own in order to get a clear idea of the issues.

The NWU slides also refer to the Archive's API which allows linking to individual pages within books. This is an interesting legal area because it may be determined to be fair use regardless of the legality of the underlying copy. This becomes yet another issue to be discussed by the legal teams, but it is separate from the question of controlled digital lending. Let's stay focused.

The International Federation of Library Associations has issued its own statement on Controlled Digital Lending at https://www.ifla.org/publications/node/93954

Citations

[1stSale] https://abovethelaw.com/2017/11/a-digital-take-on-the-first-sale-doctrine/ 

[ag1]https://www.authorsguild.org/industry-advocacy/reselling-a-digital-file-infringes-copyright/ 

[ag42] https://www.authorsguild.org/industry-advocacy/authors-guild-survey-shows-drastic-42-percent-decline-in-authors-earnings-in-last-decade/ 

[aghard] https://www.authorsguild.org/the-writing-life/why-is-it-so-goddamned-hard-to-make-a-living-as-a-writer-today/

[aglibend] https://www.authorsguild.org/industry-advocacy/macmillan-announces-new-library-lending-terms-for-ebooks/

[agol] https://www.authorsguild.org/industry-advocacy/update-open-library/ 

[cdl-blog] https://blog.archive.org/2020/03/09/controlled-digital-lending-and-open-libraries-helping-libraries-and-readers-in-times-of-crisis

[cdt] https://cdt.org/insights/up-next-controlled-digital-lendings-first-legal-battle-as-publishers-take-on-the-internet-archive/ 

[kinkos] https://law.justia.com/cases/federal/district-courts/FSupp/758/1522/1809457

[nel] http://blog.archive.org/national-emergency-library/

[nwu] "Appeal from the victims of Controlled Digital Lending (CDL)". (Retrieved 2021-01-10) 

[nwu2] "What is the Internet Archive doing with our books?" https://nwu.org/wp-content/uploads/2020/04/NWU-Internet-Archive-webinar-27APR2020.pdf

[pricing1] https://www.authorsguild.org/industry-advocacy/e-book-library-pricing-the-game-changes-again/ 

[pricing2] https://americanlibrariesmagazine.org/blogs/e-content/ebook-pricing-wars-publishers-perspective/ 

[reader] Bookreader 

[resale] https://www.hollywoodreporter.com/thr-esq/appeals-court-weighs-resale-digital-files-1168577 

[suit] https://www.courtlistener.com/recap/gov.uscourts.nysd.537900/gov.uscourts.nysd.537900.1.0.pdf 

[suit-g] https://cases.justia.com/federal/appellate-courts/ca2/13-4829/13-4829-2015-10-16.pdf?ts=1445005805

Thursday, June 25, 2020

Women designing


Those of us in the library community are generally aware of our premier "designing woman," the so-called "Mother of MARC," Henriette Avram. Avram designed the MAchine Reading Cataloging record in the mid-1960's, a record format that is still being used today. MARC was way ahead of its time using variable length data fields and a unique character set that was sufficient for most European languages, all thanks to Avram's vision and skill. I'd like to introduce you here to some of the designing women of the University of California library automation project, the project that created one of the first online catalogs in the beginning of the 1980's, MELVYL. Briefly, MELVYL was a union catalog that combined data from the libraries of the nine (at that time) University of California campuses. It was first brought up as a test system in 1980 and went "live" to the campuses in 1982.

Work on the catalog began in or around 1980, and various designs were put forward and tested. Key designers were Linda Gallaher-Brown, who had one of the first masters degrees in computer science from UCLA, and Kathy Klemperer, who like many of us was a librarian turned systems designer.

We were struggling with how to create a functional relational database of bibliographic data (as defined by the MARC record) with computing resources that today would seem laughable but were "cutting edge" for that time. I remember Linda remarking that during one of her school terms she returned to her studies to learn that the newer generation of computers would have this thing called an "operating system" and she thought "why would you need one?" By the time of this photo she had come to appreciate what an operating system could do for you. The one we used at the time was IBM's OS 360/370.

Kathy Klemperer was the creator of the database design diagrams that were so distinctive we called them "Klemperer-grams." Here's one from 1985:
MELVYL database design Klemperer-gram, 1985
Drawn and lettered by hand, not only did these describe a workable database design, they were impressively beautiful. Note that this not only predates the proposed 2009 RDA "database scenario" for a relational bibliographic design by 24 years, it provides a more detailed and most likely a more accurate such design.
RDA "Scenario 1" data design, 2009
In the early days of the catalog we had a separate file and interface for the cataloged serials based on a statewide project (including the California State Universities). Although it was possible to catalog serials in the MARC format, the systems that had the detailed information about which issues the libraries held was stored in serials control databases that were separate from the library catalog, and many serials were represented by crusty cards that had been created decades before library automation. The group below developed and managed the CALLS (California Academic Library List of Serials). Four of those pictured were programmers, two were serials data specialists, and four had library degrees. Obviously, these are overlapping sets. The project heads were Barbara Radke (right) and Theresa Montgomery (front, second from right).

At one point while I was still working on the MELVYL project, but probably around the very late 1990's or early 2000's, I gathered up some organization charts that had been issued over the years and quickly calculated that during its history the project the technical staff that had created this early marvel had varied from 3/4 to 2/3 female. I did some talks at various conferences in which I called MELVYL a system "created by women." At my retirement in 2003 I said the same thing in front of the entire current staff, and it was not well-received by all. In that audience was one well-known member of the profession who later declared that he felt women needed more mentoring in technology because he had always worked primarily with men, even though he had indeed worked in an organization with a predominantly female technical staff, and another colleague who was incredulous when I stated once that women are not a minority, but over 50% of the world's population. He just couldn't believe it.

While outright discrimination and harassment of women are issues that need to be addressed, the invisibility of women in the eyes of their colleagues and institutions is horribly damaging. There are many interesting projects, not the least the Wikipedia Women in Red, that aim to show that there is no lack of accomplished women in the world, it's the acknowledgment of their accomplishments that falls short. In the library profession we have many women whose stories are worth telling. Please, let's make sure that future generations know that they have foremothers to look to for inspiration.

Monday, May 25, 2020

1982

I've been trying to capture what I remember about the early days of library automation. Mostly my memory is about fun discoveries in my particular area (processing MARC records into the online catalog). I did run into an offprint of some articles in ITAL from 1982 (*) which provide very specific information about the technical environment, and I thought some folks might find that interesting. This refers to the University of California MELVYL union catalog, which at the time had about 800,000 records.

Operating system: IBM 360/370
Programming language: PL/I
CPU: 24 megabytes of memory
Storage: 22 disk drives, ~ 10 gigabytes
DBMS: ADABAS

The disk drives were each about the size of an industrial washing machine. In fact, we referred to the room that held them as "the laundromat."

Telecommunications was a big deal because there was no telecommunications network linking the libraries of the University of California. There wasn't even one connecting the campuses at all. The article talks about the various possibilities, from an X.25 network to the new TCP/IP protocol that allows "internetwork communication." The first network was a set of dedicated lines leased from the phone company that could transmit 120 characters per second (character = byte) to about 8 ASCII terminals at each campus over a 9600 baud line. There was a hope to be able to double the number of terminals.

In the speculation about the future, there was doubt that it would be possible to open up the library system to folks outside of the UC campuses, much less internationally. (MELVYL was one of the early libraries to be open access worldwide over the Internet, just a few years later.) It was also thought that libraries would charge other libraries to view their catalogs, kind of like an inter-library loan.

And for anyone who has an interest in Z39.50, one section of the article by David Shaughnessy and Clifford Lynch on telecommunications outlines a need for catalog-to-catalog communication which sounds very much like the first glimmer of that protocol.

-----

(*) Various authors in a special edition: (1982). In-Depth: University of California MELVYL. Information Technology and Libraries, 1(4)

I wish I could give a better citation but my offprint does not have page numbers and I can't find this indexed anywhere. (Cue here the usual irony that libraries are terrible at preserving their own story.)