"Meeting at the Royal Library of Denmark, the Conference of European National Librarians (CENL), has voted overwhelmingly to support the open licensing of their data. CENL represents Europe’s 46 national libraries, and are responsible for the massive collection of publications that represent the accumulated knowledge of Europe.
What does that mean in practice?
It means that the datasets describing all the millions of books and texts ever published in Europe – the title, author, date, imprint, place of publication and so on, which exists in the vast library catalogues of Europe – will become increasingly accessible for anybody to re-use for whatever purpose they want."
From an announcement by the Conference of European National Libraries.
Wednesday, September 28, 2011
Sunday, September 18, 2011
Meaning in MARC: Indicators
I have been doing a study of the semantics of MARC data on the futurelib wiki. An article on what I learned about the fixed fields (00X) and the number and code fields (0XX) appeared in the code4lib journal, issue 14, earlier this year. My next task was to tackle the variable fields in the MARC range 1XX-8XX.
This is a huge task, so I started by taking a look at the MARC indicators in this tag range, and have expanded this to a short study of the role that indicators play in MARC. I have to say that it is amazing how much one can stretch the MARC format with one or two single-character data elements.
Indicators have a large number of effects on the content of the MARC fields they modify. Here is the categorization that I have come up with, although I'm sure that other breakdowns are equally plausible.
I. Indicators that do not change the meaning of the field
There are indicators that have a function, but it does not change the meaning of the data in the field or subfields.
Many indicators serve as a way to expand the meaning of a field without requiring the definition of a new tag.
Given the complexity of the indicators there isn't a single answer to how this information should be interpreted in a semantic analysis of MARC. I am inclined to consider the display constants and tracing indicators in section I to not have meaning that needs to be addressed. These are parts of the MARC record that served the production of card sets but that should today be functions of system customization. I would argue that some of these have local value but are possibly not appropriate for record sharing.
The non-filing indicators are a solution to a problem that is evident in so many bibliographic applications. When I sort by title in Zotero or Mendeley, a large portion of the entries are sorted under "The." The world needs a solution here, but I'm not sure what it is going to be. One possibility is to create two versions of a title: one for display, with the initial article, and one for sorting, without. Systems could do the first pass at this, as they often to today with taking author names and inverting them into "familyname, forenames" order. Of course, humans would have to have the ability to make corrections where the system got it wrong.
The indicators that identify the source of a controlled vocabulary could logically be transformed into a separate data element for each vocabulary (e.g. "LCSH," "MeSH"). However, the number of different vocabularies is, while not infinite, very large and growing (as evidenced by the practice in MARC to delegate the source to a separate subfield that carries codes from a controlled list of sources), so producing a separate data element for each vocabulary is unwieldy, to say the least. At some future date, when controlled vocabularies "self-identify" using URIs this may be less of a problem. For now, however, it seems that we will need to have multi-part data elements for controlled vocabularies that include the source with the vocabulary terms.
The indicators that further sub-type a field, like the 520 Summary field, can be fairly easily given their own data element since they have their own meaning. Ideally there would be a "type/sub-type" relationship where appropriate.
And Some Problems
There are a number of areas that are problematic when it comes to the indicator values. In many cases, the MARC standard does not make clear if the indicator modifies all subfields in the field, or only a select few. In some instances we can reason this out: the non-filing indicators only refer to the left-most characters of the field, so they can only refer to the $a (which is mandatory in each of those fields). On the other hand, for the values in the subject area (6XX) of the record, the source indicator relates to all of the subject subfields in the field. I assume, however, that in all cases the control subfields $3-$8 perform functions that are unrelated to the indicator values. I do not know at this point if there are fields in which the indicators function on some other subset of the subfields between $a and $z. That's something I still need to study.
I also see a practical problem in making use of the indicator values in any kind of mapping from MARC to just about anything else. In 60% of MARC tags either one or both indicator positions is undefined. Undefined indicators are represented in the MARC record with blanks. Unfortunately there are also defined indicators that have a meaning assigned to the character "blank." There is nothing in the record itself to differentiate blank indicator values from undefined indicators. Any transformation from MARC to another format has to have knowledge about every tag and its indicators in order to do anything with these elements. This is another example of the complexity of MARC for data processing, and yet another reason why a new format could make our lives easier.
More on the Wiki
For anyone else who obsesses on these kinds of things there is more detail on all of this on the futurelib wiki. I welcome comments here, and on the wiki. If you wish to comment on the wiki, however, I need to add your login to the site (as an anti-spam measure). I will undoubtedly continue my own obsessive behavior related to this task, but I really would welcome collaboration if anyone is so inclined. I don't think that there is a single "right answer" to the questions I am asking, but am working on the principle that some practical decisions in this area can help us as we work on a future bibliographic carrier.
This is a huge task, so I started by taking a look at the MARC indicators in this tag range, and have expanded this to a short study of the role that indicators play in MARC. I have to say that it is amazing how much one can stretch the MARC format with one or two single-character data elements.
Indicators have a large number of effects on the content of the MARC fields they modify. Here is the categorization that I have come up with, although I'm sure that other breakdowns are equally plausible.
I. Indicators that do not change the meaning of the field
There are indicators that have a function, but it does not change the meaning of the data in the field or subfields.
- Display constants: some, but not all, display constants merely echo the meaning of the tag, e.g. 775 Other Edition Entry, Second Indicator
Display constant controller
# - Other edition available
8 - No display constant generated - Trace/Do not trace: I consider these indicators to be carry-overs from card production.
- Non-filing indicators: similar to indicators that control displays, these indicators make it possible to sort (was filing) titles properly, ignoring the initial articles ("The ", "A ", etc.).
- Existence in X collection: there are indicators in the 0XX range that let you know if the item exists in the collection of a national library.
Many indicators serve as a way to expand the meaning of a field without requiring the definition of a new tag.
- Identification of the source or agency: a single field like a 650 topical subject field can have content from an unlimited list of controlled vocabularies because the indicator (or the indicator plus the $2 subfield) provides the identity of the controlled vocabulary.
- Multiple types in a field: some fields can have data of different types, controlled by the indicator. For example, the 246 (Varying form of title) has nine different possible values, like Cover title or Spine title, controlled by a single indicator value.
- Pseudo-display controllers: the same indicator type that carries display constants that merely echo the meaning of the field also has a number of instances where the display constant actually indicates a different meaning for the field. One example is the 520 (Summary, etc.) field with display constants for "subject," "review," "abstract," and others.
Given the complexity of the indicators there isn't a single answer to how this information should be interpreted in a semantic analysis of MARC. I am inclined to consider the display constants and tracing indicators in section I to not have meaning that needs to be addressed. These are parts of the MARC record that served the production of card sets but that should today be functions of system customization. I would argue that some of these have local value but are possibly not appropriate for record sharing.
The non-filing indicators are a solution to a problem that is evident in so many bibliographic applications. When I sort by title in Zotero or Mendeley, a large portion of the entries are sorted under "The." The world needs a solution here, but I'm not sure what it is going to be. One possibility is to create two versions of a title: one for display, with the initial article, and one for sorting, without. Systems could do the first pass at this, as they often to today with taking author names and inverting them into "familyname, forenames" order. Of course, humans would have to have the ability to make corrections where the system got it wrong.
The indicators that identify the source of a controlled vocabulary could logically be transformed into a separate data element for each vocabulary (e.g. "LCSH," "MeSH"). However, the number of different vocabularies is, while not infinite, very large and growing (as evidenced by the practice in MARC to delegate the source to a separate subfield that carries codes from a controlled list of sources), so producing a separate data element for each vocabulary is unwieldy, to say the least. At some future date, when controlled vocabularies "self-identify" using URIs this may be less of a problem. For now, however, it seems that we will need to have multi-part data elements for controlled vocabularies that include the source with the vocabulary terms.
The indicators that further sub-type a field, like the 520 Summary field, can be fairly easily given their own data element since they have their own meaning. Ideally there would be a "type/sub-type" relationship where appropriate.
And Some Problems
There are a number of areas that are problematic when it comes to the indicator values. In many cases, the MARC standard does not make clear if the indicator modifies all subfields in the field, or only a select few. In some instances we can reason this out: the non-filing indicators only refer to the left-most characters of the field, so they can only refer to the $a (which is mandatory in each of those fields). On the other hand, for the values in the subject area (6XX) of the record, the source indicator relates to all of the subject subfields in the field. I assume, however, that in all cases the control subfields $3-$8 perform functions that are unrelated to the indicator values. I do not know at this point if there are fields in which the indicators function on some other subset of the subfields between $a and $z. That's something I still need to study.
I also see a practical problem in making use of the indicator values in any kind of mapping from MARC to just about anything else. In 60% of MARC tags either one or both indicator positions is undefined. Undefined indicators are represented in the MARC record with blanks. Unfortunately there are also defined indicators that have a meaning assigned to the character "blank." There is nothing in the record itself to differentiate blank indicator values from undefined indicators. Any transformation from MARC to another format has to have knowledge about every tag and its indicators in order to do anything with these elements. This is another example of the complexity of MARC for data processing, and yet another reason why a new format could make our lives easier.
More on the Wiki
For anyone else who obsesses on these kinds of things there is more detail on all of this on the futurelib wiki. I welcome comments here, and on the wiki. If you wish to comment on the wiki, however, I need to add your login to the site (as an anti-spam measure). I will undoubtedly continue my own obsessive behavior related to this task, but I really would welcome collaboration if anyone is so inclined. I don't think that there is a single "right answer" to the questions I am asking, but am working on the principle that some practical decisions in this area can help us as we work on a future bibliographic carrier.
Friday, September 16, 2011
European Thoughts on RDA
Some European libraries are asking the question: "Should we adopt RDA as our cataloging code?" The discussion is happening through the European RDA Interest Group (EURIG). Members of EURIG are preparing reports on what they see as the possibilities that RDA could become a truly international cataloging code. With the increased sharing of just about everything between Europe's countries -- currency, labor force, media, etc. -- the vision of Europe's libraries as a cooperating unit seems to be a no-brainer.
There are interesting comments in the presentations available from the EURIG meeting. For example:
Spain has done comparisons with current cataloging and some testing using MARC21. They conclude: "Our decision will probably depend on the flexibility to get the different lists, vocabularies, specific rules... that we need." In other words, it all depends on being able to customize RDA to local practice.
Germany sees RDA as having the potential to be an international set of rules for data sharing (much like ISBD today), with national profiles for internal use. Germany has starting translating the RDA vocabulary terms in the Open Metadata Registry, but notes that translation of the text must be negotiated with the co-publishers of RDA, that is the American Library Association, the Canadian Library Association, and CILIP.
The most detail, though, comes from a report by the French libraries. (The French are totally winning my heart as a smart and outspoken people. Their response to the Google Books Settlement was wonderful.) This report brings up some key issues about RDA from outside the JSC.
First, it is said in this report, and also in some of the EURIG presentations from their meeting, that it is RDA's implementation of FRBR that makes it a candidate for an international cataloging code. FRBR is seen as the model that will allow library metadata to have a presence on the Web, and many in the library profession see getting a library presence on the Web as an essential element of being part of the modern information environment. One irony of this, though, is that Italy already has a cataloging code based on FRBR, REICAT, but that has gotten little attention. (A large segment of it is available in English if you are curious about their approach. )
The French interest in FRBR is specifically about Scenario 1 as defined in RDA; a model with defined entities and links between them. An implementation of Scenario 2, which links authority records to bibliographic records, would be a mere replication of what already exists in France's catalogs. In other words, they have already progressed to level 2 while U.S. libraries are still stuck in level 3, the flat data model.
Although the French libraries see an advantage to using RDA, they also have some fairly severe criticisms. Key ones are:
* There is a strong adherence to ISO and IFLA standards outside of the U.S. I don't know why we in the U.S. feel less need to pay attention to those international standards bodies, but it does separate us from the greater library community.
(Thanks to John Hostage of Harvard for pointing out the recent EURIG activity on the RDA-L list.)
There are interesting comments in the presentations available from the EURIG meeting. For example:
Spain has done comparisons with current cataloging and some testing using MARC21. They conclude: "Our decision will probably depend on the flexibility to get the different lists, vocabularies, specific rules... that we need." In other words, it all depends on being able to customize RDA to local practice.
Germany sees RDA as having the potential to be an international set of rules for data sharing (much like ISBD today), with national profiles for internal use. Germany has starting translating the RDA vocabulary terms in the Open Metadata Registry, but notes that translation of the text must be negotiated with the co-publishers of RDA, that is the American Library Association, the Canadian Library Association, and CILIP.
The most detail, though, comes from a report by the French libraries. (The French are totally winning my heart as a smart and outspoken people. Their response to the Google Books Settlement was wonderful.) This report brings up some key issues about RDA from outside the JSC.
First, it is said in this report, and also in some of the EURIG presentations from their meeting, that it is RDA's implementation of FRBR that makes it a candidate for an international cataloging code. FRBR is seen as the model that will allow library metadata to have a presence on the Web, and many in the library profession see getting a library presence on the Web as an essential element of being part of the modern information environment. One irony of this, though, is that Italy already has a cataloging code based on FRBR, REICAT, but that has gotten little attention. (A large segment of it is available in English if you are curious about their approach. )
The French interest in FRBR is specifically about Scenario 1 as defined in RDA; a model with defined entities and links between them. An implementation of Scenario 2, which links authority records to bibliographic records, would be a mere replication of what already exists in France's catalogs. In other words, they have already progressed to level 2 while U.S. libraries are still stuck in level 3, the flat data model.
Although the French libraries see an advantage to using RDA, they also have some fairly severe criticisms. Key ones are:
- it ignores ISO standards, and does not follow IFLA standards, such as Names of person, or Anonymous classics*
- it is a follow-on to, and makes concessions to, AACR(1 and 2), which is not used by the French libraries
- it proposes one particular interpretation of FRBR, not allowing for others, and defines each element as being exclusively for use with a single FRBR entity
* There is a strong adherence to ISO and IFLA standards outside of the U.S. I don't know why we in the U.S. feel less need to pay attention to those international standards bodies, but it does separate us from the greater library community.
(Thanks to John Hostage of Harvard for pointing out the recent EURIG activity on the RDA-L list.)
Due diligence do-over
In what I see as both a brave and an appropriate move, the University of Michigan admitted publicly that the Authors Guild had found some serious flaws in its process for identifying orphan works. The statement reaffirms the need to identify orphan works, and promises to revise its procedures.
"Having learned from our mistakes—we are, after all, an educational institution—we have already begun an examination of our procedures to identify the gaps that allowed volumes that are evidently not orphan works to be added to the list. Once we create a more robust, transparent, and fully documented process, we will proceed with the work, because we remain as certain as ever that our proposed uses of orphan works are lawful and important to the future of scholarship and the libraries that support it."Among other things, what I find interesting in all this is that no one seems to be wondering why our copyright registration process is so broken that sometimes even the rights holders themselves don't know that they are the rights holders. It really shouldn't be this hard to find out if a work is covered by copyright. Larry Lessig covered this in his book "Free Culture," which is available online. The basic process of identifying copyrights is broken, and the burden is being placed on those who wish to make use of works. This is a clear anti-progress, pro-market bias in our copyright system.
Thursday, September 15, 2011
Diligence due
Oooof! Talk about making a BIG, public mistake.
HathiTrust's new Orphan Works Project proposed to do due diligence on works, then post them on the HT site for 90 days, after which those works would be assumed to be orphans and would then be made available (in full text) to members of the HT cooperating institutions. Sounds good, right? (Well, maybe other than the fact of posting the works on a site that few people even know about...)
The Authors Guild blog posted yesterday that it had found the rights holder of one of the books on HTs orphan works list in a little over 2 minutes using Google. (It's hard to believe that they didn't know this when the suit was filed on September 13 -- this is brilliant PR, if I ever saw it.) They then reported finding two others.
James Grimmelman, Associate Professor at New York Law School and someone considered expert on the Google Books case, has titled his blog post on this: "HathiTrust Single-Handedly Sinks Orphan Works Reform," stating that this incident will be brought up whenever anyone claims to have done due diligence on orphan works. I'm not quite as pessimistic as James, but I do believe this will be brought up in court and will work against HT.
HathiTrust's new Orphan Works Project proposed to do due diligence on works, then post them on the HT site for 90 days, after which those works would be assumed to be orphans and would then be made available (in full text) to members of the HT cooperating institutions. Sounds good, right? (Well, maybe other than the fact of posting the works on a site that few people even know about...)
The Authors Guild blog posted yesterday that it had found the rights holder of one of the books on HTs orphan works list in a little over 2 minutes using Google. (It's hard to believe that they didn't know this when the suit was filed on September 13 -- this is brilliant PR, if I ever saw it.) They then reported finding two others.
James Grimmelman, Associate Professor at New York Law School and someone considered expert on the Google Books case, has titled his blog post on this: "HathiTrust Single-Handedly Sinks Orphan Works Reform," stating that this incident will be brought up whenever anyone claims to have done due diligence on orphan works. I'm not quite as pessimistic as James, but I do believe this will be brought up in court and will work against HT.
Wednesday, September 14, 2011
Authors Guild in Perspective
In its suit against HathiTrust the three authors guilds claim that there are digitized copies of millions of copyrighted books in HathiTrust, and that these should be removed from the database and stored in escrow off-line.
A relevant question is: who do the authors guilds represent, and how many of those books belong to the represented authors?
The combined members of the three authors guilds is about 13,000. That seems like a significant number, but the Library of Congress name authority file has about 8 million names. That file also contains name/title combinations, and I don't have any statistics that tell me how many of those there are. (If anyone out there has a copy of the file and can run some stats on it, I'd greatly appreciate it.) Some of the names are for writers whose works are all in the public domain. Yet no matter how we slice it, the authors guilds of the lawsuit represent a small percentage of authors whose in-copyright works are in the HathiTrust database.
The legal question then is: does this lawsuit pertain to all in-copyright works in HathiTrust, or only those by the represented authors? Could I, for example, sue HathiTrust for violating Fay Weldon's copyright?
Reply to this from James Grimmelman on his blog:
Good question, Karen, and one I plan to address in more detail in a civil procedure post in the next few days. In brief, you couldn’t sue to enforce Fay Weldon’s copyright, as you aren’t an “owner” of any of the rights in it. The Authors Guild and other organizations can sue on behalf of their members, but the details of associational standing are complicated. There is also the question of the scope of a possible injunction (e.g. could Fay Weldon win as to one of her works and obtain an injunction covering others, or works by others), where there are also significant limits on how far the court can go. Again, more soon.
As I suspected, the legal issues are complex. Keep an eye on James' blog for more on this.
A relevant question is: who do the authors guilds represent, and how many of those books belong to the represented authors?
The combined members of the three authors guilds is about 13,000. That seems like a significant number, but the Library of Congress name authority file has about 8 million names. That file also contains name/title combinations, and I don't have any statistics that tell me how many of those there are. (If anyone out there has a copy of the file and can run some stats on it, I'd greatly appreciate it.) Some of the names are for writers whose works are all in the public domain. Yet no matter how we slice it, the authors guilds of the lawsuit represent a small percentage of authors whose in-copyright works are in the HathiTrust database.
The legal question then is: does this lawsuit pertain to all in-copyright works in HathiTrust, or only those by the represented authors? Could I, for example, sue HathiTrust for violating Fay Weldon's copyright?
Reply to this from James Grimmelman on his blog:
Good question, Karen, and one I plan to address in more detail in a civil procedure post in the next few days. In brief, you couldn’t sue to enforce Fay Weldon’s copyright, as you aren’t an “owner” of any of the rights in it. The Authors Guild and other organizations can sue on behalf of their members, but the details of associational standing are complicated. There is also the question of the scope of a possible injunction (e.g. could Fay Weldon win as to one of her works and obtain an injunction covering others, or works by others), where there are also significant limits on how far the court can go. Again, more soon.
As I suspected, the legal issues are complex. Keep an eye on James' blog for more on this.
Monday, September 12, 2011
Authors Guild Sues HathiTrust
There has been a period of limbo since Judge Chin rejected the proposed settlement between the Author's Guild/Association of American publishers and Google. In fact, a supposedly final meeting between the parties is scheduled for this Thursday, 9/15, in the judge's court.
Monday, 9/12, the Author's Guild (and partners) filed suit against HathiTrust (and partners) for some of the same "crimes" of which it had accused Google: essentially making unauthorized copies of in-copyright texts. In addition, the recent announcement that the libraries would allow their users to access items that had been deemed to be orphan works figures in the suit. That this suit has come over 6 years since the original suit against Google is in itself interesting. Nearly all of the actions of HathiTrust and its member libraries fall within what would have been allowed if the agreement that came out of that suit had been approved by the court. Although we do not know the final outcome of that suit (and anxiously await Thursdays meeting to see if it is revelatory), this suit against the libraries is surely a sign that AG/AAP and Google have not come to a reconciliation.
The Suit
First, the suit establishes that the libraries received copies of Google-digitized items from Google, and have sent copies of these items to HathiTrust, which in turn makes some number of copies as part of its archival function. This is followed by a somewhat short exposition of the areas of copyright law that are pertinent, with an emphasis on section 108, which allows libraries to make limited copies to replace deteriorating works. The suit states that the copying being done is not in accord with section 108. Then it refers to the Orphan Works Project that several libraries are partnering in, and the plan on the part of the libraries to make the full text of orphan works available to institutional users.
Since most of these institutions (if not all of them) are state institutions that have protections against paying out large sums in a lawsuit of this nature, the goal is to regain the control of the works by forcing HathiTrust (and the named libraries) to transfer their digital copies of in-copyright works to a "commercial grade" escrow agency with the files held off network "pending an appropriate act of Congress."
As James Grimmelman comments in his blog post on the suit, there's a lot of mixing up between the orphan works and owned works in the suit. He points out that a group of organizations representing authors could hardly make a case for orphan works since, by definition, the lack of ownership of the orphans means they can't be represented by a guild of people defending their own works.
The Problems
There are numerous problems that I see in the text of this suit. (IANAL, just a Librarian.)
Plaintiffs: The Author's Guild, Inc.; The Australian Society of Authors limited ; Union des Erivaines Quebecois; Pat Cummings; Angelo Loukakis; Roxana Robinson; Andre Roy; James Shapiro; Daniele Simpson; T.J. Stiles; and Fay Weldon. (Links are to some sample HathiTrust records.)
Defendants: HathiTrust; The Regents of the University of California, The Board of Regents of the University of Wisconsin System; The Trustees of Indiana University; and Cornell University.
Links
Boing Boing: Authors Guild declares war on university effort to rescue orphaned books
Library Journal: Copyright Clash
Monday, 9/12, the Author's Guild (and partners) filed suit against HathiTrust (and partners) for some of the same "crimes" of which it had accused Google: essentially making unauthorized copies of in-copyright texts. In addition, the recent announcement that the libraries would allow their users to access items that had been deemed to be orphan works figures in the suit. That this suit has come over 6 years since the original suit against Google is in itself interesting. Nearly all of the actions of HathiTrust and its member libraries fall within what would have been allowed if the agreement that came out of that suit had been approved by the court. Although we do not know the final outcome of that suit (and anxiously await Thursdays meeting to see if it is revelatory), this suit against the libraries is surely a sign that AG/AAP and Google have not come to a reconciliation.
The Suit
First, the suit establishes that the libraries received copies of Google-digitized items from Google, and have sent copies of these items to HathiTrust, which in turn makes some number of copies as part of its archival function. This is followed by a somewhat short exposition of the areas of copyright law that are pertinent, with an emphasis on section 108, which allows libraries to make limited copies to replace deteriorating works. The suit states that the copying being done is not in accord with section 108. Then it refers to the Orphan Works Project that several libraries are partnering in, and the plan on the part of the libraries to make the full text of orphan works available to institutional users.
Since most of these institutions (if not all of them) are state institutions that have protections against paying out large sums in a lawsuit of this nature, the goal is to regain the control of the works by forcing HathiTrust (and the named libraries) to transfer their digital copies of in-copyright works to a "commercial grade" escrow agency with the files held off network "pending an appropriate act of Congress."
As James Grimmelman comments in his blog post on the suit, there's a lot of mixing up between the orphan works and owned works in the suit. He points out that a group of organizations representing authors could hardly make a case for orphan works since, by definition, the lack of ownership of the orphans means they can't be represented by a guild of people defending their own works.
The Problems
There are numerous problems that I see in the text of this suit. (IANAL, just a Librarian.)
- The suit mentions large numbers of books that have been copied without permission, but makes no attempt to state how many of those books belong to the members of the plaintiff organizations.
- The suit throws around large numbers without clearly stating that none of the statements include Public Domain works. It isn't clear, therefore, what the numbers represent: the entire holdings of HathiTrust, or just the in-copyright holdings. Also, in relation to the latter, unless one has done a considerable amount of work there are many works that are post-1923 that are also in the Public Domain. Cutting off at that year does not account for works that were not renewed, or were never copyrighted. I also doubt if anyone has a clear idea how many of the works in question are Public Domain because they are US Federal documents. This imprecision on the copyright status of works is very frustrating, but HathiTrust is not to blame for this state of affairs.
- Some of their claims do not seem to me to be within legal bounds. For example, in one section they claim that although HathiTrust is not giving users access to in-copyright works, they potentially could. Where does that fit in?
- They also claim that there is a risk of unauthorized access. However, the security at HathiTrust meets the security standards that the Author's Guild agreed to in the (unapproved) settlement with Google. If it was good enough then, why is it now too risky?
- They claim that the libraries themselves have been digitizing in-copyright books. I wasn't aware of this, and would like to know if this is the case.
- They state that the libraries said that before Google it was costing them $100 a book for digitization. Then the plaintiffs say that this means that the value of the digital files is in the hundreds of millions of dollars. First, I have heard figures that are more like $30 a book. Second, I don't see how the cost to digitize can translate into a value that is relevant to the complaint.
- Although the legislature has failed to pass an orphan works law that would allow the use of these materials and still benefit owners if they do come forth, it seems like a poor strategy to complain about a well-designed program of due diligence and notification, which is what the libraries have designed. Orphan works are the least available works: if you have an owner you can ask permission; if there is no owner you cannot ask permission and therefore there is no way to use the work if your use falls outside of fair use. It's hard to argue for taking these works entirely out of the cultural realm simply because we have a poorly managed copyright ownership record.
- There are a few odd sections where they make reference to bibliographic data as though that were part of the "unauthorized digitization" rather than data that was created by and belongs to the libraries. There's an odd attempt to make bibliographic data searching seem nefarious.
Plaintiffs: The Author's Guild, Inc.; The Australian Society of Authors limited ; Union des Erivaines Quebecois; Pat Cummings; Angelo Loukakis; Roxana Robinson; Andre Roy; James Shapiro; Daniele Simpson; T.J. Stiles; and Fay Weldon. (Links are to some sample HathiTrust records.)
Defendants: HathiTrust; The Regents of the University of California, The Board of Regents of the University of Wisconsin System; The Trustees of Indiana University; and Cornell University.
Links
Boing Boing: Authors Guild declares war on university effort to rescue orphaned books
Library Journal: Copyright Clash
Friday, September 09, 2011
MARC vs RDA
As LC ponders the task of moving to a bibliographic framework, I can't help but worry about how much the past is going to impinge on our future. It seems to me that we have two potentially incompatible needs at the moment: the first is to fix MARC, and the second is to create a carrier for RDA.
Fixing MARC
For well over a decade some of us have been suggesting that we need a new carrier for the data that is currently stored in the MARC format. The record we work with today is full of kludges brought on by limitations in the data format itself. To give a few examples:
A Carrier for RDA
The precipitating reason for LC's bibliographic framework project is RDA. One of the clearest results of the RDA tests that were conducted in 2010 was that MARC is not a suitable carrier for RDA. If we are to catalog using the new code, we must have a new carrier. I see two main areas where RDA differs "record-wise" from the cataloging codes that informed the MARC record:
To my mind, the main complications about a carrier for RDA have to do with FRBR and how we can most efficiently create relationships between the FRBR entities and manage them within systems. I suspect that we will need to accommodate multiple FRBR scenarios, some appropriate to data storage and others more appropriate to data transmission.
Can We Do Both?
This is my concern: creating a carrier for RDA will not solve the MARC record problem; solving the MARC record problem will not provide a carrier for RDA. There may be a way to combine these two needs, but I fear that a combined solution would end up creating a data format that doesn't really solve either problem because of the significant difference between the AACR conceptual model and that of RDA/FRBR.
It seems that if we want to move forward, we may have to make a break with the past. We may need to freeze MARC for those users continuing to create pre-RDA bibliographic data, and create an RDA carrier that is true to the needs of RDA and the systems that will be built around RDA data, with any future enhancements taking place only to the new carrier. This will require a strategy for converting data in MARC to the RDA carrier as libraries move to systems based on RDA.
Next: It's All About the Systems
In fact, the big issue is not data conversion but what the future systems will require in order to take advantage of RDA/FRBR. This is a huge question, and I will take it up in a new post, but just let me say here that it would be folly to devise a data format that is not based on an understanding of the system requirements that can fulfill desired functionality and uses.
Fixing MARC
For well over a decade some of us have been suggesting that we need a new carrier for the data that is currently stored in the MARC format. The record we work with today is full of kludges brought on by limitations in the data format itself. To give a few examples:
- 041 Language Codes - We have a single language code in the 008 and a number of other language codes (e.g. for original language of an abstract) in 041. The language code in the 008 is not "typed" so it must be repeated in the 041 which has separate subfields for different language codes. However, 041 is only included when more than one language code is needed. This means that there are always two places one must look to find language codes.
- 006 Fixed-Length Data Elements, Additional Material Characteristics - The sole reason for the existence of the 006 is that the 008 is not repeatable. The fixed-length data elements in the 006 are repeats of format-specific elements in the 008 so that information about multi-format items can be encoded.
- 773 Host Item Entry - All of the fields for related resources (76X-78X) have the impossible task of encoding an entire bibliographic description in a single field. Because there are only 26 possible subfields (a-z) available for the bibliographic data, data elements in these fields are not coded the same as they are in other parts of the record. For example, in the 773 the entire main entry is entered in a single subfield ("$aDesio, Ardito, 1897-") as opposed to the way it is coded in any X00 field ("$aDesio, Ardito,$d1897-").
A Carrier for RDA
The precipitating reason for LC's bibliographic framework project is RDA. One of the clearest results of the RDA tests that were conducted in 2010 was that MARC is not a suitable carrier for RDA. If we are to catalog using the new code, we must have a new carrier. I see two main areas where RDA differs "record-wise" from the cataloging codes that informed the MARC record:
- RDA implements the FRBR entities
- RDA allows the use of identifiers for entities and terms
To my mind, the main complications about a carrier for RDA have to do with FRBR and how we can most efficiently create relationships between the FRBR entities and manage them within systems. I suspect that we will need to accommodate multiple FRBR scenarios, some appropriate to data storage and others more appropriate to data transmission.
Can We Do Both?
This is my concern: creating a carrier for RDA will not solve the MARC record problem; solving the MARC record problem will not provide a carrier for RDA. There may be a way to combine these two needs, but I fear that a combined solution would end up creating a data format that doesn't really solve either problem because of the significant difference between the AACR conceptual model and that of RDA/FRBR.
It seems that if we want to move forward, we may have to make a break with the past. We may need to freeze MARC for those users continuing to create pre-RDA bibliographic data, and create an RDA carrier that is true to the needs of RDA and the systems that will be built around RDA data, with any future enhancements taking place only to the new carrier. This will require a strategy for converting data in MARC to the RDA carrier as libraries move to systems based on RDA.
Next: It's All About the Systems
In fact, the big issue is not data conversion but what the future systems will require in order to take advantage of RDA/FRBR. This is a huge question, and I will take it up in a new post, but just let me say here that it would be folly to devise a data format that is not based on an understanding of the system requirements that can fulfill desired functionality and uses.
Wednesday, September 07, 2011
XML and Library Data Future
There is sometimes the assumption that the future data carrier for library data will be XML. I think this assumption may be misleading and I'm going to attempt to clarify how XML may fit into the library data future. Some of this explanation is necessarily over-simplified because a full exposition of the merits and de-merits of XML would be a tome, not a blog post.
What is XML?
The eXtensible Markup Language (XML) is a highly versatile markup language. A markup language is primarily a way to encode text or other expressions so that some machine-processing can be performed. That processing can manage display (e.g. presenting text in bold or italics) or it can be similar to metadata encoding of the meaning of a group of characters ("dateAndTime"). It makes the expression more machine-usable. It is not a data model in itself, but it can be used to mark up data based on a wide variety of models.*
XML is the flagship standard in a large family of markup languages, although not the first: it is an evolution of SGML which had (perhaps necessary) complexities that rendered it very difficult for most mortals to use. It's also the conceptual granddaddy of HTML, a much simplified markup language that many of us take for granted.
Defining Metadata in XML
There is a difference between using XML as a markup for documents or data and using XML to define your data. XML has some inherent structural qualities that may not be compatible with what you want your data to be. There is a reason why XML "records" are generally referred to as "documents": they tend to be quite linear in nature, with a beginning, a middle, and an end, just like a good story.
XML's main structural functionality is that of nesting, or the creation of containers that hold separate bits of data together.
<paragraph>
<sentence></sentence>
<sentence></sentence> ...
</paragraph>
<name>
<familyname></familyname>
<forenames></forenames>
</name>
This is useful for document markup and also handy when marking up data. It is not unusual for XML documents to have nesting of elements many layers deep. This nesting, however, can be deceptive. Just because you have things inside other things does not mean that the relationship is anything more than a convenience for the application for which it was designed.
<customer>
<customerNumber></customerNumber>
<phoneNumber></phoneNumber>
</customer>
Nested elements are most frequently in a whole/part relationship, with the container representing the whole and holding the elements (parts) together as a unit (in particular a unit that can be repeated).
<address>
<street1></street1>
<street2></street2>
<city></city>
<state></state>
<zip></zip>
</address
While usually not hierarchical in the sense of genus/species or broader/narrower, this nesting has some of the same data processing issues that we find in other hierarchical arrangements:
In the nested XML structure some of the same data is carried in separate containers and there isn't any inherent relationship between them. Were this data entered into a relational database it might be possible to create those relationships, somewhat like the graph view. But as a record the XML document has separate data elements for the same data because the element is not separate from the container. In other words, the XML document has two different data elements for the zip code:
address:zip
censusDistrict:zip
To use a library concept as an analogy, the nesting in XML is like pre-coordination in library subject headings. It binds elements together in a way that they cannot be readily used in any other context. Some coordination is definitely useful at the application level, but if all of your data is pre-coordinated it becomes difficult to create new uses for new contexts.
Avoid XML Pitfalls
XML does not make your data any better than it was, and it can be used to mark up data that is illogically organized and poorly defined. A misstep that I often see is data designers beginning to use XML before their data is fully described, and therefore letting the structure and limitations of XML influence what their data can express. Be very wary of any project that decides that the data format will be XML before the data itself has been fully defined.
XML and Library data
If XML had been available in 1965 when Henriette Avram was developing the MARC format it would have been a logical choice for that data. The task that Avram faced was to create a machine-readable version of the data on the catalog card that would allow cards to be printed that looked exactly like the cards that were created prior to MARC. It was a classic document mark-up situation. Had that been the case our records could very well have evolved in a way that is different to what we have today, because XML would not have had the need to separate fixed field data from variable field data, and expansion of some data areas might have been easier. But saying that XML would have been a good format in 1965 does not mean that it would be a good format in 2011.
For the future library data format, I can imagine that it will, at times, be conveyed over the internet in XML. If it can ONLY be conveyed in XML we will have created a problem for ourselves. Our data should be independent of any particular serialization and be designed so that it is not necessary to have any particular combination or nesting of elements in order to make use of the data. Applications that use the data can of course combine and structure the elements however they wish, but for our data to be usable in a variety of applications we need to keep the "pre-coordination" of elements to a minimum.
* For example, there is an XML serialization (essentially a record format) of RDF that is frequently used to exchange linked data, although other serializations are also often available. It is used primarily because there is a wide range of software tools available for making use of XML data in applications, and there are many fewer tools available for the more "native" RDF expressions such as triples or turtle. It encapsulates RDF data in a record format and I suspect that using XML for this data will turn out to be a transitional phase as we move from record-based data structures to graph-based ones.
What is XML?
The eXtensible Markup Language (XML) is a highly versatile markup language. A markup language is primarily a way to encode text or other expressions so that some machine-processing can be performed. That processing can manage display (e.g. presenting text in bold or italics) or it can be similar to metadata encoding of the meaning of a group of characters ("dateAndTime"). It makes the expression more machine-usable. It is not a data model in itself, but it can be used to mark up data based on a wide variety of models.*
XML is the flagship standard in a large family of markup languages, although not the first: it is an evolution of SGML which had (perhaps necessary) complexities that rendered it very difficult for most mortals to use. It's also the conceptual granddaddy of HTML, a much simplified markup language that many of us take for granted.
Defining Metadata in XML
There is a difference between using XML as a markup for documents or data and using XML to define your data. XML has some inherent structural qualities that may not be compatible with what you want your data to be. There is a reason why XML "records" are generally referred to as "documents": they tend to be quite linear in nature, with a beginning, a middle, and an end, just like a good story.
XML's main structural functionality is that of nesting, or the creation of containers that hold separate bits of data together.
<paragraph>
<sentence></sentence>
<sentence></sentence> ...
</paragraph>
<name>
<familyname></familyname>
<forenames></forenames>
</name>
This is useful for document markup and also handy when marking up data. It is not unusual for XML documents to have nesting of elements many layers deep. This nesting, however, can be deceptive. Just because you have things inside other things does not mean that the relationship is anything more than a convenience for the application for which it was designed.
<customer>
<customerNumber></customerNumber>
<phoneNumber></phoneNumber>
</customer>
Nested elements are most frequently in a whole/part relationship, with the container representing the whole and holding the elements (parts) together as a unit (in particular a unit that can be repeated).
<address>
<street1></street1>
<street2></street2>
<city></city>
<state></state>
<zip></zip>
</address
While usually not hierarchical in the sense of genus/species or broader/narrower, this nesting has some of the same data processing issues that we find in other hierarchical arrangements:
- The difficulty of placing elements in a single hierarchy when many elements could be logically located in more than one place. That problem has to be weighed against the inconvenience and danger of carrying the same data more than once in a record or system and the chances that these redundant elements will not get updated together.
- The need to traverse the whole hierarchy to get to "buried" elements. This was the pain-in-the-neck that caused most data processing shops to drop hierarchical database management systems for relational ones. XML tools make this somewhat less painful, but not painless.
- Poor interoperability. The same data element can be in different containers in different XML documents, but the data elements may not be usable outside the context of the containing element (e.g. "street2").
In the nested XML structure some of the same data is carried in separate containers and there isn't any inherent relationship between them. Were this data entered into a relational database it might be possible to create those relationships, somewhat like the graph view. But as a record the XML document has separate data elements for the same data because the element is not separate from the container. In other words, the XML document has two different data elements for the zip code:
address:zip
censusDistrict:zip
To use a library concept as an analogy, the nesting in XML is like pre-coordination in library subject headings. It binds elements together in a way that they cannot be readily used in any other context. Some coordination is definitely useful at the application level, but if all of your data is pre-coordinated it becomes difficult to create new uses for new contexts.
Avoid XML Pitfalls
XML does not make your data any better than it was, and it can be used to mark up data that is illogically organized and poorly defined. A misstep that I often see is data designers beginning to use XML before their data is fully described, and therefore letting the structure and limitations of XML influence what their data can express. Be very wary of any project that decides that the data format will be XML before the data itself has been fully defined.
XML and Library data
If XML had been available in 1965 when Henriette Avram was developing the MARC format it would have been a logical choice for that data. The task that Avram faced was to create a machine-readable version of the data on the catalog card that would allow cards to be printed that looked exactly like the cards that were created prior to MARC. It was a classic document mark-up situation. Had that been the case our records could very well have evolved in a way that is different to what we have today, because XML would not have had the need to separate fixed field data from variable field data, and expansion of some data areas might have been easier. But saying that XML would have been a good format in 1965 does not mean that it would be a good format in 2011.
For the future library data format, I can imagine that it will, at times, be conveyed over the internet in XML. If it can ONLY be conveyed in XML we will have created a problem for ourselves. Our data should be independent of any particular serialization and be designed so that it is not necessary to have any particular combination or nesting of elements in order to make use of the data. Applications that use the data can of course combine and structure the elements however they wish, but for our data to be usable in a variety of applications we need to keep the "pre-coordination" of elements to a minimum.
* For example, there is an XML serialization (essentially a record format) of RDF that is frequently used to exchange linked data, although other serializations are also often available. It is used primarily because there is a wide range of software tools available for making use of XML data in applications, and there are many fewer tools available for the more "native" RDF expressions such as triples or turtle. It encapsulates RDF data in a record format and I suspect that using XML for this data will turn out to be a transitional phase as we move from record-based data structures to graph-based ones.