Tuesday, April 01, 2014

FRBR group 1: the gang of four

(This is a very delayed follow on to my earlier  post on FRBR groups 2 and 3. It's not that I haven't been thinking about it... and I hope soon to be able to post my talk from FSR2014 on FRBR, as well.) 

Parts vs. views

Each of the three FRBR groups is defined briefly in the introduction to section 3 of the FRBR document. The second and third groups have fairly concrete definitions:
group 2 "...those responsible for the intellectual or artistic content, the physical production and dissemination, or the custodianship of the entities in the first group"

group 3 "...an additional set of entities that serve as the subjects of works"
The definition of Group 1 is more complex and considerably less clear:
"The entities in the first group (as depicted in Figure 3.1) represent the different aspects of user interests in the products of intellectual or artistic endeavour." [FRBR, p. 13]
Where groups 2 and 3 are made up of similar but independent things (which is a common definition of a class of things), group 1 consists of aspects of a single thing ("intellectual or artistic endeavor"). The term "aspects" can be defined as either parts of something or points of view about something. The difference between "parts" vs. "points of view" is important. Parts could be defined as simple, observable facts, such as the parts of a particular automobile (motor, chassis, wheels). These are characteristics of the thing itself, independent of the observer. Points of view, of course, vary for each viewer and perhaps each viewing. This would fit with the FRBR document's statement on the work:
"The concept of what constitutes a work and where the line of demarcation lies between one work and another may in fact be viewed differently from one culture to another. Consequently the bibliographic conventions established by various cultures or national groups may differ in terms of the criteria they use for determining the boundaries between one work and another." [FRBR p. 17]
That the FRBR document states that the entities are aspects of user interests rather than aspects of an intellectual endeavor implies that the entities of group 1 are not parts of the endeavor, but constructions in the minds of users. From the remainder of the FRBR document, in particular the areas where the attributes are defined for each entity, it is clear that the FRBR Study Group chose to define the bibliographic description of intellectual endeavors as a single point of view. For each entity, the Study Group has a provided a set of elements that are each defined for only that one entity, with no concession made for different points of view or interests. This is, however, in spite of the statement above that communities may have different views.

To reinforce the view of group 1 as parts of a whole, there exist dependencies between the group 1 entities such that, with the exception of work, each can only exist in combination with certain others to which it is linked. Therefore none represents a whole on its own. (In fact, there is no concept of a whole bibliographic description in FRBR. That would need to come from a different analysis.) The definitions of the entities express these dependencies.
"work: a distinct intellectual or artistic creation." [FRBR, p.17]
"expression: the intellectual or artistic realization of a work in the form of alpha-numeric, musical, or choreographic notation, sound, image, object, movement, etc., or any combination of such forms." [FRBR p. 19]
"manifestation: the physical embodiment of an expression of a work." [FRBR p. 21]
"item: a single exemplar of a manifestation" [FRBR p. 24]
as does the description of the cardinality of the relationships:
"The relationships depicted in the diagram indicate that a work may be realized through one or more than one expression (hence the double arrow on the line that links work to expression). An expression, on the other hand, is the realization of one and only one work (hence the single arrow on the reverse direction of that line linking expression to work). An expression may be embodied in one or more than one manifestation; likewise a manifestation may embody one or more than one expression. A manifestation, in turn, may be exemplified by one or more than one item; but an item may exemplify one and only one manifestation." [FRBR pp. 13-14]
This directionality, or fixed order, of the dependencies is the source of the image of group 1 as a hierarchy, where each entity connects to the entity "above" it. But there is more than one interpretation of these definitions. Taniguchi [taniguchi] reads the description of the entities as a "Russian doll" with each succeeding entity containing the previous ones. In the definitions of expression, manifestation, and item, above, each entity appears to encapsulate the one or ones above it in the diagram ("the physical embodiment of an expression of a work"). When diagrammed, this view would look like:

(Note that this does not exclude the one-to-many and many-to-many relationships as long as both expressions and manifestations can be part of more than one nested structure.)

The other interpretation, which is the most common interpretation of the entity-relation diagrams, is similar to a database design where each entity represents a single set of data elements that can be shared in one-to-many or many-to-many relationships. This view presents the four aspects as separate entities with strict relationships between them.

I perceive a contradiction between the verbal definitions in the document and the diagrams, which one presumes are intended to represent the information in the text. The decision to represent the group 1 entities as separate parts and without any overlap in data elements is a conceptual reduction of the definitions that are given early in the document, and no where does the document state that such a decision was made. There could be good reasons to implement the FRBR group 1 concepts in a particular technology as a simplified structure, but it is clear to me that the model in the diagrams is not as rich as the concepts in the text would allow.


Some interpretations of FRBR treat the work, expression and manifestation as a process or continuum, moving from the idea in the creator’s mind, to an expression of that creation, and then to a manifestation where the expression becomes "manifest" or physical in nature.
"Content relationships can be viewed as a continuum from works/expressions/manifestations/items. Moving left to right along this continuum we start with some original work and related works and expressions and manifestations that can be considered ‘equivalent,’ that is, they share the same intellectual or artistic content as realized through the same mode of expression." [tillett p. 4]
The FRBR group 1 "continuum of entities" runs into problems when faced with the reality of publishing and packaging. While the line from work to expression to manifestation may follow some ideal logic, it may have been more functional to separate the description of the package from its intellectual contents. Instead, manifestation, as described in FRBR, is still based on the traditional catalog entry that mixes content and carrier by including creators, titles, and edition information, which better fit the definition of expression than manifestation.

Most explanations of work, expression, manifestation, item (WEMI) move from work to expression, then to manifestation, in that order, and most give only a slight nod to item. But in terms of cataloger workflow, WEM is a single unit that is encountered with the item in hand. While you may be able to store information about a work or expression separately in a database, you cannot separate the work from the expression or the manifestation in real life.


FRBR provides a static view of the bibliographic resource with little agency. The entities simply exist, they are not described as created as the result of an action. In fact, the entities seem to be actors themselves, as when the expression "realizes" the work. It would make more sense to say that the expression entity is the realization of the work, and that some sentient being acts to create the expression. Instead, in FRBR some unnamed magic occurs between the work and the expression. The same is true of the manifestation, which should be the result of some action that produces a physical manifestation of the expression of the work.

This static view is compatible with library cataloging, which is mainly interested in describing the item in hand as a single unit. The development of a model that emphasizes relationships between creative outputs begs for a more actor-centered view of the bibliographic universe. One could argue that it is precisely the intervention of specific actors that creates a differentiation between entities. The same music piece performed by different musicians, or the same musicians at a different time, must be a different expression. The "studio cut" and the "director's cut" of a film are either different expressions or different works (depending on your definition of work), based on the agent in control. Adding agent intervention to the model could be useful in developing clearer rules for the determination of separate entities during the cataloging process.


While FRBR groups 2 and 3 are composed of real world things (in the semantic web sense), group 1 appears to be an analysis of the current data of bibliographic records. The division of attributes into the four "boxes" of WEMI does not introduce new data elements but partitions the existing bibliographic record among the entities. The resulting group 1 picture resembles ISBD rather than AACR/MARC in that it is a static view of a bibliographic "done deal" with no indication of agents or subjects. Others have noticed that there are neither creator no subject attributes among the listed attributes for the work -- instead, these are included in the model as relationships defined between groups 2 and 3 and group 1. This is a logical outcome of the use of the database design methodology where data is stored for subsequent use but is not part of a data creator or data user workflow.

In the past bibliographic description has been unitarian, with one record representing one, indivisible bibliographic thing. FRBR posits a quatritarian view of the same data. The difficulty, however, is that the FRBR group 1 is not like the division of an automobile into chassis, motor and wheels; instead, where one draws the line between the separate aspects of the FRBR quaternity -- or whether one prefers a unitarian, duotarian or even a quintitarian approach -- is not based on empirical data, but on one's particular point of view. That point of view is not arbitrary, but has many factors based on material type, organization type, and the needs of the users. FRBR's four-part bibliographic description is one possibility. It may represent a particular bibliographic view, but one cannot expect that it represents all bibliographic views, either in libraries or beyond them.

[FRBR] IFLA Study Group on the Functional Requirements for Bibliographic Records. (2009). Functional Requirements for Bibliographic Records. Retrieved from http://archive.ifla.org/VII/s13/frbr/frbr_2008.pdf

[taniguchi] Shoichi Taniguchi. “A conceptual model giving primacy to expression-level bibliographic entity in cataloging”, Journal of Documentation, Vol. 58 Iss: 4, 2002. pp.363 – 382. http://dx.doi.org/10.1108/00220410210431109

[tillett] Tillett, B. What is FRBR? A conceptual model for the bibliographic universe. (p. 8). Washington, DC. 2003. http://www.loc.gov/cds/downloads/FRBR.PDF

Monday, February 24, 2014

The FRBR Groups

FRBR has three groups of entities, numbered 1-3. Each group, however, has its own set of characteristics that are very different from each other, so different that they really are different kinds of groups. These differences make it hard to speak of them in one breath.

One of the key things to know about the groups is that they aren't classes in the data modeling sense. Why they are therefore grouped at all is not clear, except perhaps it was a convenient way to speak of them. The IFLA FRBR Review Group maintains that the groups do not represent classes and that the ten (or eleven, with family) entities represent the highest organizational level recognized by FRBR. Unfortunately the treatment of them as groups throughout the document tends to contradict this statement. This just adds to the confusion about the meaning and nature of the groups.

Group 2

I'm going to take Group 2 first because it is the simplest. Group 2 is a group of "agents" or "actors" that perform actions in the bibliographic environment. The original entities of the group were person and corporate body, but family has been added through the work that was done on the Functional Requirements for Authority Data (FRAD). In most kinds of data modeling the members of Group 2 would be members of a class, and the class would have certain characteristics that define the kinds of things that could be members and their shared characteristics. For example, one could say that all members are people or groups of people, that they generally have names, they perform certain actions, etc. These characteristics would therefore not need to be defined separately for each member of the class, and definitions of members would only include characteristics unique to that class. Because no classes exist in FRBR, each Group 2 entity is described separately through its own attributes, although there is a fair amount of overlap between them.

Note that each of the Group 2 entities stands alone with no dependencies on any other entities. (This matters when we get to Group 1).

Group 3

Group 3 is an odd grouping because it has a rather miscellaneous nature. The entities that are described as Group 3 are ones that are needed for subject description in bibliographic data: subject; object; place; event. Not much is said about them because FRBR, not unlike cataloging rules (AACR, ISBD), does not really address subject assignment. It isn't clear to me where these particular entities come from because they are not equal to the "facets" of LCSH (form, chronological and geographic). It would be interesting to know how this particular set came about, since the FRBR Study Group was looking beyond North American practice.

What makes Group 3 odd, though, is not its composition but that it is only a partial listing of the subject elements; the full set also includes all of the members of groups 1 and 2. So the actual meaning of group 3 is: all of the subject entities that are not in other groups.

It remains to be seen what will happen to these entities when FRBR and the Functional Requirements for Subject Authority Data (FRSAD) are merged. FRSAD takes a 30,000 foot view of subjects, and essentially concludes that if you can call it a subject and give it an identity and a name, it's a subject. This aspect of the original FRBR study, which was specifically directed at the charge of defining the elements of a core bibliographic record, could change as the model becomes more generalized, which seems to be what is happening.

Group 1

All that I will say about Group 1 here is that it is a group that represents one thing divided into four levels of abstraction. Group 1 deserves its own post, rather than making this one overly long. That post will be next.

Wednesday, February 19, 2014

FRBR goals: entities, relations, and a core level record

The FRBR study was motivated by a 1990 international seminar on cataloging held in Stockholm. The charge to the study group was approved by the IFLA Standing Committee of the Section of Cataloguing in 1992. That document, called the Terms of Reference for a Study of the Functional Requirements for Bibliographic Records, stated:
Today the expectations and constraints facing bibliographic control are more pressing than ever. All libraries, including national bibliographic agencies, are operating under increasing budgetary constraints and increasing pressures to reduce cataloging costs through minimal level cataloging. [1]
Or, as Olivia Madison, the chair of the FRBR Study Group from 1991-1993 and 1995-1997, put it:
The Stockholm Seminar addressed the general question: "Can cataloging be considerably simplified?" [2]
The Standing Committee decided that consultants with particular skills in the area of cataloging were needed in order to approach the task, and three (later four) consultants were engaged. The primary charge to the consultants was:
1. Determine the full range of functions of the bibliographic record and then state the primary uses of the record as a whole.
This is at the very least a daunting task. However, the Terms of Reference gave the consultants some guidance about how to go about their work. The remaining tasks for the consultants were:
2. Develop a framework that identifies and clearly defines the full range of entities (e.g., work, texts, subjects, editions and authors) that are the subject of interest to users of a bibliographic record and the types of relationships (e.g. part/whole, derivative, and chronological) that may exist between those entities.
3. For each of the entities in the framework, identify and define the functions (e.g., to describe, to identify, to differentiate, to relate) that the bibliographic record is expected to perform.
4. Identify the key attributes (e.g., title, date, and size) of each entity or relationship that are required for each specific function to be performed. Attribute requirements should relate specifically to the media or format of the bibliographic item where applicable.
The notions of entities, relationships, and attributes don't appear in traditional cataloging theory; they come instead from the world of database design, and in particular relational database design. Because these concepts were expected to be unfamiliar to members of the committee and perhaps also the consultants, the Terms of Reference provides definitions, using as its source the 1984 book Data Analysis: the Key to Data Base Design, by Richard C. Perkinson. (Note, some of this is re-iterated in the FRBR final report, in the section on methodology, where four books are cited as sources of information on entity-relation methodology.)

Those were the tasks for the consultants, the selected experts who would do the analysis and present the report to the Study Group. The Study Group itself had this task:
5. For the National Libraries: for bibliographic records created by the national bibliographic agencies, recommend a basic level of functionality that relates specifically to the entities identified in the framework the functions that are relevant to each.
It appears to be this last charge that directly addressed the needs expressed in the Stockholm seminar: the need for a core level record that would help cataloging agencies reduce their costs while still serving users. I read the charges to the consultants as mainly providing a working methodology that would allow the consultants to focus  their energies on what amounts to a general rethinking of cataloging theory and practice.

The Terms of Reference is a rather bare bones statement of what needs to be done, and it says little about the why of the study. According to Tillett's 1994 report [3], some of the concerns that came out of Stockholm were:
"the mounting costs of cataloging," the proliferation of new media, "exploding bibliographic universe," the need to economize in cataloging, and "the continuing pressures to adapt cataloguing practices and codes to the machine environment."
The FRBR document states the motivation as:
"The purpose of formulating recommendations for a basic level national bibliographic record was to address the need identified at the Stockholm Seminar for a core level standard that would allow national bibliographic agencies to reduce their cataloguing costs through the creation, as necessary, of less-than-full-level records, but at the same time ensure that all records produced by national bibliographic agencies met essential user needs." [4] p.2
At this point, it is worth asking: did the FRBR study indeed result in a "core level standard" that would reduce cataloging costs for national bibliographic agencies? It definitely did define a core level standard, although that aspect of the FRBR report is not often discussed. Chapter 7 of the FRBR document, BASIC REQUIREMENTS FOR NATIONAL BIBLIOGRAPHIC RECORDS, lists the "basic level of functionality" for library catalogs:
Find all manifestations embodying:
  • the works for which a given person or corporate body is responsible
  • the various expressions of a given work
  • works on a given subject
  • works in a given series
Find a particular manifestation:
  • when the name(s) of the person(s) and/or corporate body(ies) responsible for the work(s) embodied in the manifestation is (are) known
  • when the title of the manifestation is known
  • when the manifestation identifier is known
Identify a work
Identify an expression of a work
Identify a manifestation
Select a work
Select an expression
Select a manifestation
Obtain a manifestation
This of course looks quite similar to the goals of a catalog developed over a century ago by Charles Ammi Cutter:
Section 7.3 of the chapter lists the descriptive and organizing elements (headings) that should make up a core bibliographic record. This chapter should be a key element of the FRBR study results, yet it isn't often mentioned in discussions of FRBR, which tend to focus on the ten (or eleven, if you add family) entities and their primary relationships to each other (is realization of, manifests, etc.), and the four user tasks (find, identify, select, obtain).

While most people can hold forth on the FRBR entities, few can discourse on this outcome of the report, which is a basic level national bibliographic record. Admittedly, the report itself does not emphasize this information. The elements of the basic level record use the terminology of ISBD, not of FRBR, which makes it difficult to see the direct connection with the rest of the report. (I haven't had the fortitude to work through the appendix comparing FRBR attributes with ISBD, GARE and GSARE but I assume that a matching was done. However, this does make the recommended core record hard to read in the context of FRBR.) For example, there are core descriptive elements relating to uniform titles ("addition to uniform title - numeric designation (music)") yet uniform titles are not mentioned among the FRBR attributes and the term "uniform title" is not included in the index.

It is not clear to me what has happened to the goal to provide a solution for cash-strapped cataloging agencies. The E-R model, which in my reading was offered as a methodology to support the analysis that needed to be done, has become what people think of as FRBR. If the FRBR Review Group, which is currently maintaining the results of the Study Group's work, does have activities that are aimed at helping national libraries do their work more effectively while saving them cataloging time, it isn't nearly as visible as the work being done to create definition of bibliographic data that follows entity-relation modeling. In any case, I, for one, was actually surprised to discover Chapter 7 in my copy of the FRBR Study Group report.

[1] Terms of Reference for a Study of the Functional Requirements for Bibliographic Records. (1992). Available in: Le Boeuf, P. (2005). Functional Requirements for Bibliographic Records (FRBR): Hype or Cure-All?. New York: Haworth Information Press.
[2] Madison, Olivia M.A. The origins of the IFLA study on Functional Requirements for Bibliographic Records. In: LE BŒUF, Patrick. Ed. Functional Requirements for Bibliographic Records (FRBR): Hype, or Cure-All? [printed text]. Binghamton, NY: the Haworth Press, 2005.
[3]Tillett, B. B. (1994). IFLA Study on the Functional Requirements of Bibliographic Records : Theoretical and Practical Foundations, (April), 1–5.
[4] IFLA Study Group on the Functional Requirements for Bibliographic Records. (2009). Functional Requirements for Bibliographic Records. Retrieved from http://archive.ifla.org/VII/s13/frbr/frbr_2008.pdf

Friday, February 14, 2014

FRBR as a conceptual model

(I have been working on a very long and very detailed analysis of FRBR, probably more than anyone wants to know. But some parts of that analysis might be generally helpful in understanding FRBR, so I'm going to "leak" those ideas out through this blog.)

The FRBR document, in its section on Methodology, gives the reasoning behind the use of entity-relation modeling technique:
The methodology used in this study is based on an entity analysis technique that is used in the development of conceptual models for relational database systems. Although the study is not intended to serve directly as a basis for the design of bibliographic databases, the technique was chosen as the basis for the methodology because it provides a structured approach to the analysis of data requirements that facilitates the processes of definition and delineation that were set out in the terms of reference for the study.
E-R modeling is a multi-step technique that begins with a high-level conceptual analysis of the data universe that is being considered. Quoting the FRBR document again:
The first step in the entity analysis technique is to isolate the key objects that are of interest to users of information in a particular domain. These objects of interest or entities are defined at as high a level as possible. That is to say that the analysis first focuses attention not on individual data but on the "things" the data describe. Each of the entities defined for the model, therefore, serves as the focal point for a cluster of data. An entity diagram for a personnel information system, for example, would likely identify "employee" as one entity that would be of interest to the users of such a system.

This is a very good description of conceptual modeling. So it is either puzzling or disturbing that most readings of FRBR do not recognize this difference between a conceptual model and either a record format or a logical model. In part this is because few have done a close reading of the FRBR document, and unfortunately it is easy to view the diagrams there as statements of data structure rather than high level concepts about bibliographic data. (It's not surprising that people get their information about FRBR from the diagrams, rather than the text. There are three very simple diagrams in the document, and 142 pages of text. Yet even if a picture is worth a thousand words, those three are not equal to the text.)

One of the main assumptions about FRBR is that the entities listed there should be directly translated into records in any bibliographic data design that intends to implement FRBR. For example, there is much criticism of BIBFRAME for presenting a two-entity bibliographic model instead of the four entities of FRBR. This reflects the mistaken idea that each Group 1 entity must be a record in whatever future bibliographic formats are developed. As entities in a conceptual model there is absolutely no direct transfer from conceptual entities to data records. How best to create a record format that carries the concepts is something that would be arrived at after a further and more detailed technical analysis. In fact, the development of a record format might not seem to be a direct descendent of the E-R model, since the E-R modeling technique has a bias toward the structure of relational database management systems, not records, and the FRBR Study Group was not intending its work to be translated directly to a database design.

There are innumerable ways that one could implement a data design that fulfills the conceptual view of FRBR. In E-R modeling there are subsequent steps that build on the conceptual design to develop it into an actionable data design. These steps are actually more detailed and imposing than the conceptual design which is often used to bridge the knowledge gap between operational staff and the technical staff that must creating a working system. The step after the conceptual model is usually the logical design step that completes the list of attributes, and defines the types of data values that will be stored in the database tables (text, date, currency) and the cardinality of each data element (mandatory, optional, repeatable, etc.). It then normalizes the data to remove any duplication of data within the entire database. It also resolves relationships between data tables so that one-to-many and many-to-many relationships are correctly implemented for the applications that will make use of them. Although this is couched in terms of database design, an equally rigorous step would be needed to move from a conceptual view to a design for a format that could be used in library systems and for data exchange.

As an illustration, here is a logical design for the bibliographic system MusicBrainz that stores information about recorded music. It has many of the same concepts as FRBR (works, performers, variant expressions), and must resolve the complex relationships between albums, songs, and performances (not unlike what a music library catalog must do):

With perhaps some difference in details you could say that this implements the concepts of FRBR. Still, this is a database design, and not a record format. For many databases, there is no single record that represents all of the stored data. Business databases are generally a combination of data from numerous departments and processes, and they can often output many different data combinations as needed.

It does say something about the state of technology awareness in the library profession that once a presumably successful conceptual model was developed there was no second step to make that model operational. What was the ultimate goal of FRBR, and did it fulfill that goal? Look for another post soon on that topic.

Thursday, November 14, 2013

It's FAIR!

"In my view, Google Books provides significant public benefits. It advances the progress of the arts and sciences, while maintaining respectful consideration for the rights of authors and other creative individuals, and without adversely impacting the rights of copyright holders. It has become an invaluable research tool that permits students, teachers, librarians, and others to more efficiently identify and locate books. It has given scholars the ability, for the first time, to conduct full-text searches of tens of millions of books. It preserves books, in particular out-of-print and old books that have been forgotten in the bowels of libraries, and it gives them new life. It facilitates access to books for print-disabled and remote or underserved populations. It generates new audiences and creates new sources of income for authors and publishers. Indeed, all society benefits." p. 26
With that statement, Judge Denny Chin has ruled (PDF) that Google's digitization of books from libraries is a fair use.  And a very long saga ends.

Google was first brought to court in 2005 by the Author's Guild in a copyright infringement suit for its mass digitization of library holdings. Since then the matter has gone back to the court a number of times. Most significantly, Google, authors, and publishers developed two complex proposed settlements that were, however, so fraught with problems that the Department of Justice weighed in. Finally, the publishers bowed out and the original Author's Guild suit was revived. At that point, the question became: Is Google's digitization of books for the purposes of indexing (and showing snippets as search results) fair use?

Of course, much happened between 2005 and 2013. One important thing that happened was the development of HathiTrust, the digital repository where libraries can store the digital copies that they received from Google of their own books. The same Authors Guild sued HathiTrust for copyright infringement, but Judge Baer in that case decided for fair use.

I cannot over-emphasize either the role of libraries in this case nor the support that both judges expressed for libraries and for their promotion of "progress and the useful arts." Chin refers frequently to the amicus brief (PDF) presented by the American Library Association, as well as the conclusions in the HathiTrust case. Both judges clearly admire the mission of libraries, and it seems clear to me that the educational use of the materials by libraries was seen to offset the for-profit use by Google. In fact, Judge Chin reverses the roles of Google and the libraries when he says:
"Google provides the libraries with the technological means to make digital copies of books that they already own. The purpose of the library copies is to advance the libraries' lawful uses of the digitized books consistent with the copyright law." p. 26
In those terms, Google has simply helped libraries do what they do, better. Google's digitization of the library books is thus a public service.
"Google Books helps to preserve books and give them new life. Older books, many of which are out-of-print books that are falling apart buried in library stacks, are being scanned and saved." p. 12
Note that Google and the libraries (in HathiTrust) are exceedingly careful to stay within the letter of the law. Google's snippet display algorithm is rococo in design, making it literally impossible to reconstruct a book from the snippets it displays. So much so that it would probably take less time to re-scan the book at home on your page-at-a-time desktop scanner.

The full impact of this ruling is impossible (for me) to predict, but there are many among us who are breathing a great sigh of relief today. This opens the door for us to rethink digital scholarship based on materials produced before information was in digital form. 

I do have a wishlist, however, and at the top of that is for us to turn our attention to making the digitized texts even more useful by turning that uncorrected OCR into a more faithful reproduction of the original book. While large-scale linguistic studies may be valid in spite of a small percentage of errors, the use of the digitized materials for reading, in the case of those works in the public domain, and for listening, in the case of works made available to VIPs (visually impaired persons), is greatly hampered by the number and kinds of errors that result. In a future post I will give the results of a short study that I have done in that area.

See all my posts on Google Books

Friday, October 25, 2013

Instant WayBack URL

Last night I attended festivities at the Internet Archive where they made a number of announcements about projects and improvements. One that particularly struck me was the ability to push a page to the WayBack Machine and instantly get a permanent WayBack URL for that page. This is significant in a number of ways but the main advantages I see are:
  1. putting permalinks in your documents rather that URLs that can break
  2. linking to a particular version of a document when citing
You will not want to use this technique if you are intending to link to, for example, a general home page where you want your link always to go to the current version of that page. But if you are quoting something, or linking to a page that you think has a limited lifetime, this ability will make a huge difference.

When you go to the WayBack machine (whose home page has changed considerably) you will see this option:

Once you provide the URL, the system echoes back the WayBay machine URL for that page at that moment in time:

You can also view the page on the WayBack machine, to make sure you captured the right one:
The page is available through the URL immediately, and will be available through the regular WayBack machine index within hours. This has great implications for scholarship and for news reporting. Note that the WayBack Machine will not capture pages that are closed to crawlers, so if you are on a commercial site, this probably will not work. I'm still very enthused about it.

Monday, October 14, 2013

Who uses Dublin Core? - the original 15

The original 15 Dublin Core elements are included in the Dublin Core Metadata Terms using the namespace http://purl.org/dc/elements/1.1/. There is an "updated" version of each of the original terms in the namespace http://purl.org/dc/terms (dcterms). The difference is that the /dc/terms includes formal domains and ranges, in conformance with linked data standards; the original 15 elements in the /dc/elements/1.1/ namespace have no domain or range constraints defined. This means that the original 15, often given the namespace prefix of "dc:" or "dce:", are compatible with legacy uses of the Dublin Core elements.

In the first post of this series, I showed that the most used terms are from the dcterms vocabulary, followed immediately by a cluster of terms from the dce namespace. In addition, the majority of the top dcterms are the linked data equivalents of the dce terms, thus confirming the "coreness" of the original Dublin Core 15.

From this explanation one might expect that the uses of dce in the wilds linked data would be limited to legacy data. That does not, however, seem to be the case. Out of a total of 125 datasets from the Linked Open Vocabularies,  nearly half (60) use both the linked data vocabulary (dcterms) and the dce terms. Of the top five datasets with the greatest number of uses of dce, only one, "Wikipedia 3," does not also use the dcterms.

Europeana Linked Open Data 
Wikipedia 3 
Linked Open Data Camera dei deputati 
B3Kat - Library Union Catalogues of Bavaria, Berlin and Brandenburg 
Yovisto - academic video search 
There are reasons why datasets may use both "generations" of the Dublin Core vocabulary. One is that their data contains a mix of legacy metadata and linked data, either because the dataset has grown over time, or because the set combines data from different sources. Another is that there may be situations in which the dcterms use of domains and ranges is too restrictive for the needs of the data creators.

The LOV dataset of dce usage has over 24 million uses (compared to 192 million uses of dcterms). Library and bibliographic data is again by far the majority of the use, although it is rivaled by government data, in part because of the over 4 million uses contributed by the Italian Camera dei deputati, which also uses dcterms but to a lesser extent. In fact, government data is overall a strong contender in the dce space.

My overall conclusion from looking at this data is that Dublin Core is used widely for bibliographic and non-bibliographic data; that there is a new "core" based on usage that overlaps greatly with the old core; some dcterms elements are hardly used at all in these datasets; and finally that both the linked data dcterms and the legacy dce elements show themselves to be useful, even in the linked data environment.

Related posts: