Coyle's InFormation: 06/01/2013

Saturday, June 29, 2013

FRBR and schema.org

The FRBR structure for what it calls the Group 1 entities (Work, Expression, Manifestation, and Item, hereafter written as WEMI) presents quite a few problems for data modeling. Of the many issues this brings up, there is the fact that this division is not universally recognized, not even in library data, and definitely is not recognized outside of libraries. This has particular impact for library data as part of the linked data space, where a primary goal is interlinking with data from diverse resources. It is unlikely that online bookstores or academic citations will begin to use the WEMI structure.

One area where library bibliographic data and bibliographic data from other sources may mingle is in schema.org markup in web pages. Schema already has a basic class that can be used for bibliographic data, called "CreativeWork." Creative work contains the common elements for this type of description, like author, title, publisher, pages, subject, etc. Problems arise, therefore, when trying to express either WEMI or the simplified BIBFRAME Work and Instance (hereafter bf:Work, bf:Instance) in this model. CreativeWork is a unified model that includes all descriptive elements in a single set; BIBFRAME separates those elements into two entities, and each entity contains only a defined set of the descriptive elements. Thus, where CreativeWork will have information for author, title, publisher, pages, subject, in BIBFRAME author and subject must be described in the bf:Work entity, and title, publisher and pages in the bf:Instance entity. Between MARC, FRBR, BIBFRAME, and schema.org, a full bibliographic description may require one, two, or four separate entities.

comparison of marc, frbr, bibframe, and CreativeWork

The OCLC report on BIBFRAME and schema.org proposes that one could use CreativeWork for different FRBR (or presumably BIBFRAME) entities, making the determination based on what fields are present:

"In this scheme, it would be possible to say that when only titles, subjects, and creators are mentioned, the description for a Schema:CreativeWork refers to a FRBR Work; and when copyright dates and genres are present, the description is equivalent to a FRBR Expression." (p. 14)

While that makes sense from a pure logic point of view, and would probably work in a library database, it has problems within the web and linked data contexts of schema.org. I should note, before going on, that schema.org is metadata markup for any web site, and CreativeWork will be used for books, films, music, art, and other forms of creation by anyone and everyone on the web. This is not a library-specific standard.

First, there are many sites that have a search response page with limited information about the item, requiring the user to click through for details. A search results page for books on Amazon or Ebay gives only the author and title, but does not represent the Work -- it merely doesn't give the user the full data on that page in order to fit more results onto the page. Therefore, the lack of information on one web page does not mean that the description there is complete.

Second, there is no "record" in schema.org, merely a number of coded statements with values within a web page. Any web page can contain information about any number of "things" and information about those things may be placed anywhere on the page, possibly far from each other and not coded as a single unit. It may not be possible to know how complete a description is.

Third, web site owners can opt to mark up only part of their data. In schema.org markup that I have encountered on commercial sites, markup reflects the owner's view. For example, Google (one of the originators of schema.org) does not mark up the bibliographic data in its Books pages, but instead emphasizes user ratings, images, and subjects. (This shows the markup using the Google rich snippet testing tool.) In comparison, the extracted schema.org elements for an IMDB page is much more detailed, an indication that it considers itself an information site more than a sales site.

Finally, although this is somewhat beyond schema.org, should the data in web pages be incorporated into the linked data space, it will go there as individual triples that are part of a huge graph of data. That graph is theoretically limitless and makes use of a principle called the "open world assumption." In an open world it is not possible to base your assumptions on what is missing from the graph. The open world does not have a concept of completeness because there is always the possibility that there is more information than what you are seeing at any given moment in time.

These may not be the only arguments against the use of CreativeWork for different FRBR or BIBFRAME entities, but in my mind they are sufficient to make the case that if it is desirable to encode FRBR or BIBFRAME entities in schema.org that they must be represented by different schema.org classes and cannot be inferred from data elements in CreativeWork.

Before I end, let me make clear that I do not favor an imposition of FRBR-like separations of bibliographic data on the linked data world. Even the BIBFRAME two-part bibliographic description will have problems interacting with the one-entity model that is used outside of libraries. I do think that we can find a way to talk virtually about works without stripping such key elements as authors and subjects from the description of the package that carries the content. That package is, after all, what I hold in my hand when I read something, and it is a whole, with author, title, subjects, pages, binding, publisher, etc. That is, however, a topic for another post.

Wednesday, June 19, 2013

Spying, the old-fashioned way

While the news debates the NSA's PRISM program, a massive collection of data points of electronic communication, the more human side of spying is being pushed to the background. Yet if you are fearful of privacy invasion, there is nothing more chilling than a reading of FBI files with accounts of informants and statements about "Communist leanings" and "pro-Russian" attitudes. You can get a taste for this in the FBI Vault, a public file of de-classified documents, most of which are revealed upon the death of the target. The A-Z list (which appears to be a selection, as other names can be found with searches) is full of famous names, from Al Capone to Al Gore, from George Burns to Marilyn Monroe, and from Helen Keller to Leon Trotsky. (The list is only marginally alphabetical, by first name, which is almost as shocking as the contents of the files.)

This is old-fashioned stuff for the most part. Type-written letters, lots of scribbled initials, and whole chunks of documents blacked out with what must be a special FBI-invented marker.

On a more modern note, who could resist adding their favorite FOIA file to Facebook?

Not everyone in the Vault is a potential "enemy of the state." Some are there because they were threatened, and the FBI was doing its "protect and serve" job. But coming to the attention of the FBI is often not a good thing. In the case of Bradbury, an informant tipped off the Bureau that Bradbury may have attended a writers' meeting in Cuba. This set off an investigation.

"Investigation conducted in the neighborhood of 10265 Cheviot Drive, Los Angeles, California, disclosed that a RAY DOUGLAS BRADBURY, date of birth 8/22/20, Waukegan, Illinois, resides at this address. He is a known writer and Los Angeles indices have numerous references on RAY DOUGLAS BRADBURY."

The report is itself a boon for any biographers (and Wikipedians). It gives his family history (back to 1630), location and occupation of all living family members, information about his spouse (including the location of the church where they were married), and of course his yearly income. Given that this report is from 1959, you can just imagine how much more information the FBI would have today. There are whole pages that would do a reference librarian proud: a list of his professional memberships, a complete bibliography, film credits. There is even some level of literary analysis:

"... BRADBURY was probably sympathetic with certain pro-Communist elements in the [Writers Guild of America, West]... [Informant] stated it has been his observation that some of the writers suspected of having Communist backgrounds have been writing in the field of science fiction and it appears that science fiction may be a lucrative field for the introduction of Communist ideologies."

Admittedly, this was high "red scare" time still, and Bradbury was working in and around Hollywood. However, the informant seems to have been unfamiliar with the work of L. Ron Hubbard.

Every file here has some gems worth reading. And don't forget to check the category "Unexplained Phenomenon".