Sunday, October 19, 2014

schema.org - where it works

In the many talks about schema.org, it seems that one topic that isn't covered, or isn't covered sufficiently, is "where do you do it?" That is, where does it fit into your data flow? I'm going to give a simple, typical example. Your actual situation may vary, but I think this will help you figure out your own case.

The typical situation is that you have a database with your data. Searches go against that database, the results are extracted, a program formats these results into a web page, and the page is sent to the screen. Let's say that your database has data about authors, titles and dates. These are stored in your database in a way that you know which is which. A search is done, and let's say that the results of the search are:
author:  Williams, R
title: History of the industrial sewing machine
date: 1996
This is where you are in your data flow:

The next thing that happens (and remember, I'm speaking very generally) is that the results then are fed into a program that formats them into HTML, probably within a template that has all your headers, footers, sidebars and branding and sends the data to the browser. The flow now looks like

Let's say that you will display this as a citation, that looks like:
Williams, R. History of the industrial sewing machine. 1996.
Without any fancy formatting, the HTML for this is:
<p>Williams, R. History of the industrial sewing machine. 1996.</p>
Now we can see the problem that schema.org is designed to fix. You started with an author, a title and date, but what you are showing to the world is a string of characters are that undifferentiated. You have lost all the information about what these represent. To a machine, this is just another of many bazillions of paragraphs on the web. Even if you format your data like this:
<p>Author: Williams, R.</p>
<p>Title:  Williams, R. History of the industrial sewing machine</p>
<p>Date: 1996</p>
What a machine sees is:
<p>blah: blah</p>
<p>blah: blah</p>
<p>blah: blah</p>  
What we want is for the program that is is formatting the HTML to also include some metadata from schema.org that retains the meaning of the data you are putting on the screen. So rather than just putting HTML formatting, it will add formatting from schema.org. Schema.org has metadata elements for many different types of data. Using our example, let's say that this is a book, and here's how you could mark that up in schema.org:
<div vocab="http://schema.org/">
<div   typeof="Book">
<p>
    <span property="author">Williams, R.</span> <span property="name">History of the industrial sewing machine</span>. <span property="datePublished">1996</span>.
    </p>
    </div>
</div>
Again, this is a very simple example, but when we test this code in the Google Rich Snippet tool, we can see that even this very simple example has added rich information that a search engine can make use of:
To see a more complex example, this is what Dan Scott and I have done to enrich the files of the Bryn Mawr Classical Reviews.

The review as seen in a browser (includes schema.org markup)

The review as seen by a tool that reads the structured schema.org data.

From these you can see a couple of things. The first is that the schema.org markup does not change how your pages look to a user viewing your data in a browser. The second is that hidden behind that simple page is a wealth of rich information that was not visible before.

Now you are probably wondering: well, what's that going to do for me? Who will use it? At the moment, the users of this data are the search engines, and they use the data to display all of that additional information that you see under a link:


In this snippet, the information about stars, ratings, type of film and audience comes from schema. org mark-up on the page.

Because the data is there, many of us think that other users and uses will evolve. The reverse of that is that, of course, if the information isn't there then those as yet undeveloped possibilities cannot happen.



Wednesday, October 01, 2014

This is what sexism looks like

[Note to readers: sick and tired of it all, I am going to report these "incidents" publicly because I just can't hack it anymore.]

I was in a meeting yesterday about RDF and application profiles, in which I made some comments, and was told by the co-chair: "we don't have time for that now", and the meeting went on.

Today, a man who was not in the meeting but who listened to the audio sent an email that said:
"I agree with Karen, if I correctly understood her point, that this is "dangerous territory".  On the call, that discussion was postponed for a later date, but I look forward to having that discussion as soon as possible because I think it is fundamental."
And he went on to talk about the issue, how important it is, and at one point referred to it as "The requirement is that a constraint language not replace (or "hijack") the original semantics of properties used in the data."

The co-chair (I am the other co-chair, although reconsidering, as you may imagine) replied:
"The requirement of not hijacking existing formal specification languages for expressing constraints that rely on different semantics has not been raised yet."
"Has not been raised?!" The email quoting me stated that I had raised it the very day before. But an important issue is "not raised" until a man brings it up. This in spite of the fact that the email quoting me made it clear that my statement during the meeting had indeed raised this issue.

Later, this co-chair posted a link to a W3C document in an email to me (on list) and stated:
"I'm going on holidays so won't have time to explain you, but I could, in theory (I've been trained to understand that formal stuff, a while ago)"
That is so f*cking condescending. This happened after I quoted from W3C documents to support my argument, and I believe I had a good point.

So, in case you haven't experienced it, or haven't recognized it happening around you, this is what sexism looks like. It looks like dismissing what women say, but taking the same argument seriously if a man says it, and it looks like purposely demeaning a woman by suggesting that she can't understand things without the help of a man.

I can't tell you how many times I have been subjected to this kind of behavior, and I'm sure that some of you know how weary I am of not being treated as an equal no matter how equal I really am.

Quiet no more, friends. Quiet no more.

(I want to thank everyone who has given me support and acknowledgment, either publicly or privately. It makes a huge difference.) 

Some links about "'Splaining"
http://scienceblogs.com/thusspakezuska/2010/01/25/you-may-be-a-mansplainer-if/
http://geekfeminism.wikia.com/wiki/Splaining

Tuesday, September 23, 2014

The book you scroll

I was traveling in Italy where I spend a lot of time in bookstores. I'm looking not only for books to read, but to discover new authors, since Italian bookstores are filled with translations of authors that I rarely see in the few bookstores remaining in my home town of Berkeley, CA. While there I came across something that I find fascinating: the flipback book. These books are small - the one I picked up is about 4 3/4" x 3 1/4". It feels like a good-sized package of post-it notes in your hand.

 From the outside, other than its size, looks "normal" although the cover design is in landscape rather than portrait  position.

The surprise is when you open the book. The first thing you notice is that you read the book top-to-bottom across two pages. It's almost like scrolling on a web page, because you move the pages up, not across.

The other thing is that they are incredibly compact. The paper is thin, and some of the books contained entire trilogies, although only about 2 - 2.5 inches thick.

Because there is no gutter between the two pages, you essentially get a quantity of text that is equal to what you get on a regular book page. Oddly, the two contiguous pages are numbered as separate pages, although only the odd numbers actually printed, so you have pages 37, 39, 41, etc. However, the actual number of open pages is about the same as the paperback book.

The font is a sans serif, similar to many used online, so that whole thing feels like a paper book imitating a computer screen.

I haven't read the book yet, so I don't know if the reading experience is pleasing. But I am amazed that someone has found a way to reinvent the print book after all these years. Patented, of course.

There are few titles available yet, but an Amazon search on "flipback" brings up a few.

Thursday, September 11, 2014

Philosophical Musings: The Work

We can't deny the idea of work - opera, oeuvre - as a cultural product, a meaningful bit of human-created stuff. The concept exists, the word exists. I question, however that we will ever have, or that we should ever have, precision in how works are bounded; that we'll ever be able to say clearly that the film version of Pride and Prejudice is or is not the same work as the book. I'm not even sure that we can say that the text of Pride and Prejudice is a single work. Is it the same work when read today that it was when first published? Is it the same work each time that one re-reads it? The reading experience varies based on so many different factors - the cultural context of the reader; the person's understanding of the author's language; the age and life experience of the reader.

The notion of work encompasses all of the complications of human communication and its consequent meaning. The work is a mystery, a range of possibilities and of possible disappointments. It has emotional and, at its best, transformational value. It exists in time and in space. Time is the more canny element here because it means that works intersect our lives and live on in our memories, yet as such they are but mere ghosts of themselves.

Take a book, say, Moby Dick; hundreds of pages, hundreds of thousands of words. We read each word, but we do not remember the words -- we remember the book as inner thoughts that we had while reading. Those could be sights and smells, feelings of fear, love, excitement, disgust. The words, external, and the thoughts, internal, are transformations of each other; from the author's ideas to words, and from the words to the reader's thoughts. How much is lost or gained during this process is unknown. All that we do know is that, for some people at least, the experience is vivid one. The story takes on some meaning in the mind of the reader, if one can even invoke the vague concept of mind without torpedoing the argument altogether.

Brain scientists work to find the place in the maze of neuronic connections that can register the idea of "red" or "cold" while outside of the laboratory we subject that same organ to the White Whale, or the Prince of Denmark, or the ever elusive Molly Bloom. We task that organ to taste Proust's madeleine; to feel the rage of Ahab's loss; to become a neighbor in one of Borges' villages. If what scientists know about thought is likened to a simple plastic ping-pong ball, plain, round, regular, white, then a work is akin to a rainforest of diversity and discovery, never fully mastered, almost unrecognizable from one moment to the next.

As we move from textual works to musical ones, or on to the visual arts, the transformation from the work to the experience of the work becomes even more mysterious. Who hasn't passed quickly by an unappealing painting hanging on the wall of a museum before which stands another person rapt with attention. If the painting doesn't speak to us, then we have no possible way of understanding what it is saying to someone else.

Libraries are struggling to define the work as an abstract but well-bounded, nameable thing within the mass of the resources of the library. But a definition of work would have to be as rich and complex as the work itself. It would have to include the unknown and unknowable effect that the work will have on those who encounter it; who transform it into their own thoughts and experiences. This is obviously impractical. It would also be unbelievably arrogant (as well as impossible) for libraries to claim to have some concrete measure of "workness" for now and for all time. One has to be reductionist to the point of absurdity to claim to define the boundaries between one work and another, unless they are so far apart in their meaning that there could be no shared messages or ideas or cultural markers between them. You would have to have a way to quantify all of the thoughts and impressions and meanings therein and show that they are not the same, when "same" is a target that moves with every second that passes, every synapse that is fired.

Does this mean that we should not try to surface workness for our users? Hardly. It means that it is too complex and too rich to be given a one-dimensional existence within the current library system. This is, indeed, one of the great challenges that libraries present to their users: a universe of knowledge organized by a single principle as if that is the beginning and end of the story. If the library universe and the library user's universe find few or no points of connection, then communication between them fails. At best, like the user of a badly designed computer interface, if any communication will take place it is the user who must adapt. This in itself should be taken the evidence of superior intelligence on the part of the user as compared to the inflexibility of the mechanistic library system.

Those of us in knowledge organization are obsessed with neatness, although few as much as the man who nearly single-handled defined our profession in the late 19th century; the man who kept diaries in which he entered the menu of every meal he ate; whose wedding vows included a mutual promise never to waste a minute; the man enthralled with the idea that every library be ordered by the simple mathematical concept of the decimal.

To give Dewey due credit, he did realize that his Decimal Classification had to bend reality to practicality. As the editions grew, choices had to be made on where to locate particular concepts in relation to others, and in early editions, as the Decimal Classification was used in more libraries and as subject experts weighed in, topics were relocated after sometimes heated debate. He was not seeking a platonic ideal or even a bibliographic ideal; his goal was closer to the late 19th century concept of efficiency. It was a place for everything, and everything in its place, for the least time and money.

Dewey's constraints of an analog catalog, physical books on physical shelves, and a classification and index printed in book form forced the limited solution of just one place in the universe of knowledge for each book. Such a solution can hardly be expected to do justice to the complexity of the Works on those shelves. Today we have available to us technology that can analyze complex patterns, can find connections in datasets that are of a size way beyond human scale for analysis, and can provide visualizations of the findings.

Now that we have the technological means, we should give up the idea that there is an immutable thing that is the work for every creative expression. The solution then is to see work as a piece of information about a resource, a quality, and to allow a resource to be described with as many qualities of work as might be useful. Any resource can have the quality of the work as basic content, a story, a theme. It can be a work of fiction, a triumphal work, a romantic work. It can be always or sometimes part of a larger work, it can complement a work, or refute it. It can represent the philosophical thoughts of someone, or a scientific discovery. In FRBR, the work has authorship and intellectual content. That is precisely what I have described here. But what I have described is not based on a single set of rules, but is an open-ended description that can grow and change as time changes the emotional and informational context as the work is experienced.

I write this because we risk the petrification of the library if we embrace what I have heard called the "FRBR fundamentalist" view. In that view, there is only one definition of work (and of each other FRBR entity). Such a choice might have been necessary 50 or even 30 years ago. It definitely would have been necessary in Dewey's time. Today we can allow ourselves greater flexibility because the technology exists that can give us different views of the same data. Using the same data elements we can present as many interpretations of Work as we find useful. As we have seen recently with analyses of audio-visual materials, we cannot define work for non-book materials identically to that of books or other texts. [1] [2] Some types of materials, such as works of art, defy any separation between the abstraction and the item. Just where the line will fall between Work and everything else, as well as between Works themselves, is not something that we can pre-determine. Actually, we can, I suppose, and some would like to "make that so", but I defy such thinkers to explain just how such an uncreative approach will further new knowledge.

[1] Kara Van Malssen. BIBFRAME A-V modeling study
[2] Kelley McGrath. FRBR and Moving Images

Thursday, September 04, 2014

WP:NOTABILITY (and Women)

I've been spending quite a bit of time lately following the Wikipedia pages of "Articles for Deletion" or WP:AfD in Wikipedia parlance. This is a fascinating way to learn about the Wikipedia world. The articles for deletion fall mostly into a few categories:
  1. Brief mentions of something that someone once thought interesting (a favorite game character, a dearly loved soap opera star, a heartfelt local organization) but that has not been considered important by anyone else. In Wikipedian, it lacks WP:NOTABILITY.
  2. Highly polished P.R. intended to make someone or something look more important than it is, knowing that Wikipedia shows up high on search engine results, and that any site linked to from Wikipedia also gets its ranking boosted.
Some of #2 is actually created by companies that are paid to get their clients into Wikipedia along with promoting them in other places online. Another good example is that of authors of self-published books, some of whom appear to be more skilled in P.R. than they are in the literary arts.

In working through a few of the fifty or more articles proposed for deletion each day, you get to do some interesting sleuthing. You can see who has edited the article, and what else they have edited; any account that has only edited one article could be seen as a suspected bogus account created just for that purpose. Or you could assume that only one person in the English-speaking world has any interest in this topic at all.

Most of the work, though, is in seeing if you can establish notability. Notability is not a precise measure, and there are many pages of policy and discussion on the topic. The short form is that for something or someone to be notable, it has to be written about in respected, neutral, third-party publications. Thus a New York Times book review is good evidence of notability for a book, while a listing in the Amazon book department is not. The grey area is wide, however. Publisher's Weekly may or may not indicate notability, since they publish only short paragraphs, and cover about 7,000 books a year. That's not very discriminating.

Notability can be tricky. I recently came across an article for deletion pointing to Elsie Finnimore Buckley, a person I had never heard of before. I discovered that her dates were 1882-1959, and she was primarily a translator of works from French into English. She did, though, write what appears to have been a popular book of Greek tales for young people.

As a translator, her works were listed under "E. F. Buckley." I can well imagine that if she had used her full name it would not have been welcome on the title page of the books she translated. Some of the works she translated appear to have a certain stature, such as works by Franz Funck-Brentano. She has an LC name authority file under "Buckley, E. F." although her full name is added in parentheses: "(Elsie Finnimore)".

To understand what it was like for women writers, one can turn to Linda Peterson's book "Becoming a Woman of Letters and the fact of the Victorian market." In that, she quotes a male reviewer of Buckley's Greek tales, which she did publish under her full name. His comments are enough to chill the aspirations of any woman writer. He said that writing on such serious topics is "not women's work" and that "a woman has neither the knowledge nor the literary tact necessary for it." (Peterson, p. 58) Obviously, her work as a translator is proof otherwise, but he probably did not know of that work.

Given this attitude toward women as writers (of anything other than embroidery patterns and luncheon menus) it isn't all that surprising that it's not easy to establish WP:NOTABILITY for women writers of that era. As Dale Spender says in "Mothers of the Novel; 100 good women writers before Jane Austen":
"If the laws of literary criticism were to be made explicit they would require as their first entry that the sex of the author is the single most important factor in any test of greatness and in any preservation for posterity." (p. 137)
That may be a bit harsh, but it illustrates the problem that one faces when trying to rectify the prejudices against women, especially from centuries past, while still wishing to provide valid proof that this woman's accomplishments are worthy of an encyclopedia entry.

We know well that many women writers had to use male names in order to be able to publish at all. Others, like E.F. Buckley, hid behind initials. Had her real identity been revealed to the reading public, she might have lost her work as a translator. Of late, J.K. Rowling has used both techniques, so this is not a problem that we left behind with the Victorian era. As I said in the discussion on Wikipedia:
"It's hard to achieve notability when you have to keep your head down."

Saturday, June 28, 2014

Linking, really linking

Some time ago I posted a bit of wishful thinking about linking search engine results to library holdings. Now Overdrive has made this a reality, at least in Bing:

This appears in the "extended information" area of a Bing search for the Girl with a Dragon Tattoo trilogy. This is based on you having an Overdrive account and I believe you may also have to have given the browser permission to use your location.




I don't usually use Bing, and so I was unaware that Bing has made much better use of linked data (in part promoted by the use of schema.org standards) than Google. Here is the Google extended sidebar for the same book:
Google results
Now look at what Bing provides:
Bing results
 
If you can't see that there are advantages to linked data after looking at these examples, then, like the global warming deniers, you just don't want to be confused by the facts. Now to get on to how we can make library offerings as rich as this.

**HT to Eric Hellman for blogging this from ALA.

Saturday, May 10, 2014

Let's link!

Tom Johnson responded to a statement of mine in which I said that we need non-programmer tools for linked data work. He asked for case studies, and so I'm going to do some "off the top of my head" riffing here, just to see what might come of it.

First, let me say a little something about some analogous technology. Let's take HTML. I'm one of those folks who learned HTML many decades ago when all you needed was <p>, <i>, <b> and maybe <hr>. With these, you could create a web page. Web pages in those days didn't have banners, sidebars, tables ... adding an image was going all fancy. There were no WYSIWYG tools because it was very simple to create such a web page. Then we got more sophisticated formatting (server-side includes, side bars, tables(!)), and now there is a whole programming language called CSS to handle page creation (and destruction, since CSS is very complex.) It doesn't take much looking around to understand that most people today need TOOLS to create a web page and lots of tools exist, tools that cost little or nothing (WordPress, Drupal) and can be used by folks who've never written 50 lines of working code.

Essentially, on the web today, a few people are providing the structure and the tools, but most folks are providing content without knowing the guts of the technology. Content is, in my mind, the actual goal of the web; technology is the means.

I've spent a lot of time talking to folks about linked data, but linked data is not itself a goal. I've started trying to move the conversation from the underlying technology to what I see as the real goal: making connections -- connections between concepts, ideas, statements about things. This is inherently more social than technical, but of course it needs the technology behind it in order to work. The easier it is for people to make connections, the more connections they will make.

The problem that I'm seeing today in the linked data space is that we don't have an idea of what kinds of connections people will want to make, or what they will do with them. I don't think we're going to know until we apply some scientific method to the problem -- that is, try, fail, try again, rinse, repeat.

I created a really funky web page with one idea. The page looks like this:



It talks about having the ability to profile types of data that can then be selected in the content of the page. Each of these will then pull additional content (maps, term definitions, biographical information)  from available linkable data into the page. This could begin as a very simple application with only a small number of options, as a proof of concept. (WordPress?) I'm sure that others can greatly improve on it.  Take a look at the display in the FAO AGRIS catalog to get an idea of how this might look with a better display.

Have at it, please.