Tuesday, January 17, 2012

Google dashboard

Google has an ad in today's New York Times. Over a half page (and with lots of white space), it is a cartoon of a guy up to his waist in water calling a plumber. The plumber who answers says: "I'm on my way. See you in 15 hours." The rest of the text goes:

"You live in Peoria. Do you really need a plumber from New York? We didn't think so.... That's why search engines, including Google, give you results based on your city or region. They can do this by using your computer's IP address. It's a number like 209.85.229.147, which acts like a zip code to tell them the rough area your computer is in.

To find out more about how websites get to know you better go to google.com/goodtoknow"
The text vs. subtext in this ad is stunning. Although justifying a Google practice, it speaks of it in the third person: "they" use your IP address, it tells "them" the area your computer is in. The message is: everyone does it. It's not a Google thing, it's an Internet thing. Don't blame us.

The site at "goodtoknow" uses the same cartoon figures and has very little text; most information is given via videos. The site is a fairly good round-up of information topics, from phishing to securing your home wifi network. (The irony of that being that Google was caught picking up open wifi traffic in Germany.) I could imagine it as a "go-to" place for novices needing information on online privacy. Much of it isn't about Google at all: the video on "Stay safe online" gives five rules about passwords and avoiding phishing and never mentions Google. It also doesn't mention that when you log into a site with a secure password, everything you do is observable by the owner of the site. Believe it or not, many people do not understand that. They think that the password makes their activities private, even to the site owner.

The page on "Manage your information" includes a link to Google Dashboard, which was also mentioned in one of the videos, and which, if I'd known about I had forgotten. Google Dashboard is a list of some of the things that Google knows about you, in particular which Google services you have accounts on. It shows your settings on these services. I found some services I had played with and forgotten about, which I can now delete.

Of course, Dashboard is only the tip of the iceberg in terms of what Google knows about us. I turned off Web history in 2007 so I don't see my searches there. If you are at all concerned about privacy, visit Dashboard and make some adjustments. Google warns you that you will get results that are less customized for your interests. However, if you are reading this you probably are an information professional, and my guess is that you can find the ad for that printer just as well searching privately (if real privacy really exists) without also letting Google know your political, sexual and religious interests.

You often hear that people don't really care about their privacy and they are quite happy to give Web sites their information in exchange for services. I also observe that behavior, but I'm not convinced that the majority of Web users are truly aware of how much information about them is being gathered. I also doubt that most users know how to take advantage of things like the private browsing options in browsers. (I'm not sure I trust that private browsing is truly private. I also don't know how to find out how private it really is.) I do find myself giving out information about myself to Web sites, but it's not because I don't care: it's because I get rushed and don't want to take the extra step, or I forget, or I'm not given a choice and I need to access that site right now. I don't believe in blaming users for the lack of privacy, because the privacy options are always opt-out, not opt-in, and are often hard to find.

And, yes, I know I am writing this on a Google-owned blog site. I've had on my task list for a very, very long time to figure out a way to port this content over to my own web site. It's not so much for privacy purposes (it'll still be a public blog) but because I want the content to be mine even though I'm more likely to lose it than Google is.  The Web has become my workplace and the choice I make is not privacy vs. better ads but privacy vs. getting my work done.  Making it all about advertising trivializes the reality that our personal and professional lives are intertwined with systems we have no control over. This dependency is as frightening as the privacy issue.




Wednesday, January 11, 2012

Bibliographic Framework: RDF and Linked Data

With the newly developed enthusiasm for RDF as the basis for library bibliographic data we are seeing a number of efforts to transform library data into this modern, web-friendly format. This is a positive development in many ways, but we need to be careful to make this transition cleanly without bringing along baggage from our past.

Recent efforts have focused on translating library record formats into RDF with the result that we now have:
    ISBD in RDF
    FRBR in RDF
    RDA in RDF

and will soon have
    MODS in RDF

In addition there are various applications that convert MARC21 to RDF, although none is "official." That is, none has been endorsed by an appropriate standards body.

Each of these efforts takes a single library standard and, using RDF as its underlying technology, creates a full metadata schema that defines each element of the standard in RDF. The result is that we now have a series of RDF silos, each defining data elements as if they belong uniquely to that standard. We have, for example, at least four different declarations of "place of publication": in ISBD, RDA, FRBR and MODS, each with its own URI. There are some differences between them (e.g. RDA separates place of publication, manufacture, production while ISBD does not) but clearly they should descend from a common ancestor:
RDA: place of publication
RDA: place of distribution
RDA: place of manufacture
FRBRer: has place of publication or distribution
ISBD: has place of publication, production, distribution
This would be annoying, but not unworkable, if these different instances of "place of publication" could be treated as having some meaning in common such that one could link a FRBRer element to an ISBD element, but they cannot. The reason they cannot is that each of these constrains the elements in a particular way that defines its relationship to a single data context (what we generally think of as a "record structure"). The elements are not independent of that context, and this means that each can only be used within that particular context. This is the antithesis of the linked data concept, where data sets from diverse sources share metadata elements. It is this re-use of elements that creates the "link" in linked data. To achieve this, metadata elements need to be unconstrained by a particular context.

Linking can also be achieved through vertical relationships, similar to "broader" and "narrower" in thesauri. This is less direct, but makes it possible to mix data sets that have differing levels of granularity. In our case, the ISBD "place of publication, production, distribution" could be defined as broader to the three RDA elements that treat those separately. Unfortunately that is not possible because of the way that ISBD and RDA have been defined in RDF. (I'll post more detail about this later for those who want more.)

The result is that we now have a series of RDF silos, expressions of our data in RDF that lack the linking capabilities of linked data because they are bound to specific data structures. Clearly we gain little in terms of linked data by creating mutually incompatible bibliographic views. Not only are these RDF schemes not compatible with each other, none will be linkable to bibliographic data from communities outside of libraries who published their data on the Web. That means no linking to Amazon, to Wikipedia, to citations within documents.

Given where we are in the development of linked data for libraries, we now have two options:

1) Define 'super-elements' that float above the record formats and that are not bound by the constraints of the RDF-defined records. In this case there would be a general "place of publication" that is super- to all of the "place of publication" elements in the various records, and would be subordinate to a general concept of "place" that is widely used (possibly a property of GeoNames). To implement linking, each record element would be extrapolated to its super elements.

2) Define our data elements outside of any particular record format first, then use these in the record schemas. In this case there would be only one instance of "place of publication" and it would be used throughout the various bibliographic records whenever an element with that meaning is needed. Those records would be interchangeable as linked data using their component data elements, and would interact with other bibliographic data on the Web using the RDF-defined elements and their relationships.

My message here is that we need to be creating data, not records, and that we need to create the data first, then build records with it for those applications where records are needed. Those records will operate internally to library systems, while the data has the potential to make connections in linked data space. I would also suggest that we cease creating silo'd RDF record formats, as these will not move us forward. Instead, we should concentrate on discovering and defining the elements of our data, and begin looking outward at all of the data we want to link to in the vast information universe.


_____
* Note on RDA: RDA in RDF includes two "versions" of each data element: one bound to FRBR and one not. The latter has potential for re-use outside of a FRBR environment, and was designed for this purpose by the DCMI/RDA task force. Its relationship to "official" RDA is somewhat unclear at this time but hopefully will gain support as the linked data concept is absorbed into the bibliographic framework.



Monday, January 02, 2012

Google Book Search Redux

The document I referred to in the previous post would have been so much clearer if I had read the two preceding documents. Now that I have, the story is even more dramatic.

On December 12, 2011, the Author's Guild filed a fourth amended complaint (PDF) against Google. This complaint is nearly identical to the first one, filed on September 20, 2005 (PDF). The two complaints between these (October 28, 2008 and November 16, 2009) included the Association of American Publishers, as did the two attempts at settling the case. (October 28, 2008, and November 13, 2009). The publishers had had their own complaint in 2005 before combining forces with the Authors Guild. Now the Authors Guild is again standing alone against Google's book digitizing efforts.

This fourth amended complaint brings us pretty much back to square one, with the addition of the involvement of more libraries and the creation of HathiTrust as a way for the libraries to store their (allegedly) ill-gotten copies. The library copies are a key element of the suit because they are proof that Google has not only digitized the library books but has made copies (the purview of copyright law) and distributed them to others.

The most interesting document of this latest group, and the one with the greatest detail about Google's actions, is the Memorandum in support of the class certification. This document is the explanation of why the Authors Guild should be considered by the court to be a valid representative of all authors in a class action suit. The document has a number of quotable moments, of which my favorite is the "tell it like it is in plain language" opening:
This litigation arises from Google's business decision to gain a competitive edge over its rivals in the search engine market by making digital copies of millions of "offline" printed materials. ... Rather than obtaining licenses from copyright owners for the digital use of their printed works, Google instead entered into agreements with libraries to gain access to these works. A number of university libraries allowed Google to make digital copies of the books in the libraries' collections, including in-copyright books. In exchange, Google provided digital copies of the books to the libraries. Google refers to this massive copyright infringement as its "Library Project." (p.1)
The assumption on the part of most folks commenting on this latest development in this now 6-year-old case is that the settlement is dead. We are therefore back to the question of whether Google's book scanning is or is not Fair Use. This question, though, is only being asked on the part of authors, not publishers, and if anyone has inside knowledge on what approach the publishers are taking I would love to hear it. It is clear that the position of publishers in relation to Google has changed greatly over these past 5-6 years since the suit was originally filed. There are now reportedly thousands of publishers who are using Google Books to promote and sell their works. It also makes sense that publishers, as corporations, are better able to negotiate with Google than are individual authors. A large publisher with numerous books in print and in its backlist has clout that a single person does not have. In addition, large publishers have lawyers, or access to legal counsel. At least some publishers have made their peace with Google and are seeing the relationship as advantageous.

Looking at this from the library point of view I wonder what will happen to the millions of library books already scanned by Google. I also wonder what this awkward and failed attempt to create the overly broad settlement between Google and the AG/AAP will mean for future digitization projects. There are strong arguments for digitization for scholarly purposes, and the creation of a computational capability over millions of texts could be a positive step for research, especially in the social sciences and humanities. I hope that the botched attempt to commercialize the contents of libraries will not prejudice the future of digital research.

Monday, December 26, 2011

Google files motion to dismiss

"The claims of the associations should be dismissed without leave to amend because they lack standing as a matter of law, since they do not themselves own copyrights and do not meet the test for associational standing set forth in Hunt." p. 19
With that conclusion, Google has filed a motion asserting that the copyright infringement lawsuits filed by the Authors' Guild and the American Society of Media Photographers, Inc. be dismissed. The arguments made in the document are:
  • "Individual copyright owners' participation is necessary to establish a claim for copyright infringement." (p.1)
  • "Plaintiff associations do not own copyrights alleged to have been infringed, and do not have standing to sue for copyright infringement." (p.4)
  • "Every copyright, and every alleged copyright infringement, is different."(p.7)
  • "... a central issue in these cases is whether the conduct alleged in the Complaints constitute fair use under 17 U.S.C. 107. Litigating that issue will require the participation of individual association members, because many of the relevant facts are specific to the particular work in question." (p.11)
All of this sounds plausible to this legal novice, but there are a couple of puzzling issues. First, why did Google not make these arguments in 2005 when the Authors' Guild filed suit? Instead, they negotiated with the association for six years, presumably in good faith, and those negotiations hinged on the acceptance of the AG as a representative of authors and their rights in their works. If Google had thought that the AG did not have standing, none of that negotiation would have made much sense.

Second, Google says in this document that fair use has to be determined on a case-by-case basis. They even quote from Campbell v. Acuff-Rose Music, Inc. that "Fair use must 'be judged case by case, in light of the ends of the copyright law....' It is 'not to be simplified with bright-line rules." (p.11) This seems to undermine Google's original defense that copying for the purposes of creating an index is itself fair use, not something that has to be determined on a case by case basis.

It isn't surprising the Google wants to bring an end to this case. It is now entering its seventh year (the original suit was filed in September of 2005), and has undoubtedly been costly for all parties. Google was moving ahead in putting into place the foundations for the settlement, including the creation of a large database of works and a means for owners to claim the copyrights. They had designated a director for the Book Rights Registry, which would administer the business agreed on in the settlement. The failure of the settlement and the amended settlement to get court approval meant that all of that effort was for naught. Yet it isn't clear to me (and I hope someone can speak to this) what practical outcome Google is seeking for its book digitization effort. A dismissal of this nature would put Google in the rather cynical position of continuing book scanning knowing that few individual authors will have the means to take Google to court, and those individual payments would probably be affordable for this multi-billion dollar company. If dismissal is rejected, then at least that aspect of the suit is clarified, but next steps surely will be that the suit goes forward as first entered.

The one thing that is clear is that negotiations between Google and the AG are no longer on the horizon.

Note, also, that the Authors Guild has filed suit against the HathiTrust for copyright infringement, and the decision here will no doubt reflect on that case as well.

Thursday, December 22, 2011

National Library of Sweden and OCLC fail to agree

In a blog post entitled "No deal with OCLC" the National Library of Sweden has announced that after five years they have ended negotiations with OCLC to become participants in WorldCat. The point of difference was over the OCLC record use policy. Sweden has declared the bibliographic data in the Swedish National Catalog, Libris, to be open for use without constraints.
"A fundamental condition for the entire Libris collaboration is voluntary participation. Libraries that catalogue in Libris can take out all their bibliographic records and incorporate them instead into another system, or use them in anyway the library finds suitable." (from the blog post)
This is an example of the down-stream constraint issues that we discussed while working on the Open Bibliography Principles for the Open Knowledge Foundation. While open data may appear to be primarily an ideological stance it in fact has real practical implications. A bibliographic database is made up of records and data elements that can have uses in many contexts. In addition, the same bibliographic data may exist in numerous databases managed by members of entirely different communities. Someone may wish to create a new database or service using data coming from a variety of sources. At times someone will want to use only portions of records and may mix and match individual data elements from different sources. Any kind of constraints on use of the data, including something as seemingly innocuous as allowing all non-commercial use, require the user of the data to keep track of the source of each record or data element. Practically this means that an application using the mix of data is effectively constrained by the most strict contract in the mix. 

The Swedish library was concerned that their participating libraries would be hindered in their future systems and activities if any limitations were placed on data use. In addition, they would not be able to share their data with the Europeana project, as Europeana requires that the data contributed be open precisely because of the complications of managing hundreds or thousands of different sources with different obligations.

As many of us pointed out during the discussions about the OCLC record use policy, the practical problems of controlling down-stream use of data are insurmountable. Some people argue that the record use policy hasn't affected libraries using WorldCat, but my experience is that the policy has a chilling effect on some libraries, and is making it more difficult for libraries to embrace the linked open data model. The Swedish National Library had to make the difficult decision between WorldCat services and future capabilities. It was undoubtedly a hard decision, but it is admirable that the National Library did not give up what it saw as important rights for its users.

Monday, December 12, 2011

Learning not to share

"Learning to share" used to be one of the basic lessons of childhood, with parents beaming the first time their offspring spontaneously handed half of a cookie to a playmate. But some time before that same child first puts fingers to keyboard she will have to learn a new lesson: not to share online.

The Facebook phenomenon has taken that simple concept of sharing with others to an industrial level. Any page you go to on the Web today connects into your online social life, so that while reading the news or watching a video you are exhorted to share your activity with your online "friends." I say "friends" in quotes because the way that Facebook involvement grows means that many of the people seeing your posts or learning about your activities are like second and third cousins; related to your friends but at least a step removed from the inner circle you relate to. It is easy to forget that those more distant relations are there, but bit by bit the links pull in more invitations and, since we have been told that it is impolite not to share, we rarely slam the digital door on those seeking our friendship.

To increase this digital sharing, the House has passed a revision to the Video Privacy Act. You may not recall the "Bork law" of 1988. It was one of the fastest privacy laws ever passed in the U.S. legislature. Here's the description from the New York Times article:

In 1987, the Washington City Paper, a weekly newspaper, published the video rental records of Judge Robert H. Bork, who at the time was a nominee to the Supreme Court. One of the paper’s reporters had obtained the records from Potomac Video, a local rental store. Judge Bork’s choice of movies — he rented a number of classic feature films starring Cary Grant — may have seemed innocuous.

But the disclosure of Judge Bork’s cultural consumption so alarmed Congress that it quickly passed a law giving individuals the power to consent to have their records shared. The statute, nicknamed the “Bork law,” also made video services companies liable for damages if they divulged consumers’ records outside the course of ordinary business.
 At the time the passage of the law had a comic aspect to it: you could imagine the thoughts going through the heads of members of Congress when they realized that any reporter could talk into their local video store and learn what they had rented. Zingo! New law!

The revised bill, stated in the article as being backed primarily by Netflix, would allow consumers (and that's all we are, right, consumers?) to sign a blanket waiver on their video privacy in order to facilitate sharing with friends.

The Times article has various quotes giving pros and cons, online services vs. privacy advocates, all talking about how much you do or don't want your "friends" to know about you. What the article fails to state, however, is that whether you like it or not, every site where you share is a de facto friend as well. If your Facebook friends get your Netflix picks, both Facebook and Netflix (and their advertising partners) also get your video viewing information. The more you share with your friends, the more you are sharing with an invisible network of corporations - who, by the way, you cannot "unfriend" even if you want to.

This is why we need to learn not to share: it's a lie, a deceit. We aren't really sharing with our friends, our friends are being used to get us to divulge information to faceless corporations who have insinuated themselves into our lives for the sole purpose of benefiting from our consumption. They have distorted the entire idea of "friend," and turned it into a buyer's club for their benefit.

Dear friends: I'm looking forward to seeing you ... offline.

Tuesday, November 01, 2011

Future Format: Goals and Measures

The LC report on the future bibliographic format (aka replacement for MARC) is out. The report is short and has few specifics, other than the selection of RDF as the underlying data format. A significant part of the report lists requirements; these, too, are general in nature and may not be comprehensive.

What needs to be done before we go much further is to begin to state our specific goals and the criteria we will use to determine if we have met those goals. Some goals we will discover in the course of developing the new environment, so this should be considered a growing list. I think it is important that every goal have measurements associated with it, to the extent possible. It makes no sense to make changes if we cannot know what those changes have achieved. Here are some examples of the kinds of things I am thinking of in terms of goals; these may not be the actual goals of the project, they are just illustrations that I have invented.

COSTS
 - goal: it should be less expensive to create the bibliographic data during the cataloging process
   measurement: using time studies, compare cataloging in MARC and in the new format
 - goal: it should be less expensive to maintain the format
   measurement: compare the total time required for a typical MARBI proposal to the time required for the new format
 - goal: it should be less expensive for vendors to make required changes or additions
   measurement: compare the number of programmer hours needed to make a change in the MARC environment and the new environment

COLLABORATION
 - goal: collaboration on data creation with a wider group of communities
   measurement: count the number of non-library communities that we are sharing data with before and after
 - goal: greater participation of small libraries in shared data
   measurement: count number of libraries that were sharing before and after the change
 - goal: make library data available for use by other information communities
   measurement: count use of library data in non-library web environments before and after

INNOVATION
 - goal: library technology staff should be able to implement "apps" for their libraries faster and easier than they can today.
   measurement: either number of apps created, or a time measure to implement (this one may be hard to compare)
 - goal: library systems vendors can develop new services more quickly and more cheaply than before
   measurement: number of changes made in the course of a year, or number of staff dedicated to those changes. Another measurement would be what libraries are charged and how many libraries make the change within some stated time frame

As you can tell from this list, most of the measurements require system implementation, not just the development of a new format. But the new format cannot be an end in itself; the goal has to be the implementation of systems and services using that format. The first MARC format that was developed was tested in the LC workflow to see if it met the needs of the Library. This required the creation of a system (called the "MARC Pilot Project") and a test period of one year. The testing that took place for RDA is probably comparable and could serve as a model. Some of the measurements will not be available before full implementation, such as the inclusion of more small libraries. Continued measurement will be needed.

So, now, what are the goals that YOU especially care about?