Coyle's InFormation: Google

Showing posts with label Google. Show all posts

Monday, February 13, 2017

Miseducation

There's a fascinating video created by the Southern Poverty Law Center (in January 2017) that focuses on Google but is equally relevant to libraries. It is called The Miseducation of Dylann Roof.

In this video, the speaker shows that by searching on "black on white violence" in Google the top items are all from racist sites. Each of these link only to other racist sites. The speaker claims that Google's algorithms will favor similar sites to ones that a user has visited from a Google search, and that eventually, in this case, the user's online searching will be skewed toward sites that are racist in nature. The claim is that this is what happened to Dylan Roof, the man who killed 9 people at an historic African-American church - he entered a closed information system that consisted only of racist sites. It ends by saying: "It's a fundamental problem that Google must address if it is truly going to be the world's library."

I'm not going to defend or deny the claims of the video, and you should watch it yourself because I'm not giving a full exposition of its premise here (and it is short and very interesting). But I do want to question whether Google is or could be "the world's library", and also whether libraries do a sufficient job of presenting users with a well-round information space.

It's fairly easy to dismiss the first premise - that Google is or should be seen as a library. Google is operating in a significantly different information ecosystem from libraries. While there is some overlap between Google and library collections, primarily because Google now partners with publishers to index some books, there is much that is on the Internet that is not in libraries, and a significant amount that is in libraries but not available online. Libraries pride themselves on providing quality information, but we can't really take the lion's share of the credit for that; the primary gatekeepers are the publishers from whom we purchase the items in our collections. In terms of content, most libraries are pretty staid, collecting only from mainstream publishers.

I decided to test this out and went looking for works promoting Holocaust denial or Creationism in a non-random group of libraries. I was able to find numerous books about deniers and denial, but only research libraries seem to carry the books by the deniers themselves. None of these come from mainstream publishing houses. I note that the subject heading, Holocaust denial literature, is applied to both those items written from the denial point of view, as well as ones analyzing or debating that view.

Creationism gets a bit more visibility; I was able to find some creationist works in public libraries in the Bible Belt. Again, there is a single subject heading, Creationism, that covers both the pro- and the con-. Finding pro- works in WorldCat is a kind of "needle in a haystack" exercise.

Don't dwell too much on my findings - this is purely anecdotal, although a true study would be fascinating. We know that libraries to some extent reflect their local cultures, such as the presence of the Gay and Lesbian Archives at the San Francisco Public Library. But you often hear that libraries "cover all points of view," which is not really true.

The common statement about libraries is that we gather materials on all sides of an issue. Another statement is that users will discover them because they will reside near each other on the library shelves. Is this true? Is this adequate? Does this guarantee that library users will encounter a full range of thoughts and facts on an issue?

First, just because the library has more than one book on a topic does not guarantee that a user will choose to engage with multiple sources. There are people who seek out everything they can find on a topic, but as we know from the general statistics on reading habits, many people will not read voraciously on a topic. So the fact that the library has multiple items with different points of view doesn't mean that the user reads all of those points of view.

Second, there can be a big difference between what the library holds and what a user finds on the shelf. Many public libraries have a high rate of circulation of a large part of their collection, and some books have such long holds lists that they may not hit the shelf for months or longer. I have no way to predict what a user would find on the shelf in a library that had an equal number of books expounding the science of evolution vs those promoting the biblical concept of creation, but it is frightening to think that what a person learns will be the result of some random library bookshelf.

But the third point is really the key one: libraries do not cover all points of view, if by points of view you include the kind of mis-information that is described in the SPLC video. There are many points of view that are not available from mainstream publishers, and there are many points of view that are not considered appropriate for anything but serious study. A researcher looking into race relations in the United States today would find the sites that attracted Roof to provide important insights, as SPLC did, but you will not find that same information in a "reading" library.

Libraries have an idea of "appropriate" that they share with the publishing community. We are both scientific and moral gatekeepers, whether we want to admit it or not. Google is an algorithm functioning over an uncontrolled and uncontrollable number of conversations. Although Google pretends that its algorithm is neutral, we know that it is not. On Amazon, which does accept self-published and alternative press books, certain content like pornography is consciously kept away from promotions and best seller lists. Google has "tweaked" its algorithms to remove Holocaust denial literature from view in some European countries that forbid the topic. The video essentially says that Google should make wide-ranging cultural, scientific and moral judgments about the content it indexes.

I am of two minds about the idea of letting Google or Amazon be a gatekeeper. On the one hand, immersing a Dylann Roof in an online racist community is a terrible thing, and we see the result (although the cause and effect may be hard to prove as strongly as the video shows). On the other hand, letting Google and Amazon decide what is and what is not appropriate does not sit well at all. As I've said before having gatekeepers whose motivations are trade secrets that cannot be discussed is quite dangerous.

There has been a lot of discussion lately about libraries and their supposed neutrality. I am very glad that we can have that discussion. With all of the current hoopla about fake news, Russian hackers, and the use of social media to target and change opinion, we should embrace the fact of our collection policies, and admit widely that we and others have thought carefully about the content of the library. It won't be the most radical in many cases, but we care about veracity, and that''s something that Google cannot say.

Sunday, December 18, 2016

Transparency of judgment

The Guardian, and others, have discovered that when querying Google for "did the Holocaust really happen", the top response is a Holocaust denier site. They mistakenly think that the solution is to lower the ranking of that site.

The real solution, however, is different. It begins with the very concept of the "top site" from searches. What does "top site" really mean? It means something like "the site most often pointed to by other sites that are most often pointed to." It means "popular" -- but by an unexamined measure. Google's algorithm doesn't distinguish fact from fiction, or scientific from nutty, or even academically viable from warm and fuzzy. Fan sites compete with the list of publications of a Nobel prize-winning physicist. Well, except that they probably don't, because it would be odd for the same search terms to pull up both, but nothing in the ranking itself makes that distinction.

The primary problem with Google's result, however, is that it hides the relationships that the algorithm itself uses in the ranking. You get something ranked #1 but you have no idea how Google arrived at that ranking; that's a trade secret. By not giving the user any information on what lies behind the ranking of that specific page you eliminate the user's possibility to make an informed judgment about the source. This informed judgment is not only about the inherent quality of the information in the ranked site, but also about its position in the complex social interactions surrounding knowledge creation itself.

This is true not only for Holocaust denial but every single site on the web. It is also true for every document that is on library shelves or servers. It is not sufficient to look at any cultural artifact as an isolated case, because there are no isolated cases. It is all about context, and the threads of history and thought that surround the thoughts presented in the document.

There is an interesting project of the Wikimedia Foundation called "Wikicite." The goal of that project is to make sure that specific facts culled from Wikipedia into the Wikidata project all have citations that support the facts. If you've done any work on Wikipedia you know that all statements of fact in all articles must come from reliable third-party sources. These citations allow one to discover the background for the information in Wikipedia, and to use that to decide for oneself if the information in the article is reliable, and also to know what points of view are represented. A map of the data that leads to a web site's ranking on Google would serve a similar function.

Another interesting project is CITO, the Citation Typing Ontology. This is aimed at scholarly works, and it is a vocabulary that would allow authors to do more than just cite a work - they could give a more specific meaning to the citation, such as "disputes", "extends", "gives support to". A citation index could then categorize citations so that you could see who are the deniers of the deniers as well as the supporters, rather than just counting citations. This brings us a small step, but a step, closer to a knowledge map.

All judgments of importance or even relative position of information sources must be transparent. Anything else denies the value of careful thinking about our world. Google counts pages and pretends not to be passing judgment on information, but they operate under a false flag of neutrality that protects their bottom line. The rest of us need to do better.

Monday, August 10, 2015

Google becomes Alphabet

I thought it was a joke, especially when the article said that they have two investment companies, Ventures and Capital. But it's all true, so I have this to say:

G is for Google, H is for cHutzpah. In addition to our investment companies Ventures and Capital, we are instituting a think tank, Brain, and a company focused on carbon-based life-based forms, Body. Servicing these will be three key enterprises: Food, Water, and Air. Support will be provided by Planet, a subsidiary of Universe. Of course, we'll also need to provide Light. Let there be. Singularity. G is for God.

Monday, March 04, 2013

Sergei Brin's Masculinity

At first I thought it was a joke: "Speaking at the TED Conference today in Long Beach, Calif., Brin told the audience that smartphones are "emasculating." "You're standing around and just rubbing this featureless piece of glass," he said." Perhaps I didn't believe it was true because I first encountered it in the form of a BoingBoing parody for "Mandroid: Google's remasculating new operating system." Another one of those moments when reality and parody are just soooooo close.

The Ted talk won't be available for while so I don't know if he said this with any hint of humor. (I rather hope so, but I fear not.) The talk was about the Google Glass product, which he was demonstrating and promoting. But even if he meant the statement as something of a joke, there are things that need to be said about the not-so-sub text.

1. Using "emasculating" to deride a competitor's product when neither product has anything to do with gender is just a cheap shot. It's like Coke saying that Pepsi is "emasculating."

2. The ongoing attempt to raise the testosterone levels of electronic equipment has gotten out of hand. Yet, unfortunately, products must make an appeal to identity in order to sell. Apple pushes an identity of design and sophistication that was once considered "un-manly" by early Mac reviewers. Brin's remark, albeit nonsensical, pushes back against Apple's more gender-neutral image.

3. It makes little sense to eliminate women from your market, and promoting a product as a kind of "technology viagra" is not going to win over female consumers. Brin's remark shows that he's more concerned with promoting a masculine image that he is comfortable with than with following good marketing practice.

Some reading:

Wikipedia Women in Computing
Gender codes: why women are leaving computing edited by Thomas Misa
How to market to women, by Carol Nelson (1994, so a little out of date, but still useful)

Friday, September 14, 2012

Rich snippets

At the recent Dublin Core annual meeting I heard Dan Brickley talk about Google's use of schema.org for rich snippets. Schema.org is commonly thought of as "search engine optimization" (SEO), which to most people means "how to get onto the first page of a Google results search." But the microdata in web sites can also be used to make the snippets shown more useful by incorporating more information from the web page. The examples above, from the Google rich snippets page, show features like ratings as well as links to actual content within the web page.

Now that WorldCat has schema.org markup, my first thought was: what kind of rich snippet would be good for library data? There is a rich snippet testing tool where you can plug in a URL and see 1) the snippet 2) what microdata is visible to Google. You can plug in a WorldCat permalink and see what the rich snippet result is:

http://www.worldcat.org/oclc/874206 (opens in separate window)

There is no rich snippet displayed here, which tells us that Google hasn't yet developed a rich snippet model for our kind of data. But you can see, in great detail, all of the coded data that is available. (The red warnings indicate that there is data in the OCLC microdata that isn't part of schema.org. OCLC is talking to the schema.org developers to incorporate new elements, some of which show up as warnings here.)

I began to think about how I would like this data used. It could be used to format a more bibliographic-like display, adding author, publisher, pagination. The ISBN could of course link to key online bookstores. (That would also bring in revenue for Google, so might be a popular choice for the search engine.) But what about libraries? How could rich snippets help libraries and library users?

The snippet could lead back to WorldCat where the user could find a nearby library, but... wait! Google often knows your approximate location, and WorldCat knows whether libraries in your area have the book. AND the library catalog often has information about availability. I don't know how this data would interact with the WorldCat tool, but here's what I would like to see in the snippet:

This definitely goes beyond what "rich snippet" means today, but is not inconsistent with retrievals that pull data from multiple online sales outlets. In the sales model, Google's assumption is that the searcher wants to obtain (in FRBR-speak) the item, and therefore various outlets that could provide that data are listed. This same logic could apply to libraries, of course. Libraries are a local source of many of the same things that are sold online, so the obtain logic fits.

This analysis of mine obviously ignores the economic incentive for Google to provide library holdings, especially since they would be seen as competing with sales. I'm just dreaming here, doing the "what if" thing without the practical limitations.

Monday, May 21, 2012

Google goes semantic

In a long-awaited move [1], Google has announced that its search will now be "semantic." They don't actually mean "semantic" in the sense of the semantic web, although there are similarities. While what Google is doing may not formally follow the W3C standards for the semantic web, there is no doubt that they are performing acts of "data linking" that make use of the concepts of linked data. The W3C standards for linked data are designed for openness, so that data from disparate communities can come together. Google has no obligation to play well with others and, as we saw with the development of schema.org, is in a position to make its own rules, many of which are known only within the giant Google-verse. They call their technology a "knowledge graph" and talk about "things not strings." I've used this same phrase myself in numerous presentations on linked data.

Google has always been about using links between things on the web to determine its brand of "relevance" of a web resource to a search query. By using existing linked data, via large stores of links like Dbpedia, Wikipedia, Freebase, and presumably others, Google can now expand its offerings from a single list of results to additional information about the topic that might be the intended topic of the searcher. I say "might be" without any irony; whether in a web search engine or a library catalog, the communication between the searcher's mind and the device that provides results is always only approximate. What the additional data provides is not only more context but a more ample explanation of the topics that have been retrieved. No longer do users have to guess from snippets the meaning of the results in the result set, but they can see a Wikipedia-like entry that not only gives them more information, but it contains links to other sources of information of the topic.

Snippet

"Knowledge Graph" result

"Knowledge graph" detail

At a meeting of the Northern California Technical Services Group in Berkeley last Friday, I said to the group:

Imagine that you have an 18-year-old user who finds a novel on your library's shelf by Oliver North. The user looks up the author in your catalog and sees that this person has written a few other books, but oddly always with a "co-author." Is someone so inept worth reading? Now imagine that your catalog also presents the user with the context: Ollie North, Iran Contra, and related persons. Suddenly the user sees where North fits into US history, has a chance to find out what an interesting character he is, and the books take on a whole new meaning.

That was before I saw this Google result.

We treat library users as if they are all-knowing; as if they know each author in our catalog, as if the title of the book and the number of pages is sufficient for them to decide if it is a good read or has the information they need. This is so obviously false that I am at a loss to explain how we continue to work under this illusion.

[1] Google purchased the only linked-dated search system, Freebase, in July of 2010, thus tipping their hand that they were moving in that direction. Not only did they acquire Freebase and the skills of its employees, they eliminated a potential rival (although it may be silly to consider that anyone could really be a rival to Google).