Wednesday, March 07, 2007

Users and Uses, Google Scholar

Speaker: Anurag Acharya

Anurag Acharya is the Principal Engineer at Google Scholar.

Anurag could not be here so Dan Clancy from Google Book Search is taking Anurag's place.

In the early Internet days, Yahoo started out emulating a traditional catalog with its subject categories, but people seem to prefer the search method. The search method works because the web itself provides the organization through links. Google doesn't organize the web, instead it makes use of the organization that web pages provide. Google makes heavy use of anchor text that defines links. These anchors provide the meaning behind the link; essentially, aboutness. A link is an assertion about the relationship. It is also a kind of metadata.

Google Book Search currently relies on things like the title for ranking, not links. On the web, people consider search to work well, but without those links, search is not a "solved problem."

One of the questions that a system like Google must address is: What is an object?
The answer is not simply: "a web page is an object." There are many "same" pages on the web, so even the web needs to be de-duped. How do you determine "sameness"? It's not pure equivalence; sameness is a fuzzy function. In the end, things are determined to be "effectively equivalent."

Apply this to books. It depends on the context. Google needs to algorithmically determine equivalence.

Authority: Who is an Expert?
Authority used to be easier to determine -- professors, where they work, what degrees they have. Doesn't work on the web. He calls the web a "democracy." The only way to get authority is to take advantage of the masses; there's too much stuff for you to be able to make determinations any other way.

The cost of asserting opinions determines value: it costs more to maintain a web page than to write a blog; it costs more to write a blog than to tag photos in flickr.

Searching things other than the web.
- number of objects is smaller than the number of pages on the web. Do the same lessons apply?
Some does; some doesn't.
Google started at the other end from library catalogs: keywords and relevance ranking.
"Full text and keywords is not always the answer." But it is part of the answer.

How to decide?: Listen to your users
Let the users tell you what they want to do.
That doesn't mean that you can't also serve minority groups. (kc: This implies that "average" users like Google Scholar, specialist users prefer library or vendor databases.)

Clancy gives some examples in a demo:
- "searches related to" at the bottom of the page
- "Refine results" at the top of the page, with aspects or synonyms. This is done with human tagging, and authoritative users.
- spelling correction

Google Book Search
Features:
- work level clustering (FRBR expression, not work)
- find it in a library
- Libcat results (a full catalog search)
- Integration into regular Google search results
-
Metadata problems are a big issue (?) Didn't say much to support this (we should get him to elaborate)

Duplication
How do you determine if these two books are the same? (Two books from different libraries)
It's easier to figure out once you've scanned the same book twice. (This implies that they use the scans to determine duplicates. Intriguing idea!)

How do you create an ontology of non-web objects?
- FRBR
- References
- authorship
- criticism and review
...

Lessons learned:

Discovery must be universal - if you can't find it, you can't access it

Make the common case easy -
especially time to task completion
the faster things work, the more often users come back
complex interfaces take longer to learn

Ranking can adapt to many problems

A Consistent experience is crucial

Metasearch of a diverse sources is a dead-end.
- ranked lists are hard to merge, even when you know the ranking functions
- can't do whole-corpus analyses
- speed is defined by the slowest search

Google Books is an Opportunity to help users
"We have the opportunity to help users find this wealth [in libraries]"

Roy asks: what do *you* mean by metadata.
"Things that describe this book."

Person from Bowker:
ISTC is coming along; will Google use it?
Clancy: sure, if it helps users.

Lorcan:
Do they have any authority files for persons, places, etc.
Clancy: We probably do not use authority files to the extent we should. We mainly work with the text.

1 comment:

Scott said...

two possible metadata examples which we don't seem to be using when "describing" objects (books, articles, etc.) that it would be great for Google (and anyone else) to use for creating relevancy rankings:

* Citation analysis (okay, we DO use this... but not generally for web-based searching.)
* Publisher (some publishers specialize in certain kinds of material - why not have that as a relevancy factor?)