Speaker: Anurag Acharya
Anurag Acharya is the Principal Engineer at Google Scholar.
Anurag could not be here so Dan Clancy from Google Book Search is taking Anurag's place.
In the early Internet days, Yahoo started out emulating a traditional catalog with its subject categories, but people seem to prefer the search method. The search method works because the web itself provides the organization through links. Google doesn't organize the web, instead it makes use of the organization that web pages provide. Google makes heavy use of anchor text that defines links. These anchors provide the meaning behind the link; essentially, aboutness. A link is an assertion about the relationship. It is also a kind of metadata.
Google Book Search currently relies on things like the title for ranking, not links. On the web, people consider search to work well, but without those links, search is not a "solved problem."
One of the questions that a system like Google must address is: What is an object?
The answer is not simply: "a web page is an object." There are many "same" pages on the web, so even the web needs to be de-duped. How do you determine "sameness"? It's not pure equivalence; sameness is a fuzzy function. In the end, things are determined to be "effectively equivalent."
Apply this to books. It depends on the context. Google needs to algorithmically determine equivalence.
Authority: Who is an Expert?
Authority used to be easier to determine -- professors, where they work, what degrees they have. Doesn't work on the web. He calls the web a "democracy." The only way to get authority is to take advantage of the masses; there's too much stuff for you to be able to make determinations any other way.
The cost of asserting opinions determines value: it costs more to maintain a web page than to write a blog; it costs more to write a blog than to tag photos in flickr.
Searching things other than the web.
- number of objects is smaller than the number of pages on the web. Do the same lessons apply?
Some does; some doesn't.
Google started at the other end from library catalogs: keywords and relevance ranking.
"Full text and keywords is not always the answer." But it is part of the answer.
How to decide?: Listen to your users
Let the users tell you what they want to do.
That doesn't mean that you can't also serve minority groups. (kc: This implies that "average" users like Google Scholar, specialist users prefer library or vendor databases.)
Clancy gives some examples in a demo:
- "searches related to" at the bottom of the page
- "Refine results" at the top of the page, with aspects or synonyms. This is done with human tagging, and authoritative users.
- spelling correction
Google Book Search
- work level clustering (FRBR expression, not work)
- find it in a library
- Libcat results (a full catalog search)
- Integration into regular Google search results
Metadata problems are a big issue (?) Didn't say much to support this (we should get him to elaborate)
How do you determine if these two books are the same? (Two books from different libraries)
It's easier to figure out once you've scanned the same book twice. (This implies that they use the scans to determine duplicates. Intriguing idea!)
How do you create an ontology of non-web objects?
- criticism and review
Discovery must be universal - if you can't find it, you can't access it
Make the common case easy -
especially time to task completion
the faster things work, the more often users come back
complex interfaces take longer to learn
Ranking can adapt to many problems
A Consistent experience is crucial
Metasearch of a diverse sources is a dead-end.
- ranked lists are hard to merge, even when you know the ranking functions
- can't do whole-corpus analyses
- speed is defined by the slowest search
Google Books is an Opportunity to help users
"We have the opportunity to help users find this wealth [in libraries]"
Roy asks: what do *you* mean by metadata.
"Things that describe this book."
Person from Bowker:
ISTC is coming along; will Google use it?
Clancy: sure, if it helps users.
Do they have any authority files for persons, places, etc.
Clancy: We probably do not use authority files to the extent we should. We mainly work with the text.