Archive for October, 2005

Implicit Feedback, Pseudo Feedback, Relevance Feedback and Active Feedback

Implicit feedback is a popular way to do personalized search. But general audience may confuse it with pseudo feedback and relevance feedback. So it is worth making a clarification here.

Relevance feedback in information retrieval research was proposed in the 1970’s by Gerald Salton and his co-workers as a way to improve retrieval accuracy. Relevance feedback works in the following way. After the user submits a query, the retrieval system will do the first run to rank documents and then present a few top ranked documents for the user to explicitly judge the relevance. After getting the user relevance judgment of these documents, the retrieval system will combine these judged documents with the original query through query expansion to do the second run and present newly ranked documents to the user. A lot of empirical evaluations show that relevance feedback is an effective way to improve the retrieval accuracy. Rocchio feedback formula is the most popular formula to do relevance feedback using vector space model. Model-based feedback proposed by ChengXiang Zhai in his CIKM 2001 paper is a popular way to do relevance feedback using statistical language model.

However, in many retrieval tasks such as web search, the user is not willing to provide the relevance feedback to the retrieval system. So pseudo feedback was later proposed. Pseudo feedback works in the following way. After the user submit a query, the retrieval system will do the first run to rank document and pick a few top ranked document. These top ranked documents are assumed to be relevant by the retrieval system and are combined with the original query through query expansion to do the second run. The retrieval system presents newly ranked documents to the user. Here we can clearly see that relevance feedback needs user involvement in the relevance judgment process while pseudo feedback does not. A lot of empirical evaluations show that pseudo feedback generally, but not always, can outperform the baseline retrieval. However, pseudo feedback is not as effective as relevance feedback.

Relevance feedback is not applicable in many search activities, while implicit feedback totally excludes the user in the feedback process. So either relevance feedback or implicit feedback has limitations. In interactive information retrieval such as web search, the user generally has many interactions with the retrieval system. During these interactions, the user gives a lot of hints to the retrieval system, which can help the retrieval system infer the user’s information need better. Thus implicit feedback was proposed. Implicit feedback works in the following way. The retrieval system will store user interaction data such as query and clickthrough history,  infer the user’s information need better through these interaction data, compose the new query to rank documents and present ranked documents to the user. We can see that implicit feedback neither asks for the user’s explicit relevance judgment nor categorically assumes that top ranked documents of baseline retrieval are relevant. Instead, implicit feedback intelligently infer the user’s information need through those hints implicitly provided by the user.  However, there is a caveat for implicit feedback. We need carefully analyze those hints and do not incorporate noise into the new query, which may even hurt the retrieval performance. Read the paper Context-Sensitive Information Retrieval Using Implicit Feedback for more discussion and references.

To summarize the difference of these three feedback techniques, relevance feedback asks the user explicit relevance judgment; pseudo feedback assumes top ranked document of baseline retrieval are relevant; implicit feedback tries to better infer the user’s information need through the data implicitly provided by the user.

Active feedback was proposed in the paper Active Feedback in Ad-hoc Information Retrieval. Active feedback can be considered as a kind of relevance feedback. But traditional relevance feedback focuses on how to incorporate judged document into the new query (e.g., query term addition and query term reweighting), while active feedback studies which documents should be presented to the user for relevance judgment in order to maximize the learning benefits of the retrieval system from the user judgment. A general framework was proposed in the paper and several specific algorithms were deduced from the framework.

Leave a Comment

Motivation for Personalized Search

In research papers or presentations, people often use ambiguous queries for the motivation of contextual or personalized search.  Often used ambiguous query examples are “bass” (fish or instrument), “java” (programming language, island or coffee), “jaguar” (animal, car and Apple software) and “IR application” (Infrared application or Information Retrieval application).

These ambiguous queries are really one motivation for contextual search. However, the motivation of contextual search is not limited to the query disambiguation. In my SIGIR 2005 paper, I showed that for 30 hard topics selected from TREC (Text REtrieval Conference) topics 1-150, the search needs to be put in context.  These topics are called hard topics because previous experiments show that they have very poor retrieval performance using traditional retrieval algorithms. When I look through these hard topics, I can see most of topics are hard not because they are ambiguous. Instead, these topics are inherently hard because1) it is very hard for the user to specify the information needs clearly since the description of these topics is very complex; 2) it is very hard for the retrieval system to find relevant documents since there are very few relevant documents among the huge document collection. We demonstrate that using context information (query history and clickthrough data), we can improve retrieval performance.  Here is an example of those hard topics. Each TREC topic is composed of topic number (unique ID), title, description, and narrative.

<topic>
<number> 2
<title> Acquisitions
<desc> Document discusses a currently proposed acquisition involving a U.S.
company and a foreign company.
<narr> To be relevant, a document must discuss a currently proposed acquisition (which may or may not be identified by type, e.g., merger, buyout, leveraged buyout, hostile takeover, friendly acquisition). The suitor and target must be identified by name; the nationality of one of the companies must be identified as U.S. and the nationality of the other company must be identified as NOT U.S.
</topic>

 

For this topic, the description of information need is very complex and there are a lot of constraints. Moreover, there are only 283 relevant documents in the whole document collection (this TREC collection has 242918 documents.).  Here is a real query sequence (4 queries in a sequence) submitted by a single user and the corresponding poor retrieval performance. MAP means Mean Average Precision, which is a good (but not intuitive) measure for the overall retrieval performance and Pr@20docs means how much percentage of top 20 documents are relevant, which is a good measure for the web search performance since many users only care about the relevance of top ranked results.

First query: acquisition u.s. foreign company
MAP: 0.004; Pr@20docs: 0.000

Second query: acquisition merge takeover u.s. foreign company
MAP: 0.026; Pr@20docs: 0.100

Third query: acquire merge foreign abroad international
MAP: 0.004; Pr@20docs: 0.050

Fourth query: acquire merge takeover foreign european japan
MAP: 0.027; Pr@20docs: 0.200

To summarize, query disambiguation is one motivation of contextual or personalized search. However, it is not the only motivation. For information seeking activities for hard topics, we also need to put the search in context.

Leave a Comment

Two Patents about Search Engine Personalization

There are two patent applications related with the search engine personalization.

One is from Google, Variable personalization of search results in a search engine,which was demonstrated somewhere on the Google website before, although it had disappeared. The basic idea is to have a slider button for the user to tune the degree of personalization. Here is the abstract of the patent application.

This invention would enable a searcher to fill out a profile, perform a normal search, and then use a slider button to indicate how much his or her personal information from the profile should be used to modify (rerank) that search based upon the personalization information that they have entered into the profile, by sliding the button partially, or all the way to a full influence on the results.

The other is from Yahoo! Color Graphing and Personalization. Here is the abstract of the patent application.

In a search processing system, identifying input authority weights for a plurality of pages, wherein an input authority weight represents a user’s weight of a page in terms of interest; distributing a page’s input authority weight over one or more pages that are linked in a graph to the page; and using a resulting authority weight for a page in effecting a search result list. The search result list might comprise one or more of reordering search hits and highlighting search hits.

Leave a Comment

Some Discussion about Thorsten’s ACM SIGIR 2005 Paper

Jakob Nielsen has an article about Thorsten’s ACM SIGIR 2005 paper (Visit my post on September 3 2005 for more information about this paper), which spurs some discussion at the Cre8site Forum. It is interesting to read the discussion about how to do user search behavior research in an unbiased way and some research findings of this paper.

Leave a Comment