Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of...

Toward A Session-Based

Search Engine

Smitha Sriram, Xuehua Shen, ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

U.S.A.

Motivation

• Information retrieval is inherently an interactive process– A user’s information need is unlikely fully satisfied with just

one query execution

– A user often needs to interact with the system several times through query reformulation and document-browsing

– Thus in general, a query exists in a search session

• A search session provides lots of contextual information for a query that can be exploited (e.g., previous queries and clickthrough data)

• Such contextual information is mostly ignored in existing search engines

• We aim at developing a session-based search engine that can exploit such contextual information to improve retrieval

Traditional vs. Session-based Retrieval

RetrievalSystem

Traditional (“1-query”)

DocumentCollection

Query=“IR applications”

Results: D1 (infrared) D2 (infrared) D3 (retrieval) D4 (infrared) D5 (retrieval)

RetrievalSystem

Session-based

Query=“IR applications”

Results: D3 (retrieval) D5 (retrieval)

Previous query=“retrieval systems”

…

Frequency in viewed docs:

Infrared: 0Retrieval: 5

…

Uses more contextual informationGives more accurate results

“IR” can mean either “information retrieval”

or “infrared”

Research Issues

• What is an appropriate architecture for supporting session-based retrieval?

– How to manage session information?

• How can we detect session boundaries?

• What contextual information should we exploit?

• How can we exploit such contextual information to improve document ranking?

• How can we display search results in the context of a session?

A Client-Server Architecture for Session-based IR

Docs

querySearch Engine

Top-N

Server Side

1.---2.---3.---… …

UserSearchcontext

Usermodel

results

PersonalizedAgent

query

Client Side

Local Collection

Session Manager

Advantages of Server-Side Processing

• Persistent user profiles (imagine if a user often uses different machines)

• Have access to global user information

– Can exploit information about all users to identify common access patterns

– Can exploit information about similar users to help improve performance for any individual user

• Have access to all the documents

– Can perform more powerful statistical analysis (e.g., to identify most frequently accessed docs)

– Can improve document representation over time

Advantages of a Client-Side Agent

• Can capture more information about the user thus more accurate user modeling

– Can exploit the complete interaction history (e.g., easily capture click-through information)

– Can exploit a user’s other activities (e.g., searching immediately after reading an email)

– Can detect session boundary more accurately

• More scalable (“distributed personalization”)

• Alleviate the problem of privacy for personalization

Session Boundary Detection

• Detection is generally easier if done on the client side – More information about the user can be exploited

– E.g., knowing that “logout” and “login” happened between two queries

• Sever side has access to query co-occurrence patterns, which can help judge query coherence

• Possible clues for session boundary detection– Time interval between queries

– Query coherence (based on word relatedness and/or query log analysis)

– Activities in between two queries

Useful Session Context Information

• Previous queries in the same session

• Documents viewed and not viewed so far in the current session

• Other user activities during the same time as the current session

• Context information collected in a similar session by the current user or other users

• … …

Session-based Retrieval Models

• Framework: The risk minimization retrieval framework [Lafferty & Zhai 01, Zhai 02] can be naturally extended to support session-based retrieval

• One possible model (KL-divergence model)– Retrieval = estimating a query model + estimating a doc

model + computing their KL-divergence

– Session context information (and any other potentially useful information) can be used to estimate a better (session-based) query model

ˆ argmax ( | )

ˆ argmax ( | , , )

D

Q

p Doc

p Query User CurrentSessionContext

Refinement of this model leads to specific retrieval formulas

Session-based Result Presentation

• Retrieval results can be displayed in the context of the current session– Previous search results in the session can be exploited to

show which document has been consistently moving up in ranking as the user is reformulating the query

– All the queries in the session can be combined and analyzed to generate a subtopic space for the user’s information need, and documents can be organized and displayed in this space

• Session-based result presentation can– Help a user digest the search results more effectively and

more efficiently

– Help a user to quickly focus on the important concept/topic dimensions

– Help a user to figure out how to better formulate a query

ACES: A Contextual Engine for Search• Architecture: server-side session management

• Session-boundary detection: probabilistic measure of query similarity

• Session-based ranking: use the KL-div retrieval model and estimate a query model based on – Original query

– Displayed title and summary of viewed documents in the same session

– Previous queries in the same search session

• Session-based result display: show ranks of each doc w.r.t. all the previous queries

ACES System Architecture

QueryClickthrough Data

Web Browser

Internet Internet

Search ResultDocument Text


Web/Application Server

Search ProfileEngine Capture

Text DB RDBMS

User Profile

Details of the Ranking Algorithm

• Query model updating using past queries q1, q2,…, qk

• Further query model updating using the displayed title and summary of the viewed documents s1, s2,…, sk

k

i

ikq

qwck i

i

kqqqwpqwp

1||),(

21' )1(

1),...,,|()|(

1( , )'| |

1

1( | '') ( | ) (1 )

1i

i

kc w s k i

si

p w q p w qk

is a decay factor to emphasize the most recent context is a parameter to control the influence of the clickthrough data

Currently all parameters are set in an ad hoc way

Demo:Exploiting Previous Queries in ACES

• TREC AP data + Topics 1- 150 + judgments

• Allow us to compare traditional search and contextual search

ACES is still far away from a full-fledged session-based search engine…

Much further research needs to be done…

Architecture of Personalized System

Docs

querySearch Engine

Top-N

Server Side

1.---2.---3.---… …

Searchcontext

Usermodel

results

PersonalizedAgent

query

Client Side

ProfileCollection

Session Manager

C

U

S

θQModel Selection

Model Selection

θD

q

d

Document generation

Query generation


Web Browser

Internet Internet

Search ResultDocument Text


Web/Application Server

Search ContextEngine Capturer

AP Text DB RDBMS

User Profile

Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of...

Documents

Transcript of Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of...