Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of...
-
Upload
tobias-jones -
Category
Documents
-
view
214 -
download
0
Transcript of Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of...
Toward A Session-Based
Search Engine
Smitha Sriram, Xuehua Shen, ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
U.S.A.
Motivation
• Information retrieval is inherently an interactive process– A user’s information need is unlikely fully satisfied with just
one query execution
– A user often needs to interact with the system several times through query reformulation and document-browsing
– Thus in general, a query exists in a search session
• A search session provides lots of contextual information for a query that can be exploited (e.g., previous queries and clickthrough data)
• Such contextual information is mostly ignored in existing search engines
• We aim at developing a session-based search engine that can exploit such contextual information to improve retrieval
Traditional vs. Session-based Retrieval
RetrievalSystem
Traditional (“1-query”)
DocumentCollection
Query=“IR applications”
Results: D1 (infrared) D2 (infrared) D3 (retrieval) D4 (infrared) D5 (retrieval)
RetrievalSystem
Session-based
Query=“IR applications”
Results: D3 (retrieval) D5 (retrieval)
Previous query=“retrieval systems”
…
Frequency in viewed docs:
Infrared: 0Retrieval: 5
…
Uses more contextual informationGives more accurate results
“IR” can mean either “information retrieval”
or “infrared”
Research Issues
• What is an appropriate architecture for supporting session-based retrieval?
– How to manage session information?
• How can we detect session boundaries?
• What contextual information should we exploit?
• How can we exploit such contextual information to improve document ranking?
• How can we display search results in the context of a session?
A Client-Server Architecture for Session-based IR
Docs
querySearch Engine
Top-N
Server Side
1.---2.---3.---… …
UserSearchcontext
Usermodel
results
PersonalizedAgent
query
Client Side
Local Collection
Session Manager
Advantages of Server-Side Processing
• Persistent user profiles (imagine if a user often uses different machines)
• Have access to global user information
– Can exploit information about all users to identify common access patterns
– Can exploit information about similar users to help improve performance for any individual user
• Have access to all the documents
– Can perform more powerful statistical analysis (e.g., to identify most frequently accessed docs)
– Can improve document representation over time
Advantages of a Client-Side Agent
• Can capture more information about the user thus more accurate user modeling
– Can exploit the complete interaction history (e.g., easily capture click-through information)
– Can exploit a user’s other activities (e.g., searching immediately after reading an email)
– Can detect session boundary more accurately
• More scalable (“distributed personalization”)
• Alleviate the problem of privacy for personalization
Session Boundary Detection
• Detection is generally easier if done on the client side – More information about the user can be exploited
– E.g., knowing that “logout” and “login” happened between two queries
• Sever side has access to query co-occurrence patterns, which can help judge query coherence
• Possible clues for session boundary detection– Time interval between queries
– Query coherence (based on word relatedness and/or query log analysis)
– Activities in between two queries
Useful Session Context Information
• Previous queries in the same session
• Documents viewed and not viewed so far in the current session
• Other user activities during the same time as the current session
• Context information collected in a similar session by the current user or other users
• … …
Session-based Retrieval Models
• Framework: The risk minimization retrieval framework [Lafferty & Zhai 01, Zhai 02] can be naturally extended to support session-based retrieval
• One possible model (KL-divergence model)– Retrieval = estimating a query model + estimating a doc
model + computing their KL-divergence
– Session context information (and any other potentially useful information) can be used to estimate a better (session-based) query model
ˆ argmax ( | )
ˆ argmax ( | , , )
D
Q
p Doc
p Query User CurrentSessionContext
Refinement of this model leads to specific retrieval formulas
Session-based Result Presentation
• Retrieval results can be displayed in the context of the current session– Previous search results in the session can be exploited to
show which document has been consistently moving up in ranking as the user is reformulating the query
– All the queries in the session can be combined and analyzed to generate a subtopic space for the user’s information need, and documents can be organized and displayed in this space
• Session-based result presentation can– Help a user digest the search results more effectively and
more efficiently
– Help a user to quickly focus on the important concept/topic dimensions
– Help a user to figure out how to better formulate a query
ACES: A Contextual Engine for Search• Architecture: server-side session management
• Session-boundary detection: probabilistic measure of query similarity
• Session-based ranking: use the KL-div retrieval model and estimate a query model based on – Original query
– Displayed title and summary of viewed documents in the same session
– Previous queries in the same search session
• Session-based result display: show ranks of each doc w.r.t. all the previous queries
ACES System Architecture
QueryClickthrough Data
Web Browser
Internet Internet
Search ResultDocument Text
QueryClickthrough Data
Web/Application Server
Search ProfileEngine Capture
Text DB RDBMS
User Profile
Details of the Ranking Algorithm
• Query model updating using past queries q1, q2,…, qk
• Further query model updating using the displayed title and summary of the viewed documents s1, s2,…, sk
k
i
ikq
qwck i
i
kqqqwpqwp
1||),(
21' )1(
1),...,,|()|(
1( , )'| |
1
1( | '') ( | ) (1 )
1i
i
kc w s k i
si
p w q p w qk
is a decay factor to emphasize the most recent context is a parameter to control the influence of the clickthrough data
Currently all parameters are set in an ad hoc way
Demo:Exploiting Previous Queries in ACES
• TREC AP data + Topics 1- 150 + judgments
• Allow us to compare traditional search and contextual search
ACES is still far away from a full-fledged session-based search engine…
Much further research needs to be done…
Architecture of Personalized System
Docs
querySearch Engine
Top-N
Server Side
1.---2.---3.---… …
Searchcontext
Usermodel
results
PersonalizedAgent
query
Client Side
ProfileCollection
Session Manager
C
U
S
θQModel Selection
Model Selection
θD
q
d
Document generation
Query generation
QueryClickthrough Data
Web Browser
Internet Internet
Search ResultDocument Text
QueryClickthrough Data
Web/Application Server
Search ContextEngine Capturer
AP Text DB RDBMS
User Profile