Designing States, Actions, and Rewards for Using POMDP in Session Search
-
Upload
grace-yang -
Category
Science
-
view
343 -
download
1
Transcript of Designing States, Actions, and Rewards for Using POMDP in Session Search
DESIGNING STATES, ACTIONS, AND REWARDS FOR USING POMDP IN SESSION SEARCH �
Jiyun Luo, Sicong Zhang, Xuchu Dong, Grace Hui Yang
InfoSense Department of Computer Science
Georgetown University
{jl1749,sz303,xd47}@georgetown.edu
1
2
E.g. Find what city and state Dulles airport is in, what shuttles ride-sharing vans and taxi cabs connect the airport to other cities, what hotels are close to the airport, what are some cheap off-airport parking, and what are the metro stops close to the Dulles airport.
DYNAMIC IR- A NEW PERSPECTIVE TO LOOK AT SEARCH�
Information need
User
Search Engine
3
¢ Trial-and-error
CHARACTERISTICS OF DYNAMIC IR �
3 ¢ q1 – "dulles hotels" ¢ q2 – "dulles airport"
¢ q3 – "dulles airport location” ¢ q4 – "dulles metrostop"
4
¢ Rich interactions � Query formulation � Document clicks � Document examination � eye movement � mouse movements � etc.
4
CHARACTERISTICS OF DYNAMIC IR �
5
¢ Temporal dependency
5
CHARACTERISTICS OF DYNAMIC IR �
clicked documents query
D1 ranked documents
q1 C1
D2
q2 C2 ……
…… Dn
qn Cn
I informa(on need
itera(on 1 itera(on 2 itera(on n
6
¢ Fits well in this trial-and-error setting
¢ It is to learn from repeated, varied attempts which are continued until success.
¢ The learner (also known as agent) learns from its dynamic
interactions with the world � rather than from a labeled dataset as in supervised learning.
¢ The stochastic model assumes that the system's current state depend on the previous state and action in a non-deterministic manner
REINFORCEMENT LEARNING (RL) �
6
PARTIALLY OBSERVABLE MARKOV DECISION PROCESS (POMDP)
7
…… s0 s1
r0
a0
s2
r1
a1
s3
r2
a2
� Hidden states � Actions � Rewards
1R. D. Smallwood et. al., ‘73
o1 o2 o3
7
� Markov � Long Term Optimization � Observations, Beliefs
8
8
Study designs of states, actions, reward functions of RL algorithms in Session Search
GOAL OF THIS PAPER �
A MARKOV CHAIN OF DECISION MAKING STATES
[Luo, Zhang, and Yang SIGIR 2014]
9
10
¢ Partially Observable Markov Decision Process
¢ Two agents � Cooperative game � Joint Optimization
WIN-WIN SEARCH: DUAL-AGENT STOCHASTIC GAME
� Hidden states � Actions
� Rewards � Markov
[Luo, Zhang, and Yang SIGIR 2014]
11
¢ A tuple (S, M, A, R, γ, O, Θ, B) � S : state space � M: state transition function � A: actions � R: reward function � γ: discount factor, 0< γ ≤1 � O: observations a symbol emitted according to a hidden state. � Θ: observation function Θ(s,a,o) is the probability that o is observed when the system transitions into state s after taking action a, i.e. P(o|s,a). � B: belief space Belief is a probability distribution over hidden states.
PARTIALLY OBSERVABLE MARKOV DECISION PROCESS (POMDP) �
1R. D. Smallwood et. al., ‘73
12
SRT Relevant &
Exploitation
SRR Relevant & Exploration
SNRT Non-Relevant & Exploitation
SNRR Non-Relevant & Exploration
� scooter price ⟶ scooter stores
� collecting old US coins⟶ selling old US coins
� Philadelphia NYC travel ⟶ Philadelphia NYC train
� Boston tourism ⟶ NYC tourism
q0
HIDDEN DECISION MAKING STATES
[Luo, Zhang, and Yang SIGIR 2014]
ACTIONS � User Action (Au)
¢ add query terms (+Δq) ¢ remove query terms (-Δq) ¢ keep query terms (qtheme)
� Search Engine Action(Ase) ¢ Increase/ decrease/ keep term weights ¢ Switch on or off a search technique,
¢ e.g. to use or not to use query expansion ¢ adjust parameters in search techniques
¢ e.g., select the best k for the top k docs used in PRF
� Message from the user(Σu) ¢ clicked documents ¢ SAT clicked documents
� Message from search engine(Σse) ¢ top k returned documents
Messages are essentially documents that an agent thinks are relevant.
[Luo, Zhang, and Yang SIGIR 2014]
13
¢ Based on Markov Decision Process (MDP) ¢ States: Queries
� Observable
¢ Actions: � User actions:
¢ Add/remove/ unchange the query terms ¢ Nicely correspond to our definition of query change
� Search Engine actions: ¢ Increase/ decrease /remain term weights
¢ Rewards: � nDCG
14
[Guan, Zhang, and Yang SIGIR 2013] 2ND MODEL: QUERY CHANGE MODEL
SEARCH ENGINE AGENT’S ACTIONS ∈ Di−1 action Example
qtheme
Y increase “pocono mountain” in s6
N increase “france world cup 98 reaction” in s28, france world cup 98 reaction stock market→ france world cup 98 reaction
+∆q Y decrease ‘policy’ in s37, Merck lobbyists → Merck
lobbyists US policy
N increase ‘US’ in s37, Merck lobbyists → Merck lobbyists US policy
−∆q Y decrease
‘reaction’ in s28, france world cup 98 reaction → france world cup 98
N No change
‘legislation’ in s32, bollywood legislation →bollywood law
15 [Guan, Zhang, and Yang SIGIR 2013]
QUERY CHANGE RETRIEVAL MODEL (QCM)
¢ Bellman Equation gives the optimal value for an MDP:
¢ The reward function is used as the document relevance score function and is tweaked backwards from Bellman equation:
16
V*(s) = maxa
R(s,a) + γ P(s' | s,a)s '∑ V*(s')
Score(qi, d) = P (qi|d) + γ P (qi|qi-1, Di-1, a)maxDi−1
P (qi-1|Di-1)a∑
Document relevant score
Query Transition
model
Maximum past
relevance Current reward/
relevance score
[Guan, Zhang, and Yang SIGIR 2013]
CALCULATING THE TRANSITION MODEL
)|(log)|(
)|(log)()|(log)|(
)|(log)]|(1[+ d)|P(q log = d) ,Score(q
*1
*1
*1ii
*1
*1
dtPdtP
dtPtidfdtPdtP
dtPdtP
qti
dtqt
dtqt
i
qthemeti
ii
∑
∑∑
∑
Δ−∈−
∉Δ+∈
∈Δ+∈
−
∈−
−
+−
−
−−
δ
εβ
α
17
• According to Query Change and Search Engine Actions
Current reward/ relevance score
Increase weights for theme terms
Decrease weights for
removed terms
Increase weights for novel added
terms
Decrease weights for old added
terms
[Guan, Zhang, and Yang SIGIR 2013]
RELATED WORK
18
¢ Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. Balancing exploration and exploitation in learning to rank online. In ECIR'11.
¢ Xiaoran Jin and Marc Sloan, and Jun Wang. Interactive exploratory search for multi page search results. In WWW '13
¢ Xuehua Shen, Bin Tan, and Chengxiang Zhai. Implicit user modeling for personalized search. In CIKM '05
¢ Norbert Fuhr. A Probability Ranking Principle for Interactive Information Retrieval. In IRJ, 11, 3, 2008
18
STATE DESIGN OPTIONS
¢ (S1) Fixed number of states � use two binary relevance states
¢ “Relevant” or “Irrelevant”
� use four states ¢ whether the previously retrieved documents are relevant ¢ whether the user desires to explore
¢ (S2) Varying number of states � model queries as states, n queries è n states
� infinity states ¢ document relevance score distribution as states.
¢ one document corresponds to one state
19
ACTION DESIGN OPTIONS
¢ (A1) Technology Selection � a meta-level modeling of actions
¢ implement multiple search methods, and select the best methods for each query
¢ Select the best parameters for each method
¢ (A2) Term Weight Adjustment � adjusted term weights
¢ (A3) Ranked List � One possible ranking of a list of documents is one
single action ¢ If the corpus size is N and the retrieved document number
is n, then the size of the action space is: 20
PNn = N(N −1)...(N − n+1) = N!
(N − n)!
REWARD FUNCTION DESIGN OPTIONS
¢ (R1) Explicit Feedback � Rewards generated from user’s relevance assessments.
¢ nDCG, MAP, etc
¢ (R2) Implicit Feedback � Use implicit feedback obtained from user behavior
¢ Clicks, SAT clicks
21
SYSTEMS UNDER COMPARISON
¢ Luo, et al. Win-Win Search: Dual-Agent Stochastic Game in Session Search. SIGIR’14
¢ Zhang, et al. A POMDP Model for Content-Free Document Re-ranking. SIGIR’14
¢ Guan, et al. Utilizing Query Change for Session Search. SIGIR’13
¢ Shen, et al. Implicit user modeling for personalized search. CIKM '05
¢ Jin, et al. Interactive exploratory search for multi page search results. WWW '13
22
S1A1R1(win-win)
S1A3R2
S2A2R1(QCM)
S2A1R1(UCAIR)
S2A3R1(IES)
S1A1R2
S1A2R1
S2A1R1
EXPERIMENTS�¢ Evaluate on TREC 2012 and 2013 Session Tracks
� The session logs contain ¢ session topic ¢ user queries ¢ previously retrieved URLs, snippets ¢ user clicks, and dwell time etc.
� Task: retrieve 2,000 documents for the last query in each session � The evaluation is based on the whole session. Metrics include:
¢ nDCG@10, nDCG, nERR@10 and MAP ¢ Wall Clock Time, CPU cycles and the Big O notation
23
¢ Datasets � ClueWeb09 CatB � ClueWeb12 CatB � spam documents are
removed � duplicated documents
are removed
EFFICIENCY VS. # OF ACTIONS ON TREC 2012
24 ¢ When number of actions increases, efficiency tends to drop dramatically
¢ S1A3R2, S1A2R1, S2A1R1(UCAIR), S2A2R1(QCM) and S2A1R1 are efficient
¢ S1A1R1(win-win) and S1A1R2 are moderately efficient
¢ S2A3R1(IES) is the slowest system
ACCURACY VS. EFFICIENCY
25
TREC 2012 TREC 2013
¢ Accuracy tends to increase when efficiency decreases ¢ S2A1R1(UCAIR) strikes a good balance between accuracy
and efficiency ¢ S1A1R1(win-win) gives impressive accuracy with a fair
degree of efficiency
OUR RECOMMENDATION
26
¢ If focus on accuracy
¢ If time limit is
within one hour
¢ If want the balance of accuracy and efficiency
v Note: number of actions heavily effect efficiency which need to be carefully designed
CONCLUSIONS
¢ POMDPs are good for session search modeling � Information seeking behaviors
¢ Design questions � States: What changes with each time step? � Actions: How does our system change the state? � Rewards: How can we measure feedback or
effectiveness?
¢ It is something between an Art and Empirical Experiments
¢ Balance between efficiency and accuracy
27
RESOURCES
¢ Infosense � http://infosense.cs.georgetown.edu/
¢ Dynamic IR Website � Tutorials : http://www.dynamic-ir-modeling.org/
¢ Live Online Search Engine – Dumpling � http://dumplingproject.org
¢ Upcoming Book � Dynamic Information Retrieval Modeling
¢ TREC 2015 Dynamic Domain Track � http://trec-dd.org/ � Please participate, if you are interested in
interactive, and dynamic search 28