Designing States, Actions, and Rewards for Using POMDP in Session Search

DESIGNING STATES, ACTIONS, AND REWARDS FOR USING POMDP IN SESSION SEARCH �

Jiyun Luo, Sicong Zhang, Xuchu Dong, Grace Hui Yang

InfoSense Department of Computer Science

Georgetown University

{jl1749,sz303,xd47}@georgetown.edu

[email protected]

1

2

E.g. Find what city and state Dulles airport is in, what shuttles ride-sharing vans and taxi cabs connect the airport to other cities, what hotels are close to the airport, what are some cheap off-airport parking, and what are the metro stops close to the Dulles airport.

DYNAMIC IR- A NEW PERSPECTIVE TO LOOK AT SEARCH�

Information need

User

Search Engine

3

¢ Trial-and-error

CHARACTERISTICS OF DYNAMIC IR �

3 ¢  q1 – "dulles hotels" ¢  q2 – "dulles airport"

¢  q3 – "dulles airport location” ¢  q4 – "dulles metrostop"

4

¢ Rich interactions �  Query formulation �  Document clicks �  Document examination �  eye movement �  mouse movements �  etc.

4


5

¢ Temporal dependency

5


clicked documents query

D1 ranked documents

q1 C1

D2

q2 C2 ……

…… Dn

qn Cn

I informa(on need

itera(on 1 itera(on 2 itera(on n

6

¢  Fits well in this trial-and-error setting

¢  It is to learn from repeated, varied attempts which are continued until success.

¢  The learner (also known as agent) learns from its dynamic

interactions with the world �  rather than from a labeled dataset as in supervised learning.

¢  The stochastic model assumes that the system's current state depend on the previous state and action in a non-deterministic manner

REINFORCEMENT LEARNING (RL) �

6

PARTIALLY OBSERVABLE MARKOV DECISION PROCESS (POMDP)

7

…… s0 s1

r0

a0

s2

r1

a1

s3

r2

a2

�  Hidden states �  Actions �  Rewards

1R. D. Smallwood et. al., ‘73

o1 o2 o3

7

�  Markov �  Long Term Optimization �  Observations, Beliefs

8

8

Study designs of states, actions, reward functions of RL algorithms in Session Search

GOAL OF THIS PAPER �

A MARKOV CHAIN OF DECISION MAKING STATES

[Luo, Zhang, and Yang SIGIR 2014]

9

10

¢ Partially Observable Markov Decision Process

¢ Two agents �  Cooperative game �  Joint Optimization

WIN-WIN SEARCH: DUAL-AGENT STOCHASTIC GAME

�  Hidden states �  Actions

�  Rewards �  Markov


11

¢  A tuple (S, M, A, R, γ, O, Θ, B) �  S : state space �  M: state transition function �  A: actions �  R: reward function �  γ: discount factor, 0< γ ≤1 �  O: observations a symbol emitted according to a hidden state. �  Θ: observation function Θ(s,a,o) is the probability that o is observed when the system transitions into state s after taking action a, i.e. P(o|s,a). �  B: belief space Belief is a probability distribution over hidden states.

PARTIALLY OBSERVABLE MARKOV DECISION PROCESS (POMDP) �

1R. D. Smallwood et. al., ‘73

12

SRT Relevant &

Exploitation

SRR Relevant & Exploration

SNRT Non-Relevant & Exploitation

SNRR Non-Relevant & Exploration

�  scooter price ⟶ scooter stores

�  collecting old US coins⟶ selling old US coins

�  Philadelphia NYC travel ⟶ Philadelphia NYC train

�  Boston tourism ⟶ NYC tourism

q0

HIDDEN DECISION MAKING STATES


ACTIONS �  User Action (Au)

¢  add query terms (+Δq) ¢  remove query terms (-Δq) ¢  keep query terms (qtheme)

�  Search Engine Action(Ase) ¢  Increase/ decrease/ keep term weights ¢  Switch on or off a search technique,

¢  e.g. to use or not to use query expansion ¢  adjust parameters in search techniques

¢  e.g., select the best k for the top k docs used in PRF

�  Message from the user(Σu) ¢  clicked documents ¢  SAT clicked documents

�  Message from search engine(Σse) ¢  top k returned documents

Messages are essentially documents that an agent thinks are relevant.


13

¢ Based on Markov Decision Process (MDP) ¢ States: Queries

�  Observable

¢ Actions: �  User actions:

¢  Add/remove/ unchange the query terms ¢  Nicely correspond to our definition of query change

�  Search Engine actions: ¢  Increase/ decrease /remain term weights

¢ Rewards: �  nDCG

14

[Guan, Zhang, and Yang SIGIR 2013] 2ND MODEL: QUERY CHANGE MODEL

SEARCH ENGINE AGENT’S ACTIONS ∈ Di−1 action Example

qtheme

Y increase “pocono mountain” in s6

N increase “france world cup 98 reaction” in s28, france world cup 98 reaction stock market→ france world cup 98 reaction

+∆q Y decrease ‘policy’ in s37, Merck lobbyists → Merck

lobbyists US policy

N increase ‘US’ in s37, Merck lobbyists → Merck lobbyists US policy

−∆q Y decrease

‘reaction’ in s28, france world cup 98 reaction → france world cup 98

N No change

‘legislation’ in s32, bollywood legislation →bollywood law

15 [Guan, Zhang, and Yang SIGIR 2013]

QUERY CHANGE RETRIEVAL MODEL (QCM)

¢ Bellman Equation gives the optimal value for an MDP:

¢ The reward function is used as the document relevance score function and is tweaked backwards from Bellman equation:

16

V*(s) = maxa

R(s,a) + γ P(s' | s,a)s '∑ V*(s')

Score(qi, d) = P (qi|d) + γ P (qi|qi-1, Di-1, a)maxDi−1

P (qi-1|Di-1)a∑

Document relevant score

Query Transition

model

Maximum past

relevance Current reward/

relevance score

[Guan, Zhang, and Yang SIGIR 2013]

CALCULATING THE TRANSITION MODEL

)|(log)|(

)|(log)()|(log)|(

)|(log)]|(1[+ d)|P(q log = d) ,Score(q

*1

*1

*1ii

*1

*1

dtPdtP

dtPtidfdtPdtP

dtPdtP

qti

dtqt

dtqt

i

qthemeti

ii

∑

∑∑

∑

Δ−∈−

∉Δ+∈

∈Δ+∈

−

∈−

−

+−

−

−−

δ

εβ

α

17

•  According to Query Change and Search Engine Actions

Current reward/ relevance score

Increase weights for theme terms

Decrease weights for

removed terms

Increase weights for novel added

terms

Decrease weights for old added

terms

[Guan, Zhang, and Yang SIGIR 2013]

RELATED WORK

18

¢ Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. Balancing exploration and exploitation in learning to rank online. In ECIR'11.

¢ Xiaoran Jin and Marc Sloan, and Jun Wang. Interactive exploratory search for multi page search results. In WWW '13

¢ Xuehua Shen, Bin Tan, and Chengxiang Zhai. Implicit user modeling for personalized search. In CIKM '05

¢ Norbert Fuhr. A Probability Ranking Principle for Interactive Information Retrieval. In IRJ, 11, 3, 2008

18

STATE DESIGN OPTIONS

¢  (S1) Fixed number of states �  use two binary relevance states

¢  “Relevant” or “Irrelevant”

�  use four states ¢  whether the previously retrieved documents are relevant ¢  whether the user desires to explore

¢  (S2) Varying number of states �  model queries as states, n queries è n states

�  infinity states ¢  document relevance score distribution as states.

¢  one document corresponds to one state

19

ACTION DESIGN OPTIONS

¢  (A1) Technology Selection �  a meta-level modeling of actions

¢  implement multiple search methods, and select the best methods for each query

¢  Select the best parameters for each method

¢  (A2) Term Weight Adjustment �  adjusted term weights

¢  (A3) Ranked List �  One possible ranking of a list of documents is one

single action ¢  If the corpus size is N and the retrieved document number

is n, then the size of the action space is: 20

PNn = N(N −1)...(N − n+1) = N!

(N − n)!

REWARD FUNCTION DESIGN OPTIONS

¢  (R1) Explicit Feedback �  Rewards generated from user’s relevance assessments.

¢  nDCG, MAP, etc

¢  (R2) Implicit Feedback �  Use implicit feedback obtained from user behavior

¢  Clicks, SAT clicks

21

SYSTEMS UNDER COMPARISON

¢  Luo, et al. Win-Win Search: Dual-Agent Stochastic Game in Session Search. SIGIR’14

¢  Zhang, et al. A POMDP Model for Content-Free Document Re-ranking. SIGIR’14

¢  Guan, et al. Utilizing Query Change for Session Search. SIGIR’13

¢  Shen, et al. Implicit user modeling for personalized search. CIKM '05

¢  Jin, et al. Interactive exploratory search for multi page search results. WWW '13

22

S1A1R1(win-win)

S1A3R2

S2A2R1(QCM)

S2A1R1(UCAIR)

S2A3R1(IES)

S1A1R2

S1A2R1

S2A1R1

EXPERIMENTS�¢  Evaluate on TREC 2012 and 2013 Session Tracks

�  The session logs contain ¢  session topic ¢  user queries ¢  previously retrieved URLs, snippets ¢  user clicks, and dwell time etc.

�  Task: retrieve 2,000 documents for the last query in each session �  The evaluation is based on the whole session. Metrics include:

¢  nDCG@10, nDCG, nERR@10 and MAP ¢  Wall Clock Time, CPU cycles and the Big O notation

23

¢  Datasets �  ClueWeb09 CatB �  ClueWeb12 CatB �  spam documents are

removed �  duplicated documents

are removed

EFFICIENCY VS. # OF ACTIONS ON TREC 2012

24 ¢  When number of actions increases, efficiency tends to drop dramatically

¢  S1A3R2, S1A2R1, S2A1R1(UCAIR), S2A2R1(QCM) and S2A1R1 are efficient

¢  S1A1R1(win-win) and S1A1R2 are moderately efficient

¢  S2A3R1(IES) is the slowest system

ACCURACY VS. EFFICIENCY

25

TREC 2012 TREC 2013

¢  Accuracy tends to increase when efficiency decreases ¢  S2A1R1(UCAIR) strikes a good balance between accuracy

and efficiency ¢  S1A1R1(win-win) gives impressive accuracy with a fair

degree of efficiency

OUR RECOMMENDATION

26

¢  If focus on accuracy

¢  If time limit is

within one hour

¢  If want the balance of accuracy and efficiency

v Note: number of actions heavily effect efficiency which need to be carefully designed

CONCLUSIONS

¢ POMDPs are good for session search modeling �  Information seeking behaviors

¢ Design questions �  States: What changes with each time step? �  Actions: How does our system change the state? �  Rewards: How can we measure feedback or

effectiveness?

¢  It is something between an Art and Empirical Experiments

¢ Balance between efficiency and accuracy

27

RESOURCES

¢  Infosense �  http://infosense.cs.georgetown.edu/

¢ Dynamic IR Website �  Tutorials : http://www.dynamic-ir-modeling.org/

¢ Live Online Search Engine – Dumpling �  http://dumplingproject.org

¢ Upcoming Book �  Dynamic Information Retrieval Modeling

¢ TREC 2015 Dynamic Domain Track �  http://trec-dd.org/ �  Please participate, if you are interested in

interactive, and dynamic search 28

THANK YOU

29

InfoSense Georgetown University

[email protected]

Designing States, Actions, and Rewards for Using POMDP in Session Search

Science

Transcript of Designing States, Actions, and Rewards for Using POMDP in Session Search