Batch Learning from Bandit Feedback - Cornell β¦...Evaluating Online Metrics Offline β’ Online:...
Transcript of Batch Learning from Bandit Feedback - Cornell β¦...Evaluating Online Metrics Offline β’ Online:...
Batch Learning from Bandit Feedback
CS7792 Counterfactual Machine Learning β Fall 2018
Thorsten Joachims
Departments of Computer Science and Information ScienceCornell University
β’ A. Swaminathan, T. Joachims, Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization, JMLR Special Issue in Memory of Alexey Chervonenkis, 16(1):1731-1755, 2015.
β’ T. Joachims, A. Swaminathan, M. de Rijke. Deep Learning with Logged Bandit Feedback. In ICLR, 2018.
Interactive Systemsβ’ Examples
β Ad Placementβ Search enginesβ Entertainment mediaβ E-commerceβ Smart homes
β’ Log Filesβ Measure and optimize
performanceβ Gathering and maintenance
of knowledgeβ Personalization
Batch Learning from Bandit Feedback
β’ Dataππ = π₯π₯1,π¦π¦1, πΏπΏ1,ππ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ,ππππ
βBanditβ Feedbackβ’ Properties
β Contexts π₯π₯ππ drawn i.i.d. from unknown ππ(ππ)β Actions π¦π¦ππ selected by existing system ππ0: ππ β ππβ Loss πΏπΏππ drawn i.i.d. from unknown ππ πΏπΏππ π₯π₯ππ ,π¦π¦ππ
β’ Goal of Learningβ Find new system ππ that selects π¦π¦ with better πΏπΏ
contextππ0 action reward
[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]
propensity
Learning Settings
Full-Information (Labeled) Feedback
Partial-Information(e.g. Bandit) Feedback
Online Learning β’ Perceptronβ’ Winnowβ’ Etc.
β’ EXP3β’ UCB1β’ Etc.
Batch Learning β’ SVMβ’ Random Forestsβ’ Etc.
?
Comparison with Supervised LearningBatch Learning from Bandit Feedback
Conventional Supervised Learning
Train example π₯π₯,π¦π¦, πΏπΏ π₯π₯,π¦π¦β
Context π₯π₯ drawn i.i.d. from unknown ππ(ππ)
drawn i.i.d. from unknown ππ(ππ)
Action π¦π¦ selected by existing system β0: ππ β ππ
N/A
Feedback πΏπΏ Observe πΏπΏ π₯π₯,π¦π¦ only for π¦π¦ chosen by β0
Assume known loss function β(π¦π¦,π¦π¦β) know feedback πΏπΏ π₯π₯,π¦π¦ for every possible y
Outline of Lectureβ’ Batch Learning from Bandit Feedback (BLBF)
ππ = π₯π₯1,π¦π¦1, πΏπΏ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ Find new policy ππ that selects π¦π¦ with better πΏπΏ
β’ Learning Principle for BLBFβ Hypothesis Space, Risk, Empirical Risk, and Overfittingβ Learning Principle: Counterfactual Risk Minimization
β’ Learning Algorithms for BLBFβ POEM: Bandit training of CRF policies for structured
outputsβ BanditNet: Bandit training of deep network policies
Hypothesis Space
Definition [Stochastic Hypothesis / Policy]:Given context π₯π₯, hypothesis/policy ππ selects action π¦π¦ with probability ππ π¦π¦ π₯π₯
Note: stochastic prediction rules β deterministic prediction rules
ππ1(ππ|π₯π₯) ππ2(ππ|π₯π₯)
ππ|π₯π₯
Risk
Definition [Expected Loss (i.e. Risk)]: The expected loss / risk R(ππ) of policy ππ is
R ππ = οΏ½οΏ½πΏπΏ π₯π₯,π¦π¦ ππ π¦π¦ π₯π₯ ππ π₯π₯ πππ₯π₯ πππ¦π¦
ππ1(ππ|π₯π₯) ππ2(ππ|π₯π₯)
ππ|π₯π₯
Evaluating Online Metrics Offline
β’ Online: On-policy A/B Test
β’ Offline: Off-policy Counterfactual Estimates
Draw ππ1from ππ1 οΏ½ππ ππ1
Draw ππ2from ππ2 οΏ½ππ ππ2
Draw ππ3from ππ3 οΏ½ππ ππ3
Draw ππ4from ππ4 οΏ½ππ ππ4
Draw ππ5from ππ5 οΏ½ππ ππ5
Draw ππ6from ππ6 οΏ½ππ ππ6
Draw ππ from ππ0
οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ ππ6
οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ ππ12
οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ ππ18
οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ ππ24
οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ β1οΏ½ππ ππ30
Draw ππ7from ππ7 οΏ½ππ ππ7
Approach 1: Direct Method
β’ Data: ππ = π₯π₯1,π¦π¦1, πΏπΏ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ
1. Learn reward predictor οΏ½ΜοΏ½πΏ: π₯π₯ Γ π¦π¦ β β
Represent via features Ξ¨ π₯π₯,π¦π¦Learn regression based on Ξ¨ π₯π₯,π¦π¦from ππ collected under ππ0
2. Derive policy ππ(π₯π₯)ππ π₯π₯ β argmax
π¦π¦πΏπΏ π₯π₯,π¦π¦
πΏπΏ(π₯π₯,π¦π¦)
πΏπΏ π₯π₯,π¦π¦π¦
Ξ¨1
Ξ¨2
Approach 2:Off-Policy Risk Evaluation
Given ππ = π₯π₯1,π¦π¦1, πΏπΏ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ collected under ππ0,
Unbiased estimate of risk, if propensity nonzeroeverywhere (where it matters).
οΏ½π π ππ =1πποΏ½ππ=1
ππ
πΏπΏππππ π¦π¦ππ π₯π₯ππππ0 π¦π¦ππ π₯π₯ππ
[Horvitz & Thompson, 1952] [Rubin, 1983] [Zadrozny et al., 2003] [Langford, Li, 2009.]
Propensity ππππ
ππ0(ππ|π₯π₯) ππ(ππ|π₯π₯)
Partial Information Empirical Risk Minimization
β’ Setupβ Stochastic logging using ππ0 with ππππ = ππ0(π¦π¦ππ|π₯π₯ππ)
Data S = π₯π₯1,π¦π¦1 , πΏπΏ1, ππ1 , β¦ , π₯π₯ππ,π¦π¦ππ , πΏπΏππ, ππππβ Stochastic prediction rules ππ β π»π»: ππ π¦π¦ππ π₯π₯ππ
β’ Training
[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]
οΏ½ππ β argminππβπ»π»οΏ½ππ
ππππ π¦π¦ππ π₯π₯ππ
πππππΏπΏππ
ππ0(ππ|π₯π₯) ππ1(ππ|π₯π₯) ππ0(ππ|π₯π₯) ππ237(ππ|π₯π₯)
Generalization Error Bound for BLBFβ’ Theorem [Generalization Error Bound]
β For any hypothesis space π»π» with capacity πΆπΆ, and for all ππ β π»π» with probability 1 β ππ
οΏ½π π ππ = οΏ½ππππππππ ππ π¦π¦ππ π₯π₯ππππππ
πΏπΏππ
οΏ½ππππππ ππ = οΏ½ππππππ ππ π¦π¦ππ π₯π₯ππππππ
πΏπΏππ
Bound accounts for the fact that variance of risk estimator can vary greatly between different ππ β H
R ππ β€ οΏ½π π ππ + ππ οΏ½ππππππ ππ /ππ + ππ(πΆπΆ)
[Swaminathan & Joachims, 2015]
Unbiased Estimator
Variance Control
Capacity Control
Counterfactual Risk Minimizationβ’ Theorem [Generalization Error Bound]
Constructive principle for designing learning algorithms
[Swaminathan & Joachims, 2015]
R ππ β€ οΏ½π π ππ + ππ οΏ½ππππππ ππ /ππ + ππ(πΆπΆ)
ππππππππ = argminππβπ»π»ππ
οΏ½π π ππ + ππ1 οΏ½ππππππ ππ /ππ + ππ2πΆπΆ(π»π»ππ)
οΏ½π π ππ =1πποΏ½ππ
ππππ π¦π¦ππ π₯π₯ππ)
πππππΏπΏππ οΏ½ππππππ(ππ) =
1πποΏ½ππ
ππππ π¦π¦ππ π₯π₯ππ
πππππΏπΏππ
2
β οΏ½π π ππ 2
Outline of Lectureβ’ Batch Learning from Bandit Feedback (BLBF)
ππ = π₯π₯1,π¦π¦1, πΏπΏ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ Find new policy ππ that selects π¦π¦ with better πΏπΏ
β’ Learning Principle for BLBFβ Hypothesis Space, Risk, Empirical Risk, and Overfittingβ Learning Principle: Counterfactual Risk Minimization
β’ Learning Algorithms for BLBFβ POEM: Bandit training of CRF policies for structured
outputsβ BanditNet: Bandit training of deep network policies
POEM Hypothesis Space
Hypothesis Space: Stochastic policies
πππ€π€ π¦π¦ π₯π₯ =1
ππ(π₯π₯)exp π€π€ β Ξ¦ π₯π₯, π¦π¦
withβ π€π€: parameter vector to be learnedβ Ξ¦ π₯π₯,π¦π¦ : joint feature map between input and outputβ Z(x): partition function
Note: same form as CRF or Structural SVM
POEM Learning Methodβ’ Policy Optimizer for Exponential Models (POEM)
β Data: ππ = π₯π₯1,π¦π¦1, πΏπΏ1, ππ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ, ππππβ Hypothesis space: πππ€π€ π¦π¦ π₯π₯ = exp π€π€ β ππ π₯π₯,π¦π¦ /ππ(π₯π₯)β Training objective: Let π§π§ππ(π€π€) = πππ€π€ π¦π¦ππ π₯π₯ππ πΏπΏππ/ππππ
[Swaminathan & Joachims, 2015]
Unbiased Risk Estimator
Variance Control
π€π€ = argminπ€π€ββππ
1πποΏ½ππ=1
ππ
π§π§ππ(π€π€) + ππ11πποΏ½ππ=1
ππ
π§π§ππ π€π€ 2 β1πποΏ½ππ=1
ππ
π§π§ππ π€π€2
+ ππ2 π€π€ 2
Capacity Control
POEM ExperimentMulti-Label Text Classification
β’ Data: ππ = π₯π₯1,π¦π¦1, πΏπΏ1,ππ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ,ππππReuters LYRL RCV1 (top 4 categories)
β’ Results: POEM with H isomorphic to CRF with one weight vector per label
0.28
0.33
0.38
0.43
0.48
0.53
0.58
1 2 4 8 16 32 64 128
Ham
min
g Lo
ss
|S| = Quantity (in epochs) of Training Interactions from pi0
f0 (log data)CoStACRF(supervised)POEM
[Swaminathan & Joachims, 2015]
π¦π¦ππ = {sport, politics}ππππ = 0.3
ππ0πΏπΏππ = 1
Learning from Logged Interventions
Every time a system places an ad, presents a search ranking, or makes a recommendation, we can think about this as an intervention for which we can observe the user's response (e.g. click, dwell time, purchase). Such logged intervention data is actually one of the most plentiful types of data available, as it can be recorded from a variety of
π₯π₯ππ
ππ0 logging policy
Does Variance RegularizationImprove Generalization?
β’ IPS:
β’ POEM:
π€π€ = argminπ€π€ββππ
οΏ½π π π€π€ + ππ2 π€π€ 2
π€π€ = argminπ€π€ββππ
οΏ½π π π€π€ + ππ1 οΏ½ππππππ π€π€ /ππ + ππ2 π€π€ 2
Hamming Loss Scene Yeast TMC LYRLππ0 1.543 5.547 3.445 1.463
IPS 1.519 4.614 3.023 1.118POEM 1.143 4.517 2.522 0.996
# examples 4*1211 4*1500 4*21519 4*23149# features 294 103 30438 47236# labels 6 14 22 4
POEM Efficient Training Algorithmβ’ Training Objective:
β’ Idea: First-order Taylor Majorizationβ Majorize at current valueβ Majorize β 2 at current value
β’ Algorithm:β Majorize objective at current π€π€π‘π‘β Solve majorizing objective via Adagrad to get π€π€π‘π‘+1
ππππππ = minπ€π€ββππ
1πποΏ½ππ=1
ππ
π§π§ππ(π€π€) + ππ11πποΏ½ππ=1
ππ
π§π§ππ π€π€ 2 β1πποΏ½ππ=1
ππ
π§π§ππ π€π€2
ππππππ β€ minπ€π€ββππ
1πποΏ½ππ=1
ππ
π΄π΄ππ π§π§ππ π€π€ + π΅π΅ππ π§π§ππ π€π€ 2
[De Leeuw, 1977+] [Groenen et al., 2008] [Swaminathan & Joachims, 2015]
Counterfactual Risk Minimizationβ’ Theorem [Generalization Error Bound]
Constructive principle for designing learning algorithms
[Swaminathan & Joachims, 2015]
R ππ β€ οΏ½π π ππ + ππ οΏ½ππππππ ππ /ππ + ππ(πΆπΆ)
ππππππππ = argminππβπ»π»ππ
οΏ½π π ππ + ππ1 οΏ½ππππππ ππ /ππ + ππ2πΆπΆ(π»π»ππ)
οΏ½π π ππ =1πποΏ½ππ
ππππ π¦π¦ππ π₯π₯ππ)
πππππΏπΏππ οΏ½ππππππ(ππ) =
1πποΏ½ππ
ππππ π¦π¦ππ π₯π₯ππ
πππππΏπΏππ
2
β οΏ½π π ππ 2
Propensity Overfitting Problemβ’ Example
β Instance Space ππ = 1, β¦ ,ππβ Label Space ππ = 1, β¦ ,ππ
β Loss πΏπΏ π₯π₯,π¦π¦ = οΏ½β2 ππππ π¦π¦ == π₯π₯β1 ππππβπππππ€π€ππππππ
β Training data: uniform x,y sample (ππππ = 1ππ
)β Hypothesis space: all deterministic functions
πππππππ‘π‘ π₯π₯ = π₯π₯ with risk π π πππππππ‘π‘ = β2
π π οΏ½ππ = minππβπ»π»
1πποΏ½ππ
ππππ π¦π¦ππ π₯π₯ππ)
πππππΏπΏππ =
1πποΏ½ππ
ππ1
1/πππΏπΏππ β€ βππ
Problem 1: Unbounded risk estimate!
SS
SS
SS
S
ππ
ππ
Propensity Overfitting Problemβ’ Example
β Instance Space ππ = 1, β¦ ,ππβ Label Space ππ = 1, β¦ ,ππ
β Loss πΏπΏ π₯π₯,π¦π¦ = οΏ½β2 ππππ π¦π¦ == π₯π₯β1 ππππβπππππ€π€ππππππ
β Training data: uniform x,y sample (ππππ = 1ππ
)β Hypothesis space: all deterministic functions
πππππππ‘π‘ π₯π₯ = π₯π₯ with risk π π πππππππ‘π‘ = β2
π π οΏ½ππ = minππβπ»π»
1πποΏ½ππ
ππππ π¦π¦ππ π₯π₯ππ)
πππππΏπΏππ =
1πποΏ½ππ
ππ0
1/πππΏπΏππ = 0
Problem 2: Lack of equivariance!
SS
SS
SS
S
ππ
ππ
01
0
Control Variateβ’ Idea: Inform estimate when expectation of correlated
random variable is known.β Estimator:
β Correlated RV with known expectation:
οΏ½ΜοΏ½π ππ =1πποΏ½ππ
ππππ π¦π¦ππ π₯π₯ππ)
ππππ
πΈπΈ οΏ½ΜοΏ½π ππ =1πποΏ½
ππ
ππ
οΏ½ππ π¦π¦ππ π₯π₯ππ)ππ0 π¦π¦ππ π₯π₯ππ)
ππ0 π¦π¦ππ π₯π₯ππ)ππ π₯π₯ πππ¦π¦πππππ₯π₯ππ = 1
Alternative Risk Estimator: Self-normalized estimator
οΏ½π π ππππ ππ =οΏ½π π πποΏ½ΜοΏ½π ππ
οΏ½π π ππ =1πποΏ½ππ
ππππ π¦π¦ππ π₯π₯ππ)
πππππΏπΏππ
[Hesterberg, 1995] [Swaminathan & Joachims, 2015]
SNIPS Learning Objectiveβ’ Method:
β Data: ππ = π₯π₯1,π¦π¦1, πΏπΏ1, ππ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ, ππππβ Hypothesis space: πππ€π€ π¦π¦ π₯π₯ = exp π€π€ β ππ π₯π₯,π¦π¦ /ππ(π₯π₯)β Training objective:
[Swaminathan & Joachims, 2015]
Self-Normalized Risk Estimator
Variance Control
π€π€ = argminπ€π€ββππ
οΏ½π π πππππΌπΌπΌπΌππ(π€π€) + ππ1 οΏ½ππππππ οΏ½π π πππππΌπΌπΌπΌππ(π€π€) + ππ2 π€π€ 2
Capacity Control
How well does NormPOEM generalize?
Hamming Loss
Scene Yeast TMC LYRL
ππ0 1.511 5.577 3.442 1.459
POEM (IPS) 1.200 4.520 2.152 0.914POEM (SNIPS) 1.045 3.876 2.072 0.799
# examples 4*1211 4*1500 4*21519 4*23149# features 294 103 30438 47236# labels 6 14 22 4
Outline of Lectureβ’ Batch Learning from Bandit Feedback (BLBF)
ππ = π₯π₯1,π¦π¦1, πΏπΏ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ Find new policy ππ that selects π¦π¦ with better πΏπΏ
β’ Learning Principle for BLBFβ Hypothesis Space, Risk, Empirical Risk, and Overfittingβ Learning Principle: Counterfactual Risk Minimization
β’ Learning Algorithms for BLBFβ POEM: Bandit training of CRF policies for structured
outputsβ BanditNet: Bandit training of deep network policies
BanditNet: Hypothesis Space
Hypothesis Space: Stochastic policies
πππ€π€ π¦π¦ π₯π₯ =1
ππ(π₯π₯)exp π·π·πππππππ·π·ππππ(π₯π₯,π¦π¦|π€π€)
withβ π€π€: parameter tensors to be learnedβ Z(x): partition function
Note: same form as Deep Net with softmax output
[Joachims et al., 2017]
BanditNet: Learning Methodβ’ Method:
β Data: ππ = π₯π₯1,π¦π¦1, πΏπΏ1, ππ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ, ππππβ Hypotheses: πππ€π€ π¦π¦ π₯π₯ = exp π·π·πππππππ·π·ππππ π₯π₯|π€π€ /ππ(π₯π₯)β Training objective:
Self-Normalized Risk Estimator
Variance Control
π€π€ = argminπ€π€ββππ
οΏ½π π πππππΌπΌπΌπΌππ(π€π€) + ππ1 οΏ½ππππππ οΏ½π π πππππΌπΌπΌπΌππ(π€π€) + ππ2 π€π€ 2
Capacity Control
[Joachims et al., 2017]
BanditNet: Learning Methodβ’ Method:
β Data: ππ = π₯π₯1,π¦π¦1, πΏπΏ1, ππ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ, ππππβ Representation: Deep Network Policies
πππ€π€ π¦π¦ π₯π₯ =1
ππ(π₯π₯,π€π€)exp π·π·πππππππ·π·ππππ π¦π¦|π₯π₯,π€π€
β SNIPS Training Objective:
π€π€ = argminπ€π€ββππ
οΏ½π π πππππΌπΌπΌπΌππ(π€π€) + ππ π€π€ 2
= argminπ€π€ββππ
1
βππ=1ππ πππ€π€ π¦π¦ππ π₯π₯ππππππ
οΏ½ππ=1
πππππ€π€ π¦π¦ππ π₯π₯ππ
πππππΏπΏππ + ππ π€π€ 2
Optimization via SGD
β’ Problem: SNIPS objective not suitable for SGDβ’ Step 1: Discretize over values in denominator
β’ Step 2: View as series of constrained OP
β’ Step 3: Eliminate constraint via Lagrangian
οΏ½π€π€ππ = argminw
βππ=1ππ πππ€π€(π¦π¦ππ|π₯π₯ππ)ππππ
πΏπΏππ subject to 1ππβππ=1ππ πππ€π€(π¦π¦ππ|π₯π₯ππ)
ππππ= ππππ
οΏ½π€π€ = argminππππ
argminw
1πππποΏ½ππ=1
πππππ€π€(π¦π¦ππ|π₯π₯ππ)
πππππΏπΏππ subject to
1πποΏ½ππ=1
πππππ€π€(π¦π¦ππ|π₯π₯ππ)
ππππ= ππππ
οΏ½π€π€ππ = argminw
maxππ
οΏ½ππ=1
πππππ€π€(π¦π¦ππ|π₯π₯ππ)
ππππ(πΏπΏππβππ) + ππππππ
Optimization via SGD
β’ Step 4: Search grid over ππ instead of ππππβ Hard: Given ππππ, find ππππ. β Easy: Given ππππ, find ππππ.
Solve
Compute
οΏ½π€π€ππ = argminw
οΏ½ππ=1
πππππ€π€(π¦π¦ππ|π₯π₯ππ)
ππππ(πΏπΏππβππππ) + ππππππππ
ππππ =1πποΏ½ππ=1
πππππ€π€(π¦π¦ππ|π₯π₯ππ)
ππππ
BanditNet: Training Algorithm
β’ Given: β Data: ππ = π₯π₯1,π¦π¦1, πΏπΏ1, ππ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ,ππππβ Lagrange Multipliers: ππππ β ππ1, β¦ , ππππ
β’ Compute:β For each ππππ solve:
β For each οΏ½wππ compute:
β Find overall οΏ½π€π€:
οΏ½π€π€ππ = argminw
οΏ½ππ=1
πππππ€π€(π¦π¦ππ|π₯π₯ππ)
ππππ(πΏπΏππβππππ)
ππππ =1πποΏ½ππ=1
ππ πποΏ½π€π€ππ(π¦π¦ππ|π₯π₯ππ)ππππ
οΏ½π€π€ = argminοΏ½π€π€ππ,ππππ
1πππποΏ½ππ=1
ππ πποΏ½π€π€ππ(π¦π¦ππ|π₯π₯ππ)ππππ
πΏπΏππ
Object Recognition: Data and Setup
β’ Data: CIFAR-10 (fully labeled) ππβ = π₯π₯1,π¦π¦1β , β¦ , π₯π₯ππ,π¦π¦ππβ
β’ Bandit feedback generation:β Draw image π₯π₯ππβ Use logging policy ππ0 ππ|π₯π₯ππ to predict π¦π¦ππ
β’ Record propensity ππ0 ππ = π¦π¦ππ|π₯π₯ππβ Observe loss πΏπΏππ = π¦π¦ππ β π¦π¦ππβ
ππ = π₯π₯1,π¦π¦1, πΏπΏ1, ππ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ,ππππ
β’ Network architecture: ResNet20 [He et al., 2016]
[Beygelzimer & Langford, 2009] [Joachims et al., 2017]
π¦π¦ππ = dogππππ = 0.3
ππ0
πΏπΏππ = 1
Bandit Feedback vs. Test Error
Logging Policy ππ0: 49% error rateBandit-ResNet with naΓ―ve IPS: >49% error rate
[Joachims et al., 2017]
Lagrange Multiplier vs. Test Error
Large basin of optimality far away from naΓ―ve IPS.
Analysis of SNIPS Estimate
Control variate responds to the Lagrange multiplier monotonically.SNIPS training error resembles test error.
Conclusions and Futureβ’ Batch Learning from Bandit Feedback
β Feedback for only presented action ππ = π₯π₯1,π¦π¦1, πΏπΏ1,ππ1 , β¦ , π₯π₯ππ,π¦π¦ππ, πΏπΏππ, ππππ
β Goal: Find new system ππ that selects π¦π¦ with better πΏπΏβ Learning Principle for BLBF: Counterfactual Risk Minimization
β’ Learning from Logged Interventions: BLBF and Beyondβ POEM: [Swaminathan & Joachims, 2015c]β NormPOEM: [Swaminathan & Joachims, 2015c]β BanditNet: [Joachims et al., 2018]β SVM PropRank [Joachims et al., 2017a] β DeepPropDCG: [Agarwal et al., 2018]β Unbiased Matrix Factorization: [Schnabel et al. 2016]
β’ Future Researchβ Other learning algorithms? Other partial-information settings?β How to handle new bias-variance trade-off in risk estimators?β Applications
β’ Software, Papers, SIGIR Tutorial, Data: www.joachims.org