Q-Learning and Dynamic Treatment Regimes
description
Transcript of Q-Learning and Dynamic Treatment Regimes
![Page 1: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/1.jpg)
Q-Learning and Dynamic Treatment Regimes
S.A. MurphyUniv. of Michigan
IMS/Bernoulli: July, 2004
![Page 2: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/2.jpg)
Outline
•Dynamic Treatment Regimes
•Optimal Q-functions and Q-learning
•The Problem & Goal
•Finite Sample Bounds
•Outline of Proof
•Shortcomings and Open Problems
![Page 3: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/3.jpg)
---- Multi-stage decision problems: repeated decisions are made over time on each patient.
---- Used in the management of Addictions, Mental Illnesses, HIV infection and Cancer
Dynamic Treatment Regimes
![Page 4: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/4.jpg)
k Decisions
Observations made prior to tth decision
Action at tth decision
Primary Outcome:
![Page 5: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/5.jpg)
A dynamic treatment regime is a vector of decision rules, one per decision
If the regime is implemented then
![Page 6: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/6.jpg)
Goal: Estimate the decision rules that maximize mean
Data: Data set of n finite horizon trajectories, each with randomized actions.
are randomization probabilities.
![Page 7: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/7.jpg)
Optimal Q-functions and Q-learning:
Definition:
denotes expectation when the actions are chosen according to the regime
![Page 8: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/8.jpg)
Q-functions:
The Q-functions for optimal regime, are given recursively by
For t=k,k-1,….
![Page 9: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/9.jpg)
Q-functions:
The optimal regime is given by
![Page 10: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/10.jpg)
Q-learning:
Given a model for the Q-functions, minimize
over
Set
![Page 11: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/11.jpg)
Q-learning:
For each t=k-1,…,1 minimize
over
And set
and so on.
![Page 12: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/12.jpg)
Q-Learning:
The estimated regime is given by
![Page 13: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/13.jpg)
The Problem & Goal:
Most learning (e.g. estimation) methods utilize a model for all or parts of the multivariate distribution of
implicitly constrains the class of possible decision rules in the dynamic treatment regime: call this constrained class,
is a vector with many components (high dimensional) thus the model is likely incorrect; view and as approximation classes.
![Page 14: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/14.jpg)
Goal: Given a learning method and approximation classes
assess the ability of learning method to produce the best decision rules in the class.
Ideally construct an upper bound for
where is the estimator of the regime
denotes expectation when the actions are chosen according to the rule
![Page 15: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/15.jpg)
Goal: Given a learning method, model and approximation class construct a finite sample upper bound for
This upper bound should be composed of quantities that are minimized in the learning method.
Learning Method is Q-learning.
![Page 16: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/16.jpg)
Finite Sample Bounds:
Primary Assumptions:
(1)
for L>1.
(2) Number of possible actions is finite.
![Page 17: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/17.jpg)
Definition:
where E, without a subscript, denotes expectation when the actions are randomized.
![Page 18: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/18.jpg)
Results:Approximation Error:
The minimum is over with
![Page 19: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/19.jpg)
Define
The estimation error involves the complexity of this space.
![Page 20: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/20.jpg)
Estimation Error:
For with probability at least 1- δ
for n satisfying
![Page 21: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/21.jpg)
If is finite then n needs only to satisfy
that is,
![Page 22: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/22.jpg)
Outline of Proof:
The Q-functions for regime are given by
![Page 23: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/23.jpg)
Proof Outline
(1)
![Page 24: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/24.jpg)
Proof Outline
(2)
It turns out that also
![Page 25: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/25.jpg)
Proof Outline
(3)
![Page 26: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/26.jpg)
Shortcomings and Open Problems
![Page 27: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/27.jpg)
Recall Estimation Error:
For with probability at least 1- δ
for n satisfying
![Page 28: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/28.jpg)
Open Problems
• Is there a learning method that can learn the best decision rule in an approximation class given a data set of n finite horizon trajectories?
• Sieve Estimators or Regularized Estimators?
• Dealing with high dimensional X-- feature extraction---feature selection.
![Page 29: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/29.jpg)
This seminar can be found at:
http://www.stat.lsa.umich.edu/~samurphy/seminars/ims_bernoulli_0704.ppt
The paper can be found at :
http://www.stat.lsa.umich.edu/~samurphy/papers/Qlearning.pdf
![Page 30: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/30.jpg)
Recall Proof Outline
(2)
It turns out that also
![Page 31: Q-Learning and Dynamic Treatment Regimes](https://reader035.fdocuments.us/reader035/viewer/2022070421/56816073550346895dcf9c7b/html5/thumbnails/31.jpg)
Recall Proof Outline
(1)