Q-learning Residual Analysis with Application to A ......Q-learning(Watkins, 1989; Sutton and Barto,...
Transcript of Q-learning Residual Analysis with Application to A ......Q-learning(Watkins, 1989; Sutton and Barto,...
-
Q-learning Residual Analysis with Application to ASchizophrenia Clinical Trial
Bibhas ChakrabortyCentre for Quantitative Medicine, Duke-National University of Singapore Graduate
Medical School
Based on joint work with Ashkan Ertefaie & Susan Shortreed
ISCB, UtrechtAugust 27, 2015
1 / 32
-
Dynamic Treatment Regimes: A Quick Overview
Outline
1 Dynamic Treatment Regimes: A Quick Overview
2 Estimation of Optimal DTRs via Q-learning
3 Model Checking for Q-learning
4 Numerical Study
5 Discussion
2 / 32
-
Dynamic Treatment Regimes: A Quick Overview
Dynamic Treatment Regimes
Consider personalized management of chronic health conditions
A dynamic treatment regime (DTR) is a sequence of decision rules, one per stageof clinical intervention
– Each decision rule takes a patient’s treatment and covariate history as inputs, andoutputs a recommended treatment
A DTR is called optimal if it optimizes the long-term mean outcome (or someother suitable criterion)
3 / 32
-
“SMART” Data Sources
Sequential Multiple Assignment Randomized Trials (SMARTs) (Lavori andDawson, 2004; Murphy, 2005)
– Each patient is followed through multiple stages of treatment
– At each stage the patient is randomized to one of the possible treatment options
– Treatment options for a patient can be restricted based on prior treatment andcovariate history
Examples of classic SMARTs:
– Schizophrenia: CATIE (Schneider et al., 2001)
– Depression: STAR*D (Rush et al., 2003)
– Prostate Cancer: Thall et al. (2000)
– Leukemia: CALGB Protocol 8923 (Stone et al., 1995; Wahed and Tsiatis, 2004)
– Smoking: Project Quit (Strecher et al., 2008)
Many recently finished or ongoing trials:
http://methodology.psu.edu/ra/adap-inter/projects
-
Dynamic Treatment Regimes: A Quick Overview
CATIE: A Study of Schizophrenia
Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) (Schneider etal., 2001; Stroup et al., 2003; Swartz et al., 2003)
One of the earlier SMART studies relevant for DTR research, funded by NIMH
Quite complex study design, but we will be looking at a simplified version forillustrative purposes
– Only non-responders to initial treatment are re-randomized at the second stage
5 / 32
-
Dynamic Treatment Regimes: A Quick Overview
CATIE Study Design (Simplified)
6 / 32
-
Estimation of Optimal DTRs via Q-learning
Outline
1 Dynamic Treatment Regimes: A Quick Overview
2 Estimation of Optimal DTRs via Q-learning
3 Model Checking for Q-learning
4 Numerical Study
5 Discussion
7 / 32
-
Estimation of Optimal DTRs via Q-learning
Q-learning: A Regression-based Method
How to estimate the optimal DTR for every individual patient from SMART data?
Q-learning (Watkins, 1989; Sutton and Barto, 1998; Ernest et al., 2005)
– A popular method from Reinforcement (Machine) Learning
– A generalization of least squares regression to multistage decision problems(Murphy, 2005)
– Implemented in the DTR context in recent times with different variations (Zhao etal., 2009; Chakraborty et al., 2010; Shortreed et al., 2011; Schulte et al., 2012;Laber et al., 2012; Song et al., 2012; Nahum-Shani et al., 2012; Moodie et al., 2012)
The intuition comes from dynamic programming (Bellman, 1957) in case themultivariate distribution of the data is known
– Q-learning is an approximate dynamic programming approach
8 / 32
-
Estimation of Optimal DTRs via Q-learning
Data Structure
Two stages on a single patient:
O1,A1,O2, S2,A2,Y
Oj : Observation (pre-treatment) at the j-th stageAj : Treatment (action) at the j-th stage, Aj ∈ Aj, randomized
(for simplicity, restrict attention to Aj = {−1, 1})S2 : Indicator of whether a patient is re-randomized at stage 2
(in some SMART designs, S2 = 1 for every patient)Hj : History at the j-th stage,H1 = O1, H2 = (O1,A1,O2)Y : Primary Outcome (larger is better)
A DTR is a sequence of decision rules:
d ≡ (d1, d2) with dj(hj) ∈ Aj
9 / 32
-
Estimation of Optimal DTRs via Q-learning
Dynamic Programming: The Background for Q-learning
Move backward in time to take care of the delayed effects
Define the “Quality of treatment”, Q-functions:
Q2(h2, a2) = E[Y∣∣∣H2 = h2,A2 = a2]
Q1(h1, a1) = E[
maxa2
Q2(H2, a2)︸ ︷︷ ︸delayed effect
∣∣∣H1 = h1,A1 = a1]
Optimal DTR:
dj(hj) = arg maxaj
Qj(hj, aj), j = 1, 2
When the true Q-functions are not known, one needs to estimate them from data,using regression models ...
10 / 32
-
Estimation of Optimal DTRs via Q-learning
Q-learning with Linear Regression
Regression models for Q-functions:
Qj(Hj,Aj;βj) = βTj1Hj1 + (βTj2Hj2)Aj, j = 1, 2,
where Hj1 and Hj2 are two features of Hj
At stage 2, regress Y on (H21, H22A2) only among patients with S2 = 1, to obtainβ̂2 = (β̂21, β̂22)
11 / 32
-
Estimation of Optimal DTRs via Q-learning
Q-learning with Linear Regression (Cont’d)
Construct stage-1 “pseudo-outcome” for patients with S2 = 1:
Ỹmax = maxa2
Q2(H2, a2; β̂2)
and hence the stage-1 dependent variable for every patient in the trial:
Ỹ = S2 · Ỹmax + (1− S2) · Y
At stage 1, regress Ỹ on (H11, H12A1) to obtain β̂1 = (β̂11, β̂12)
Estimated Optimal DTR:
d̂j(hj) = arg maxaj
Qj(hj, aj; β̂j) = sign(β̂Tj2 hj2)
12 / 32
-
Model Checking for Q-learning
Outline
1 Dynamic Treatment Regimes: A Quick Overview
2 Estimation of Optimal DTRs via Q-learning
3 Model Checking for Q-learning
4 Numerical Study
5 Discussion
13 / 32
-
Model Checking for Q-learning
Model Checking
Quality of the DTRs estimated using Q-learning are critically model-dependent
Model checking for stage 2 can be done using standard residual diagnostic toolsfrom linear regression
Model checking for stage 1 is tricky
– The dependent variable is a non-smooth function of the data
– The data consists of two types of individuals, viz., those who are randomized atstage 2 and those who are not – hence inherent scope of variance heterogeneity
14 / 32
-
Proposed Modification: Q-learning with Mixture Residuals(QL-MR)
Stage-2 regression model (nested among those with S2 = 0 and S2 = 1):
Q2(H2,A2;β21, β22, β23) = S2 · (βT21H21 + βT22H22A2) + (1− S2) · (βT23H23)
Obtain β̂2 by fitting the above model to Y , and hence define
d̂2(h2) = arg maxa2
Q2(h2, a2; β̂2)
Construct stage-1 “pseudo-outcome” for each patient in the trial:
ỸQL-MR = maxa2
[S2 · (β̂T21H21 + β̂T22H22A2)] + (1− S2) · (β̂T23H23)
= S2 · (β̂T21H21 + |β̂T22H22|) + (1− S2) · (β̂T23H23)
-
Q-learning with Mixture Residuals (QL-MR) (Cont’d)
Define π = E[S2|H2] = P[S2 = 1|H2]
Postulate a parametric model for π, say π(α), and compute the maximumlikelihood estimate α̂; then define π̂ = E[S2|H2; α̂] (e.g., logistic regression)
Stage-1 Q-function:
Q1(H1,A1) = E[ỸQL-MR
∣∣H1,A1]= E
[E{ỸQL-MR|H2}
∣∣H1,A1]= E
[E{
S2(β̂T21H21 + |β̂T22H22|) + (1− S2)(β̂T23H23)∣∣H2}∣∣∣H1,A1]
= E[π(β̂T21H21 + |β̂T22H22|)
∣∣∣H1,A1]+ E[(1− π)(β̂T23H23)∣∣∣H1,A1]
Replace π by π̂ in the expression of Q1
-
Q-learning with Mixture Residuals (QL-MR) (Cont’d)Q1 is a mixture model with two components (e.g., for responders and fornon-responders)
Fit two linear models for the two conditional expectations in the expression ofQ1, say ηT11H11 + η
T12H12A1 and θ
T11H
′11 + θ
T12H
′12A1
Construct mixture residuals �̂QL-MR as
π̂(β̂T21H21+|β̂T22H22|)+(1−π̂)(β̂T23H23)−[η̂T11H11+η̂T12H12A1]−[θ̂T11H′11+θ̂T12H′12A1]
Assess �̂QL-MR using standard residual diagnostic plots– If lack of fit is detected, adjust the set of predictors and re-assess model
– Else find the optimal DTR based on fitted models
The optimal stage-1 decision rule is given by
d̂1(h1) = arg maxa1
Q1(h1, a1; η̂1, θ̂1)
-
Model Checking for Q-learning
Asymptotic Properties of QL-MR
Standard Q-learning and QL-MR are asymptotically equivalent under thefollowing conditions:
1 The postulated model for Y among individuals with S2 = 0 is correctly specified2 The postulated model for π is correctly specified
Precisely, ỸQL-MR + (1− S2)τ = Ỹ + op(1), where τ = Y − β̂T23H23 forindividuals with S2 = 0
In case of correctly specified model, E(τ |H2) = 0, and thus the twopseudo-outcomes have the same mean
18 / 32
-
Model Checking for Q-learning
Inference
The problem of non-regularity remains the same as in Q-learning
Either the adaptive confidence interval (ACI) (Laber et al., 2014) or m-out-of-nbootstrap (Chakraborty et al., 2013) should be employed for constructingconfidence intervals
– We extended and implemented ACI in the current work
19 / 32
-
Numerical Study
Outline
1 Dynamic Treatment Regimes: A Quick Overview
2 Estimation of Optimal DTRs via Q-learning
3 Model Checking for Q-learning
4 Numerical Study
5 Discussion
20 / 32
-
Numerical Study
Simulation Design
Assess the diagnostic performance of QL-MR as compared to conventionalQ-learning
Simulate a SMART study, analogous to CATIE, with sample size n = 300
21 / 32
-
Generative Model for Simulation Study
O1ji.i.d.∼ N(0, 1), j = 1, 2
A1 ∈ {−1, 1} with probability 0.5
O2ji.i.d.∼ N
(5− 0.3 A1 − 0.5 O1j, 1
), j = 1, 2
S2 = I{O22 > 5}
A2 ∈ {−1, 1} with probability 0.5
g(H2) = 1 + 2 O11 − 1.5 O211 − 2 O12 + O21 − A1 − 0.5 A1O11
� ∼ N(0, 1)
Y = g(H2) + S2 ·(
0.8 O21 − 0.5 A2 − 0.4 A2O21 − 0.7A2 O11)+ �
-
Analysis Model
Assume that the model for the stage-2 Q-function is correctly specified, andcheck the model fit at stage 1
At stage 1, fit models for E[π̂(β̂T21H21 + |β̂T22H22|)
∣∣∣H1,A1] andE[(1− π̂)(β̂T23H23)
∣∣∣H1,A1]Three types of models are considered for these quantities:
Model Variables Included1 (O11,O12,A1)2 (O11,O12,A1,O211)3 (O11,O12,A1,O211,A1O11)
-
-3 -2 -1 0 1 2
-20
-10
010
o11
Res
QL
-3 -2 -1 0 1 2 3
-20
-10
010
o12
Res
QL
Res QL
Frequency
-30 -20 -10 0 10 20
020
4060
80
-3 -2 -1 0 1 2
-50
510
20
o11
Res
QL-
MR
-3 -2 -1 0 1 2 3
-50
510
20
o12
Res
QL-
MR
Res QL-MR
Frequency
-10 0 10 20 30
050
100
150
Figure : Model 1 residual plots against O11 and O12, and the histogram. The orange and greenlines are the loess smoother lines for individuals with A1 = +1 and A1 = −1, respectively.
-
-3 -2 -1 0 1 2
-10
-50
510
15
o11
Res
QL
-3 -2 -1 0 1 2 3
-10
-50
510
15
o12
Res
QL
Res QL
Frequency
-15 -10 -5 0 5 10 15 20
020
4060
80
-3 -2 -1 0 1 2
-10
-50
5
o11
Res
QL-
MR
-3 -2 -1 0 1 2 3
-10
-50
5
o12
Res
QL-
MR
Res QL-MR
Frequency
-10 -5 0 5 10
010
2030
4050
Figure : Model 2 residual plots against O11 and O12, and the histogram. The orange and greenlines are the loess smoother lines for individuals with A1 = +1 and A1 = −1, respectively.
-
-3 -2 -1 0 1 2
-50
510
15
o11
Res
QL
-3 -2 -1 0 1 2 3
-50
510
15
o12
Res
QL
Res QL
Frequency
-10 -5 0 5 10 15
010
2030
4050
-3 -2 -1 0 1 2
-6-4
-20
24
6
o11
Res
QL-
MR
-3 -2 -1 0 1 2 3
-6-4
-20
24
6
o12
Res
QL-
MR
Res QL-MR
Frequency
-6 -4 -2 0 2 4 6
020
4060
Figure : Model 3 residual plots against O11 and O12, and the histogram. The orange and greenlines are the loess smoother lines for individuals with A1 = +1 and A1 = −1, respectively.
-
What do the plots say?
Even after adjusting for quadratic and interaction terms, the residuals fromstandard Q-learning suggest at least lack of variance homogeneity and lack ofsymmetry / normality
This finding may influence the analyst to believe a lack of fit and considervariance-stabilizing and/or normality-inducing transformations
– This, in turn, may jeopardize the simplicity and interpretability of Q-learning
QL-MR, on the other hand, does not mislead an analyst
– And, this is achieved by using standard diagnostic tools – not requiring to inventnew residual diagnostic techniques
In the end, the parameter estimates are similar to standard Q-learning – so theextra diagnostic ability is not at the cost of the estimation performance of keyparameters
-
Numerical Study
Parameter Estimates
Table : Simulated data: Estimates of the Stage-2 and Stage-1 decision rule parameters
Standard Q-learning QL-MRParameter Estimate 90% CI Estimate 90% CIStage-2 ModelA2 -2.17 -2.97 -1.37 -2.18 -3.01 -1.35A2O11 -1.67 -1.84 -1.51 -1.68 -1.85 -1.51A2O21 1.64 1.47 1.80 1.64 1.47 1.81Stage-1 ModelA1 -0.84 -1.44 -0.24 -0.86 -1.48 -0.26A1O11 -3.69 -4.43 -2.96 -3.75 -4.49 -3.07
28 / 32
-
Numerical Study
CATIE Data Analysis (QoL Outcome)
Table : CATIE: Stage-2 and Stage-1 regression models
Standard Q-learning QL-MRParameter Estimate 90% C.I Estimate 90% CIStage-2 ModelO11: Baseline PANSS 0.01 -0.12 0.14 0.02 -0.11 0.15O211: Baseline PANSS 0.05 -0.02 0.13 0.02 -0.05 0.09O12: Baseline Quality of Life 0.48 0.36 0.61 0.49 0.37 0.60A1: Stage-1 treatment 0.004 -0.11 0.12 0.008 -0.09 0.11O21: PANSS during stage-1 -0.19 -0.30 -0.08 -0.20 -0.30 -0.10A2: Stage-2 treatment -0.06 -0.17 0.04 -0.07 -0.17 0.03A2A1 -0.09 -0.19 0.02 -0.09 -0.19 0.01Stage-1 ModelO11: Baseline PANSS -0.13 -0.23 -0.04 -0.12 -0.22 -0.03O211: Baseline PANSS 0.06 0.00 0.12 0.05 -0.01 0.12O12: Baseline Quality of Life 0.51 0.42 0.61 0.50 0.42 0.59A1: Stage-1 treatment -0.01 -0.10 0.11 -0.01 -0.13 0.09
29 / 32
-
Discussion
Outline
1 Dynamic Treatment Regimes: A Quick Overview
2 Estimation of Optimal DTRs via Q-learning
3 Model Checking for Q-learning
4 Numerical Study
5 Discussion
30 / 32
-
Summary
SMART designs are becoming increasingly popular in various domains of healthresearch
– A particular type of SMART studies, where only the non-responders to the initialtreatment are being re-randomized, are more common
Secondary analysis of SMART studies to find individualized interventions isusually conducted using Q-learning
In case of SMARTs where only the non-responders are re-randomized, modelchecking for standard Q-learning is problematic
– This problem has received little, if any, attention in the literature so far
– We have proposed a simple modification of Q-learning so that standard residualdiagnostic tools from the classical regression literature can be used
-
Shoot your questions, comments, criticisms, request for slides to:[email protected]
Dynamic Treatment Regimes: A Quick OverviewEstimation of Optimal DTRs via Q-learningModel Checking for Q-learningNumerical StudyDiscussion