1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki...
-
Upload
imani-trepp -
Category
Documents
-
view
214 -
download
1
Transcript of 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki...
1
A Semiparametric Statistics Approach to Model-Free Policy Evaluation
Tsuyoshi UENO(1), Motoaki KAWANABE(2),
Takeshi MORI(1), Shin-ich MAEDA(1) , Shin ISHII(1),(3) (1)Kyoto University
(2)Fraunhofer FIRST
2
Summary of This Talk
• We discussed LSTD-based policy evaluation from the viewpoint of semiparametric statistics and estimating function.
1. How good is LSTD?
2. Can we improve LSTD ?
LSTD is a type of estimating function method, andevaluate the asymptotic estimation variance of LSTD.
We derive an optimal estimating function with the minimum asymptotic estimation variance.
We propose a new policy evaluation algorithm (gLSTD)
Model-Free Reinforcement Learning
3
Goal: Obtain an optimal policy
which maximizes the sum of future rewards
Environment
Action
State
Reward
*pp
sp
ap
rp
ppPolicy
4
Policy Iteration [Sutton & Barto, 1998]
Policy Evaluation( Estimate the value function )
Policy Improvement(Update the policy)
Value function estimation is a key of policy iteration !!
If the value function can be correctly estimated,policy iteration converges the optimal policy *pp
5
Policy Evaluation Method: LSTD[Bratke & Barto, 1996]
• Least Squares Temporal Difference (LSTD)– LSTD-based policy iteration algorithms have shown good
practical performance. • Least Squares Policy Iteration (LSPI) [Lagoudakis & Parr, 2003]
• Natural Actor-Critic (NAC) [Peters et.al., 2003, 2005]
• Representation Policy Iteration (RPI)[Mahadevan & Maggino, 2007]
LSTD is one of the important algorithms in RL field
6
Least Square Temporal Difference (LSTD)
• Bellman equation [Bellman, 1966 ]
10
V ( ) : E |tt
t
s r sp p g¥
+=
é ù= ë ûå
( )T TV ( ) :t t ts sp = =f q f q
Feature Parameter
• Assumption
We assume that the linear function ‘completely’ represents the value function.
(There are no bias.)
[ ]TT1E | E |t t t t tr s sp pg+
é ù= +ë ûf q f q
7
• Linearly approximated bellman equation
Parameter
( ) ( ){ } ( )1 1 11
T
11 E | E |t tt t tt t t t rs r r sp pgg + ++ + + +é ù é ù- -ë û ë û- + + =f fff q
Noise Noise
Just a linear regression problem(Error in (input) variable problem [Young,1984])
Input: Output:
Ttt tye =x q+
Least Square Temporal Difference (LSTD)
tx ty
the input and observation noise are mutually dependent!!
8
Linear Regression with Error in Variables
1
OLS1 1
ˆN N
t t t tt t
y-
= =
é ù é ùê ú ê ú=ê ú ê úë û ë ûå åxx xq
x
y
OLS estimator is biased. LSq̂
• Ordinary least squares method (OLS):
y x=OLS
the observation noise depends on the input variable,
9
Instrumental Variable Method[Soderstrom and Stoica, 2002]
• Introduce the instrumental variable: tz1
OLS1 1
ˆN N
t t t tt t
y-
= =
é ù é ùê ú ê ú=ê ú ê úë û ë ûå åxx xq
is an unbiased estimator IVq̂Input: x
Out
put:
y1
IV1 1
ˆN N
t ttt t
ty-
= =
é ù é ùê ú ê ú=ê ú ê úë û ë ûå åxz zq
y x=The instrumental variable is correlated with the input but uncorrelated with the noise
10
• LSTD = Instrumenatal variable method.– Instrumental variable :
( )-11 1
T
LSTD 1 10 0
ˆN N
t tt t tt t
rg- -
+ += =
é ùê ú= -ê úë ûå å ffffq
t t=z f
Least Square Temporal Difference (LSTD)
, ,,t t t t k t ta-+= = = +z z zc cLff f(for example)
are also instrumental variables
It is important to choose an appropriate instrumental variable.
Our Approach
• How good is LSTD ?
• Can we improve LSTD?
11
We analysis the asymptotic estimation variance of instrumental variable method.
We optimize the instrumental variable so as to minimize the asymptotic estimation variance.
We introduce a viewpoint of semiparametric statistical inference
12
• Semiparametric model:
– is target parameter – are nuisance parameter (infinite degree of freedom )
Semiparametric Statistics Approach
Tt t ty e= x q+
( ); ,p x qk
kq
1
1
t t t
t ty r
g +
+
= -
=
x ff
We need to estimate only the target parameter regardless of the nuisance parameters
• Linearly approximated Bellman equation
We don’t know the noise distribution.
• Estimating function [Godambe, 1985] [Conditions]
• Estimating equation
13
Inference of Semiparametric Model
( )1
0
, ;ˆN
t tt
y-
=
=å f x 0q
converges to the true parameter regardless of nuisance parameter. q̂ *q
( )[ ], ,E ;yp =f x 0q ( ) ( )2
E , ; 0,E , ;y yp pé ù¶ é ùê ú¹ < ¥ê úë ûê ú¶ë ûf x f xq q
q
For any nuisance parameter
14
Estimating Functions
• Estimating function = LSTD
• Estimating function = Instrumental variable method
( ){ }T
LSTD 1 1t t t trg + += -f ff f q-
Are there any other estimating functions ?
( ) ( ){ }T
IV 1 1, ,t t k t t ts s rg- + += -f z L ff q-
Instrumental Variable
15
Are There Any Other Estimating Functions ?
Proposition 1
( ) ( ){ }T
IV 1 1 1, , , .t t t T t t ts s s rg- - + += -f z L ff q-
Every admissible estimating functions must have the form of
No !!
“Inadmissible” estimating function means there are superior estimating functions to it.
16
Asymptotic Variance of LSTD-Based Estimators
Lemma 2.The asymptotic estimation variance of estimating function for value functions is given by
where
and
( ) 11 T1ˆAVN
--é ù=ê úë û A M Aq
( )T
1E ,t t tp g +
é ù= -ê úë ûA z ff ( )2* TE t t t
p eé ù= ê úë ûM zz
( )T* *1 1.t t t tre g + += - -ff q
Which instrumental variable performs the minimum asymptotic variance ?
17
The Optimal Estimating Function
Theorem 1.
The optimal instrumental variable with the minimum asymptotic variance is given by
where
( ) ( )12* *
1E | E |t t t t t ts sp pe g-
+é ù é ù= -ê ú ë ûë û
z ff
( )T* *1 1.t t t tre g + += - -ff q
True parameter (unknown)
Unknown conditional expectations
gLSTDApproximation is necessary
gLSTD
• The residual of true parameter
• Unknown conditional expectations
18
*te
( )2*1E | ,E |t t t ts sp pe +
é ù é ùê ú ë ûë ûf
( ) ( )1
** 2
1E | E |t t t tt ts sp pe g-
+é ù é ùê= -ú ë ûë û
z ff
The optimal instrumental variable
Replace the regression residual of true parameter with that of LSTD estimator.
LSTDt̂e¬
Approximate these conditional expectations by using a sample-based function approximation technique.
(Unknown)
19
Summary of gLSTD
1) Calculate the initial estimator and replace the true residual
2) Approximate the conditional expectations
3) Construct the instrumental variable
4) Calculate the gLSTD estimator
( )2*1E | ,E |t t t ts sp pe +
é ù é ùê ú ë ûë ûf
( )-11 1
T
gLSTD 1 10 0
ˆ ˆN N
t t t t tt t
rg- -
+ += =
é ù é ùê ú ê ú¬ -ê ú ê úë û ë ûå åz zq ff
( ) ( )12*
1ˆ E | E |t t t t t ts sp pe g-
+é ù é ù¬ -ê ú ë ûë û
z ff
( )-11 1
T
LSTD 1 10 0
N N
t t t t tt t
rg- -
+ += =
é ù é ùê ú ê ú¬ -ê ú ê úë û ë ûå åq ff ff
* LSTDˆt te e¬
20
Simulation (Markov Random Walk)
• Conditions of the simulation experiment – Policy: Random– The number of steps: 100– The number of episodes: 100– Discounted factor: 0.9
• Basis function : – We generated three basis functions by the diffusion model.
[Mahadevan & Maggino, 2007]
1 32 4 5
R=0 R=0 R=0 R=1.0R=0.5
21
Simulation Result.
The estimator of gLSTD achieved 20% smaller MSE than that of the LSTD
Median
The upper and lower quartiles
20%
22
Conclusion• We discussed LSTD-based policy evaluation in the
framework of semiparametric statistics approach. – We evaluated the asymptotic variance of LSTD-based
estimator.
– We derived the optimal estimating function with the minimum asymptotic variance and proposed its practical implementation method: gLSTD.
– Through an simple Markov chain problem, we demonstrated that gLSTD reduces the estimation variance of LSTD.
23
Future Work
A Semiparametric Approach to
Model-Free Policy Evaluation
A Semiparametric Approach to
Model-Free Reinforcement Learning
Application to the policy improvement
- Least Squares Policy Iteration (LSPI)
- Natural Actor Critic (NAC) etc.
24
EndThank you for your attention!!
Cost Function
25
2gLS gLS *1 ˆargmin
2 gLSD DDr
r
= -V Vq
2LS *1 ˆargmin
2 LS
LSD D
Drr
= -V Vq ( ) ( )LS Trr rg g= - -I P D D I PD FF
( ) ( )gLS 1 1r rg g- -= - -I P D I PD S S
Simulation Result
26
1 2 3 4 5
0 0 0 0 1.0r é ù= ê úë û
27
28
31
Questions
1. How good is the LSTD?
2. Can we improve the LSTD ?
LSTD is a type of estimating function method, andevaluate the asymptotic estimation variance of LSTD.
We derive the optimal estimating function with the minimum asymptotic estimation variance.
32
The Suboptimal Estimating Function (LSTDc)
• GLSTD is required to estimate the functions depending on current state.
• To avoid estimating these functions, we simple replace them by constant value.
t t= +z cf
Optimize it to minimize the asymptotic variance
33
The Suboptimal Estimating Function (LSTDc)
Theorem 2. The optimal shift is given by
where
( ) ( ) ( ) ( ) [ ]
( ) ( ) ( ) ( ) [ ]
2 2 1* * T T1
*2 2 1* * T T
1
E 1 E E E
E 1 E E E
t t t t t t t t t
t t t t t t t
p p p p
p p p p
e g e g
e g e g
-
+
-
+
é ù é ù é ù- - -ê ú ê ú ê úë ûë û ë û= -é ù é ù é ù- - -ê ú ê ú ê úë ûë û ë û
cff ff ff f
ff ff f
( )T* *1 1.t t t tre g + += - -ff q
34
Summary of This Talk
• We introduce a semiparametric statistical viewpoint for estimation of value function with linear model.
• Our aim – Evaluate the estimation variance of value
functions – Develop more efficient estimation methods
36
Summary of Our Main Results
1. Formulate the estimation problem of linearly-represented value functions as a semiparametric inference problem
2. Evaluate the asymptotic variance of estimations of value function
3. Derive the optimal estimation method with the minimum asymptotic variance
37
Estimating Functions
•Question Which function is appropriate when more than one
estimating function exist ?
•Answer Choose the estimating function with minimum
asymptotic variance
( )( )T
* *ˆ ˆ ˆAV : E é ùé ù= - -ê úê úë û ë ûq q q q q
38
Instrumental Variable (IV) Method
• Instrumental variable:– Correlated to the input variable, but uncorrected to the noise.
• Instrumental variable method
{ }Tt xt yt tye+ + =x e q
tz tz tx
tz
[ ] [ ]1E Et t t ty-zx zq=
xte
39
Statistics approach
40
What is the Semiparametric Approach ?
• Semiparametric model:– Parameter:
– Nuisance parameter:
• Estimating function [Godambe, 1985]
[Conditions]
–
( ); ,p x qk
( )1
0
ˆ;N
tt
-
=
=å f x 0q converges to the true parameter q̂
*q
qk
( )[ ]E ; =f x 0q
We need to estimate the parameter regardless of the nuisance parameter .k
q
Show the detail in [Godambe, 1985]