Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a...
Transcript of Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a...
![Page 1: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/1.jpg)
Reinforcement Learning
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 2: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/2.jpg)
Introduction
I Unsupervised learning has no outcome (no feedback).I Supervised learning has outcome so we know what to
predict.I Reinforcement learning is in between–it has no explicit
supervision so uses a rewarding system to learnfeature-outcome relationship.
I The crucial advantage of reinforcement learning is itsnon-greedy nature: we do not need to improveperformance in a short term but to optimize a long-termachievement.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 3: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/3.jpg)
RL terminology
I Reinforcement learning is a dynamic process where at eachstep, a new decision rule or policy is updated based onnew data and rewarding system.
I Terminology used in reinforcement learning:– Agent: whoever uses learned decisions during theprocess (robot in AI)– Action (A): a decision to be taken during the process– State (S): environment variables that may interact withAction– Reward (R): a value system to evaluate Action givenState.
I Note that (A,S,R) is time-step dependent so we use(At,St,Rt) to reflect time-step t.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 4: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/4.jpg)
Reinforcement learning diagram
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 5: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/5.jpg)
Maze example
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 6: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/6.jpg)
Maze example: continue
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 7: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/7.jpg)
Maze example: continue
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 8: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/8.jpg)
Mountain car problem
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 9: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/9.jpg)
RL Notation
I At time-step t, the agent observe a state St from a statespace (ST) and selects an action At from an action space(At).
I Both action and state result in transition to a new state St+1.I Given (At,St,St+1), the agent receives an immediate
rewardRt = rt(St,At,St+1) ∈ R,
where rt(·, ·, ·) is called immediate reward function.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 10: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/10.jpg)
RL mathematical formulation
I At time t, we assume a transition probability function from(St = s,At = a) to (St+1 = s′): pt(s′|s, a) ≥ 0,∫
s′ pt(s′|s, a)ds′ = 1.I We also assume At given St from a probability distribution:πt(a|s) ≥ 0,
∫a πt(a|s)da = 1.
I A trajectory (training sample) (s1, a1, s2, ..., sT, aT, sT+1) isgenerated as follows:– start from an initial state s1 from a probabilitydistribution p(s);– for t = 1, 2, ...,T (T is the total number of steps),– (a) at is chosen from πt(·|st),– (b) the next state st+1 is from pt(·|st, at).
I It is called finite horizon if T <∞ and infinite horizon ifT =∞.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 11: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/11.jpg)
Goal of RL
I Define the return at time t asT∑
j=t
γj−tr(Sj,Aj,Sj+1)
where γ ∈ [0, 1) is called the discount factor (discountinglong trajectory).
I An action policy, π = (π1, ...., πT), is a sequence ofprobability distribution functions, where πt is a probabilitydistribution for At given St.
I The goal of RL is to learn the optimal action decision,policy π∗ = (π∗1 , π
∗2 , ..., π
∗T), to maximize the expected
return:
Eπ[T∑
j=1
γj−1r(Sj,Aj,Sj+1)], Eπ(·) means At|St ∼ πt(·|St).
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 12: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/12.jpg)
Optimal policy
I RL aims to find the best action decision rules such that theaverage long-term reward is maximized if such rules areimplemented.
I Note: π∗ is a function of states and for any individual, weonly know what actions should be at time t after observingits states ate time t. This is related to the so-called adaptivedecision or dynamic decision.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 13: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/13.jpg)
How supervised learning is framed in RL context?
I We can imagine St to be all data (both feature andoutcome) collected by step t.
I Then At is the prediction rule from a class of predictionfunctions based on St (no need to be perfect predictionfunction; can be even random prediction) so πt is theprobabilistic selection of which prediction function at t.
I Based on (St,At), St+1 can be St with additionally collecteddata, or St with individual errors, or just St.
I Rt is the prediction error evaluated at the data.I The goal is to learn the best prediction rule–RL method can
help!
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 14: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/14.jpg)
Two important concepts in RL
I State-Action value function (SAV) It is the expected returnincrement at time t given state St = s and action At = a:
Qπt (s, a) = Eπ[
T∑j=t
γj−trt(Sj,Aj,Sj+1)|St = s,At = a].
Q∗t (s, a) ≡ Qπ∗t (s, a) is the optimal expected return at time t.
I State value function (SV) It is the expected returnincrement at time t given state St = s:
Vπt (s) = Eπ[
T∑j=t
γj−trt(Sj,Aj,Sj+1)|St = s].
Similarly, V∗t (s) = Vπ∗t (s).
I Clearly, Vπt (s) =
∫a Qπ
t (s, a)πt(a|s)da.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 15: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/15.jpg)
RL methods
I Reinforcement learning methods are mostly into twogroups:– (policy iteration) model-based or learning methods toapproximate SAV– (policy search) model-based or learning methods todirectly maximize SV for estimating π∗.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 16: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/16.jpg)
Policy iteration for value function approximation
I The Bellman equation for SV:
Vπt (s) = Eπ
[rt(s,At,St+1) + γVπ
t+1(St+1)∣∣∣St = s
]=
∫s′
∫a
[rt(s, a, s′) + γVπ
t+1(s′)]πt(a|s)pt(s′|s, a)dads′.
I The Bellman equation for SAV:
Qπt (s, a) = Eπ
[rt(s, a,St+1) + γQπ
t+1(St+1,At+1)∣∣∣St = s,At = a
]=
∫s′
∫a′
[rt(s, a, s′) + γQπ
t+1(s′, a′)
]×πt+1(a′|s′)pt(s′|s, a)da′ds′.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 17: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/17.jpg)
Value function learning for finite horizon
I For finite T, the Bellman equations suggest a backwardprocedure to evaluate the value function associated aparticular policy:– start from time T. We can learnQπ
T(s, a) = E[RT|ST = s,AT = a].– at time T − 1, we learn Qπ
T−1(s, a) as
E[RT−1 + γQπ
T(ST,AT)∣∣∣ST−1 = s,AT−1 = a
].
...– we perform learning backwards till time 1.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 18: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/18.jpg)
Optimal policy learning for finite horizon (Q-learning)
I Start from time T. We can learnQπ
T(s, a) = E[RT|ST = s,AT = a]. We calculate π∗T(s) as withprobability 1 at a∗ = argmaxaQπ
T(s, a).I At time T − 1, we learn Qπ∗
T−1(s, a) as
E[
RT−1 + γmaxa′
Qπ∗T (ST, a′)
∣∣∣ST−1 = s,AT−1 = a].
We obtain π∗T−1 as the one with probability 1 ata∗ = argmaxaQπ∗
T−1(s, a).I We perform the same learning procedures backwards till
time 1 to learn all the optimal policies.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 19: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/19.jpg)
Value function learning for infinite horizon
I When T =∞ or T is large, Q-learning method may not beapplicable.
I The salvage is to take advantage of process stability when tis large so we can assume the following Markov decisionprocess (MDP):
I MDP assumes that state and action spaces are constant overtime.
I MPD assumes pt(s′|s, a) to be independent of t.I Reward function rt(s, a, s′) is independent of t.
I MDP assumption is plausible for a long horizon and aftercertain number of steps.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 20: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/20.jpg)
Bellman equations under MDP (T =∞)
I Under MPD, Qπt (s, a) = Qπ(s, a) and Vπ
t (s) = Vπ(s).I Bellman equations become
Vπ(s) = Eπ[r(s,At,St+1) + γVπ(St+1)
∣∣∣St = s],
Qπ(s, a) = Eπ[r(s, a,St+1) + γQπ(St+1,At+1)
∣∣∣St = s,At = a].
I Derived equation for optimal policy:
Vπ∗(s) = maxa
Qπ∗(s, a),
Qπ∗(s, a) = Eπ∗[r(s, a,St+1) + γVπ∗(St+1)
∣∣∣St = s,At = a],
π∗(s) ∼ I{
a = argmaxaQπ∗(s, a)}.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 21: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/21.jpg)
Policy iteration procedure
I Start from a policy π.I Policy evaluation: evaluate Qπ(s, a) and thus Vπ(s).I Policy improvement: update π(a|s) to be I(a = aπ(s)) where
aπ(s) is the action maximizing Qπ(s, a).I Iterate between policy evaluation step and policy
improvement.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 22: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/22.jpg)
Soft policy iteration procedure
I Selecting a deterministic policy update may be too greedyif the initial policy is far from the optimal.
I More soft policy update includes:– π(a|s) ∝ exp{Qπ(s, a)/τ},– (ε-greedy policy improvement) π(a|x) has a probability(1− ε+ ε/m) at a = a(π) and probability ε/m at other a’s,where m is the number of possible actions.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 23: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/23.jpg)
Estimation of state-value function
I One challenge in the policy iteration is how to estimateQπ(s, a).
I This requires statistical modelling or learning algorithms.I Parametric/semiparametric models for Qπ(s, a) are
commonly used.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 24: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/24.jpg)
Least-squared policy iteration
I We assume
Qπ(s, a) =B∑
b=1
θbφb(s, a),
where φb(s, a) is a sequence of basis functions.I In other words, the policy is indirectly represented by θb’s.I From the Bellman equation, we note
Rt = r(St,At,St+1) = Qπ(St,At)− γEπ[Qπ(St+1,At+1)|St,At]
≈ θTψ(St,At)
has mean zero given (St,At) under policy π, whereψ(s, a) = φ(s, a)− γEπ[φ(St+1,At+1)|St = s,At = a].
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 25: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/25.jpg)
Numerical implementation of least-squares policyiteration
I Suppose we have data from n subjects, each with a trainingsample of T steps, or n training T-step sample from thesame agent,
(Si1,Ai1,Si2, ...,SiT,AiT,Si,T+1).
I We estimate ψ(s, a) by ψb(s, a) =
φb(s, a)− γ∑n
i=1∑T
t=1 I(Sit = s,Ait = a)Eπ[φb(Si,t+1,Ai,t+1])∑ni=1∑T
t=1 I(Sit = s,Ait = a).
I We perform a least-squares estimation
minθ
1nT
n∑i=1
T∑t=1
I(Ait|Sit ∼ π)[θTψ(Sit,Ait)− Rit
]2,
where Ait|Sit ∼ π means that the data of Ait is obtained byfollowing the policy.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 26: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/26.jpg)
More on numerical implementation
I Regularization may be introduced to have a more sparsesolution.
I L2-minimization can be replaced by L1-minimization togain robustness.
I Choice of basis functions: radial basis function wherekernel function can be the usual Gaussian kernel (onepossible definition of d(s, s′) is the shortest path from s to s′
in the graph defined by transition probabilities).
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 27: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/27.jpg)
Robot-Arm control example
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 28: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/28.jpg)
Robot-Arm control example: continue
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 29: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/29.jpg)
Robot-Arm control example: continue
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 30: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/30.jpg)
Off-policy estimation
I In the previous derivation, we essentially estimate
Eπ
[T∑
t=1
(θTψ(St,At)− Rt)2
]using the history sample (St,At) following the targetpolicy π.
I This is called on-policy reinforcement learning.I However, not all policy has been seen in the history
sample.I An alternative method is to use importance sampling:
Eπ
[T∑
t=1
(θTψ(St,At)− Rt)2
]= Eπ
[T∑
t=1
(θTψ(St,At)− Rt)2wt
],
where
wt =
∏tj=1 π(Aj|Sj)∏tj=1 π(Aj|Sj)
.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 31: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/31.jpg)
Off-policy iteration: more
I We need one assumption: there exists a policy in historysample, π, such that
π(a|s) > 0, ∀(a, s).
I Adaptive importance weighting is to replace wt by wνt and
choose ν via cross-validation.I When history sample have multiple policies π’s, we can
obtain the estimate from importance weighting withrespect to each policy and aggregate estimation(sample-reuse policy iteration).
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 32: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/32.jpg)
Mountain car example
I Action space: force applied to the car (0.2,−0.2, 0).I State space: (x, x) where x is the horizontal position
(∈ [−1.2, 0.5]) and x is the velocity (∈ [−1.5, 1.5]).I Transition:
xt+1 = xt + xt+1δt,xt+1 = xt + (−9.8wcos(3xt) + at/w− kxt)δt,
where w is the mass 0.2kg, k is the friction coefficient 0.3,and δt is 0.1 second.
I Reward:
r(s, a, s′) ={
1 xs′ ≥ 0.5,−0.01 o.w.
I Policy iteration uses kernels with centers at{−1.2, 0.35, 0.5} × {−1.5,−0.5, 0.5, 1.5} and σ = 1.
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 33: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/33.jpg)
Experiment results
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 34: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/34.jpg)
Experiment results
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 35: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/35.jpg)
Direct Policy Search
I The direct policy search approach aims for finding thepolicy maximizing the expected return.
I Suppose we model policy as π(a|s; θ).I The expected return under π is given by
J(θ) =∫
s1,...,sT
p(s1)
T∏t=1
p(st+1|st, at)π(at|st; θ)
×
{T∑
t=1
γt−1r(st, at, st+1)
}s1 · · · dsT.
I We optimize J(θ) to find the optimal θ.I Gradient approach can be adopted for optimization.I EM-based policy search can be used for optimization.I Importance sampling can be used for evaluating J(θ).
Donglin Zeng, Department of Biostatistics, University of North Carolina
![Page 36: Reinforcement Learning - Biostatisticsdzeng/BIOS740/RL1.pdf · I Reinforcement learning is a dynamic process where at each step, a new decision rule or policy is updated based on](https://reader034.fdocuments.us/reader034/viewer/2022043010/5fa1fc66b00218793025661e/html5/thumbnails/36.jpg)
Alternative methods
I Modelling transition probability functionsI Active policy iteration (active learning)
– update sampling policy actively
Donglin Zeng, Department of Biostatistics, University of North Carolina