Interactive Imitation LearningDAgger Analysis: A reduction to no-regret online learning Decision set...
Transcript of Interactive Imitation LearningDAgger Analysis: A reduction to no-regret online learning Decision set...
-
Interactive Imitation Learning
Sham Kakade and Wen Sun CS 6789: Foundations of Reinforcement Learning
-
Announcements
Final report: NeurIPS format
Maximum 9 pages for main tex (not including references and appendix)
-
Recap
Offline IL and Hybrid Setting:
-
Recap
Offline IL and Hybrid Setting:
Ground truth reward is unknown;
assume expert is a near optimal policy
r(s, a) ∈ [0,1]π⋆
-
Recap
Offline IL and Hybrid Setting:
Ground truth reward is unknown;
assume expert is a near optimal policy
r(s, a) ∈ [0,1]π⋆
We have a dataset 𝒟 = (s⋆i , a⋆i )
Mi=1 ∼ d
π⋆
-
Recap
Maximum Entropy IRL formulation:
maxπ
entropy[ρπ]
s . t, 𝔼s,a∼dπϕ(s, a) = 𝔼s,a∼dπ⋆ϕ(s, a)
-
Recap
Maximum Entropy IRL formulation:
maxπ
entropy[ρπ]
s . t, 𝔼s,a∼dπϕ(s, a) = 𝔼s,a∼dπ⋆ϕ(s, a)
We can rewrite it in the max-min formulation:
maxθ
minπ
[𝔼s,a∼dπθ⊤ϕ(s, a) − 𝔼s,a∼dπ⋆θ⊤ϕ(s, a) + 𝔼s,a∼dπ ln π(a |s)]:=f(θ,π)
-
Recap
Maximum Entropy IRL formulation:
maxπ
entropy[ρπ]
s . t, 𝔼s,a∼dπϕ(s, a) = 𝔼s,a∼dπ⋆ϕ(s, a)
We can rewrite it in the max-min formulation:
maxθ
minπ
[𝔼s,a∼dπθ⊤ϕ(s, a) − 𝔼s,a∼dπ⋆θ⊤ϕ(s, a) + 𝔼s,a∼dπ ln π(a |s)]:=f(θ,π)
θt+1 := θt + η∇θJ(θt, πt), πt+1 = arg minπ
J(θt+1, π)
Algorithm: incremental update on cost function , exact update on policy (soft VI):θ π
-
Today:
Interactive Imitation Learning Setting
-
Today:
Interactive Imitation Learning Setting
Key assumption: we can query expert at any time and any state during trainingπ⋆
-
Today:
Interactive Imitation Learning Setting
Key assumption: we can query expert at any time and any state during trainingπ⋆
(Recall that previously we only had an offline dataset )𝒟 = (s⋆i , a⋆i )
Mi=1 ∼ d
π⋆
-
Recall the Main Problem from Behavior Cloning:
Expert’s trajectoryLearned Policy
No training data of “recovery’’ behavior
-
Intuitive solution: Interaction
7
Use interaction to collect data where learned policy goes
-
General Idea: Iterative Interactive Approach
Update PolicyCollect Data
through Interaction
New Data
Updated Policy
All DAgger slides credit: Drew Bagnell, Stephane Ross, Arun Venktraman
-
DAgger: Dataset Aggregation 0th iteration
9
Expert Demonstrates Task Dataset
Supervised Learning
1st policy π1
[Ross11a]
-
DAgger: Dataset Aggregation 1st iteration
10
Execute π1 and Query Expert
Steering from expert
[Ross11a]
-
DAgger: Dataset Aggregation 1st iteration
11
Execute π1 and Query ExpertNew Data
[Ross11a]
Steering from expert
-
DAgger: Dataset Aggregation 1st iteration
11
Execute π1 and Query ExpertNew Data
[Ross11a]
Steering from expert
States from the learned policy
-
DAgger: Dataset Aggregation 1st iteration
12
Execute π1 and Query ExpertNew Data
All previous data
[Ross11a]
Steering from expert
-
DAgger: Dataset Aggregation 1st iteration
13
Execute π1 and Query ExpertNew Data
Supervised Learning
New policy π2
All previous data
Aggregate Dataset
[Ross11a]
Steering from expert
-
DAgger: Dataset Aggregation 2nd iteration
14
Execute π2 and Query ExpertNew Data
Supervised Learning
New policy π3
All previous data
Aggregate Dataset
Steering from expert
[Ross11a]
-
DAgger: Dataset Aggregation nth iteration
15
[Ross11a]
Execute πn-1 and Query ExpertNew Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
-
Success!
16
[Ross AISTATS 2011]
-
Success!
16
[Ross AISTATS 2011]
-
Success!
16
[Ross AISTATS 2011]
-
Better
17
[Ross AISTATS 2011]
Average Falls/Lap
-
More fun than Video Games…
18[Ross ICRA 2013]
-
More fun than Video Games…
18[Ross ICRA 2013]
-
More fun than Video Games…
18[Ross ICRA 2013]
-
Forms of the Interactive Experts
Interactive Expert is expensive, especially when the expert is human…
But expert does not have to be human…
-
Forms of the Interactive Experts
Interactive Expert is expensive, especially when the expert is human…
But expert does not have to be human…
Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]
-
Forms of the Interactive Experts
Interactive Expert is expensive, especially when the expert is human…
But expert does not have to be human…
Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper] Goal: learn a racing control policy that
maps from data on cheap on-board sensors (raw-pixel imagine) to low-level
control (steer and throttle)
-
Forms of the Interactive Experts
Interactive Expert is expensive, especially when the expert is human…
But expert does not have to be human…
Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper] Goal: learn a racing control policy that
maps from data on cheap on-board sensors (raw-pixel imagine) to low-level
control (steer and throttle)
Steering + throttle
-
Forms of the Interactive ExpertsExample: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]
-
Forms of the Interactive ExpertsExample: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]
Their Setup:
At Training, we have expensive sensors for accurate state estimation
and we have computation resources for MPC (i.e., high-frequency replanning)
-
Forms of the Interactive ExpertsExample: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]
Their Setup:
At Training, we have expensive sensors for accurate state estimation
and we have computation resources for MPC (i.e., high-frequency replanning)
The MPC is the expert in this case!
-
Forms of the Interactive ExpertsExample: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]
Their Setup:
At Training, we have expensive sensors for accurate state estimation
and we have computation resources for MPC (i.e., high-frequency replanning)
The MPC is the expert in this case!
-
Analysis of DAgger
First let’s do a quick introduction of online no-regret learning
-
Online Learning
AdversaryLearner
[Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]
convex Decision set 𝒳…
-
Online Learning
AdversaryLearner
[Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]
convex Decision set 𝒳
Learner picks a decision x0
…
-
Online Learning
AdversaryLearner
[Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]
convex Decision set 𝒳
Learner picks a decision x0
Adversary picks a loss ℓ0 : 𝒳 → ℝ
…
-
Online Learning
AdversaryLearner
[Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]
convex Decision set 𝒳
Learner picks a decision x0
Adversary picks a loss ℓ0 : 𝒳 → ℝ
Learner picks a new decision x1
…
-
Online Learning
AdversaryLearner
[Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]
convex Decision set 𝒳
Learner picks a decision x0
Adversary picks a loss ℓ0 : 𝒳 → ℝ
Learner picks a new decision x1
Adversary picks a loss ℓ1 : 𝒳 → ℝ
…
-
Online Learning
AdversaryLearner
[Vovk92,Warmuth94,Freund97,Zinkevich03,Kalai05,Hazan06,Kakade08]
convex Decision set 𝒳
Learner picks a decision x0
Adversary picks a loss ℓ0 : 𝒳 → ℝ
Learner picks a new decision x1
Adversary picks a loss ℓ1 : 𝒳 → ℝ
…
Regret =T−1
∑t=0
ℓt(xt) − minx∈𝒳
T−1
∑t=0
ℓt(x)
-
A no-regret algorithm: Follow-the-Leader
At time step learner has seen , which new decision she could pick? t, ℓ0, …ℓt−1
FTL: xt = minx∈𝒳
t−1
∑i=0
ℓi(x)
-
A no-regret algorithm: Follow-the-Leader
At time step learner has seen , which new decision she could pick? t, ℓ0, …ℓt−1
Theorem (FTL): if is convex, and is strongly convex for all , then for regret of FTL, we have: 𝒳 ℓt t1T [
T−1
∑t=0
ℓt(xt) − minx∈𝒳
T−1
∑t=0
ℓt(x)] = O ( log(T)T )
FTL: xt = minx∈𝒳
t−1
∑i=0
ℓi(x)
-
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
-
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
-
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
Ln(⇡) =nX
i=1
k⇡(x)� yk22
-
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
Ln(⇡) =nX
i=1
k⇡(x)� yk22
-
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
Ln(⇡) =nX
i=1
k⇡(x)� yk22
nX
t=1
Lt(⇡)
-
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
Ln(⇡) =nX
i=1
k⇡(x)� yk22
nX
t=1
Lt(⇡)
-
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
Ln(⇡) =nX
i=1
k⇡(x)� yk22
nX
t=1
Lt(⇡)argmin⇡
nX
t=1
Lt(⇡)
-
DAgger Revisit
New Data
Supervised Learning
New policy πn
All previous data
Steering from expert
Aggregate Dataset
At iteration n:
Ln(⇡) =nX
i=1
k⇡(x)� yk22
nX
t=1
Lt(⇡)argmin⇡
nX
t=1
Lt(⇡)
Data Aggregation = Follow-the-Leader Online Learner
-
DAgger Analysis: A reduction to no-regret online learning
Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space
-
DAgger Analysis: A reduction to no-regret online learning
Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space
(Here could be any convex surrogate loss for classification, .e.g, hinge loss)ℓ
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]
-
DAgger Analysis: A reduction to no-regret online learning
Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space
(Here could be any convex surrogate loss for classification, .e.g, hinge loss)ℓ
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]
DAgger is equivalent to FTL, i.e., πt+1 = arg minπ∈Π
t
∑i=0
ℓi(π)
-
DAgger Analysis: A reduction to no-regret online learning
Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space
(Here could be any convex surrogate loss for classification, .e.g, hinge loss)ℓ
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]
DAgger is equivalent to FTL, i.e., πt+1 = arg minπ∈Π
t
∑i=0
ℓi(π)
If the online learning procedure ensures no-regret, then
1T [
T−1
∑t=0
ℓt(πt) − minπ∈Π
T−1
∑t=0
ℓt(π)] = o(T)/T
-
DAgger Analysis: A reduction to no-regret online learning
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1
∑t=0
ℓt(πt) − minπ∈Π
T−1
∑t=0
ℓt(π) = o(T)
-
DAgger Analysis: A reduction to no-regret online learning
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1
∑t=0
ℓt(πt) − minπ∈Π
T−1
∑t=0
ℓt(π) = o(T)
1T
T−1
∑t=0
ℓt(πt) = o(T)/T
ϵavg−reg
+ minπ∈Π
1T
T−1
∑t=0
ℓt(π)
ϵΠ
-
DAgger Analysis: A reduction to no-regret online learning
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1
∑t=0
ℓt(πt) − minπ∈Π
T−1
∑t=0
ℓt(π) = o(T)
1T
T−1
∑t=0
ℓt(πt) = o(T)/T
ϵavg−reg
+ minπ∈Π
1T
T−1
∑t=0
ℓt(π)
ϵΠ
, such that: ∃ ̂t ∈ [0,…, T − 1] ℓ ̂t(π ̂t ) ≤ ϵavg−reg + ϵΠ
-
DAgger Analysis: A reduction to no-regret online learning
Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1
∑t=0
ℓt(πt) − minπ∈Π
T−1
∑t=0
ℓt(π) = o(T)
1T
T−1
∑t=0
ℓt(πt) = o(T)/T
ϵavg−reg
+ minπ∈Π
1T
T−1
∑t=0
ℓt(π)
ϵΠ
, such that: ∃ ̂t ∈ [0,…, T − 1] ℓ ̂t(π ̂t ) ≤ ϵavg−reg + ϵΠ
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵavg−reg + ϵΠ
Under the assumption that surrogate loss upper bounds zero-one loss:
-
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
-
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
-
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
Vπ ̂i − Vπ⋆ =1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]
-
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
Vπ ̂i − Vπ⋆ =1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]
=1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]
-
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
Vπ ̂i − Vπ⋆ =1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]
=1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]
≤1
1 − γ𝔼s∼dπ ̂i [1{π ̂i(s) ≠ π⋆(s)} maxs,a A⋆(s, a) ]
-
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
Vπ ̂i − Vπ⋆ =1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]
=1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]
≤1
1 − γ𝔼s∼dπ ̂i [1{π ̂i(s) ≠ π⋆(s)} maxs,a A⋆(s, a) ]
≤maxs,a A⋆(s, a)
1 − γ⋅ (ϵreg + ϵΠ)
-
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
Vπ ̂i − Vπ⋆ =1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]
=1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]
≤1
1 − γ𝔼s∼dπ ̂i [1{π ̂i(s) ≠ π⋆(s)} maxs,a A⋆(s, a) ]
≤maxs,a A⋆(s, a)
1 − γ⋅ (ϵreg + ϵΠ)
Case study:
1. Worst case: (not
recoverable from a mistake): quadratic dependence on horizon, i.e., no better than BC;
A⋆(s, a) ≈1
1 − γ
-
DAgger Analysis: A reduction to no-regret online learning
𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆
Let’s turn this to the true performance under the cost function c(s, a)
Vπ ̂i − Vπ⋆ =1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]
=1
1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s)) − A⋆(s, π⋆(s))]
≤1
1 − γ𝔼s∼dπ ̂i [1{π ̂i(s) ≠ π⋆(s)} maxs,a A⋆(s, a) ]
≤maxs,a A⋆(s, a)
1 − γ⋅ (ϵreg + ϵΠ)
Case study:
1. Worst case: (not
recoverable from a mistake): quadratic dependence on horizon, i.e., no better than BC;
A⋆(s, a) ≈1
1 − γ
2. Good case: (easily
recoverable from a one-step mistake): Better than BC;
A⋆(s, a) ≈ o ( 11 − γ )
-
Summary of Imitation Learning
-
Summary of Imitation Learning
1. Behavior Cloning (Maximum Likelihood Estimation)
Performance-gap ≈1
(1 − γ)2(classification error)
-
Summary of Imitation Learning
1. Behavior Cloning (Maximum Likelihood Estimation)
Performance-gap ≈1
(1 − γ)2(classification error)
2. Hybrid Distribution Matching (w/ IPM or MaxEnt-IRL):
Performance-gap ≈1
(1 − γ)(classification error)
-
Summary of Imitation Learning
1. Behavior Cloning (Maximum Likelihood Estimation)
Performance-gap ≈1
(1 − γ)2(classification error)
2. Hybrid Distribution Matching (w/ IPM or MaxEnt-IRL):
Performance-gap ≈1
(1 − γ)(classification error)
3.DAgger w/ Interactive Experts:
Performance-gap ≈sups,a |A⋆(s, a) |
(1 − γ)(classification error)