Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration...
Transcript of Imitation Learning from Imperfect Demonstration · Imitation Learning from Imperfect Demonstration...
Imitation Learning from Imperfect Demonstration
Yueh-Hua Wu1,2, Nontawat Charoenphakdee3,2, Han Bao3,2,Voot Tangkaratt2, Masashi Sugiyama2,3
1National Taiwan University
2RIKEN Center for Advanced Intelligence Project
3The University of Tokyo
Poster #47
Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 1 / 12
Introduction
Imitation learning
learning from demonstration instead of a reward function
Demonstration
a set of decision makings (state-action pairs x)
Collected demonstration may be imperfectDriving: traffic violationPlaying basketball: technical foul
Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 2 / 12
Motivation
Confidence: how optimal is state-action pair x (between 0 and 1)
A semi-supervised setting: demonstration partially equipped with confidence
How?
crowdsourcing: N(1)/(N(1) + N(0)).digitized score: 0.0, 0.1, 0.2, . . . , 1.0
Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 3 / 12
Generative Adversarial Imitation Learning [1]
One-to-one correspondence between policy π and distribution of demonstration [2]
Utilize generative adversarial training
minθ
maxw
Ex∼pθ [logDw (x)] + Ex∼popt [log(1− Dw (x))]
Dw : discriminator, popt: demonstration distribution of πopt, and pθ: trajectorydistribution of agent πθ
Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 4 / 12
Problem Setting
Human switches to non-optimal policies when they make mistakes or are distracted
p(x) = α p(x |y = +1)︸ ︷︷ ︸popt(x)
+(1− α) p(x |y = −1)︸ ︷︷ ︸pnon(x)
Confidence: r(x) , Pr(y = +1|x)
Unlabeled demonstration: {xi}nui=1 ∼ p
Demonstration with confidence: {(xj , rj)}ncj=1 ∼ q
Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 5 / 12
Proposed Method 1: Two-Step Importance Weighting Imitation Learning
Step 1: estimate confidence by learning a confidence scoring function g
Unbiased risk estimator (come to Poster #47 for details):
RSC,`(g) = Ex ,r∼q[r · (`(g(x)))]︸ ︷︷ ︸Risk for optimal
+Ex ,r∼q[(1− r)`(−g(x))]︸ ︷︷ ︸Risk for non-optimal
Theorem
For δ ∈ (0, 1), with probability at least 1− δ over repeated sampling of data for training g ,
RSC,`(g)− RSC,`(g∗) = Op( n
−1/2c︸ ︷︷ ︸
# of confidence
+ n−1/2u︸ ︷︷ ︸
# of unlabeled
)
Step 2: employ importance weighting to reweight GAIL objective
Importance weighting
minθ
maxw
Ex∼pθ [logDw (x)] + Ex∼p[r(x)
αlog(1− Dw (x))]
Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 6 / 12
Proposed Method 2: GAIL with Imperfect Demonstration and Confidence
Mix the agent demonstration with the non-optimal one
p′ = αpθ + (1− α)pnon
Matching p′ with p enables pθ = popt and meanwhile benefits from the large amountof unlabeled data.
Objective:
V (θ,Dw ) = Ex∼p[log(1− Dw (x))]︸ ︷︷ ︸Risk for P class
+αEx∼pθ [logDw (x)] + Ex ,r∼q[(1− r) logDw (x)]︸ ︷︷ ︸Risk for N class
Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 7 / 12
Setup
Confidence is given by a classifier trained with the demonstration mixture labeled as optimal(y = +1) and non-optimal (y = −1)
Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 8 / 12
Results: Higher Average Return of the Proposed Methods
Environment: MujocoProportion of labeled data: 20%
Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 9 / 12
Results: Unlabeled Data Helps
More unlabeled data results in lower variance and better performance
proposed methods are robust to noise
(a) Number of unlabeled data. The number in thelegend indicates proportion of orignal unlabeled data.
(b) Noise influence. The number in the legend indicatesstandard deviation of Gaussian noise.
Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 10 / 12
Conclusion
Two approaches that utilize both unlabeled and confidence data are proposed
Our methods are robust to labelers with noise
The proposed approaches can be generalized to other IL and IRL methods
Poster #47
Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 11 / 12
Reference
[1] Ho, Jonathan, and Stefano Ermon. ”Generative adversarial imitation learning.” Advancesin Neural Information Processing Systems. 2016.
[2] Syed, Umar, Michael Bowling, and Robert E. Schapire. ”Apprenticeship learning usinglinear programming.” Proceedings of the 25th international conference on Machinelearning. ACM, 2008.
Yueh-Hua Wu et al. Imitation Learning from Imperfect Demonstration Poster #47 12 / 12