Verification of Agents learning through...
Transcript of Verification of Agents learning through...
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Verification of Agentslearning through Reinforcement
Shashank Pathak12 Giorgio Metta12 Luca Pulina3
Armando Tacchella2
Robotics, Brain and Cognitive Sciences (RBCS)Istituto Italiano di Tecnologia (IIT), Via Morego, 30 – 16163 Genova – Italy
[email protected] - [email protected]
Dipartimento. di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi(DIBRIS)
Universita degli Studi di Genova, Via Opera Pia, 13 – 16145 Genova – [email protected]
POLCOMING, Universita degli Studi di SassariViale Mancini 5 – 07100 Sassari – Italy
December 5, 2012
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Figure : Robots: perception and reality
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Some relevant features:
Figure : Reinforcement Learning
Learning throughexperiences ie(St ,At ,Rt ,Snext)
Objective is to attain apolicy π(si )→ Ai
Secondly, policy shouldbe maximizing somemeasure of ”rewards” Ri
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Finite-window rewards
Assume:finite time-horizon t ∈ (t,T ) and discounting γ with γ ∈ [0, 1)
Rt = rt+1 + γrt+2 + γ2rt+3 + · · ·+ γT−t−1rT
and that we define, Value as expected-value of thisaveraged-reward V π(s) = Eπ(Rt |st = s)
V (st)→ V (st) + α(Rt − V (st))
We would have update:
V (st)→ V (st) + αδ, δ = (rt+1 + γV (st+1)− V (st)) (1)
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Air hockey
Figure : Platform and simulator
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Reasons for picking up air hockey
Air hockey is a challenging platform and has been used in pastto demonstrate learning
As a robotic setup, it has been included as one of thebenchmark for robotics & humanoids
Our previous work has been performed on real air hockey andsupervised learning
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Simulator
For the current study, we chose simulator instead of real setup
Our goal was to demonstrate safety in a model-free learningapproach and ways to improve it
Some sophiticated semi-supervised approaches are needed toapply RL on real setup
Showing benefits of verification and repair was independent tothese approaches
Simulation or at least some logging would be required even ifreal setup were used
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Simulator ...
Simulator was implemented with C++ using some libraries likeOpenCV, Boost and Pantheios
For simplicity no game engine was used, rather 2D Physicswas implemented
Also physical and geometric considerations were made
Extensive logging and a GUI based parameter search was done
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Learning Problem
Given: an air hockey platform and a robotic arm.Objective: to learn to defend the goal as good as possible
Action of robotic arm was constrained to be minimum-jerktrajectory
Joint-kinematics and safety
State was defined in trajectory-space rather than cartesiancoordinates
Discrete state and discrete actions were considered
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Learn
Algorithm 1 Pseudo-code for learning to play Air hockey using Re-inforcement Learning
Initialize Q ← 0; ∆t ← 20msfunction Learn(Ne , Nb , Nr )
for all i ∈ {1, . . . , Ne} doSend Start signal to Simulatorj ← 1repeat
Receive sj ← (pj , αj , θj ) from Simulator
∆θj ← ComputePolicy(Q, sj )
Send (∆θj ,∆t) to Simulator
Receive sj+1 ← (pj+1, αj+1, θj+1) and
fj+1 ← (m, g, w, r)
rj+1 ← ComputeReward((sj+1, fj+1)
Ej ← (sj ,∆θj , rj+1, sj+1, fj+1)
Q ← Update(Q, Ej )
j ← j + 1if (j = Nb ) then
for all k ∈ {1, . . . , Nr} doChoose random m ∈ {1, . . . , Nb}Q ← Update(Q, Ej )
end forj ← 1
end ifuntil r = TRUE
end forreturn Q
end functionShashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Verification of DTMC
Discrete state-action space, allowed to model learned policy asa Discrete Time Markov Chain
Learnt policy π(s)→ a was Softmax distribution overQ-values,
π(s, ai ) =eκQ(s,ai )∑a∈A eκQ(s,a)
(2)
Next states were observed via simulation and probabilitieswere adjusted imperically
We considered 2 approaches: unsafe states as failure and asfault
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Verification of DTMC: Unsafe states as failures
unsafe flag =⇒ halt
On practical setups, there are usually low-level control
Some approaches to address this: Lyapunov candidates, safetyconscious rewarding etc
For sake of generality and yet effectiveness, we used safetyconscious rewarding schema while avoided Lyapunovcandidates
In our case, safety of the agent is reachability probability onunsafe states
Using safety property, we used both PRISM and MRMC, toget qualitative measure of safety
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Repairing DTMC
Intuition: badness of a state depends on forward proximity toa bad state.
In general, changing Q-values in ways similar to eligibilitytrace would make policy safer
While this is more effective than incorporating safety whilelearning, it could deteriorate learnt policy
Our experiments show it need not be the case
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Repairing DTMC: Using COMICS
We used tool COMICS to generate the counter-exampleWe then proceeded with repairing the pathsThe overall algorithm was
Algorithm 2 Pseudo-code for Verification and Repair of Learn
1: Given agent A, learning algorithm Learn , safety bound Pbound
2: Using A perform Learn3: Obtain policy π(s, a)4: Construct a DTMC D from policy π(s, a)5: Use MRMC or PRISM on D to obtain Punsafe of violating P6: repeat7: repeat8: Use COMICS to generate set Sunsafe negating P with bound Punsafe
9: Apply Repair on Sunsafe
10: until Sunsafe = {φ}11: Punsafe ← Punsafe − ε, ε ∈ (0,Punsafe − Pbound ]12: until Punsafe < Pbound
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Results
0
0.05
0.1
0.15
0.2
0.25
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Nbr of episodes
LearnTest
0
0.05
0.1
0.15
0.2
0.25
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Nbr of episodes
LearnTest
0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Nbr of episodes
LearnTest
0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Nbr of episodes
LearnTest
0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Nbr of episodes
LearnTest
0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Nbr of episodes
LearnTest
Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL
Reinforcement LearningAir hockey as case-study
Air hockey as RL taskVerification
RepairConclusion
Thanks to audience and my colleagues 1!Questions or comments?
1Armando Tacchella,Giorgio Metta, & Luca PulinaShashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL