Verification of Agents learning through...

16
Reinforcement Learning Air hockey as case-study Air hockey as RL task Verification Repair Conclusion Verification of Agents learning through Reinforcement Shashank Pathak 12 Giorgio Metta 12 Luca Pulina 3 Armando Tacchella 2 Robotics, Brain and Cognitive Sciences (RBCS) Istituto Italiano di Tecnologia (IIT), Via Morego, 30 – 16163 Genova – Italy [email protected] - [email protected] Dipartimento. di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi (DIBRIS) Universit` a degli Studi di Genova, Via Opera Pia, 13 – 16145 Genova – Italy [email protected] POLCOMING, Universit` a degli Studi di Sassari Viale Mancini 5 – 07100 Sassari – Italy [email protected] Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Transcript of Verification of Agents learning through...

Page 1: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Verification of Agentslearning through Reinforcement

Shashank Pathak12 Giorgio Metta12 Luca Pulina3

Armando Tacchella2

Robotics, Brain and Cognitive Sciences (RBCS)Istituto Italiano di Tecnologia (IIT), Via Morego, 30 – 16163 Genova – Italy

[email protected] - [email protected]

Dipartimento. di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi(DIBRIS)

Universita degli Studi di Genova, Via Opera Pia, 13 – 16145 Genova – [email protected]

POLCOMING, Universita degli Studi di SassariViale Mancini 5 – 07100 Sassari – Italy

[email protected]

December 5, 2012

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 2: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Figure : Robots: perception and reality

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 3: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Some relevant features:

Figure : Reinforcement Learning

Learning throughexperiences ie(St ,At ,Rt ,Snext)

Objective is to attain apolicy π(si )→ Ai

Secondly, policy shouldbe maximizing somemeasure of ”rewards” Ri

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 4: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Finite-window rewards

Assume:finite time-horizon t ∈ (t,T ) and discounting γ with γ ∈ [0, 1)

Rt = rt+1 + γrt+2 + γ2rt+3 + · · ·+ γT−t−1rT

and that we define, Value as expected-value of thisaveraged-reward V π(s) = Eπ(Rt |st = s)

V (st)→ V (st) + α(Rt − V (st))

We would have update:

V (st)→ V (st) + αδ, δ = (rt+1 + γV (st+1)− V (st)) (1)

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 5: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Air hockey

Figure : Platform and simulator

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 6: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Reasons for picking up air hockey

Air hockey is a challenging platform and has been used in pastto demonstrate learning

As a robotic setup, it has been included as one of thebenchmark for robotics & humanoids

Our previous work has been performed on real air hockey andsupervised learning

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 7: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Simulator

For the current study, we chose simulator instead of real setup

Our goal was to demonstrate safety in a model-free learningapproach and ways to improve it

Some sophiticated semi-supervised approaches are needed toapply RL on real setup

Showing benefits of verification and repair was independent tothese approaches

Simulation or at least some logging would be required even ifreal setup were used

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 8: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Simulator ...

Simulator was implemented with C++ using some libraries likeOpenCV, Boost and Pantheios

For simplicity no game engine was used, rather 2D Physicswas implemented

Also physical and geometric considerations were made

Extensive logging and a GUI based parameter search was done

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 9: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Learning Problem

Given: an air hockey platform and a robotic arm.Objective: to learn to defend the goal as good as possible

Action of robotic arm was constrained to be minimum-jerktrajectory

Joint-kinematics and safety

State was defined in trajectory-space rather than cartesiancoordinates

Discrete state and discrete actions were considered

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 10: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Learn

Algorithm 1 Pseudo-code for learning to play Air hockey using Re-inforcement Learning

Initialize Q ← 0; ∆t ← 20msfunction Learn(Ne , Nb , Nr )

for all i ∈ {1, . . . , Ne} doSend Start signal to Simulatorj ← 1repeat

Receive sj ← (pj , αj , θj ) from Simulator

∆θj ← ComputePolicy(Q, sj )

Send (∆θj ,∆t) to Simulator

Receive sj+1 ← (pj+1, αj+1, θj+1) and

fj+1 ← (m, g, w, r)

rj+1 ← ComputeReward((sj+1, fj+1)

Ej ← (sj ,∆θj , rj+1, sj+1, fj+1)

Q ← Update(Q, Ej )

j ← j + 1if (j = Nb ) then

for all k ∈ {1, . . . , Nr} doChoose random m ∈ {1, . . . , Nb}Q ← Update(Q, Ej )

end forj ← 1

end ifuntil r = TRUE

end forreturn Q

end functionShashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 11: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Verification of DTMC

Discrete state-action space, allowed to model learned policy asa Discrete Time Markov Chain

Learnt policy π(s)→ a was Softmax distribution overQ-values,

π(s, ai ) =eκQ(s,ai )∑a∈A eκQ(s,a)

(2)

Next states were observed via simulation and probabilitieswere adjusted imperically

We considered 2 approaches: unsafe states as failure and asfault

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 12: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Verification of DTMC: Unsafe states as failures

unsafe flag =⇒ halt

On practical setups, there are usually low-level control

Some approaches to address this: Lyapunov candidates, safetyconscious rewarding etc

For sake of generality and yet effectiveness, we used safetyconscious rewarding schema while avoided Lyapunovcandidates

In our case, safety of the agent is reachability probability onunsafe states

Using safety property, we used both PRISM and MRMC, toget qualitative measure of safety

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 13: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Repairing DTMC

Intuition: badness of a state depends on forward proximity toa bad state.

In general, changing Q-values in ways similar to eligibilitytrace would make policy safer

While this is more effective than incorporating safety whilelearning, it could deteriorate learnt policy

Our experiments show it need not be the case

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 14: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Repairing DTMC: Using COMICS

We used tool COMICS to generate the counter-exampleWe then proceeded with repairing the pathsThe overall algorithm was

Algorithm 2 Pseudo-code for Verification and Repair of Learn

1: Given agent A, learning algorithm Learn , safety bound Pbound

2: Using A perform Learn3: Obtain policy π(s, a)4: Construct a DTMC D from policy π(s, a)5: Use MRMC or PRISM on D to obtain Punsafe of violating P6: repeat7: repeat8: Use COMICS to generate set Sunsafe negating P with bound Punsafe

9: Apply Repair on Sunsafe

10: until Sunsafe = {φ}11: Punsafe ← Punsafe − ε, ε ∈ (0,Punsafe − Pbound ]12: until Punsafe < Pbound

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 15: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Results

0

0.05

0.1

0.15

0.2

0.25

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nbr of episodes

LearnTest

0

0.05

0.1

0.15

0.2

0.25

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nbr of episodes

LearnTest

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nbr of episodes

LearnTest

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nbr of episodes

LearnTest

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nbr of episodes

LearnTest

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nbr of episodes

LearnTest

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Page 16: Verification of Agents learning through Reinforcementmovep.lif.univ-mrs.fr/documents/pathak-slides.pdfReinforcement Learning Air hockey as case-study Air hockey as RL task Veri cation

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Thanks to audience and my colleagues 1!Questions or comments?

1Armando Tacchella,Giorgio Metta, & Luca PulinaShashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL