Verification of Agents learning through...

Reinforcement LearningAir hockey as case-study

Air hockey as RL taskVerification

RepairConclusion

Verification of Agentslearning through Reinforcement

Shashank Pathak12 Giorgio Metta12 Luca Pulina3

Armando Tacchella2

Robotics, Brain and Cognitive Sciences (RBCS)Istituto Italiano di Tecnologia (IIT), Via Morego, 30 – 16163 Genova – Italy

[email protected] - [email protected]

Dipartimento. di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi(DIBRIS)

Universita degli Studi di Genova, Via Opera Pia, 13 – 16145 Genova – [email protected]

POLCOMING, Universita degli Studi di SassariViale Mancini 5 – 07100 Sassari – Italy

[email protected]

December 5, 2012

Shashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

[email protected]

[email protected]

[email protected]

[email protected]



RepairConclusion

Figure : Robots: perception and reality




RepairConclusion

Some relevant features:

Figure : Reinforcement Learning

Learning throughexperiences ie(St ,At ,Rt ,Snext)

Objective is to attain apolicy π(si )→ Ai

Secondly, policy shouldbe maximizing somemeasure of ”rewards” Ri




RepairConclusion

Finite-window rewards

Assume:finite time-horizon t ∈ (t,T ) and discounting γ with γ ∈ [0, 1)

Rt = rt+1 + γrt+2 + γ2rt+3 + · · ·+ γT−t−1rT

and that we define, Value as expected-value of thisaveraged-reward V π(s) = Eπ(Rt |st = s)

V (st)→ V (st) + α(Rt − V (st))

We would have update:

V (st)→ V (st) + αδ, δ = (rt+1 + γV (st+1)− V (st)) (1)




RepairConclusion

Air hockey

Figure : Platform and simulator




RepairConclusion

Reasons for picking up air hockey

Air hockey is a challenging platform and has been used in pastto demonstrate learning

As a robotic setup, it has been included as one of thebenchmark for robotics & humanoids

Our previous work has been performed on real air hockey andsupervised learning




RepairConclusion

Simulator

For the current study, we chose simulator instead of real setup

Our goal was to demonstrate safety in a model-free learningapproach and ways to improve it

Some sophiticated semi-supervised approaches are needed toapply RL on real setup

Showing benefits of verification and repair was independent tothese approaches

Simulation or at least some logging would be required even ifreal setup were used




RepairConclusion

Simulator ...

Simulator was implemented with C++ using some libraries likeOpenCV, Boost and Pantheios

For simplicity no game engine was used, rather 2D Physicswas implemented

Also physical and geometric considerations were made

Extensive logging and a GUI based parameter search was done




RepairConclusion

Learning Problem

Given: an air hockey platform and a robotic arm.Objective: to learn to defend the goal as good as possible

Action of robotic arm was constrained to be minimum-jerktrajectory

Joint-kinematics and safety

State was defined in trajectory-space rather than cartesiancoordinates

Discrete state and discrete actions were considered




RepairConclusion

Learn

Algorithm 1 Pseudo-code for learning to play Air hockey using Re-inforcement Learning

Initialize Q ← 0; ∆t ← 20msfunction Learn(Ne , Nb , Nr )

for all i ∈ {1, . . . , Ne} doSend Start signal to Simulatorj ← 1repeat

Receive sj ← (pj , αj , θj ) from Simulator

∆θj ← ComputePolicy(Q, sj )

Send (∆θj ,∆t) to Simulator

Receive sj+1 ← (pj+1, αj+1, θj+1) and

fj+1 ← (m, g, w, r)

rj+1 ← ComputeReward((sj+1, fj+1)

Ej ← (sj ,∆θj , rj+1, sj+1, fj+1)

Q ← Update(Q, Ej )

j ← j + 1if (j = Nb ) then

for all k ∈ {1, . . . , Nr} doChoose random m ∈ {1, . . . , Nb}Q ← Update(Q, Ej )

end forj ← 1

end ifuntil r = TRUE

end forreturn Q

end functionShashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL



RepairConclusion

Verification of DTMC

Discrete state-action space, allowed to model learned policy asa Discrete Time Markov Chain

Learnt policy π(s)→ a was Softmax distribution overQ-values,

π(s, ai ) =eκQ(s,ai )∑a∈A eκQ(s,a)

(2)

Next states were observed via simulation and probabilitieswere adjusted imperically

We considered 2 approaches: unsafe states as failure and asfault




RepairConclusion

Verification of DTMC: Unsafe states as failures

unsafe flag =⇒ halt

On practical setups, there are usually low-level control

Some approaches to address this: Lyapunov candidates, safetyconscious rewarding etc

For sake of generality and yet effectiveness, we used safetyconscious rewarding schema while avoided Lyapunovcandidates

In our case, safety of the agent is reachability probability onunsafe states

Using safety property, we used both PRISM and MRMC, toget qualitative measure of safety




RepairConclusion

Repairing DTMC

Intuition: badness of a state depends on forward proximity toa bad state.

In general, changing Q-values in ways similar to eligibilitytrace would make policy safer

While this is more effective than incorporating safety whilelearning, it could deteriorate learnt policy

Our experiments show it need not be the case




RepairConclusion

Repairing DTMC: Using COMICS

We used tool COMICS to generate the counter-exampleWe then proceeded with repairing the pathsThe overall algorithm was

Algorithm 2 Pseudo-code for Verification and Repair of Learn

1: Given agent A, learning algorithm Learn , safety bound Pbound

2: Using A perform Learn3: Obtain policy π(s, a)4: Construct a DTMC D from policy π(s, a)5: Use MRMC or PRISM on D to obtain Punsafe of violating P6: repeat7: repeat8: Use COMICS to generate set Sunsafe negating P with bound Punsafe

9: Apply Repair on Sunsafe

10: until Sunsafe = {φ}11: Punsafe ← Punsafe − ε, ε ∈ (0,Punsafe − Pbound ]12: until Punsafe < Pbound




RepairConclusion

Results

0

0.05

0.1

0.15

0.2

0.25

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nbr of episodes

LearnTest

0

0.05

0.1

0.15

0.2

0.25

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nbr of episodes

LearnTest

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nbr of episodes

LearnTest

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nbr of episodes

LearnTest

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nbr of episodes

LearnTest

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Nbr of episodes

LearnTest




RepairConclusion

Thanks to audience and my colleagues 1!Questions or comments?

1Armando Tacchella,Giorgio Metta, & Luca PulinaShashank Pathak, Giorgio Metta, Luca Pulina, Armando Tacchella Verification of RL

Verification of Agents learning through...

Documents

Transcript of Verification of Agents learning through...