Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

19
Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson

description

Motivations Efficiently design nonlinear policies Make policy-gradient reinforcement learning practical.

Transcript of Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Page 1: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Efficient Policy GradientOptimization/Learning of Feedback

Controllers

Chris Atkeson

Page 2: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Punchlines

• Optimize and learn policies. Switch from “value iteration” to “policy

iteration”.• This is a big switch from optimizing and

learning value functions.• Use gradient-based policy optimization.

Page 3: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Motivations

• Efficiently design nonlinear policies• Make policy-gradient reinforcement

learning practical.

Page 4: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Model-Based Policy Optimization• Simulate policy u = π(x,p) from some initial

states x0 to find policy cost.• Use favorite local or global optimizer to

optimize simulated policy cost.• If gradients are used, they are typically

numerically estimated.• Δp = -ε ∑x0w(x0)Vp 1st order gradient

• Δp = -(∑x0w(x0)Vpp)-1 ∑x0w(x0)Vp 2nd order

Page 5: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Can we make model-based policy gradient more efficient?

Page 6: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Analytic Gradients• Deterministic policy: u = π(x,p) • Policy Iteration (Bellman Equation): Vk-1(x,p) = L(x,π(x,p)) + V(f(x,π(x,p)),p)• Linear models: f(x,u) = f0 + fxΔx + fuΔu

L(x,u) = L0 + LxΔx + LuΔu

π(x,p) = π0 + πxΔx + πpΔp

V(x,p) = V0 + VxΔx + VpΔp• Policy Gradient: Vx

k-1 = Lx + Luπx + Vx(fx + fuπx)

Vpk-1 = (Lu + Vxfu)πp + Vp

Page 7: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Handling Constraints

• Lagrange multiplier approach, with constraint violation value function.

Page 8: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Vpp: Second Order Models

Page 9: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Regularization

Page 10: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

LQBR: Linear (dynamics) Quadratic (cost) Bilinear (policy) Regulator

Page 11: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Timing Test

Page 12: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Antecedents

• Optimizing control “parameters” in DDP: Dyer and McReynolds 1970.

• Optimal output feedback design (1960s-1970s)

• Multiple model adaptive control (MMAC)• Policy gradient reinforcement learning• Adaptive critics, Werbos: HDP, DHP, GDHP,

ADHDP, ADDHP

Page 13: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

When Will LQBR Work?

• Initial stabilizing policy is known (“output stabilizable”)

• Luu is positive definite.

• Lxx is positive semi-definite and (sqrt(Lxx),Fx) is detectable.

• Measurement matrix C has full row rank.

Page 14: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Locally Linear Policies

Page 15: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Local Policies

GOAL

Page 16: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Cost Of One Gradient Calculation

Page 17: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Continuous Time

Page 18: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Other Issues

• Model Following• Stochastic Plants• Receding Horizon Control/MPC• Adaptive RHC/MPC• Combine with Dynamic Programming• Dynamic Policies -> Learn State Estimator

Page 19: Efficient Policy Gradient Optimization/Learning of Feedback Controllers Chris Atkeson.

Optimize Policies

• Policy Iteration, with gradient-based policy improvement step.

• Analytic gradients are easy.• Non-overlapping sub-policies make second

order gradient calculations fast.• Big problem: How choose policy structure?