Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov...
Transcript of Emo Todorovtranslectures.videolectures.net/site/...todorov_optimal_control_01.pdf · Emo Todorov...
Linearly-solvable optimal control
Emo Todorov
Applied Mathematics and Computer Science & Engineering
University of Washington
based on:E. Todorov, "Efficient computation of optimal actions", PNAS 2009
Emo Todorov (UW) Linearly-solvable optimal control 1 / 19
Problem formulation
In traditional MDPs the controller chooses actions u which in turn specify thetransition probabilities p (x0jx, u). We can obtain a linearly-solvable MDP(LMDP) by allowing the controller to specify these probabilities directly:
x0 � u (�jx) controlled dynamics
x0 � p (�jx) passive dynamics
p (x0jx) = 0) u (x0jx) = 0 feasible control set
The running cost is in the form
` (x, u (�jx)) = q (x) +DKL (u (�jx) jjp (�jx))
Thus the controller can impose any dynamics it wishes, however it pays aprice (KL divergence control cost) for pushing the system away from itspassive dynamics.
Emo Todorov (UW) Linearly-solvable optimal control 2 / 19
Problem formulation
In traditional MDPs the controller chooses actions u which in turn specify thetransition probabilities p (x0jx, u). We can obtain a linearly-solvable MDP(LMDP) by allowing the controller to specify these probabilities directly:
x0 � u (�jx) controlled dynamics
x0 � p (�jx) passive dynamics
p (x0jx) = 0) u (x0jx) = 0 feasible control set
The running cost is in the form
` (x, u (�jx)) = q (x) +DKL (u (�jx) jjp (�jx))
Thus the controller can impose any dynamics it wishes, however it pays aprice (KL divergence control cost) for pushing the system away from itspassive dynamics.
Emo Todorov (UW) Linearly-solvable optimal control 2 / 19
Reducing optimal control to a linear problemThe Bellman equation for a first-exit LMPD is
v (x) = minu(�jx)
�q (x) + Ex0�u(�jx)
�log
u (x0jx)p (x0jx)
�+ Ex0�u(�jx)
�v�x0���
= q (x) + minu(�jx)
Ex0�u(�jx)
�log
u (x0)p (x0jx) exp (�v (x0))
�The last term is an unnormalized KL divergence,
thus the optimal control is
u��x0jx
�=
p (x0jx) z (x0)P [z] (x)
z (x) , exp (�v (x)) desirability function
P [z] (x) , ∑x0 p (x0jx) z (x0) next-state expectation
The problem is now linear in the desirability function:
z (x) = exp (�q (x))P [z] (x)
Emo Todorov (UW) Linearly-solvable optimal control 3 / 19
Reducing optimal control to a linear problemThe Bellman equation for a first-exit LMPD is
v (x) = minu(�jx)
�q (x) + Ex0�u(�jx)
�log
u (x0jx)p (x0jx)
�+ Ex0�u(�jx)
�v�x0���
= q (x) + minu(�jx)
Ex0�u(�jx)
�log
u (x0)p (x0jx) exp (�v (x0))
�The last term is an unnormalized KL divergence, thus the optimal control is
u��x0jx
�=
p (x0jx) z (x0)P [z] (x)
z (x) , exp (�v (x)) desirability function
P [z] (x) , ∑x0 p (x0jx) z (x0) next-state expectation
The problem is now linear in the desirability function:
z (x) = exp (�q (x))P [z] (x)
Emo Todorov (UW) Linearly-solvable optimal control 3 / 19
Reducing optimal control to a linear problemThe Bellman equation for a first-exit LMPD is
v (x) = minu(�jx)
�q (x) + Ex0�u(�jx)
�log
u (x0jx)p (x0jx)
�+ Ex0�u(�jx)
�v�x0���
= q (x) + minu(�jx)
Ex0�u(�jx)
�log
u (x0)p (x0jx) exp (�v (x0))
�The last term is an unnormalized KL divergence, thus the optimal control is
u��x0jx
�=
p (x0jx) z (x0)P [z] (x)
z (x) , exp (�v (x)) desirability function
P [z] (x) , ∑x0 p (x0jx) z (x0) next-state expectation
The problem is now linear in the desirability function:
z (x) = exp (�q (x))P [z] (x)
Emo Todorov (UW) Linearly-solvable optimal control 3 / 19
Illustration
z(x’)
p(x’|x)u*(x’|x) ~ p(x’|x) z(x’)
x’: sampled from u*(x’|x)x
Emo Todorov (UW) Linearly-solvable optimal control 4 / 19
Linearly-solvable controlled diffusions
dx = a (x) dt+ B (x) (udt+ σdω)
` (x, u) = q (x) +1
2σ2 kuk2
Defining z = exp (�v) as before, the optimal control law is
u� (x) = σ2B (x)Tzx (x)z (x)
and the (first-exit) HJB equation becomes
q (x) z (x) = L [z] (x)
where L is the generator of the passive dynamics:
L [z] (x) = a (x)T zx (x) +σ2
2trace
�B (x)B (x)T zxx (x)
�Emo Todorov (UW) Linearly-solvable optimal control 5 / 19
Relation between the two problems
The above diffusion is the h! 0 limit of discrete-time continuous-spaceLMDPs with passive dynamics
p(h)�x0jx
�= N
�x+ ha (x) , hσ2B (x)B (x)T
�
The LMDP state cost is hq (x). From the KL divergence formula for Gaussians,the LMDP control cost reduces to the diffusion control cost h
2σ2 kuk2:
x
passive
controlled
hBu DKL (N (m1, Σ) jjN (m2, Σ))
=12(m1 �m2)
T Σ�1 (m1 �m2)
Thus LMDPs can approximate linearly-solvable diffusions, but they can alsoapproximate other dynamics, e.g. contact dynamics.
Emo Todorov (UW) Linearly-solvable optimal control 6 / 19
Relation between the two problems
The above diffusion is the h! 0 limit of discrete-time continuous-spaceLMDPs with passive dynamics
p(h)�x0jx
�= N
�x+ ha (x) , hσ2B (x)B (x)T
�The LMDP state cost is hq (x). From the KL divergence formula for Gaussians,the LMDP control cost reduces to the diffusion control cost h
2σ2 kuk2:
x
passive
controlled
hBu DKL (N (m1, Σ) jjN (m2, Σ))
=12(m1 �m2)
T Σ�1 (m1 �m2)
Thus LMDPs can approximate linearly-solvable diffusions, but they can alsoapproximate other dynamics, e.g. contact dynamics.
Emo Todorov (UW) Linearly-solvable optimal control 6 / 19
Relation between the two problems
The above diffusion is the h! 0 limit of discrete-time continuous-spaceLMDPs with passive dynamics
p(h)�x0jx
�= N
�x+ ha (x) , hσ2B (x)B (x)T
�The LMDP state cost is hq (x). From the KL divergence formula for Gaussians,the LMDP control cost reduces to the diffusion control cost h
2σ2 kuk2:
x
passive
controlled
hBu DKL (N (m1, Σ) jjN (m2, Σ))
=12(m1 �m2)
T Σ�1 (m1 �m2)
Thus LMDPs can approximate linearly-solvable diffusions, but they can alsoapproximate other dynamics, e.g. contact dynamics.
Emo Todorov (UW) Linearly-solvable optimal control 6 / 19
Summary of (mostly) linear Bellman equations
LMDP : Diffusion :
first exit exp (q) z = P [z] qz = L [z]finite horizon exp (qk) zk = Pk [zk+1] qz� zt = L [z]average cost exp (q� c) z = P [z] (q� c) z = L [z]discounted cost exp (q) z = P [zα] z log (zα) = L [z]
The operator P (h) in the diffusion-approximating LMDP is related to L as
P (h) [z] (x) = z (x) + hL [z] (x) + o�
h2�
Emo Todorov (UW) Linearly-solvable optimal control 7 / 19
Comparison to generic MDP approximation
0 40 80 120
z iter
policy iter
100502510
CPU time (sec)
aver
age
costP
3 +39
+9
horizontal position
tang
entia
lvel
ocity
optimal control (z iter) optimal costtogo (z iter) optimal costtogo (policy iter)
0 10 102 103 104
number of updates
aver
age
cost
z pol. val.
Emo Todorov (UW) Linearly-solvable optimal control 8 / 19
Z-learning
The solution to the linear Bellman equation
z (x) = exp (�q (x))Ex0�p(�jx)�z�x0��
can be approximated in a model-free way given samples (xn, x0n, qn = q (xn))obtained from the passive dynamics x0n � p (�jxn).
One possibility is a Monte Carlo method based on the path integralrepresentation, although covergence can be slow:
bz (x) = 1# trajectoriesstarting at x
∑ exp (� trajectory cost)
Faster convergence is obtained using temporal difference learning:
bz (xn) (1� β)bz (xn) + β exp (�qn)bz �x0n�The learning rate β should decrease over time.
Emo Todorov (UW) Linearly-solvable optimal control 9 / 19
Z-learning
The solution to the linear Bellman equation
z (x) = exp (�q (x))Ex0�p(�jx)�z�x0��
can be approximated in a model-free way given samples (xn, x0n, qn = q (xn))obtained from the passive dynamics x0n � p (�jxn).
One possibility is a Monte Carlo method based on the path integralrepresentation, although covergence can be slow:
bz (x) = 1# trajectoriesstarting at x
∑ exp (� trajectory cost)
Faster convergence is obtained using temporal difference learning:
bz (xn) (1� β)bz (xn) + β exp (�qn)bz �x0n�The learning rate β should decrease over time.
Emo Todorov (UW) Linearly-solvable optimal control 9 / 19
Z-learning
The solution to the linear Bellman equation
z (x) = exp (�q (x))Ex0�p(�jx)�z�x0��
can be approximated in a model-free way given samples (xn, x0n, qn = q (xn))obtained from the passive dynamics x0n � p (�jxn).
One possibility is a Monte Carlo method based on the path integralrepresentation, although covergence can be slow:
bz (x) = 1# trajectoriesstarting at x
∑ exp (� trajectory cost)
Faster convergence is obtained using temporal difference learning:
bz (xn) (1� β)bz (xn) + β exp (�qn)bz �x0n�The learning rate β should decrease over time.
Emo Todorov (UW) Linearly-solvable optimal control 9 / 19
Importance sampling
Let bu (x0jx) denote the greedy control law, i.e. the control law which would beoptimal if the current approximation bz (x) were the exact desirability function.Then we can sample from bu rather than p and use importance sampling:
bz (xn) (1� β)bz (xn) + βp (x0njxn)bu (x0njxn)
exp (�qn)bz �x0n�We now need access to the model p (x0jx) of the passive dynamics.
X
Qlearning, passiveQlearning, egreedyZlearning, passiveZlearning, greedy
state transitions
cost
tog
oap
prox
imat
ion
erro
r
0 50,000state transitions
0 50,000
log
tota
lcos
t
Emo Todorov (UW) Linearly-solvable optimal control 10 / 19
Importance sampling
Let bu (x0jx) denote the greedy control law, i.e. the control law which would beoptimal if the current approximation bz (x) were the exact desirability function.Then we can sample from bu rather than p and use importance sampling:
bz (xn) (1� β)bz (xn) + βp (x0njxn)bu (x0njxn)
exp (�qn)bz �x0n�We now need access to the model p (x0jx) of the passive dynamics.
X
Qlearning, passiveQlearning, egreedyZlearning, passiveZlearning, greedy
state transitions
cost
tog
oap
prox
imat
ion
erro
r
0 50,000state transitions
0 50,000lo
gto
talc
ost
Emo Todorov (UW) Linearly-solvable optimal control 10 / 19
Estimation-control dualityThe probability of a trajectory under the passive dynamics is
p (x1, x2, � � � xTjx0) = ∏T�1k=0 p (xk+1jxk)
The probability of a trajectory under the optimally-controlled dynamics is
µ (x1, x2, � � � xTjx0) ∝ exp��∑T
k=1 q (xk)�
p (x1, x2, � � � xTjx0)
This is equivalent to Bayesian inference where p is the prior, µ the posterior,and q the negative log-likelihood (of some unspecified measurements).
Let µk (x) be the marginal of the trajectory density at time step k, and definerk (x) = µk (x) /zk (x). Then
zk (x) = exp (�q (x))∑x0 p�x0jx
�zk+1
�x0�
rk+1�x0�= ∑x exp (�q (x)) p
�x0jx
�rk (x)
µk (x) = rk (x) zk (x)
This is equivalent to recursive Bayesian inference where rk is the filteringdensity, zk the backward filtering density, and µk the posterior.
Emo Todorov (UW) Linearly-solvable optimal control 11 / 19
Estimation-control dualityThe probability of a trajectory under the passive dynamics is
p (x1, x2, � � � xTjx0) = ∏T�1k=0 p (xk+1jxk)
The probability of a trajectory under the optimally-controlled dynamics is
µ (x1, x2, � � � xTjx0) ∝ exp��∑T
k=1 q (xk)�
p (x1, x2, � � � xTjx0)
This is equivalent to Bayesian inference where p is the prior, µ the posterior,and q the negative log-likelihood (of some unspecified measurements).
Let µk (x) be the marginal of the trajectory density at time step k, and definerk (x) = µk (x) /zk (x). Then
zk (x) = exp (�q (x))∑x0 p�x0jx
�zk+1
�x0�
rk+1�x0�= ∑x exp (�q (x)) p
�x0jx
�rk (x)
µk (x) = rk (x) zk (x)
This is equivalent to recursive Bayesian inference where rk is the filteringdensity, zk the backward filtering density, and µk the posterior.
Emo Todorov (UW) Linearly-solvable optimal control 11 / 19
General case
Continuous: Discrete:
dx = a (x) dt+ B (x) (udt+ dω) xt+1 � u (�jxt)
` (x, u) = q (x) + 12 kuk
2 ` (x, u) = q (x) +DKL (u (�jx) jj p (�jx))
is dual to is dual to
dx = a (x) dt+ B (x) dω xt+1 � p (�jxt)
dy = h (x) dt+ dν yt � Bernoulli (r (xt))
when when
q (x) = 12 kh (x)k
2 q (x) = � log (r (x))
Emo Todorov (UW) Linearly-solvable optimal control 12 / 19
Linear case
dynamics: dx = Axdt+ Budt+ Cdω
cost rate: ` (x, u) = 12 xTQx+ 1
2 kuk2
measurement: dy = Hxdt+ dν
cost-to-go Hessian: �V = Q+ATV+VA�VBBTV
covariance: S = CCT +AS+ SAT � SHTHS
inverse covariance: P = HTH�ATP� PA� PCCTP
Duality relations:
linear-quadratic regulator: V A B Q t
Kalman-Bucy filter: S AT HT CCT tf � t
information filter: P �A C HTH tf � t
Emo Todorov (UW) Linearly-solvable optimal control 13 / 19
Linear case
dynamics: dx = Axdt+ Budt+ Cdω
cost rate: ` (x, u) = 12 xTQx+ 1
2 kuk2
measurement: dy = Hxdt+ dν
cost-to-go Hessian: �V = Q+ATV+VA�VBBTV
covariance: S = CCT +AS+ SAT � SHTHS
inverse covariance: P = HTH�ATP� PA� PCCTP
Duality relations:
linear-quadratic regulator: V A B Q t
Kalman-Bucy filter: S AT HT CCT tf � t
information filter: P �A C HTH tf � t
Emo Todorov (UW) Linearly-solvable optimal control 13 / 19
Maximum principle for the most likely trajectory
The maximum of µ is the optimal trajectory for a deterministic problem withany dynamics x0 = f (x, u) and cost ` (x, u) = q (x)� log p (f (x, u) , x).
For diffusions with B = const the corresponding deterministic problem hasdynamics x = a (x) and cost ` (x, u) = q (x) + 1
2σ2 kuk2 + 12 div (a (x)).
dx = a (x) dt+ udt+ σdω
q = 0
qT = x2
5 +50
0
2
+2
position (x)
a(x)div(a(x))
0 5
0
+5
5
time (t)
po
siti
on
(x)
r(x,t) z(x,t)mu(x,t)
sigma = 0.6
sigma = 1.2
Emo Todorov (UW) Linearly-solvable optimal control 14 / 19
Maximum principle for the most likely trajectory
The maximum of µ is the optimal trajectory for a deterministic problem withany dynamics x0 = f (x, u) and cost ` (x, u) = q (x)� log p (f (x, u) , x).
For diffusions with B = const the corresponding deterministic problem hasdynamics x = a (x) and cost ` (x, u) = q (x) + 1
2σ2 kuk2 + 12 div (a (x)).
dx = a (x) dt+ udt+ σdω
q = 0
qT = x2
5 +50
0
2
+2
position (x)
a(x)div(a(x))
0 5
0
+5
5
time (t)
po
siti
on
(x)
r(x,t) z(x,t)mu(x,t)
sigma = 0.6
sigma = 1.2
Emo Todorov (UW) Linearly-solvable optimal control 14 / 19
Compositionality
Consider a finite-horizon problem with final cost
qT (x) = � log�∑k wk exp (�gk (x))
�Let zk (x, t), u�k (x, t) be the desirability functions and optimal control laws forcomponent problems with final costs gk (x).Then the desirability function for the composite problem is
z (x, t) = ∑k wkzk (x, t)
and the optimal control law is
u� (x, t) = ∑kwkzk (x, t)
z (x, t)u�k (x, t)
Emo Todorov (UW) Linearly-solvable optimal control 15 / 19
Application to LQG
dx = u dt+ dω, ` (x, u) =12
u2, gk (x) =12(x� ck)
2
Emo Todorov (UW) Linearly-solvable optimal control 16 / 19
Function approximation
Infinite-horizon average-cost discrete-time continuous-space problem:
λ z (x) = exp (�q (x))Z
p�x0jx
�z�x0�
dx0
Select a collocation set fxng. Define a function approximator
bz (x; w, θ) = ∑i wi fi (x; θ)
Construct the (usually sparse) matrices F, G with elements
Fn,i = fi (xn; θ)
Gn,i = exp (�q (xn))Z
p�x0jxn
�fi�x0; θ
�dx0
The above functional equation now becomes an algebraic equation:
λ F (θ)w = G (θ)w
Solving for λ, w is a linear problem; θ can be adapted via gradient descent.
Emo Todorov (UW) Linearly-solvable optimal control 17 / 19
Function approximation
Infinite-horizon average-cost discrete-time continuous-space problem:
λ z (x) = exp (�q (x))Z
p�x0jx
�z�x0�
dx0
Select a collocation set fxng. Define a function approximator
bz (x; w, θ) = ∑i wi fi (x; θ)
Construct the (usually sparse) matrices F, G with elements
Fn,i = fi (xn; θ)
Gn,i = exp (�q (xn))Z
p�x0jxn
�fi�x0; θ
�dx0
The above functional equation now becomes an algebraic equation:
λ F (θ)w = G (θ)w
Solving for λ, w is a linear problem; θ can be adapted via gradient descent.
Emo Todorov (UW) Linearly-solvable optimal control 17 / 19
Example: 40 adaptive Gaussian bases
caronahillcycle
7.48
4.63
positionve
loci
ty
state cost7.51
4.66
metronome cycle
0 100CG iteration
log 1
0re
sidu
al
optimal control approximation
Emo Todorov (UW) Linearly-solvable optimal control 18 / 19
Remarks on function approximation
The collocation states and bases need to cover the (unknown) region ofstate space where the optimally-controlled dynamics tend to live. We areexploring various adaptive methods, not yet clear which is best.
Trajectory optimization may be a good way to initialize the basis functionmethod – which can then construct a feedback controller applicable overa much wider region.The computational bottleneck is the evaluation of
Rp (x0jxn) fi (x0; θ) dx0.
If p is a Gaussian mixture, and all fi’s are Gaussians (possibly multipliedby polynomials), this evaluation can be done analytically. Otherwise onecan use cubature formulas.Overall impression: Numerical methods derived from the LMDPframework are much more efficient than methods for generic stochasticcontrol problems, but not enough to beat the curse of dimensionality.
Emo Todorov (UW) Linearly-solvable optimal control 19 / 19
Remarks on function approximation
The collocation states and bases need to cover the (unknown) region ofstate space where the optimally-controlled dynamics tend to live. We areexploring various adaptive methods, not yet clear which is best.Trajectory optimization may be a good way to initialize the basis functionmethod – which can then construct a feedback controller applicable overa much wider region.
The computational bottleneck is the evaluation ofR
p (x0jxn) fi (x0; θ) dx0.If p is a Gaussian mixture, and all fi’s are Gaussians (possibly multipliedby polynomials), this evaluation can be done analytically. Otherwise onecan use cubature formulas.Overall impression: Numerical methods derived from the LMDPframework are much more efficient than methods for generic stochasticcontrol problems, but not enough to beat the curse of dimensionality.
Emo Todorov (UW) Linearly-solvable optimal control 19 / 19
Remarks on function approximation
The collocation states and bases need to cover the (unknown) region ofstate space where the optimally-controlled dynamics tend to live. We areexploring various adaptive methods, not yet clear which is best.Trajectory optimization may be a good way to initialize the basis functionmethod – which can then construct a feedback controller applicable overa much wider region.The computational bottleneck is the evaluation of
Rp (x0jxn) fi (x0; θ) dx0.
If p is a Gaussian mixture, and all fi’s are Gaussians (possibly multipliedby polynomials), this evaluation can be done analytically. Otherwise onecan use cubature formulas.
Overall impression: Numerical methods derived from the LMDPframework are much more efficient than methods for generic stochasticcontrol problems, but not enough to beat the curse of dimensionality.
Emo Todorov (UW) Linearly-solvable optimal control 19 / 19
Remarks on function approximation
The collocation states and bases need to cover the (unknown) region ofstate space where the optimally-controlled dynamics tend to live. We areexploring various adaptive methods, not yet clear which is best.Trajectory optimization may be a good way to initialize the basis functionmethod – which can then construct a feedback controller applicable overa much wider region.The computational bottleneck is the evaluation of
Rp (x0jxn) fi (x0; θ) dx0.
If p is a Gaussian mixture, and all fi’s are Gaussians (possibly multipliedby polynomials), this evaluation can be done analytically. Otherwise onecan use cubature formulas.Overall impression: Numerical methods derived from the LMDPframework are much more efficient than methods for generic stochasticcontrol problems, but not enough to beat the curse of dimensionality.
Emo Todorov (UW) Linearly-solvable optimal control 19 / 19
Trajectory optimization
Emo Todorov
Applied Mathematics Computer Science and Engineering
University of Washington
MuJoCo: A physics engine for control
recursive algorithms for smooth dynamics new algorithms for contact dynamics modeling of tendons, muscles, pneumatics 100 times faster than real-time (single core, 23 DOF humanoid) parallel evaluation for finite differences
Trajectory optimization methods General procedure: - divide the relevant variables (positions, velocities, torques, contact forces) into independent and dependent sets; compute the latter given the former - if the independent variables are physically coupled, impose constraints - use contact smoothing to ensure that the cost function is differentiable - apply a suitable 2nd-order optimization method, usually with continuation
Independent Dependent (computation) Constraints Smoothing
Torques Positions (integration) Velocities (forward dynamics) Contact forces (contact solver)
n/a Iterative contact approximation
Torques Contact forces
Positions (integration) Velocities (forward dynamics)
Contact & friction Soft constraints
Positions Velocities (differentiation) Torques (inverse dynamics) Contact forces (inverse contact solver)
Under-actuation Finite differences, Convex contact model
Positions Contact forces
Velocities (differentiation) Torques (inverse dynamics)
Under-actuation Contact & friction
Soft constraints
Positions Velocities Torques Contact forces
n/a Differentiation Dynamics Contact & friction
Soft constraints
Model-predictive control (MPC)
At time step t, solve a trajectory optimization problem of the form Larger horizon N results in better performance but requires more computation. The final cost h(x) should approximate all future costs incurred after time t+N.
( ) ( )∑−+
=+ +
1
,minNt
tkkkNt uxxh
There is always a plan, the plan is re-optimized all the time, only the initial portion is ever executed.
The optimizer is based on the iLQG method described in Todorov and Li, ACC 2005, with further improvements described in Tassa, Erez and Todorov, IROS 2012.
The cost has the extra term where e is the vector distance to the nearest surface. The planned contact impulse f is penalized in full, but scaled by c before being applied.
Contact-aware optimization
We introduce auxiliary decision variables indicating if potential contact i should be active in movement phase . These variables affect the dynamics and cost function, and are optimized along with the movement trajectory.
Mordatch, Todorov and Popovic, SIGGRAPH 2o12 Mordatch, Popovic and Todorov, SCA 2012
Results Kulchenko and Todorov, ICRA 2011 MPC: plan with iLQG until next contact (adaptive horizon). Contact dynamics handled implicitly: heuristic value function used as final cost (encodes what is a good way to hit the ball).
Tassa, Erez and Todorov, IROS 2012 MPC: plan with iLQG 0.5 sec into the future (fixed horizon). Smoothing due to soft approximation to contact.
Mordatch, Todorov and Popovic, SIGRAPH 2012 Space-time optimization: spline representation of trajectory. Contact smoothing due to finite differencing. Auxiliary decision variables for all potential contacts. Continuation: contact forces from a distance.
Mordatch, Popovic and Todorov, SCA 2012 Similar to above, but contact forces and their origins are now independent variables, and contact smoothing is due to soft constraints.
Erez, Tassa and Todorov, IROS 2012 Space-time optimization: Fourier representation of trajectory. Contact smoothing due to inverse contact solver. Continuation: root forces, contact forces from a distance.
Towards real-world applications
Drive utility vehicle to site Walk across rubble Remove debris blocking door Open door, enter building Climb ladder and stairs Use tool to break concrete panel Close leaking valve Replace component
DARPA Robotics Challenge
The case for MPC in the brain
yet most neurons in your brain are doing something when you move... perhaps online re-optimization of the movement?
complex “hardwired” behaviors can be generated with few neurons
we usually think of learning/optimization as setting up a similar neural machine, which is then responsible for online generation of behavior
small brain: box of chocolates / behaviors
large brain: chocolate factory / behavior optimizer
If you have expensive machinery, you should use it all the time.
Contact smoothing via soft constraints
If the contact forces are treated as independent variables and physical realism is enforced via penalty functions, then the cost function is differentiable
Coulomb friction model
complementarity:
A contact inverse inertia:
b contact velocity before impulse
f contact impulse
v contact velocity after impulse
TJJMA 1−=0,0,0 =≥≥ xyyx
0, ≤=
−⊥
⊥+=
αα
µ
TT
TNT
NN
fvffv
fvbfAv yx ⊥
( )
( )...
,min2
22
2
yxyx
yx
+−+
Penalty functions instead of hard constraints: replace with one of yx ⊥
The contact impulse f is the solution to a linear equation: subject to complicated but easy-to-enforce constraints: Iterative solvers tend to work well and produce smooth results: - solve the linear equation - enforce the constraints softly - repeat a couple of times
Contact smoothing via iterative approximations
bfAv +=
0, ≤=
−⊥
⊥
αα
µ
TT
TNT
NN
fvffv
fv
Convex, smooth and invertible contact model
Define the contact impulse by minimizing contact-space kinetic energy subject to for each contact. Replace the constraints with penalty functions
vAvT 1
21 −
TNNN ffvf ≥≥≥ µ,0,0
( ) ( )bAfdfdbfAfff VFTT
f ++++=21minarg*
( ) ( )vdfd VF ,
( ) ( )bAfvfbA +=→ ,,
( ) ( )AfvbfvA −=→ ,,
Forward contact dynamics:
Inverse contact dynamics:
( )( ) ( )fdvdAvff FVT
f +∇+= minarg*
0 200 400 6004
2
0
2
4
time (msec)
verticalvelocity
normalimpulse
ball drop test
Todorov, ICRA 2011