Calculating Hessian matrices - Télécom ParisTech · Calculating Hessian matrices Calculating...

Calculating Hessian matrices

StudentRobert Mansel Gower gowerrobert@gmail.com

AdvisorMargarida Pinheiro Mello margarid@ime.unicamp.br

July 25, 2011

Contents

1 Motivation

2 Computational graph

3 GradientForward GradientPartial derivatives on computational graph

4 HessianForward HessianHessian on computational graphNew Reverse Hessian algorithmComparative tests

Motivation

Motivation from nonlinear programming

Second order Taylor approximations very common in nonlinearprogramming.

Hessians desirable in interior-point and augmented Lagrangianmethods.

Sensitivity analysis

Computational graph

Function Representation

−1 0

v−1 = x−1 v0 = x0−1 0

v1 = h(v−1) 1 v2 = g(v−1, v0)2

v3 = f (v2, v1) 3

f (h(x−1), g(x−1, x0))v−1 = x−1v0 = x0v1 = h(v−1)v2 = g(v−1, v0)v3 = f (v2, v1)

Indices of matrices and vectors shifted by −n.y ∈ Rm : y = (y1−n, . . . , ym−n)T

Node numbering is in order of evaluation.

(j is a predecessor of i) ≡ j ∈ P(i).

(i is a sucessor of j) ≡ i ∈ S(j).

Computational graph

−1 0

v−1 = x−1 v0 = x0−1 0

v1 = h(v−1) 1 v2 = g(v−1, v0)2

v3 = f (v2, v1) 3

f (h(x−1), g(x−1, x0))

v−1 = x−1v0 = x0v1 = h(v−1)v2 = g(v−1, v0)v3 = f (v2, v1)

Computational graph

−1 0

v−1 = x−1 v0 = x0−1 0

v1 = h(v−1) 1 v2 = g(v−1, v0)2

v3 = f (v2, v1) 3

Computational graph

−1 0

v−1 = x−1 v0 = x0−1 0

v1 = h(v−1) 1 v2 = g(v−1, v0)2

v3 = f (v2, v1) 3

Computational graph

−1 0

v−1 = x−1 v0 = x0−1 0

v1 = h(v−1) 1 v2 = g(v−1, v0)2

v3 = f (v2, v1) 3

Computational graph

−1 0

v−1 = x−1 v0 = x0

−1 0

v1 = h(v−1) 1

v2 = g(v−1, v0)2

v3 = f (v2, v1) 3

Computational graph

−1 0

v−1 = x−1 v0 = x0

−1 0

v1 = h(v−1)

v2 = g(v−1, v0)2

v3 = f (v2, v1) 3

Computational graph

−1 0

v−1 = x−1 v0 = x0

−1 0

v1 = h(v−1)

v2 = g(v−1, v0)

v3 = f (v2, v1) 3

Computational graph

−1 0

v−1 = x−1 v0 = x0

−1 0

v1 = h(v−1)

v2 = g(v−1, v0)

v3 = f (v2, v1)

Computational graph

−1 0

v−1 = x−1 v0 = x0

−1 0

v1 = h(v−1)

v2 = g(v−1, v0)

v3 = f (v2, v1)

Computational graph

Function Evaluation ≡ Computational Graph

Nodes for Independent variables:

vi−n = xi−n, for i = 1, . . . , nIndependent nodes Z = {1− n, . . . , 0}

Nodes for Intermediate variables:vi = φi (vP(i)), for i = 1, . . . , `.Intermediate nodes V = {1, . . . , `}

Function Evaluation ≡ G = (Z ∪ V ,E ) & φ set of elementalfunctions with derivatives codedTIME(eval(f (x))) = O(`+ n).

Computational graph

Nodes for Independent variables:vi−n = xi−n, for i = 1, . . . , n

Independent nodes Z = {1− n, . . . , 0}

Computational graph

Nodes for Independent variables:vi−n = xi−n, for i = 1, . . . , nIndependent nodes Z = {1− n, . . . , 0}

Nodes for Intermediate variables:

vi = φi (vP(i)), for i = 1, . . . , `.Intermediate nodes V = {1, . . . , `}

Computational graph

Nodes for Intermediate variables:vi = φi (vP(i)), for i = 1, . . . , `.

Intermediate nodes V = {1, . . . , `}

Computational graph

Function Evaluation ≡ G = (Z ∪ V ,E )

& φ set of elementalfunctions with derivatives codedTIME(eval(f (x))) = O(`+ n).

Computational graph

Function Evaluation ≡ G = (Z ∪ V ,E ) & φ set of elementalfunctions with derivatives coded

TIME(eval(f (x))) = O(`+ n).

Computational graph

Gradient

Forward Gradient

Forward Gradient: The first attempt

Set of elemental function = Sums, multiplication and unaryfunctions.

vi = φi (vP(i))

∇vi =∑

j∈P(i)

∂φi

∂vj∇vj .

Each j passes on ∂φi∂vj∇vj to each sucessor i .

Gradient

Forward Gradient

vi = φi (vP(i))

∇vi =∑

j∈P(i)

∂φi

∂vj∇vj .

Gradient

Forward Gradient

vi = φi (vP(i))

∇vi =∑

j∈P(i)

∂φi

∂vj∇vj .

Gradient

Forward Gradient

Resume of Forward gradient

For each node i one stores ∇vi = ( ∂vi∂x1−n

, . . . , ∂vi∂x0).

Memory complexity: O(n`).

For each node visit, perform n-dimension vector arithmetic.

Time complexity: O(n`).

Storing and calculating all ∇vi ’s is expensive andunnecessary.

Gradient

Forward Gradient

, . . . , ∂vi∂x0).

Gradient

Forward Gradient

, . . . , ∂vi∂x0).

Gradient

Forward Gradient

, . . . , ∂vi∂x0).

Gradient

Forward Gradient

, . . . , ∂vi∂x0).

Gradient

Forward Gradient

, . . . , ∂vi∂x0).

Gradient

Partial derivatives on computational graph

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

x−1 x0

∂x−1=∂φ3

∂φ2

∂v−1+∂φ3

∂φ1

∂v−1

∂xi=

∑p|path from i to `

(weight of path p)

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

x−1 x0

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

∂x−1=∂φ3

∂φ2

∂v−1+∂φ3

∂φ1

∂v−1

∂xi=

(weight of path p)

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

x−1 x0

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

∂x−1=∂φ3

∂φ2

∂v−1+∂φ3

∂φ1

∂v−1

∂xi=

(weight of path p)

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

x−1 x0

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

∂x0=∂φ3

∂φ2

∂x0+ 0

∂x−1=∂φ3

∂φ2

∂v−1+∂φ3

∂φ1

∂v−1

∂xi=

(weight of path p)

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

x−1 x0

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

∂x0=∂φ3

∂φ2

∂x0+ 0

∂x−1=∂φ3

∂φ2

∂v−1+∂φ3

∂φ1

∂v−1

∂xi=

(weight of path p)

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

x−1 x0

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

∂x0=∂φ3

∂φ2

∂x0+ 0

∂x−1=∂φ3

∂φ2

∂v−1+∂φ3

∂φ1

∂v−1

∂xi=

(weight of path p)

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

x−1 x0

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

∂x0=∂φ3

∂φ2

∂x0+ 0

∂x−1=∂φ3

∂φ2

∂v−1+∂φ3

∂φ1

∂v−1

∂xi=

(weight of path p)

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

x−1 x0

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

∂x0=∂φ3

∂φ2

∂x0+ 0

∂x−1=∂φ3

∂φ2

∂v−1+∂φ3

∂φ1

∂v−1

∂xi=

(weight of path p)

Gradient

Reverse Gradient - Accumulating paths

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

v3 = 1 3

v1 = ∂φ3∂v1

v2 = ∂φ3∂v2

v−1 =∂φ2

∂x−1v2 v0 =

∂φ2

∂x0v2v−1 =

∂φ1

∂x−1v1 +

∂φ2

∂x−1v2

∂x−1= v−1

∂x0= v0

vi =∑path from i to `

(path weight)

vj =∑i∈S(j)

∂φi∂vj

TIME(∇f (x))= TIME(f (x))

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

v3 = 1 3

v1 = ∂φ3∂v1

v2 = ∂φ3∂v2

v−1 =∂φ2

∂x−1v2 v0 =

∂φ2

∂x0v2v−1 =

∂φ1

∂x−1v1 +

∂φ2

∂x−1v2

∂x−1= v−1

∂x0= v0

(path weight)

vj =∑i∈S(j)

∂φi∂vj

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

v3 = 1 3

v1 = ∂φ3∂v1

v2 = ∂φ3∂v2

v−1 =∂φ2

∂x−1v2 v0 =

∂φ2

∂x0v2v−1 =

∂φ1

∂x−1v1 +

∂φ2

∂x−1v2

∂x−1= v−1

∂x0= v0

(path weight)

vj =∑i∈S(j)

∂φi∂vj

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

v3 = 1 3

v1 = ∂φ3∂v1

v2 = ∂φ3∂v2

v−1 =∂φ2

∂x−1v2 v0 =

∂φ2

∂x0v2v−1 =

∂φ1

∂x−1v1 +

∂φ2

∂x−1v2

∂x−1= v−1

∂x0= v0

(path weight)

vj =∑i∈S(j)

∂φi∂vj

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

v3 = 1 3

2v1 = ∂φ3∂v1

v2 = ∂φ3∂v2

v−1 =∂φ2

∂x−1v2 v0 =

∂φ2

∂x0v2v−1 =

∂φ1

∂x−1v1 +

∂φ2

∂x−1v2

∂x−1= v−1

∂x0= v0

(path weight)

vj =∑i∈S(j)

∂φi∂vj

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

v3 = 1 3

1v1 = ∂φ3∂v1

v2 = ∂φ3∂v2

v−1 =∂φ2

∂x−1v2 v0 =

∂φ2

∂x0v2

v−1 =∂φ1

∂x−1v1 +

∂φ2

∂x−1v2

∂x−1= v−1

∂x0= v0

(path weight)

vj =∑i∈S(j)

∂φi∂vj

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

v3 = 1 3

v1 = ∂φ3∂v1

v2 = ∂φ3∂v2

v−1 =∂φ2

∂x−1v2

v0 =∂φ2

∂x0v2v−1 =

∂φ1

∂x−1v1 +

∂φ2

∂x−1v2

∂x−1= v−1

∂x0= v0

(path weight)

vj =∑i∈S(j)

∂φi∂vj

Gradient

−1 0

f (x) = φ3(φ1(x−1), φ2(x−1, x0))

∂φ1∂x−1

∂φ2∂x0

∂φ2∂x−1

∂φ3∂v2

∂φ3∂v1

v3 = 1 3

v1 = ∂φ3∂v1

v2 = ∂φ3∂v2

v−1 =∂φ2

∂x−1v2

v0 =∂φ2

∂x0v2v−1 =

∂φ1

∂x−1v1 +

∂φ2

∂x−1v2

∂x−1= v−1

∂x0= v0

(path weight)

vj =∑i∈S(j)

∂φi∂vj

Hessian

Forward Hessian

Forward Hessian: McCormick and Jackson 1986

vi = φi (vP(i))

∇vi =∑

j∈P(i)

∂φi

∂vj∇vj

v ′′i =∑

j ,k∈P(i)

∇vj ·∂2φi

∂vj∂vk· ∇vT

k +∑

j∈P(i)

∂φi

∂vj· v ′′j

Hessian

Forward Hessian

vi = φi (vP(i))

∇vi =∑

j∈P(i)

∂φi

∂vj∇vj

v ′′i =∑

j ,k∈P(i)

∇vj ·∂2φi

∂vj∂vk· ∇vT

k +∑

j∈P(i)

∂φi

∂vj· v ′′j

Hessian

Forward Hessian

vi = φi (vP(i))

∇vi =∑

j∈P(i)

∂φi

∂vj∇vj

v ′′i =∑

j ,k∈P(i)

∇vj ·∂2φi

∂vj∂vk· ∇vT

k +∑

j∈P(i)

∂φi

∂vj· v ′′j

Hessian

Forward Hessian

vi = φi (vP(i))

∇vi =∑

j∈P(i)

∂φi

∂vj∇vj

v ′′i =∑

j ,k∈P(i)

∇vj ·∂2φi

∂vj∂vk· ∇vT

k +∑

j∈P(i)

∂φi

∂vj· v ′′j

Hessian

Forward Hessian

Forward Hessian resume

For each node, store and calculate a n × n matrix.

Is it necessary to calculate the gradient and Hessian of eachnode?

Gain a deeper understanding on the problem using gradientgraph.

Hessian

Forward Hessian

Hessian

Forward Hessian

Hessian

Forward Hessian

Calculating the Hessian using the computational graph

Function’s computational graph + vi nodes and dependencies= gradient computational graph.

Interpret partial derivative on augmented graph: Second orderderivative.

Eliminate unnecessary symmetries on augmented graph.

Hessian

Forward Hessian

Hessian

Forward Hessian

Hessian

Hessian on computational graph

The adjoint variables of the Reverse Gradient satisfy

vj =∑i∈S(j)

vi∂φi∂vj≡ ϕj .

Gradient’s graph has 2(`+ n) nodes:(v1−n, . . . , v`) and (v1−n, . . . , v`).

node ←→ vj .

i ∈ P () iff j ∈ P(i).

Hessian

vj =∑i∈S(j)

node ←→ vj .

i ∈ P () iff j ∈ P(i).

Hessian

vj =∑i∈S(j)

node ←→ vj .

i ∈ P () iff j ∈ P(i).

Hessian

vj =∑i∈S(j)

node ←→ vj .

i ∈ P () iff j ∈ P(i).

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0)

−1 0

v−1 = x−1 v3 = 1v0 = x0 v2 = v3v1v1 = sin(v−1) v1 = v3v2v2 = (v−1 + v0) v0 = v21v3 = v1v2 v−1 = v21 + v1 cos(v−1)

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0)

−1 0

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0)

−1 0

v−1 = x−1 v3 = 1v0 = x0 v2 = v3v1 ←−v1 = sin(v−1) v1 = v3v2v2 = (v−1 + v0) v0 = v21v3 = v1v2 v−1 = v21 + v1 cos(v−1)

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0)

−1 0

v−1 = x−1 v3 = 1v0 = x0 v2 = v3v1v1 = sin(v−1) v1 = v3v2 ←−v2 = (v−1 + v0) v0 = v21v3 = v1v2 v−1 = v21 + v1 cos(v−1)

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0)

−1 0

v−1 = x−1 v3 = 1v0 = x0 v2 = v3v1v1 = sin(v−1) v1 = v3v2v2 = (v−1 + v0) v0 = v21v3 = v1v2 v−1 = v21 + v1 cos(v−1) ←−

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0)

−1 0

v−1∂2f

∂xi∂xj=∑

p from −1 to −1

Weight(p)

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0)

−1 0

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0)

−1 0

cos(v−1)

∂φ1

∂v−1= cos(v−1)

∂ϕ−1

∂v1= cos(v−1)

v−1 = x−1 v3 = 1v0 = x0 v2 = v3v1v1 = sin(v−1) ←− v1 = v3v2v2 = (v−1 + v0) v0 = v21v3 = v1v2 v−1 = v21 + v1 cos(v−1) ←−

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0)

−1 0

cos(v−1)

cos(v−1)∂ϕj

∂vk= ckj =

∂φk

v−1 = x−1 v3 = 1v0 = x0 v2 = v3v1v1 = sin(v−1) ←− v1 = v3v2v2 = (v−1 + v0) v0 = v21v3 = v1v2 v−1 = v21 + v1 cos(v−1) ←−

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0)

−1 0

v−1 = x−1 v3 = 1v0 = x0 v2 = v3v1 ←−v1 = sin(v−1) v1 = v3v2 ←−v2 = (v−1 + v0) v0 = v21v3 = v1v2 v−1 = v21 + v1 cos(v−1)

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0)

−1 0

∂ϕ1

∂v2= v3

∂ϕ2

∂v1= v3

v−1 = x−1 v3 = 1v0 = x0 v2 = v3v1 ←−v1 = sin(v−1) v1 = v3v2 ←−v2 = (v−1 + v0) v0 = v21v3 = v1v2 v−1 = v21 + v1 cos(v−1)

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0)

−1 0

∂ϕj

∂vk= ckj =

∂ϕk

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0)

−1 0

∂ϕj

∂vk= ckj =

∂ϕk

ckj =∑

i∈S(k)∩S(j)

vi∂2φi

∂vj∂vk

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0) = φ3(φ1(x−1), φ2(x−1, x0))

−1 0

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0) = φ3(φ1(x−1), φ2(x−1, x0))

−1 0

∂2f∂x2−1

= c1−1c21c2−1

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0) = φ3(φ1(x−1), φ2(x−1, x0))

−1 0

∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1

Hessian

−1 0

f (x) = sin(x−1)(x−1 + x0) = φ3(φ1(x−1), φ2(x−1, x0))

−1 0

∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1 + c−1−1

Hessian

−1 0

∂2f∂x2−1

= c1−1c21c2−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1 + c−1−1

Fold mirror subgraph.

More symmetry

k 99K j iff k L99 j

ckj = cjk

Hessian

−1 0

∂2f∂x2−1

= c1−1c21c2−1

∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1 + c−1−1

More symmetry

k 99K j iff k L99 j

ckj = cjk

Hessian

−1 0

∂2f∂x2−1

= c1−1c21c2−1

∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1

∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1 + c−1−1

More symmetry

k 99K j iff k L99 j

ckj = cjk

Hessian

−1 0

∂2f∂x2−1

= c1−1c21c2−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1

∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1 + c−1−1

More symmetry

k 99K j iff k L99 j

ckj = cjk

Hessian

−1 0

∂2f∂x2−1

= c1−1c21c2−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1 + c−1−1

More symmetry

k 99K j iff k L99 j

ckj = cjk

Hessian

−1 0

∂2f∂x2−1

= c1−1c21c2−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1 + c−1−1

More symmetry

k 99K j iff k L99 j

ckj = cjk

Hessian

−1 0

∂2f∂x2−1

= c1−1c21c2−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1 + c−1−1

More symmetry

k 99K j iff k L99 j

ckj = cjk

Hessian

−1 0

∂2f∂x2−1

= c1−1c21c2−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1 + c−1−1

∂2f∂x2−1

= 2c1−1c21c2−1

More symmetry

k 99K j iff k L99 j

ckj = cjk

Hessian

−1 0

∂2f∂x2−1

= c1−1c21c2−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1∂2f∂x2−1

= c1−1c21c2−1 + c2−1c21c1−1 + c−1−1

∂2f∂x2−1

= 2c1−1c21c2−1 + c−1−1

More symmetry

k 99K j iff k L99 j

ckj = cjk

Hessian

∂xi∂xj=∑

nonlinear

edge {m, k}

∑{p| from i to m}

( weight of p) cmk

∑{p| from j to k}

( weight of p) .

Hessian

∂xi∂xj=∑

nonlinear

edge {m, k}

∑{p| from i to m}

( weight of p) cmk

∑{p| from j to k}

( weight of p) .

Hessian

New Reverse Hessian algorithm

Building shortcuts

P(m) = {i , j}.

(m, k) ∈ Path

⇒ path 3 (i ,m, k) or path 3 (j ,m, k)

cjmcmk

cimcmk

Figure: Pushing the edge {m, k}

Hessian

Building shortcuts

P(m) = {i , j}.(m, k) ∈ Path

m kcmk

cjmcmk

cimcmk

Hessian

Building shortcuts

P(m) = {i , j}.(m, k) ∈ Path

m kcmk

cjmcmk

cimcmk

Hessian

Building shortcuts

P(m) = {i , j}.(m, k) ∈ Path

cjmcmk

cimcmk

Hessian

Simple example of edge pushing execution

−2 −1 0

f (x) = (x−2 + 1)(x−1 + 1)3(x0 + 1)

v1 = v−2 + 1v2 = v−1 + 1v3 = v0 + 1v4 = v1v2v5 = 3v3v6 = v4v5v6 = 1v5 = v4v4 = v5

Hessian

−2 −1 0

f (x) = (x−2 + 1)(x−1 + 1)3(x0 + 1)

Hessian

−2 −1 0

f (x) = (x−2 + 1)(x−1 + 1)3(x0 + 1)

Hessian

−2 −1 0

f (x) = (x−2 + 1)(x−1 + 1)3(x0 + 1)

Hessian

−2 −1 0

f (x) = (x−2 + 1)(x−1 + 1)3(x0 + 1)

Hessian

−2 −1 0

f (x) = (x−2 + 1)(x−1 + 1)3(x0 + 1)

Hessian

−2 −1 0

f (x) = (x−2 + 1)(x−1 + 1)3(x0 + 1)

Hessian

−2 −1 0

6f (x) = (x−2 + 1)(x−1 + 1)3(x0 + 1)

Hessian

−2 −1 0

6f (x) = (x−2 + 1)(x−1 + 1)3(x0 + 1)

v1 = v−2 + 1v2 = v−1 + 1v3 = v0 + 1v4 = v1v2v5 = 3v3v6 = v4v5

f ′′ =

0 X XX 0 XX X 0

Hessian

pushing of nonlinear edges

sweepingnode i

wiicji wiickiwiicjicki

wikcji2wikcik

wipcji wipcki

Hessian

The pseudo-code of edge pushing

Input: x ∈ Rn,for i = `, . . . , 1 do

Create nonlinear edges if φi is nonlinear ;Push nonlinear edges adjacent to i ;

Hessian

Comparative tests

Competitor for edge pushing: Graph coloring

edge pushing implementation aimed at large sparse Hessians.

state-of-the-art competitor: graph coloring methodsGebremedhin, Manne, Pothen, Walther, Tarafdar

Efficient Computation of Sparse Hessians Using Coloring andAutomatic Differentiation(2009)What Color Is Your Jacobian? Graph Coloring for ComputingDerivatives(2005)

Hessian

Comparative tests

Sparsity Pattern1 Color Compact

Calculates once

f ′′(x) f ′′(x)S

1Uses Walther’s 2008 algorithm

Hessian

Comparative tests

Sparsity Pattern1 Color Compact

Calculates once

f ′′(x) f ′′(x)S

Hessian

Comparative tests

Sparsity Pattern1 Color Compact Derive Compact Recover

Repeat for each x ∈ RnCalculates once

f ′′(x) f ′′(x)S

Hessian

Comparative tests

Sparsity Pattern1 Color Compact Derive Compact Recover

Repeat for each x ∈ RnCalculates once

f ′′(x) f ′′(x)S

Hessian

Comparative tests

Invests a large initial time in 1st run ⇒ fast subsequent runs.

Two different coloring methods with different recoveries: Starand Acyclic.

Hessian

Comparative tests

Invests a large initial time in 1st run ⇒ fast subsequent runs.

Two different coloring methods with different recoveries: Starand Acyclic.

Hessian

Comparative tests

Test set chosen from CUTE

n = 50′000.# colors

Name Pattern Star Acyclic

cosine B 1 3 2chainwoo B 2 3 3bc4 B 1 3 2cragglevy B 1 3 2pspdoc B 2 5 3scon1dls B 2 5 3morebv B 2 5 3augmlagn 5× 5 diagonal blocks 5 5lminsurf B 5 11 6brybnd B 5 13 7arwhead arrow 2 2nondquar arrow + B 1 4 3sinquad frame + diagonal 3 3bdqrtic arrow + B 3 8 5noncvxu2 irregular 12 7ncvxbqp1 irregular 12 7

Hessian

Comparative tests

Numeric Results edge pushing × Colouring methods

Star AcyclicName 1st 2nd 1st 2nd e p

cosine 9.93 0.16 9.68 2.52 0.15chainwoo 35.07 0.33 33.24 5.08 0.30bc4 10.02 0.25 10.00 2.56 0.25cragglevy 28.17 0.79 28.15 2.60 0.48pspdoc 10.31 0.35 10.27 4.39 0.23scon1dls 11.00 0.59 10.97 4.96 0.40morebv 10.36 0.46 10.33 4.49 0.35augmlagn 15.99 0.68 8.36 16.74 0.27lminsurf 9.30 1.01 9.24 3.89 0.35brybnd 11.87 2.44 11.73 12.63 1.68arwhead 176.50 0.16 45.86 0.24 0.20nondquar 166.59 0.18 28.64 2.57 0.12sinquad 606.72 0.26 888.57 1.51 0.32bdqrtic 262.64 1.34 96.87 7.80 0.80noncvxu2 29.69 1.10 29.27 7.76 0.28ncvxbqp1 13.51 2.42 – – 0.37

Averages 87.98 0.78 82.08 5.32 0.41

Variances 25 083.44 0.54 50 313.10 19.32 0.14

Hessian

Comparative tests

Graphical comparison: Star 2nd run versus edge pushing.

cosine

chainwoo

cragglevy

pspdoc

scon1dls

morebv

augmlagn

lminsurf

brybnd

arwhead

nondquar

sinquad

bdqrtic

noncvxu2

ncvxbqp1

edge pushing

Time (in seconds)

Functions

Hessian

Comparative tests

Summing up

Graph representation:

New algorithm.New perspective.Propagate from known contributions (nonlinear edges).

Algebraic representation:

New correctness.New algorithms.

edge pushing

Exploits the symmetry and sparsity.Promising test results.Lives up to Griewank 16th rule.

The calculation of gradients by nonincrementalreverse makes the corresponding computationalgraph symmetric, a property that should beexploited and maintained in accumulating Hes-sians.

Hessian

Comparative tests

Summing up

Graph representation:New algorithm.

New perspective.Propagate from known contributions (nonlinear edges).

edge pushing

Hessian

Comparative tests

Summing up

Graph representation:New algorithm.New perspective.

Propagate from known contributions (nonlinear edges).

edge pushing

Hessian

Comparative tests

Summing up

Graph representation:New algorithm.New perspective.Propagate from known contributions (nonlinear edges).

edge pushing

Hessian

Comparative tests

Summing up

edge pushing

Hessian

Comparative tests

Summing up

Algebraic representation:New correctness.

New algorithms.

edge pushing

Hessian

Comparative tests

Summing up

Algebraic representation:New correctness.New algorithms.

edge pushing

Hessian

Comparative tests

Summing up

edge pushing

Hessian

Comparative tests

Summing up

edge pushingExploits the symmetry and sparsity.

Promising test results.Lives up to Griewank 16th rule.

Hessian

Comparative tests

Summing up

edge pushingExploits the symmetry and sparsity.Promising test results.

Lives up to Griewank 16th rule.

Hessian

Comparative tests

Summing up

edge pushingExploits the symmetry and sparsity.Promising test results.Lives up to Griewank 16th rule.

Hessian

Comparative tests

HANK U

QUESTIONS?

Hessian

Comparative tests

HANK UQUESTIONS?

Calculating Hessian matrices - Télécom ParisTech · Calculating Hessian matrices Calculating...

Documents

Transcript of Calculating Hessian matrices - Télécom ParisTech · Calculating Hessian matrices Calculating...

Hessian-Free Optimization and its applications to Neural ...morrow/336_17/2016papers/joe.… · Hessian-Free Optimization and its applications to Neural Networks Joseph Christianson

Optimization Randomized Hessian estimation and directional ... · randomized Hessian estimation and prove some basic properties. In Section 3, we consider algorithmic applications

Formal Computational Skills Matrices 1. Overview Motivation: many mathematical uses eg Writing networks operations Solving linear equations Calculating.

LECTURE #3: Complex Hessian Matrices - asl.epfl.ch · 2 Lecture #3: Complex Hessian Matrices EE210B: Inference over Networks (A. H. Sayed) Reference Appendix B (Complex Hessian Matrices,

· Hessian Warp Winder, Hessian Weft Windcr, Sacking Warp Winder and Sacking Weft Winder. as . IV VI 11 111 TIC. Tk.620 —24 —980 Tk ... Weaving: Hessian Weaver, Sacking Weaver

ES-MAML: Hessian Free Meta Google Brain, Columbia ...

An Investigation into Neural Net Optimization via Hessian ...proceedings.mlr.press/v97/ghorbani19b/ghorbani19b.pdf · Hessian Eigenvalue Densities for n > 107 To understand the Hessian,

Blind Source Separation Using Hessian Evaluation

Projected Hessian and quasi-Newton FWI

Hessian recovery for finite element methods

Design Optimization Utilizing Gradient/Hessian Enhanced Surrogate Model

Optimal Estimation of Jacobian and Hessian …...MATHEMATICS OF COMPUTATION VOLUME 43. NUMBER 167 JULY 1984. PAGES 69-88 Optimal Estimation of Jacobian and Hessian Matrices That Arise

Numerical Issues Involved in Inverting Hessian Matrices · 2013-02-05 · Numerical Issues Involved in Inverting Hessian Matrices Jeff Gill and Gary King 6.1 INTRODUCTION In the social

Hessian-based Analysis of Large Batch Training and ...papers.nips.cc/paper/7743-hessian-based-analysis... · with large Hessian spectrum show poor robustness to adversarial perturbation.

Biology and Management of Hessian Fly in the Southeast · into Long Island, New York, in straw bedding used by Hessian soldiers during the Revolutionary War. Hessian fly prefers to

Hessian Clothes by Sonsvijay International Kolkata.ppsx

Learning Hessian matrix.pdf

flyer CMM software - UGent · Hessian Vibr ational Analysis (PHVA), Mobile Block Hessian (MBH), Vibrational Subsystem Analysis (VSA),... - Flexible definition of partition functions

Optimization Methods for Non-Linear/Non-Convex Learning … · Yann LeCun Making the Hessian better conditionedMaking the Hessian better conditioned The Hessian is the covariance

Deep Learning via Hessian-free Optimizationjmartens/docs/HF_talk.pdf · 2010-06-29 · Practical Considerations for 2nd-order optimization Hessian size problem for machine learning