Computing Sparse Jacobians and Hessians Using Algorithmic ...

Computing Sparse Jacobians and Hessians Using

Algorithmic Differentiation

Bradley M. Bell1 and Kasper Kristensen2

1IHME, University of Washington, Seattle, USA,[email protected]

2DTU Aqua, Technical University of Denmark, DK, [email protected]

November 10, 2021

Abstract

Stochastic scientific models and machine learning optimization esti-mators have a large number of variables; hence computing large sparseJacobians and Hessians is important. Algorithmic differentiation (AD)greatly reduces the programming effort required to obtain the sparsitypatterns and values for these matrices. We present forward, reverse,and subgraph methods for computing sparse Jacobians and Hessians.Special attention is given the the subgraph method because it is new.The coloring and compression steps are not necessary when computingsparse Jacobians and Hessians using subgraphs. Complexity analysisshows that for some problems the subgraph method is expected to bemuch faster. We compare C++ operator overloading implementationsof the methods in the ADOL-C and CppAD software packages usingsome of the MINPACK-2 test problems. The experiments are set upin a way that makes them easy to run on different hardware, differentsystems, different compilers, other test problem and other AD pack-ages. The setup time is the time to record the graph, compute sparsity,coloring, compression, and optimization of the graph. If the setup isnecessary for each evaluation, the subgraph implementation has similarrun times for sparse Jacobians and faster run times for sparse Hessians.

Keywords: automatic differentiation, sparse, minpack-2, cppad, adolc, c++

AMS classifications: 05C15, 65F50, 90C30

ACM classifications: F.2.2, G.2.2

1

arX

iv:2

111.

0520

7v1

[cs

.MS]

9 N

ov 2

021

1 INTRODUCTION

1 Introduction

This paper concerns the computation of sparse Jacobians and Hessians fora function f : Rn → Rm that can be evaluated using a computer program.We use x ∈ Rn to denote the inputs and y ∈ Rm the outputs of the program.

1.1 The Computational Graph Representation

We use v ∈ R` to denote all intermediate values that depend on the inputs

v = (v1, . . . , v`)T ∈ Rn

The independent variable subvector (input vector) is

x = (v1, . . . , vn)T ∈ Rn .

The dependent variable subvector (output vector) is

y = (v`−m+1, . . . , v`)T ∈ Rm .

We decompose the computations to a sequence of elementary functions φk :R2 → R. For k = n+ 1 to `

vk = φk(va[k], vb[k])

where the positive integer a[k] and b[k] are the variable index of the left andright operands for the k-th elementary function. It follows that a[k] < kand b[k] < k, i.e., computation of the k-th variable only depends on the in-dependent variables and variables that have been computed previously. Thevariables are nodes in the computation graph. Each elementary operationcorresponds to two arcs, one from variable va[k] to variable vk and the otherfrom variable vb[k] to variable vk. We also write a[k] ≺ k and b[k] ≺ k whichmeans variable k directly depends on variables a[k] and b[k]. This is verysimilar to the notation in [Griewank and Walther, 2008, Section 2.2]. Somereferences call nodes a[k], b[k] predecessors of node k and node k a successorof a[k], b[k]; for example see [Gower and Mello, 2014], [Wang et al., 2016].

We represent a unary elementary function using a[k] = b[k] together witha binary function φk(va[k], vb[k]) that does not depend on it second argument.One example of a unary representation is φk(va[k], vb[k]) = sin(va[k]). Anotherexample is φk(va[k], vb[k]) = va[k] which can be used to set yi = v`−m+i =va[`−m+i]. It is not necessary to have the extra node for v`−m+i in the graphfor this purpose but it makes this presentation simpler.

2

1 INTRODUCTION 1.2 Dependency Relation

Example 1.1 Consider the computer program defined by the following pseudocode:

function y = f(x)(v1, v2, v3) = xv4 = v1 + v2v5 = v3 ∗ v4y = (v4, v5)

The corresponding function f : R3 → R2 is given by

f1(x) = x1 + x2f2(x) = x3(x1 + x2)

The following is a diagram of the corresponding computational graph:

��v1 ��

��v2 ��

��v3

��v4

��

@@@I+

��v5

6

-

*

If the last line of the program pseudo code were changed to y = v4, thenf : R3 → R, f(x) = x3(x1 + x2) and the diagram above would not change.

The dimension of the range of φk might be greater than one, and itsdomain might be greater than two. For example, in some operator over-loaded AD tools it is possible to record the operations for one functionand make that function an elementary operation in another function. Thisis closely related to checkpointing; see [Griewank and Walther, 2008, Tables12.3, 12.4]. We only include binary elementary functions in order to simplifythis presentation.

1.2 Dependency Relation

There may be a difference between sparsity patterns and dependency pat-terns. For example, suppose that

φk(va[k], vb[k]) =

{0 if va[k] ≤ 0

1 otherwise.

3

2 SPARSE JACOBIANS

It follows that, where φk is differentiable, its derivative is zero. Some ADpackages might return zero for the derivative everywhere while others mightreturn (+∞, 0) when va[k] = 0. In any event, φk(va[k], vb[k]) depends on thevalue of va[k]. Hence, a dependency pattern might include more possiblynon-zero elements than a Jacobian sparsity pattern. For the purpose of thispresentation, we do not distinguish Jacobian sparsity patterns and depen-dency patterns. We use the notation a[k] ≺ k and b[k] ≺ k to denote thefact that vk directly depends on va[k] and vb[k]. Furthermore, we use ≺∗to denote the transitive closure of this relation; e.g., if p ≺ q and q ≺ kthen p ≺∗ k; see [Griewank and Walther, 2008, Section 2.2]. Note that theφk(va[k], vb[k]) defined above does not depend on vb[k]; i.e., it represents aunary function. In the text below, unary functions correspond to the casewhere φk does not depend on its second argument and a[k] = b[k].

2 Sparse Jacobians

In this section we present a forward algorithm, reverse algorithm, and re-verse subgraph algorithm that compute Jacobian sparsity patterns. We alsopresent an algorithm that obtains a sorted subgraph and discuss computingJacobians and the computational complexity of the subgraph algorithms.

2.1 Forward Mode Sparsity

We use J to denote the subset of independent variable indices that are of in-terest, J ⊂ {1, . . . , n}. For the independent variable indices, k ∈ {1, . . . , n},we define Xk to be the singleton {k}, if k ∈ J , otherwise the empty set. Fork > n we define Xk to be the indices of independent variables in J that thevariable vk depends on; i.e.,

Xk =

{k} if k ≤ n and k ∈ J∅ if k ≤ n and k /∈ J{j : j ∈ J and j ≺∗ k} otherwise

.

The forward Jacobian sparsity calculation is given by Algorithm 2.1; see[Griewank and Walther, 2008, Eq. (7.4)]. Note that for unary functions,a[k] = b[k] and the union in the algorithm is not necessary. The sparsitypattern for the Jacobian f (1)(x) is given by the observation that for i =1, . . . ,m, j ∈ J , and x ∈ Rn,

[f(1)i (x)]j 6= 0 ⇒ j ∈ X`−m+i .

4

2 SPARSE JACOBIANS 2.2 Reverse Mode Sparsity

Algorithm 1 Forward Jacobian Sparsity

{ X1 , . . . , X` } = function(J)for k = 1, . . . , n

if k ∈ J set Xk = {k}else set Xk = ∅

for k = n+ 1, . . . , `set Xk = Xa[k] ∪Xb[k]

We note that there are ` sets Xk and at most |J | elements in each of thesesets where |J | ≤ n is the number of elements in set J .

2.2 Reverse Mode Sparsity

We use I to denote the subset of dependent variable indices that are ofinterest, I ⊂ {1, . . . ,m}. For the corresponding variable indices k ∈ {` −m+ 1, . . . , `} we define Yk to be the singleton {k− `+m}, if k− `+m ∈ I,otherwise the empty set. For k ≤ ` −m we define Yk to be the indices ofdependent variables that the variable in I that are affected by vk; i.e.,

Yk =

{k − `+m} if k > `−m and k − `+m ∈ I∅ if k > `−m and k − `+m /∈ I{i ∈ I : k ≺∗ `−m+ i} otherwise

.

We use∪= for the operator that sets the left hand side to the right hand side

union the previous value of the left hand side. The reverse Jacobian sparsitycalculation is given by Algorithm 2.2; see [Griewank and Walther, 2008, Eq.(7.8)]. Note that for unary functions, a[k] = b[k] and the second union inthe algorithm is not necessary. The sparsity pattern for the Jacobian f (1)(x)is given by the observation that for i ∈ I, j = 1, . . . , n, and x ∈ Rn,

[f(1)i (x)]j 6= 0 ⇒ i ∈ Yj .

We note that there are ` sets Yk and at most |I| elements in each of thesesets where |I| ≤ m is the number of elements in set I.

2.3 Reverse Subgraph Sparsity

We call this a reverse subgraph method because on each pass through thegraph it only visits the nodes that affect a selected dependent variable. (A

5

2.3 Reverse Subgraph Sparsity 2 SPARSE JACOBIANS

Algorithm 2 Reverse Jacobian Sparsity

{ Y1 , . . . , Y` } = function(I)for k = `, . . . , `−m+ 1

if k − `+m ∈ I set Yk = {k − `+m}else set Yk = ∅

for k = `−m, . . . , 1set Yk = ∅

for k = `−m, . . . , 1set Ya[k]

∪= Yk

set Yb[k]∪= Yk

forward subgraph method would only visit nodes that are affected by aselected independent variable.) We present two versions of the algorithm.The first is the ‘pattern only’ variant which finds the sparsity pattern of theJacobian matrix. The second variant is an extension that finds the actualsubgraph required to calculate the non-zero values of the Jacobian.

We use I ⊂ {1, . . . ,m} to denote the subset of dependent variable in-dices that are of interest. We use J ⊂ {1, . . . , n} to denote the subset ofindependent variable indices that are of interest. The set Si accumulates theindependent variables in J that affect the dependent variable yi = v`−m+i.The stack K contains the nodes in the subgraph that have not yet beenprocessed for this independent variable index i. Let {Xk} be the outputcorresponding to Algorithm 2.1 with input J . It is not necessary to havethe sequence of sets {Xk}, just the integer sequence {ck} initialized by

if Xk = ∅ set ck = m+ 1 else set ck = 0 .

The value ck = i < m + 1 is used to indicate that node k has alreadybeen processed for this independent variable index i. The reverse subgraphJacobian sparsity calculation is defined by Algorithm 2.3.

The input value for the sequence {ck} can be computed using Algo-rithm 2.1 with the sets Xk replaced by integer values ck that are m+ 1 forempty Xk and 0 for non-empty Xk. The complexity of this calculation isO(`), ` is the number of nodes in the graph. Note that for unary functions,a[k] = b[k] and the condition for the second if block in the algorithm is false.The sparsity pattern for the Jacobian f (1)(x) is given by the observationthat for i ∈ I, j ∈ J , and x ∈ Rn,

[f(1)i (x)]j 6= 0 ⇒ j ∈ Si .

6

2 SPARSE JACOBIANS 2.4 Reverse Subgraph Sorting

Algorithm 3 Reverse Subgraph Jacobian: Return Sparsity Pattern

{Si : i ∈ I} = function(I, c1, . . . , c`)for i ∈ I

set done = i, ignore = m+ 1set K = ∅set Si = ∅push `−m+ i into Kwhile K 6= ∅

pop k from Kif ca[k] /∈ {done, ignore}

set ca[k] = done

if a[k] ≤ n set Si∪= {a[k]}

else push a[k] into Kif cb[k] /∈ {done, ignore}

set cb[k] = done

if b[k] ≤ n set Si∪= {b[k]}

else push b[k] into K

We note that there are |I| sets Si, at most |J | elements in each of these sets,and |I| ≤ m , |J | ≤ n. We contrast this with the forward and reverse modesparsity patterns which have ` sets; i.e., a set for each node in the graph.

Remark 2.1 Algorithm 2.3 is a non-recursive depth-first search in reversedirection from each dependent variable. It uses a mark vector c, of alreadyprocessed nodes, to avoid multiple placements of the same node in the stackK. The mark vector need not be cleared between consecutive searches becausethe dependent variable index is used as unique mark for each search. Thisis important when the subgraph is much smaller than the full graph. The‘pop k from K’ in the algorithm could be replaced by extracting any elementfrom K and the algorithm would still work. However, extracting the mostrecently pushed element (the pop) reduces the worst case space requirementfor K from l, the size of the full graph, to the maximum vertex depth of thegraph.

2.4 Reverse Subgraph Sorting

Algorithm 2.3 computes the sparsity patterns {Si : i ∈ I}. For each i, thesubgraph corresponding to the node order of the pops from K would have to

7

2.5 Computing Jacobians 2 SPARSE JACOBIANS

be sorted in dependency order to be used for computations. For example, tocalculate reverse mode derivatives along the subgraph a dependency orderis required. In a worst case scenario, the subgraphs are as big as the fullgraph and the cost of a reverse sweep is O(l). The cost of sorting thesubgraph is O(l × log(l)). Hence, the sorting could asymptotically becomeexpensive compared to other parts of the algorithm. It is therefore relevantto avoid the sort. With a minor modification of Algorithm 2.3 we can directlyobtain a dependency sorted subgraph. If vd[1], . . . , vd[|G|] is the ordering fora subgraph G, we say that G is dependency sorted or topologically sorted if

vd[p] ≺∗ vd[q] ⇒ p < q .

The main change to the algorithm is as follows: When the top of the stackis a node, for which all the nodes it depends on are already in the subgraph,the top node of the stack is moved to the end of the subgraph. A new mark−i is used to signify that a node has been moved to the subgraph. Theresult is that the nodes in the subgraph are always dependency sorted.

Remark 2.2 Algorithm 2.4 can be used to find all the subgraphs, one byone, as needed to perform a full Jacobian calculation. When a Gi has beenused for a sweep it is not needed anymore and can be discarded. Note thatGi is only guarantied to be dependency sorted. In particular, one cannot as-sume that the independent variable indices (node indices ≤ n) in the sparsitypattern of the i-th row of the Jacobian are at the beginning of Gi.

2.5 Computing Jacobians

Once a sparsity pattern for a Jacobian is available, a row (column) com-pression technique could be used to by reverse mode (forward mode) tomultiple rows (columns) of the Jacobian during one pass of the compu-tational graph. This requires an approximate solution of a graph color-ing problem; see [Griewank and Walther, 2008, Eq. 8.6 and Section 8.3],[Coleman and Verma, 1998].

The reverse subgraph Jacobian method does not require coloring or acompression step. In this way it is similar to the edge pushing algorithm; see[Petra et al., 2018], [Gower and Mello, 2014]. Reference [Gower and Mello, 2014]defines the apex-induced subgraph corresponding to a fixed node as thenodes that the fixed node depends on (plus the corresponding arcs). Thesubgraphs in this paper are apex-induced subgraphs corresponding to de-pendent variables. The subgraph Jacobian calculation method for a selecteddependent variable is the same as for normal reverse pass, except that only

8

2 SPARSE JACOBIANS 2.5 Computing Jacobians

Algorithm 4 Reverse Subgraph Jacobian: Return Sorted Subgraph

{Gi : i ∈ I} = function(I, c1, . . . , c`)for i ∈ I

set visited = i, done = −i, ignore = m+ 1set Gi = ∅, K = ∅push `−m+ i into Kwhile K 6= ∅

set k = top (K)comment If all inputs to node k are done move k to Giif ca[k] ∈ {done, ignore} and cb[k] ∈ {done, ignore}

pop k from Kpush k into Giset ck = done

elsecomment vmax may depend on vmin

for ν = max(a[k], b[k]),min(a[k], b[k])if cν /∈ {visited,done, ignore}

if ν ≤ npush ν into Giset cν = done

elsepush ν into Kset cν = visited

9

2.6 Complexity of the Subgraph Methods 2 SPARSE JACOBIANS

the subset of nodes that affect a selected dependent variable yi are pro-cessed. The subgraph and normal reverse mode calculations are so similarthat CppAD uses iterators to implement both calculations with the samecode; see [Bell, 2020].

2.6 Complexity of the Subgraph Methods

The asymptotic complexity of the subgraph algorithms 2.3 and 2.4 is easy toassess. Both algorithms loop across subgraphs and nodes in each subgraph.For each subgraph, each subgraph node is visited once in Algorithm 2.3and twice in Algorithm 2.4. The work per node is bounded by a constantindependent of the computational graph because there are no more thantwo arguments (incoming edges) and one result (variable) for each node. Itfollows that the complexity of both algorithms is

O

(∑i∈I|Gi|

)+O(`) ,

whether computing patterns only or numerical entries of the Jacobian. Theset I is the independent variables of interest and |I| is less than or equal m.The term O(`) is for the complexity of initializing the sequence c1 , . . . , c`.If few independent variables are of interest, the term

∑i∈I |Gi| could be less

than `, the number of nodes in the entire graph.

The formula above tells us that, for a given problem, the efficiency ofthe subgraph method is not directly determined by sparsity of the Jacobianmatrix. What really matters is the amount of overlap between the sub-graphs. The less overlap (smaller |Gi ∩ Gj |) the faster are the algorithms.This is in contrast with the graph coloring approach for which the efficiencyis determined by the sparsity pattern of the Jacobian. We highlight thesedifferences by two simple examples:

Example 2.1 The Jacobian f (1)(x) for this example is a dense matrix. LetA be a random n-by-n matrix and consider the matrix vector multiplicationfunction f(x) = Ax. The computational graph essentially consists of n2

multiply-add instructions. The size of the full graph is thus O(n2). The sizeof the i-th subgraph is O(n) because yi is only affected by row i of A. Thetime to calculate the Jacobian by each of the methods using n reverse sweepsis:

10

2 SPARSE JACOBIANS 2.6 Complexity of the Subgraph Methods

Method Boundn full sweeps O(n3)Coloring O(n3)Algorithm 2.3 O(n2 log(n))Algorithm 2.4 O(n2)

Clearly, no method can be faster than O(n2) for this problem. Althoughthis example was chosen to show the benefit of the subgraph methods, itdemonstrates that there exist non-sparse problems where Algorithm 2.4 isasymptotically optimal.

Example 2.2 The Jacobian f (1)(x) for this example is the identity matrixplus a matrix that is zero except for column n. Suppose that for k = 1, . . . , n,φk is a unary function (in the sense of this paper) and

vn+k = φk(vn+k−1, vn+k−1) .

Recall that vn = xn and define f : Rn → Rn by

yk = v2n + xk ,

for k = 1, . . . , n. The computational graph of f and all its subgraphs areof size O(n). The sparsity pattern of the Jacobian is {n} for the n-th rowand {n, k} for rows k = 1, . . . , n − 1. This Jacobian can be recovered usinga combination of one forward pass and one reverse pass. The asymptoticcomplexity of the different methods is:

Method Boundn full sweeps O(n2)Coloring O(n)Algorithm 2.3 O(n2 log(n))Algorithm 2.4 O(n2)

The subgraph method is inefficient for this problem because of the high sub-graph overlap. The coloring approach is most efficient for this example be-cause of the special sparsity pattern.

The examples 2.1 and 2.2 are also useful for testing that implementationsof a subgraph algorithm scales as expected for various values of n.

11

3 SPARSE HESSIANS

3 Sparse Hessians

In this section we consider computing the sparsity pattern for the Hessianof

g(x) =

m∑i=1

wifi(x) , (1)

where w ∈ Rm. We use I to denote the dependent variable indices for whichwi 6= 0; i.e.,

i ∈ I ⇔ i ∈ {1, . . . ,m} and wi 6= 0 . (2)

In addition, we use J to denote the subset of independent variables thatare of interest, J ⊂ {1, . . . , n}. Let {Xk} be the output correspondingto Algorithm 2.1 with input J . Let {Yk} be the output corresponding toAlgorithm 2.2 with input I. It is not necessary to have the sequence of sets{Yk}, just the reverse mode activity analysis Boolean sequence {dk} definedby

dk = (Yk 6= ∅) .Here and below ∂ph(u) is the partial of h with respect to the p-th com-

ponent of its argument vector evaluated at u. We use ∂p,q to abbreviate∂p∂q. We say that the elementary function φk is left nonlinear if ∂1,1φk(u)is possibly non-zero for some u. It is right nonlinear if ∂2,2φk(u) is possiblynon-zero for some u. It is jointly nonlinear if ∂1,2φk(u) is possibly non-zerofor some u. We assume that ∂1,2φk(u) = ∂2,1φk(u).

3.1 Forward Mode Sparsity

Forward mode for function, derivative, and Hessian values starts with thezero, first, and second order values for the independent variables; i.e., xj ,xj , xj for j = 1, . . . , n. It computes the zero, first, and second order valuesfor the other variables using the following equations for k = n+ 1, . . . , `:

vk = φk(va[k], vb[k])

vk = ∂1φk(va[k], vb[k])va[k] + ∂2φk(va[k], vb[k])vb[k]

vk = ∂1φk(va[k], vb[k])va[k] + ∂2φk(va[k], vb[k])vb[k]

+ ∂1,1φk(va[k], vb[k])v2a[k] + ∂2,2φk(va[k], vb[k])v

2b[k]

+ 2 ∂1,2φk(va[k], vb[k])va[k]vb[k] .

(3)

The forward mode Hessian sparsity calculation is defined by Algorithm 3.1.This is similar to the algorithm [Walther, 2008, Algorithm II]. One differ-ence is using {dk} to avoid the ‘dead end’ nodes mentioned in the reference

12

3 SPARSE HESSIANS 3.1 Forward Mode Sparsity

and nodes that are not included in the Hessian because the correspondingwi is zero; see Eq 1 and Eq 2. This is probably the reason that the adolc

implementation has a larger nnz (more possibly non-zero values) than theother implementations in Table 5 and Table 7.

The set Nj , in the algorithm, accumulates the nonlinear interactionsbetween the j-th independent variable and other independent variables. The

Algorithm 5 Forward Hessian Sparsity

{ N1 , . . . , Nn } = function(X1, d1, . . . , X`, d`)for j = 1, . . . , n set Nj = ∅for k = n+ 1, . . . , ` if dk

if φk is left nonlinear

for j ∈ Xa[k] set Nj∪= Xa[k]

if φk is right nonlinear

for j ∈ Xb[k] set Nj∪= Xb[k]

if φk is jointly nonlinear

for j ∈ Xa[k] set Nj∪= Xb[k]

for j ∈ Xb[k] set Nj∪= Xa[k]

nonlinear interactions are initialized as empty. This corresponds to thesecond order values for the independent variables being zero; i.e., 0 = x ∈R`. In the case where φk is jointly nonlinear and left nonlinear, the algorithm[Walther, 2008, Algorithm II] uses the fact thatXk = Xa[k]∪Xb[k] to combinetwo of the unions into one. A similar optimization is done for the case whereφk is jointly nonlinear and right nonlinear,

The sparsity pattern for the Hessian g(2)(x) is given by the observationthat for j ∈ J , p ∈ J , and x ∈ Rn,

[g(2)(x)]j,p 6= 0 ⇒ p ∈ Nj .

Given the second order forward mode equation for vk in Eq 3, a proof forthis assertion would be similar to the proof for Algorithm 3.2. The Booleanvector d has length `. There are ` sets Xk and at most |J | elements in eachof these sets. There are n sets Nj and at most |J | elements in each of thesesets.

13

3.2 Reverse Mode Sparsity 3 SPARSE HESSIANS

3.2 Reverse Mode Sparsity

The reverse mode Hessian sparsity calculation is defined by Algorithm 3.2.This is similar to the table [Griewank and Walther, 2008, Table 7.4], butincludes more general nonlinear binary functions; e.g., pow(x, y) = xy. Inaddition, the algorithm and proof show how to extend the algorithm tofunctions with more than two arguments.

Algorithm 6 Reverse Hessian Sparsity

{ M1 , . . . , M` } = function(X1, d1, . . . , X`, d`)for k = 1, . . . , ` set Mk = ∅for k = `, . . . , n+ 1 if dk

set Ma[k]∪= Mk

set Mb[k]∪= Mk

if φk is left nonlinear

set Ma[k]∪= Xa[k]

set Mb[k]∪= Xa[k]

if φk is right nonlinear

set Ma[k]∪= Xb[k]

set Mb[k]∪= Xb[k]

if φk is jointly nonlinear

set Ma[k]∪= Xb[k]

set Mb[k]∪= Xa[k]

As with Algorithm 3.1 when φk is both left nonlinear and jointly non-linear (or right nonlinear and jointly nonlinear) two of the unions in Algo-rithm 3.2 can be combined into one. We include a theorem and proof forthis algorithm below.

Theorem 3.1 For j ∈ J , p ∈ J , and x ∈ Rn,

[g(2)(x)]j,p 6= 0 ⇒ p ∈Mj .

Proof : We define the sequence of scalar valued functions F`, . . . , Fn by

F`(v1, . . . , v`) =

m∑i=1

wiv`−m+i ,

14

3 SPARSE HESSIANS 3.2 Reverse Mode Sparsity

and for k = `, . . . , n+ 1,

Fk−1(v1, . . . , vk−1) = Fk[v1, . . . , vk−1, φk(va[k], vb[k])] .

The function Fn(x) is the same as g(x). Reverse mode computes the deriva-tives Fk with respect to its arguments for k = ` − 1, . . . , n. The derivativeof Fn with respect to its arguments is equal to g(1)(x) and is the final value

for x = (v1, . . . , vn) in the algorithm below. We use+= for the operator that

sets the left hand side to the right hand side plus the previous value of theleft hand side.

for k = 1, . . . , `−m set vk = 0for i = 1, . . . ,m set v`−m+i = wifor k = `, . . . , n+ 1

set va[k]+= ∂1φk(va[k], vb[k])vk

set vb[k]+= ∂2φk(va[k], vb[k])vk

Differentiating the algorithm above with respect to x, and using the forwardmode equation for vk in Eq 3, we obtain

for k = 1, . . . , ` set ˙vk = 0for k = `, . . . , n+ 1

comment differentiate setting of va[k]

set ˙va[k]+= ∂1φk(va[k], vb[k]) ˙vk

set ˙va[k]+= ∂1,1φk(va[k], vb[k])vkva[k]

set ˙va[k]+= ∂1,2φk(va[k], vb[k])vkvb[k]

comment differentiate setting of vb[k]

set ˙vb[k]+= ∂2φk(va[k], vb[k]) ˙vk

set ˙vb[k]+= ∂1,2φk(va[k], vb[k])vkva[k]

set ˙vb[k]+= ∂2,2φk(va[k], vb[k])vkvb[k]

Suppose that in Eq 3 x is the j-th elementary vector. It follows that ˙xp =∂p,jg(x) for p = 1, . . . n. We claim that at the beginning of the iteration k,in the algorithm above and in Algorithm 3.2, for p = 1, . . . , k

˙vp 6= 0⇒ j ∈Mp .

Proving this claim will complement the proof of the theorem. For the firstiteration, k = ` and ˙vp = 0 for all p. Hence the claim is true for k = `.

15

3.2 Reverse Mode Sparsity 3 SPARSE HESSIANS

Suppose the claim is true at the beginning of the k-th iteration, it sufficesto show it is true at the beginning of iteration k−1. If p 6= a[k] and p 6= b[k]then ˙vp and Mp are the same at the beginning of iteration k and k − 1, sowe are done. The two cases p = a[k] and p = b[k] are symmetric. It sufficesto show the case p = a[k]; i.e., at the end of iteration k

˙va[k] 6= 0⇒ j ∈Ma[k] .

If ˙va[k] 6= 0 at the beginning of iteration k then by induction j ∈ Ma[k] atthe beginning of iteration k and by Algorithm 3.2 it also true at the end ofiteration k − 1.

Consider the remaining case where ˙va[k] = 0 at the beginning of iterationk and ˙va[k] 6= 0 at the end of iteration k. This implies that the right handside was non-zero in one of the three assignments to ˙va[k] above. This inturn implies that dk is true (otherwise vk and ˙vk would be zero). Supposethe first assignment to ˙va[k] is non-zero,

0 6= ∂1φk(va[k], vb[k]) ˙vk .

This implies that ˙vk 6= 0 which, by induction, implies that j ∈Mk which, byAlgorithm 3.2, implies j ∈ Ma[k] at the end of iteration k. This completesthe case where the first assignment to ˙va[k] is non-zero.

Suppose the second assignment to ˙va[k] is non-zero,

0 6= ∂1,1φk(va[k], vb[k])vkva[k] .

This implies that all three terms in the product on the right hand side arenon-zero. Hence va[k] = ∂jva[k](x) is non-zero and j ∈ Xa[k]. Furthermoreφk is left nonlinear. Hence, at the end of iteration k, Xa[k] ⊂ Ma[k] andj ∈ Ma[k] This completes the case where the second assignment to ˙va[k] isnon-zero.

Suppose the third assignment to ˙va[k] is non-zero,

0 6= ∂1,2φk(va[k], vb[k])vkvb[k] .

This implies that all three terms in the product on the right hand side arenon-zero. Hence vb[k] = ∂jvb[k](x) is non-zero and j ∈ Xb[k]. Hence, at theend of iteration k, Xb[k] ⊂ Ma[k] and j ∈ Ma[k]. This completes the casewhere the third assignment to ˙va[k] is non-zero. Q.E.D.

Remark 3.1 The edge pushing algorithm [Gower and Mello, 2014, Algo-rithm 3.1] is an alternative reverse mode algorithm for computing Hessian

16

4 EXPERIMENTS 3.3 Subgraph Sparsity

sparsity patterns. It is fundamentally different from the reverse mode Hes-sian sparsity algorithm in this paper because it does not use first order for-ward mode sparsity patterns to reduce the computed index set sizes to thenumber of independent variables that are of interest.

3.3 Subgraph Sparsity

We are given a function f : Rn → Rm and define g : Rn → R by

g(x) =m∑i=1

wifi(x) .

Using reverse mode we can compute g(1)(x) in O(`) operations. In addi-tion we can obtain the corresponding computational graph; for example see[Kristensen et al., 2016, Tape T2 on pp. 7], [Wang et al., 2016, Section 3.1].We use this computational graph to define the function h : Rn → Rn

h(x) = g(1)(x) =m∑i=1

wif(1)i (x) .

The Jacobian of h(x) is the Hessian of g(x). We can apply the reversesubgraph Jacobian algorithm to the computational graph for h(x) to obtainthe sparsity pattern for the Hessian g(2)(x).

3.4 Computing Hessians

Once a sparsity pattern for a Hessian is available, the values in the Hessianare computed in a manner similar to how the Jacobians are calculated; seeSection 2.5.

4 Experiments

The speed tests reported below were run using the following hardware andsoftware:

17

4.1 CSV File 4 EXPERIMENTS

processor: i5-3470 3.2GHz 64bitoperating system: fedora 34

Linux kernel: 5.13.6compiler: clang 12.0.0memory: 8GB

disk: ST500DM002-1BD14ADOL-C [Griewank et al., 1999]: git hash 25a69c4

CppAD [Bell, 2020]: git hash 5c86b18ColPack [Gebremedhin et al., 2013]: version v1.0.10

The ADOL-C git hash corresponds to its master branch on 2020-03-31,the CppAD git hash corresponds to its master branch on 2020-05-23.

4.1 CSV File

Each run of the speed test program adds a row to a csv file with the followingcolumns:

KB : This is the average memory usage, in kilobytes, where each kilobyteis 1000 bytes (not 1024 bytes). A separate run of the speed test program byvalgrind’s massif tool is used to determine these values.

implement :

adolc A forward or reverse algorithm implemented by the ADOL-C pack-age

cppad A forward or reverse algorithm implemented by the CppAD pack-age

subgraph A CppAD implementation of Algorithm 2.3. Note that the poten-tially faster Algorithm 2.4 is not implemented in CppAD and thereforenot part of the test.

problem: This is the minpack2 test source code used to compute f(x);see [Averick et al., 1992]. The source code was converted to C using f2c andthen to C++ so it could be used with ADOL-C and CppAD. The availableproblem values are:

dficfj The flow in a channel Jacobian problem

dierfj The incompressible elastic rod Jacobian problem

deptfg The elastic-plastic torsion Hessian problem

18

4 EXPERIMENTS 4.1 CSV File

dgl1fg One Dimensional Ginzburg-Landau Hessian problem

colpack : If true, the ColPack package was used to solve the coloring sub-problem. Otherwise a greedy distance two coloring algorithm (inspired bythe algorithm [Gebremedhin et al., 2005, Algorithm 3.2] and implementedin the CppAD package) is used. In either case, if this is a Hessian problem, aspecial version of the coloring algorithm that takes advantage of the symme-try, is used. The colpack option must be true when implement is adolc. Itmust be false when implement is subgraph because it does not use coloring.

indirect : If this is true, an indirect method was used to get more com-pression out of the coloring problem. It can only be true when implement isadolc and problem corresponds to computing a Hessian. This must be falsewhen implement is cppad because it does not support this recovery option.It must also be false when implement is subgraph because it does not use acompression technique.

optimize: Was the computation graph was optimized. This must be falsewhen implement is adolc because it does not have this option.

setup: Does the sec result include the setup time; e.g., recording thegraph, computing sparsity, coloring, compression, and optimization of thegraph. If the computational graph (see Section 1.1) corresponding to afunction does not depend on the argument to the function, the setup opera-tions can be preformed once and re-used for any argument value. Note thatoptimization is not included in the setup time when implement is adolc.Coloring and compression are not included when implement is subgraph.

reverse: If true (false) reverse mode (forward mode) was used for com-puting both the sparsity pattern and derivative values. This option must betrue when implement is subgraph.

onepass: If true, the derivative values were computed using one pass ofgraph with multiple directions at the same time. Otherwise, each directionalderivative is computed using a separate pass. This option must be false whenimplement is subgraph. The onepass option must be true when implementis adolc and reverse is false.

n: This is the size of the domain space for f(x).

m: This is the size of the range space for f(x). If m = 1 (m > 1), thisis a Hessian (Jacobian) problem.

nnz : This is the number of possibly non-zero values in sparse matrixcalculated by the implementation. For a Jacobian problem, the matrix is

19

4.2 Result Tables 4 EXPERIMENTS

the entire Jacobian. For a Hessian problem, the matrix is the upper triangleof the Hessian. Note that one implementation may compute a more efficientsparsity pattern than another (have fewer possibly non-zero values).

sec: This is the number of seconds for each calculation of the sparseJacobian or Hessian.

4.2 Result Tables

There is one table for each problem and value for setup. For each implement,only the combination of options that result in the smallest sec (smallest timeand fastest execution) is included in the table. Given this selection, optionsthat are the same for all the implementations in a table are reported in thecaption at the top of the table. The other options are reported in each rowof the table. The rows for each table are sorted so that the sec column ismonotone non-decreasing.

4.2.1 Jacobian Without Setup

Table 1: problem=dficfj, setup=false, indirect=false, n=3200,m=3200, nnz=24787

KB implement colpack optimize reverse onepass sec12108 adolc true false false true 0.0018613361 cppad true true false true 0.002336480 subgraph false true true false 0.00586

Table 2: problem=dierfj, setup=false, indirect=false, n=3003,m=3003, nnz=31600

KB implement colpack optimize reverse onepass sec11561 adolc true false false true 0.00155797 cppad true true false true 0.001824049 subgraph false true true false 0.00411

Table 1 (Table 2) compares the time, sec, and memory, KB, required tocompute the Jacobian where problem is dficfj (dierfj) and setup is false.The adolc and cppad implementations use the same algorithms, hence theirsimilar run times is not surprising. The subgraph implementation takes

20

4 EXPERIMENTS 4.2 Result Tables

about twice as long. (The subgraph implementation does not require asparsity calculation or solving a coloring sub-problem and these tasks arepart of the setup for the adolc and cppad implementations.) The numberof possibly non-zeros in the Jacobian sparsity pattern, nnz, depends on theproblem but does not depend on the implementation; i.e., the sparsity pat-tern efficiency is the same for all the implementations. The fastest optionchoices for the cppad implementation are optimize true, reverse false, andonepass true. Using colpack true with the cppad implementation is fasterfor one problem and slower for the other. The indirect true option is notavailable for Jacobian problems. The onepass true option results in morememory usage but we have not included those details in the result tables.The subgraph implementation uses less memory than the other implemen-tations.

4.2.2 Jacobian With Setup

Table 3: problem=dficfj, setup=true, indirect=false, optimize=false,n=3200, m=3200, nnz=24787

KB implement colpack reverse onepass sec9807 adolc true false true 0.0109

18430 cppad true false true 0.0138859 subgraph false true false 0.0141

Table 4: problem=dierfj, setup=true, indirect=false, optimize=false,n=3003, m=3003, nnz=31600

KB implement colpack reverse onepass sec9585 adolc true false true 0.008625151 subgraph false true false 0.00887

13645 cppad false false true 0.0116

Table 3 (Table 4) is similar to Table 1 (Table 2) with the differencebeing that setup is true instead of false. The time for the adolc, cppad andsubgraph implementations are all close. The fastest option choices for thecppad implementation are optimize false, reverse false, and onepass true.Using colpack true with the cppad implementation is faster for one problemand slower for the other. The subgraph implementation uses less memory

21

4.2 Result Tables 4 EXPERIMENTS

than the other implementations.

4.2.3 Hessian Without Setup

Table 5: problem=deptfg, setup=false, onepass=false, n=3600, m=1

KB implement colpack indirect optimize reverse nnz sec5207 subgraph false false true true 10680 0.004994364 cppad true false true false 10680 0.00523

15115 adolc true true false false 14161 0.0226

Table 6: problem=dgl1fg, setup=false, onepass=false, n=5000, m=1,nnz=10000

KB implement colpack indirect optimize reverse sec6086 cppad true false true true 0.005359806 subgraph false false true true 0.00982

16679 adolc true true false false 0.0243

Table 5 (Table 6) compares the time, sec, and and memory, KB, requiredto compute the Hessian where problem is deptfg (dgl1fg) and setup isfalse. The cppad and subgraph implementations have similar times and theadolc implementation takes twice as long. The number of possibly non-zeros in the adolc implementation for the deptfg problem is significantlylarger than in the other implementations; i.e., its sparsity pattern is not asefficient as the other implementations. The fastest option choices for thecppad implementation are colpack true, optimize true, and reverse true.The indirect true option is not available with the cppad implementation.No implementation uses less memory for both problems.

4.2.4 Hessian With Setup

Table 7 (Table 8) is similar to Table 5 (Table 6) with the difference being thatsetup is true instead of false. The subgraph implementation is the fastest forthis case and the cppad implementation is significantly slower. The fastestoption choices for the cppad implementation are colpack true, optimize false,and reverse true. The adolc implementation uses less memory than theother implementations.

22

5 CONCLUSIONS

Table 7: problem=deptfg, setup=true, onepass=false, n=3600, m=1

KB implement colpack indirect optimize reverse nnz sec7809 subgraph false false true true 10680 0.0258

15299 adolc true true false false 14161 0.04534921 cppad true false true false 10680 0.0843

Table 8: problem=dgl1fg, setup=true, onepass=false, n=5000, m=1,nnz=10000

KB implement colpack indirect optimize reverse sec33709 subgraph false false false true 0.06037917 cppad true false true false 0.111

19144 adolc true true false false 0.114

4.2.5 Reproducing Results

The source code that produced the results corresponds to the tag 20210803

of the following git repository:

https://github.com/bradbell/sparse_ad .

5 Conclusions

If the computational graph (see Section 1.1) corresponding to a function doesnot depend on the argument to the function, the setup can be done onceand reused many times. In the other case, when the setup operations arecomputed for each argument to a sparse Jacobian, all the implementationhave similar timing results. When computing the setup operations for eachargument to a sparse Hessian, the subgraph implementation is significantlyfaster and cppad is significantly slower.

Further testing is called for. We have provided a GitHub repository withthe source code used to obtain the results in this paper. This facilitates re-production of the results as well as extension of the tests to other cases.Other cases include different computer hardware, operating systems, com-pilers, and versions of the AD packages. The tests can also be extended toother AD packages as well as other problems and problem sizes. The testsalso provide example source code that implements the algorithms presentedin this paper.

23

https://github.com/bradbell/sparse_ad

REFERENCES REFERENCES

References

[Averick et al., 1992] Averick, B. M., Carter, R. G., More, J. J., and Xue,G.-L. (1992). The MINPACK-2 test problem collection. Technical ReportPreprint MCS-P153-0692, Mathematics and Computer Science Division,Argonne National Laboratory.

[Bell, 2020] Bell, B. M. (2020). CppAD: a package for C++ algorithmic dif-ferentiation. Computational infrastructure for operations research: coin-or.

[Coleman and Verma, 1998] Coleman, T. F. and Verma, A. (1998). Theefficient computation of sparse jacobian matrices using automatic differ-entiation. SIAM Journal on Scientific Computing, 19(4):1210–1233.

[Gebremedhin et al., 2013] Gebremedhin, A., Nguyen, D., Patwary, M.M. A., and Pothen, A. (2013). Colpack: Software for graph coloring andrelated problems in scientific computing. ACM Transactions on Mathe-matical Software (TOMS), 40:1–31.

[Gebremedhin et al., 2005] Gebremedhin, A. H., Manne, F., and Pothen,A. (2005). What color is your jacobian? graph coloring for computingderivatives. SIAM Review, 47(4):629–705.

[Gower and Mello, 2014] Gower, R. M. and Mello, M. P. (2014). Computingthe sparsity pattern of Hessians using automatic differentiation. ACMTransactions on Mathematical Software, 40(2):10:1–10:15.

[Griewank et al., 1999] Griewank, A., Juedes, D., Mitev, H., Utke, J., Vo-gel, O., and Walther, A. (1999). ADOL-C: A package for the automaticdifferentiation of algorithms written in C/C++. Technical report, In-stitute of Scientific Computing, Technical University Dresden. Updatedversion of the paper published in ACM Trans. Math. Software 22, 1996,131–167.

[Griewank and Walther, 2008] Griewank, A. and Walther, A. (2008). Eval-uating Derivatives: Principles and Techniques of Algorithmic Differenti-ation. SIAM, Philadelphia, PA, 2nd edition.

[Kristensen et al., 2016] Kristensen, K., Nielsen, A., Berg, C. W., Skaug,H., and Bell, B. M. (2016). TMB: Automatic differentiation and Laplaceapproximation. Journal of Statistical Software, 70(5):1–21.

24

REFERENCES REFERENCES

[Petra et al., 2018] Petra, C. G., Qiang, F., Lubin, M., and Huchette, J.(2018). On efficient hessian computation using the edge pushing algorithmin julia. Optimization Methods and Software, 33(4–6):1010–1029.

[Walther, 2008] Walther, A. (2008). Computing sparse hessians with au-tomatic differentiation. ACM Transactions on Mathematical Software,34(1).

[Wang et al., 2016] Wang, M., Pothen, A., and Hovland, P. (2016). Edgepushing is equivalent to vertex elimination for computing Hessians. InGebremedhin, A., Boman, E., and Ucar, B., editors, 2016 Proceedingsof the Seventh SIAM Workshop on Combinatorial Scientific Computing,Albuquerque, New Mexico, USA, pages 102–111, Philadelphia, PA, USA.SIAM.

25

Computing Sparse Jacobians and Hessians Using Algorithmic ...

Documents

Transcript of Computing Sparse Jacobians and Hessians Using Algorithmic ...