Sparse least squares problems with dense rows and ... · Outline 1 Introduction 2 Possible research...

Sparse least squares problems with dense rows andpreconditioned iterative methods.

Jennifer Scott

University of Reading and STFC Rutherford Appleton Laboratory

Miroslav Tůma

Faculty of Mathematics and Physics, Charles University, Prague

Padua, February 2020

1 / 64

Outline

1 Introduction

2 Possible research directions

3 Some historical comments

4 Arbitrary sparse-dense (ASD) approach

5 Sparse stretching

6 Schur complement approach

7 Null sparse approach

8 Conclusions

2 / 64

Introduction

The Linear Least Squares Problem (LS):

minx∈IRn

∥Ax − b∥2,

where A ∈ IRm×n with m ≥ n is large and sparse, b ∈ IRm

3 / 64

Introduction


minx∈IRn

∥Ax − b∥2,


Interested in solving the problem by preconditioned iterations

3 / 64

Introduction


minx∈IRn

∥Ax − b∥2,


Interested in solving the problem by preconditioned iterations

Many obstacles on the way to construct efficient and generalpreconditioners

▸ Enormous variability of LS problems (and we like to consideralgebraic problem)

▸ The sparsity structure of AT A is always behind the scene in theCholesky/QR approaches.

3 / 64

Introduction

QR is always behind the scene even when the normal equations are notformed.

4 / 64

Introduction


Undergraduate stuff:A = QR

AT A = RT QT QR

4 / 64

Introduction


Undergraduate stuff:A = QR

AT A = RT QT QR

The fill is exactly as predicted by Cholesky of AT A if A has thestrong Hall property

4 / 64

Introduction

What if A contains a small but a problematic part?

5 / 64

Introduction


An example of such a trivial problematic part is a chunk of dense rows:

A

A AT

5 / 64

Introduction


An example of such a trivial problematic part is a chunk of dense rows:

A

A AT

Dense rows may represent a standard global coupling on the top oflocal sparse relations and we need to treat them in the same way asthe other constraints.

5 / 64

Introduction

Sometimes, some constraints have to be satisfied with a higheraccuracy than the rest: linearly constrained problem (LEP).

minx∈IRn

∥Ax − b∥2, Ex = f

These constraint rows (Ex = f) are often less sparse than the otherrows.

6 / 64

Introduction


minx∈IRn

∥Ax − b∥2, Ex = f


Dense rows may appear due to transformation/decomposition. Forexample, after projecting the problem onto a subspace of givenconstraints and trying to express the transformation explicitly:

Ex = f → finding Z such that EZ = 0→ ZT AZ

In some cases, Z is known to contain dense rows (Arioli, Maryška,Rozložník, T., 2004.)

6 / 64

Introduction


minx∈IRn

∥Ax − b∥2, Ex = f


Dense rows may appear due to transformation/decomposition. Forexample, after projecting the problem onto a subspace of givenconstraints and trying to express the transformation explicitly:

Ex = f → finding Z such that EZ = 0→ ZT AZ

In some cases, Z is known to contain dense rows (Arioli, Maryška,Rozložník, T., 2004.)

Schur complement systems look like weighted normal equations . . .

6 / 64

Introduction

But not only dense rows are the problematic part in incompletefactorizations: the problem is more general:

▸ Rank-1 (rank-k) modifications of (approximate) factorizations fromsome rows of AT A (Tismenetsky (1991); Kaporin (1998); Scott, T.(2014)) may generate dense contributions into the Schur complement.

Before the update

7 / 64

Introduction



After the update

8 / 64

Introduction



After the update

How to make AT A to minimize this effect?8 / 64

Introduction



After the update

How to make AT A to minimize this effect? Split.

8 / 64

Research directions inside our framework(1)

This talk mentions the following approaches to solve the problem

9 / 64



1 Combining the sparse and dense parts

9 / 64



1 Combining the sparse and dense parts▸ Arbitrary sparse-dense (ASD) approach (Scott., T., 2017)

9 / 64



1 Combining the sparse and dense parts▸ Arbitrary sparse-dense (ASD) approach (Scott., T., 2017)▸ Iterative approach based on CG (CGLS1)

9 / 64



1 Combining the sparse and dense parts▸ Arbitrary sparse-dense (ASD) approach (Scott., T., 2017)▸ Iterative approach based on CG (CGLS1)▸ If rank(A) > rank(sparse_part_of_A): specific modifications

needed.

9 / 64




needed.▸ We rely on an implicit combination of the sparse and dense part of the

approximate inverse coupled together inside CG to get

z =M−1r.

9 / 64






z =M−1r.

2 Transforming dense to sparse at the expense of getting the problemlarger.

9 / 64






z =M−1r.


▸ Sparsifying the dense part by matrix stretching (Scott, T., 2019)

9 / 64






z =M−1r.


▸ Sparsifying the dense part by matrix stretching (Scott, T., 2019)▸ Hoping to get overall “uniform problem sparsity”

9 / 64






z =M−1r.


▸ Sparsifying the dense part by matrix stretching (Scott, T., 2019)▸ Hoping to get overall “uniform problem sparsity”▸ Traps on the way: size increase / ill-conditioning

9 / 64


The approaches (continued)

10 / 64



3 Schur complement approach → saddle-point formulation (Scott, T.,2018)

10 / 64




▸ The approach enables to use direct factorization or an incompletefactorization

10 / 64





▸ An acceleration is needed also in case of regularization when runningdirect solver.

10 / 64






▸ Can use indefinite or positive definite factorization depending on theproblem.

10 / 64







4 Null-space approach → saddle-point formulation (Scott, T., inpreparation)

10 / 64








▸ Transformation of the saddle point formulation

10 / 64








▸ Transformation of the saddle point formulation▸ Computation of null-space bases of wide dense matrices and more

other interesting problems.

10 / 64








▸ Transformation of the saddle point formulation▸ Computation of null-space bases of wide dense matrices and more

other interesting problems.

All the approaches have specific strengths, weaknesses and a potentialto be further developed.

10 / 64

History

Some historical comments (a few of many - we apologize that only asketch is shown)

11 / 64

History


▸ A lot of authors studied direct methods to solve this or a relatedproblem since the 1980s (a lot more interesting papers thanmentioned): George, Heath (1980) (sparse Givens rotations);Björck, Duff (1980) (updates, modifications of Peters-WilkinsonLU-based method, rank-deficiency, weights); Heath (1982) (LSextensions including linear constraints); Björck (1984) (generalupdating scheme with both sparse and dense constraints); Ng(1992) (an interesting scheme to handle rank-deficiency); Sun(1995, 1997) (dense rows in sequential or parallel computingenvironment), and a lot more.

11 / 64

History


▸ A lot of authors studied direct methods to solve this or a relatedproblem since the 1980s (a lot more interesting papers thanmentioned): George, Heath (1980) (sparse Givens rotations);Björck, Duff (1980) (updates, modifications of Peters-WilkinsonLU-based method, rank-deficiency, weights); Heath (1982) (LSextensions including linear constraints); Björck (1984) (generalupdating scheme with both sparse and dense constraints); Ng(1992) (an interesting scheme to handle rank-deficiency); Sun(1995, 1997) (dense rows in sequential or parallel computingenvironment), and a lot more.

▸ An authoritative summary of this research in the monographBjörck (1996), see also Björck, 2015.

11 / 64

History

Some historical comments (continued)▸ A lot of work to methods using randomization (Rokhlin, Tygert,

2008; Avron, Maymounkov, Toledo, 2010; Drineas, Mahoney,Muthukrishnan, Sarlós, 2011; Meng, Saunders, Mahoney, 2014)

12 / 64

History



▸ A lot of work in applications as numerical optimization; see, e.g.,Wright (1992); Lustig et al. (1992); Goldfarb, Shanno (2004);Oliveira, Sorensen (2005), Gondzio (1991), Andersen et al. (1996).

12 / 64

History




▸ Much less work using preconditioned iterative methods:Preconditioners often put on the top of the pioneering work onLSQR (Paige, Saunders, 1982) or LSMR (Fong, Saunders, 2011);Li, Saad, 2006 (MIQR - multilevel IQR), Avron, Ng, Toledo(2009) (low-rank perturbations of A); interesting ideas within therandomized approach (Avron, 2010; Drineas et al. 2011 above;

12 / 64

History




▸ Much less work using preconditioned iterative methods:Preconditioners often put on the top of the pioneering work onLSQR (Paige, Saunders, 1982) or LSMR (Fong, Saunders, 2011);Li, Saad, 2006 (MIQR - multilevel IQR), Avron, Ng, Toledo(2009) (low-rank perturbations of A); interesting ideas within therandomized approach (Avron, 2010; Drineas et al. 2011 above;

▸ New research continues to use ideas from classical papers asPeters, Wilkinson (1970), Sautter (1978), Woodbury (1949,1950).

12 / 64

Notation

Notation for the mixed sparse-dense problem: sparse problem with afew dense rows

A = (As

Ad)

C = (ATs AT

d )(As

Ad) = AT

s As +ATd Ad ≡ Cs +Cd

▸ As ∈ IRms×n is sparse, Ad ∈ IRmd×n is dense, (ms ≫md).▸ Full column rank of A (not necessarily of As)

13 / 64

ASD: Arbitrary sparse-dense preconditioning

Iterative approach based on CG (CGLS1) (Scott, T., 2017): theapproach uses approximations of both

Cs = ATs As and Cd = AT

d Ad

in one preconditioner M .

14 / 64




d Ad

in one preconditioner M .The parts are merged together and applied to the residual as

z =M−1r

14 / 64




d Ad


z =M−1r

Woodbury formulas (1949, 1950) update by dense and update bysparse behind this merge

14 / 64




d Ad


z =M−1r

Woodbury formulas (1949, 1950) update by dense and update bysparse behind this mergeExamples of (hidden) Woodbury-like formulas follow

14 / 64




d Ad


z =M−1r

Woodbury formulas (1949, 1950) update by dense and update bysparse behind this mergeExamples of (hidden) Woodbury-like formulas follow

Theorem

If Cs = LsLTs and ξ1 minimizes ∥AsL−T

s z − bs∥2 exactly, the exact leastsquares solution of our problem can be written as x = L−T

s (ξ1 + Γ1),ρd = bd −AdL−T

s ξ1 and

Γ1 = L−1

s ATd (Imd

+AdL−Ts L−1

s ATd )−1ρd.

14 / 64


Another example of Woodbury-like formula

15 / 64



Theorem

If Cs = LsLTs and ξ1 is an approximate solution to the problem

minz ∥AsL−Ts z − bs∥

2, the exact least squares solution of the equivalent

problem above can be written as z = ξ1 + Γ1, where ρs = bs −AsL−Ts ξ1,

ρd = bd −AdL−Ts ξ1 and

Γ1 = L−1

s ATs ρs +L−1

s ATd (Imd

+AdL−Ts L−1

s ATd )−1(ρd −AdL−T

s L−1

s ATs ρs).

15 / 64



Theorem

If Cs = LsLTs and ξ1 is an approximate solution to the problem

minz ∥AsL−Ts z − bs∥

2, the exact least squares solution of the equivalent

problem above can be written as z = ξ1 + Γ1, where ρs = bs −AsL−Ts ξ1,

ρd = bd −AdL−Ts ξ1 and

Γ1 = L−1

s ATs ρs +L−1

s ATd (Imd

+AdL−Ts L−1

s ATd )−1(ρd −AdL−T

s L−1

s ATs ρs).

The following examples use only the first one formula that was alwaysreliable.

15 / 64


SCSD8-2r_a (m=60,550; n=8,650): size of Cs

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

6

number of dense rows

size

of t

he m

atrix

of n

orm

al e

quat

ions

size of the matrix of normal equations

16 / 64


SCSD8-2r_a (m=60,550; n=8,650): size of Cs

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

6


size

of t

he m

atrix

of n

orm

al e

quat

ions


This is the determined md ↑.

16 / 64

ASD: Moving rows one by one from As to Ad

SCSD8-2r_a: iteration counts + size_p/size(AT A)

0 10 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

120


itera

tion

coun

t

iteration count

0 10 20 30 40 50 60 70 80 90 10010

0

101

102

number of dense rowsra

tio p

reco

nditi

oner

siz

e / n

orm

al e

quat

ions

siz

e

ratio preconditioner size / normal equations size

Figure: Problem Meszaros/scsd8− 2r. Iteration counts (left), and ratio of thepreconditioner size to the size of AT A (right) as the number of dense rows thatare removed from A is increased.

17 / 64


SCSD8-2r_a: timings

0 10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45


time

to c

ompu

te th

e pr

econ

ditio

ner

time to compute the preconditioner

0 10 20 30 40 50 60 70 80 90 1000

0.05

0.1

0.15

0.2

0.25

number of dense rowstim

e fo

r th

e pr

econ

ditio

ned

itera

tions

time for the preconditioned iterations

Figure: Problem Meszaros/scsd8− 2r. Time to compute the preconditioner(left) and time for CGLS (right) as the number of dense rows that are removedfrom A is increased.

18 / 64


stormg2_1000 (m=1,377,306; n=528,185): size of Cs

0 50 100 1500

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 10

7


size

of t

he m

atrix

of n

orm

al e

quat

ions


Figure: Problem Mittelmann/stormg2_1000. Size of ATs As.

19 / 64


stormg2_1000: large problem: iteration counts + size_p/size(AT A)

0 50 100 1500

200

400

600

800

1000

1200

1400

1600

1800


itera

tion

coun

t

iteration count

0 50 100 15010

−1

100

101

102

number of dense rowsra

tio p

reco

nditi

oner

siz

e / n

orm

al e

quat

ions

siz

e

ratio preconditioner size / normal equations size

Figure: Problem Mittelmann/stormg2_1000. Iteration counts (left), Ratio ofthe preconditioner size to the size of AT A (right) as the number of dense rowsthat are removed from A is increase.

20 / 64

Moving rows one by one from As to Ad

stormg2_1000: timings

0 50 100 1504

6

8

10

12

14

16

18

20

22

24


time

to c

ompu

te th

e pr

econ

ditio

ner

time to compute the preconditioner

0 50 100 1500

50

100

150

200

250

300

350

400


time

for

the

prec

ondi

tione

d ite

ratio

ns

time for the preconditioned iterations

Figure: Problem Mittelmann/stormg2_1000. Time to compute thepreconditioner (left), time for the preconditioned iterations (right).

21 / 64

Experimental evaluation of ASD

Dense rows not exploited Dense rows exploitedIdentifier size_p T _p Its T _i md size_ps T _p Its T _i

lp_fit2p 17,985 0.26 ‡ ‡ 25 4,940 0.09 1 0.01scsd8-2r 51,885 0.25 90 0.11 50 51,855 0.05 7 0.02scagr7-2r 197,067 3,34 244 0.53 7 152,977 0.06 1 0.01scfxm1-2r 227,835 0.59 187 0.51 58 227,823 0.14 33 0.23neos1 789,471 † † † 74 789,471 5.27 132 3.71neos2 † † † † 90 795,323 5.46 157 4.84stormg2-125 395,595 0.27 ‡ ‡ 121 7,978,135 0.22 16 0.29PDE1 † † † † 1 1,623,531 12.7 696 1.28neos † † † † 20 2,874,699 4.93 232 15.0stormg2_1000 3,157,095 19.1 ‡ ‡ 121 3,125,987 19.1 18 2.92cont1_l † † † † 1 11,510,370 4.82 1 0.33

22 / 64

ASD: conclusions

More possibilities to merge formulas.

23 / 64

ASD: conclusions


It is not clear why some of the formulas work very well and some ofthem can be “destroyed” by slight perturbations of the factorizations- a careful analysis needed

23 / 64

ASD: conclusions



As for the experiments, we treated dense as dense intentionallywithout any enhancements. To see how it is qualitatively.

23 / 64

ASD: conclusions




What if As has null columns?

23 / 64

ASD: conclusions




What if As has null columns?

A possible way is to solve the problem with more right-hand sides.

A = (A1 A2) ≡ (As1

Ad1Ad2

) , (1)

⇓

▸ The solution can be expressed as a combination of partial solutions ofminz ∥A1z − b∥

2and minW ∥A1W −A2∥F .

23 / 64

Matrix stretching for the LS problems

Stretching: a specific sparsification by splitting dense rows into sparsepieces.

24 / 64



The problem is then augmented also by columns.

A

A

24 / 64




A

A

Such strategy called stretching discussed (among others) by Grcar(1990), Vanderbei (1991), Gondzio (1991), Alvarado (1997), Adler(2000), Adler, Björck (2000), Duff, Scott (2005).

24 / 64




A

A

Such strategy called stretching discussed (among others) by Grcar(1990), Vanderbei (1991), Gondzio (1991), Alvarado (1997), Adler(2000), Adler, Björck (2000), Duff, Scott (2005).

Up to now it has not been an approach of choice

24 / 64


Derivation: special case of one dense row▸

Ad = (e f) (2)

25 / 64



Ad = (e f) (2)

▸ Add one variable s and a modified dense row getting the problemequivalent in the sense of getting the same x as in the original problem

min(xT s)T

RRRRRRRRRRRRRR

RRRRRRRRRRRRRR⎛⎜⎝

Ase Asf 0

e f 0

e −f√

2

⎞⎟⎠⎛⎜⎝

xe

xf

s

⎞⎟⎠ −⎛⎜⎝

bs

bd

0

⎞⎟⎠RRRRRRRRRRRRRR

RRRRRRRRRRRRRR2(3)

25 / 64



Ad = (e f) (2)

▸ Add one variable s and a modified dense row getting the problemequivalent in the sense of getting the same x as in the original problem

min(xT s)T

RRRRRRRRRRRRRR

RRRRRRRRRRRRRR⎛⎜⎝

Ase Asf 0

e f 0

e −f√

2

⎞⎟⎠⎛⎜⎝

xe

xf

s

⎞⎟⎠ −⎛⎜⎝

bs

bd

0

⎞⎟⎠RRRRRRRRRRRRRR

RRRRRRRRRRRRRR2(3)

▸ Orthogonally transform the problem applying a Givens rotation G weget

minz∣∣Az − b∣∣2

with

G = 1√2(1 1

1 −1) , A = ⎛⎜⎝

Ase Asf 0√2 e 0 1

0√

2 f −1

⎞⎟⎠ , z = ⎛⎜⎝xe

xf

s

⎞⎟⎠ , b = ⎛⎜⎝bs

bd/√2

bd/√2

⎞⎟⎠ .

25 / 64


And this is the sparsified (stretched) matrix

A = ⎛⎜⎝Ase Asf 0√

2 e 0 1

0√

2 f −1

⎞⎟⎠

26 / 64




2 e 0 1

0√

2 f −1

⎞⎟⎠The transformation can be used for more rows and more parts

26 / 64




2 e 0 1

0√

2 f −1


But, there are problems with stretching. The first of them: how manyparts? Grcar (1990):“ the main challenge ... lies in determinining theappropriate choice of the number of rows ... to split into ... ”

26 / 64




2 e 0 1

0√

2 f −1



We will illustrate this problem by examples.

26 / 64




2 e 0 1

0√

2 f −1



We will illustrate this problem by examples.

They show that sparsity can very soon again decrease.

26 / 64

Matrix stretching for the LS problems: very trivial model

0 10 20 30 40 50 60 70

0

10

20

30

40

50

60

70

nz = 1420 10 20 30 40 50 60 70

0

10

20

30

40

50

60

70

nz = 755

Figure: Structure of A (left) and C (right) for the matrix A with a single denserow and As diagonal with n=64 and dense row stretched into 8 parts.

27 / 64

Matrix stretching for the LS problems: very trivial model

0 10 20 30 40 50 60 70

0

10

20

30

40

50

60

70

nz = 1420 10 20 30 40 50 60 70

0

10

20

30

40

50

60

70

nz = 755

Figure: Structure of A (left) and C (right) for the matrix A with a single denserow and As diagonal with n=64 and dense row stretched into 8 parts.

This seems to be very tempting.

27 / 64

Matrix stretching for the LS problems: very simple model

0 10 20 30 40 50 60 7010

2

103

104

number of stretched parts

fill

in th

e st

retc

hed

mat

rices

fill in the stretched matrix of normal equations fill in the stretched matrix

Figure: Number of entries in A and C when A has a single dense row that is splitinto an increasing number of parts and As is diagonal (n = 64).

28 / 64

Matrix stretching for the LS problems: very simple model

0 10 20 30 40 50 60 7010

2

103

104


fill

in th

e st

retc

hed

mat

rices

fill in the stretched matrix of normal equations fill in the stretched matrix

Figure: Number of entries in A and C when A has a single dense row that is splitinto an increasing number of parts and As is diagonal (n = 64).

No-fill in the Cholesky factorization

Seems to be very promising28 / 64

Matrix stretching for the LS problems: less trivial model

0 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 3520 10 20 30 40 50 60 70 80

0

10

20

30

40

50

60

70

80

nz = 382

Figure: Matrices based on discrete 2D Laplacian: unstretched (left), stretched(right).

29 / 64

Matrix stretching for the LS problems: less trivial model

0 10 20 30 40 50 60

0

10

20

30

40

50

60

nz = 3520 10 20 30 40 50 60 70 80

0

10

20

30

40

50

60

70

80

nz = 382

Figure: Matrices based on discrete 2D Laplacian: unstretched (left), stretched(right).

Seems to be promising again

29 / 64

Matrix stretching for the LS problems: generic example

0 10 20 30 40 50 60 70 80

0

10

20

30

40

50

60

70

80

nz = 9910 10 20 30 40 50 60 70 80

0

10

20

30

40

50

60

70

80

nz = 3077

Figure: AT A (left) and its Cholesky factor (right).

Ooops, this is not sparse at all ...

30 / 64

Matrix stretching for the LS problems: fill-in

0 200 400 600 800 1000 120010

4

105

106

107


fill

in th

e st

retc

hed

mat

rices

fill in the stretched matrix of normal equations fill in the Cholesky factor of normal equations

Figure: Fill-in in AT A and its Cholesky factor (triangular parts).

31 / 64


0 200 400 600 800 1000 120010

4

105

106

107


fill

in th

e st

retc

hed

mat

rices



Reorderings do not help, straightforward stretching not possible!.

31 / 64


0 200 400 600 800 1000 120010

4

105

106

107


fill

in th

e st

retc

hed

mat

rices




More general cases: problems with the fill-in and also withill-conditioning (despite the bounds by Adlers and Björck, 2000).

31 / 64


0 200 400 600 800 1000 120010

4

105

106

107


fill

in th

e st

retc

hed

mat

rices




More general cases: problems with the fill-in and also withill-conditioning (despite the bounds by Adlers and Björck, 2000).

The problem with fill-in is primarily not in the need of additionalmemory, but in contributing to loss of information in preconditioners!

31 / 64


The normal matrix:

AT A = m

∑k=1

ukT

uk, AT = (u1T , . . . ,um

T ).That is, uk, k = 1, . . . , m are rows of A

32 / 64


The normal matrix:

AT A = m

∑k=1

ukT

uk, AT = (u1T , . . . ,um


What if a pattern of a row uj is contained in the pattern of a row ui

(dominated by ui)?

32 / 64


The normal matrix:

AT A = m

∑k=1

ukT

uk, AT = (u1T , . . . ,um


What if a pattern of a row uj is contained in the pattern of a row ui

(dominated by ui)?

⇒ Pattern of uj is not needed to get the pattern of AT A.

A =⎛⎜⎜⎜⎜⎜⎜⎝

1 2 3 4 5 6

⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮uj ∗ ∗⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ui ∗ ∗ ∗⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

⎞⎟⎟⎟⎟⎟⎟⎠32 / 64


Let us see this: consider

(As

ui)

The normal matrix

(ATs ui

T )T (As

ui)

has the same sparsity pattern as ATs As.

33 / 64



(As

ui)

The normal matrix

(ATs ui

T )T (As

ui)


ui → F , A → (As

F)

33 / 64



(As

ui)

The normal matrix

(ATs ui

T )T (As

ui)


ui → F , A → (As

F)

The idea: stretch F T into blocks dominated by rows in As!

33 / 64



(As

ui)

The normal matrix

(ATs ui

T )T (As

ui)


ui → F , A → (As

F)

The idea: stretch F T into blocks dominated by rows in As!

AT A = (AT F

0 ST)( A 0

F T S) = (AT A +F T F ST F T

FS ST S)

Struct(AT A + F T F ) ⊆ Struct(AT A)33 / 64

Matrix stretching for the LS problems: sparse stretching

Finding the stretched and dominated pieces

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

∗ ∗∗ ∗ ∗∗ ∗∗ ∗ ∗ ∗∗ ∗ ∗∗ ∗∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠Figure: A matrix of order 15 × 12 with one dense row

34 / 64


rows of A (edges)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

12

10

dense row (nodes)

35 / 64


rows of A (edges)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

12

10

dense row (nodes)

Minimum set cover problem+ ostp cessing

36 / 64


rows of A (edges)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

12

10

dense row (nodes)

First step: Minimum set cover problem37 / 64


rows of A (edges)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

12

10

dense row (nodes)

Second step: find disjoint segments38 / 64


Matrix transformed by the stretching fully automaticallyparameter-free using

▸ minimum set cover heuristic▸ sparsification to minimize the total fill-in

39 / 64




Based just on combinatorial techniques.

We call this technique sparse stretching

39 / 64




Based just on combinatorial techniques.

We call this technique sparse stretching

Much better than the standard (old) stretching in

▸ Sparsity of normal equations.▸ Cholesky factor size▸ Normal equations conditioning.▸ Black-box construction of algebraic construction.▸ Iteration counts

39 / 64

Matrix stretching for the LS problems: old and sparse stretching

Number of entries: AT A and Cholesky factor versus number of parts

Red dot means the result for the sparse stretching

0 50 100 150 200 250


5000

6000

7000

8000

9000

10000

11000

entr

ies

in th

e st

retc

hed

norm

al m

atrix

0 50 100 150 200 250


1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

entr

ies

in th

e C

hole

sky

fact

or

104

Figure: Comparison of the entries in the stretched normal matrix (left) and itsCholesky factor (right) for problem WM1 with one dense row appended.

40 / 64


Number of entries: AT A and Cholesky factor versus number of parts

Red dot means the result for the sparse stretching

0 50 100 150 200 250 300 350 400 450 500


1

2

3

4

5

6

7

entr

ies

in th

e st

retc

hed

norm

al m

atrix

104

0 50 100 150 200 250 300 350 400 450 500


0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

entr

ies

in th

e C

hole

sky

fact

or

105

Figure: Comparison of the entries in the stretched normal matrix (left) and itsCholesky factor (right) for problem LP_AGG with one dense row appended.

41 / 64


0 100 200 300 400 500

nz = 29268

0

50

100

150

200

250

300

350

400

450

500

0 100 200 300 400 500

nz = 24878

0

50

100

150

200

250

300

350

400

450

500

Figure: For problem LP_AGG with one dense row appended, the sparsity pattern ofthe stretched normal matrix for standard stretching (left) and sparse stretching(right).

42 / 64


Figure: For problem LP_AGG with one dense row appended, the sparsity pattern ofL + LT of the Cholesky factor of the stretched normal matrix for standardstretching (left) and sparse stretching (right).

43 / 64


The sizes really transfer into the iteration counts

0 50 100 150 200 250 300 350 400 450 500


0

10

20

30

40

50

60

70

80

90

100

num

ber

of p

reco

nditi

oned

iter

atio

ns

number of preconditioned iterations

0 50 100 150 200 250 300 350 400 450 500


2

2.5

3

3.5

4

4.5

5

prec

ondi

tione

r si

ze

104

preconditioner size

Figure: Comparison of the iteration counts (left) and preconditioner size (right)for the matrix LPAGG. The curve corresponds to the number of entries varyingwith the number of parts into which is the dense row stretched. CGLSpreconditioned by HSL_MI35.

44 / 64

Matrix stretching: old and sparse stretching

Identifier strategy mds nnz(C) nnz(L) nflops

aircraft none 0 1.421×106 7.048×10

6 1.764×1010

standard 17 5.474×104 4.911×10

5 1.759×107

sparse 17 5.474×104 4.911×10

5 1.759×107

sc205-2r none 0 6.510×106 8.002×10

7 1.995×1011

standard 8 1.408×105 1.841×10

6 8.043×107

sparse 8 1.408×105 1.852×10

6 8.150×107

scagr7-2r none 0 2.215×107 9.502×10

7 4.369×1011

standard 7 1.979×106 9.269×10

6 1.423×1010

sparse 7 1.841×105 1.564×10

6 6.538×107

scrs8-2r none 0 6.217×106 1.718×10

7 3.215×1010

standard 22 8.013×105 2.675×10

6 1.244×109

sparse 22 8.968×104 1.258×10

6 7.303×107

scsd8-2r none 0 1.957×106 1.196×10

7 1.960×1010

standard 50 6.287×105 6.924×10

6 5.467×109

sparse 50 1.794×105 4.357×10

6 2.414×109

south31 none 0 1.536×108 1.581×10

8 1.850×1012

standard 381 3.224×105 4.366×10

6 2.546×108

sparse 381 3.224×105 4.260×10

6 2.319×108

HSL_MI35 (Scott, T., 2016) 45 / 64


What about conditioning?

0 50 100 150 200 2500

2

4

6

8

10

12x 10

12


cond

ition

num

ber

(con

dest

)

0 50 100 150 200 250 300 350 400 450 500


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

cond

ition

num

ber

(con

dest

)

1013

Figure: The condition number of the stretched normal matrix for problems WM1

(left) and LP_AGG (right) with one dense row appended.

46 / 64


What about conditioning?

0 50 100 150 200 2500

2

4

6

8

10

12x 10

12


cond

ition

num

ber

(con

dest

)

0 50 100 150 200 250 300 350 400 450 500


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

cond

ition

num

ber

(con

dest

)

1013

Figure: The condition number of the stretched normal matrix for problems WM1

(left) and LP_AGG (right) with one dense row appended.

The condition number goes steadily up in practice46 / 64


Ill-conditioning in practice is in agreement with theoretical bounds.

Adlers-Björck theory (see Adlers, Björck, 2000; Scott, T., 2019)

Theorem

An upper bound for the condition number of the stretched matrix (pstretched rows, k parts) A with γ = 1/2√p k∣∣Ad∣∣2 is

κ2(A) ≤ κ2(A)k (1 + 2p k∣∣Ad∣∣22∣∣A∣∣22

)(k + 1 + σn(A)2∣∣Ad∣∣22 ) .

47 / 64


Ill-conditioning in practice is in agreement with theoretical bounds.

Adlers-Björck theory (see Adlers, Björck, 2000; Scott, T., 2019)

Theorem

An upper bound for the condition number of the stretched matrix (pstretched rows, k parts) A with γ = 1/2√p k∣∣Ad∣∣2 is

κ2(A) ≤ κ2(A)k (1 + 2p k∣∣Ad∣∣22∣∣A∣∣22

)(k + 1 + σn(A)2∣∣Ad∣∣22 ) .

And this is really not optimistic ...

47 / 64


Condition number increase when stretching more rows

0 1 2 3 4 5 6 7 8

number of stretched rows

106

108

1010

1012

1014

1016

cond

ition

num

ber

estim

ate

0 5 10 15 20 25 30 35


100

101

102

103

itera

tion

coun

t

Figure: Condition number estimate (right) and iteration count (left) for problemsctap1-2b as the number of dense rows increases.

48 / 64


Condition number increase when stretching more rows

0 1 2 3 4 5 6 7 8


106

108

1010

1012

1014

1016

cond

ition

num

ber

estim

ate

0 5 10 15 20 25 30 35


100

101

102

103

itera

tion

coun

t

Figure: Condition number estimate (right) and iteration count (left) for problemsctap1-2b as the number of dense rows increases.

Do we really need to stretch everything?

48 / 64


Remedy: partial stretching▸ Stretching only when needed. For example, when we have null columns

in As

▸ The rest treated by the ASD

Identifier m n md mds nnz(C) nnz(L) nflops

aircraft 7517 3754 17 4 1.575×104 9.984×10

4 2.552×106

sc205-2r 62423 35213 8 1 9.602×104 9.084×10

5 2.875×107

scagr7-2r 46679 32847 7 1 1.443×105 8.531×10

5 2.572×107

scrs8-2r 27691 14364 22 7 6.698×104 6.716×10

5 2.797×107

scsd8-2r 60550 8650 50 5 5.895×104 † †

south31 36321 18425 381 5 1.884×104 2.306×10

4 1.565×105

49 / 64


Remedy: partial stretching▸ Stretching only when needed. For example, when we have null columns

in As

▸ The rest treated by the ASD

Identifier m n md mds nnz(C) nnz(L) nflops

aircraft 7517 3754 17 4 1.575×104 9.984×10

4 2.552×106

sc205-2r 62423 35213 8 1 9.602×104 9.084×10

5 2.875×107

scagr7-2r 46679 32847 7 1 1.443×105 8.531×10

5 2.572×107

scrs8-2r 27691 14364 22 7 6.698×104 6.716×10

5 2.797×107

scsd8-2r 60550 8650 50 5 5.895×104 † †

south31 36321 18425 381 5 1.884×104 2.306×10

4 1.565×105

Much better in many cases, but not always.

49 / 64


What about a more sophisticated graph preprocessing: weightedvariant of the set cover?

A = (As

Ad) =⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 1 1 0

0 0 0 0 1

1 1 0 0 0

5 6 0 0 0

0 0 7 8 9

3/√2 3/√2 3/√2 3/√2 3/√2

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠. (4)

50 / 64


What about a more sophisticated graph preprocessing: weightedvariant of the set cover?

A = (As

Ad) =⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 1 1 0

0 0 0 0 1

1 1 0 0 0

5 6 0 0 0

0 0 7 8 9

3/√2 3/√2 3/√2 3/√2 3/√2

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠. (4)

(As

F T) =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 1 1 0

0 0 0 0 1

1 1 0 0 0

5 6 0 0 0

0 0 7 8 9

3 3 3 3 0

0 0 0 0 3

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

or

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 1 1 0

0 0 0 0 1

1 1 0 0 0

5 6 0 0 0

0 0 7 8 9

3 3 0 0 0

0 0 3 3 3

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

(5)

50 / 64


What about a more sophisticated graph proprocessing: weighted setcover

A = (As

Ad) =⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 1 1 0

0 0 0 0 1

1 1 0 0 0

5 6 0 0 0

0 0 7 8 9

3/√2 3/√2 3/√2 3/√2 3/√2

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠. (6)

ATs As + FF T =

⎛⎜⎜⎜⎜⎜⎜⎝

36 41 10 10 0

41 47 10 10 0

10 10 59 66 63

10 10 66 64 72

0 0 63 72 91

⎞⎟⎟⎟⎟⎟⎟⎠or

⎛⎜⎜⎜⎜⎜⎜⎝

36 41 1 1 0

41 47 1 1 0

1 1 59 66 72

1 1 66 74 81

0 0 72 81 91

⎞⎟⎟⎟⎟⎟⎟⎠.

51 / 64


What about a more sophisticated graph proprocessing: weighted setcover

A = (As

Ad) =⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 1 1 0

0 0 0 0 1

1 1 0 0 0

5 6 0 0 0

0 0 7 8 9

3/√2 3/√2 3/√2 3/√2 3/√2

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠. (6)

ATs As + FF T =

⎛⎜⎜⎜⎜⎜⎜⎝

36 41 10 10 0

41 47 10 10 0

10 10 59 66 63

10 10 66 64 72

0 0 63 72 91

⎞⎟⎟⎟⎟⎟⎟⎠or

⎛⎜⎜⎜⎜⎜⎜⎝

36 41 1 1 0

41 47 1 1 0

1 1 59 66 72

1 1 66 74 81

0 0 72 81 91

⎞⎟⎟⎟⎟⎟⎟⎠.

Apparently, the second choice is intuitively better.51 / 64


0 5 10 15 20 25 30 35

number of appended rows

0

200

400

600

800

1000

1200

1400

1600

1800

itera

tion

coun

t

non-weighted set cover weighted set cover

0 5 10 15 20 25 30 35


0

100

200

300

400

500

600

itera

tion

coun

t


Figure: Iteration counts for problem lp_agg as the number of appended denserows increases. Results are for sparse stretching using the weighted andnon-weighted vertex set cover with lsize = 50 (left) and lsize = 60 (right).

52 / 64


0 20 40 60 80 100 120

lsize

0

500

1000

1500

2000

2500

itera

tion

coun

t


40 50 60 70 80 90 100 110 120

lsize

0

500

1000

1500

2000

2500

3000

itera

tion

coun

t


Figure: Iteration counts for problem pltexpa with one appended dense row as thethe parameter lsize increases. Results are for sparse stretching using theweighted and non-weighted vertex set cover for density ρ = 0.1 (left) and ρ = 0.5

(right).

53 / 64


What if the ill-conditioning can be overcome by treating the problemas having the saddle-point structure?

AT A = (AT F T

0 ST )(A 0

F S) = (AT A + F T F AT F

F T A ST S)

See that (in an unscaled case):

ST S =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

2 −1−1 2 −1−1 2 −1−1 2 −1⋯ ⋯ ⋯−1 2 −1−1 2

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠54 / 64


Very preliminary: preconditioners based on

P0 = (Cs +F T F FS

ST F T ST S) , P1 = (diag(Cs) FS

ST F T ST S) .

0 10 20 30 40 50


101

102

103

itera

tion

coun

t

P0 preconditioner P1 preconditioner

0 10 20 30 40 50


102

103

104

prec

ondi

tione

r si

ze


Figure: Iteration counts (left) and preconditioner size (right) for P 0 and P 1

applied to problem lp_agg as the number appended dense rows increases. Herelsize = 5.

55 / 64


Very preliminary: preconditioners based on

P0 = (Cs +F T F FS

ST F T ST S) , P1 = (diag(Cs) FS

ST F T ST S) .

0 10 20 30 40 50


101

102

103

itera

tion

coun

t


0 10 20 30 40 50


102

103

104

prec

ondi

tione

r si

ze


Figure: Iteration counts (left) and preconditioner size (right) for P 0 and P 1

applied to problem lp_agg as the number appended dense rows increases. Herelsize = 5.Many more possibilities to get modified constraint preconditioners.

55 / 64

Schur complement approach

Fully embedded in the Schur complement approach that combines adirect solver, modifications, regularization to get a preconditioner:

System matrix varying α

K(α) = (Cs(α) ATd

Ad −Imd

) .

56 / 64

Schur complement approach

Fully embedded in the Schur complement approach that combines adirect solver, modifications, regularization to get a preconditioner:

System matrix varying α

K(α) = (Cs(α) ATd

Ad −Imd

) .

Once the dense rows are clearly detected, the preconditioned iterativemethod can be extremely successful in solving some hard problems.

56 / 64

Standard null space approach

Optimization motivation of the null-space approach

minimize f(u)subject to B u = g, (7)

57 / 64




The saddle-point problem for the direction vector u, H is the localHessian.

(H BT

B 0)(u

v) = (f −Hu

g) , (8)

57 / 64




The saddle-point problem for the direction vector u, H is the localHessian.

(H BT

B 0)(u

v) = (f −Hu

g) , (8)

Algorithm

Dual variable method for solving the saddle-point problem

1. Find Z with columns forming a basis for N(B)2. Find u such that Bu = g.3. Solve ZT HZz = ZT (f −Hu).4. Set x = u +Zz.5. Find y such that BT y = f −Hu.

57 / 64


Dual variable method to solve it heavily relies on the fact that thebottom right block of the saddle-point matrix is zero

58 / 64



Saddle-point from the LS problemThe LS problem can be written as solving the following system

(Cs ATd

Ad −I)( x

Adx) = (c

0) . (9)

and the problem with the nonzero bottom right block should beovercome.

58 / 64



Saddle-point from the LS problemThe LS problem can be written as solving the following system

(Cs ATd

Ad −I)( x

Adx) = (c

0) . (9)

and the problem with the nonzero bottom right block should beovercome.

Denote the problem using more general notation as

A(uv) = (H BT

B −C)(u

v) = (f

g) , (10)

58 / 64

The null space approach to solve sparse/dense LS problems

Theorem

Consider the saddle-point problem above, rank(B) = r ≤ k, C SPD.Assume that E = (Z Y ) ∈ Rn×n is nonsingular, Z ∈ Rn×n−r is such that

BE = (0k,n−r Br), Br ∈ Rk×r, Br ≠ 0. Set E to be

E = [ E 0n,k

0k,n Ik] ∈ R(n+k)×(n+k).

If H is positive definite on the null space of B then A = ETAE issymmetric and invertible, with SPD leading principal submatrix ZT HZ ofthe order n − r. The transformed saddle point system is

A(uv) = ET (f

g) , (u

v) = E (u

v) .

59 / 64


Remind the LS problem in the saddle-point formulation

(Cs ATd

Ad −I)( x

Adx) = (c

0) . (11)

60 / 64



(Cs ATd

Ad −I)( x

Adx) = (c

0) . (11)

Lemma

Consider A = (As

Ad) , As ∈ Rms×n, Ad ∈ Rmd×n. If A is of full rank, then

Cs = ATs As is positive definite on the null space of Ad.

60 / 64



(Cs ATd

Ad −I)( x

Adx) = (c

0) . (11)

Lemma

Consider A = (As



It does not matter whether the As itself has a full column rank (abunch specific treatments needed in other solution approaches).

60 / 64



(Cs ATd

Ad −I)( x

Adx) = (c

0) . (11)

Lemma

Consider A = (As



It does not matter whether the As itself has a full column rank (abunch specific treatments needed in other solution approaches).

Using a sparse basis for the range space is enough to have awell-defined solver.

60 / 64


The solution components of the null-space approach:

61 / 64



1 Null-space basis computation for B ≡ Ad: the sparse null space basisof a wide dense matrix.

61 / 64



1 Null-space basis computation for B ≡ Ad: the sparse null space basisof a wide dense matrix.

2 Preconditioner construction for the transformed system: the Hessianis transformed by a matrix having both null space-basis andrange-space basis E = (Z Y ) ∈ Rn×n.

⎛⎜⎝ZT HZ ZT HY 0n−r,k

Y T HZ Y T HY BTr

0k,n−r Br −C

⎞⎟⎠ .

61 / 64


Threshold thresh = 0.73

0 100 200 300 400

nz = 5258

0

50

100

150

200

250

300

350

400

450

LPAGG (615 × 488, UFL Sparse Matrix Collection) + 10 dense rows.

62 / 64

Conclusions

We will comment mainly on the space for subsequent research

63 / 64

Conclusions


ASD: still a lot theoretical challenges.

63 / 64

Conclusions



Stretching implies new concepts to find dense rows - based on theways to split and embed their row structures.

63 / 64

Conclusions




Combination with the structure of factorization?

63 / 64

Conclusions





Stretching implies a new approach to solve the problem of nullcolumns of As: stretching makes a bridge between As and Ad andkeeps the As full column rank.

63 / 64

Conclusions






Another stretching problem: the stretched rows may be processeddifferently from the rest (?)

63 / 64

Conclusions







Null-space approach: a viable strategy as well as a way to get overthe singularity of As (Scott, T., 2019, in preparation).

63 / 64

Conclusions







Null-space approach: a viable strategy as well as a way to get overthe singularity of As (Scott, T., 2019, in preparation).

Many more questions than expected at the beginning.

63 / 64

Last but not least

Thank you for your attention!

64 / 64

Sparse least squares problems with dense rows and ... · Outline 1 Introduction 2 Possible research...

Documents

Transcript of Sparse least squares problems with dense rows and ... · Outline 1 Introduction 2 Possible research...