The Kronecker Product SVD - Cornell UniversityThe Kronecker Product B⊗Cis a block matrix whose...

The Kronecker Product SVD

Charles Van Loan

October 19, 2009

The Kronecker Product

B ⊗ C is a block matrix whose ij-th block is bijC.

E.g.,

[

b11 b12b21 b22

]

⊗ C =

[

b11C b12C

b21C b22C

]

Replicated Block Structure

The KP-SVD

If

A =

A11 · · · A1N

... . . . ...AM1 · · · AMN

Aij ∈ IRp×q

then there exists a positive integer rA with rA ≤MN so that

A =

rA∑

k=1

σk Bk ⊗ Ck rA = rankKP(A)

The KP-singular values: σ1 ≥ · · · ≥ σrA> 0.

The Bk ∈ IRM×N and Ck ∈ IRp×q satisfy < Bi, Bj >= δij and

< Ci, Cj >= δij where < F,G >= trace(FTG).

Nearness Property

Let r be a positive integer that satisfies r ≤ rA. The problem

minrankKP(X) = r

‖A−X ‖F

is solved by setting

X(opt) =

r∑

k=1

σk Bk ⊗ Ck.

Talk Outline

1. Survey of Essential KP PropertiesJust enough to get through the talk.

2. Computing the KP-SVDIt’s an SVD computation.

3. Nearest KP PreconditionersSolving KP Systems is fast.

4. Some Constrained Nearest KP ProblemsNearest (Markov) ⊗ (Markov)

5. Multilinear ConnectionsA low-rank approximation of a 4-dimensional tensor

6. Off-The-Wall / Just-For-FunComputing log(det(A)) for Large Sparse Pos Def A

Essential KP Properties

Every bijckl Shows Up

[

b11 b12b21 b22

]

⊗

c11 c12 c13c21 c22 c23c31 c32 c33

=

b11c11 b11c12 b11c13 b12c11 b12c12 b12c13

b11c21 b11c22 b11c23 b12c21 b12c22 b12c23

b11c31 b11c32 b11c33 b12c31 b12c32 b12c33

b21c11 b21c12 b21c13 b22c11 b22c12 b22c13

b21c21 b21c22 b21c23 b22c21 b22c22 b22c23

b21c31 b21c32 b21c33 b22c31 b22c32 b22c33

Hierarchical

A =

[

b11 b12b21 b22

]

⊗

c11 c12 c13 c14c21 c22 c23 c24c31 c32 c33 c34c41 c42 c43 c44

⊗

d11 d12 d13d21 d22 d23d31 d32 d33

A is a 2-by-2 block matrix whose entries are 4-by-4 block matriceswhose entries are 3-by-3 matrices.

Algebra

(B ⊗ C)T = BT ⊗ CT

(B ⊗ C)−1 = B−1 ⊗ C−1

(B ⊗ C)(D ⊗ F ) = BD ⊗ CF

B ⊗ (C ⊗D) = (B ⊗ C) ⊗D

No: B ⊗ C 6= C ⊗B

Yes: B ⊗ C = (Perfect Shuffle)(C ⊗B)(Perfect Shuffle)T

The vec Operation

Turns matrices into vectors by stacking columns:

X =

1 102 203 30

⇒ vec(X) =

123102030

Important special case:

vec(rank-1 matrix) = vec

123

[

1 10]

=

[

110

]

⊗

123

Reshaping

The matrix equation

Y = CXBT

can be reshaped into a vector equation

vec(Y ) = (B ⊗ C)vec(X)

Implies fast linear equation solving and fast matrix-vector multi-plication. (More later.)

Inheriting Structure

IfB andC are

nonsingular

lower(upper) triangular

banded

symmetric

positive definite

stochastic

Toeplitz

permutations

orthogonal

thenB ⊗ C is

nonsingular

lower(upper)triangular

block banded

symmetric

positive definite

stochastic

block Toeplitz

a permutation

orthogonal

Computing the KP-SVD

Warm-Up: The Nearest KP Problem

Given A ∈ IRm×n with m = m1m2 and n = n1n2.

Find B ∈ IRm1×n1 and C ∈ IRm2×n2 so

φ(B,C) = ‖ A−B ⊗ C ‖F = min

A bilinear least squares problem.

Fix B (or C) and it becomes linear in C (or B).

Reshaping the Nearest KP Problem

φ(B, C) =

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

a11 a12 a13 a14

a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44

a51 a52 a53 a54

a61 a62 a63 a64

−

b11 b12

b21 b22

b31 b32

⊗[

c11 c12

c21 c22

]

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

F

=

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

a11 a21 a12 a22

a31 a41 a32 a42

a51 a61 a52 a62

a13 a23 a14 a24

a33 a43 a34 a44

a53 a63 a54 a64

−

b11

b21

b31

b12

b22

b32

[

c11 c21 c12 c22

]

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

F

!!! Finding the nearest rank-1 matrix is an SVD problem !!!

SVD Primer

A ∈ IRm×n ⇒ UTAV = Σ = diag(σ1, . . . , σn)

If U = [u1 | u2 | · · · | um] and V = [v1 | v2 | · · · | vn] then

•The rank-1 matrix σ1u1vT1 solves

min

rank(A) = 1‖ A − A ‖

F

• v1 is the dominant eigenvector for ATA:

ATAv1 = σ2

1v1 Av1 = σ1u1 σ1 = uT

1Av1

• u1 is the dominant eigenvector for AAT :

AATu1 = σ2

1u1 ATu1 = σ1v1 σ1 = vT

1 ATu1

Sol’n: SVD of Permuted A + Reshaping

φ(B, C) =

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

a11 a12 a13 a14

a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44

a51 a52 a53 a54

a61 a62 a63 a64

−

b11 b12

b21 b22

b31 b32

⊗[

c11 c12

c21 c22

]

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

F

=

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

a11 a21 a12 a22

a31 a41 a32 a42

a51 a61 a52 a62

a13 a23 a14 a24

a33 a43 a34 a44

a53 a63 a54 a64

−

b11

b21

b31

b12

b22

b32

[

c11 c21 c12 c22

]

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

∥

F

General Solution Procedure

Minimize

φ(B, C) = ‖A − B ⊗ C‖F

=∥

∥ A − vec(B)vec(C)T∥

∥

F

where

A =

vec(A11)T

vec(A21)T

vec(A31)T

vec(A12)T

vec(A22)T

vec(A32)T

.

Solution: Compute the SVD UT AV = Σ and set

vec(B(opt)) =√σ1 U (:, 1) vec(C(opt)) =

√σ1 V (:, 1).

Lanczos SVD Algorithm

Need to compute the dominant eigenvector v1 of ATA and thedominant eigenvector u1 of AAT . The power method approach...

b = initial guess of v1; c = initial guess of u1 ; s = cTAb;

while (‖ Ab− sc ‖2 ≈ ‖Av1 − σ1u1 ‖2 is too big )c = Ab; c = c/‖ c ‖2;

b = AT c; b = b/‖ b ‖2; s = cTAb;

end

The Lanczos method is better than this because it uses more thanjust the most recent b and c vectors. It too lives off of matrix-vectorproducts, i.e., is “sparse friendly.”

The Nearest KP-rank r Problem

Use Block Lanczos.

E.g., To minimize

‖A − B1 ⊗ C1 − B2 ⊗ C2 − B3 ⊗ C3‖F

use block Lanczos SVD with block width 3 and set

vec(B(opt)i ) =

√σi U (:, i)

vec(C(opt)i ) =

√σi V (:, i)

i = 1:3

The Complete KP-SVD

Given:

A =

A11 · · · A1N

... . . . ...AM1 · · · AMN

Aij ∈ IRp×q

Form A (MN -by-pq) and apply LAPACK SVD:

A =

rA∑

i=1

σiuivTi

Then:

A =

rA∑

i=1

σi · reshape(ui,M,N ) ⊗ reshape(vi, p, q)

The Theorems Follow From This

A ⇐⇒ A

m m

A =

rA∑

i=1

σiBi ⊗ Ci ⇐⇒ A =

rA∑

i=1

σuivTi

A Related Problem

Problem. Find X and Y to minimize

‖ A− (X ⊗ Y − Y ⊗X) ‖F

Solution. Find vectors x and y so

‖ A− (xyT − yxT ) ‖F

is minimized and reshape x and y to get X(opt) and Y (opt).

The Schur decomposition of A− AT is involved.

Another Related Problem

Problem. Find X to minimize

‖A−X ⊗X) ‖F

Solution. Find vector x so

‖ A− xxT ‖F

is minimized and reshape to get X(opt).

The Schur decomposition of A+ AT is involved.

A Much More Difficult Problem

minB, C, D

‖ A−B ⊗ C ⊗D ‖F

Computational multilinear algebra is filled with problems like this.

Nearest KPPreconditioners

Main Idea

(i) Suppose A and an N -by-N block matrix with p-by-p blocks.

(ii) Need to solve Ax = b. Ordinarily this is O(N3p3)

(iii) A system of the form

(B1 ⊗ C1 + B2 ⊗ C2)z = r

can be solved in O(N3 + p3) time. Hint C1ZBT1 +C2ZB

T2 = R.

(iv) If

(B1 ⊗ C1 + B2 ⊗ C2) ≈ A

we have a potential preconditioner.

A Block Toeplitz De-Blurring Problem

(Nagy, O’Leary, Kamm(1998))

Need to solve a large block Toeplitz system Tx = b

Preconditioner:

T ≈ T1 ⊗ T2

Can solve the nearest KP problem with the constraint that thefactor matrices T1 and T2 are Toeplitz.

A Poisson-Related Problems

Poisson’s equation on a rectangle with a regular (M+1)-by-(N+1)grid discretizes to

Au = (IM ⊗ TN + TM ⊗ IN) u = f

where the T ’s are 1-2-1 tridiagonals. Can be solved very fast.

A new method for the Navier-Stokes problem being developed byDiamessis and Escobar-Vargas leads to linear system where thehighly structured A-matrix has KP-rank rA = 16.

Looking for a KP-Preconditioner M of the form

M = B1 ⊗ C1 + B2 ⊗ C2

Some ConstrainedNearest KP Problems

Joint with Stefan Ragnarsson

NOT Inheriting Structure

In the

minB,C ‖A−B ⊗ C ‖F

problem, sometimes B and C fail to inherit A’s special attributes.

If A is

{

StochasticOrthogonal

}

then B and C arenot quite

{

StochasticOrthogonal

}

KP Approximation of Stochastic Matrices

If A ∈ IRn×n , B ∈ IRn1×n1 , and C ∈ IRn2×n2 , and

A = B ⊗ C = stochastic ⊗ stochastic

then each A-entry has the form bijcpq. The states are clusteredinto groups G1, . . . , Gn1 each of size n2 and

bij = prob(Gj → Gi)

cpq = prob(state q → state p within any group)

References:

“Aggregation of Stochastic Automata Networks with Replicas” (A. Benoit, L.Brenner, P. Fernandes, B. Plateau)

“Analyzing Markov Chains Using Kronecker Products” (T. Dayer)

A Bilinear Optimization Strategy

Given an initial guess C...

Repeat Until Converged:

minB Stochastic

‖ A−B ⊗ C ‖F C fixed

minC Stochastic

‖ A− B ⊗ C ‖F B fixed

end

These are linear, constrained least squares problems.

Reshaping

The problem

minC Stochastic


is equivalent to

minx ≥ 0, Ex = e

‖Mx− f ‖2

where M = I ⊗ B, f = vec(A), x = vec(C), e = ones(m, 1),and E = Im ⊗ eT .

The linear constraint forces C (a.k.a. x) to have unit columnsums.

Example

If

A =

0.2444 0.1950 0.2129 0.1850 0.1202 0.16820.2367 0.2712 0.2526 0.2573 0.1857 0.22490.1811 0.2348 0.2236 0.1415 0.2900 0.14810.1198 0.0949 0.1105 0.1147 0.0822 0.18020.1422 0.1091 0.0938 0.1709 0.1405 0.15700.0757 0.0949 0.1065 0.1306 0.1813 0.1217

The matrices B and C obtained by the unconstrained SVD mini-mization of ‖A−B ⊗ C ‖F give approximately stochastic

B =

[

0.6842 0.58900.3158 0.4320

]

C =

0.3301 0.2449 0.32460.3925 0.3542 0.36570.2611 0.3993 0.2963

Example (Cont’d)

Using

B =

[

0.6842 0.58900.3158 0.4320

]

C =

0.3301 0.2449 0.32460.3925 0.3542 0.36570.2611 0.3993 0.2963

as the initial guess for the successive nonnegative least squaresiteration we get the “exactly” stochastic matrices

BLS =

[

0.6823 0.57760.3177 0.4224

]

CLS =

0.3359 0.2449 0.32890.3984 0.3552 0.37040.2658 0.3998 0.3008

Work per iteration is roughly quadratic in the dimension of A.

MathWorks Optimization Toolbox and PROPACK (R.M. Larsen).

A Note on Ordering

This problem assumes that we know how to group the states:

minB,C Stochastic

‖ A− B ⊗ C ‖F

This doesn’t:

minB,C StochasticP permutation

‖ PAPT −B ⊗ C ‖F

The Inverse Times Table Problem

Suppose we have the stationary vector x for A, i.e.,

AxA = xA x > 0, sum(xA) = 1

Then

PAPT = B ⊗ C

BxB = xB

CxC = xC

⇒ PxA = xB ⊗ xC

If we know xA, can we figure out P so PxA is the Kroneckerproduct of two smaller vectors?

Inverse TT Cont’d

Suppose

xA = [ 2 3 4 6 7 9 10 12 16 18 21 24 27 30 32 36 56 63 80 90 ]T

and we seek permutation P ∈ IR20×20 so that

PxA =

c1c2c3c4

⊗

b1b2b3b4b5

what are xB and xC?

Inverse TT Cont’d

xA = [ 2 3 4 6 7 9 10 12 16 18 21 24 27 30 32 36 56 63 80 90 ]T

reshape(PxA, 5, 4) =

b1b2b3b4b5

[

c1 c2 c3 c4]

24 9 3 2756 21 7 6380 30 10 9016 6 2 1832 12 4 36

=

371024

[

8 3 1 9]

Quick Aside: Nearest Orthogonal KP

A =

−.447 .032 −.528 .384 .031 .406 .308 .006 −.330−.497 −.243 .464 .367 .183 −.308 .320 .187 .274−.205 .654 .150 .105 −.494 −.134 .138 −.442 .132−.381 −.021 −.406 −.006 −.004 −.019 −.562 .022 .609−.404 −.167 .327 −.021 −.003 .001 −.562 −.290 −.548−.107 .530 .131 −.003 −.024 .011 −.218 .777 −.191−.299 −.022 −.342 −.559 −.000 −.608 .236 .048 −.225−.298 −.157 .254 −.581 −.274 .568 .208 .105 .175−.104 .419 .097 −.233 .802 .165 .059 −.257 .087

≈ (3-by-3 Orthogonal B ) ⊗ (3-by-3 Orthogonal C )

Nearest Orthogonal KP (Cont’d)

The unconstrained KP-SVD minimization gives

B0 =

0.7042 −0.5335 −0.27130.5563 0.0030 0.46180.4412 0.8460 −0.1679

C0 =

−0.7433 −0.0069 −0.4633−0.7671 −0.3743 0.3931−0.2822 1.0394 0.1272

but ‖ BT0 B0 − I3 ‖2 ≈ ‖ CT

0 C0 − I3 ‖2 ≈ .643.

After 2 iterations of alternating bilinear clean-up:

BLS =

0.7025 −0.5305 −0.47450.5607 0.0019 0.82800.4383 0.8477 −0.2988

CLS =

−0.6701 −0.0123 −0.7422−0.6962 −0.3365 0.6341−0.2576 0.9416 0.2169

giving ‖ BTLSBLS − I3 ‖2 ≈ ‖BT

LSBLS − I3 ‖2 ≈ 10−4.

Nearest Orthogonal KP (Cont’d)

The problem

minC Orthogonal


is equivalent to an orthogonal procrustes problem with simple SVDsolution:

UT(

∑ ∑

bijAij

)

V = Σ Copt = UV T

Bojanczyk and Lutoborski (2003) solved a related problem.

A Wireless Bandwidth Problem

GivenH1, . . . , HN ∈ IRp×q , findC ∈ IRp×r andW ∈ IRq×r withorthonormal columns so

ψ(C,W ) =

N∑

k=1

σ1(CTHkW )2

is maximized.

(Joint with J. Nsenga (CETIC) , S. Ragnarsson)

A Wireless Bandwidth Problem Cont’d

ψ(C,W ) =

N∑

k=1

‖ CTHkW ‖22 ≤

N∑

k=1

‖ CTHkW ‖22

=

N∑

k=1

‖ (C ⊗W )Tvec(Hk) ‖22

= tr(

(C ⊗W )TS(C ⊗W ))

= ψ(C,W )

where

S =

N∑

k=1

vec(Hk)vec(Hk)T

A Wireless Bandwidth Problem Cont’d

Solution Approach. If

S ≈ S1 ⊗ S2

then

ψ(C,W ) ≈ tr(

(C ⊗W )T (S1 ⊗ S2)(C ⊗W ))

= tr(CTS1C) · tr(WTS2W )

The trace of QTMQ with Q ∈ IRn×j is maximized if ran(Q) isthe span of the r-dimensional dominant invariant subspace of M .An “easy computation.”

Connections toComputational

Multilinear Algebra

A 2d-by-2d Hx = λx Problem

H =

d∑

ij

tijHTi Hj +

d∑

ijkl

vijklHTi H

Tj HkHl

Hi = I2i−1 ⊗[

0 10 0

]

⊗ I2d−i

T = T (1:d, 1:d)

V = V(1:d, 1:d, 1:d, 1:d)

Matrix T is symmetric. Tensor V = (vi,j,k,`) also hassymmetries.

The H-Matrix

0 200 400 600 800 1000

0

100

200

300

400

500

600

700

800

900

1000

nz = 104703

nzeros =

(

1

64d4 − 3

32d3 +

27

64d2 − 11

32d + 1

)

2d − 1

Some Fourth-Order Tensor Symmetries

The tensor V in our problem frequently has these symmetries:

V(i, j, k, `) =

V(j, i, k, `)

V(i, j, `, k)

V(k, `, i, j)

Let’s Flatten V...

V =

V(:, :, 1, 1) V(:, :, 1, 2) V(:, :, 1, 3) V(:, :, 1, 4)

V(:, :, 2, 1) V(:, :, 2, 2) V(:, :, 2, 3) V(:, :, 2, 4)

V(:, :, 3, 1) V(:, :, 3, 2) V(:, :, 3, 3) V(:, :, 3, 4)

V(:, :, 4, 1) V(:, :, 4, 2) V(:, :, 4, 3) V(:, :, 4, 4)

and see what happens to

V(i, j, k, `) =

V(j, i, k, `)

V(i, j, `, k)

V(k, `, i, j)

Flattened Symmetries

Block Symmetry:

V(i, j, k, `) = V(i, j, `, k) ⇒ Vk,` = V`,k

Symmetric Blocks:

V(i, j, k, `) = V(j, i, k, `) ⇒ V`,k = V T`,k

Perfect Shuffle Symmetry:

V(i, j, k, `) = V(k, `, i, j) ⇒ ΠT V Π = V

where Π is a perfect shuffle permutation.

A Sample V

280 206 100 206 182 187 100 187 296

206 328 188 182 138 148 187 244 143

100 188 176 187 148 122 296 143 326

206 182 187 328 138 244 188 148 143

182 138 148 138 312 192 148 192 212

187 148 122 244 192 272 143 212 200

100 187 296 188 148 143 176 122 326

187 244 143 148 192 212 122 272 200

296 143 326 143 212 200 326 200 280

The KP-SVD of V is Highly Structured

V =

r∑

i=1

σi Bi ⊗Bi Bi symmetric

If V ≈ σ1 B1 ⊗B1, then

V(i, j, k, `) ≈ σ1 B1(i, j)B1(k, `)

and...

d∑

i=1

d∑

j=1

d∑

k=1

d∑

`=1

V(i, j, k, `) ∗HTi H

Tj HkH`

≈

σ1

d∑

i=1

d∑

j=1

d∑

k=1

d∑

`=1

B1(i, j)B1(k, `)HTi H

Tj HkH`

=

σ1

d∑

k=1

d∑

`=1

B1(k, `)HkH`

T

d∑

k=1

d∑

`=1

B1(k, `)HkH`

and H-manipulation reduces from O(d4) to O(d2).

Just-For-Fun

log(det((A))

The Logarithm of the Determinant

SupposeA ∈ IRn×n is positive definite with eigenvalues λ1, . . . , λn

The problem of computing

log(det(A)) = log(λ1 · · ·λn) =

n∑

k=1

log(λk)

can arise in certain maximum liklihood estimation settings.

Solution Approaches

(i)If n is modest, then compute Cholesky factorization A = GGT

and use

log(det(A)) = log(det(GGT )) = log(det(G)2)

= 2 log(g11 · · · gnn) = 2

n∑

k=1

log(gkk)

(ii) If A is large and sparse, then Monte Carlo. See Barry andPace (1999) and also M. McCourt (2008).

Nearest KP Appoach

Suppose n = n1n2 and B ⊗ C is the nearest KP to A withB ∈ IRn1×n1 and C ∈ IRn2×n2 .

It can be shown that B and C are sym pos def and

log(det(A)) ≈ log(det(B ⊗ C))

= log(det(B)n2det(C)n1)

= n2log(det(B)) + n1log(det(C))

I.e., the log(det(A))problem breaks down into a pair of (much)smaller log det problems.

What If...

What if A ≈ B ⊗ C isn’t good enough?

What if

A ≈ (B1 ⊗ C1)(B2 ⊗ C2)(B3 ⊗ C3)

is good enough where Bi ∈ IRmi×mi and Ci ∈ IR(n/mi)×(n/mi)

for i = 1:3.

Then

log(det(A)) ≈3

∑

i=1

((n/mi)det(Bi) +midet(Ci))

Conclusion

The KP-SVD can serve as a bridge from

small n problems to large n problems and

more generally, from numerical linear algebra

to numerical multilinear algebra.

The Kronecker Product SVD - Cornell UniversityThe Kronecker Product B⊗Cis a block matrix whose...

Documents

Transcript of The Kronecker Product SVD - Cornell UniversityThe Kronecker Product B⊗Cis a block matrix whose...