The Kronecker Product SVD - Cornell UniversityThe Kronecker Product B⊗Cis a block matrix whose...
Transcript of The Kronecker Product SVD - Cornell UniversityThe Kronecker Product B⊗Cis a block matrix whose...
The Kronecker Product SVD
Charles Van Loan
October 19, 2009
The Kronecker Product
B ⊗ C is a block matrix whose ij-th block is bijC.
E.g.,
[
b11 b12b21 b22
]
⊗ C =
[
b11C b12C
b21C b22C
]
Replicated Block Structure
The KP-SVD
If
A =
A11 · · · A1N
... . . . ...AM1 · · · AMN
Aij ∈ IRp×q
then there exists a positive integer rA with rA ≤MN so that
A =
rA∑
k=1
σk Bk ⊗ Ck rA = rankKP(A)
The KP-singular values: σ1 ≥ · · · ≥ σrA> 0.
The Bk ∈ IRM×N and Ck ∈ IRp×q satisfy < Bi, Bj >= δij and
< Ci, Cj >= δij where < F,G >= trace(FTG).
Nearness Property
Let r be a positive integer that satisfies r ≤ rA. The problem
minrankKP(X) = r
‖A−X ‖F
is solved by setting
X(opt) =
r∑
k=1
σk Bk ⊗ Ck.
Talk Outline
1. Survey of Essential KP PropertiesJust enough to get through the talk.
2. Computing the KP-SVDIt’s an SVD computation.
3. Nearest KP PreconditionersSolving KP Systems is fast.
4. Some Constrained Nearest KP ProblemsNearest (Markov) ⊗ (Markov)
5. Multilinear ConnectionsA low-rank approximation of a 4-dimensional tensor
6. Off-The-Wall / Just-For-FunComputing log(det(A)) for Large Sparse Pos Def A
Essential KP Properties
Every bijckl Shows Up
[
b11 b12b21 b22
]
⊗
c11 c12 c13c21 c22 c23c31 c32 c33
=
b11c11 b11c12 b11c13 b12c11 b12c12 b12c13
b11c21 b11c22 b11c23 b12c21 b12c22 b12c23
b11c31 b11c32 b11c33 b12c31 b12c32 b12c33
b21c11 b21c12 b21c13 b22c11 b22c12 b22c13
b21c21 b21c22 b21c23 b22c21 b22c22 b22c23
b21c31 b21c32 b21c33 b22c31 b22c32 b22c33
Hierarchical
A =
[
b11 b12b21 b22
]
⊗
c11 c12 c13 c14c21 c22 c23 c24c31 c32 c33 c34c41 c42 c43 c44
⊗
d11 d12 d13d21 d22 d23d31 d32 d33
A is a 2-by-2 block matrix whose entries are 4-by-4 block matriceswhose entries are 3-by-3 matrices.
Algebra
(B ⊗ C)T = BT ⊗ CT
(B ⊗ C)−1 = B−1 ⊗ C−1
(B ⊗ C)(D ⊗ F ) = BD ⊗ CF
B ⊗ (C ⊗D) = (B ⊗ C) ⊗D
No: B ⊗ C 6= C ⊗B
Yes: B ⊗ C = (Perfect Shuffle)(C ⊗B)(Perfect Shuffle)T
The vec Operation
Turns matrices into vectors by stacking columns:
X =
1 102 203 30
⇒ vec(X) =
123102030
Important special case:
vec(rank-1 matrix) = vec
123
[
1 10]
=
[
110
]
⊗
123
Reshaping
The matrix equation
Y = CXBT
can be reshaped into a vector equation
vec(Y ) = (B ⊗ C)vec(X)
Implies fast linear equation solving and fast matrix-vector multi-plication. (More later.)
Inheriting Structure
IfB andC are
nonsingular
lower(upper) triangular
banded
symmetric
positive definite
stochastic
Toeplitz
permutations
orthogonal
thenB ⊗ C is
nonsingular
lower(upper)triangular
block banded
symmetric
positive definite
stochastic
block Toeplitz
a permutation
orthogonal
Computing the KP-SVD
Warm-Up: The Nearest KP Problem
Given A ∈ IRm×n with m = m1m2 and n = n1n2.
Find B ∈ IRm1×n1 and C ∈ IRm2×n2 so
φ(B,C) = ‖ A−B ⊗ C ‖F = min
A bilinear least squares problem.
Fix B (or C) and it becomes linear in C (or B).
Reshaping the Nearest KP Problem
φ(B, C) =
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
a11 a12 a13 a14
a21 a22 a23 a24
a31 a32 a33 a34
a41 a42 a43 a44
a51 a52 a53 a54
a61 a62 a63 a64
−
b11 b12
b21 b22
b31 b32
⊗[
c11 c12
c21 c22
]
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
F
=
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
a11 a21 a12 a22
a31 a41 a32 a42
a51 a61 a52 a62
a13 a23 a14 a24
a33 a43 a34 a44
a53 a63 a54 a64
−
b11
b21
b31
b12
b22
b32
[
c11 c21 c12 c22
]
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
F
!!! Finding the nearest rank-1 matrix is an SVD problem !!!
SVD Primer
A ∈ IRm×n ⇒ UTAV = Σ = diag(σ1, . . . , σn)
If U = [u1 | u2 | · · · | um] and V = [v1 | v2 | · · · | vn] then
•The rank-1 matrix σ1u1vT1 solves
min
rank(A) = 1‖ A − A ‖
F
• v1 is the dominant eigenvector for ATA:
ATAv1 = σ2
1v1 Av1 = σ1u1 σ1 = uT
1Av1
• u1 is the dominant eigenvector for AAT :
AATu1 = σ2
1u1 ATu1 = σ1v1 σ1 = vT
1 ATu1
Sol’n: SVD of Permuted A + Reshaping
φ(B, C) =
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
a11 a12 a13 a14
a21 a22 a23 a24
a31 a32 a33 a34
a41 a42 a43 a44
a51 a52 a53 a54
a61 a62 a63 a64
−
b11 b12
b21 b22
b31 b32
⊗[
c11 c12
c21 c22
]
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
F
=
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
a11 a21 a12 a22
a31 a41 a32 a42
a51 a61 a52 a62
a13 a23 a14 a24
a33 a43 a34 a44
a53 a63 a54 a64
−
b11
b21
b31
b12
b22
b32
[
c11 c21 c12 c22
]
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
∥
F
General Solution Procedure
Minimize
φ(B, C) = ‖A − B ⊗ C‖F
=∥
∥ A − vec(B)vec(C)T∥
∥
F
where
A =
vec(A11)T
vec(A21)T
vec(A31)T
vec(A12)T
vec(A22)T
vec(A32)T
.
Solution: Compute the SVD UT AV = Σ and set
vec(B(opt)) =√σ1 U (:, 1) vec(C(opt)) =
√σ1 V (:, 1).
Lanczos SVD Algorithm
Need to compute the dominant eigenvector v1 of ATA and thedominant eigenvector u1 of AAT . The power method approach...
b = initial guess of v1; c = initial guess of u1 ; s = cTAb;
while (‖ Ab− sc ‖2 ≈ ‖Av1 − σ1u1 ‖2 is too big )c = Ab; c = c/‖ c ‖2;
b = AT c; b = b/‖ b ‖2; s = cTAb;
end
The Lanczos method is better than this because it uses more thanjust the most recent b and c vectors. It too lives off of matrix-vectorproducts, i.e., is “sparse friendly.”
The Nearest KP-rank r Problem
Use Block Lanczos.
E.g., To minimize
‖A − B1 ⊗ C1 − B2 ⊗ C2 − B3 ⊗ C3‖F
use block Lanczos SVD with block width 3 and set
vec(B(opt)i ) =
√σi U (:, i)
vec(C(opt)i ) =
√σi V (:, i)
i = 1:3
The Complete KP-SVD
Given:
A =
A11 · · · A1N
... . . . ...AM1 · · · AMN
Aij ∈ IRp×q
Form A (MN -by-pq) and apply LAPACK SVD:
A =
rA∑
i=1
σiuivTi
Then:
A =
rA∑
i=1
σi · reshape(ui,M,N ) ⊗ reshape(vi, p, q)
The Theorems Follow From This
A ⇐⇒ A
m m
A =
rA∑
i=1
σiBi ⊗ Ci ⇐⇒ A =
rA∑
i=1
σuivTi
A Related Problem
Problem. Find X and Y to minimize
‖ A− (X ⊗ Y − Y ⊗X) ‖F
Solution. Find vectors x and y so
‖ A− (xyT − yxT ) ‖F
is minimized and reshape x and y to get X(opt) and Y (opt).
The Schur decomposition of A− AT is involved.
Another Related Problem
Problem. Find X to minimize
‖A−X ⊗X) ‖F
Solution. Find vector x so
‖ A− xxT ‖F
is minimized and reshape to get X(opt).
The Schur decomposition of A+ AT is involved.
A Much More Difficult Problem
minB, C, D
‖ A−B ⊗ C ⊗D ‖F
Computational multilinear algebra is filled with problems like this.
Nearest KPPreconditioners
Main Idea
(i) Suppose A and an N -by-N block matrix with p-by-p blocks.
(ii) Need to solve Ax = b. Ordinarily this is O(N3p3)
(iii) A system of the form
(B1 ⊗ C1 + B2 ⊗ C2)z = r
can be solved in O(N3 + p3) time. Hint C1ZBT1 +C2ZB
T2 = R.
(iv) If
(B1 ⊗ C1 + B2 ⊗ C2) ≈ A
we have a potential preconditioner.
A Block Toeplitz De-Blurring Problem
(Nagy, O’Leary, Kamm(1998))
Need to solve a large block Toeplitz system Tx = b
Preconditioner:
T ≈ T1 ⊗ T2
Can solve the nearest KP problem with the constraint that thefactor matrices T1 and T2 are Toeplitz.
A Poisson-Related Problems
Poisson’s equation on a rectangle with a regular (M+1)-by-(N+1)grid discretizes to
Au = (IM ⊗ TN + TM ⊗ IN) u = f
where the T ’s are 1-2-1 tridiagonals. Can be solved very fast.
A new method for the Navier-Stokes problem being developed byDiamessis and Escobar-Vargas leads to linear system where thehighly structured A-matrix has KP-rank rA = 16.
Looking for a KP-Preconditioner M of the form
M = B1 ⊗ C1 + B2 ⊗ C2
Some ConstrainedNearest KP Problems
Joint with Stefan Ragnarsson
NOT Inheriting Structure
In the
minB,C ‖A−B ⊗ C ‖F
problem, sometimes B and C fail to inherit A’s special attributes.
If A is
{
StochasticOrthogonal
}
then B and C arenot quite
{
StochasticOrthogonal
}
KP Approximation of Stochastic Matrices
If A ∈ IRn×n , B ∈ IRn1×n1 , and C ∈ IRn2×n2 , and
A = B ⊗ C = stochastic ⊗ stochastic
then each A-entry has the form bijcpq. The states are clusteredinto groups G1, . . . , Gn1 each of size n2 and
bij = prob(Gj → Gi)
cpq = prob(state q → state p within any group)
References:
“Aggregation of Stochastic Automata Networks with Replicas” (A. Benoit, L.Brenner, P. Fernandes, B. Plateau)
“Analyzing Markov Chains Using Kronecker Products” (T. Dayer)
A Bilinear Optimization Strategy
Given an initial guess C...
Repeat Until Converged:
minB Stochastic
‖ A−B ⊗ C ‖F C fixed
minC Stochastic
‖ A− B ⊗ C ‖F B fixed
end
These are linear, constrained least squares problems.
Reshaping
The problem
minC Stochastic
‖ A− B ⊗ C ‖F B fixed
is equivalent to
minx ≥ 0, Ex = e
‖Mx− f ‖2
where M = I ⊗ B, f = vec(A), x = vec(C), e = ones(m, 1),and E = Im ⊗ eT .
The linear constraint forces C (a.k.a. x) to have unit columnsums.
Example
If
A =
0.2444 0.1950 0.2129 0.1850 0.1202 0.16820.2367 0.2712 0.2526 0.2573 0.1857 0.22490.1811 0.2348 0.2236 0.1415 0.2900 0.14810.1198 0.0949 0.1105 0.1147 0.0822 0.18020.1422 0.1091 0.0938 0.1709 0.1405 0.15700.0757 0.0949 0.1065 0.1306 0.1813 0.1217
The matrices B and C obtained by the unconstrained SVD mini-mization of ‖A−B ⊗ C ‖F give approximately stochastic
B =
[
0.6842 0.58900.3158 0.4320
]
C =
0.3301 0.2449 0.32460.3925 0.3542 0.36570.2611 0.3993 0.2963
Example (Cont’d)
Using
B =
[
0.6842 0.58900.3158 0.4320
]
C =
0.3301 0.2449 0.32460.3925 0.3542 0.36570.2611 0.3993 0.2963
as the initial guess for the successive nonnegative least squaresiteration we get the “exactly” stochastic matrices
BLS =
[
0.6823 0.57760.3177 0.4224
]
CLS =
0.3359 0.2449 0.32890.3984 0.3552 0.37040.2658 0.3998 0.3008
Work per iteration is roughly quadratic in the dimension of A.
MathWorks Optimization Toolbox and PROPACK (R.M. Larsen).
A Note on Ordering
This problem assumes that we know how to group the states:
minB,C Stochastic
‖ A− B ⊗ C ‖F
This doesn’t:
minB,C StochasticP permutation
‖ PAPT −B ⊗ C ‖F
The Inverse Times Table Problem
Suppose we have the stationary vector x for A, i.e.,
AxA = xA x > 0, sum(xA) = 1
Then
PAPT = B ⊗ C
BxB = xB
CxC = xC
⇒ PxA = xB ⊗ xC
If we know xA, can we figure out P so PxA is the Kroneckerproduct of two smaller vectors?
Inverse TT Cont’d
Suppose
xA = [ 2 3 4 6 7 9 10 12 16 18 21 24 27 30 32 36 56 63 80 90 ]T
and we seek permutation P ∈ IR20×20 so that
PxA =
c1c2c3c4
⊗
b1b2b3b4b5
what are xB and xC?
Inverse TT Cont’d
xA = [ 2 3 4 6 7 9 10 12 16 18 21 24 27 30 32 36 56 63 80 90 ]T
reshape(PxA, 5, 4) =
b1b2b3b4b5
[
c1 c2 c3 c4]
24 9 3 2756 21 7 6380 30 10 9016 6 2 1832 12 4 36
=
371024
[
8 3 1 9]
Quick Aside: Nearest Orthogonal KP
A =
−.447 .032 −.528 .384 .031 .406 .308 .006 −.330−.497 −.243 .464 .367 .183 −.308 .320 .187 .274−.205 .654 .150 .105 −.494 −.134 .138 −.442 .132−.381 −.021 −.406 −.006 −.004 −.019 −.562 .022 .609−.404 −.167 .327 −.021 −.003 .001 −.562 −.290 −.548−.107 .530 .131 −.003 −.024 .011 −.218 .777 −.191−.299 −.022 −.342 −.559 −.000 −.608 .236 .048 −.225−.298 −.157 .254 −.581 −.274 .568 .208 .105 .175−.104 .419 .097 −.233 .802 .165 .059 −.257 .087
≈ (3-by-3 Orthogonal B ) ⊗ (3-by-3 Orthogonal C )
Nearest Orthogonal KP (Cont’d)
The unconstrained KP-SVD minimization gives
B0 =
0.7042 −0.5335 −0.27130.5563 0.0030 0.46180.4412 0.8460 −0.1679
C0 =
−0.7433 −0.0069 −0.4633−0.7671 −0.3743 0.3931−0.2822 1.0394 0.1272
but ‖ BT0 B0 − I3 ‖2 ≈ ‖ CT
0 C0 − I3 ‖2 ≈ .643.
After 2 iterations of alternating bilinear clean-up:
BLS =
0.7025 −0.5305 −0.47450.5607 0.0019 0.82800.4383 0.8477 −0.2988
CLS =
−0.6701 −0.0123 −0.7422−0.6962 −0.3365 0.6341−0.2576 0.9416 0.2169
giving ‖ BTLSBLS − I3 ‖2 ≈ ‖BT
LSBLS − I3 ‖2 ≈ 10−4.
Nearest Orthogonal KP (Cont’d)
The problem
minC Orthogonal
‖ A− B ⊗ C ‖F B fixed
is equivalent to an orthogonal procrustes problem with simple SVDsolution:
UT(
∑ ∑
bijAij
)
V = Σ Copt = UV T
Bojanczyk and Lutoborski (2003) solved a related problem.
A Wireless Bandwidth Problem
GivenH1, . . . , HN ∈ IRp×q , findC ∈ IRp×r andW ∈ IRq×r withorthonormal columns so
ψ(C,W ) =
N∑
k=1
σ1(CTHkW )2
is maximized.
(Joint with J. Nsenga (CETIC) , S. Ragnarsson)
A Wireless Bandwidth Problem Cont’d
ψ(C,W ) =
N∑
k=1
‖ CTHkW ‖22 ≤
N∑
k=1
‖ CTHkW ‖22
=
N∑
k=1
‖ (C ⊗W )Tvec(Hk) ‖22
= tr(
(C ⊗W )TS(C ⊗W ))
= ψ(C,W )
where
S =
N∑
k=1
vec(Hk)vec(Hk)T
A Wireless Bandwidth Problem Cont’d
Solution Approach. If
S ≈ S1 ⊗ S2
then
ψ(C,W ) ≈ tr(
(C ⊗W )T (S1 ⊗ S2)(C ⊗W ))
= tr(CTS1C) · tr(WTS2W )
The trace of QTMQ with Q ∈ IRn×j is maximized if ran(Q) isthe span of the r-dimensional dominant invariant subspace of M .An “easy computation.”
Connections toComputational
Multilinear Algebra
A 2d-by-2d Hx = λx Problem
H =
d∑
ij
tijHTi Hj +
d∑
ijkl
vijklHTi H
Tj HkHl
Hi = I2i−1 ⊗[
0 10 0
]
⊗ I2d−i
T = T (1:d, 1:d)
V = V(1:d, 1:d, 1:d, 1:d)
Matrix T is symmetric. Tensor V = (vi,j,k,`) also hassymmetries.
The H-Matrix
0 200 400 600 800 1000
0
100
200
300
400
500
600
700
800
900
1000
nz = 104703
nzeros =
(
1
64d4 − 3
32d3 +
27
64d2 − 11
32d + 1
)
2d − 1
Some Fourth-Order Tensor Symmetries
The tensor V in our problem frequently has these symmetries:
V(i, j, k, `) =
V(j, i, k, `)
V(i, j, `, k)
V(k, `, i, j)
Let’s Flatten V...
V =
V(:, :, 1, 1) V(:, :, 1, 2) V(:, :, 1, 3) V(:, :, 1, 4)
V(:, :, 2, 1) V(:, :, 2, 2) V(:, :, 2, 3) V(:, :, 2, 4)
V(:, :, 3, 1) V(:, :, 3, 2) V(:, :, 3, 3) V(:, :, 3, 4)
V(:, :, 4, 1) V(:, :, 4, 2) V(:, :, 4, 3) V(:, :, 4, 4)
and see what happens to
V(i, j, k, `) =
V(j, i, k, `)
V(i, j, `, k)
V(k, `, i, j)
Flattened Symmetries
Block Symmetry:
V(i, j, k, `) = V(i, j, `, k) ⇒ Vk,` = V`,k
Symmetric Blocks:
V(i, j, k, `) = V(j, i, k, `) ⇒ V`,k = V T`,k
Perfect Shuffle Symmetry:
V(i, j, k, `) = V(k, `, i, j) ⇒ ΠT V Π = V
where Π is a perfect shuffle permutation.
A Sample V
280 206 100 206 182 187 100 187 296
206 328 188 182 138 148 187 244 143
100 188 176 187 148 122 296 143 326
206 182 187 328 138 244 188 148 143
182 138 148 138 312 192 148 192 212
187 148 122 244 192 272 143 212 200
100 187 296 188 148 143 176 122 326
187 244 143 148 192 212 122 272 200
296 143 326 143 212 200 326 200 280
The KP-SVD of V is Highly Structured
V =
r∑
i=1
σi Bi ⊗Bi Bi symmetric
If V ≈ σ1 B1 ⊗B1, then
V(i, j, k, `) ≈ σ1 B1(i, j)B1(k, `)
and...
d∑
i=1
d∑
j=1
d∑
k=1
d∑
`=1
V(i, j, k, `) ∗HTi H
Tj HkH`
≈
σ1
d∑
i=1
d∑
j=1
d∑
k=1
d∑
`=1
B1(i, j)B1(k, `)HTi H
Tj HkH`
=
σ1
d∑
k=1
d∑
`=1
B1(k, `)HkH`
T
d∑
k=1
d∑
`=1
B1(k, `)HkH`
and H-manipulation reduces from O(d4) to O(d2).
Just-For-Fun
log(det((A))
The Logarithm of the Determinant
SupposeA ∈ IRn×n is positive definite with eigenvalues λ1, . . . , λn
The problem of computing
log(det(A)) = log(λ1 · · ·λn) =
n∑
k=1
log(λk)
can arise in certain maximum liklihood estimation settings.
Solution Approaches
(i)If n is modest, then compute Cholesky factorization A = GGT
and use
log(det(A)) = log(det(GGT )) = log(det(G)2)
= 2 log(g11 · · · gnn) = 2
n∑
k=1
log(gkk)
(ii) If A is large and sparse, then Monte Carlo. See Barry andPace (1999) and also M. McCourt (2008).
Nearest KP Appoach
Suppose n = n1n2 and B ⊗ C is the nearest KP to A withB ∈ IRn1×n1 and C ∈ IRn2×n2 .
It can be shown that B and C are sym pos def and
log(det(A)) ≈ log(det(B ⊗ C))
= log(det(B)n2det(C)n1)
= n2log(det(B)) + n1log(det(C))
I.e., the log(det(A))problem breaks down into a pair of (much)smaller log det problems.
What If...
What if A ≈ B ⊗ C isn’t good enough?
What if
A ≈ (B1 ⊗ C1)(B2 ⊗ C2)(B3 ⊗ C3)
is good enough where Bi ∈ IRmi×mi and Ci ∈ IR(n/mi)×(n/mi)
for i = 1:3.
Then
log(det(A)) ≈3
∑
i=1
((n/mi)det(Bi) +midet(Ci))
Conclusion
The KP-SVD can serve as a bridge from
small n problems to large n problems and
more generally, from numerical linear algebra
to numerical multilinear algebra.