Learning with matrix and tensor based models using low ... · Learning with matrix and tensor based...

Learning with matrix and tensor based models

using low-rank penalties

Johan Suykens

KU Leuven, ESAT-SCD/SISTAKasteelpark Arenberg 10

B-3001 Leuven (Heverlee), BelgiumEmail: [email protected]

http://www.esat.kuleuven.be/scd/

Nonsmooth optimization in machine learning, Liege, March 4 2013

(joint work with Marco Signoretto, Quoc Tran Dinh, Lieven De Lathauwer)

Learning with matrix and tensor based models using low-rank penalties - Johan Suykens

Learning with matrices and tensorsneuroscience: EEG data

(time samples × frequency × electrodes)

computer vision: image (/video) compression/completion/· · ·(pixel × illumination × expression × · · ·)

web mining: analyze users behaviors

(users × queries × webpages)

vector x matrix X tensor X

data vector x −→ data matrix X −→ data tensor Xvector model: −→ matrix model: −→ tensor model:y = wTx y = 〈W,X〉 y = 〈W,X〉

[Signoretto M., Tran Dinh Q., De Lathauwer L., Suykens J.A.K., “Learning with Tensors:

a Framework Based on Convex Optimization and Spectral Regularization”, 2011]

Learning with matrix and tensor based models using low-rank penalties - Johan Suykens 1

Overview

• Sparsity

• Matrix completion and tensor completion

• Learning with matrices and low rank penalty

• Learning with tensors

• Optimization algorithms


Learning with matrices and tensors




Sparsity in machine learning

• through loss function: model y =∑

i αiK(x, xi) + b

min wTw + γ∑

i

L(ei)

⇒ sparse α

• through regularization: model y = wTx + b

min∑

j

|wj| + γ∑

i

e2i

⇒ sparse w

−ε 0 +ε


Sparsity (1)

• Underdetermined linear system:

Ax = b A ∈ Rn×m, n < m

• Minimum norm solution:

minx

‖x‖22 s.t. Ax = b ⇒ x = AT (AAT )−1b

• Sparsest solution:

(P0) minx

‖x‖0 s.t. Ax = b (with ‖x‖0 = #i : xi 6= 0)

• Alternatives: lp-norms ‖x‖p = (∑

i |xi|p)1/p

(Pp) minx

‖x‖p s.t. Ax = b

Nonconvex for 0 < p < 1, convex for p = 1.


Sparsity (2)

• Mutual coherence: µ(A) = max1≤k,j≤m, k 6=j|aT

k aj|

‖ak‖2‖aj‖2

For a full rank A ∈ Rn×m, n < m, if a solution x exists satisfying

‖x‖0 <1

2(1 +

1

µ(A))

it is both the unique solution of (P1) and (P0).

• Restricted Isometry Property (RIP): Matrix A ∈ Rn×m has by

definition RIP(δ, k) if each submatrix AI (by combining at most kcolumns of A) has its nonzero singular values bounded between 1 − δand 1 + δ.

Matrix A with RIP(0.41; 2k) implies that (P1) and (P0) have identicalsolutions on all k-sparse vectors.

[Bruckstein et al., SIAM Review, 2009; Candes & Tao, 2005; Donoho & Elad, 2003; ...]


Matrix completion: example

Given image (80 % missing entries)

[experiments by M. Signoretto]



Given image (80 % missing entries) and completed image




Given image (40 % missing entries)




Given image (40 % missing entries) and completed image




Original image


Matrix completion (1)

Given: matrix X with missing entriesGoal: complete the missing entriesAssumption: assume that X has low rank

minX

‖X‖∗

subject to Xij = Yij, i, j ∈ S

• given values Yij with i, j ∈ S a subset of all entries of the matrix

• Nuclear norm ‖X‖∗ =∑

i σi with σi the singular values of X(singular value decomposition: X =

∑

i σiuivTi )

• ‖X‖∗ is convex envelope of rankX on X : ‖X‖ ≤ 1 [Fazel, 2002]

• ‖X‖ ≤ ‖X‖F ≤ ‖X‖∗ ≤√

r‖X‖F ≤ r‖X‖ [Recht et al., 2010]


Matrix completion (2)

This can be written as an SDP problem (semidefinite program):

minX,W1,W2

tr(W1) + tr(W2)

subject to Xij = Yij i, j ∈ S[

W1 XX∗ W2

]

0

The nuclear norm plays a similar role as the l1 norm, at the matrix level.

[Fazel et al., 2001; Candes & Recht, 2009]


Matrix completion: RIP property

• Consider:(P0) : min rank X s.t. A(X) = b

(P1) : min ‖X‖∗ s.t. A(X) = b

• r-restricted isometry constant: smallest number δr(A) such that

1 − δr(A) ≤ ‖A(X)‖‖X‖F

≤ 1 + δr(A)

holds for all X of rank at most r, with A : Rm×n → R

p a linear map.

• Suppose that δ2r < 1 for integer r ≥ 1. Then solution to (P0) is theonly matrix of rank at most r satisfying A(X) = b.

• Suppose that r ≥ 1, is such that δ5r < 110, then solution to (P1) equals

solution to (P0).

[Recht, Fazel, Parrilo, Siam Rev, 2010]


Tensor completion

Given: N -th order tensor X ∈ RI1×..×IN with missing entries

Goal: complete the missing entriesAssumption: assume that X has low rank

minX

‖X‖∗

subject to Xi1i2...iN = Yi1i2...iN , i1, i2, ..., iN ∈ Swith

• given entries Yi1i2...iN with i1, i2, ..., iN ∈ S a subset of the tensor

• Nuclear norm ‖X‖∗ = 1N

∑

n∈NN‖X〈n〉‖∗ with X〈n〉 the n-th mode

matrix unfolding

[Signoretto M., Van De Plas R., De Moor B., Suykens J.A.K., IEEE-SPL, 2011; Gandy et

al., 2011, Tomioka et al., 2011]


Mass spectral imaging - digital staining

Data tensor: 51 × 34 pixels × 6490 variables/spectrumGiven partial labelling (4 classes), SVM prediction on all pixels

cerebellar cortex - Ammon’s horn section of hippocampus - cauda-putamen - lateral ventricle area

[Luts J., Ojeda F., Van de Plas R., De Moor B., Van Huffel S., Suykens J.A.K., ACA 2010]


Tensor completion on mass spectral imaging

Mass spectral imaging: sagittal section mouse brain [data: E. Waelkens, R. Van de Plas]

Tensor completion using nuclear norm regularization [Signoretto et al., IEEE-SPL, 2011]


Multichannel EEG for patient-specific seizure detection

• The electroencephalogram (EEG) measures the electrical activity ofthe brain and is a well-established technique in epilepsy diagnosis andmonitoring.

• Automatic seizure detection would drastically decrease the workloadof clinicians; EEG can provide accurate information about the onset ofthe seizure.

• As the seizure spreads quickly through the brain, the early detection ofthe seizure is essential.

[Hunyadi B., Signoretto M., Van Paesschen W., Suykens J., Van Huffel S., De Vos A.,

Clinical Neurophysiology, 2012]


Feature-channel matrix

Extracted features:

Time domain features:1.-3. Number zero crossings, max & min

4. Skewness (skew)

5. Kurtosis (kurt)

6. Root mean square amplitude (rmsa)

Frequency domain features:7. Total power (TP)

8. Peak frequency (PF)

9.-16. Mean and normalized power in frequency bands:

delta: 13 Hz (D, nD), theta: 48 Hz (T, nT),

alpha: 913 Hz (A, nA), beta: 1420 Hz (B, nB)

EEG data: CHB-MIT database - scalp EEG recordings, 23 pediatric patients, 18 channels

0 2 4 6 8 10

T8−P8FT10−T8

FT9−FT10T7−FT9

P7−T7CZ−PZFZ−CZP8−O2T8−P8F8−T8

FP2−F8P4−O2C4−P4F4−C4

FP2−F4P3−O1C3−P3F3−C3

FP1−F3P7−O1T7−P7F7−T7

FP1−F7

Time (sec)

365 uV


Model with nuclear norm regularization

• Synchronization between EEG channels is a generally occurringcharacteristic. Representing the data in matrix form allows to exploit thecommon information among the channels.

• Model: (per patient)y = 〈W,X〉 + b

where 〈W,X〉 =∑

ij WijXij with X,W ∈ Rd×p,

d the number of features, p number of channels.Classifier with decision rule sign[y]

• Training from given data (Xk, yk)Nk=1:

minW,b

N∑

k=1

(yk − yk)2 + µ‖W‖∗

with nuclear norm ‖W‖∗ =∑

i σi with singular values σi; the labels±1 correspond to seizure and non-seizure epoch.


Multichannel EEG for patient-specific seizure detection

[Hunyadi B., Signoretto M., Van Paesschen W., Suykens J., Van Huffel S., De Vos A., Clinical Neurophysiology, 2012]


Tensors

• N -th order tensor A ∈ RI1×I2×···×IN

• inner product: 〈A,B〉 :=∑

i1

∑

i2· · ·

∑

iNAi1i2···iNBi1i2···iN

• norm: ‖A‖ :=√

〈A,A〉• n−mode vector: obtained by varying in and keeping other indices fixed

• n−rank rankn(A): dimension of space spanned by n−mode vectors

• rank-(r1, r2, . . . , rN) tensor: tensor for which rn = rankn(A) for n ∈ NN

• multilinear rank: N−tuple (r1, r2, . . . , rN)

• rank: rank(A) := arg min

R ∈ N : A =∑

r∈NRu

(1)r ⊗ u

(2)r ⊗ · · · ⊗

u(N)r : u

(n)r ∈ R

In ∀ r ∈ NR, n ∈ NN

• property: rankn(A) ≤ rank(A) ∀n

• special case of matrix: rank1(A) = rank2(A) = rank(A)


Tensors



i1

∑

i2· · ·

∑


• norm: ‖A‖ :=√






R ∈ N : A =∑

r∈NRu

(1)r ⊗ u

(2)r ⊗ · · · ⊗

u(N)r : u

(n)r ∈ R





Mode unfoldings of a tensor

• n−mode unfolding A〈n〉 ∈ RIn×J (matricization):

matrix whose columns are the n−mode vectors with J :=∏

j∈NN\n Ij

• n−mode unfolding ·〈n〉 : RI1×I2×···×IN → R

In×J .

• refolding: ·〈n〉 : RIn×J → R

I1×I2×···×IN

• property: rankn(A) = rank(A〈n〉)


Multilinear SVD (1)

[De Lathauwer L., De Moor B., Vandewalle J., 2000]


Multilinear SVD (2)

• n−mode product A×n U ∈ RI1×I2×···×In−1×Jn×In+1×···×IN :

product of tensor A ∈ RI1×I2×···×IN by matrix U ∈ R

Jn×In

• multilinear SVD:

A = S ×1 U(1) ×2 U

(2) ×3 · · · ×N U(N)

with

– core tensor S ∈ RI1×I2×···×IN

– U(n) ∈ R

In×In a matrix of n−mode singular vectors, i.e., left singularvectors of the n−mode unfolding W〈n〉 with SVD

A〈n〉 = U(n)diag(σ(A〈n〉))V

(n)⊤


Inductive and transductive learning

transductive learning with tensors inductive learning with tensors

soft-completion

data: partially specified input data tensor

and matrix of target labels

data: pairs of fully specified input features

and vectors of target labels

output: latent features and missing labels output: models for out-of-sample evaluations

of multiple tasks

hard-completion

data: pairs of fully specified input features

and vectors of target labels

output: missing input data

[Signoretto M., Tran Dinh Q., De Lathauwer L., Suykens J.A.K., 2011]


Inductive learning with tensors: setting

• Training data DN =(

X (n), y(n))

∈ RD1×D2×···×DM × R

T : n ∈ NN

n = 1, ..., N training datat = 1, ..., T outputs (tasks)M−th order input data tensor

• Modelyt = 〈W(t),X〉 + bt, t = 1, ..., T

• Assumptions:

– X = X + E with X a rank-(r1, r2, . . . , rM) tensor– for core tensors:

〈W(t),X〉 = 〈SW(t),SX 〉low multilinear rank in W(t) = SW(t) ×1 U1 ×2 U2 × · · · ×M UM

– target lables yt generated according to p(yt|yt) = 1/(1 + exp(−ytyt))


Inductive learning with tensors: training

• Penalized empirical risk minimization:

minW,b

fDN(W, b) +

∑

m∈NM+1

λm ‖W〈m〉‖∗

with misclassification error e.g. based on logistic loss:

fDN: (W, b) 7→

∑

n∈NN

∑

t∈NT

log(

1 + exp(

−y(n)t

(

〈X (n),W(t)〉 + bt

)))

• gives a predictive model, applicable to input data X beyond the trainingdata


Transductive learning with X and Y completion


Transductive learning with tensors: setting

• Tensors X ∈ RD1×D2×···×DM×N and Y = [y(1)y(2) · · · y(N)] ∈ R

T×N

• Missing entries both in X and Y:SX , SY: index sets of observed entries of X and Y

SSX, SSY

: sampling operators related to the index sets

• Implicit model:

y(n)t = 〈W(t),X (n)〉 + bt, t = 1, ..., T

• Assumptions:

– X = X + E with X a rank-(r1, r2, . . . , rM , rM+1) tensor– targets yt generated according to p(ytn|ytn) = 1/(1 + exp(−ytnytn))

– rank([

X〈M+1〉, Y⊤])

≤ rM+1 ≪ min(N, J + T )

with J =∏

j∈NMDj


Transductive learning with tensors: estimation

• Estimation of X , Y, b:

min(X ,Y ,b)∈V

fλ0(X , Y, b) +∑

m∈NM

λm‖X〈m〉‖∗ + λM+1

∥

∥

∥

[

X〈M+1〉, Y⊤]∥

∥

∥

∗

• objective function:

– V has module spaces(

RD1×D2×···×DM×N

)

×(

RT×N

)

×RT and inner

product 〈(X1, Y1, b1), (X2, Y2, b2)〉V = 〈X1, X2〉+ 〈Y1, Y2〉+ 〈b1, b2〉– objective

fλ0(X , Y, b) = fx(X ) + λ0fy(Y, b)

with fx : X 7→ ∑

p∈NPlx((ΩS

XX )p, z

xp)

fy : (Y, b) 7→ ∑

q∈NQly((ΩS

Y(Y + b ⊗ 1J))q, z

yq )

– losses e.g. lx : (u, v) 7→ 12(u − v)2, ly : (u, v) 7→ log(1 + exp(−uv))

– zx, zy are vectors of the observed entries


Transductive soft completion: Olivetti faces

original

true label: 5 predicted: 3 predicted: 5

input data matrix-sc tensor-sc

original

true label: 5 predicted: 3 predicted: 5


original

true label: 3


predicted: 3predicted: 3


Impainting color images by hard completion

original given image completed

Tensor: mode 1 and 2: pixel space, mode 3: 8-bit RGB color information


Impainting color images by hard completion

original given image completed


Optimization algorithm (1)

The learning problems are instances of the following convex optimizationproblem on an abstract vector space:

minw∈W

f(w) + g(w)

subject to w ∈ C

with- f : convex and differentiable functional- ∇f is Lf -Lipschitz:

‖∇f(w) −∇f(v)‖W ≤ Lf‖w − v‖W ∀ w, v ∈ W ;

- g: convex but possibly non-differentiable functional- C ⊆ W is a is non-empty, closed and convex set



• Problem restatement

minw∈W

h(w) = f(w) + g(w) + δC (w) , δC : w 7→

0, if w ∈ C

∞, otherwise

• Proximity operator

x(t+1) = proxτh

(

x(t))

with

proxτh : x 7→ arg minw∈W

h(w) +1

2τ‖w − x‖2 .



• Operator splitting approach:split h(w) = f(w) + g(w) + δC (w)into f(w) + δC (w) and non-smooth term g(w)

• Douglas-Rachford splitting:

y(k) = arg minx∈C

f (x) + 12τ

∥

∥x − w(k)∥

∥

2

W→ (solved inexactly)

r(k) = proxτg(2y(k) − w(k))

w(k+1) = w(k) + γ(k)(

r(k) − y(k))

• Projection onto C

Proof of convergence for sequence y(k)k

Stopping criterion based on h


Note: matrix case - Singular Value Thresholding

Given matrix Y , finding the solution to

minX

1

2‖X − Y ‖2

F + λ‖X‖∗

with λ > 0, is given by a shrinkage operation on singular values of Y :

proxtrλ (Y ) = U max(S − λI, 0)V T

[Cai et al., 2008; Tomioka et al., 2011]


Tensor case

• Learning problems involve the tensor modes:

∑

m∈NM+1

λm ‖W〈m〉‖∗

• Consider space W with cartesian product W1 × W2 × · · · × WI andinner product 〈x, y〉 =

∑

i∈NI〈xi, yi〉i.

• Assume function g : W → R defined by

g : (x1, x2, . . . , xI) 7→∑

i∈NI

gi(xi)

where for any i ∈ NI, gi : Wi → R is convex. Then we have:

proxg(x) =(

proxg1(x1),proxg2

(x2), · · · , proxgI(xI)

)


Duplication - transductive learning case

• Duplication of the tensors leads to considering the set:

C :=

(X[1], X[2], . . . , X[M ], X[M+1], Y , b) ∈ W : X[1] = X[2] = . . . = X[M+1]

• This gives the problem statement:

min(X[1],X[2],...,X[M ],X[M+1],Y ,b)∈W

f(X[1], . . . , X[M+1], Y , b) + g(X[1], . . . , X[M+1], Y )

subject to (X[1], . . . , X[M+1], Y , b) ∈ C


Prox and tensor modes

We apply

proxτg(X[1], . . . , X[M+1], Y ) =(

proxτλ1‖σ(·〈1〉)‖1(X[1]), · · · , proxτλM‖σ(·〈M〉)‖1

(X[M ]),Z1, Z2

)

where [Z1(X , Y ),Z2(X , Y )] is a partitioning of

Z(X , Y ) = Udiag(

proxτλM+1‖σ(·)‖1

([

X〈M+1〉, Y⊤]))

V⊤

with

proxλ‖σ(·〈n〉)‖1(W) =

(

U(n)diag(dλ)V (n)⊤

)〈n〉

and (dλ)i := max(σi(W〈n〉) − λ, 0).


Conclusions

• Sparsity:from vectors to matricesfrom matrices to tensors

• Transductive and inductive learning with matrices/tensors:going beyond matrix/tensor completion

• Further details:

Signoretto M., Tran Dinh Q., De Lathauwer L., Suykens J.A.K., “Learning with Tensors:

a Framework Based on Convex Optimization and Spectral Regularization”, 2011

• Software:https://sites.google.com/site/marcosignoretto/codeshttp://www.esat.kuleuven.be/sista/ADB/software.php


Acknowledgements


Thank you


Learning with matrix and tensor based models using low ... · Learning with matrix and tensor based...

Documents

Transcript of Learning with matrix and tensor based models using low ... · Learning with matrix and tensor based...