A Least Squares Formulation for Canonical Correlation Analysis

Background CCA Versus Multivariate Linear Regression CCA Extensions Experiment

A Least Squares Formulation for CanonicalCorrelation Analysis

Liang Sun, Shuiwang Ji, and Jieping YeDepartment of Computer Science and Engineering

Arizona State University


Motivation

Canonical Correlation Analysis (CCA) is commonly used forfinding the correlations between two sets of multi-dimensionalvariables (Hotelling, 1936; Hardoon et al., 2004; Vert &Kanehisa, 2003).

Two representations X ∈ Rd×n and Y ∈ Rk×n.

One popular use of CCA is for supervised learning, in whichone view is derived from the data and another view is derivedfrom the class labels.

CCA involves an eigenvalue problem, which is computationallyexpensive to solve.It is challenging to derive sparse CCA models.


Motivation

CCA is closely related to linear regression (Y ∈ R1×n).

CCA is equivalent to Fisher Linear Discriminant Analysis(LDA) for binary-class problems. Fisher LDA can beformulated as a least squares problem for binary-class problems.It can be solved efficiently using conjugate gradient.Sparse models can be derived readily using 1-normregularization (Lasso).

Multivariate linear regression (MLR) is a well-studiedtechnique for regression problems.

To apply MLR to multi-label classification, one key issue ishow to define an appropriate class indicator matrix.

Can we extend their equivalence relationship to the general(multi-label) case?


Main Contributions

We establish the equivalence relationship between CCA andmultivariate linear regression for multi-label problems under amild condition.

Based on the equivalence relationship, several CCA extensionsincluding sparse CCA are derived.

The entire solution path for sparse CCA can be readilycomputed by Least Angle Regression algorithm (LARS).

Our experiments confirm the equivalence relationship, andresults also show that the performance of these wo models isvery close even when the assumption is violated.


Outline

1 Background

2 CCA Versus Multivariate Linear Regression

3 CCA Extensions

4 Experiment


Background: CCA

In CCA two different representations of the same set ofobjects are given, and a projection is computed for eachrepresentation such that they are maximally correlated in thedimensionality-reduced space.

Two representations X ∈ Rd×n and Y ∈ Rk×n.

CCA attempts to maximize the following correlationcoefficient w.r.t. wx and wy :

ρ =wT

x XY Twy√(wT

x XXTwx)(wTy YY Twy )

. (1)


Background: CCA

The optimization problem in CCA can be formulated as

maxwx ,wy

wTx XY Twy (2)

subject to wTx XXTwx = 1, wT

y YY Twy = 1.

Assume YY T is nonsingular, wx can be obtained bycomputing the eigenvector corresponding to top eigenvalue ofthe following generalized eigenvalue problem:

XY T (YY T )−1YXTwx = ηXXTwx . (3)


Background: CCA

Multiple projection vectors under certain orthonormalityconstraints can be obtained by computing the top `eigenvectors of the generalized eigenvalue problem in Eq. (3).

In regularized CCA (rCCA), a regularization term λI withλ > 0 is added to XXT to prevent the overfitting and avoidthe singularity of XXT . Specifically, rCCA solves the followinggeneralized eigenvalue problem:

XY T (YY T )−1YXTwx = η(XXT + λI )wx . (4)


Background: Multivariate Linear Regression

We are given a training set {(xi , ti )}ni=1, where xi ∈ Rd is theobservation and ti ∈ Rk is the corresponding target. Weassume both {xi}Ni=1 and {ti}ni=1 are centered.

In MLR, we compute a weight matrix W by minimizing thefollowing sum-of-squares cost function:

minW

n∑i=1

‖W T xi − ti‖22 = ‖W TX − T‖2

F . (5)

The optimal weight matrix is given by

WLS = (XXT )†XTT . (6)

To improve its generalization ability, a penalty term based on2-norm or 1-norm regularization is commonly applied.


Background: MLR for Multi-label Classification

In the general multi-label classification, we are given a dataset consisting of n samples {(xi , yi )}ni=1, where xi ∈ IRd , andyi denotes the set of class labels of the i-th sample.

Assume there are k labels. The 1-of-k binary coding schemeT ∈ IRk×n is commonly employed to apply a vector-valuedclass code to each data point.

Sample class indicator matrix: Tij = 1 if xj contains the i-thclass label, and Tij = 0 otherwise.

The solution to the least squares problem depends on thechoice of class indicator matrix.


CCA Versus Multivariate Linear Regression

The optimal projection matrix WCCA in CCA consists of thetop eigenvectors of

(XXT )†(XY T (YY T )−1YXT

).

The optimal weight matrix in MLR is given by

WLS = (XXT )†XTT .

Establish the equivalence relationship between these two.


Notations and Definitions

To simplify the discussion, we define the following matrices:

H = Y T (YY T )−12 ∈ Rn×k , (7)

CXX = XXT ∈ Rd×d , (8)

CHH = XHHTXT ∈ Rd×d , (9)

CDD = CXX − CHH ∈ Rd×d . (10)

Denote SVD of X by

X = UΣV T = [U1,U2] diag(Σr , 0) [V1,V2]T = U1ΣrVT1 ,

where r = rank(X ), U1 ∈ Rd×r , V1 ∈ Rn×r .

The optimal projection matrix WCCA of CCA consists of thetop eigenvectors of C †XXCHH .


Computing CCA via Eigendecomposition

Define A ∈ Rr×k as A = Σ−1r UT

1 XH.

Let the SVD of A be A = PΣAQT .

The eigendecomposition of C †XXCHH can be derived as follows:

C †XXCHH =U1Σ−2r UT

1 XHHTXT = U1Σ−1r AHTXTUUT

=U

[Ir0

]Σ−1

r A[HTXTU1,HTXTU2]UT

=U

[Σ−1

r AAT Σr 00 0

]UT

=U

[Σ−1

r P 00 I

] [ΣAΣT

A 00 0

] [PT Σr 0

0 I

]UT .


Equivalence Relationship between CCA and MLR

The optimal projection matrix in CCA is given by

WCCA = U1Σ−1r P`, (11)

where P` contains the first ` columns of P.

In MLR, we define the class indicator matrix T asT = (YY T )−

12 Y = HT , and the optimal solution is given by

WLS = (XXT )†XH = U1Σ−2r UT

1 XH = U1Σ−1r PΣAQT . (12)

Our main results show that all diagonal elements of ΣA areones provided that rank(X ) = n − 1.


CCA Extensions: Regularized CCA

Regularization is commonly used to control the complexity ofthe model and improve the generalization performance.

Based on the least squares formulation of CCA, we obtain the2-norm regularized least squares CCA formulation(called“LS-CCA2”) that minimizes the following objectivefunction:

L2(W , λ) =k∑

j=1

(n∑

i=1

(xTi wj − Tij)

2 + λ‖wj‖22

),

where W = [w1, · · · ,wk ], and λ > 0 is the regularizationparameter.


CCA Extensions: Sparse CCA

Sparseness can often be achieved by penalizing the L1-norm ofthe variables.

The sparse 1-norm least squares CCA formulation (called“LS-CCA1”) can be derived by minimizing the followingobjective function:

L1(W , λ) =k∑

j=1

(n∑

i=1

(xTi wj − Tij)

2 + λ‖wj‖1

).

The optimal w∗j , for 1 ≤ j ≤ k , is given by

w∗j = arg minwj

(n∑

i=1

(xTi wj − Tij)

2 + λ‖wj‖1). (13)


CCA Extensions: Entire CCA Solution Path

The optimal w∗j can be computed equivalently as

w∗j = arg min‖wj‖1≤τ

n∑i=1

(xTi wj − Tij)

2. (14)

The solution can be readily computed by the Least AngleRegression algorithm (LARS), which computes the entiresolution path for all values of τ , with essentially the samecomputational cost as fitting the model with a single τ value.


Experiment – Experimental Setup

Two types of data used in the experiment:

The gene expression pattern image data of Drosophila.The scene data set.

Five methods are investigated:CCARegularized CCA (rCCA)LS-CCALS-CCA2

LS-CCA1

Linear SVM is applied for classification for each label after theCCA projection.

The Receiver Operating Characteristic (ROC) value iscomputed for each label and the averaged performance overall labels is reported.


Equivalence Relationship

The assumption rank(X ) = n − 1 holds in all cases when thedata dimensionality d is larger than the sample size n.

Our results show that when d > n, all diagonal elements of ΣA

are ones and CCA and LS-CCA achieve the same classificationperformance, which confirms our theoretical analysis.


Performance Comparison

Table: Comparison of different CCA methods in terms of mean ROCscores. ntot denotes the total number of images in the data set, and kdenotes the number of terms (labels). Ten different splittings of the datainto training (of size n) and test (of size ntot − n) sets are applied foreach data set. For the regularized algorithms, the value of the parameteris chosen via cross-validation. The proposed sparse CCA model(LS-CCA1) performs the best for this data set.

ntot k CCA LS-CCA rCCA LS-CCA2 LS-CCA1

863 10 0.542 0.542 0.617 0.619 0.7221041 15 0.534 0.534 0.602 0.603 0.7071138 20 0.538 0.538 0.609 0.610 0.7141222 25 0.540 0.540 0.603 0.605 0.7041349 30 0.548 0.548 0.606 0.608 0.709


Sensitivity Study

We vary the training sample size to investigate LS-CCA and itsvariants in comparison with CCA.

d = 384 for the gene data set, and d = 294 for the scene data set.

100 204 303 400 497 602 697 802 9000.5

0.55

0.6

0.65

0.7

0.75

0.8

Mea

n R

OC

Sco

re

Training Size

CCALS−CCALS−CCA−1LS−CCA−2rCCA

Figure: Gene data set

102 198 287 407 497 607 700 806 8970.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Me

an

RO

C S

core

Training Size

CCALS−CCALS−CCA−1LS−CCA−2rCCA

Figure: Scene data set


The Entire CCA Solution Path

** * * * ** * * **** *** **

* * *

0.0 0.2 0.4 0.6 0.8 1.0

−0.

10−

0.05

0.00

0.05

0.10

0.15

Sparseness Coefficient

Coe

ffici

ents

** * * * ** *

***** *

** ** * * *

** * * * ** * * **** * ** *** * *

** * * * ** * * **** * ** *** * *

** * * * ***

* **** * ** *** * *

** * * * ** * * **** *** **

* * *

**

*

**

** * * **** * ** *** * *

**

*

** ** * * **** * ** ** * * *

** * * * ** **

**** *** **

* * *

** *

** ** *

***** * ** ** * * *

** * * *** * * **** *

** *** * *

** * * * ** * * **** * ** ** * * *** * * * ** * * **** *** **

* * *

** * * * ** * * **** * ** ** * * *** * * * ** * * *****

** *** * *

** * * * ** * * **** * ** ** * * *** * *

*

**

*

***** *

** **

* * *

** * * * ** * * **** * ** *** * *

** * * * ** * * **** * ** *** * *

** * * * ** * * **** *** **

* * *

218

131

197

277

4940

36

0 2 3 4 6 8 10 14 18 20

Figure: The entire collection of solution paths for a subset of the coefficients fromthe first weight vector w1 on the scene data set. The x-axis denotes the sparsenesscoefficient, and the y -axis denotes the value of the coefficients (of w1).


Conclusion and Future Work

Conclusion

We establish the equivalence relationship between CCA andmultivariate linear regression under a mind condition.Several CCA extensions including sparse CCA are proposedbased on the equivalence relationship.Experiments confirm the equivalence relationship. Results alsoshows that the performance of CCA and MLR is very closeeven when the assumption is violated.

Future Work

Investigate semi-supervised CCA by incorporating unlabeleddata into the CCA framework.Extensions to the nonlinear case using the kernel trick.More extensive investigation of the sparse CCA model inbiological applications.

A Least Squares Formulation for Canonical Correlation Analysis

Documents

Transcript of A Least Squares Formulation for Canonical Correlation Analysis