Scalable Machine Learninggobie.csb.pitt.edu/SML/MatrixFactorization.pdf · Scalable Machine...

Scalable Machine Learning

Matrix factorization

Matrix factorization/decomposition in ML

Arise as a solution to various practical problemsI Matrix completionI Missing value estimationI Representation learning

Recommender systems

There are two basic types of recommender systemsI Items have side information. Example: Pandora. Many possible solutions

some of which can involve matrix factorization.I Items have no side information. Also called collaborative filtering.

Example: Netflix. We can only use other people’s ratings. Need to domatrix factorization either explicitly or implicitly.

Netflix: 100,480,507 ratings that 480,189 users gave to 17,770 movies

1 million dollars goes to fancy matrix factorization!

Matrix factorization/decomposition in math

Given a matrix Y , write Y as a product of 2 or 3 other matrices.

Some useful matrix decompositions

Many decompositions apply only to square matrices and may requireadditional conditions.

I Eigen (diagonalization)-square matrix-sometimes existsI Cholesky-square matrix-sometimes exists

A strong condition for existence of various decompositions is that a matrix ispositive semi-definite often written as Y � 0

xT Yx ≥ 0 (1)

Square matrices in data analysis

I In data analysis matrices are actual data not linear operators.

I If you are dealing with a square matrix it is most likely a samplecovariance. Y T Y or YY T .

I Useful fact: all such matrices are positive semi-definite

I Interesting fact: the converse is also true. All positive semi-definitematrices are inner products in some space. Important for kernel learningmethods like SVM.

SVDMost of the data matrices are not square!

Any matrix no matter how weird has a Singular Value Decomposition.

U and V are orthonormal so that UT U = I and V T V = I.

assuming m ≥ n

SVD

Y = U · D · V T (2)

D is diagonal with di >= 0. By convention we write D such that di >= di+1

Figure adapted from the Wikipedia “SVD” article

Why? (Proof sketch)

I Y may be weird by both Y T Y and YY T are nice:

They are both positive semi-definite.

I So they both have eigen-decompositions PDP−1 = PDPT where D isnon-negative and the eigenvectors of P are orthogonal.

I The eigenvectors of YY T and Y T Y give the left (U) and right (V ) singularvectors of Y .

Algebraic intuition

SVD as a sum of rank-1 matrices

So each entry yi,j ∈ Y can be written as.

n∑k=1

dk Ui,k Vk,j (3)

Figure adapted from the Wikipedia “SVD” article

Why is the SVD useful

Y U VTD

p

n

k

Theorem: Zeroing out all but the top k singular values in D gives the bestrank-k approximation to the original matrix.

By “best” we mean in terms of the squared Frobenius norm:

||Y − Y ||2F =∑

i,j

(Yi,j − Yi,j )2

Plausibility argument:

|Y |2F = tr(Y T Y ) =∑

d2i

Where the d2i are eigenvalues of Y T Y and dis are singular values of Y

An example: single cell RNAseq

I Most clinical data are tissue samplesI Tissue is composed of different cell-typesI Immune: 20 cell-types. Liver: 5 cell-types.

I We often want to know which cell-types are affected by the disease/drug.

An example:single cell RNAseq

Some single cell data

Tnfrsf4Izumo1rSurrogate_iCreCd4Fcgr1Msr1Aif1Ifi205HpIfitm6Mgst1Mcemp1Ms4a7Trem2Cbr2C1qaTcf7Bcl2Nsg2Lef1Birc52810417H13RikSpc24Ccnb2GzmaKlrb1cKlra8Klra4PdgfaGm44040Klrb1fGzmcNcr1Car2Cd8b1Cd8aLag3Cxcr6H2−DMb2Cbfa2t3Ccl17Cd209aInhbaFlrt3Clec4dThbs1SiglechCcr9Cd300cD13Ertd608e

0

1

2

3

4

5

6

Data has lots of 0s. Either the gene is not on in that cell-type or we failed tocapture it.


Raw data top rank-1 approximationTnfrsf4Izumo1rSurrogate_iCreCd4Fcgr1Msr1Aif1Ifi205HpIfitm6Mgst1Mcemp1Ms4a7Trem2Cbr2C1qaTcf7Bcl2Nsg2Lef1Birc52810417H13RikSpc24Ccnb2GzmaKlrb1cKlra8Klra4PdgfaGm44040Klrb1fGzmcNcr1Car2Cd8b1Cd8aLag3Cxcr6H2−DMb2Cbfa2t3Ccl17Cd209aInhbaFlrt3Clec4dThbs1SiglechCcr9Cd300cD13Ertd608e

0

1

2

3

4

5

6Tnfrsf4Izumo1rSurrogate_iCreCd4Fcgr1Msr1Aif1Ifi205HpIfitm6Mgst1Mcemp1Ms4a7Trem2Cbr2C1qaTcf7Bcl2Nsg2Lef1Birc52810417H13RikSpc24Ccnb2GzmaKlrb1cKlra8Klra4PdgfaGm44040Klrb1fGzmcNcr1Car2Cd8b1Cd8aLag3Cxcr6H2−DMb2Cbfa2t3Ccl17Cd209aInhbaFlrt3Clec4dThbs1SiglechCcr9Cd300cD13Ertd608e

0

1

2

3

4


Raw data top rank-1 approximationTnfrsf4Izumo1rSurrogate_iCreCd4Fcgr1Msr1Aif1Ifi205HpIfitm6Mgst1Mcemp1Ms4a7Trem2Cbr2C1qaTcf7Bcl2Nsg2Lef1Birc52810417H13RikSpc24Ccnb2GzmaKlrb1cKlra8Klra4PdgfaGm44040Klrb1fGzmcNcr1Car2Cd8b1Cd8aLag3Cxcr6H2−DMb2Cbfa2t3Ccl17Cd209aInhbaFlrt3Clec4dThbs1SiglechCcr9Cd300cD13Ertd608e

0

1

2

3

4

5


0

1

2

3

4

Second rank-1 approximation rank-2 approximationTnfrsf4Izumo1rSurrogate_iCreCd4Fcgr1Msr1Aif1Ifi205HpIfitm6Mgst1Mcemp1Ms4a7Trem2Cbr2C1qaTcf7Bcl2Nsg2Lef1Birc52810417H13RikSpc24Ccnb2GzmaKlrb1cKlra8Klra4PdgfaGm44040Klrb1fGzmcNcr1Car2Cd8b1Cd8aLag3Cxcr6H2−DMb2Cbfa2t3Ccl17Cd209aInhbaFlrt3Clec4dThbs1SiglechCcr9Cd300cD13Ertd608e

−1

0

1

2

3


0

1

2

3

4


Raw data rank-5 approximationTnfrsf4Izumo1rSurrogate_iCreCd4Fcgr1Msr1Aif1Ifi205HpIfitm6Mgst1Mcemp1Ms4a7Trem2Cbr2C1qaTcf7Bcl2Nsg2Lef1Birc52810417H13RikSpc24Ccnb2GzmaKlrb1cKlra8Klra4PdgfaGm44040Klrb1fGzmcNcr1Car2Cd8b1Cd8aLag3Cxcr6H2−DMb2Cbfa2t3Ccl17Cd209aInhbaFlrt3Clec4dThbs1SiglechCcr9Cd300cD13Ertd608e

0

1

2

3

4

5


−1

0

1

2

3

4

5

6

rank-10 approximationTnfrsf4Izumo1rSurrogate_iCreCd4Fcgr1Msr1Aif1Ifi205HpIfitm6Mgst1Mcemp1Ms4a7Trem2Cbr2C1qaTcf7Bcl2Nsg2Lef1Birc52810417H13RikSpc24Ccnb2GzmaKlrb1cKlra8Klra4PdgfaGm44040Klrb1fGzmcNcr1Car2Cd8b1Cd8aLag3Cxcr6H2−DMb2Cbfa2t3Ccl17Cd209aInhbaFlrt3Clec4dThbs1SiglechCcr9Cd300cD13Ertd608e

0

2

4

6

Filling in missing values

I The CD8 molecule is made from 2 different genes (Cd8a and Cd8b1).I They only work together and each cell either makes neither or both.

Raw data rank-10 approximation

●●●●●● ●● ●

●

●●●●●●●●●●●●●●●●●●●●●●

●

●●●

●

●●

●

●

●

●

●●●●●●●●●●●●●●●●●●●●

●

●●●

●

●●●●●●●●●●●●●● ●●●●●●●●●●

●

●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●

●

●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●

●

●●●●

● ●

●

●

●●●●

●

●●

●

●

●

●

●

●

●

● ●●

●

●

●● ●

●

●

●

●

●

●●●●● ●●●

●

●

●●

●●

●

●

●

●

●

●

●●●●● ●●●● ●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

● ●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●●●

●●

●●

●

●

●

●

● ●

●

●

●

●●

● ●●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●●●●●●●

●●

●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●

●

●

●●●

●

●●●●●●●●

●

●●● ●●● ●

●

●

●● ●●●●● ●●●●●●●●●●●●●● ●●●

●

●●●●●●

●

●

●

●●●●● ●●●●

● ●

●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●● ●●●●●●●

●

●●●●●●●●●●●●● ●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●

●

●●●●●

●

●●●●●●●● ●●●●●●●●●●●●●●●●● ●

●

●●●●●●●●●●

●

●●

●

●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●

●

●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●

●

●●

●●●●

●

●

●●●

●

●

●

●

●

●●

●

●●

●

● ● ●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●

●

●●●●

●

●●●●●●●

●

●●●●●●

●

●

●

●

●

●

●●●●

●●

●●●●●

●

●●●●●●●●●●●●●●●●

●

●

●●●

●

●●●●

●

●

●

●

●

●

●

●

●●

●

●

●●●●

●

●

●●●●●●

●

●●●●

●

●●●●●●●●●●●●●

●

●●●●●●●● ●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●

●● ●

●

●

●

●

●

●

●●● ●

●● ●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●● ●●●●●● ● ●●

●

●●●●●●●●●●●●●●● ●● ●●●●●●●●●●

●

●●●●●●●●●●●●

●

● ●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●

●

●●●●●

●

●●●

●

●●

●

●

●●●●●●

●

●●●●●●

●

●

●

●

●●●●●●●●●●

●

●●●●●●●●

●●

●

●

●●●●●●●●●●●●●●●0

1

2

3

4

0 1 2 3Cd8a

Cd8

b1

●

●●

●●

●

●

●

●

●

●●

●●●

●

●

●●

●

●●

●

●●●

●●●●

●●

●

●●

●

●

●●

●

●

●

●

●●●●●●●●●●

●

●

●●●●●

●●●

●

●●●

●

●●

●●●●●●●●

●●●●

●

●●●●●●●●●

●

●●●●●●●●●

●

●●●●●●●●●●

●

●●●

●

●●●●●●●

●●●

●●●●

●●●●

●

●●●●●●●

●●●●

●

●●●●

●

●●●●●●●

●●●●

●●●●●●●●●●

●●●

●●●

●

●

●

●●●

●

●●

●●

●

●

●●●●●●●

●●

●●

●

●●

●●●●●●●●●●●●●●●●●

●●●●

●●●●●●

●

●●●●●●●●

●●

●

●●●●

●●

●●●

●●●●

●

●●

●

●●●●

●

●

●

●●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●●●●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●●

●

●

●

●●●

●

●

●

●

●

●●

●●

●

●

●

●●●●

●

●●

●

●

●●

●

●●●

●

●

●●

●

●●

●●

●

●

●

●●●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●●

●

●●

●

●

●●

●●

●●

●●

●●

●●●

●

●

●●

●●

●●

●

●

●

●●●

●

●

●●

●

●

●

●

●●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●●●

●

●

●●●

●

●

●

●●

●●●

●●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●●●●●●●●

●●

●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●

●

●●●●●

●

●

●●●

●

●●●●●●●●

●

●●●

●

●●

●

●

●

●●

●●

●●●

●

●●●●●●●●●●●●●

●

●●

●

●●●●●●

●

●

●

●●●●●

●

●●●

●

●

●●●●●●●●●

●●

●●●●●●●

●●●●●●●●

●

●●●

●

●

●

●●●●

●●

●

●●●●●●●●●●●●●

●

●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●

●

●●●●●●●●●●●●●

●

●●●●●

●

●●●●●●●●

●

●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●

●

●●

●

●●●●●●●●●●●●●●●●●●●●●

●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●

●

●

●

●●●●●●●●●

●

●●●

●

●●●●

●●●●●●●●

●

●●●

●●●

●●

●●●

●●

●●

●

●

●

●

●●

●

●

●

●●●●

●

●●●●

●●●●●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●●●●●●●●●●●●●

●●●

●●●●●●

●

●

●●

●●●●●

●

●

●

●●

●

●●●●●●●

●

●●

●

●●●

●

●●

●

●

●

●●●●

●●

●●

●

●

●

●

●●●●●●●●●●●●●●

●●

●●

●●●

●

●●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●●●●

●

●

●●●●

●

●

●●●

●●●●●●●●●

●

●●●●●●●●

●

●●●●●●

●

●●●

●

●●●●

●●●

●

●●●●

●●●●

●●●

●

●

●●●●●●●●●●●

●●

●

●

●●●

●

●

●●

●●●

●

●

●

●

●●

●●

●●●

●●●

●●

●●●

●●●●●

●●

●●●●●

●●

●

●

●

●

●●●●

●

●●

●

●

●●

●●●●●●●●●●●●

●●

●

●●●●●●●●●

●

●●●●●●●●

●●

●●

●

●

●

●●●●

●

●

●●

●●●●●●●●

●●

●●●●●

●

●

●●●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●●●●●

●

●●

●

●

●●●

●●●●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●●● ●●

●

●●●●●

●

●

0

1

2

3

4

0 1 2 3Cd8a

Cd8

b1

What rank do we need?rank-10 approximation


0

2

4

6

Why 10? Often the number of components is determined to the elbow plot.

●

●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●5 10 15 20

0.0

0.1

0.2

0.3

0.4

Index

Var

ianc

e ex

plai

ned

Singular values of a randomized matrix

●

●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●5 10 15 20

0.0

0.1

0.2

0.3

0.4

Index

Var

ianc

e ex

plai

ned

●●

●●

●● ●

● ●●

● ● ● ● ● ●● ●

● ●5 10 15 20

0.04

80.

049

0.05

00.

051

0.05

2

Randomized

Index

Var

ianc

e ex

plai

ned

SVD vs PCAIn data analysis we frequently here the term “Principle Component Analysis” orPCA

I SVD is matrix decomposition with a mathematical definition and can beapplied to any matrix.

I PCA is a data analysis technique that reduces to applying SVD to datapre-processed in a specific way – we subtract the mean for each row (alsocalled centering)

I If the mean is 0 than then the rank k approximations capture the variance!

●

●

●●

●

●● ●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●●●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●●

●

●

●●

●●

●

●

●

●

●

● ●

●

●

●

● ●●

●

●●

●

●●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●●●

●

●

●● ●

● ●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

● ●

●●

● ●

●

●

●

●

●

●

●

●

●

● ●●●

●

●

●●

●

●●●

●●●

●

●

●

●

●

●

● ●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●●●

●●

●●

●

● ●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

● ●●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●● ●

●

●

●

●

●

●

●●

● ●●

●

●

●●

●

●●

●●

●

●

●

● ● ●

●●

●

●

●

●●

●

●

●

●

●

●

● ● ●●●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

● ●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●●●

●

●

●

●

●●

●●●●

●

●

●

●

●

●

●●

●

●

●●●

● ●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●●

●

●

●●

●

●

●●●

●

●●

●●

●●●

●

●●

●

●

●

●

●●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●●●

●

●

●

●

●

●

●●

●●

●●●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●● ●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●●

●

●

●●●

●

●●

●

●●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

−4 −2 0 2 4

−4

−2

02

4

mydata[, 1]

myd

ata[

, 2]

first SVsecond SV

●

●

●●

●

●● ●●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●●●●●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●●

●

●

●●●●●

●

●

●

●

●●

●●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●●●●

●

●●

●●

●

●

●

●

●

● ●

●

●

●

● ●●

●

●●

●

●●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●● ●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●●●

●

●

●● ●

● ●

●

●

●

●

●

●●

●

●

●

●●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

● ●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

● ●●●

●

●

●●

●

●●●●●●

●

●

●

●

●

●●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●●●

●

●●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●●

●

●

●

●●●● ●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●●●

●●

● ●

●

● ●

●

●●

●

●

●

● ●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●● ●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●●

●

● ●●

●

●

●

●

●

●

● ●

●

●● ●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●● ●●

●

●

●

●●●

●

●

●

●●●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●●● ●

●

●

●

●

●

●

●●

● ●●

●

●●

●

●

●●

●●●

●

●

● ● ●

●●

●

●

●

●●●●

●

●

●

●

● ● ●●●

●●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

● ●●

●

●

●

●

●

●

●

●

●● ●

●●

●

●

●

●

● ●●●

●

●

●

●

●●●●●

●

●●

●

●

●

●

●●

●

●

●●●

● ●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●●●

●

●

●●

●●

●●●

●

●●

●●

●●●

●

●●

●

●

●

●●

●●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●●

●

●●

●

●●

●

●●●

●

●

●

●

●●

●●

●● ●●●

●

●●●

●

●●

●

●

●

●

● ●

●

●●

●●

●

●

●

●●

●

●●

●●

●

●

●

●●

●

●●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

● ●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●● ●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●●

●

●

●●●

●

●

●●●

●

● ●

●

●●

●

● ●

●

●●

●●

●●

●

●

●

●

●

●

●

●● ●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●●●

0 2 4 6 8 10

02

46

810

mydata[, 1]

myd

ata[

, 2]

SVD vs PCA

In PCA we also typically analyze just U or just VAnd only care about the first few singular vectors. We assume the data is lowrank up to some error

Cells PCA Genes PCA

●●●●

● ●●

●● ●●●●

●●●

●●●

●●●●

●

●●●●

●●●

●

●●●●●●

●●●

●●

●●●●●●●●●●●●●●●

●●●●●●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●

●●

●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●●● ●●●

●●●●●

●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●

●●

●●●●●

●●●●

●● ●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●

●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●● ●●

● ●●●

●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

● ●●●●●

●●●

●●●●●●●

●

●●●●●●

●●●●

●●●

●●●

●●●●●●●●●●●●

●

●

●●●●

●

●● ●

●●

●●●●●●

●●●

●●●● ●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●●●●

●

●●●●●●●●

●●●

●●●●

●●

●●●●●●●●●●●●●●●●

●●●●

●●

●

●

●●●●●

●●●●●●●●●

●●●●●●●●●

●●

●●●●●

●●●●●

● ●●●●

●●●●●●●●

●

●

●●

●●●

●

●●●●

●●●●●●●●

●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●

●

●●●●●●● ●●●

●●●●●●●●●

●●

●●●●

●●●●●●●●

●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●

●

●●●●●●

●●●

●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●

●●●

●●●●●●●●●●●●●

●●

●●●●●

●●●●

●●

●●●

●●●●●●●●

●

●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●● ●●●●●●● ●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●●

●●●●●●

●●●●●●●●●

●●●●●●

●

●●●●

●

●

●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●

−3

0

3

6

−5.0 −2.5 0.0 2.5 5.0PC1(23.91%)

PC

2(17

.11%

)

●

●

●

●

●

●

●

●

●

●

●

●

2

3

4

5

6

7

8

9

10

11

12

13

●●●●

●

●●●

● ●

●

●●

●●●

●

●●

●●

●

●●

●●

●

●

●

●●●

●

●

●●

●●

●

●

●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●

●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●

●●●●

●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●

●

●●●●●●●●●●●

●●

●● ●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●

●●

●●●

●●●●

●●●●

●

●

●●●●

●

●●●●●●●

●●●●●●●

●●●

●●

●●

●●●●

●●●

●●●●●

●●●

●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●●●

●●●●●●●●●●

● ●●●

●●●●●●●●●●●●●●

●

●●

●● ●●●●●●●●●●● ●●●●●●●●●●●●●

●

●●●●

●●●●

●●●

●●●

●

●●

●●● ●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●● ●●●●●

●●

●●●●●●

●●●●●●●

●●

● ●●●

●●●

●●●

●●●

●●

●●●

●●●●

●

●●●

●●

●●● ●●

●

●●●●●

●●●

●●

●●●

●

●●●●●●●●●●●●

●

●

●●●●●

●●●●●●●●●●●●●

●●● ●

●

●●●

●●●●●●●

●●

●●

●●●

●●●●

●

●●●●

●

●

●●●●●●●

●●

●●

●●●●

●●

●●●●

●●

●

●●●

●●

●●

●●

●

●●●●●●

●●●●●●●

●

●

●

●

● ●

●●●●●●

●●●

●●

●●

●●●

●●●

●

●

●

●

●

●●

●●●

●●●●●●

●●

●●

●●●●

●

●●●●●

●●●●

●

●●●●●●

●●●●●●

●

●●

●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●

●

●●●

●●●●●●●

●●

●●●

●

●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●

●●●

●●●●●

●

●●●●●●

●

●

●●●●

●●

●●●●

●●

●

●●●●●●●

●●●●●●●●●●●●●●

●●●

●●●●●●●●●

●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●

●

●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●

●

●

●●●

●●●●●●●●

●●●

●

●

●

●●●●●●●●●●●●

●●●●

●●●●●●

●

●● ●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●

●

●●

●●●●●● ●●●●●●●●

●●●●●●●●●●

●●●●●●●●●● ●●●●●●●●

●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●

●

●●●●●●

●●●●●●●●●●●

●●

●●

●●

●●●

● ●

●●●

●●●

●●●●●●●●

●●●●●●●

●●●●●

●

●●●●●●●●●●●●●●●● ●

●●●●●●●●●●●●

●

●●●●●●

−5.0

−2.5

0.0

−5.0 −2.5 0.0 2.5 5.0PC1(23.91%)

PC

3(10

.33%

)

grp

●

●

●

●

●

●

●

●

●

●

●

●

●

1

2

3

4

5

6

7

8

9

10

11

12

13

●●●●

●●●

●● ●●

●●●●●

●

●●

●●●●●

●●

●●

●

●

●

●●●

●●●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●●●●●●●●●●●

●

●●●●

●●●●●

●●●●

●●

●●●●●

●

●●

●●

●●●

●●

●

●●●

●●

●●●

●●

●

●●●●●

●

●

●●●

●●

●●●●●●

●●●●●

●●●

●●●●●

●●

●

●●●

●●

●●●

●●●●●●●●●

●●

●●●

●●

●●●●●●●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●●

●● ●●●

●●●●●●●●●

●●●

●●

●●●●

● ●●●

●●●●●●●●

●●●●●●●●●●●

●●●

●

●●●●

●●

●

●

●●●●●

●

●

●

●

●●

●●●●●

●●●

●●●●●●●●●●

●

●●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●

●●●●●●●●●●

●●●

●●●

●●●

●●●

●●●●●

●●

●

●●

●

●●●●

●●●●

●●●

●

●●●

●●●

●●●●

●

●

●●

●●

●

●●●●

●●

●●●●

●

●● ●●●

●●

●●

●●●●

●●●

●●

●●●●

●

●●●●

●●●●●●●

●●

●●

●●●

●●●

●

●●●

●●●

●●●●

●●●●●●

●●

●●●

●●●●

●●●

●●

●●●

●

●●●

●

●

●●●●●●●

●

●●●●

●●●

●●●

●●●●●

●●●●●●●●●●

●●●●

●●●●●●●

●

●●

●●●●●

●

●●

●●

●

●

●

●●

●●●●●

●●

●●● ●

●●●●

●

●●

●●

●●●●●●●●●●

●●

●

●●●●●●●

●

●●●●●●

●●●●

●●●●●●

●

●

●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●

●●

●●●

●●●●●●●●●●●

●●●●

●●●

●●●●●●●●

●●●●●●●●

●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●

●

●●●●●

●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●

●●● ●

●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●● ●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●

●●●●

●●●●●●●●●

●●●

●●●●●●●●●●●●●

●●●● ●●●●

●●●●●

●●●●●●●●

●

●●●●

●●●●●

●●●●●●●●●●●●●

●●●●●

●

●

●●

●●●●●●

●●●

●●●

●●●●

●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●●● ●

●●●●●●●●

●

●●●

●●●●●●

●●●

●

●●●●●●

●●●

●●●

●●

●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●

●

●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●

●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●

●

●

●●

●●●●●●

●●●●●●●

●●●●●●●

●●

●

●●●

●

●

●●●●●

●●●●

●●●●

●●●●●●●●

●●●●●●

●●

●●●●●●●●●●●●

●● ●

●●

●

●●●●●

●●●●

●●●●●●●

−2

0

2

4

−5.0 −2.5 0.0 2.5 5.0PC1(23.91%)

PC

4(6.

37%

)

grp

●

●

●

●

●

●

●

●

●

●

●

●

●

1

2

3

4

5

6

7

8

9

10

11

12

13

●●●●

●

●●●●●

●

●●

●●●

●

●●

●●

●

●●

●●

●

●

●

●●●

●

●

●●

●●

●

●

●●●

●●●●

●●●●

●●●

●●●

●●●●●●●●●

●●●●●

●●●

●●●●●●

●

●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●

●●

●●

●●●●

●●●●●●

●●●●●●●

●●●●●

●●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●

●●●●

●●

●●●

●●●●●●

●●●

●

●●●●●

●●●●●●●●●●●●●●●●

●

●●●●●

●●●●●●●

●●●

●

●●●●●●●●●●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●

●●●●●

●●●●

●●

●●

●

●

●●●●

●

●●●●●●●

●●●●●●●

●●

●

●●

●●●

●●●

● ●●●●

●●●

●●●

●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●

●

●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●

●●

●

●

●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●

●●

●●●●

●●

●●●●●●●

●●

●●●●

●●●●

●●

●●●●●

●●●

●●●●

●

●●●●●

●●●●

●●

●●●

●●

●●●

●●

●●●

●

●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●

● ●

●●●●

●

●●●●

●

●●●●

●

●

●●●●●●●

●●

●●

●●●●

●●

●●● ●

●●

●

●●●

●●

●●

●●●

●●●●●

●

●●●●●●●●

●

●

●

●●

●●●●●●

●●●

●●

●●

●●

●

●●●

●

●

●

●

●

●●

●●●

●●●●●●

●●●●

●●●●●

●●●●●

●●●●

●

●●●●

●●●●●●●●

●

●●

●●●● ●●●●●●●●●

●●●●●●●●

●●●●●●

●

●

●●●

●●●●●●●

●●

●●●

●

●●●●●●●●●●

●●●●●●●●●●●

●● ●●

●●●●●●●●●●●●●●●●●●●

●

●●●

●●●●●

●

●●●●●

●●

●

●●●●

●●

●●●●

●●

●

●●●●●●●

●●●●●●●●●●●●●

●●●●

●●●●●●●●●

●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●

●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●

●

●

●●●

●●●●●●●●

●●

●●

●

●

●●●●

●●●●●●●●●●●●

●●●●●●●

●● ●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●

●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●

●

●●●●●●

●●●●●●●●●●●

●●●●

● ●

●●●

● ●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●

−5.0

−2.5

0.0

−3 0 3 6PC2(17.11%)

PC

3(10

.33%

)

●

●

●

●

●

●

●

●

●

●

●

●

2

3

4

5

6

7

8

9

10

11

12

13

●●●●

●●

●●

●●●

●● ●●

●

●

●●

●●

●●●

●●

●●

●

●

●

●●●

●●●●

●●

●●●

●●●●●●●●●●●●

●●●●●●●●●●●

●

●●●●

●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●

●● ●●●●●●●●●●●●● ●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●

●

●●●●

●●●●●

●●●●

●●

●●●●●

●

●●

●●

●●●

●●

●

●●

●

●●●●●

●●●

●●●●

●

●

●

●●●

●●

●●●

●●

●

●●●●

●

●●●

●●●●●

●●

●

●●●

●●

●●●

●●●●●●●●●

●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●

●●●●●●

●●

●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●

●

●●●●

●●

●

●

●●●●●

●

●

●

●

●●

●●●●●●●●●●

●●●●●●●●

●

●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●

●●●●●●●

●●●

●●●

●●●

●●●

●●●

●●●●●

●●●

●●

●

●●●●

●●●●

●●●

●

●●●

●●●

●●●●●

●

●●

●●

●

●●●●

●●

●●●●

●

●●● ●●

●●

●●●●●●

●●●

●●

●●●

●

●

●●●●●●

●●●●●●●

●●●●●

●●●

●

●●●

●●●●●●

●●●●●●

●●●

●●●

●●●●

●●●

●●

●●●

●

●●●

●

●

●●●●●●●●

●●●●●●●

●●●

●● ●●

●

● ●●●●●●●●●

●●●●

●●●●●●●

●

●●

●●●●●

●

●●●

●●

●

●

●●

●●●●

●

●●

●●●●

●●● ●

●

● ●

●●

●●●●●●●●●

●●●

●

●● ●

●●●●

●

●●●

●●●

●●●●

●●●●●

●●

●

●●●●

● ●●●●●●

●●●●●

●●●●●●● ●●●●●

●●

●●●

●●●●●●●●●●●

●●●●

●●●

●●●●●●●●

●●●●●●●●

●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●

●

●●●●●

●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●

●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●

●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●

● ●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●

●●●●

●●●●●

●●●●

●●●●●●●●●

●●●●●●●●●●● ●●●●

●●●●●

●●●●●●●

●

●

●● ●●

●●●●●

● ●●●●●●●●●●●

●●●●●●

●

●

●●

●●●●●●

●●●

●●●

●●●●

●●●●●

●●●●●●●

●●

●●●●●●●●

●●●●●●●●●

●●

●●●●●●

●

●●●

●●●●●●●●●

●

●●●●●●

●●●

●●

●

●●

●

●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●● ●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●●●●

●

●

●●

●●●●●●

●●●●●●●●●

●●●●●●

●●

●●●

●

●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●

●●●●●●●●●●●●

●●●

●●●

●●●●●●●●●

●●●●●●●

−2

0

2

4

−3 0 3 6PC2(17.11%)

PC

4(6.

37%

)

grp

●

●

●

●

●

●

●

●

●

●

●

●

●

1

2

3

4

5

6

7

8

9

10

11

12

13

●●●●

●●

●●

●●●

●●●●●

●

●●

●●

● ●●

●●

● ●

●

●

●

● ●●

●●●●

● ●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●

● ●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●

●●●●

● ●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●

●●

●●

●●

●●●●●●●●●●●●●●

●

●●●●

●●●●●

●●●●

●●

●●●●●

●

●●

● ●

●●●

●●

●

●●

●

●●

●●●

●●●

●●●●

●

●

●

●●●

●●

●●●

●●

●

●●●●

●

●●●

●●●●●

●●

●

●●●

●●

●●●

●●●●●●●●●

●●●●●

●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●

●●●●●●●

●●●●●

●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●

●●●●●●●●●●●

●●●

●

●●●●

●●

●

●

●● ●●●

●

●

●

●

●●

●●●●●

●●●●●

●●●

●●●●●

●

●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●

●●●●●●●●

●●

●●●

●●●

●●●

●●●

●●●●●

●●

●

●●

●

●●●●

●●●●

●●●

●

●●●

●●●

●●●

●●

●

●●

●●

●

●●●●

●●

●●●●

●

●●●●●

●●

●●●

●●●

●●●

●●

●●●●

●

●●●●●●●●

● ●●●●

●●

●●

●●●●

●

●●●

●●●

●●●●

●●●●●

●●●

●●●

●● ●●

●●●

●●

●●●

●

●●●

●

●

●●●●

●●●●

●●

●●●●●●●

●●●●

●●

● ●●●

●●●●●●

●●

●●●●● ●●●●

●

●●

● ●●●

●

●

●●●●●

●

●

●●

●● ●●

●

●●

●●●●

●●● ●

●

● ●

●●

●●●●●●●●●

●●●

●

●● ●

●●●●

●

●●●

● ●●

●●●●

●●●●●

●●

●

●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●

●●●●

●●● ●●●●●●●●

●●●

●●●●●●●●

●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●

●

●●●●●●●

●●●●●

●●●●●●

●●● ●●●

●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●

●●●●●

●●●●●●●●●

●●● ●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●

●●●●

●●●●●

●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●

● ●●●●

●●●

●●●●●

●

●●● ●

● ●●●●

●●●●●●●●●●●●

●●●●●●

●

●

●●

●●●●●●●●●

●●●

●●●●

●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●

●●●●

●●●●●

●●●

●

●●●●●●

●●●

●●

●

●●

●

●●●●●

●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●

●

●

●●

●●●

●●●●●●●●●●●●●●●●

●●

●●

●●●

●

●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●

●●●●●●●●●●●●

●●●

●●●

●●●●

●●●●●

● ●●●●

●●

−2

0

2

4

−5.0 −2.5 0.0PC3(10.33%)

PC

4(6.

37%

)

grp

●

●

●

●

●

●

●

●

●

●

●

●

●

1

2

3

4

5

6

7

8

9

10

11

12

13

●●●●

●

●

●

●

●●●●

●●

●

●

●●

●●

●●●●

●

●●●

●●●●

●●

●●●●

●●●●●●

●●

●●●●

−20

0

20

40

−40 −20 0PC1(24.72%)

PC

2(16

.05%

)

●

●

●

●

●

●

●

●

●

●

●

●

2

3

4

5

6

7

8

9

10

11

12

13

● ●●●

●

●

●

●

●●●●

●●

●

●

●

●

●●

●●

●●

●

●

●●

●●●

●

●●

●●●●

●●●●●●

●●

●●●●

−40

−30

−20

−10

0

10

−40 −20 0PC1(24.72%)

PC

3(11

.46%

)

grp

●

●

●

●

●

●

●

●

●

●

●

●

●

1

2

3

4

5

6

7

8

9

10

11

12

13

●

●●●

●

●●

●●●●●

●●●

●

●

●

●●

●●

●●

●

●●●●●●

●

●●

●●

●●

●●●●●●●●

●●●●

−10

0

10

20

−40 −20 0PC1(24.72%)

PC

4(7.

37%

)

grp

●

●

●

●

●

●

●

●

●

●

●

●

●

1

2

3

4

5

6

7

8

9

10

11

12

13

●●●●

●

●

●

●

●●●●

●●

●

●

●

●

●●

●●

●●

●

●

●●

●●●●

●●

●●●●

●●●●●●

●●

●●●●

−40

−30

−20

−10

0

10

−20 0 20 40PC2(16.05%)

PC

3(11

.46%

)

●

●

●

●

●

●

●

●

●

●

●

●

2

3

4

5

6

7

8

9

10

11

12

13

●

●●●

●

●●

●●●●●

●●●

●

●

●

●●

●●

●●

●

●●●●●●

●

●●

●●

●●

●●●●●●

●●

●●●●

−10

0

10

20

−20 0 20 40PC2(16.05%)

PC

4(7.

37%

)

grp

●

●

●

●

●

●

●

●

●

●

●

●

●

1

2

3

4

5

6

7

8

9

10

11

12

13

●

●●●

●

●●

●●●●●

●●●

●

●

●

●●

●●

●●

●

● ●●●●●

●

●●

●●

●●

●●●●●●

●●

●●●●

−10

0

10

20

−40 −30 −20 −10 0 10PC3(11.46%)

PC

4(7.

37%

)

grp

●

●

●

●

●

●

●

●

●

●

●

●

●

1

2

3

4

5

6

7

8

9

10

11

12

13

Minimizing error and maximizing variance

SVD approximations can be viewed asI Minimizing squared errorI Maximizing the variance along the singular vector directions

Both views are equivalent and can be used to derive optimization algorithms.

Projection intuition

Two views of SVA

For a rank-1 approximation we have:

min{u,v}||Yn×m − un×1dvT1×m||2F (4)

subject to ||u|| = 1, ||v|| = 1

Is the same as:

max{u,v}uTn×1Yv1×m (5)

A simple algorithm for finding the first singular vectors.

1. Initialize a vector v with L2 norm of 1.2. Iterate:

I u← argmaxu uT Yv subject to ‖u‖22 ≤ 1

I v← argmaxu uT Yv subject to ‖v‖22 ≤ 1

SVD complexity

I The complexity of computing a full SVD (assuming m > n) is O(mn2)

I If we want just the first k components we can do is O(mnk)

I A very popular method is Randomized SVD or RSVD

RSVD

A ≈ QQ>A = QB = QUΣV> = UΣV> (6)

Denoising variant of SVD

We can approximate our matrix Y as a sum of Low rank and Sparse matrices.

This is often called roust PCA or rPCA.

A ≈ L + S (7)

This reduces toI penalizing S entry-wise with L1 norm andI penalizing the singular values of L with L1 norm, forcing some of them to

0.

rPCA example

using rPCA for background subtraction

Low-Rank Modeling and Its Applications in Image Analysis

General matrix decomposition/factor analysis

We want to approximate a (features) × (samples) matrix Yf×s as a product oftwo (sometimes three) low rank matrices.

Yf×s = Lf×k Fk×s + E (8)

We can generally refer to F as the factors and L as the loadings. E is theerror.

What do we mean approximate? We want to maximize the likelihood of Y . Wemay have some priors on L and F .

General matrix decomposition

min{L,F}Loss(Y , LF ) + PL(L) + PF (F ) (9)

I We want to minimize some loss.

I We may want to penalize and constrain the factors and loadings.

I Choices are derived from a likelihood formulation.

Consider the least squares error

min{L,F}||Y − LF ||2F (10)

What probabilistic assumption does least-squares loss correspond to?

Absent any other assumptions this is solved by SVD.

We set L = UD1/21:k and F1:k = D1/2

1:k V T .

∗Note that the scaling by D is arbitrary and L = U and F = DVT give the same reconstruction error.

Matrix factorization for prediction

If all we care about is loss there are many equivalent solutions.

Yn×m ≈ Ln×k Fk×m = (LBk×k )(B−1k×k F ) (11)

Multiplying both L and F by some invertible matrix gives the same predicted Y .

But SVD is unique?? How does that make sense?

Matrix factorization for prediction

What about the loss? Least squares loss is pretty standard

min{L,F}||Y − LF ||2F (12)

If we care about predicting missing values is this the best choice?

Hint: there are two possible problems!

Beyond prediction: representation learningWe may want something more from our factorization than just predictions.Most famous machine learning dataset: 70,000 handwritten digits, each in a28× 28 pixel image (784 pixels per image).

Approximating as rank-k

Using SVD to understand your data

I The MNIST dataset is a 70, 000× 784 matrix.I Each row is “really” one of just 10 digits!I Does that correspond to the SVD representations? Do the singular

vectors correspond to digits?

Singular vectors viewed as 28× 28 pixel images:

Not really! If aliens are looking at MNIST and trying to understand handwrittendigits, they shouldn’t use SVD!

The why of representation learning?

Our data is generated by some process – there is a latent representationthat captures that process

I For MNIST the latent factors are digits.I For Netflix the latent factors are movie attributesI For single cell data the latent factors are cell-types

Can we recover them from matrix factorization?

Why would we want to?

Example: if the latent factors correspond to physical variables we can askquestions about causality. If they are linear combinations of real variables thatdoesn’t make sense!

SVD does not generally lead to a mechanistically correct model

General factor analysis problem.

Yf×s = Lf×k Fk×s (13)

We can solve by SVD to minimize error, but:

I We hope that the individual vectors Fi are meaningful. SVD onlyguarantees minimum error.

I In fact, the Fis we get from SVD are by construction orthogonal. Thereal mechanistic model has no such restriction.

I Our SVD factors cannot in general capture the true latent structureI We can ask for Z to be sparse (have lots of 0s) and positive

MINIMIZE ||Y − LF ||2FSUBJECT TO L > 0 ||L||L1 < t

More on this next week.

More than two dataset?

I Canonical correlation. In the base case defined for 2 datasets.

I One of the dimensions is “aligned”.I Same set of biological samples but different assays. Aligned dimension:

samples.I Same assay (gene expression) but different set of samples. Aligned

dimension: genes.

I Tensor factorization. Any number of datasets.

I Considering individual datasets two of the dimensions are “aligned”.I Example: different genomic assays in multiple cell-types

Canonical Correlation Analysis (CCA)Basic ideaGiven a dataset Xp×n and Yp×m

Find a linear combinations of columns of X (a) and a linear combinations ofcolumns of Y (b) such that the correlation between a and b is maximized.(

a′, b′)

= argmaxa,b

corr(

aT X , bT Y)

(14)

With no additional constraints CCA a closed form solution in terms ofeigenvectors of of X T X , X T Y , and Y T Y .

Integrating single-cell transcriptomic data across different conditions, technologies, and species

CCA for single cell

ENCODE dataset

A large collection of molecular profiles of different cell-types.The data is (cell-type) × (molecular assay) × (genomic position)Not all assays are available for all cell-types.

Tensor factorization

This can be represented as a 3-dimensional tensor with many missing values.

Deep Tensor Factorization for the Imputation of Thousands of Missing Epigenetics Experiments

Non linear factorizations

We can think of matrix factorization as a prediction problem. If our model is

Yf×s = Lf×k Fk×s + E (15)

Then elements of Y can be predicted as linear combinations of F .The

coefficients are given by L.Now we can replace the linear function that is multiplication by L by somenon-linear function.

Why would we want to do this?

Comparing linear and non-linear models

Next week: We will learn about constrained versions of matrix factorization.

PREDICTD: PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition. Nature Communications, 2018.

Scalable Machine Learninggobie.csb.pitt.edu/SML/MatrixFactorization.pdf · Scalable Machine...

Documents

Transcript of Scalable Machine Learninggobie.csb.pitt.edu/SML/MatrixFactorization.pdf · Scalable Machine...