FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES...

116
FINDING RELATIONS BETWEEN VARIABLES Pearson’s Correlation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Transcript of FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES...

Page 1: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

FINDING RELATIONS BETWEEN VARIABLES

Pearson’s Correlation

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 2: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Relation between coupled variables

What couples of variables are in relation?Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 3: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Correlated variables

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 4: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Uncorrelated variables

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 5: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Covariance and Pearson’s Correlation index

Variable 1 Variable 2

Item1 x1(1) x2(1)

Item 2 x1(2) x2(2)

Item i x1(i) x2(i)

Item m x1(m) x2(m)

Mean M1 M2

m

i

m

i

ixm

M

ixm

M

1

22

1

11

)(1

)(1

m

i

m

i

MixMix

mxxcorr

MixMixm

xx

1 21

221121

22

1

1121

)()(

1

1),(

)()(1

1),cov(

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 6: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Covariance and Pearson’s Correlation index

Variable 1 Variable 2

Item1 x1(1) x2(1)

Item 2 x1(2) x2(2)

Item i x1(i) x2(i)

Item m x1(m) x2(m)

Mean M1 M2

m

i

m

i

ixm

M

ixm

M

1

22

1

11

)(1

)(1

2

1

)()(1

1),cov( jjj

m

i

jjjj MixMixm

xx

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 7: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Correlation

1,1),( 21 xxcorr

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 8: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

When is a correlation significant?

Given a correlation index:

A test variable can be computed under the null hypothesis that r=0

t is distributed as Student’s t test with n-2 degrees of freedom It assumes normality of x

m

i

MixMix

mxxr

1 21

221121

)()(

1

1),(

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 9: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Graph showing the minimum value of Pearson's correlation coefficient that is significantly different from zero at the 0.05 level, for a given sample size.

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 10: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Example: discovery of a misconduct

Repeatability test: 2 different experimentalist were asked to take the same solution and to perform 24 independent ELISA assays on a 6x4 plate.

They submitted to the assessor the following results out of the spectrophotometer, ordered following the well

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 11: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

S1 S2P1 0,481 0,496P2 0,485 0,501P3 0,479 0,495P4 0,506 0,522P5 0,467 0,48P6 0,474 0,491P7 0,469 0,48P8 0,475 0,489P9 0,514 0,52P10 0,52 0,524P11 0,526 0,531P12 0,494 0,509P13 0,535 0,54P14 0,524 0,526P15 0,481 0,492P16 0,502 0,509P17 0,479 0,484P18 0,491 0,495P19 0,503 0,515P20 0,472 0,481P21 0,481 0,486P22 0,503 0,512P23 0,448 0,454P24 0,519 0,526

The assessor suspectedthat the experimenter submitted two reads of the same plate

How to prove it?

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 12: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Example: discovery of a misconduct

0,44

0,45

0,46

0,47

0,48

0,49

0,5

0,51

0,52

0,53

0,54

0,55

0,44 0,46 0,48 0,5 0,52 0,54

S2

S1

R=0.978

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 13: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Example: discovery of a misconduct

R=0.978 n=24 t=22.05

Objection: the test is valid only when data are normally distributed and we cannot prove that.Any other idea?

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 14: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Use the data set to generate random experiments

S1 S2P1 0,481 0,496P2 0,485 0,501P3 0,479 0,495P4 0,506 0,522P5 0,467 0,48P6 0,474 0,491P7 0,469 0,48P8 0,475 0,489P9 0,514 0,52P10 0,52 0,524P11 0,526 0,531P12 0,494 0,509P13 0,535 0,54P14 0,524 0,526P15 0,481 0,492P16 0,502 0,509P17 0,479 0,484P18 0,491 0,495P19 0,503 0,515P20 0,472 0,481P21 0,481 0,486P22 0,503 0,512P23 0,448 0,454P24 0,519 0,526

S1 Random(S2)P1 0,481 0,495P2 0,485 0,522P3 0,479 0,48P4 0,506 0,491P5 0,467 0,48P6 0,474 0,489P7 0,469 0,52P8 0,475 0,524P9 0,514 0,531P10 0,52 0,509P11 0,526 0,54P12 0,494 0,526P13 0,535 0,492P14 0,524 0,509P15 0,481 0,484P16 0,502 0,495P17 0,479 0,515P18 0,491 0,481P19 0,503 0,486P20 0,472 0,512P21 0,481 0,454P22 0,503 0,526P23 0,448 0,496P24 0,519 0,501

Page 15: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

0,44

0,46

0,48

0,5

0,52

0,54

0,56

0,44 0,46 0,48 0,5 0,52 0,54

Rand

om (S2)

S1

R=0.25

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 16: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Building a distribution by random resampling

Iterate the process of shuffling and computation of r many times (say 1000)

Compute a cumulative histogram counting the resamplings scoring with correlation ≥ r

0,00

200,00

400,00

600,00

800,00

1000,00

0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 17: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Building a distribution by random resampling

The plot gives the probability (per thousand) of obtaining a given correlation with random pairings of the original data P-value independent on the assumptions on the data distribution

0,00

200,00

400,00

600,00

800,00

1000,00

0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90

This is only an example plot: Compute by yourself the plot corresponding to the data available in misconduct.xls file

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 18: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Bootstrapping

The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where the main character pulls himself out of a swamp by his hair (specifically, his pigtail), but the Baron does not, in fact, pull himself out by his bootstraps

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 19: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

FINDING RELATIONS BETWEEN VARIABLES

Spearman’s Correlation

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016-University of Bologna

Page 20: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Pearson’s correlation index assumes linear dependence

R=0.816

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 21: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Non parametric correlation: Spearman

Given a set of paired (xi,yi) sort separately the two variables, obtaining the ranks.

The Spearman’s correlation is the Pearson’s correlation of the ranked variables: [R(xi),R(yi)]

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 22: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 23: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Under the null hypothesis (r=0)

Is distributed as a Student’s t test with n-2 degrees of freedom

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 24: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Categorical data: Matthews correlation index

Secreted Non Secreted Total

With Signal peptide a b a + b

Without Signal Peptide c d c + d

total a + c b + d n

cdbdcaba

ad-bcMCC

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 25: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

EXTRACTING INFORMATION FROM HIGH DIMENSIONALDATA

Principal Component Analysis

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016-University of Bologna

Page 26: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

High-dimensional descriptors

Many different descriptors can be adopted for characterizing a set of objects under investigation:

1) given a set of proteins, we can measure the residue composition (20 values), the dipeptide composition (400 values), length, average hydrophobicity..... of each sequence

2) Given a set of individuals we can measure dimensions, weight, haematic concentration of metabolites...

How can we extract a minimal set of descriptors without losing information on the variability of the set?

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 27: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Data Reduction

Summarizing the p-dimensional description of nobjects by a smaller set of (k) derived (synthetic, composite) variables.

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 28: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Data Reduction

“Residual” variation is information in A that is not retained in X

A good data reduction must balance between clarity of representation, ease of understanding

oversimplification: loss of important or relevant information.

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 29: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Most important descriptors

When plotting these two dimensional data it is evident that the x-direction accounts for the largest part of the variance

Directions with largest variance better describe the data set

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 30: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Most important descriptors

When variables are correlated the variable variance is not able to determine principal components

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 31: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Example:2D

Markus Ringnér. What is principal component analysis?Nature Biotechnology 26, 303 - 304 (2008)

Suppose to measure the expression level of two genes (XBP1 and GATA3) in breast cancer samples expressing or not the Estrogen Receptor (ER+ in red, ER- in black)

Page 32: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Markus Ringnér. What is principal component analysis?Nature Biotechnology 26, 303 - 304 (2008)

Example:2D

The totalvariance isdecomposed intotwo orthogonalcomponents, PCA1 and PCA2

Page 33: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Markus Ringnér. What is principal component analysis?Nature Biotechnology 26, 303 - 304 (2008)

Example:2D

Analysis of the principal component can highlightimportant featuresin an unsupervisedway

Page 34: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14 16 18 20

Variable X1

Va

ria

ble

X2

+

2D Example of PCA• variables X1 and X2 have positive covariance & each

has a similar variance.

67.61 V 24.62 V 42.32,1 C

35.81 X

91.42 X

www.plantbiology.siu.edu/PLB444/PCA.ppt

Page 35: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

-6

-4

-2

0

2

4

6

8

-8 -6 -4 -2 0 2 4 6 8 10 12

Variable X1

Vari

ab

le X

2Configuration is Centered

• each variable is adjusted to a mean of zero (by subtracting the mean from each value).

www.plantbiology.siu.edu/PLB444/PCA.ppt

Page 36: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

-6

-4

-2

0

2

4

6

-8 -6 -4 -2 0 2 4 6 8 10 12

PC 1

PC

2Configuration is rotatedNew coordinates

PC1 PC2• PC 1 and PC 2 have zero covariance.• PC 1 has the highest possible variance (9.88)• PC 2 has a variance of 3.03

www.plantbiology.siu.edu/PLB444/PCA.ppt

Page 37: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Covariance Matrix

Variable 1 Variable 2 Variable j Variable n

Item1 x1(1) x2(1) xj(1) xn(1)

Item 2 x1(2) x2(2) xj(2) xn(2)

Item i x1(i) x2(i) xj(i) xn(i)

Item m x1(m) x2(m) xj(m) xn(m)

2

21

2

21

22

2

212

1121

2

1

,cov,cov,cov

,cov,cov,cov

,cov,cov,cov

,cov,cov,cov

njnnn

njjjj

nj

nj

xxxxxx

xxxxxx

xxxxxx

xxxxxx

COV

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 38: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Covariance and multidimensional normal distribution

M is a n-valued vector (means)

COV is a nxn symmetric matrix (covariance matrix), with determinant |COV|

MxCOVMx

COV

MxT

n

1

2

1

22

1exp

2

1),|(

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 39: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Uncorrelated variables

The set of variable is uncorrelated if all the covariances (not the variances) are null:

that is if covariance matrix is in diagonal form.

iij

m

k

jjiiij MkxMkxm

COV 2

11

1

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 40: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Eigenvector equation

COV is a square symmetrical matrix and can be reduced in diagonal form

by means of the eigenvector equation

it defines n real eigenvalues λi

and n real-m-valued orthonormal eigenvectors ui

ijiijVOC ~

0det

ICOV

COV

uu

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 41: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Matrix transformation

The column normalized eigenvectors ui are orthogonal and define the n x n unitary matrix U

T

T

UU

IUU

1

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

22

11

21

21

2212

2111

uu

uu

UU

UUU

Page 42: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Matrix transformation

U defines an orthogonal rotation and/or reflection of the coordinate axes (preserving norms and angles)

xxxxxx

xx

TTT

~UU~~~

U~

T

T

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

cossin

sincosU

Try for example a counterclokwise rotation of 45° and transform the point (1,1)

Page 43: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Matrix transformation

Given the matrix A, the eigenvalues definethe diagonal matrix Λ and the eigenvectorsdefine the unitary matrix U so that:

Λ=UTAU

Prove with matrix:

12

21A

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 44: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Matrix transformation

The diagonal matrix :

U then defines a coordinate rotation so that in the new system variables are not correlated

VOCUCOVU~

T

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 45: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Matrix transformation

Coordinate transformation

2

1

2

1

2

1

2

1

x

x

y

y

y

y

x

x

TU

U

x1

x2

y2 y1

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 46: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Matrix transformation

In particular the old coordinates of new axes, called LOADINGS, are the eigenvectors

2

1

2

1

2

1

2

1

x

x

y

y

y

y

x

x

TU

U

U11

x2

y1

22

12

21

11

1

0

0

1

U

U

U

U

U

U

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

U21U22

U12

Page 47: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Principal Component Analysis

Given any item x, represented by an m-valued vector

it can be expressed with the n ordered principal components, y, by using the basis U

Where

),..,( 21 m

T xxxx

n

i

ii uyx1

xuuxy T

ii

T

i

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 48: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Matrix transformation

The eigenvalues measure the variances along the new coordinate axes.

Usually the percentage of the total variance accounted by any coordinate is reported

Coordinates are sorted from the highest to the lowest eigenvalue

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 49: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Principal Component Analysis

Given a set of m items described with n variables:1) Compute the mean of each variable2) Subtract the mean to any measure3) Compute the covariance matrix4) Diagonalize the covariance matrix5) Sort the eigenvalues from the highest to the

lowest: the corresponding eigenvectors define the 1st,2nd....jth principal components

6) For each item i, the j-th component results from the scalar product of the i-th variable vector with the j-th eigenvector

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 50: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Why “principal”?

Given a n-dimensional representation of mobserved samples and you want to representdata in a p-dimensional space, with p<n (e.g.,for representing them in 2- or 3-dimensional space). Which is the bestchoice, when only linear transformations areallowed?

To use the first p principal components.

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 51: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Why “principal”? Minimum-error formulation

We have to search for a completeorthonormal m-dimensional basis V

and we have to use p-dimensions in thatcoordinate basis to approximate the points:

where ci are independent of k

i

n

i

i

T

k

n

i

ikik vvxvax

11

)(

n

pi

ii

p

i

ikik vcvbx11

~

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 52: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Why “principal”? Minimum-error formulation

The goal is to minimize the error

For any base V, the error is minimized when:

Then:

m

k

kk xxm

E1

2~1

i

T

ki

i

T

kkiki

vxc

vxab

n

pi

ii

T

kkkk vvxxxx1

~

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 53: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Why “principal”? Minimum-error formulation

The error is minimized, imposing thenormalization of each v (via Lagrange’smultipliers)

Then

choosing the lowesteigenvalues

n

pi

i

T

i

m

k

n

pi

i

T

ki

T

k vvvxvxn

E11 1

21 COV

i

T

ii

n

pi

i

T

iv vvvvMINIMIZEi

1

1

COV

iii vv

COV

n

pi

iE1

Page 54: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Covariance vs Correlation

using covariances among variables only makes sense if they are measured in the same units

even then, variables with high variances will dominate the principal components

these problems are generally avoided by standardizing each variable to unit variance and zero mean.

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 55: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Correlation PCA

The correlation matrix can be chosen instead of the covariance. It gives a zero-mean, unit-covariance plot

m

i

ii MxMx

mxxcorr

1 21

121121

1

1),(

Page 56: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

An ecological example• data from research on habitat definition in the endangered Baw Baw frog

• 16 environmental and structural variables measured at each of 124 sites

• correlation matrix used because variables have different units

Philoria frostiwww.plantbiology.siu.edu/PLB444/PCA.ppt

Page 57: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Axis Eigenvalue% of

VarianceCumulative % of Variance

1 5.855 36.60 36.60

2 3.420 21.38 57.97

3 1.122 7.01 64.98

4 1.116 6.97 71.95

5 0.982 6.14 78.09

6 0.725 4.53 82.62

7 0.563 3.52 86.14

8 0.529 3.31 89.45

9 0.476 2.98 92.42

10 0.375 2.35 94.77

Eigenvalues

www.plantbiology.siu.edu/PLB444/PCA.ppt

Page 58: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Baw Baw Frog - PCA of 16 Habitat Variables

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

1 2 3 4 5 6 7 8 9 10

PC Axis Number

Eig

en

valu

e

www.plantbiology.siu.edu/PLB444/PCA.ppt

Page 59: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Interpreting Eigenvectors

• correlations between variables and the principal axes are known as loadings

• each element of the eigenvectors represents the contribution of a given variable to a component

1 2 3

Altitude 0.3842 0.0659 -0.1177

pH -0.1159 0.1696 -0.5578

Cond -0.2729 -0.1200 0.3636

TempSurf 0.0538 -0.2800 0.2621

Relief -0.0765 0.3855 -0.1462

maxERht 0.0248 0.4879 0.2426

avERht 0.0599 0.4568 0.2497

%ER 0.0789 0.4223 0.2278

%VEG 0.3305 -0.2087 -0.0276

%LIT -0.3053 0.1226 0.1145

%LOG -0.3144 0.0402 -0.1067

%W -0.0886 -0.0654 -0.1171

H1Moss 0.1364 -0.1262 0.4761

DistSWH -0.3787 0.0101 0.0042

DistSW -0.3494 -0.1283 0.1166

DistMF 0.3899 0.0586 -0.0175

www.plantbiology.siu.edu/PLB444/PCA.ppt

Page 60: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

How many axes are needed?

• does the (k+1)th principal axis represent more variance than would be expected by chance?

• several tests and rules have been proposed

• a common “rule of thumb” when PCA is based on correlations is that axes with eigenvalues > 1 are worth interpreting

www.plantbiology.siu.edu/PLB444/PCA.ppt

Page 61: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Example: Thermostability

The problem is to investigate the differences in Residue composition Codon composition Codon usage

Among thermophilic and mesophilicprokaryotes

Montanucci, Martelli, Fariselli, Casadio J Proteome Res, 2007

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 62: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Example: Thermostability

The data set contains 116 fully sequenced genomes from prokaryotes

16 thermophilic species (11 archaea and 5 bacteria), with an OGT higher than 60 °C

100 mesophilic species (95 bacteria and 5 archaea) with an OGT lower than 45 °C.

7 quasi-mesophilic species (7 bacteria) with OGT between 45 and 60 °C

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 63: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

PCA on residue composition

Red:

Thermophilic

Green:

Intermediates

Blue:

Mesophilic

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 64: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

PCA on codon usage

Red:

Thermophilic

Green:

Intermediates

Blue:

Mesophilic

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 65: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

PCA on codon composition

Red:

Thermophilic

Green:

Intermediates

Blue:

Mesophilic

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 66: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Components of 1st and 2nd PC (codon composition)

First two components expressed as a function of the codons

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 67: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Relation between 2nd component and OGT

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 68: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Pitfalls of PCA

A method such as PCA assumes that thereare linear relationships between thederived components and the originalvariables. This is apparent from the role ofthe covariance or correlation matrix. If therelationships are non linear a Pearsoncorrelation coefficient, which measures thestrength of linear relationships, wouldunderestimate the strength of the non-linear relationship.

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 69: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

EXTRACTING INFORMATION FROM HIGH DIMENSIONALDATA

Correspondence analysis

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016-University of Bologna

Page 70: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Pitfalls of PCA

PCA can be applied in a coherent way onlywhen continuous variables are taken intoaccount. Data are however available ascounts on categorical variable

contingency table

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 71: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Contingency tables

Variable 1 Variable 2 Variable j Variable n

Item 1 x11 x12 x1j x1n

Item 2 x21 x22 x2j x2n

Item i xi1 xi2 xij xin

Item m xm1 xm2 xmj xmn

Ex: Experiment = genomes, Variable = codon countExperiment = Text, Variable = number each character in the text

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 72: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Text characters usage:is it distinctive of the authors

Yelland PM,The Mathematica JournalPier Luigi Martelli - System and In Silico Biology. AA 2015-

2016- University of Bologna

Page 73: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Margins

Variable 1 Variable 2 Variable j Variable n

Item 1 x11 x12 x1j x1m r1

Item 2 x21 x22 x2j x2m r2

Item i xi1 xi2 xij xim ri

Item m xn1 xn2 xnj xnm rn

c1 c2 cj cm N

m

i

i

n

j

j

n

j

m

i

ij

m

i

ijj

n

j

iji

rcxN

xcxr

111 1

11

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 74: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Correspondence matrix

:N

cm

j

jc

Mass of column j

:N

rm i

ir

Mass of row i

N

xm

ij

ij

Frequency value

Variable 1 Variable 2 Variable j Variable n

Item 1 x11/N=m11 x12/N=m12 x1j/N=m1j x1n/N=m1m r1/N=mr:1

Item 2 x21/N=m21 x22/N=m22 x2j/N=m2j x2n/N=m2m r2/N=mr:2

Item i xi1/N=mi1 xi2/N=mi2 xij/N=mij xin/N=mim ri/N=mr:I

Item m xm1/N=mn1 xm2/N=mn2 xmj/N=mnj xmn/N=mnm rm/N=mr:m

c1/N=mc:1 c2/N=mc:2 cj/N=mc:j cn/N=mc:n 1

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 75: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Are the variable dependent on the experiments?

If the variable j is independent of the experiment i

represent a baseline distribution for

We need a distance between distributions: it would be nice to evaluate the probability that is compatible with

:: jcirij mmm

:: jcir mm ijm

ijm :: jcir mm

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 76: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Testing distributions

Is a frequency distribution of certain events observed in a sample consistent with a particular theoretical distribution?

Null hypothesis: the distributions do not differ

A simple example is the hypothesis that an ordinary six-sided die is "fair", i.e., all six outcomes are equally likely to occur.

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 77: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Pearson’s chi-square test

The observed values (Oi) and the expected values (Ei) are compared over all the n classes (bins):

This test variable approximately follows a chi-square statistics: degrees of freedoms are: n-1-s, where s is the number of parameters describing the theoretical distribution (e.g. 2 for normal distribution, 1 for binomial distribution)

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 78: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Chi-square distribution

Mean = kVariance = 2k

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 79: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Critical values

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 80: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Correspondence analysis

Correspondence analysis can be viewed as a weighted PCA in which Euclidean distances are replaced by chi-squared distances that are more appropriate for count data.

Chi-square distances capture also NON-linear dependencies and can be applied to counts

As with PCA, derived variables are created from the original variables but this time they maximise the correspondence between row and column counts.

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 81: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Are the variable dependent on the experiments?

If the variable j is independent of the experiment i

A good measure of distance between the actual matrix and the matrix computed with the independence hypothesis is

If the null hypothesis holds (independence), it follows a chi-square statistics (m-1)(n-1) DoF

1 1 ::

2

::2

m

i

n

j jcir

jcirij

mm

mmm

:: jcirij mmm

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 82: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Text characters usage:is it distinctive of the authors

Yelland PM,The Mathematica Journal

0.01 value-P

5.4482

The hypothesis of independence between character usage and authors can be rejected.How can I recognize the works of different authors?

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 83: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Normalizing rows: Row profiles

Variable 1 Variable 2 Variable j Variable n

Item 1 m11/mr:1 m12/mr:1 m1j/mr:1 m1n/mr:1 1

Item 2 m21/mr:2 m22/mr:2 x2j/mr:2 m2n/mr:2 1

Item i mi1/mr:i mi2/mr:i mij/mr:i min/mr:i 1

Item m mm1/mr:m mm2/mr:m mmj/mr:m mmn/mr:m 1

Row centroid mc:1 mc:2 mc:j mc:n 1

:N

cm

j

jc

Mass of column jProfile of row i

:

:

:2

:1

irin

irij

iri

iri

i

mm

mm

mm

mm

R

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016-University of Bologna

Page 84: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Distance between row profiles

//

1 :

2

::

n

j jc

krkjirijik

r

m

mmmmd

Measures the distance between the rows iand k, upon renormalization of each row

Considering the row centroid z:

m

j jcir

jcirij

ir

m

j jc

jcirijiz

r

mm

mmm

mm

mmmd

1 ::

2

::

:1 :

2

:: 1

/

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 85: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Chi square distance among authors

Yelland PM,The Mathematica JournalPier Luigi Martelli - System and In Silico Biology. AA 2015-

2016- University of Bologna

Page 86: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Column profiles

:N

xm i

ir

Mass of ROW iProfile of column j

:

:

:2

:1

jcnj

jcij

jcj

jcj

j

mm

mm

mm

mm

C

Variable 1

Variable 2 Variable j Variable n Column centroid

Item 1 m11/mc:1 m12/mc:2 m1j/mc:j m1m/mc:n mr:1

Item 2 m21/mc:1 m22/mc:2 x2j/mc:j m2m/mc:n mr:2

Item I mi1/mc:1 mi2/mc:2 mij/mc:j mim/mc:n mr:i

Item m mm1/mc:1 mm2/mc:2 mmj/mc:j mmn/mc:n mr:m

1 1 1 1 1

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016-University of Bologna

Page 87: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Distance between column profiles

//

1 :

2

::

m

i ir

kcikjcijjk

c

m

mmmmd

Measures the distance between the columns j and k, upon renormalization of each column

Considering the column centroid z:

m

i irjc

irjcij

jc

m

i ir

rjcijjz

c

mm

mmm

mm

mmmd

1 ::

2

::

:1 :

2

:: 1

/

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 88: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Nomenclature: reminder

original count

correspondence values

mass of row i

mass of column j

: j

iji

ir mN

rm

ijx

N

xm

ij

ij

: i

ij

j

jc mN

cm

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 89: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Nomenclature: reminder

profile of row i

profile of column j

:

:

:2

:1

irin

irij

iri

iri

i

mm

mm

mm

mm

R

:

:

:2

:1

jcmj

jcij

jcj

jcj

j

mm

mm

mm

mm

C

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 90: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Mass centroid:row

We can see the system as a set of n (row-) masses (mr:i) distributed in the m-dimensional space in points: ri

The centroid is given by the column masses

:

:

2:

1:

:

1

:

:

:2

:1

1

:

nc

jc

c

c

ir

n

i

irim

irij

iri

iri

n

i

iir

m

m

m

m

m

mm

mm

mm

mm

Rm

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 91: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Chi square as inertia with weighted distances

/

1 1 :

2

:

:

1 1 :

2

::

:

1 1 ::

2

::2

m

i

n

j jc

jc

i

j

ir

m

i

n

j jc

jcirij

ir

m

i

n

j jcir

jcirij

m

mRm

m

mmmm

mm

mmm

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 92: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Standardized residuals

The chi-square residuals S measure the level of dependency of the individuals from the variables

Is it possible to find a linear transform of the individuals and a linear transform of variable so that: ?

If yes each new transformed individual would depend only on a single new transformed variable.

jcir

jcirij

ijmm

mmmS

::

::

ijiij dS ~

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 93: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Correspondence analysis

Searching a transformation that preserves the distances between rows or column and that allows the representation into low dimensional spaces with low loss of information.

Diagonalization of matrix S.

S is a matrix mxn

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 94: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Single value decomposition

The matrix S (mxn) can be decomposed as

where U (mxm) and V (nxn) are base change matrices in the experiment (row) and variable (column) space, respectively

and Λ is a mxn diagonal matrix

TVΛUS

1 1 TTVVUU

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 95: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Single value decomposition: solution 1

The decomposition can be reduced to the determination of eigen-vector and -values

and Λ2 is a nxn diagonal matrix

NB: STS is a sort of covariance matrix of the chi-square distances

T

nn

TT

TTTT

VVVUUV

VUVUSS

2

)()(

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 96: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Single value decomposition: solution 1

The diagonalization of STS gives n eigenvalues λ2, whose square roots are the eigenvalues of Λ sort them

The n column eigenvectors, sorted following the eigenvalues, define the matrix V (base change in column space)right singular vectors

VΛVSS

nn

TT 2

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 97: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Single value decomposition: solution 2

The decomposition can be reduced to t

and Λ2 is a mxm diagonal matrix

T

mm

TT

TTTT

UUUVVU

VUVUSS

2

))((

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 98: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Single value decomposition: solution 2

The diagonalization of SST gives m eigenvalues λ2, whose square roots are the eigenvalues of Λ sort them

The n column eigenvectors, sorted following the eigenvalues, define the matrix U (base change in row space) left singular vectors

UΛUSS

mm

TT 2

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 99: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Are the two sets of eigenvaluesequal?

If λ2i is eigenvalue of SST with eigenvector ui

then λ2i is eigenvalue of STS with eigenvector

vi=STui

iii

iii

iii

vv

uu

uu

2T

T2TT

2T

SS

SSSS

SS

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 100: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Single value decomposition

The number of non-null eigenvalues of matrices Λ2 is at most equal to min (m,n).

Λ is a mxn diagonal matrix containing the sortedsquare roots of eigenvaluesλ

n>m n<m

and U and V are the ortogonal change vector matricesPier Luigi Martelli - System and In Silico Biology. AA 2015-

2016- University of Bologna

Page 101: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Correspondence analysis

Searching a transformation that preserves the distances between rows or column and that allows the representation into low dimensional spaces with low loss of information.

Diagonalization of matrix S.

ji

jiij

ijmm

mmmS

TVΛUS Pier Luigi Martelli - System and In Silico Biology. AA 2015-

2016- University of Bologna

Page 102: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Row Scores

kik

i

m

l

lkjl

i

ijik Um

Um

R

11

1

Represents new component k of row i. It conserves the distance from the centroid

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 103: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Row components

The first 2 or 3 components are plotted, usually

111

1i

i

i Um

R

222

1i

i

i Um

R

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 104: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

2-D plot of the texts

Yelland PM,The Mathematica Journal

Proximity in the plot means low chi-square distance

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 105: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

This is the best approximation

As in the case of PCA, the CA find the low dimensional space for projecting the data that is allows the best approximated representation of data. In the case of PCA the euclidean distance among points is best approximated. In tha case of CA, it is the chi-square distance

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 106: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Column Scores

kjk

j

n

l

lkjl

j

ijjk Vm

Vm

C

11

1

Represents new component k of column j. It conserves the distance from the centroid

m

i ij

ijij

j

jzmm

mmm

md

1

21

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 107: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Column components

The first 2 or 3 components are plotted, usually

111

1j

j

j Vm

C

222

1 j

j

j Vm

C

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 108: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Joint plot

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 109: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Angles determine the type of dependency

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 110: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

The cosine of the angles determines the stregnth of the dependence

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 111: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Example: codon usage in genes

Contingency table

AAA AAC AAT AAG ...........

gene1

gene2

.......

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 112: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Row (gene) representation

Genes that are close each other, haveMinimum chi-square distance are coded with a similar codon distribution

High GC genes and Low GC genes separate along the first axis

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 113: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Column (codon) representation

Rapid divergence of codon usage patterns within the rice genomeHuai-Chun Wang1and Donal A HickeyBMC Evolutionary Biology 2007, 7(Suppl 1):

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 114: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Column (codon) representation

Column plot shows the separation of C/

G-ending codons and A/U-ending codons along the first axis. The separation of genes on the second axis appears to be largely due to frequency differences in C-ending and G-ending codons among the GC rich genes

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 115: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Codon usage in prokaryotes

Use of Correspondence Analysis in Genome ExplorationFredj Tekaia

Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna

Page 116: FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES Pearson’sCorrelation Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University

Codon usage in lehismania

Online Synonymous Codon Usage Analyses with the ade4 and seqinR packagesCharif, D., Thioulouse, J. Lobry, J.R., Perrière, G.,Pier Luigi Martelli - System and In Silico Biology. AA 2015-

2016- University of Bologna