FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES...
Transcript of FINDING RELATIONS BETWEEN VARIABLES - unibo.itFINDING RELATIONS BETWEEN VARIABLES...
FINDING RELATIONS BETWEEN VARIABLES
Pearson’s Correlation
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Relation between coupled variables
What couples of variables are in relation?Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Correlated variables
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Uncorrelated variables
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Covariance and Pearson’s Correlation index
Variable 1 Variable 2
Item1 x1(1) x2(1)
Item 2 x1(2) x2(2)
Item i x1(i) x2(i)
Item m x1(m) x2(m)
Mean M1 M2
m
i
m
i
ixm
M
ixm
M
1
22
1
11
)(1
)(1
m
i
m
i
MixMix
mxxcorr
MixMixm
xx
1 21
221121
22
1
1121
)()(
1
1),(
)()(1
1),cov(
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Covariance and Pearson’s Correlation index
Variable 1 Variable 2
Item1 x1(1) x2(1)
Item 2 x1(2) x2(2)
Item i x1(i) x2(i)
Item m x1(m) x2(m)
Mean M1 M2
m
i
m
i
ixm
M
ixm
M
1
22
1
11
)(1
)(1
2
1
)()(1
1),cov( jjj
m
i
jjjj MixMixm
xx
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Correlation
1,1),( 21 xxcorr
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
When is a correlation significant?
Given a correlation index:
A test variable can be computed under the null hypothesis that r=0
t is distributed as Student’s t test with n-2 degrees of freedom It assumes normality of x
m
i
MixMix
mxxr
1 21
221121
)()(
1
1),(
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Graph showing the minimum value of Pearson's correlation coefficient that is significantly different from zero at the 0.05 level, for a given sample size.
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Example: discovery of a misconduct
Repeatability test: 2 different experimentalist were asked to take the same solution and to perform 24 independent ELISA assays on a 6x4 plate.
They submitted to the assessor the following results out of the spectrophotometer, ordered following the well
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
S1 S2P1 0,481 0,496P2 0,485 0,501P3 0,479 0,495P4 0,506 0,522P5 0,467 0,48P6 0,474 0,491P7 0,469 0,48P8 0,475 0,489P9 0,514 0,52P10 0,52 0,524P11 0,526 0,531P12 0,494 0,509P13 0,535 0,54P14 0,524 0,526P15 0,481 0,492P16 0,502 0,509P17 0,479 0,484P18 0,491 0,495P19 0,503 0,515P20 0,472 0,481P21 0,481 0,486P22 0,503 0,512P23 0,448 0,454P24 0,519 0,526
The assessor suspectedthat the experimenter submitted two reads of the same plate
How to prove it?
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Example: discovery of a misconduct
0,44
0,45
0,46
0,47
0,48
0,49
0,5
0,51
0,52
0,53
0,54
0,55
0,44 0,46 0,48 0,5 0,52 0,54
S2
S1
R=0.978
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Example: discovery of a misconduct
R=0.978 n=24 t=22.05
Objection: the test is valid only when data are normally distributed and we cannot prove that.Any other idea?
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Use the data set to generate random experiments
S1 S2P1 0,481 0,496P2 0,485 0,501P3 0,479 0,495P4 0,506 0,522P5 0,467 0,48P6 0,474 0,491P7 0,469 0,48P8 0,475 0,489P9 0,514 0,52P10 0,52 0,524P11 0,526 0,531P12 0,494 0,509P13 0,535 0,54P14 0,524 0,526P15 0,481 0,492P16 0,502 0,509P17 0,479 0,484P18 0,491 0,495P19 0,503 0,515P20 0,472 0,481P21 0,481 0,486P22 0,503 0,512P23 0,448 0,454P24 0,519 0,526
S1 Random(S2)P1 0,481 0,495P2 0,485 0,522P3 0,479 0,48P4 0,506 0,491P5 0,467 0,48P6 0,474 0,489P7 0,469 0,52P8 0,475 0,524P9 0,514 0,531P10 0,52 0,509P11 0,526 0,54P12 0,494 0,526P13 0,535 0,492P14 0,524 0,509P15 0,481 0,484P16 0,502 0,495P17 0,479 0,515P18 0,491 0,481P19 0,503 0,486P20 0,472 0,512P21 0,481 0,454P22 0,503 0,526P23 0,448 0,496P24 0,519 0,501
0,44
0,46
0,48
0,5
0,52
0,54
0,56
0,44 0,46 0,48 0,5 0,52 0,54
Rand
om (S2)
S1
R=0.25
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Building a distribution by random resampling
Iterate the process of shuffling and computation of r many times (say 1000)
Compute a cumulative histogram counting the resamplings scoring with correlation ≥ r
0,00
200,00
400,00
600,00
800,00
1000,00
0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Building a distribution by random resampling
The plot gives the probability (per thousand) of obtaining a given correlation with random pairings of the original data P-value independent on the assumptions on the data distribution
0,00
200,00
400,00
600,00
800,00
1000,00
0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90
This is only an example plot: Compute by yourself the plot corresponding to the data available in misconduct.xls file
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Bootstrapping
The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where the main character pulls himself out of a swamp by his hair (specifically, his pigtail), but the Baron does not, in fact, pull himself out by his bootstraps
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
FINDING RELATIONS BETWEEN VARIABLES
Spearman’s Correlation
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016-University of Bologna
Pearson’s correlation index assumes linear dependence
R=0.816
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Non parametric correlation: Spearman
Given a set of paired (xi,yi) sort separately the two variables, obtaining the ranks.
The Spearman’s correlation is the Pearson’s correlation of the ranked variables: [R(xi),R(yi)]
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Under the null hypothesis (r=0)
Is distributed as a Student’s t test with n-2 degrees of freedom
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Categorical data: Matthews correlation index
Secreted Non Secreted Total
With Signal peptide a b a + b
Without Signal Peptide c d c + d
total a + c b + d n
cdbdcaba
ad-bcMCC
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
EXTRACTING INFORMATION FROM HIGH DIMENSIONALDATA
Principal Component Analysis
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016-University of Bologna
High-dimensional descriptors
Many different descriptors can be adopted for characterizing a set of objects under investigation:
1) given a set of proteins, we can measure the residue composition (20 values), the dipeptide composition (400 values), length, average hydrophobicity..... of each sequence
2) Given a set of individuals we can measure dimensions, weight, haematic concentration of metabolites...
How can we extract a minimal set of descriptors without losing information on the variability of the set?
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Data Reduction
Summarizing the p-dimensional description of nobjects by a smaller set of (k) derived (synthetic, composite) variables.
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Data Reduction
“Residual” variation is information in A that is not retained in X
A good data reduction must balance between clarity of representation, ease of understanding
oversimplification: loss of important or relevant information.
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Most important descriptors
When plotting these two dimensional data it is evident that the x-direction accounts for the largest part of the variance
Directions with largest variance better describe the data set
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Most important descriptors
When variables are correlated the variable variance is not able to determine principal components
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Example:2D
Markus Ringnér. What is principal component analysis?Nature Biotechnology 26, 303 - 304 (2008)
Suppose to measure the expression level of two genes (XBP1 and GATA3) in breast cancer samples expressing or not the Estrogen Receptor (ER+ in red, ER- in black)
Markus Ringnér. What is principal component analysis?Nature Biotechnology 26, 303 - 304 (2008)
Example:2D
The totalvariance isdecomposed intotwo orthogonalcomponents, PCA1 and PCA2
Markus Ringnér. What is principal component analysis?Nature Biotechnology 26, 303 - 304 (2008)
Example:2D
Analysis of the principal component can highlightimportant featuresin an unsupervisedway
0
2
4
6
8
10
12
14
0 2 4 6 8 10 12 14 16 18 20
Variable X1
Va
ria
ble
X2
+
2D Example of PCA• variables X1 and X2 have positive covariance & each
has a similar variance.
67.61 V 24.62 V 42.32,1 C
35.81 X
91.42 X
www.plantbiology.siu.edu/PLB444/PCA.ppt
-6
-4
-2
0
2
4
6
8
-8 -6 -4 -2 0 2 4 6 8 10 12
Variable X1
Vari
ab
le X
2Configuration is Centered
• each variable is adjusted to a mean of zero (by subtracting the mean from each value).
www.plantbiology.siu.edu/PLB444/PCA.ppt
-6
-4
-2
0
2
4
6
-8 -6 -4 -2 0 2 4 6 8 10 12
PC 1
PC
2Configuration is rotatedNew coordinates
PC1 PC2• PC 1 and PC 2 have zero covariance.• PC 1 has the highest possible variance (9.88)• PC 2 has a variance of 3.03
www.plantbiology.siu.edu/PLB444/PCA.ppt
Covariance Matrix
Variable 1 Variable 2 Variable j Variable n
Item1 x1(1) x2(1) xj(1) xn(1)
Item 2 x1(2) x2(2) xj(2) xn(2)
Item i x1(i) x2(i) xj(i) xn(i)
Item m x1(m) x2(m) xj(m) xn(m)
2
21
2
21
22
2
212
1121
2
1
,cov,cov,cov
,cov,cov,cov
,cov,cov,cov
,cov,cov,cov
njnnn
njjjj
nj
nj
xxxxxx
xxxxxx
xxxxxx
xxxxxx
COV
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Covariance and multidimensional normal distribution
M is a n-valued vector (means)
COV is a nxn symmetric matrix (covariance matrix), with determinant |COV|
MxCOVMx
COV
MxT
n
1
2
1
22
1exp
2
1),|(
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Uncorrelated variables
The set of variable is uncorrelated if all the covariances (not the variances) are null:
that is if covariance matrix is in diagonal form.
iij
m
k
jjiiij MkxMkxm
COV 2
11
1
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Eigenvector equation
COV is a square symmetrical matrix and can be reduced in diagonal form
by means of the eigenvector equation
it defines n real eigenvalues λi
and n real-m-valued orthonormal eigenvectors ui
ijiijVOC ~
0det
ICOV
COV
uu
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Matrix transformation
The column normalized eigenvectors ui are orthogonal and define the n x n unitary matrix U
T
T
UU
IUU
1
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
22
11
21
21
2212
2111
uu
uu
UU
UUU
Matrix transformation
U defines an orthogonal rotation and/or reflection of the coordinate axes (preserving norms and angles)
xxxxxx
xx
TTT
~UU~~~
U~
T
T
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
cossin
sincosU
Try for example a counterclokwise rotation of 45° and transform the point (1,1)
Matrix transformation
Given the matrix A, the eigenvalues definethe diagonal matrix Λ and the eigenvectorsdefine the unitary matrix U so that:
Λ=UTAU
Prove with matrix:
12
21A
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Matrix transformation
The diagonal matrix :
U then defines a coordinate rotation so that in the new system variables are not correlated
VOCUCOVU~
T
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Matrix transformation
Coordinate transformation
2
1
2
1
2
1
2
1
x
x
y
y
y
y
x
x
TU
U
x1
x2
y2 y1
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Matrix transformation
In particular the old coordinates of new axes, called LOADINGS, are the eigenvectors
2
1
2
1
2
1
2
1
x
x
y
y
y
y
x
x
TU
U
U11
x2
y1
22
12
21
11
1
0
0
1
U
U
U
U
U
U
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
U21U22
U12
Principal Component Analysis
Given any item x, represented by an m-valued vector
it can be expressed with the n ordered principal components, y, by using the basis U
Where
),..,( 21 m
T xxxx
n
i
ii uyx1
xuuxy T
ii
T
i
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Matrix transformation
The eigenvalues measure the variances along the new coordinate axes.
Usually the percentage of the total variance accounted by any coordinate is reported
Coordinates are sorted from the highest to the lowest eigenvalue
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Principal Component Analysis
Given a set of m items described with n variables:1) Compute the mean of each variable2) Subtract the mean to any measure3) Compute the covariance matrix4) Diagonalize the covariance matrix5) Sort the eigenvalues from the highest to the
lowest: the corresponding eigenvectors define the 1st,2nd....jth principal components
6) For each item i, the j-th component results from the scalar product of the i-th variable vector with the j-th eigenvector
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Why “principal”?
Given a n-dimensional representation of mobserved samples and you want to representdata in a p-dimensional space, with p<n (e.g.,for representing them in 2- or 3-dimensional space). Which is the bestchoice, when only linear transformations areallowed?
To use the first p principal components.
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Why “principal”? Minimum-error formulation
We have to search for a completeorthonormal m-dimensional basis V
and we have to use p-dimensions in thatcoordinate basis to approximate the points:
where ci are independent of k
i
n
i
i
T
k
n
i
ikik vvxvax
11
)(
n
pi
ii
p
i
ikik vcvbx11
~
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Why “principal”? Minimum-error formulation
The goal is to minimize the error
For any base V, the error is minimized when:
Then:
m
k
kk xxm
E1
2~1
i
T
ki
i
T
kkiki
vxc
vxab
n
pi
ii
T
kkkk vvxxxx1
~
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Why “principal”? Minimum-error formulation
The error is minimized, imposing thenormalization of each v (via Lagrange’smultipliers)
Then
choosing the lowesteigenvalues
n
pi
i
T
i
m
k
n
pi
i
T
ki
T
k vvvxvxn
E11 1
21 COV
i
T
ii
n
pi
i
T
iv vvvvMINIMIZEi
1
1
COV
iii vv
COV
n
pi
iE1
Covariance vs Correlation
using covariances among variables only makes sense if they are measured in the same units
even then, variables with high variances will dominate the principal components
these problems are generally avoided by standardizing each variable to unit variance and zero mean.
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Correlation PCA
The correlation matrix can be chosen instead of the covariance. It gives a zero-mean, unit-covariance plot
m
i
ii MxMx
mxxcorr
1 21
121121
1
1),(
An ecological example• data from research on habitat definition in the endangered Baw Baw frog
• 16 environmental and structural variables measured at each of 124 sites
• correlation matrix used because variables have different units
Philoria frostiwww.plantbiology.siu.edu/PLB444/PCA.ppt
Axis Eigenvalue% of
VarianceCumulative % of Variance
1 5.855 36.60 36.60
2 3.420 21.38 57.97
3 1.122 7.01 64.98
4 1.116 6.97 71.95
5 0.982 6.14 78.09
6 0.725 4.53 82.62
7 0.563 3.52 86.14
8 0.529 3.31 89.45
9 0.476 2.98 92.42
10 0.375 2.35 94.77
Eigenvalues
www.plantbiology.siu.edu/PLB444/PCA.ppt
Baw Baw Frog - PCA of 16 Habitat Variables
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
1 2 3 4 5 6 7 8 9 10
PC Axis Number
Eig
en
valu
e
www.plantbiology.siu.edu/PLB444/PCA.ppt
Interpreting Eigenvectors
• correlations between variables and the principal axes are known as loadings
• each element of the eigenvectors represents the contribution of a given variable to a component
1 2 3
Altitude 0.3842 0.0659 -0.1177
pH -0.1159 0.1696 -0.5578
Cond -0.2729 -0.1200 0.3636
TempSurf 0.0538 -0.2800 0.2621
Relief -0.0765 0.3855 -0.1462
maxERht 0.0248 0.4879 0.2426
avERht 0.0599 0.4568 0.2497
%ER 0.0789 0.4223 0.2278
%VEG 0.3305 -0.2087 -0.0276
%LIT -0.3053 0.1226 0.1145
%LOG -0.3144 0.0402 -0.1067
%W -0.0886 -0.0654 -0.1171
H1Moss 0.1364 -0.1262 0.4761
DistSWH -0.3787 0.0101 0.0042
DistSW -0.3494 -0.1283 0.1166
DistMF 0.3899 0.0586 -0.0175
www.plantbiology.siu.edu/PLB444/PCA.ppt
How many axes are needed?
• does the (k+1)th principal axis represent more variance than would be expected by chance?
• several tests and rules have been proposed
• a common “rule of thumb” when PCA is based on correlations is that axes with eigenvalues > 1 are worth interpreting
www.plantbiology.siu.edu/PLB444/PCA.ppt
Example: Thermostability
The problem is to investigate the differences in Residue composition Codon composition Codon usage
Among thermophilic and mesophilicprokaryotes
Montanucci, Martelli, Fariselli, Casadio J Proteome Res, 2007
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Example: Thermostability
The data set contains 116 fully sequenced genomes from prokaryotes
16 thermophilic species (11 archaea and 5 bacteria), with an OGT higher than 60 °C
100 mesophilic species (95 bacteria and 5 archaea) with an OGT lower than 45 °C.
7 quasi-mesophilic species (7 bacteria) with OGT between 45 and 60 °C
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
PCA on residue composition
Red:
Thermophilic
Green:
Intermediates
Blue:
Mesophilic
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
PCA on codon usage
Red:
Thermophilic
Green:
Intermediates
Blue:
Mesophilic
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
PCA on codon composition
Red:
Thermophilic
Green:
Intermediates
Blue:
Mesophilic
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Components of 1st and 2nd PC (codon composition)
First two components expressed as a function of the codons
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Relation between 2nd component and OGT
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Pitfalls of PCA
A method such as PCA assumes that thereare linear relationships between thederived components and the originalvariables. This is apparent from the role ofthe covariance or correlation matrix. If therelationships are non linear a Pearsoncorrelation coefficient, which measures thestrength of linear relationships, wouldunderestimate the strength of the non-linear relationship.
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
EXTRACTING INFORMATION FROM HIGH DIMENSIONALDATA
Correspondence analysis
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016-University of Bologna
Pitfalls of PCA
PCA can be applied in a coherent way onlywhen continuous variables are taken intoaccount. Data are however available ascounts on categorical variable
contingency table
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Contingency tables
Variable 1 Variable 2 Variable j Variable n
Item 1 x11 x12 x1j x1n
Item 2 x21 x22 x2j x2n
Item i xi1 xi2 xij xin
Item m xm1 xm2 xmj xmn
Ex: Experiment = genomes, Variable = codon countExperiment = Text, Variable = number each character in the text
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Text characters usage:is it distinctive of the authors
Yelland PM,The Mathematica JournalPier Luigi Martelli - System and In Silico Biology. AA 2015-
2016- University of Bologna
Margins
Variable 1 Variable 2 Variable j Variable n
Item 1 x11 x12 x1j x1m r1
Item 2 x21 x22 x2j x2m r2
Item i xi1 xi2 xij xim ri
Item m xn1 xn2 xnj xnm rn
c1 c2 cj cm N
m
i
i
n
j
j
n
j
m
i
ij
m
i
ijj
n
j
iji
rcxN
xcxr
111 1
11
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Correspondence matrix
:N
cm
j
jc
Mass of column j
:N
rm i
ir
Mass of row i
N
xm
ij
ij
Frequency value
Variable 1 Variable 2 Variable j Variable n
Item 1 x11/N=m11 x12/N=m12 x1j/N=m1j x1n/N=m1m r1/N=mr:1
Item 2 x21/N=m21 x22/N=m22 x2j/N=m2j x2n/N=m2m r2/N=mr:2
Item i xi1/N=mi1 xi2/N=mi2 xij/N=mij xin/N=mim ri/N=mr:I
Item m xm1/N=mn1 xm2/N=mn2 xmj/N=mnj xmn/N=mnm rm/N=mr:m
c1/N=mc:1 c2/N=mc:2 cj/N=mc:j cn/N=mc:n 1
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Are the variable dependent on the experiments?
If the variable j is independent of the experiment i
represent a baseline distribution for
We need a distance between distributions: it would be nice to evaluate the probability that is compatible with
:: jcirij mmm
:: jcir mm ijm
ijm :: jcir mm
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Testing distributions
Is a frequency distribution of certain events observed in a sample consistent with a particular theoretical distribution?
Null hypothesis: the distributions do not differ
A simple example is the hypothesis that an ordinary six-sided die is "fair", i.e., all six outcomes are equally likely to occur.
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Pearson’s chi-square test
The observed values (Oi) and the expected values (Ei) are compared over all the n classes (bins):
This test variable approximately follows a chi-square statistics: degrees of freedoms are: n-1-s, where s is the number of parameters describing the theoretical distribution (e.g. 2 for normal distribution, 1 for binomial distribution)
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Chi-square distribution
Mean = kVariance = 2k
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Critical values
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Correspondence analysis
Correspondence analysis can be viewed as a weighted PCA in which Euclidean distances are replaced by chi-squared distances that are more appropriate for count data.
Chi-square distances capture also NON-linear dependencies and can be applied to counts
As with PCA, derived variables are created from the original variables but this time they maximise the correspondence between row and column counts.
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Are the variable dependent on the experiments?
If the variable j is independent of the experiment i
A good measure of distance between the actual matrix and the matrix computed with the independence hypothesis is
If the null hypothesis holds (independence), it follows a chi-square statistics (m-1)(n-1) DoF
1 1 ::
2
::2
m
i
n
j jcir
jcirij
mm
mmm
:: jcirij mmm
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Text characters usage:is it distinctive of the authors
Yelland PM,The Mathematica Journal
0.01 value-P
5.4482
The hypothesis of independence between character usage and authors can be rejected.How can I recognize the works of different authors?
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Normalizing rows: Row profiles
Variable 1 Variable 2 Variable j Variable n
Item 1 m11/mr:1 m12/mr:1 m1j/mr:1 m1n/mr:1 1
Item 2 m21/mr:2 m22/mr:2 x2j/mr:2 m2n/mr:2 1
Item i mi1/mr:i mi2/mr:i mij/mr:i min/mr:i 1
Item m mm1/mr:m mm2/mr:m mmj/mr:m mmn/mr:m 1
Row centroid mc:1 mc:2 mc:j mc:n 1
:N
cm
j
jc
Mass of column jProfile of row i
:
:
:2
:1
irin
irij
iri
iri
i
mm
mm
mm
mm
R
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016-University of Bologna
Distance between row profiles
//
1 :
2
::
n
j jc
krkjirijik
r
m
mmmmd
Measures the distance between the rows iand k, upon renormalization of each row
Considering the row centroid z:
m
j jcir
jcirij
ir
m
j jc
jcirijiz
r
mm
mmm
mm
mmmd
1 ::
2
::
:1 :
2
:: 1
/
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Chi square distance among authors
Yelland PM,The Mathematica JournalPier Luigi Martelli - System and In Silico Biology. AA 2015-
2016- University of Bologna
Column profiles
:N
xm i
ir
Mass of ROW iProfile of column j
:
:
:2
:1
jcnj
jcij
jcj
jcj
j
mm
mm
mm
mm
C
Variable 1
Variable 2 Variable j Variable n Column centroid
Item 1 m11/mc:1 m12/mc:2 m1j/mc:j m1m/mc:n mr:1
Item 2 m21/mc:1 m22/mc:2 x2j/mc:j m2m/mc:n mr:2
Item I mi1/mc:1 mi2/mc:2 mij/mc:j mim/mc:n mr:i
Item m mm1/mc:1 mm2/mc:2 mmj/mc:j mmn/mc:n mr:m
1 1 1 1 1
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016-University of Bologna
Distance between column profiles
//
1 :
2
::
m
i ir
kcikjcijjk
c
m
mmmmd
Measures the distance between the columns j and k, upon renormalization of each column
Considering the column centroid z:
m
i irjc
irjcij
jc
m
i ir
rjcijjz
c
mm
mmm
mm
mmmd
1 ::
2
::
:1 :
2
:: 1
/
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Nomenclature: reminder
original count
correspondence values
mass of row i
mass of column j
: j
iji
ir mN
rm
ijx
N
xm
ij
ij
: i
ij
j
jc mN
cm
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Nomenclature: reminder
profile of row i
profile of column j
:
:
:2
:1
irin
irij
iri
iri
i
mm
mm
mm
mm
R
:
:
:2
:1
jcmj
jcij
jcj
jcj
j
mm
mm
mm
mm
C
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Mass centroid:row
We can see the system as a set of n (row-) masses (mr:i) distributed in the m-dimensional space in points: ri
The centroid is given by the column masses
:
:
2:
1:
:
1
:
:
:2
:1
1
:
nc
jc
c
c
ir
n
i
irim
irij
iri
iri
n
i
iir
m
m
m
m
m
mm
mm
mm
mm
Rm
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Chi square as inertia with weighted distances
/
1 1 :
2
:
:
1 1 :
2
::
:
1 1 ::
2
::2
m
i
n
j jc
jc
i
j
ir
m
i
n
j jc
jcirij
ir
m
i
n
j jcir
jcirij
m
mRm
m
mmmm
mm
mmm
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Standardized residuals
The chi-square residuals S measure the level of dependency of the individuals from the variables
Is it possible to find a linear transform of the individuals and a linear transform of variable so that: ?
If yes each new transformed individual would depend only on a single new transformed variable.
jcir
jcirij
ijmm
mmmS
::
::
ijiij dS ~
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Correspondence analysis
Searching a transformation that preserves the distances between rows or column and that allows the representation into low dimensional spaces with low loss of information.
Diagonalization of matrix S.
S is a matrix mxn
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Single value decomposition
The matrix S (mxn) can be decomposed as
where U (mxm) and V (nxn) are base change matrices in the experiment (row) and variable (column) space, respectively
and Λ is a mxn diagonal matrix
TVΛUS
1 1 TTVVUU
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Single value decomposition: solution 1
The decomposition can be reduced to the determination of eigen-vector and -values
and Λ2 is a nxn diagonal matrix
NB: STS is a sort of covariance matrix of the chi-square distances
T
nn
TT
TTTT
VVVUUV
VUVUSS
2
)()(
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Single value decomposition: solution 1
The diagonalization of STS gives n eigenvalues λ2, whose square roots are the eigenvalues of Λ sort them
The n column eigenvectors, sorted following the eigenvalues, define the matrix V (base change in column space)right singular vectors
VΛVSS
nn
TT 2
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Single value decomposition: solution 2
The decomposition can be reduced to t
and Λ2 is a mxm diagonal matrix
T
mm
TT
TTTT
UUUVVU
VUVUSS
2
))((
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Single value decomposition: solution 2
The diagonalization of SST gives m eigenvalues λ2, whose square roots are the eigenvalues of Λ sort them
The n column eigenvectors, sorted following the eigenvalues, define the matrix U (base change in row space) left singular vectors
UΛUSS
mm
TT 2
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Are the two sets of eigenvaluesequal?
If λ2i is eigenvalue of SST with eigenvector ui
then λ2i is eigenvalue of STS with eigenvector
vi=STui
iii
iii
iii
vv
uu
uu
2T
T2TT
2T
SS
SSSS
SS
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Single value decomposition
The number of non-null eigenvalues of matrices Λ2 is at most equal to min (m,n).
Λ is a mxn diagonal matrix containing the sortedsquare roots of eigenvaluesλ
n>m n<m
and U and V are the ortogonal change vector matricesPier Luigi Martelli - System and In Silico Biology. AA 2015-
2016- University of Bologna
Correspondence analysis
Searching a transformation that preserves the distances between rows or column and that allows the representation into low dimensional spaces with low loss of information.
Diagonalization of matrix S.
ji
jiij
ijmm
mmmS
TVΛUS Pier Luigi Martelli - System and In Silico Biology. AA 2015-
2016- University of Bologna
Row Scores
kik
i
m
l
lkjl
i
ijik Um
Um
R
11
1
Represents new component k of row i. It conserves the distance from the centroid
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Row components
The first 2 or 3 components are plotted, usually
111
1i
i
i Um
R
222
1i
i
i Um
R
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
2-D plot of the texts
Yelland PM,The Mathematica Journal
Proximity in the plot means low chi-square distance
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
This is the best approximation
As in the case of PCA, the CA find the low dimensional space for projecting the data that is allows the best approximated representation of data. In the case of PCA the euclidean distance among points is best approximated. In tha case of CA, it is the chi-square distance
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Column Scores
kjk
j
n
l
lkjl
j
ijjk Vm
Vm
C
11
1
Represents new component k of column j. It conserves the distance from the centroid
m
i ij
ijij
j
jzmm
mmm
md
1
21
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Column components
The first 2 or 3 components are plotted, usually
111
1j
j
j Vm
C
222
1 j
j
j Vm
C
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Joint plot
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Angles determine the type of dependency
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
The cosine of the angles determines the stregnth of the dependence
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Example: codon usage in genes
Contingency table
AAA AAC AAT AAG ...........
gene1
gene2
.......
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Row (gene) representation
Genes that are close each other, haveMinimum chi-square distance are coded with a similar codon distribution
High GC genes and Low GC genes separate along the first axis
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Column (codon) representation
Rapid divergence of codon usage patterns within the rice genomeHuai-Chun Wang1and Donal A HickeyBMC Evolutionary Biology 2007, 7(Suppl 1):
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Column (codon) representation
Column plot shows the separation of C/
G-ending codons and A/U-ending codons along the first axis. The separation of genes on the second axis appears to be largely due to frequency differences in C-ending and G-ending codons among the GC rich genes
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Codon usage in prokaryotes
Use of Correspondence Analysis in Genome ExplorationFredj Tekaia
Pier Luigi Martelli - System and In Silico Biology. AA 2015-2016- University of Bologna
Codon usage in lehismania
Online Synonymous Codon Usage Analyses with the ade4 and seqinR packagesCharif, D., Thioulouse, J. Lobry, J.R., Perrière, G.,Pier Luigi Martelli - System and In Silico Biology. AA 2015-
2016- University of Bologna