Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.
-
Upload
marion-arlene-miles -
Category
Documents
-
view
215 -
download
0
Transcript of Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.
Correspondence Analysis
Ahmed Rebai
Center of Biotechnology of Sfax
Correspondance analysis
Introduced by Benzecri (1973) For uncovering and understanding the structure and pattern in data in contingency tables.
Involves finding coordinate values which represent the row and column categories in some optimal way
Contingency tables Table with r rows and c columns
X1 1 ……….. j ………… c Total
X2
12..i.r
N11
N21
.
.
.
.
Nr1
N1j
Nij
N1c
Ncr
N1.
Nr.
Total N.1 N.j N.c N..
Main idea Develop simple indices that will show
us the relation between rows and columns
Indices that tell us simultaneously which columns have more wheights in a row category and vice versa
Reduce dimensionality like PCA Indice are extracted in decreasing
order of imporance
Which crietria? In contigency table global
independence between the two variables is generally measured by a chi-square (²) calculated as:
Where Eij are expected count under independence
r
i
c
j ij
ijij
E
EN
1 1
22
)(
....
N
NNE jiij
Decomposition of ² We have a departure from
indepedence and we want to know why To find the factors we use the matrix C
of dimension (r xc ) with elements
ij
ijijij
E
ENc
)(
How to find factors? Singular value decomposition (SVD)
of matrix C that is find matrice U, D and V such that
C=U D VT U are eigenvectors of CCT V eigenvectors of CTC D a diagonal matrix of where k
are eigenvalues of CCT k=Rank(C)<Min(r-1,c-1)
k
Tr(CCT)= k = ²= cij²
The projections of the rows and the columns are given by the eigenvectors Uk and Vk
C Uk = Vk
CTVk = Uk
k
k
How many factors? The adequacy of representation by
the two first coordinates is measured by the % of explained inertia
(1+2)/ k In general a display on (U1,U2) of
rows and (V1,V2) of columns The proximity between rows and
columns points is to be interpreted
CA in practice Proximity of two rows (columns)
indicates a similar profile that is similar conditional frequency distribution: the two rows (columns) are proportional
The orignin is the average of the factor; so a point (row or column) close to the origin indicates an average profile
Proximity of a row to a column indicates that this row has particularly important wheight in this column (if far from origin)
A first example: French Bac
Eigenvalues
With Corsica
Without Corsica
Classicalbac
Technicalbac
Coefficients for regions
Coefficients for Bac Type
Properties of CA Allows consideration of dummy variables
(called ‘illustrative variables’), as additional variables which do not contribute to the construction of the factorial space, but can be displayed on this factorial space.
With such a representation it is possible to determine the proximity between observations and variables and the illustrative variables and observations.
Tekaia and yeramian (2006) 208 predicted proteomes representing
the three phylogenetic domains and various lifestyle (hyperthromphile, thermophiles, psychrofile and mesophiles including eukaryotes)
Variables: amino-acid composition of proteomes
Illustrative variables:groups of amino-acids (charged, polar, hydrophobic)
Why CA? To analyze distribution of species
in terms of global properties and discriminated groups
Search for amino-acid signature in groups of species
Try to understand potential evolutionary trends
Results
First axis (63%) correspond to GC contents (Mycoplasma (23%) to Streptomyces(72%))
Second axis (14%) correspond to optimals growth temperature