Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

24
Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax

Transcript of Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Page 1: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Correspondence Analysis

Ahmed Rebai

Center of Biotechnology of Sfax

Page 2: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Correspondance analysis

Introduced by Benzecri (1973) For uncovering and understanding the structure and pattern in data in contingency tables.

Involves finding coordinate values which represent the row and column categories in some optimal way

Page 3: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Contingency tables Table with r rows and c columns

X1 1 ……….. j ………… c Total

X2

12..i.r

N11

N21

.

.

.

.

Nr1

N1j

Nij

N1c

Ncr

N1.

Nr.

Total N.1 N.j N.c N..

Page 4: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Main idea Develop simple indices that will show

us the relation between rows and columns

Indices that tell us simultaneously which columns have more wheights in a row category and vice versa

Reduce dimensionality like PCA Indice are extracted in decreasing

order of imporance

Page 5: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Which crietria? In contigency table global

independence between the two variables is generally measured by a chi-square (²) calculated as:

Where Eij are expected count under independence

r

i

c

j ij

ijij

E

EN

1 1

22

)(

....

N

NNE jiij

Page 6: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Decomposition of ² We have a departure from

indepedence and we want to know why To find the factors we use the matrix C

of dimension (r xc ) with elements

ij

ijijij

E

ENc

)(

Page 7: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

How to find factors? Singular value decomposition (SVD)

of matrix C that is find matrice U, D and V such that

C=U D VT U are eigenvectors of CCT V eigenvectors of CTC D a diagonal matrix of where k

are eigenvalues of CCT k=Rank(C)<Min(r-1,c-1)

k

Page 8: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Tr(CCT)= k = ²= cij²

The projections of the rows and the columns are given by the eigenvectors Uk and Vk

C Uk = Vk

CTVk = Uk

k

k

Page 9: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

How many factors? The adequacy of representation by

the two first coordinates is measured by the % of explained inertia

(1+2)/ k In general a display on (U1,U2) of

rows and (V1,V2) of columns The proximity between rows and

columns points is to be interpreted

Page 10: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

CA in practice Proximity of two rows (columns)

indicates a similar profile that is similar conditional frequency distribution: the two rows (columns) are proportional

The orignin is the average of the factor; so a point (row or column) close to the origin indicates an average profile

Proximity of a row to a column indicates that this row has particularly important wheight in this column (if far from origin)

Page 11: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

A first example: French Bac

Page 12: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Eigenvalues

Page 13: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

With Corsica

Page 14: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Without Corsica

Classicalbac

Technicalbac

Page 15: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Coefficients for regions

Page 16: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Coefficients for Bac Type

Page 17: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Properties of CA Allows consideration of dummy variables

(called ‘illustrative variables’), as additional variables which do not contribute to the construction of the factorial space, but can be displayed on this factorial space.

With such a representation it is possible to determine the proximity between observations and variables and the illustrative variables and observations.

Page 18: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Tekaia and yeramian (2006) 208 predicted proteomes representing

the three phylogenetic domains and various lifestyle (hyperthromphile, thermophiles, psychrofile and mesophiles including eukaryotes)

Variables: amino-acid composition of proteomes

Illustrative variables:groups of amino-acids (charged, polar, hydrophobic)

Page 19: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Why CA? To analyze distribution of species

in terms of global properties and discriminated groups

Search for amino-acid signature in groups of species

Try to understand potential evolutionary trends

Page 20: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.
Page 21: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Results

First axis (63%) correspond to GC contents (Mycoplasma (23%) to Streptomyces(72%))

Second axis (14%) correspond to optimals growth temperature

Page 22: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.
Page 23: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.
Page 24: Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.