MMG 991 – Special topics in microbiology · MMG 991 – Special topics in microbiology A gentle...
Transcript of MMG 991 – Special topics in microbiology · MMG 991 – Special topics in microbiology A gentle...
MMG 991 – Special topics in microbiologyA gentle introduction to exploratory data analysis for
microbiologists
George M. GarrityDepartment of Microbiology
and Molecular GeneticsMichigan State Universtiy
Fall Semester 2001
“The purpose of models is not to fit the data, but to sharpen the questions”
Samuel Karlin, 11th Memorial R.A. Fisher Lecturer, Royal Society. April 20, 1983
The “new biology”• Paradigm shift
– A move towards signal processing• The driving force
– Advances in molecular biology• Point of origin
– Sequence databases• Comparative genomics / phylogenomics / DNA microarrays
– Bottlenecks• Tools
– Interface with databases– Display data in an intuitive framework
» Correlations between genes and source organisms» Correlations between expression and physiological
state– Data quality
» Sources of variability / error– Interpretation of results
» The curse of high dimensionality
Exploratory data analysis• A philosophical approach
– Successfully used in many disciplines– High dimension data– Application of graphical techniques
• Methodology– Simple
• Summary data• Box and whisker plots
– Clustering• Hierarchical• Partitioning• Goal
– Projection methods• Supervised vs non-supervised• Linear vs non-linear• Goal
Impact of genomics
• Speed of data acquisition– uniformity across taxa– outstrips our ability to annotate and comprehend
• pressing need for better tools• EDA methods
– allow one to visualize very large data sets– pose/test hypotheses– see the “big picture”
• Links to other data– critical need to validate
• The importance of ground truth– annotation
S-Plus
• A statistical computing environment– large data sets– Object oriented, vectorized language
• extremely powerful• learning curve
– visualization tools– access to internal calculations, code– ability to develop custom functions, scripts
EDA in chemotaxonomic study
• Goal– Rapidly characterize wild-type actinomycetes
entering an industrial high throughput screening program
– Method of choice• FAME
– Low cost/sample– Speed– Good resolution
• Problem– Data analysis
» Clustering» Advantatges / disadvantages
– Data visualization» alternatives
– Data interpretation» What drives the classification?
% to
t
0 50 100 150
010
2030
4050
Indet
% to
t
0 50 100 150
010
2030
4050
Indet
% to
t
0 50 100 150
010
2030
4050
Indet%
tot
0 50 100 150
010
2030
4050
Indet%
tot
0 50 100 150
010
2030
4050
Indet
% to
t0 50 100 150
010
2030
4050
Indet
Geo
d
Geo
d
NA
Geo
d
Geo
dGeo
d
NA
Geo
d
Geo
d
Geo
d
Geo
d
NA
NA
Atp
n
NA
NA
Geo
d
Inde
t
NA
Geo
d
NA
Inde
t
Atp
n
NA
Geo
d NA
Geo
d
NA
Geo
d
NA
Inde
t NA
Geo
d
NA
NA
NA
Geo
d
Geo
d
Geo
d
Geo
d
NA
NA
Atp
n
NA
Geo
d
NA
Geo
dGeo
d Geo
d
Geo
d
Geo
d
Geo
d
Inde
tG
eod
NA
Stm
y
Inde
t
Geo
d
NAN
A
NA
Geo
d
NA
Geo
d NA
Geo
d
NA
Geo
dG
eod N
A
NA
Atp
nG
eod
Geo
d
Geo
d NA
Geo
d
Inde
t
NA
NA
NA
NA
NA
Geo
d
Geo
d
Geo
d
Geo
dG
eod
Atp
nN
A
Atp
n
Geo
d
Geo
d
NA
Geo
d
NA
Inde
t
NA
NA
Stm
y Atp
n
Geo
d
NA
NA
NA
Geo
d
NA
NA
Atp
n
NA
NA
Inde
t
NA N
A
Geo
d
Geo
d
NA
Geo
d
NA
Geo
d
Atp
n
Atp
n
NA
NA
NA
Stm
y
Atp
n
NA
NA
NA
Atp
n
Atp
n
Inde
t
Geo
d
NAN
A
Geo
d NA
Geo
d
Geo
dGeo
d
Geo
d NA
Geo
d
Atp
n
Geo
d
NA
NA
NA
Inde
t
NANA
Atp
n
Atp
n
NA
Atp
n
NA
Inde
tIn
det
050
100
150
200
Reference strains from Nevada desert collection
A29
A34A26A31
A58
A28
A49
A69
A13
2
A52A
44
A78
A24
A36
A61
A21
A12
6
A72
A12
9
A70
A39
A98
A12
7
A55
A67
A54
A53
A13
0
A64
A73
A12
8A18
A71
A57
A12
3
A42
A47
A81A63
A60
A13
1
A96
A46
A33
A80
A11
A56A74
A59A43
A14
8
A35
A95
A27
A45A30A62
A84A32
A38
A66
A97
A87
A10
1A40
A68
A37
A12
5
A76
A92
A11
4
A13
4
A12A94
A10
6
A10
0
A79
A13
3
A48A99
A23
A25
A11
2
A77
A91
A14
7
A10
8
A85A10
A11
7A17
A10
5
A11
1A8
Col
157
A19
A83A65
A12
4
A90
A22
A93
A89
A10
2
A88
A10
7
A14
Col
159
A86A13
Col
158
A15 A9
A16
A20A
50
A51A82
A10
3
A10
4
A10
9A11
0
A11
3
A11
5
A11
6
A11
8
A11
9
A12
0A
121
A12
2A
135
A13
6A
137
A13
8A
139
A14
0A
141
A14
2A
143
A14
4
A14
5
A14
6
A41
A75
050
010
0015
0020
0025
00R-analysis
Principal component 1
Prin
cipa
l com
pone
nt 2
-40 -20 0 20 40
-40
-20
020
Zoosporogenous actinomycetes from Nevada desert
-40-20
020
40
Principal component 1
-40-30
-20-10
010
20Principal component 2
010
2030
40N
umbe
r of i
sola
tes
Density plot of PCA: Zoosporogenous actinomycetes from Nevada desert
EDA in large-scale phylogenetic study
• Limitation of tree-based models– Low capacity– Computationally expensive– Low quality of graphical output– Comparability
• Alignment and mask(s)• Evolutionary models• Treeing algorithms• Statistical significance of groupings
An alternative view of prokaryotic diversity
Principal component 1
Prin
cipa
l com
pone
nt 2
0 1 2 3 4
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
Map of the procaryotic 16S sequences (RDP 7.0)
Principal component 1
Prin
cipa
l com
pone
nt 2
0 1 2 3 4
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
Proteobacteria
Principal component 1
Prin
cipa
l com
pone
nt 2
0 1 2 3 4
-0.8
-0.4
0.0
0.2
0.4
0.6
Alphaproteobacteria
Principal component 1
Prin
cipa
l com
pone
nt 2
0 1 2 3 4
-0.8
-0.4
0.0
0.2
0.4
0.6
Betaproteobacteria
Principal component 1
Prin
cipa
l com
pone
nt 2
0 1 2 3 4
-0.8
-0.4
0.0
0.2
0.4
0.6
Gammaproteobacteria
Principal component 1
Prin
cipa
l com
pone
nt 2
0 1 2 3 4
-0.8
-0.4
0.0
0.2
0.4
0.6
Deltaproteobacteria
Principal component 1
Prin
cipa
l com
pone
nt 2
0 1 2 3 4
-0.8
-0.4
0.0
0.2
0.4
0.6
Epsilonproteobacteria
Principal component 1
Prin
cipal
com
pone
nt 2
-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
Proteobacteria, all benchmark sequences
Comp.1Comp.2Comp.3Comp.4Comp.5Comp.6Comp.7Comp.8Comp.9Comp.10
0.0
0.02
0.04
0.06
proteomap.pca
Varia
nces
0.392
0.634
0.7260.812
0.842 0.86 0.8730.8850.893 0.9
Principal component 1
Prin
cipal
com
pone
nt 2
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2
-0.8
-0.6
-0.4
-0.2
0.0
0.2
Proteobacteria, proteobacterial benchmark sequences
Comp.1Comp.2Comp.3Comp.4Comp.5Comp.6Comp.7Comp.8Comp.9Comp.10
0.0
0.01
0.02
0.03
0.04
0.05
proteomap.pca
Varia
nces
0.442
0.706
0.826
0.8770.9040.9170.9270.9350.9420.948
Princpal component 1
Prin
cipa
l com
pone
nt 2
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2
-0.8
-0.6
-0.4
-0.2
0.0
0.2
Proteobacteria
Princpal component 1-0.4 -0.2 0.0 0.1 0.2
-0.8
-0.6
-0.4
-0.2
0.0
0.2
Alphaproteobacteria
Spatial distortions and perspective
• Two problems with “global” view– Small discrepancies between distance and PC scores
• Unrelated sequences found in same location– Assumption that map is planar
• An experiment with a known solution– Can PCA recover a known shape based on a matrix of
distances?• Historical precedence
– If not, what corrections might be required?– Mapping strategy analogous to PCA model
• 150 benchmarks• 23,470 points (“species”)
rand.cities.pca$scores[, 1]
rand
.citi
es.p
ca$s
core
s[, 2
]
0 20 40 60 80
-10
010
2030
40
rand.cities.pca$scores[, 2]
rand
.citi
es.p
ca$s
core
s[, 3
]
-10 0 10 20 30 40
-8-6
-4-2
02
MD
cities.pca$scores[, 1]
citie
s.pc
a$sc
ores
[, 2]
-20000 -15000 -10000 -5000 0 5000
050
0010
000
MD
cities.pca$scores[, 2]
citie
s.pc
a$sc
ores
[, 3]
0 5000 10000
-100
00-8
000
-600
0-4
000
-200
00
MD
DNA micro arrays
0 500 1000 1500 2000
010
2030
4050
60
050
100
150
Clustering of genes: Expression profiles of 64 cancers
CNS
CNS1
CNS2
RENA
LBR
EAST
CNS3
CNS4
BREA
ST5
NSCL
C
NSCL
C6
RENA
L7RE
NAL8
RENA
L9
RENA
L10
RENA
L11
RENA
L12RE
NAL1
3
BREA
ST14
NSCL
C15
RENA
L16
UNKN
OW
NO
VARI
AN
MEL
ANO
MA
PRO
STAT
E
OVA
RIAN
17O
VARI
AN18
OVA
RIAN
19O
VARI
AN20
OVA
RIAN
21
PRO
STAT
E22
NSCL
C23
NSCL
C24
NSCL
C25
LEUK
EMIA
K562
B.re
pro
K562
A.re
pro
LEUK
EMIA
26
LEUK
EMIA
27
LEUK
EMIA
28LE
UKEM
IA29
LEUK
EMIA
30
COLO
NCOLO
N31
COLO
N32
COLO
N33
COLO
N34
COLO
N35
COLO
N36
MCF
7A.re
pro
BREA
ST37
MCF
7D.re
pro
BREA
ST38
NSCL
C39NS
CLC4
0NS
CLC4
1
MEL
ANO
MA4
2BR
EAST
43BR
EAST
44
MEL
ANO
MA4
5M
ELAN
OM
A46
MEL
ANO
MA4
7M
ELAN
OM
A48MEL
ANO
MA4
9M
ELAN
OM
A50
4060
80
Clustering of cancers: Expression profiles of 64 cancers
0 500 1000 1500 2000
010
2030
4050
60
Principal component 1
Prin
cipa
l com
pone
nt 2
-15 -10 -5 0 5 10
-10
-50
510
PCA of expression profiles of 2199 genes in 64 cancers
Principal component 1
Prin
cipa
l com
pone
nt 2
-40 -20 0 20 40
-40
-30
-20
-10
010
20PCA of 64 cancers based on expression profiles of 2199 genes
Course objectives
• To explore current applications of EDA methods– Special emphasis on microarrays
• Expression profiling• Phylogenetic / deterministic applications
– Other possible applications• Provide an introduction to S-Plus
– Experimental platform• Work with publicly available data sets• What works and what doesn’t
• Grading– Participation in discussion of the literature– Demonstration of proficiency in S-Plus
• Completion of exercises• Team project