MMG 991 – Special topics in microbiology · MMG 991 – Special topics in microbiology A gentle...

MMG 991 – Special topics in microbiologyA gentle introduction to exploratory data analysis for

microbiologists

George M. GarrityDepartment of Microbiology

and Molecular GeneticsMichigan State Universtiy

Fall Semester 2001

“The purpose of models is not to fit the data, but to sharpen the questions”

Samuel Karlin, 11th Memorial R.A. Fisher Lecturer, Royal Society. April 20, 1983

The “new biology”• Paradigm shift

– A move towards signal processing• The driving force

– Advances in molecular biology• Point of origin

– Sequence databases• Comparative genomics / phylogenomics / DNA microarrays

– Bottlenecks• Tools

– Interface with databases– Display data in an intuitive framework

» Correlations between genes and source organisms» Correlations between expression and physiological

state– Data quality

» Sources of variability / error– Interpretation of results

» The curse of high dimensionality

Exploratory data analysis• A philosophical approach

– Successfully used in many disciplines– High dimension data– Application of graphical techniques

• Methodology– Simple

• Summary data• Box and whisker plots

– Clustering• Hierarchical• Partitioning• Goal

– Projection methods• Supervised vs non-supervised• Linear vs non-linear• Goal

Impact of genomics

• Speed of data acquisition– uniformity across taxa– outstrips our ability to annotate and comprehend

• pressing need for better tools• EDA methods

– allow one to visualize very large data sets– pose/test hypotheses– see the “big picture”

• Links to other data– critical need to validate

• The importance of ground truth– annotation

S-Plus

• A statistical computing environment– large data sets– Object oriented, vectorized language

• extremely powerful• learning curve

– visualization tools– access to internal calculations, code– ability to develop custom functions, scripts

EDA in chemotaxonomic study

• Goal– Rapidly characterize wild-type actinomycetes

entering an industrial high throughput screening program

– Method of choice• FAME

– Low cost/sample– Speed– Good resolution

• Problem– Data analysis

» Clustering» Advantatges / disadvantages

– Data visualization» alternatives

– Data interpretation» What drives the classification?

% to

t

0 50 100 150

010

2030

4050

Indet

% to

t

0 50 100 150

010

2030

4050

Indet

% to

t

0 50 100 150

010

2030

4050

Indet%

tot

0 50 100 150

010

2030

4050

Indet%

tot

0 50 100 150

010

2030

4050

Indet

% to

t0 50 100 150

010

2030

4050

Indet

Geo

d

Geo

d

NA

Geo

d

Geo

dGeo

d

NA

Geo

d

Geo

d

Geo

d

Geo

d

NA

NA

Atp

n

NA

NA

Geo

d

Inde

t

NA

Geo

d

NA

Inde

t

Atp

n

NA

Geo

d NA

Geo

d

NA

Geo

d

NA

Inde

t NA

Geo

d

NA

NA

NA

Geo

d

Geo

d

Geo

d

Geo

d

NA

NA

Atp

n

NA

Geo

d

NA

Geo

dGeo

d Geo

d

Geo

d

Geo

d

Geo

d

Inde

tG

eod

NA

Stm

y

Inde

t

Geo

d

NAN

A

NA

Geo

d

NA

Geo

d NA

Geo

d

NA

Geo

dG

eod N

A

NA

Atp

nG

eod

Geo

d

Geo

d NA

Geo

d

Inde

t

NA

NA

NA

NA

NA

Geo

d

Geo

d

Geo

d

Geo

dG

eod

Atp

nN

A

Atp

n

Geo

d

Geo

d

NA

Geo

d

NA

Inde

t

NA

NA

Stm

y Atp

n

Geo

d

NA

NA

NA

Geo

d

NA

NA

Atp

n

NA

NA

Inde

t

NA N

A

Geo

d

Geo

d

NA

Geo

d

NA

Geo

d

Atp

n

Atp

n

NA

NA

NA

Stm

y

Atp

n

NA

NA

NA

Atp

n

Atp

n

Inde

t

Geo

d

NAN

A

Geo

d NA

Geo

d

Geo

dGeo

d

Geo

d NA

Geo

d

Atp

n

Geo

d

NA

NA

NA

Inde

t

NANA

Atp

n

Atp

n

NA

Atp

n

NA

Inde

tIn

det

050

100

150

200

Reference strains from Nevada desert collection

A29

A34A26A31

A58

A28

A49

A69

A13

2

A52A

44

A78

A24

A36

A61

A21

A12

6

A72

A12

9

A70

A39

A98

A12

7

A55

A67

A54

A53

A13

0

A64

A73

A12

8A18

A71

A57

A12

3

A42

A47

A81A63

A60

A13

1

A96

A46

A33

A80

A11

A56A74

A59A43

A14

8

A35

A95

A27

A45A30A62

A84A32

A38

A66

A97

A87

A10

1A40

A68

A37

A12

5

A76

A92

A11

4

A13

4

A12A94

A10

6

A10

0

A79

A13

3

A48A99

A23

A25

A11

2

A77

A91

A14

7

A10

8

A85A10

A11

7A17

A10

5

A11

1A8

Col

157

A19

A83A65

A12

4

A90

A22

A93

A89

A10

2

A88

A10

7

A14

Col

159

A86A13

Col

158

A15 A9

A16

A20A

50

A51A82

A10

3

A10

4

A10

9A11

0

A11

3

A11

5

A11

6

A11

8

A11

9

A12

0A

121

A12

2A

135

A13

6A

137

A13

8A

139

A14

0A

141

A14

2A

143

A14

4

A14

5

A14

6

A41

A75

050

010

0015

0020

0025

00R-analysis

Principal component 1

Prin

cipa

l com

pone

nt 2

-40 -20 0 20 40

-40

-20

020

Zoosporogenous actinomycetes from Nevada desert

-40-20

020

40


-40-30

-20-10

010

20Principal component 2

010

2030

40N

umbe

r of i

sola

tes

Density plot of PCA: Zoosporogenous actinomycetes from Nevada desert

EDA in large-scale phylogenetic study

• Limitation of tree-based models– Low capacity– Computationally expensive– Low quality of graphical output– Comparability

• Alignment and mask(s)• Evolutionary models• Treeing algorithms• Statistical significance of groupings

An alternative view of prokaryotic diversity


Prin

cipa

l com

pone

nt 2

0 1 2 3 4

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

Map of the procaryotic 16S sequences (RDP 7.0)


Prin

cipa

l com

pone

nt 2

0 1 2 3 4

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

Proteobacteria


Prin

cipa

l com

pone

nt 2

0 1 2 3 4

-0.8

-0.4

0.0

0.2

0.4

0.6

Alphaproteobacteria


Prin

cipa

l com

pone

nt 2

0 1 2 3 4

-0.8

-0.4

0.0

0.2

0.4

0.6

Betaproteobacteria


Prin

cipa

l com

pone

nt 2

0 1 2 3 4

-0.8

-0.4

0.0

0.2

0.4

0.6

Gammaproteobacteria


Prin

cipa

l com

pone

nt 2

0 1 2 3 4

-0.8

-0.4

0.0

0.2

0.4

0.6

Deltaproteobacteria


Prin

cipa

l com

pone

nt 2

0 1 2 3 4

-0.8

-0.4

0.0

0.2

0.4

0.6

Epsilonproteobacteria


Prin

cipal

com

pone

nt 2

-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

Proteobacteria, all benchmark sequences

Comp.1Comp.2Comp.3Comp.4Comp.5Comp.6Comp.7Comp.8Comp.9Comp.10

0.0

0.02

0.04

0.06

proteomap.pca

Varia

nces

0.392

0.634

0.7260.812

0.842 0.86 0.8730.8850.893 0.9


Prin

cipal

com

pone

nt 2

-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2

-0.8

-0.6

-0.4

-0.2

0.0

0.2

Proteobacteria, proteobacterial benchmark sequences

Comp.1Comp.2Comp.3Comp.4Comp.5Comp.6Comp.7Comp.8Comp.9Comp.10

0.0

0.01

0.02

0.03

0.04

0.05

proteomap.pca

Varia

nces

0.442

0.706

0.826

0.8770.9040.9170.9270.9350.9420.948

Princpal component 1

Prin

cipa

l com

pone

nt 2

-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2

-0.8

-0.6

-0.4

-0.2

0.0

0.2

Proteobacteria

Princpal component 1-0.4 -0.2 0.0 0.1 0.2

-0.8

-0.6

-0.4

-0.2

0.0

0.2

Alphaproteobacteria

Spatial distortions and perspective

• Two problems with “global” view– Small discrepancies between distance and PC scores

• Unrelated sequences found in same location– Assumption that map is planar

• An experiment with a known solution– Can PCA recover a known shape based on a matrix of

distances?• Historical precedence

– If not, what corrections might be required?– Mapping strategy analogous to PCA model

• 150 benchmarks• 23,470 points (“species”)

rand.cities.pca$scores[, 1]

rand

.citi

es.p

ca$s

core

s[, 2

]

0 20 40 60 80

-10

010

2030

40

rand.cities.pca$scores[, 2]

rand

.citi

es.p

ca$s

core

s[, 3

]

-10 0 10 20 30 40

-8-6

-4-2

02

MD

cities.pca$scores[, 1]

citie

s.pc

a$sc

ores

[, 2]

-20000 -15000 -10000 -5000 0 5000

050

0010

000

MD

cities.pca$scores[, 2]

citie

s.pc

a$sc

ores

[, 3]

0 5000 10000

-100

00-8

000

-600

0-4

000

-200

00

MD

DNA micro arrays

0 500 1000 1500 2000

010

2030

4050

60

050

100

150

Clustering of genes: Expression profiles of 64 cancers

CNS

CNS1

CNS2

RENA

LBR

EAST

CNS3

CNS4

BREA

ST5

NSCL

C

NSCL

C6

RENA

L7RE

NAL8

RENA

L9

RENA

L10

RENA

L11

RENA

L12RE

NAL1

3

BREA

ST14

NSCL

C15

RENA

L16

UNKN

OW

NO

VARI

AN

MEL

ANO

MA

PRO

STAT

E

OVA

RIAN

17O

VARI

AN18

OVA

RIAN

19O

VARI

AN20

OVA

RIAN

21

PRO

STAT

E22

NSCL

C23

NSCL

C24

NSCL

C25

LEUK

EMIA

K562

B.re

pro

K562

A.re

pro

LEUK

EMIA

26

LEUK

EMIA

27

LEUK

EMIA

28LE

UKEM

IA29

LEUK

EMIA

30

COLO

NCOLO

N31

COLO

N32

COLO

N33

COLO

N34

COLO

N35

COLO

N36

MCF

7A.re

pro

BREA

ST37

MCF

7D.re

pro

BREA

ST38

NSCL

C39NS

CLC4

0NS

CLC4

1

MEL

ANO

MA4

2BR

EAST

43BR

EAST

44

MEL

ANO

MA4

5M

ELAN

OM

A46

MEL

ANO

MA4

7M

ELAN

OM

A48MEL

ANO

MA4

9M

ELAN

OM

A50

4060

80

Clustering of cancers: Expression profiles of 64 cancers

0 500 1000 1500 2000

010

2030

4050

60


Prin

cipa

l com

pone

nt 2

-15 -10 -5 0 5 10

-10

-50

510

PCA of expression profiles of 2199 genes in 64 cancers


Prin

cipa

l com

pone

nt 2

-40 -20 0 20 40

-40

-30

-20

-10

010

20PCA of 64 cancers based on expression profiles of 2199 genes

Course objectives

• To explore current applications of EDA methods– Special emphasis on microarrays

• Expression profiling• Phylogenetic / deterministic applications

– Other possible applications• Provide an introduction to S-Plus

– Experimental platform• Work with publicly available data sets• What works and what doesn’t

• Grading– Participation in discussion of the literature– Demonstration of proficiency in S-Plus

• Completion of exercises• Team project

MMG 991 – Special topics in microbiology · MMG 991 – Special topics in microbiology A gentle...

Documents

Transcript of MMG 991 – Special topics in microbiology · MMG 991 – Special topics in microbiology A gentle...