0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao...

Post on 11-Jan-2016

217 views 2 download

Transcript of 0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao...

1

Fast and Accurate Reconstructionof Evolutionary Trees: a Model-based Study

Ming-Yang Kao

Department of Computer ScienceNorthwestern University

Evanston, Illinois

U. S. A.

2

Perspectives

Use biology ideas to solve computer science problems

Use computer science tools to solve biology problems

biologycomputerscience

this talk

3

Use Biology to Solve CS Problems

• DNA Computing

• DNA Self-Assembly

• Genetic Algorithms

• Neural Network

• Others

4

Use CS to Solve Biology Problems

• Bioinformatics or Computational Biology

data mining

(this talk)

• Related fields computational neuroscience computational ecology medical informatics … many more ...

5

Example Research Areas of Bioinformatics

• DNA sequencing • DNA microarray analysis• DNA self-assembly for nano-structures• DNA word design

• RNA secondary structure prediction

• Protein sequencing (my talk #4)• Proteomics• Protein database search • Protein sequence design (my talk #3)• Protein landscape analysis

• Phylogeny reconstruction (this talk)• Phylogeny comparison (my talk #1)

6

Evolutionary Trees

definition: a tree with distinct labels at leaves

leaf labels: species, organisms, DNAs, RNAs, proteins, features, etc.

ancestralspecies

bird plumpeach

rice wheat

present-day species(Just a joke!)

7

Evolutionary Trees

leaf labels: DNA sequences

bird plum peach

rice wheat

AAGT CCAG CCAT

CGGG CGGC

(Just a joke!)

8

Problem Formulation

bird plumpeach

rice wheat

AAGT CCAGCCAT

CGGG CGGC

Input: DNA sequences of present-day species

Output: the true evolutionary tree

Question: What is “true”? Need a model!

(Just a joke!)

9

A Fundamental Problem of Biology

Since the time of Charles Darwin,

Problem: reconstruct the evolutionary history of all known species.

Importance:

• intellectually fascinating

• practical benefits – medicine, food …

• Charles Robert Darwin --- 1809-1882• Origin of Species --- 1859

10

Main Difficulties

• Availability of data

Hundreds of millions of species --- unlikely to be all available any time soon or ever.

But DNA sequences of more and more species are becoming available.

• Extracting information from data

focus of this talk

11

Today’s Technical Focus

bird plumpeach

rice wheat

AAGT CCAGCCAT

CGGG CGGC

Input: DNA sequences of present-day species

Output: the true evolutionary tree

Question: What is “true”? Need a model!

Collaborators:Csuros & Kim

12

Main Result

An algorithm that constructs an evolutionary tree from biomolecular sequences

• Provable high accuracy

• Short sequence length

• Optimal running time

• Optimal memory space

13

Outline of Technical Discussion

1. Define the model of evolution.2. Formulate the computational

problem.

3. Discuss the theoretical performance of our algorithm.

4. Discuss the empirical performance.

5. Describe and analyze the algorithm.

6. Further research.

14

Outline of Technical Discussion (1)

1. Define the model of evolution.2. Formulate the computational

problem.

3. Discuss the theoretical performance of our algorithm.

4. Discuss the empirical performance.

5. Describe and analyze the algorithm.

6. Further research.

15

Model of Evolution

Intuitions

ACGTACT

AGGAGAA

CAGGAGTTTTAA

Mutation occurs probabilistically.

1. edge length ~ time 2. edge length ~ mutation probability3. edge length ~ dissimilarity (or distance)

AGTTCCT

16

Jukes-Cantor Model of Evolution (1)

Edge Mutation Probability

0.6Pe

A

X

430 gf p

e

• No insertion or deletion.

• X = A with probability 1 - 0.6 = 0.4

• X = C, G, or T with probability 0.6/3 = 0.2

17

Jukes-Cantor Model of Evolution (2)

Independent Mutations along All Edges

A

A C

G

G

0.2

0.70.65

0.6

18

Jukes-Cantor Model of Evolution (3)

i.i.d. mutations at every character

AAGT

AGTTCAGG

GGTG

GTTG

0.2

0.70.65

0.6

19

Outline of Technical Discussion (2)

1. Define the model of evolution.2. Formulate the computational

problem.

3. Discuss the theoretical performance of our algorithm.

4. Discuss the empirical performance.

5. Describe and analyze the algorithm.

6. Further research.

20

Problem Formulation

AGTGT

S 4

GGTAC

CGTTT

CAGGT GTACT

TGGAC

CAGGT

CGTGT ATCGT

0.2

0.60.7

0.3

0.20.5

0.70.1

S1

S5

S3

S 2

True Tree(not known to algorithm)

Input: SSS 521,...,,

Output:

S 4

S1

S5

S3

S 2

unrooted

• Pick any sequence for the root (also unknown to algorithm).• Generate the other sequences.

but not the other sequences,nor the tree.

21

Computational Objectives

Input: DNA sequences SSS 521,...,,

Output:

S 4

S1

S5

S3

S 2

Minimize:

• running time

• memory space

• probability of incorrect output

• sample size, i.e., length of the input sequences

22

Outline of Technical Discussion (3)

1. Define the model of evolution.2. Formulate the computational

problem.

3. Discuss the theoretical performance of our algorithm.

4. Discuss the empirical performance.

5. Describe and analyze the algorithm.

6. Further research.

23

Triplets• A triplet is one formed by three leaves.

• P is the center of XYZ.

X

P

ZY

24

G-depth of Triplet

# of edges between X and Y

X

Z

Y

d XY

},,max{ dddd ZXYZXYXYZ

5, 8, 7

25

G-depth of a Tree

the smallest d such that the triplets of g-depth at most d covers the entire tree

g-depth = 4

the best case

26

G-depth of a Tree

the smallest d such that the triplets of g-depth at most d covers the entire tree

g-depth = 2 log n

the worst case

27

G-depth of a Treethe smallest d such that the triplets of g-depth at most d covers the entire tree

• at most 2 log n

• can be O(1)

28

Our New Result (1)

29

Our New Result (2)

polynomial sample size

30

Our New Result (3)

polynomial sample size

provable high accuracy

31

Our New Result (4)

polynomial sample size

provable high accuracy

optimal time & space

32

Comparison with Previous Results

this talk

33

Outline of Technical Discussion (4)

1. Define the model of evolution.2. Formulate the computational

problem.

3. Discuss the theoretical performance of our algorithm.

4. Discuss the empirical performance.

5. Describe and analyze the algorithm.

6. Further research.

34

Experimental Study Design

• Step 1 -- Pick a model tree T.

• Step 2 -- Use T to generate sequences.

• Step 3 -- Use an algorithm to reconstruct a tree T’ from the sequences (without knowing T).

• Step 4 -- Compare T’ and T.

35

Wrong and Right Edges

X1

X2X4

X3

X5

X3

X2X4

X1

X5

bad

good

true tree

reconstructed tree

36

Experiment #1

• the 135-taxon African-Eve tree (courtesy of Huson and Maddison)

• algorithms compared: HGT and bioNJ (Olivier Gascuel)

• parameters: sequence length and percentage of wrong edges

• edge mutation probabilities: between 0.47 and 0.088

• # of simulations = 20 per sequence length

• more experiments in progress

37

135-taxon African Eve Tree

38

Results of Experiment #1

39

Experiment #2

• a 1892-taxon tree of eukaryotes

• algorithms compared: HGT and bioNJ

• parameters: sequence length and percentage of wrong edges

• edge mutation probabilities: between 0.47 and 0.088

• # of simulations = 20 per sequence length

• more experiments in progress

• several variants of the basic HGT

40

Results of Experiment #2

41

Results of Experiment #2

42

Results of Experiment #2

43

Outline of Technical Discussion (5)

1. Define the model of evolution.2. Formulate the computational

problem.

3. Discuss the theoretical performance of our algorithm.

4. Discuss the empirical performance.

5. Describe and analyze the algorithm.

6. Further research.

44

Our New Result (4)

polynomial sample size

provable high accuracy

optimal time & space

45

Outline of Technical Discussion (5)

1. Describe the HGT algorithm.

2. Prove the sample size bound (and high probability for accuracy).

3. Prove the optimal time & space.

46

Outline of Technical Discussion (5/1)

1. Describe the HGT algorithm.

2. Prove the sample size bound (and high probability for accuracy).

3. Prove the optimal time & space.

47

Closeness and Distance of Two Leaves

0.6Pe

AAGT

AGTTX CAGG

GGTGY

GTTG

0.2

0.70.65

3lnln3

1

4

2

3

1

4

2}Pr{

3

1}Pr{

D

XYXY

XYYXYX

The larger the closeness,the more accurately we can estimate the distance.

Closeness is multiplicative.Distance is additive!!!

48

Closeness = Cubic Root of Determinant

0.6Pe

T

G

C

A

1333

3133

3313

3331

PPPP

PPPP

PPPP

PPPP

eeee

ee

ee

eee

e

eeee

e

M

AAGT

CAGG

A C G T

49

Closeness of Triplet

0.6Pe

AAGT

AGTTX CAGG

GGTGY GTTG

Z

0.2

0.70.65

ZXYZXY

XYZ 1111

The larger the closeness, the more accurately we can estimate the three pairwise distances.

50

Assemble Triplets Into Treevia Distance Additivity (I)

X A Y

b

P

a c

9

28

31

D

D

D

YA

XA

XY

X A Y

3

P

25 6

cb

ba

ca

YA

XA

XY

9

28

31

DDD

51

Assemble Triplets Into Treevia Distance Additivity (II)

X YA

B

B

X

X

Y

Y

A

3

2

106

3

Q

P

P

Q

25 6

15

15 216

9

28

31

D

D

D

YA

XA

XY

18

17

31

D

D

D

YB

XB

XY

52

How to Choose Triplets to Minimize Errors?

X Z Y

3

P

25 6

9

28

31

D

D

D

YZ

XZ

XY

ZXYZXY

XYZ 1111

The larger the closeness, the more accurately we can estimate the three pairwise distances.

Greedy Strategy!

Harmonic Greedy Triplet (HGT)

53

Over-Simplified Outline of HGT

• Stage 1: T’ ABC with the largest

closeness.

• Stage 2: Repeat the following steps until

T’ contains all the leaves.

Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not.

Step 2(2): Incorporate XYZ into T’ to add Z into T’.

54

Outline of Technical Discussion (5/2)

1. Describe the HGT algorithm.

2. Prove the sample size bound (and high probability for accuracy).

3. Prove the optimal time & space.

55

Our New Result (4/1)

polynomial sample size

provable high accuracy

56

Over-Simplified Outline of HGT

• Stage 1: T’ ABC with the largest

closeness.

• Stage 2: Repeat the following steps until

T’ contains all the leaves.

Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not.

Step 2(2): Incorporate XYZ into T’ to add Z into T’.

57

Polynomial Sequence Length (1)

)1( 34 g d XYZ

XYZ

larger

smaller

Lemma 1: The g-depth of the last triplet used in HGT is the g-depth of the true tree T.

Proof:• The largest closeness such that the triplets with same or larger closeness cover the true tree T.

• The smallest g-depth such that the triplets with same or smaller g-depths cover the true tree T.

58

Polynomial Sequence Length (2)

)1( 34 g d XYZ

XYZ g-depth of tree

Lemma 1: The g-depth of the last triplet used in HGT is the g-depth of the true tree T.

Lemma 2:

)(2

XYZsequence length needed

where XYZ is the last triplet used.

59

Outline of Technical Discussion (5/3)

1. Describe the HGT algorithm.

2. Prove the sample size bound (and high probability for accuracy).

3. Prove the optimal time & space.

60

Our New Result (4/2)

optimal time & space

61

Over-Simplified Outline of HGT

• Stage 1: T’ ABC with the largest

closeness.

• Stage 2: Repeat the following steps until

T’ contains all the leaves.

Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not.

Step 2(2): Incorporate XYZ into T’ to add Z into T’.

62

Optimal Time/Space for the First Triplet

• Stage 1:

Fix an arbitrary leaf A.

T’ ABC with the largest closeness.

• Stage 2:

Repeat the following steps until T’ contains all the leaves.

Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not.

Step 2(2): Incorporate XYZ into T’.

63

Optimal Time/Space for the Other Leaves

partially reconstructed tree

not yet recovered

Y

X

Z

XYZ

A

B

C

ABC

P

Q

only need to consider thetriplets formed byone of X, Y, one of B, C,and one of

64

Outline of Technical Discussion (6)

1. Define the model of evolution.2. Formulate the computational

problem.

3. Discuss the theoretical performance of our algorithm.

4. Discuss the empirical performance.

5. Describe and analyze the algorithm.

6. Further research.

65

Further Research

• more general models of evolution

• practical implementations

66

Main Difficulties

• Availability of data

Hundreds of millions of species --- unlikely to be all available any time soon or ever.

But DNA sequences of more and more species are becoming available.

• Extracting information from data

focus of this talk

67

Do the genomes of all green plants contain enough information for the reconstructionof their evolutionary tree?

• genome size of eukaryotes: base pairs

• # of green plant species: several

If so, does this impose any necessary structure on the information or the tree? If so, how do we determine and use that structure?

Beyond All Computational Considerations

1010116

~

What do you think?

The End.

Thank You!

108

68

Data Mining Flowchart

true tree(unknown)

collect & processindividual sequences

compare & alignmultiple sequences

tree reconstructionalgorithms

tree verification(compare & refine)

evolution models

generatesequences further

process

parameters

distance or characters

treesinformation

refine

infer

today’s focus

parameters