0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao...
-
Upload
elfreda-mcgee -
Category
Documents
-
view
217 -
download
2
Transcript of 0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao...
1
Fast and Accurate Reconstructionof Evolutionary Trees: a Model-based Study
Ming-Yang Kao
Department of Computer ScienceNorthwestern University
Evanston, Illinois
U. S. A.
2
Perspectives
Use biology ideas to solve computer science problems
Use computer science tools to solve biology problems
biologycomputerscience
this talk
3
Use Biology to Solve CS Problems
• DNA Computing
• DNA Self-Assembly
• Genetic Algorithms
• Neural Network
• Others
4
Use CS to Solve Biology Problems
• Bioinformatics or Computational Biology
data mining
(this talk)
• Related fields computational neuroscience computational ecology medical informatics … many more ...
5
Example Research Areas of Bioinformatics
• DNA sequencing • DNA microarray analysis• DNA self-assembly for nano-structures• DNA word design
• RNA secondary structure prediction
• Protein sequencing (my talk #4)• Proteomics• Protein database search • Protein sequence design (my talk #3)• Protein landscape analysis
• Phylogeny reconstruction (this talk)• Phylogeny comparison (my talk #1)
6
Evolutionary Trees
definition: a tree with distinct labels at leaves
leaf labels: species, organisms, DNAs, RNAs, proteins, features, etc.
ancestralspecies
bird plumpeach
rice wheat
present-day species(Just a joke!)
7
Evolutionary Trees
leaf labels: DNA sequences
bird plum peach
rice wheat
AAGT CCAG CCAT
CGGG CGGC
(Just a joke!)
8
Problem Formulation
bird plumpeach
rice wheat
AAGT CCAGCCAT
CGGG CGGC
Input: DNA sequences of present-day species
Output: the true evolutionary tree
Question: What is “true”? Need a model!
(Just a joke!)
9
A Fundamental Problem of Biology
Since the time of Charles Darwin,
Problem: reconstruct the evolutionary history of all known species.
Importance:
• intellectually fascinating
• practical benefits – medicine, food …
• Charles Robert Darwin --- 1809-1882• Origin of Species --- 1859
10
Main Difficulties
• Availability of data
Hundreds of millions of species --- unlikely to be all available any time soon or ever.
But DNA sequences of more and more species are becoming available.
• Extracting information from data
focus of this talk
11
Today’s Technical Focus
bird plumpeach
rice wheat
AAGT CCAGCCAT
CGGG CGGC
Input: DNA sequences of present-day species
Output: the true evolutionary tree
Question: What is “true”? Need a model!
Collaborators:Csuros & Kim
12
Main Result
An algorithm that constructs an evolutionary tree from biomolecular sequences
• Provable high accuracy
• Short sequence length
• Optimal running time
• Optimal memory space
13
Outline of Technical Discussion
1. Define the model of evolution.2. Formulate the computational
problem.
3. Discuss the theoretical performance of our algorithm.
4. Discuss the empirical performance.
5. Describe and analyze the algorithm.
6. Further research.
14
Outline of Technical Discussion (1)
1. Define the model of evolution.2. Formulate the computational
problem.
3. Discuss the theoretical performance of our algorithm.
4. Discuss the empirical performance.
5. Describe and analyze the algorithm.
6. Further research.
15
Model of Evolution
Intuitions
ACGTACT
AGGAGAA
CAGGAGTTTTAA
Mutation occurs probabilistically.
1. edge length ~ time 2. edge length ~ mutation probability3. edge length ~ dissimilarity (or distance)
AGTTCCT
16
Jukes-Cantor Model of Evolution (1)
Edge Mutation Probability
0.6Pe
A
X
430 gf p
e
• No insertion or deletion.
• X = A with probability 1 - 0.6 = 0.4
• X = C, G, or T with probability 0.6/3 = 0.2
17
Jukes-Cantor Model of Evolution (2)
Independent Mutations along All Edges
A
A C
G
G
0.2
0.70.65
0.6
18
Jukes-Cantor Model of Evolution (3)
i.i.d. mutations at every character
AAGT
AGTTCAGG
GGTG
GTTG
0.2
0.70.65
0.6
19
Outline of Technical Discussion (2)
1. Define the model of evolution.2. Formulate the computational
problem.
3. Discuss the theoretical performance of our algorithm.
4. Discuss the empirical performance.
5. Describe and analyze the algorithm.
6. Further research.
20
Problem Formulation
AGTGT
S 4
GGTAC
CGTTT
CAGGT GTACT
TGGAC
CAGGT
CGTGT ATCGT
0.2
0.60.7
0.3
0.20.5
0.70.1
S1
S5
S3
S 2
True Tree(not known to algorithm)
Input: SSS 521,...,,
Output:
S 4
S1
S5
S3
S 2
unrooted
• Pick any sequence for the root (also unknown to algorithm).• Generate the other sequences.
but not the other sequences,nor the tree.
21
Computational Objectives
Input: DNA sequences SSS 521,...,,
Output:
S 4
S1
S5
S3
S 2
Minimize:
• running time
• memory space
• probability of incorrect output
• sample size, i.e., length of the input sequences
22
Outline of Technical Discussion (3)
1. Define the model of evolution.2. Formulate the computational
problem.
3. Discuss the theoretical performance of our algorithm.
4. Discuss the empirical performance.
5. Describe and analyze the algorithm.
6. Further research.
23
Triplets• A triplet is one formed by three leaves.
• P is the center of XYZ.
X
P
ZY
24
G-depth of Triplet
# of edges between X and Y
X
Z
Y
d XY
},,max{ dddd ZXYZXYXYZ
5, 8, 7
25
G-depth of a Tree
the smallest d such that the triplets of g-depth at most d covers the entire tree
g-depth = 4
the best case
26
G-depth of a Tree
the smallest d such that the triplets of g-depth at most d covers the entire tree
g-depth = 2 log n
the worst case
27
G-depth of a Treethe smallest d such that the triplets of g-depth at most d covers the entire tree
• at most 2 log n
• can be O(1)
28
Our New Result (1)
29
Our New Result (2)
polynomial sample size
30
Our New Result (3)
polynomial sample size
provable high accuracy
31
Our New Result (4)
polynomial sample size
provable high accuracy
optimal time & space
32
Comparison with Previous Results
this talk
33
Outline of Technical Discussion (4)
1. Define the model of evolution.2. Formulate the computational
problem.
3. Discuss the theoretical performance of our algorithm.
4. Discuss the empirical performance.
5. Describe and analyze the algorithm.
6. Further research.
34
Experimental Study Design
• Step 1 -- Pick a model tree T.
• Step 2 -- Use T to generate sequences.
• Step 3 -- Use an algorithm to reconstruct a tree T’ from the sequences (without knowing T).
• Step 4 -- Compare T’ and T.
35
Wrong and Right Edges
X1
X2X4
X3
X5
X3
X2X4
X1
X5
bad
good
true tree
reconstructed tree
36
Experiment #1
• the 135-taxon African-Eve tree (courtesy of Huson and Maddison)
• algorithms compared: HGT and bioNJ (Olivier Gascuel)
• parameters: sequence length and percentage of wrong edges
• edge mutation probabilities: between 0.47 and 0.088
• # of simulations = 20 per sequence length
• more experiments in progress
37
135-taxon African Eve Tree
38
Results of Experiment #1
39
Experiment #2
• a 1892-taxon tree of eukaryotes
• algorithms compared: HGT and bioNJ
• parameters: sequence length and percentage of wrong edges
• edge mutation probabilities: between 0.47 and 0.088
• # of simulations = 20 per sequence length
• more experiments in progress
• several variants of the basic HGT
40
Results of Experiment #2
41
Results of Experiment #2
42
Results of Experiment #2
43
Outline of Technical Discussion (5)
1. Define the model of evolution.2. Formulate the computational
problem.
3. Discuss the theoretical performance of our algorithm.
4. Discuss the empirical performance.
5. Describe and analyze the algorithm.
6. Further research.
44
Our New Result (4)
polynomial sample size
provable high accuracy
optimal time & space
45
Outline of Technical Discussion (5)
1. Describe the HGT algorithm.
2. Prove the sample size bound (and high probability for accuracy).
3. Prove the optimal time & space.
46
Outline of Technical Discussion (5/1)
1. Describe the HGT algorithm.
2. Prove the sample size bound (and high probability for accuracy).
3. Prove the optimal time & space.
47
Closeness and Distance of Two Leaves
0.6Pe
AAGT
AGTTX CAGG
GGTGY
GTTG
0.2
0.70.65
3lnln3
1
4
2
3
1
4
2}Pr{
3
1}Pr{
D
XYXY
XYYXYX
The larger the closeness,the more accurately we can estimate the distance.
Closeness is multiplicative.Distance is additive!!!
48
Closeness = Cubic Root of Determinant
0.6Pe
T
G
C
A
1333
3133
3313
3331
PPPP
PPPP
PPPP
PPPP
eeee
ee
ee
eee
e
eeee
e
M
AAGT
CAGG
A C G T
49
Closeness of Triplet
0.6Pe
AAGT
AGTTX CAGG
GGTGY GTTG
Z
0.2
0.70.65
ZXYZXY
XYZ 1111
The larger the closeness, the more accurately we can estimate the three pairwise distances.
50
Assemble Triplets Into Treevia Distance Additivity (I)
X A Y
b
P
a c
9
28
31
D
D
D
YA
XA
XY
X A Y
3
P
25 6
cb
ba
ca
YA
XA
XY
9
28
31
DDD
51
Assemble Triplets Into Treevia Distance Additivity (II)
X YA
B
B
X
X
Y
Y
A
3
2
106
3
Q
P
P
Q
25 6
15
15 216
9
28
31
D
D
D
YA
XA
XY
18
17
31
D
D
D
YB
XB
XY
52
How to Choose Triplets to Minimize Errors?
X Z Y
3
P
25 6
9
28
31
D
D
D
YZ
XZ
XY
ZXYZXY
XYZ 1111
The larger the closeness, the more accurately we can estimate the three pairwise distances.
Greedy Strategy!
Harmonic Greedy Triplet (HGT)
53
Over-Simplified Outline of HGT
• Stage 1: T’ ABC with the largest
closeness.
• Stage 2: Repeat the following steps until
T’ contains all the leaves.
Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not.
Step 2(2): Incorporate XYZ into T’ to add Z into T’.
54
Outline of Technical Discussion (5/2)
1. Describe the HGT algorithm.
2. Prove the sample size bound (and high probability for accuracy).
3. Prove the optimal time & space.
55
Our New Result (4/1)
polynomial sample size
provable high accuracy
56
Over-Simplified Outline of HGT
• Stage 1: T’ ABC with the largest
closeness.
• Stage 2: Repeat the following steps until
T’ contains all the leaves.
Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not.
Step 2(2): Incorporate XYZ into T’ to add Z into T’.
57
Polynomial Sequence Length (1)
)1( 34 g d XYZ
XYZ
larger
smaller
Lemma 1: The g-depth of the last triplet used in HGT is the g-depth of the true tree T.
Proof:• The largest closeness such that the triplets with same or larger closeness cover the true tree T.
• The smallest g-depth such that the triplets with same or smaller g-depths cover the true tree T.
58
Polynomial Sequence Length (2)
)1( 34 g d XYZ
XYZ g-depth of tree
Lemma 1: The g-depth of the last triplet used in HGT is the g-depth of the true tree T.
Lemma 2:
)(2
XYZsequence length needed
where XYZ is the last triplet used.
59
Outline of Technical Discussion (5/3)
1. Describe the HGT algorithm.
2. Prove the sample size bound (and high probability for accuracy).
3. Prove the optimal time & space.
60
Our New Result (4/2)
optimal time & space
61
Over-Simplified Outline of HGT
• Stage 1: T’ ABC with the largest
closeness.
• Stage 2: Repeat the following steps until
T’ contains all the leaves.
Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not.
Step 2(2): Incorporate XYZ into T’ to add Z into T’.
62
Optimal Time/Space for the First Triplet
• Stage 1:
Fix an arbitrary leaf A.
T’ ABC with the largest closeness.
• Stage 2:
Repeat the following steps until T’ contains all the leaves.
Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not.
Step 2(2): Incorporate XYZ into T’.
63
Optimal Time/Space for the Other Leaves
partially reconstructed tree
not yet recovered
Y
X
Z
XYZ
A
B
C
ABC
P
Q
only need to consider thetriplets formed byone of X, Y, one of B, C,and one of
64
Outline of Technical Discussion (6)
1. Define the model of evolution.2. Formulate the computational
problem.
3. Discuss the theoretical performance of our algorithm.
4. Discuss the empirical performance.
5. Describe and analyze the algorithm.
6. Further research.
65
Further Research
• more general models of evolution
• practical implementations
66
Main Difficulties
• Availability of data
Hundreds of millions of species --- unlikely to be all available any time soon or ever.
But DNA sequences of more and more species are becoming available.
• Extracting information from data
focus of this talk
67
Do the genomes of all green plants contain enough information for the reconstructionof their evolutionary tree?
• genome size of eukaryotes: base pairs
• # of green plant species: several
If so, does this impose any necessary structure on the information or the tree? If so, how do we determine and use that structure?
Beyond All Computational Considerations
1010116
~
What do you think?
The End.
Thank You!
108
68
Data Mining Flowchart
true tree(unknown)
collect & processindividual sequences
compare & alignmultiple sequences
tree reconstructionalgorithms
tree verification(compare & refine)
evolution models
generatesequences further
process
parameters
distance or characters
treesinformation
refine
infer
today’s focus
parameters