CS 394C: Computational Biology Algorithms
-
Upload
hoyt-ramirez -
Category
Documents
-
view
30 -
download
3
description
Transcript of CS 394C: Computational Biology Algorithms
CS 394C: Computational Biology Algorithms
Tandy WarnowDepartment of Computer Sciences
University of Texas at Austin
DNA Sequence Evolution
AAGACTT
TGGACTTAAGGCCT
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
Molecular Systematics
TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT
U V W X Y
U
V W
X
Y
Phylogeny estimation methods
• Distance-based (Neighbor joining, NQM, and others): mostly statistically consistent and polynomial time
• Maximum parsimony and maximum compatibility: NP-hard and not statistically consistent
• Maximum likelihood: NP-hard and usually statistically consistent (if solved exactly)
• Bayesian Methods: statistically consistent if run long enough
Distance-based methods
• Theorem: Let (T,) be a Cavender-Farris model tree, with additive matrix [(i,j)]. Let >0 be given. The sequence length that suffices for accuracy with probability at least 1- of NJ (neighbor joining) and the Naïve Quartet Method is
O(log n e(O(max (i,j))).
Neighbor joining (although statistically consistent) has poor performance on large diameter trees
[Nakhleh et al. ISMB 2001]
Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides.
Error rates reflect proportion of incorrect edges in inferred trees.
NJ
0 400 800 16001200No. Taxa
0
0.2
0.4
0.6
0.8
Err
or R
ate
Maximum Parsimony
• Input: Set S of n aligned sequences of length k
• Output: A phylogenetic tree T– leaf-labeled by sequences in S– additional sequences of length k labeling the
internal nodes of T
such that is minimized. ∑∈ )(),(
),(TEji
jiH
Maximum parsimony (example)
• Input: Four sequences– ACT– ACA– GTT– GTA
• Question: which of the three trees has the best MP scores?
Maximum Parsimony
ACT
GTT ACA
GTA ACA ACT
GTAGTT
ACT
ACA
GTT
GTA
Maximum Parsimony
ACT
GTT
GTT GTA
ACA
GTA
12
2
MP score = 5
ACA ACT
GTAGTT
ACA ACT
3 1 3
MP score = 7
ACT
ACA
GTT
GTAACA GTA
1 2 1
MP score = 4
Optimal MP tree
Maximum Parsimony
ACT
ACA
GTT
GTAACA GTA
1 2 1
MP score = 4
Finding the optimal MP tree is NP-hard
Optimal labeling can be computed in polynomial time using Dynamic Programming
Solving NP-hard problems exactly is … unlikely
• Number of (unrooted) binary trees on n leaves is (2n-5)!!
• If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in
2890 millennia
#leaves #trees
4 3
5 15
6 105
7 945
8 10395
9 135135
10 2027025
20 2.2 x 1020
100 4.5 x 10190
1000 2.7 x 102900
1. Hill-climbing heuristics (which can get stuck in local optima)2. Randomized algorithms for getting out of local optima3. Approximation algorithms for MP (based upon Steiner Tree approximation
algorithms) -- however, the approx. ratio that is needed is probably 1.01 or smaller!
Approaches for “solving” MP and ML(and other NP-hard problems in phylogeny)
Phylogenetic trees
Cost
Global optimum
Local optimum
Problems with techniques for MP and ML
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 4 8 12 16 20 24
Hours
Average MP score above
optimal, shown as a percentage of
the optimal
Shown here is the performance of a TNT heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%.
Performance of TNT with time
MP and Cavender-Farris
• Consider a tree (AB,CD) with two very long branches leading to A and C, and all other branches very short.
• MP will be statistically inconsistent (and “positively misleading”) on this tree.
Problems with existing phylogeny reconstruction methods
• Polynomial time methods (generally based upon distances) have poor accuracy with large diameter datasets.
• Heuristics for NP-hard optimization problems take too long (months to reach acceptable local optima).
Warnow et al.: Meta-algorithms for phylogenetics
• Basic technique: determine the conditions under which a phylogeny reconstruction method does well (or poorly), and design a divide-and-conquer strategy (specific to the method) to improve its performance
• Warnow et al. developed a class of divide-and-conquer methods, collectively called DCMs (Disk-Covering Methods). These are based upon chordal graph theory to give fast decompositions and provable performance guarantees.
Disk-Covering Method (DCM)
Improving phylogeny reconstruction methods using DCMs
• Improving the theoretical convergence rate and performance of polynomial time distance-based methods using DCM1
• Speeding up heuristics for NP-hard optimization problems (Maximum Parsimony and Maximum Likelihood) using Rec-I-DCM3
DCM1 Warnow, St. John, and Moret, SODA 2001
• A two-phase procedure which reduces the sequence length requirement of methods. The DCM phase produces a collection of trees, and the SQS phase picks the “best” tree.
• The “base method” is applied to subsets of the original dataset. When the base method is NJ, you get DCM1-NJ.
DCM SQSExponentiallyconvergingmethod
Absolute fast convergingmethod
DCM1-boosting distance-based methods[Nakhleh et al. ISMB 2001]
•Theorem: DCM1-NJ converges to the true tree from polynomial length sequences
NJ
DCM1-NJ
0 400 800 16001200No. Taxa
0
0.2
0.4
0.6
0.8
Err
or R
ate
Rec-I-DCM3 significantly improves performance (Roshan et al. CSB 2004)
Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset.Similar improvements obtained for RAxML (maximum likelihood).
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 4 8 12 16 20 24
Hours
Average MP score above
optimal, shown as a percentage of
the optimal
Current best techniques
DCM boosted version of best techniques
Summary (so far)
• Optimization problems in biology are almost all NP-hard, and heuristics may run for months before finding local optima.
• The challenge here is to find better heuristics, since exact solutions are very unlikely to ever be achievable on large datasets.
Summary
• NP-hard optimization problems abound in phylogeny reconstruction, and in computational biology in general, and need very accurate solutions
• Many real problems have beautiful and natural combinatorial and graph-theoretic formulations