CIPRES: Enabling Tree of Life Projects

CIPRES: Enabling Tree of Life Projects

Tandy Warnow

The University of Texas at Austin

Reconstructing the “Tree” of LifeHandling large datasets: Handling large datasets:

millions of speciesmillions of species

The “Tree of Life” is not The “Tree of Life” is not really a tree: really a tree:

reticulate evolutionreticulate evolution

Cyber Infrastructure for Phylogenetic Research

Purpose: to create a national infrastructure of hardware, open source software, database technology, etc., necessary to infer the Tree of Life. Group: 40 biologists, computer scientists, and mathematicians from 13 institutions.Funding: $11.6 M (large ITR grant from NSF).URL: http://www.phylo.org

CIPRes Members

University of New MexicoBernard Moret David Bader

UCSD/SDSCFran Berman Alex Borchers Phil Bourne John Huelsenbeck Terri LiebowitzMark Miller University of ConnecticutPaul O Lewis

University of PennsylvaniaJunhyong Kim Susan Davidson Sampath KannanVal Tannen

Texas A&MTiffani Williams

UT Austin Tandy Warnow David M. Hillis Warren Hunt Robert Jansen Randy Linder Lauren Meyers Daniel Miranker

University of ArizonaDavid R. Maddison

University of British ColumbiaWayne Maddison

North Carolina State UniversitySpencer Muse

American Museum of Natural HistoryWard C. Wheeler

NJITUsman Roshan

UC BerkeleySatish Rao Steve EvansRichard M Karp Brent MishlerElchanan MosselEugene W. MyersChristos M. PapadimitriouStuart J. Russell

Rice Luay Nakhleh

SUNY BuffaloWilliam Piel

Florida State UniversityDavid L. SwoffordMark Holder

Yale Michael DonoghuePaul Turner

CIPRES activity

• Databases - e.g. TreeBase II (Bill Piel and others)• Simulations of large-scale complex genome-scale

evolution (Junhyong Kim)• Outreach (Michael Donoghue and Brent Mishler)• Algorithms (Tandy Warnow)• Open source software (Wayne Maddison, Dave

Swofford, Mark Holder, and Bernard Moret)• Computer cluster at SDSC (Fran Berman and Mark

Miller) - available to ATOL projects and other groups with datasets above 1000 taxa

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT


AAGACTT

TGGACTTAAGGCCT


AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT

U V W X Y

U

V W

X

Y

Complex Evolutionary Processes

• Gap events• “Heterotachy” (violations of the rates-

across-sites assumption)• New types of data (e.g., whole genomes)• Reticulate evolution (e.g., hybrid speciation

and horizontal gene transfer)

Challenges in reconstructing large and/or complex evolutionary histories

• Previous simulation studies don’t necessarily help us understand phylogenetic reconstruction on large or complex datasets

• We need new statistical models, new theory, and probably new methods.

• Reticulate evolution and whole genome evolution in particular present many interesting challenges for reconstruction.

CIPRES research in algorithms

• Multiple sequence alignment • Genomic alignment • Heuristics for Maximum Parsimony and Maximum

Likelihood• Bayesian MCMC methods• Supertree methods• Whole genome phylogeny reconstruction• Reticulate evolution detection and reconstruction • Data mining on sets of trees, and compact representations

of these sets

1. Heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood) - hard to solve on large datasets

Phylogenetic reconstruction methods

Phylogenetic trees

Cost

Global optimum

Local optimum

2. Polynomial time distance-based methods: Neighbor Joining, FastME, Weighbor, etc. - poor accuracy on datasets with large evolutionary distances

DCMs: Divide-and-conquer for improving phylogeny reconstruction

“Boosting” phylogeny reconstruction methods

• DCMs “boost” the performance of phylogeny reconstruction methods.

DCMBase method M DCM-M

DCMs (Disk-Covering Methods)

• DCMs for polynomial time methods improve topological accuracy (empirical observation), and have provable theoretical guarantees under Markov models of evolution

• DCMs for hard optimization problems reduce running time needed to achieve good levels of accuracy (empirically observation)

DCM1-boosting distance-based methods[Nakhleh et al. ISMB 2001]

•DCM1-boosting makes distance-based methods more accurate

•Theoretical guarantees that DCM1-NJ converges to the true tree from polynomial length sequences

NJ

DCM1-NJ

0 400 800 16001200No. Taxa

0

0.2

0.4

0.6

0.8

Err

or R

ate

Major challenge: MP and ML

• Maximum Parsimony (MP) and Maximum Likelihood (ML) remain the methods of choice for most systematists

• The main challenge here is to make it possible to obtain good solutions to MP or ML in reasonable time periods on large datasets

Solving NP-hard problems exactly is … unlikely

• Number of (unrooted) binary trees on n leaves is (2n-5)!!

• If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in

2890 millennia

#leaves #trees

4 3

5 15

6 105

7 945

8 10395

9 135135

10 2027025

20 2.2 x 1020

100 4.5 x 10190

1000 2.7 x 102900

How good an MP analysis do we need?

• Our research shows that we need to get within 0.01% of optimal (or better even, on large datasets) to return reasonable estimates of the true tree’s “topology”

Problems with current techniques for MP

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

Shown here is the performance of a heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%.

Performance of TNT with time

Observations

• The best MP heuristics cannot get acceptably good solutions within 24 hours on most of these large datasets.

• Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions.

• Apparent convergence can be misleading.

Our objective: speed up the best MP heuristics

Time

MP scoreof best trees

Performance of hill-climbing heuristic

Desired Performance

Fake study

DCM3 decomposition

Input: Set S of sequences, and guide-tree T

1. Compute short subtree graph G(S,T), based upon T

2. Find clique separator in the graph G(S,T) and form subproblems

DCM3 decompositions (1) can be obtained in O(n) time(2) yield small subproblems(3) can be used iteratively(4) can be applied recursively

Iterative-DCM3

T

T’

Base methodDCM3

New DCMs

• DCM31. Compute subproblems using DCM3 decomposition

2. Apply base method to each subproblem to yield subtrees

3. Merge subtrees using the Strict Consensus Merger technique

4. Randomly refine to make it binary

• Recursive-DCM3• Iterative DCM3

1. Compute a DCM3 tree

2. Perform local search and go to step 1

• Recursive-Iterative DCM3

Rec-I-DCM3 significantly improves performance

Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours


optimal, shown as a percentage of

the optimal

Current best techniques

DCM boosted version of best techniques

Datasets

• 1322 lsu rRNA of all organisms• 2000 Eukaryotic rRNA• 2594 rbcL DNA• 4583 Actinobacteria 16s rRNA • 6590 ssu rRNA of all Eukaryotes• 7180 three-domain rRNA• 7322 Firmicutes bacteria 16s rRNA• 8506 three-domain+2org rRNA• 11361 ssu rRNA of all Bacteria• 13921 Proteobacteria 16s rRNA

Obtained from various researchers and online databases

Rec-I-DCM3(TNT) vs. TNT(Comparison of scores at 24 hours)

Base method is the default TNT technique, the current best method for MP. Rec-I-DCM3 significantly improves upon the unboosted TNT by returning trees which are at most 0.01% above optimal on most datasets.

00.010.020.030.040.050.060.070.080.090.1


optimal at 24 hours, shown as a

percentage of the optimal

1 2 3 4 5 6 7 8 9 10

Dataset#

TNT Rec-I-DCM3

Observations

• Rec-I-DCM3 improves upon the best performing heuristics for MP.

• The improvement increases with the difficulty of the dataset.

DCMs• DCM for NJ and other distance methods produces

absolute fast converging (afc) methods• DCMs for MP heuristics • DCMs for use with the GRAPPA software for whole

genome phylogenetic analysis; these have been shown to let GRAPPA scale from its maximum of about 15-20 genomes to 1000 genomes.

• Current projects: DCM development for maximum likelihood and multiple sequence alignment.

Part II: Whole-Genome Phylogenetics

A

B

C

D

E

F

X

Y

ZW

A

B

C

D

E

F

Genomes Evolve by Rearrangements

• Inverted Transposition

1 2 3 9 -8 –7 –6 –5 –4 10

1 2 3 4 5 6 7 8 9 10

• Inversion (Reversal)

1 2 3 –8 –7 –6 –5 -4 9 10

• Transposition

1 2 3 9 4 5 6 7 8 10

Genome Rearrangement Has A Huge State Space

• DNA sequences : 4 states per site• Signed circular genomes with n genes:

states, 1 site

• Circular genomes (1 site)

– with 37 genes: states

– with 120 genes: states

)!1(2 1 −− nn

521056.2 ×2321070.3 ×

Why use gene orders?

• “Rare genomic changes”: huge state space and relative infrequency of events (compared to site substitutions) could make the inference of deep evolution easier, or more accurate.

• Our research shows this is true, but accurate analysis of gene order data is computationally very intensive!

The Generalized Nadeau-Taylor model

Wang and Warnow, 2001

• Three types of events: inversions, transpositions, and inverted transpositions

• Each event of each type is equiprobable

• The relative probabilities of the three events are parameters that the user can specify

Phylogeny reconstruction in 1998• Distance-based

– Breakpoint (BP) distances [Blanchette, Kunisawa, Sankoff 1998]

• Minimum length trees (NP-hard, even for three taxa)

– BPAnalysis: [Sankoff & Blanchette 1998]: exhaustive search through treespace to find the minimum breakpoint length (the number of breakpoints on the tree)

40 taxa, 120 genes, Inv.:Transp.:InvTransp=2:1:1

birth-death trees, expected deviation from ultrametricity=2

NJ(BP)

NJ(BP): seconds BPanalysis: will not finish (will take 200 years for

a 13 genome dataset)

Error in

inferred tree

Amount of evolution

Progress

• Statistically-defined distance estimators: EDE and IEBP, highly robust to model violations

• FastME(EDE) yields very accurate trees, except when the datasets are close to “saturated”

BP=breakpoint distanceINV=inversion distanceEDE: statistically-based estimator [Wang et al. ‘01] - highly robust.All these methods are polynomial time.

40 taxa, 120 genes

Inv.:Transp.:InvTransp=2:1:1

Birth-death trees, expected deviation from ultrametricity=2

Amount of evolution

Minimum length trees (“parsimony”)

• Breakpoint length and inversion length: both NP-hard to solve even on three-leaf trees. Exact solutions exponential in both number of taxa and number of genes.

• Inversion-phylogeny has better topological accuracy than breakpoint phylogeny, but is harder to solve. Highly robust to model violations.

“Solving” the inversion and breakpoint phylogeny problems

Phylogenetic trees

MP score

Global optimum

Local optimum

• Usual issue of getting stuck in local optima, since the optimization problems are NP-hard

• Additional problem: finding the best trees is enormously hard, since even the “point estimation” problem is hard (worse than estimating branch lengths in ML).

Minimum length trees (“parsimony”)• Breakpoint phylogeny

– BPAnalysis: [Sankoff & Blanchette 1998] – GRAPPA [Moret et al. 2001]– MPME [Wang et al. PSB 2002]: represents gene orders as multi-

state strings, and solves parsimony on this modified dataset. This problem is exponential in the number of taxa, but polynomial in the number of genes). Because of MP software, it cannot handle large datasets.

– DCM4-MPME: uses a divide-and-conquer strategy (similar to DCM3) to decompose a large dataset into smaller datasets, on the basis of a guide tree. It can handle larger datasets than any of the other methods.

• Inversion phylogeny:– GRAPPA: highly accurate, robust to model violations, but cannot

analyze trees with large edge lengths in reasonable time periods.

Analyzing Large Datasets

Problem size /divergence

Topological Error

NJ(EDE)

• poor accuracy for highly diverged datasets

GRAPPA

• cannot handle datasets of moderate size, or trees with long branches

MPME• Cannot handle

datasets of large size

NJ(EDE)+MPMEin a Divide-and-Conquer approach

NJ(BP)

NJ(EDE)

MPME

GRAPPA

DCM4-MPME: Guide tree=NJ(EDE) GRAPPA & MPME won’t finish

(long branch lengths; too many taxa)

120 genes, 200 taxa, Inversion/Transposition/Inverted Transposition=2:1:1Birth-Death Trees with deviation from ultrametricity

NJ(EDE)

DCM4-MPME

Summary

• True evolutionary distance estimators improve accuracy of NJ

• Sequence-based heuristic (MPME)

• Divide-and-conquer, integrated approach for large-scale data

Limitations and ongoing research

• Current methods are mostly limited to single chromosomes with equal gene content (or very small amounts of deletions and duplications).

• Moret et al. have made some progress on developing a reliable distance-based method for chromosomes with unequal gene content (tests on real and simulated data show high accuracy)

• Handling the multiple chromosome case is harder

GRAPPA (Genome Rearrangement Analysis under Parsimony and other

Phylogenetic Algorithms)http://www.cs.unm.edu/~moret/GRAPPA/• Heuristics for NP-hard optimization problems• Fast polynomial time distance-based methods• Contributors: U. New Mexico, U. Texas at

Austin, Universitá di Bologna, Italy• Freely available in source code at this site.• Project leader: Bernard Moret (UNM)

([email protected])

CIPRES software distributions

Software group leaders: Wayne Maddison and Dave Swofford

The first distribution (in the next months) will focus on Rec-I-DCM3(PAUP*): fast heuristic searches for maximum parsimony on large datasets for PAUP* users

All software will be open source

Community contributions to software will be enabled

Acknowledgements

• NSF• The David and Lucile Packard Foundation• The Program in Evolutionary Dynamics at Harvard• The Institute for Cellular and Molecular Biology at UT-

Austin

See http://www.phylo.org and http://www.cs.utexas.edu/~tandy and http://www.cs.unm.edu/~moret/GRAPPA

http://www.phylo.org/



http://www.cs.utexas.edu/~tandy






http://www.cs.unm.edu/~moret/GRAPPA

CIPRES: Enabling Tree of Life Projects

Documents

Transcript of CIPRES: Enabling Tree of Life Projects