Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

Post on 21-Jan-2016

34 views 0 download

description

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood. Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum.edu. Outline. Motivation Introduction to phylogenetic tree inference Statistical inference methods - PowerPoint PPT Presentation

Transcript of Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

Parallel & Distributed Systems and Parallel & Distributed Systems and Algorithms for Inference of Large Algorithms for Inference of Large

Phylogenetic Trees with Maximum Phylogenetic Trees with Maximum LikelihoodLikelihood

Alexandros StamatakisLRR TU München

Contact: stamatak@cs.tum.edu

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 2

OutlineOutline Motivation Introduction to phylogenetic tree inference Statistical inference methods Maximum Likelihood & associated problems Solutions:

– 2 simple heuristics – parallel & distributed implementation

Results Conclusion Availability & Future Work

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 3

Motivation: Towards a „Tree of Life“Motivation: Towards a „Tree of Life“ 30.000 organisms available, current trees <= 1000

Where we are:

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 4

Motivation: Towards a „Tree of Life“Motivation: Towards a „Tree of Life“ 30.000 organisms available, current trees <= 1000

Where we want to get:

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 5

Phylogenetic Tree InferencePhylogenetic Tree Inference

Input: „good“ multiple alignment of a distinguished, highly conserved part of DNA sequences

Output: unrooted binary tree with the sequences at its leaves (all nodes: either degree 1 or 3)

Various methods for phylogenetic tree inference Differ in computational complexity and quality of trees Most accurate methods: Maximum Likelihood Method

(ML) and Bayesian Phylogenetic Inference: + most sound and flexible methods + other methods not suited for large/complex trees

-- most computationally intensive methods

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 6

ML and Bayesian methodsML and Bayesian methods T.Williams et al (March 2003) comparative analysis with simulated data shows: MrBayes is best

program Guidon et al (May 2003) PHYML very fast & accurate ML program for real & simulated data: faster

than MrBayes ML (PHYML, RAxML2):

+ Significantly faster than MrBayes+ Reference/starting trees for bayesian methods-- Less powerful statistical model

Bayesian Inference (MrBayes):+ Powerful statistical model-- MCMC convergence problem

Memory requirements for 1000/10000-taxon alignment:– RAxML: 200MB/750MB– PHYML: 900MB/8.8GB– MrBayes: 1150MB/unknown

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 7

MCMC Convergence ProblemMCMC Convergence Problem

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 8

What does ML compute?What does ML compute?Maximum Likelihood calculates:

1. Topologies

2. Branch lengths v[i]

3. Likelihood of the tree

Goal: Find tree topology wich maximizes likelihood

Problem I: Number of possible topologies is exponential in n

Problem II: Computation of likelihood value + branch length optimization is expensive

Solution: Algorithmic Optimizations (previous work) + New heuristics + HPC

S1

S2

S3S4

S5

v1

v2v3 v4

v5

v6

v7

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 9

New Heuristics for RAxMLNew Heuristics for RAxML Two common methods to build a tree:

1. Progressive addition of organisms e.g. stepwise addition algorithm

2. Use a (random, simple) starting tree containing all organisms and optimize likelihood by application of topological changes

RAxML (Randomized Axelerated Maximum Likelihood) computes parsimony starting tree with dnapars

-> fast and relatively „good“ initial likelihood dnapars uses stepwise addition -> randomized

sequence input order to obtain distinct starting trees Optimize starting tree by application of rearrangements Accelerate rearrangements by two simple ideas

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 10

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6

ST4

ST3

ST1

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 11

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6

ST4

ST3

ST1

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 12

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6

ST4

ST3

ST1

+1

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 13

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6

ST4

ST3

ST1

+1

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 14

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6

ST4

ST3

ST1

+1

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 15

Subtree RearrangementsSubtree Rearrangements

ST5

ST2ST6

ST4

ST3

ST1

+1

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 16

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6ST4

ST3

ST1

+2

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 17

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6ST4

ST3

ST1

+2

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 18

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6ST4

ST3

ST1

Optimize all branches

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 19

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6ST4

ST3

ST1

Need to optimize all branches ?

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 20

Idea 1: Local Optimization of Branch Idea 1: Local Optimization of Branch LengthLength

ST5

ST2

ST6

ST4

ST3

ST1

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 21

Idea 1: Local Optimization of Branch Idea 1: Local Optimization of Branch LengthLength

ST5

ST2

ST6

ST4

ST3

ST1

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 22

Why is Idea 1 useful?Why is Idea 1 useful? Local optimization of branch lengths:

– Update less likelihood vectors -> significantly faster – Allows higher rearrangement settings -> better trees

Likelihood depends strongly on topology Fast exploration of large number of topologies Straight-forward parallelization Store best 20 trees from each rearrangement step Branch length optimization of best 20 trees only Experimental results justify this mechanism

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 23

Idea 2:Subsequent Application of Idea 2:Subsequent Application of Topological Changes Topological Changes

ST5

ST2

ST6ST4

ST3

ST1

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 24

Idea 2:Subsequent Application of Idea 2:Subsequent Application of Topological Changes Topological Changes

ST5

ST2

ST6ST4

ST3

ST1

ST3

ST5

ST2

ST6

ST4

ST1

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 25

Idea 2:Subsequent Application of Idea 2:Subsequent Application of Topological Changes Topological Changes

ST5

ST2

ST6ST4

ST3

ST1

ST5

ST2

ST6

ST4

ST1

ST5

ST2

ST6

ST4

ST1

ST3

ST3

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 26

Idea 2:Subsequent Application of Idea 2:Subsequent Application of Topological Changes Topological Changes

ST5

ST2

ST6ST4

ST3

ST1

ST5

ST2

ST6

ST4

ST1

ST5

ST2

ST6

ST4

ST1

ST5

ST2

ST6

ST4

ST1

ST3

ST3

ST3

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 27

Why is Idea 2 useful?Why is Idea 2 useful?

During inital 5-10 rearrengement steps many improved topologies are encountered

Acceleration of likelihood improvment in initial optimization phase

Enables fast optimization of random starting trees

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 28

Remainder of this TalkRemainder of this Talk Motivation Introduction to phylogenetic tree inference Statistical inference methods Maximum Likelihood & associated problems Solutions:

– 2 simple heuristics – parallel & distributed implementation

Results Conclusion Availability & Future Work

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 29

Basic Parallel & Distributed AlgorithmBasic Parallel & Distributed Algorithm Basic idea: Distribute work by subtrees instead of

topologies (e.g. parallel fastDNAml) Simple Master-Worker architecture Subsequent application of topological changes

introduces non-determinism

ST5

ST2

ST6ST4

ST3

ST1

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 30

Basic Parallel & Distributed AlgorithmBasic Parallel & Distributed Algorithm Basic idea: Distribute work by subtrees instead of

topologies (e.g. parallel fastDNAml) Simple Master-Worker architecture Subsequent application of topological changes

introduces non-determinism

ST5

MPI_Send(ST3_ID, tree)

ST6ST4

ST3

ST1ST2

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 31

Basic Parallel & Distributed AlgorithmBasic Parallel & Distributed Algorithm Basic idea: Distribute work by subtrees instead of

topologies (e.g. parallel fastDNAml) Simple Master-Worker architecture Subsequent application of topological changes

introduces non-determinism

ST5

MPI_Send(ST3_ID, tree)

ST6ST4

ST3

ST1ST2

MPI_Send(ST2_ID, tree)

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 32

Differences between Parallel & Differences between Parallel & Distributed AlgorithmDistributed Algorithm

Parallel: best tree list of max(20, #workers) maintained and merged at the master

Parallel: Master distributes max(20, #workers) as toplogy-strings to workers for branch length optimization

Distributed: Each worker maintains local best list of 20 trees

Distributed: Worker performs fast branch length optimizations locally on all 20 trees -> returns only best topology to the master

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 33

Sequential ResultsSequential Results 50 distinct simulated 100-taxon alignments

- Measured average execution times & topological distance (RF-rate) from „true“ tree

- PHYML: 35.21 seconds, RF-rate: 0.0796

- MrBayes: 945.32 seconds, RF-rate: 0.0741

- RAxML: 29.27 seconds, RF-rate: 0.0818

9 distinct real alignments containing 101-1000 taxa- Measured execution times & final likelihood values

- RAxML yields best-known likelihood for all data sets

- RAxML faster than PHYML & MrBayes

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 34

Sequential Results: Real DataSequential Results: Real Datadata PHYML secs MrBayes secs RAxML secs R > PHY

secs

PAXML hrs

101_SC -74097.6 153 -77191.5 40527 -73919.3 617 31 -73975.9 47

150_SC -44298.1 158 -52028.4 49427 -44142.6 390 33 -44146.9 164

150_ARB -77219.7 313 -77196.7 29383 -77189.7 178 67 -77189.8 300

200_ARB -104826.5 477 -104856.4 156419 -104742.6

272 99 -104743.3 775

250_ARB -131560.3 787 -133238.3 158418 -131468.0

1067 249 -131469.0 1947

500_ARB -253354.2 2235 -263217.8 366496 -252499.4

26124 493 -252588.1 7372

1000_ARB -402215.0 16594

-459392.4 509148 -400925.3

50729 1893 -402282.1 9898

218_RDPII -157923.1 403 -158911.6 138453 -157526.0

6774 244 n/a n/a

500_ZILLA -22186.8 2400 -22259.0 96557 -21033.9 29916 67 n/a n/a

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 35

Sequential Results: Real DataSequential Results: Real Datadata PHYML secs MrBayes secs RAxML secs R > PHY

secs

PAXML hrs

101_SC -74097.6 153 -77191.5 40527 -73919.3 617 31 -73975.9 47

150_SC -44298.1 158 -52028.4 49427 -44142.6 390 33 -44146.9 164

150_ARB -77219.7 313 -77196.7 29383 -77189.7 178 67 -77189.8 300

200_ARB -104826.5 477 -104856.4 156419 -104742.6

272 99 -104743.3 775

250_ARB -131560.3 787 -133238.3 158418 -131468.0

1067 249 -131469.0 1947

500_ARB -253354.2 2235 -263217.8 366496 -252499.4

26124 493 -252588.1 7372

1000_ARB -402215.0 16594

-459392.4 509148 -400925.3

50729 1893 -402282.1 9898

218_RDPII -157923.1 403 -158911.6 138453 -157526.0

6774 244 n/a n/a

500_ZILLA -22186.8 2400 -22259.0 96557 -21033.9 29916 67 n/a n/a

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 36

Sequential Results: Real DataSequential Results: Real Data

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 37

Sequential Results: Real DataSequential Results: Real Data

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 38

Sequential Results: Real DataSequential Results: Real Data

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 39

Parallel Results: Speedup 1000_ARBParallel Results: Speedup 1000_ARB

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 40

Distributed Results: First Tests Distributed Results: First Tests

Platforms:– Infiniband-Cluster: 10 Intel Xeon 2.4 GHz– Sunhalle: 50 Sun-Workstations for CS students

Alignments:– 1000_ARB – 2025_ARB– Larger trees to come ..........

Results:– Program executed correctly & terminated– RAxML@home yielded best-known tree for 2025_ARB

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 41

Biological Results: 1st ML 10.000-taxon treeBiological Results: 1st ML 10.000-taxon tree Calculated 5 parsimony starting trees + 3-4 initial

rearrangement steps sequentially on Xeon 2.4GHz Further rearrangements of those 5 trees in parallel on

32 or 64 Xeon 2.66GHz at RRZE Accumulated CPU hours/tree ~ 3200hours Best ln likelihood: -949539 worst: -950026 Problems:

– Quality assessment? bootstrap not feasible– Consense crashes for > 5 trees– MrBayes/PHYML crash on 32-bit/4GB – MrBayes crashed on Itanium– Visualization?

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 42

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 43

ConclusionConclusion

RAxML not able to handle protein data RAxML not able to perform model parameter

optimization BUT:

– RAxML easy to parallelize/distribute– Accurate & fast for large trees– Significantly lower memory requirements than

MrBayes/PHYML Conclusion: Imlement model parameter

optimization & protein data in RAxML

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 44

Availability & Future WorkAvailability & Future Work

Further development & distribution of RAxML@home

Big production runs with RAxML@home Survey: ML supertrees vs. integral trees Alignment split-up methods for ML supertrees RAxML implementation on GPUs RAxML2 download, benchmark, code:

wwwbode.in.tum.de/~stamatak RAxML@home development:

www.sourceforge.com/projects/axml