Alexandros Stamatakis LRR TU München Contact: [email protected]

44
Parallel & Distributed Parallel & Distributed Systems and Algorithms for Systems and Algorithms for Inference of Large Inference of Large Phylogenetic Trees with Phylogenetic Trees with Maximum Likelihood Maximum Likelihood Alexandros Stamatakis LRR TU München Contact: [email protected]

description

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood. Alexandros Stamatakis LRR TU München Contact: [email protected]. Outline. Motivation Introduction to phylogenetic tree inference Statistical inference methods - PowerPoint PPT Presentation

Transcript of Alexandros Stamatakis LRR TU München Contact: [email protected]

Page 1: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

Parallel & Distributed Systems and Parallel & Distributed Systems and Algorithms for Inference of Large Algorithms for Inference of Large

Phylogenetic Trees with Maximum Phylogenetic Trees with Maximum LikelihoodLikelihood

Alexandros StamatakisLRR TU München

Contact: [email protected]

Page 2: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 2

OutlineOutline Motivation Introduction to phylogenetic tree inference Statistical inference methods Maximum Likelihood & associated problems Solutions:

– 2 simple heuristics – parallel & distributed implementation

Results Conclusion Availability & Future Work

Page 3: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 3

Motivation: Towards a „Tree of Life“Motivation: Towards a „Tree of Life“ 30.000 organisms available, current trees <= 1000

Where we are:

Page 4: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 4

Motivation: Towards a „Tree of Life“Motivation: Towards a „Tree of Life“ 30.000 organisms available, current trees <= 1000

Where we want to get:

Page 5: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 5

Phylogenetic Tree InferencePhylogenetic Tree Inference

Input: „good“ multiple alignment of a distinguished, highly conserved part of DNA sequences

Output: unrooted binary tree with the sequences at its leaves (all nodes: either degree 1 or 3)

Various methods for phylogenetic tree inference Differ in computational complexity and quality of trees Most accurate methods: Maximum Likelihood Method

(ML) and Bayesian Phylogenetic Inference: + most sound and flexible methods + other methods not suited for large/complex trees

-- most computationally intensive methods

Page 6: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 6

ML and Bayesian methodsML and Bayesian methods T.Williams et al (March 2003) comparative analysis with simulated data shows: MrBayes is best

program Guidon et al (May 2003) PHYML very fast & accurate ML program for real & simulated data: faster

than MrBayes ML (PHYML, RAxML2):

+ Significantly faster than MrBayes+ Reference/starting trees for bayesian methods-- Less powerful statistical model

Bayesian Inference (MrBayes):+ Powerful statistical model-- MCMC convergence problem

Memory requirements for 1000/10000-taxon alignment:– RAxML: 200MB/750MB– PHYML: 900MB/8.8GB– MrBayes: 1150MB/unknown

Page 7: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 7

MCMC Convergence ProblemMCMC Convergence Problem

Page 8: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 8

What does ML compute?What does ML compute?Maximum Likelihood calculates:

1. Topologies

2. Branch lengths v[i]

3. Likelihood of the tree

Goal: Find tree topology wich maximizes likelihood

Problem I: Number of possible topologies is exponential in n

Problem II: Computation of likelihood value + branch length optimization is expensive

Solution: Algorithmic Optimizations (previous work) + New heuristics + HPC

S1

S2

S3S4

S5

v1

v2v3 v4

v5

v6

v7

Page 9: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 9

New Heuristics for RAxMLNew Heuristics for RAxML Two common methods to build a tree:

1. Progressive addition of organisms e.g. stepwise addition algorithm

2. Use a (random, simple) starting tree containing all organisms and optimize likelihood by application of topological changes

RAxML (Randomized Axelerated Maximum Likelihood) computes parsimony starting tree with dnapars

-> fast and relatively „good“ initial likelihood dnapars uses stepwise addition -> randomized

sequence input order to obtain distinct starting trees Optimize starting tree by application of rearrangements Accelerate rearrangements by two simple ideas

Page 10: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 10

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6

ST4

ST3

ST1

Page 11: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 11

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6

ST4

ST3

ST1

Page 12: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 12

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6

ST4

ST3

ST1

+1

Page 13: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 13

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6

ST4

ST3

ST1

+1

Page 14: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 14

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6

ST4

ST3

ST1

+1

Page 15: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 15

Subtree RearrangementsSubtree Rearrangements

ST5

ST2ST6

ST4

ST3

ST1

+1

Page 16: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 16

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6ST4

ST3

ST1

+2

Page 17: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 17

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6ST4

ST3

ST1

+2

Page 18: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 18

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6ST4

ST3

ST1

Optimize all branches

Page 19: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 19

Subtree RearrangementsSubtree Rearrangements

ST5

ST2

ST6ST4

ST3

ST1

Need to optimize all branches ?

Page 20: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 20

Idea 1: Local Optimization of Branch Idea 1: Local Optimization of Branch LengthLength

ST5

ST2

ST6

ST4

ST3

ST1

Page 21: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 21

Idea 1: Local Optimization of Branch Idea 1: Local Optimization of Branch LengthLength

ST5

ST2

ST6

ST4

ST3

ST1

Page 22: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 22

Why is Idea 1 useful?Why is Idea 1 useful? Local optimization of branch lengths:

– Update less likelihood vectors -> significantly faster – Allows higher rearrangement settings -> better trees

Likelihood depends strongly on topology Fast exploration of large number of topologies Straight-forward parallelization Store best 20 trees from each rearrangement step Branch length optimization of best 20 trees only Experimental results justify this mechanism

Page 23: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 23

Idea 2:Subsequent Application of Idea 2:Subsequent Application of Topological Changes Topological Changes

ST5

ST2

ST6ST4

ST3

ST1

Page 24: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 24

Idea 2:Subsequent Application of Idea 2:Subsequent Application of Topological Changes Topological Changes

ST5

ST2

ST6ST4

ST3

ST1

ST3

ST5

ST2

ST6

ST4

ST1

Page 25: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 25

Idea 2:Subsequent Application of Idea 2:Subsequent Application of Topological Changes Topological Changes

ST5

ST2

ST6ST4

ST3

ST1

ST5

ST2

ST6

ST4

ST1

ST5

ST2

ST6

ST4

ST1

ST3

ST3

Page 26: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 26

Idea 2:Subsequent Application of Idea 2:Subsequent Application of Topological Changes Topological Changes

ST5

ST2

ST6ST4

ST3

ST1

ST5

ST2

ST6

ST4

ST1

ST5

ST2

ST6

ST4

ST1

ST5

ST2

ST6

ST4

ST1

ST3

ST3

ST3

Page 27: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 27

Why is Idea 2 useful?Why is Idea 2 useful?

During inital 5-10 rearrengement steps many improved topologies are encountered

Acceleration of likelihood improvment in initial optimization phase

Enables fast optimization of random starting trees

Page 28: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 28

Remainder of this TalkRemainder of this Talk Motivation Introduction to phylogenetic tree inference Statistical inference methods Maximum Likelihood & associated problems Solutions:

– 2 simple heuristics – parallel & distributed implementation

Results Conclusion Availability & Future Work

Page 29: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 29

Basic Parallel & Distributed AlgorithmBasic Parallel & Distributed Algorithm Basic idea: Distribute work by subtrees instead of

topologies (e.g. parallel fastDNAml) Simple Master-Worker architecture Subsequent application of topological changes

introduces non-determinism

ST5

ST2

ST6ST4

ST3

ST1

Page 30: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 30

Basic Parallel & Distributed AlgorithmBasic Parallel & Distributed Algorithm Basic idea: Distribute work by subtrees instead of

topologies (e.g. parallel fastDNAml) Simple Master-Worker architecture Subsequent application of topological changes

introduces non-determinism

ST5

MPI_Send(ST3_ID, tree)

ST6ST4

ST3

ST1ST2

Page 31: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 31

Basic Parallel & Distributed AlgorithmBasic Parallel & Distributed Algorithm Basic idea: Distribute work by subtrees instead of

topologies (e.g. parallel fastDNAml) Simple Master-Worker architecture Subsequent application of topological changes

introduces non-determinism

ST5

MPI_Send(ST3_ID, tree)

ST6ST4

ST3

ST1ST2

MPI_Send(ST2_ID, tree)

Page 32: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 32

Differences between Parallel & Differences between Parallel & Distributed AlgorithmDistributed Algorithm

Parallel: best tree list of max(20, #workers) maintained and merged at the master

Parallel: Master distributes max(20, #workers) as toplogy-strings to workers for branch length optimization

Distributed: Each worker maintains local best list of 20 trees

Distributed: Worker performs fast branch length optimizations locally on all 20 trees -> returns only best topology to the master

Page 33: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 33

Sequential ResultsSequential Results 50 distinct simulated 100-taxon alignments

- Measured average execution times & topological distance (RF-rate) from „true“ tree

- PHYML: 35.21 seconds, RF-rate: 0.0796

- MrBayes: 945.32 seconds, RF-rate: 0.0741

- RAxML: 29.27 seconds, RF-rate: 0.0818

9 distinct real alignments containing 101-1000 taxa- Measured execution times & final likelihood values

- RAxML yields best-known likelihood for all data sets

- RAxML faster than PHYML & MrBayes

Page 34: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 34

Sequential Results: Real DataSequential Results: Real Datadata PHYML secs MrBayes secs RAxML secs R > PHY

secs

PAXML hrs

101_SC -74097.6 153 -77191.5 40527 -73919.3 617 31 -73975.9 47

150_SC -44298.1 158 -52028.4 49427 -44142.6 390 33 -44146.9 164

150_ARB -77219.7 313 -77196.7 29383 -77189.7 178 67 -77189.8 300

200_ARB -104826.5 477 -104856.4 156419 -104742.6

272 99 -104743.3 775

250_ARB -131560.3 787 -133238.3 158418 -131468.0

1067 249 -131469.0 1947

500_ARB -253354.2 2235 -263217.8 366496 -252499.4

26124 493 -252588.1 7372

1000_ARB -402215.0 16594

-459392.4 509148 -400925.3

50729 1893 -402282.1 9898

218_RDPII -157923.1 403 -158911.6 138453 -157526.0

6774 244 n/a n/a

500_ZILLA -22186.8 2400 -22259.0 96557 -21033.9 29916 67 n/a n/a

Page 35: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 35

Sequential Results: Real DataSequential Results: Real Datadata PHYML secs MrBayes secs RAxML secs R > PHY

secs

PAXML hrs

101_SC -74097.6 153 -77191.5 40527 -73919.3 617 31 -73975.9 47

150_SC -44298.1 158 -52028.4 49427 -44142.6 390 33 -44146.9 164

150_ARB -77219.7 313 -77196.7 29383 -77189.7 178 67 -77189.8 300

200_ARB -104826.5 477 -104856.4 156419 -104742.6

272 99 -104743.3 775

250_ARB -131560.3 787 -133238.3 158418 -131468.0

1067 249 -131469.0 1947

500_ARB -253354.2 2235 -263217.8 366496 -252499.4

26124 493 -252588.1 7372

1000_ARB -402215.0 16594

-459392.4 509148 -400925.3

50729 1893 -402282.1 9898

218_RDPII -157923.1 403 -158911.6 138453 -157526.0

6774 244 n/a n/a

500_ZILLA -22186.8 2400 -22259.0 96557 -21033.9 29916 67 n/a n/a

Page 36: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 36

Sequential Results: Real DataSequential Results: Real Data

Page 37: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 37

Sequential Results: Real DataSequential Results: Real Data

Page 38: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 38

Sequential Results: Real DataSequential Results: Real Data

Page 39: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 39

Parallel Results: Speedup 1000_ARBParallel Results: Speedup 1000_ARB

Page 40: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 40

Distributed Results: First Tests Distributed Results: First Tests

Platforms:– Infiniband-Cluster: 10 Intel Xeon 2.4 GHz– Sunhalle: 50 Sun-Workstations for CS students

Alignments:– 1000_ARB – 2025_ARB– Larger trees to come ..........

Results:– Program executed correctly & terminated– RAxML@home yielded best-known tree for 2025_ARB

Page 41: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 41

Biological Results: 1st ML 10.000-taxon treeBiological Results: 1st ML 10.000-taxon tree Calculated 5 parsimony starting trees + 3-4 initial

rearrangement steps sequentially on Xeon 2.4GHz Further rearrangements of those 5 trees in parallel on

32 or 64 Xeon 2.66GHz at RRZE Accumulated CPU hours/tree ~ 3200hours Best ln likelihood: -949539 worst: -950026 Problems:

– Quality assessment? bootstrap not feasible– Consense crashes for > 5 trees– MrBayes/PHYML crash on 32-bit/4GB – MrBayes crashed on Itanium– Visualization?

Page 42: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 42

Page 43: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 43

ConclusionConclusion

RAxML not able to handle protein data RAxML not able to perform model parameter

optimization BUT:

– RAxML easy to parallelize/distribute– Accurate & fast for large trees– Significantly lower memory requirements than

MrBayes/PHYML Conclusion: Imlement model parameter

optimization & protein data in RAxML

Page 44: Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

ICS/IMBB Iraklion Alexandros Stamatakis:

Phylogenetic Inference with RAxML2Slide: 44

Availability & Future WorkAvailability & Future Work

Further development & distribution of RAxML@home

Big production runs with RAxML@home Survey: ML supertrees vs. integral trees Alignment split-up methods for ML supertrees RAxML implementation on GPUs RAxML2 download, benchmark, code:

wwwbode.in.tum.de/~stamatak RAxML@home development:

www.sourceforge.com/projects/axml