Crunching Huge Phylogenies. A. Stamatakis

125
Crunching Huge Phylogenies: A Rapid Bootstrap Algorithm and Massive Parallelism on the IBM BlueGene Alexandros Stamatakis Swiss Federal Institute of Technology Lausanne (EPFL) School of Computer & Communication Sciences Laboratory for Computational Biology and Bioinformatics Lausanne, Switzerland & Swiss Institute of Bioinformatics [email protected] icwww.epfl.ch/~stamatak

Transcript of Crunching Huge Phylogenies. A. Stamatakis

Page 1: Crunching Huge Phylogenies. A. Stamatakis

Crunching Huge Phylogenies:A Rapid Bootstrap Algorithm and Massive Parallelism on the IBM

BlueGene

Alexandros Stamatakis

Swiss Federal Institute of Technology Lausanne (EPFL)School of Computer & Communication Sciences

Laboratory for Computational Biology and BioinformaticsLausanne, Switzerland

&Swiss Institute of Bioinformatics

[email protected]/~stamatak

Page 2: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

The Missing Part

Data Assembly Tree AnalysisInference ?

Page 3: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

The Missing Part

Data Assembly Tree Analysis

Page 4: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

IBM BlueGene/Lsupercomputer

Page 5: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Rapid BootstrappingBootstopping Criterion

Page 6: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

The Big Hardware Problem

1980 2007

CPU Speed 40% p.a.

Memory Speed 9% p.a.

Page 7: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

... and why this concerns Bioinformatics

1980 2007

CPU Speed 40% p.a.

Memory Speed 9% p.a.

Sequence Data

Page 8: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

... and why this concerns Bioinformatics

1980 2007

CPU Speed 40% p.a.

Memory Speed 9% p.a.

Sequence Data

Application of HPC techniques will become much more important

Page 9: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Cache Hierarchy

Page 10: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Outline● Introduction

● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services

● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion

● Related Projects● Outlook

Page 11: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic

inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &

simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)

Page 12: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic

inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &

simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)

ML & Bayesian: explicit model choice

Page 13: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic

inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &

simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)

Complex Methods & Models required to reconstruct large & complicated trees !

Focus of this talk is on Maximum Likelihood!

Page 14: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic

inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &

simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)

The real reason for working on ML: ......

Page 15: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Challenges for Phyloinformatics

Holy grail: “Tree of Life” What is a good alignment in a

phylogenetic context? Simultaneous alignment and tree building Improve/extend models ... but thereby size

of computable trees decreases! More HPC awareness Exploit multi-core architectures Amount of available data grows at a

higher rate than algorithms are getting faster

Page 16: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

The algorithmic problem

Page 17: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

The number of trees

Page 18: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

The number of trees

Page 19: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

The number of trees

Page 20: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

The number of trees explodes!

BANG !

Page 21: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Outline● Introduction

● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services

● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion

● Related Projects● Outlook

Page 22: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

Page 23: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Page 24: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Page 25: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Page 26: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

virtual root: vrvirtual root: vr

Page 27: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)

mm

vrvr

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Page 28: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)

mm

vrvr

Lots of floating point operations!

Page 29: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3

optimize branch lengthsoptimize branch lengths

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Page 30: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3

optimize model parametersoptimize model parameters

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Page 31: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Maximum Likelihood

Goal: Obtain topology with maximum likelihood value

Problem I: Number of possible topologies is exponential in n

Problem II: Computation of likelihood function is expensive

Problem III: Probably high score accuracy required

Problem IV: High memory consumption

Solution:

• New Algorithms

• New Models

• High Performance Computing

Page 32: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Maximum Likelihood

Goal: Obtain topology with maximum likelihood value

Problem I: Number of possible topologies is exponential in n

Problem II: Computation of likelihood function is expensive

Problem III: Probably high score accuracy required

Problem IV: High memory consumption

Solution:

• New Algorithms

• New Models

• High Performance Computing

RAxML Randomized Axelerated

Maximum Likelihood

Page 33: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Web & Grid Services RAxML Web-Server at San Diego Supercomputing

Center via www.phylo.org (CIPRES project) Web-Server at Vital-IT unit of Swiss Institute of

Bioinformatics phylobench.vital-it.ch/raxml-bb/ Includes novel search algorithm with 1 order of

magnitude run-time improvement Since Sept 3, about 700 jobs from 130 Ips Extension to SwissGrid planned Novel algorithm with Bootstopping to be

integrated into CIPRES portal soon RAxML integration into Distributed European

Infrastructure for Supercomputing Applications www.deisa.org started 10 days ago

Integration into Debian medical distribution

Page 34: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

RAxML Black Box

Page 35: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

RAxML Black Box

Why are Black Boxesuseful?

Page 36: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Outline● Introduction

● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services

● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion

● Related Projects● Outlook

Page 37: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Levels of Parallelism

Embarrassing Parallelism

MPI, CORBA, Grid Technologies

Page 38: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Coarse-Grained Parallelism: MPI Version of RAxML

Master Process

Worker Processes

B-0B-1 B-3

B-2

B-4

PC-CLUSTER

InterconnectionNetwork

Page 39: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Levels of Parallelism

Embarrassing Parallelism

Inference Parallelism

MPI, CORBA, Grid Technologies

MPI, algorithm-dependent

Page 40: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Levels of Parallelism

Embarrassing Parallelism

Inference Parallelism

Loop-Level Parallelism

MPI, CORBA, Grid Technologies

MPI, algorithm-dependent

OpenMP, GPUs, IBM CELL (Playstation), IBM BlueGene,Clusters with fast Interconnect

Page 41: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

P[i] = f(Q[i], R[i])

virtual root

Page 42: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

P[i] = f(Q[i], R[i])

virtual root

This operation uses ≥ 90% of total execution time !

Page 43: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

P[i] = f(Q[i], R[i])

virtual root

This operation uses ≥ 90% of total execution time !→ simple fine-grained parallelization

Page 44: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

virtual root

Page 45: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

virtual root

Page 46: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

virtual root

Page 47: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

virtual rootThe real reason for assuming independent evolution among sites: ......

Page 48: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Fine-Grained Parallelism:OpenMP version of RAxML

Page 49: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Fine-Grained Parallelism:OpenMP version of RAxML

Page 50: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

HPC for ML (Bayesian) Proof of Concept & Programming

Techniques: RAxML on a Graphics Processing Unit RAxML on the IBM CELL & Playstation

Production Level Implementations: RAxML with OpenMP RaxML with MPI RAxML on BlueGene Multi-Core Architectures

Page 51: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

HPC for ML (Bayesian) Proof of Concept & Programming

Techniques: RAxML on a Graphics Processing Unit RAxML on the IBM CELL & Playstation

Production Level Implementations: RAxML with OpenMP RaxML with MPI RAxML on BlueGene Multi-Core Architectures

A good excuse to buy one

Page 52: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

RAxML-BlueGene Many slow processors: 1024 in one rack 512 MB or 1GB of main memory per node But: high performance network Challenges:

Distribute tree data structure among CPUs Exploit fast collective communication network

For optimal efficiency: loop-level + embarrassing parallelism hybrid parallelism with MPI

Test & Production Run Data With Olaf Bininda-Emonds, Jena: 2,182

mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human

Haplotype Map sequences x 500,000 base pairs

Page 53: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

RAxML-BlueGene Many slow processors: 1024 in one rack 512 MB or 1GB of main memory per node But: high performance network Challenges:

Distribute tree data structure among CPUs Exploit fast collective communication network

For optimal efficiency: loop-level + embarrassing parallelism hybrid parallelism with MPI

Test & Production Run Data With Olaf Bininda-Emonds, Jena: 2,182

mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human

Haplotype Map sequences x 500,000 base pairs

To be presented at IEEE/ACM 2007 Supercomputing Conference.

Page 54: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

RAxML-BlueGene Many slow processors: 1024 in one rack 512 MB or 1GB of main memory per node But: high performance network Challenges:

Distribute tree data structure among CPUs Exploit fast collective communication network

For optimal efficiency: loop-level + embarrassing parallelism hybrid parallelism with MPI

Test & Production Run Data With Olaf Bininda-Emonds, Jena: 2,182

mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human

Haplotype Map sequences x 500,000 base pairs

Largest ML analysis to date in terms of memory footprint

Page 55: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Loop-Level Parallelism on BlueGene

Page 56: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

50 Seqs x 23,385 bp

Page 57: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

50 Seqs x 23,385 bp

Superlinear Speedup

Page 58: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

250 Seqs x 403,581 bp

Page 59: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Embarrassing Parallelism

M

W

W

W

M

WW

W M W

W W

M

WW

W

Page 60: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Outline● Introduction

● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services

● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion

● Related Projects● Outlook

Page 61: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Confidence Values Tree without node confidence

values is mostly useless Problem:

Confidence value calculation is major computational obstacle

We can compute large trees but not analyse them: compute ≠analyse !

Current Slow Methods Sampling with Bayesian methods Non-parametric Bootstrapping

Page 62: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

A Tree with Confidence Values

Joint work with Marc Gottschling, Charite Hospital, Berlin

Page 63: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

BootstrappingOriginal Alignment

perturbation

compute tree compute tree compute tree

Page 64: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

BootstrappingOriginal Alignment

perturbation

compute tree compute tree compute tree

This needs to be done 100-1000 timesEmbarrassingly Parallel !

Page 65: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Two Questions How to compute Bootstraps faster? How many Bootstrap replicates do we

need?

Page 66: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Current Work:Rapid Bootstrapping Algorithm

Tested on 22 diverse (mammals, bacteria, archaea, grasses, fishes, plants, viral) real-world DNA/AA single-/multi-gene datasets containing 125-7,764 sequences

Pearson correlation on best-scoring ML trees between RBS (Rapid BS) & SBS (Standard BS) support values 0.95-0.99 (except one dataset at 0.91), average 0.97

Weighted topological distance < 6%, average 4% Program Acceleration: 8-20, average ≈ 15

Acceleration by one order of magnitude Full ML analysis (100BS + ML search) of datasets of

up to 5,000 sequences within less than 5 days on your desktop!

Allows for a sufficiently large number of Bootstrap replicates

Page 67: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Quick & Dirty Bootstrap

Modify Algorithm

Computational Experiments

Page 68: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Quick & Dirty Bootstrap

Modify Algorithm

Computational Experiments

iterate

Page 69: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Rapid Bootstrap

1111111111111111111111111111

011022111111110110221111111110111102220111101111022201111111111011202111111110112021

Page 70: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Rapid Bootstrap

1111111111111111111111111111

011022111111110110221111111110111102220111101111022201111111111011202111111110112021

Compute Starting TreeCompute Starting Tree

Page 71: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Rapid Bootstrap

1111111111111111111111111111

011022111111110110221111111110111102220111101111022201111111111011202111111110112021

Optimize Model Params &Optimize Model Params &Branch LengthsBranch Lengths

Page 72: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

1111111111111111111111111111

0110221111111101102211111111 -110 -1101011110222011110111102220111 -105 -1051111111011202111111110112021 -100 -100

Rapid BootstrapUse Starting Tree &Use Starting Tree &

Model Params to compute Model Params to compute RELL scoresRELL scores

Page 73: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

1111111111111111111111111111

0110221111111101102211111111 -110 -1101011110222011110111102220111 -105 -1051111111011202111111110112021 -100 -100

Rapid BootstrapUse Starting Tree &Use Starting Tree &

Model Params to compute Model Params to compute RELL scoresRELL scores

Sort by RELLSort by RELL

Page 74: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

TT00: Thorough Search: Thorough Search

Page 75: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

TT00: Thorough Search : Thorough Search

TT11: Quick Search on T: Quick Search on T00

Page 76: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

TT00: Thorough Search : Thorough Search

TT11: Quick Search on T: Quick Search on T00

TT22: Quick Search on T: Quick Search on T11

Page 77: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

TT00: Thorough Search : Thorough Search

TT11: Quick Search on T: Quick Search on T00

TT22: Quick Search on T: Quick Search on T11

sequential dependency is bad for parallelism

Page 78: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Scalability of Rapid Bootstrap

Page 79: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Scalability of Rapid Bootstrap

Some datasets are harder than others

Page 80: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Scalability of Rapid Bootstrap

Page 81: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

ML-Scores: Garli, RAxML, PHYML 715 Sequences

Page 82: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Correlation 125 Taxa: 0.91

Page 83: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Support Value Distribution

Page 84: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Bootstrap Likelihood Values125 x 19,436

10,000 replicates only 195 non-trivial bipartitions

Page 85: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Bootstrap Likelihood Values125 x 19,436

Page 86: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

3,491 rBCL sequencesRapid versus Standard BS

Correlation: 0.98

Page 87: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

7,764 DNA Best Tree

Page 88: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

7,764 DNA All Bipartitions

Page 89: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

775 x 3,838 AA

Page 90: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

New Opportunities

Assess Impact of Alignment Method on tree and support values

Test Bootstrap of the Bootstrap (double Bootstrap) procedures

Devise and empirically verify Bootstopping criteria

Page 91: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Bootstrap of the Bootstrap140 AA (Efron et al PNAS 1996)

Page 92: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Bootstrap of the Bootstrap3,491 rBCL

Page 93: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Bootstopping Rapid Bootstrapping allows to assess

Bootstopping criteria as follows1. Compute a high number of BS replicates (10,000)2. Devise topology-based bootstopping criterion and

apply it to these 10,000 replicates3. Compare support values induced by bootstopped

trees (say 300 replicates) with 10,000 replicates

We have 10,000 replicates for 18 datasets containing 125 to 2,554 sequences

Page 94: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Bootstopping Criterion

Every 50, 100, 150, ... replicates do a test: Say we have N BS trees Do the following 100 times:

Randomly split up this set of N trees into 2 equal sets S1, S2, of size N/2

Compute the bipartition support vectors for S1 and S2

Compute Pearson correlation of the support vectors

return average of the 100 Pearson correlations if average > 0.99 stop

Page 95: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Result Overview

Bootstopped between 100-400 (avg 213)

Correlation on best tree: Bootstopped versus 10,000 replicates > 0.99 (avg 0.995)

Correlation of all bipartitions > 0.995 (avg 0.997)

Page 96: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Bootstopping Best 140 AA

Page 97: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Bootstopping Best 404 DNA (Multi-Gene)

Page 98: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Bootstopping Best 994 DNA

Page 99: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Bootstopping All 994 DNA

Page 100: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Bootstopping Best 1,908 DNA

Page 101: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Bootstopping Best 2,554 DNA

Page 102: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Putting the Pieces together Blue-Gene: Can handle huge datasets

Use Cat approximation on BlueGene Further speedup of factor 3.5 Memory footprint reduction factor 4

Page 103: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

8,864 Bacteria under GTR+Γand GTR+CAT

Log Likelihood Score under Γ

14 days14 days7 days7 days

Execution Time

Page 104: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Putting the Pieces together Blue-Gene: Can handle huge datasets

Use Cat approximation on BlueGene Further speedup of factor 3.5 Memory footprint reduction factor 4

Integrate rapid Bootstrap into BlueGene version

Additional speedup ≈ 15 Mechanisms available to accelerate

BlueGene version by factor 50-60 Integrate Bootstopping into BlueGene

Conclusion: We will soon be able to compute a small tree of life with 10,000 organisms and data from multiple genes!

Page 105: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Outline● Introduction

● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services

● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion

● Related Projects● Outlook

Page 106: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Host-Parasite Co-Evolution

Hosts (eg Mammals) Parasites (eg Lice)

Page 107: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Host-Parasite Co-Evolution

Hosts Parasites

Adjacency Matrix 0/1

8 Parasites

6 hosts

Co-Evolution Hypothesis

Page 108: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Host-Parasite Co-Evolution

Hosts Parasites

Adjacency Matrix 0/1

8 Parasites

6 hosts

Co-Evolution Hypothesis

Statistical Test

Page 109: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

What can HPC do forBioinformatics?Axelerated Parafit

“Parafit: statistical test of co-evolution”, Pierre Legendre, Syst. Biol. 2003

AxParafit (Axelerated Parafit) Statistical test of hypotheses of host-parasite co-

evolution C porting, optimization, BLAS integration Speedup up to factor 67 Master-Worker MPI-parallelization

Largest co-phylogenetic study to date conducted within 8 minutes instead of 4 weeks

Open-Source Code: http://icwww.epfl.ch/~stamatak/AxParafit.html

SwissGrid-based Web-Server planned

Page 110: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

AxParafit: Sequential Performance

Page 111: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

AxParafit: Parallel Performance

Page 112: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

The ML Benchmark:A Current Community Project

Standardized way required to test ML search programs Web-Server with real-world alignments and performance data

at Swiss Institute of Bioinformatics Many developers of popular ML programs involved

Stephane Guindon (PHYML) Montpellier Simon Wheelan (LeaPhy) Manchester Bui Quang Minh (IQPNNI) Vienna Derrick Zwickl (GARLI) Virginia Thomas Keane (dprML) Cambridge

Byproduct: SPEC-like CPU benchmark for phylogenetics Follow-up: (planned) ML competition at major conference with

industrial sponsor

Page 113: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

A Current Problem:Handling Multi-Gene Alignments

Gene 1 Gene 2

Sequence 1

Sequence 5

Missing Data ≠ Gap Data

Page 114: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

A Multi-Gene Model

Page 115: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

A Multi-Gene Model

Page 116: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

A Multi-Gene Model

Page 117: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

A Multi-Gene Model

LogLH (T) = LogLh (T|Red)

Page 118: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

LogLH (T) = LogLh (T|Red) +LogLH(T|Yellow)

A Multi-Gene Model

Page 119: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

LogLH (T) = LogLh (T|Red) +LogLH(T|Yellow)

A Multi-Gene ModelChallenge: devise efficient data structures for this

Page 120: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Why are Individual Branches per Gene a Challenge?

Page 121: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Why are Individual Branches per Gene a Challenge?

Page 122: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Outlook

Page 123: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Outlook

Tree of Life What is a good alignment in a

phylogenetic context? Simultaneous alignment and tree building More HPC & memory-aware programming Multi-core architectures Models for “gappy” multi-gene alignments

Page 124: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Acknowledgements BlueGene Project

Michael Ott, TUM

Srinivas Aluru, Jaroslaw Zola, Iowa State

Dan Janies, Andrew Johnson, Ohio State

IBM CELL & Playstation

Filip Blagojevic, Dimitris Nikolopoulos, Virginia Tech

Christos Antonopoulos, Univ. of Thessaly

Bootstopping

Bernard Moret, Masoud Alipour, EPFL

Olaf Bininda-Emonds, Univ. Jena

RAxML Web-Server

Jacques Rougemont, SIB

Terri Liebowitz, SDSC

AxParafit/AxPcoords

Markus Goeker, Alexander Auch, Jan Meier-Kolthoff, University of Tuebingen

Datasets for Studies

Jun Inoue (Florida), Nicolas Salamin (Lausanne), Marc Gottschling (Berlin), Guido Grimm (Tuebingen), Nikos Poulakakis (Yale), Usman Roshan (NJIT)

Page 125: Crunching Huge Phylogenies. A. Stamatakis

Alexandros Stamatakis, October 2007

Thank you for your Attention !

Lake Geneva, Switzerland