Crunching Huge Phylogenies. A. Stamatakis

Post on 10-May-2015

1.193 views 1 download

Tags:

Transcript of Crunching Huge Phylogenies. A. Stamatakis

Crunching Huge Phylogenies:A Rapid Bootstrap Algorithm and Massive Parallelism on the IBM

BlueGene

Alexandros Stamatakis

Swiss Federal Institute of Technology Lausanne (EPFL)School of Computer & Communication Sciences

Laboratory for Computational Biology and BioinformaticsLausanne, Switzerland

&Swiss Institute of Bioinformatics

Alexandros.Stamatakis@epfl.chicwww.epfl.ch/~stamatak

Alexandros Stamatakis, October 2007

The Missing Part

Data Assembly Tree AnalysisInference ?

Alexandros Stamatakis, October 2007

The Missing Part

Data Assembly Tree Analysis

Alexandros Stamatakis, October 2007

IBM BlueGene/Lsupercomputer

Alexandros Stamatakis, October 2007

Rapid BootstrappingBootstopping Criterion

Alexandros Stamatakis, October 2007

The Big Hardware Problem

1980 2007

CPU Speed 40% p.a.

Memory Speed 9% p.a.

Alexandros Stamatakis, October 2007

... and why this concerns Bioinformatics

1980 2007

CPU Speed 40% p.a.

Memory Speed 9% p.a.

Sequence Data

Alexandros Stamatakis, October 2007

... and why this concerns Bioinformatics

1980 2007

CPU Speed 40% p.a.

Memory Speed 9% p.a.

Sequence Data

Application of HPC techniques will become much more important

Alexandros Stamatakis, October 2007

Cache Hierarchy

Alexandros Stamatakis, October 2007

Outline● Introduction

● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services

● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion

● Related Projects● Outlook

Alexandros Stamatakis, October 2007

Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic

inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &

simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)

Alexandros Stamatakis, October 2007

Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic

inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &

simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)

ML & Bayesian: explicit model choice

Alexandros Stamatakis, October 2007

Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic

inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &

simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)

Complex Methods & Models required to reconstruct large & complicated trees !

Focus of this talk is on Maximum Likelihood!

Alexandros Stamatakis, October 2007

Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic

inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &

simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)

The real reason for working on ML: ......

Alexandros Stamatakis, October 2007

Challenges for Phyloinformatics

Holy grail: “Tree of Life” What is a good alignment in a

phylogenetic context? Simultaneous alignment and tree building Improve/extend models ... but thereby size

of computable trees decreases! More HPC awareness Exploit multi-core architectures Amount of available data grows at a

higher rate than algorithms are getting faster

Alexandros Stamatakis, October 2007

The algorithmic problem

Alexandros Stamatakis, October 2007

The number of trees

Alexandros Stamatakis, October 2007

The number of trees

Alexandros Stamatakis, October 2007

The number of trees

Alexandros Stamatakis, October 2007

The number of trees explodes!

BANG !

Alexandros Stamatakis, October 2007

Outline● Introduction

● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services

● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion

● Related Projects● Outlook

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

virtual root: vrvirtual root: vr

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)

mm

vrvr

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

b2b2

b5b5

b3b3

b4b4

P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)

mm

vrvr

Lots of floating point operations!

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3

optimize branch lengthsoptimize branch lengths

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Alexandros Stamatakis, October 2007

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3

optimize model parametersoptimize model parameters

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Alexandros Stamatakis, October 2007

Maximum Likelihood

Goal: Obtain topology with maximum likelihood value

Problem I: Number of possible topologies is exponential in n

Problem II: Computation of likelihood function is expensive

Problem III: Probably high score accuracy required

Problem IV: High memory consumption

Solution:

• New Algorithms

• New Models

• High Performance Computing

Alexandros Stamatakis, October 2007

Maximum Likelihood

Goal: Obtain topology with maximum likelihood value

Problem I: Number of possible topologies is exponential in n

Problem II: Computation of likelihood function is expensive

Problem III: Probably high score accuracy required

Problem IV: High memory consumption

Solution:

• New Algorithms

• New Models

• High Performance Computing

RAxML Randomized Axelerated

Maximum Likelihood

Alexandros Stamatakis, October 2007

Web & Grid Services RAxML Web-Server at San Diego Supercomputing

Center via www.phylo.org (CIPRES project) Web-Server at Vital-IT unit of Swiss Institute of

Bioinformatics phylobench.vital-it.ch/raxml-bb/ Includes novel search algorithm with 1 order of

magnitude run-time improvement Since Sept 3, about 700 jobs from 130 Ips Extension to SwissGrid planned Novel algorithm with Bootstopping to be

integrated into CIPRES portal soon RAxML integration into Distributed European

Infrastructure for Supercomputing Applications www.deisa.org started 10 days ago

Integration into Debian medical distribution

Alexandros Stamatakis, October 2007

RAxML Black Box

Alexandros Stamatakis, October 2007

RAxML Black Box

Why are Black Boxesuseful?

Alexandros Stamatakis, October 2007

Outline● Introduction

● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services

● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion

● Related Projects● Outlook

Alexandros Stamatakis, October 2007

Levels of Parallelism

Embarrassing Parallelism

MPI, CORBA, Grid Technologies

Alexandros Stamatakis, October 2007

Coarse-Grained Parallelism: MPI Version of RAxML

Master Process

Worker Processes

B-0B-1 B-3

B-2

B-4

PC-CLUSTER

InterconnectionNetwork

Alexandros Stamatakis, October 2007

Levels of Parallelism

Embarrassing Parallelism

Inference Parallelism

MPI, CORBA, Grid Technologies

MPI, algorithm-dependent

Alexandros Stamatakis, October 2007

Levels of Parallelism

Embarrassing Parallelism

Inference Parallelism

Loop-Level Parallelism

MPI, CORBA, Grid Technologies

MPI, algorithm-dependent

OpenMP, GPUs, IBM CELL (Playstation), IBM BlueGene,Clusters with fast Interconnect

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

P[i] = f(Q[i], R[i])

virtual root

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

P[i] = f(Q[i], R[i])

virtual root

This operation uses ≥ 90% of total execution time !

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

P[i] = f(Q[i], R[i])

virtual root

This operation uses ≥ 90% of total execution time !→ simple fine-grained parallelization

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

virtual root

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

virtual root

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

virtual root

Alexandros Stamatakis, October 2007

Loop Level Parallelism

P

QR

virtual rootThe real reason for assuming independent evolution among sites: ......

Alexandros Stamatakis, October 2007

Fine-Grained Parallelism:OpenMP version of RAxML

Alexandros Stamatakis, October 2007

Fine-Grained Parallelism:OpenMP version of RAxML

Alexandros Stamatakis, October 2007

HPC for ML (Bayesian) Proof of Concept & Programming

Techniques: RAxML on a Graphics Processing Unit RAxML on the IBM CELL & Playstation

Production Level Implementations: RAxML with OpenMP RaxML with MPI RAxML on BlueGene Multi-Core Architectures

Alexandros Stamatakis, October 2007

HPC for ML (Bayesian) Proof of Concept & Programming

Techniques: RAxML on a Graphics Processing Unit RAxML on the IBM CELL & Playstation

Production Level Implementations: RAxML with OpenMP RaxML with MPI RAxML on BlueGene Multi-Core Architectures

A good excuse to buy one

Alexandros Stamatakis, October 2007

RAxML-BlueGene Many slow processors: 1024 in one rack 512 MB or 1GB of main memory per node But: high performance network Challenges:

Distribute tree data structure among CPUs Exploit fast collective communication network

For optimal efficiency: loop-level + embarrassing parallelism hybrid parallelism with MPI

Test & Production Run Data With Olaf Bininda-Emonds, Jena: 2,182

mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human

Haplotype Map sequences x 500,000 base pairs

Alexandros Stamatakis, October 2007

RAxML-BlueGene Many slow processors: 1024 in one rack 512 MB or 1GB of main memory per node But: high performance network Challenges:

Distribute tree data structure among CPUs Exploit fast collective communication network

For optimal efficiency: loop-level + embarrassing parallelism hybrid parallelism with MPI

Test & Production Run Data With Olaf Bininda-Emonds, Jena: 2,182

mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human

Haplotype Map sequences x 500,000 base pairs

To be presented at IEEE/ACM 2007 Supercomputing Conference.

Alexandros Stamatakis, October 2007

RAxML-BlueGene Many slow processors: 1024 in one rack 512 MB or 1GB of main memory per node But: high performance network Challenges:

Distribute tree data structure among CPUs Exploit fast collective communication network

For optimal efficiency: loop-level + embarrassing parallelism hybrid parallelism with MPI

Test & Production Run Data With Olaf Bininda-Emonds, Jena: 2,182

mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human

Haplotype Map sequences x 500,000 base pairs

Largest ML analysis to date in terms of memory footprint

Alexandros Stamatakis, October 2007

Loop-Level Parallelism on BlueGene

Alexandros Stamatakis, October 2007

50 Seqs x 23,385 bp

Alexandros Stamatakis, October 2007

50 Seqs x 23,385 bp

Superlinear Speedup

Alexandros Stamatakis, October 2007

250 Seqs x 403,581 bp

Alexandros Stamatakis, October 2007

Embarrassing Parallelism

M

W

W

W

M

WW

W M W

W W

M

WW

W

Alexandros Stamatakis, October 2007

Outline● Introduction

● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services

● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion

● Related Projects● Outlook

Alexandros Stamatakis, October 2007

Confidence Values Tree without node confidence

values is mostly useless Problem:

Confidence value calculation is major computational obstacle

We can compute large trees but not analyse them: compute ≠analyse !

Current Slow Methods Sampling with Bayesian methods Non-parametric Bootstrapping

Alexandros Stamatakis, October 2007

A Tree with Confidence Values

Joint work with Marc Gottschling, Charite Hospital, Berlin

Alexandros Stamatakis, October 2007

BootstrappingOriginal Alignment

perturbation

compute tree compute tree compute tree

Alexandros Stamatakis, October 2007

BootstrappingOriginal Alignment

perturbation

compute tree compute tree compute tree

This needs to be done 100-1000 timesEmbarrassingly Parallel !

Alexandros Stamatakis, October 2007

Two Questions How to compute Bootstraps faster? How many Bootstrap replicates do we

need?

Alexandros Stamatakis, October 2007

Current Work:Rapid Bootstrapping Algorithm

Tested on 22 diverse (mammals, bacteria, archaea, grasses, fishes, plants, viral) real-world DNA/AA single-/multi-gene datasets containing 125-7,764 sequences

Pearson correlation on best-scoring ML trees between RBS (Rapid BS) & SBS (Standard BS) support values 0.95-0.99 (except one dataset at 0.91), average 0.97

Weighted topological distance < 6%, average 4% Program Acceleration: 8-20, average ≈ 15

Acceleration by one order of magnitude Full ML analysis (100BS + ML search) of datasets of

up to 5,000 sequences within less than 5 days on your desktop!

Allows for a sufficiently large number of Bootstrap replicates

Alexandros Stamatakis, October 2007

Quick & Dirty Bootstrap

Modify Algorithm

Computational Experiments

Alexandros Stamatakis, October 2007

Quick & Dirty Bootstrap

Modify Algorithm

Computational Experiments

iterate

Alexandros Stamatakis, October 2007

Rapid Bootstrap

1111111111111111111111111111

011022111111110110221111111110111102220111101111022201111111111011202111111110112021

Alexandros Stamatakis, October 2007

Rapid Bootstrap

1111111111111111111111111111

011022111111110110221111111110111102220111101111022201111111111011202111111110112021

Compute Starting TreeCompute Starting Tree

Alexandros Stamatakis, October 2007

Rapid Bootstrap

1111111111111111111111111111

011022111111110110221111111110111102220111101111022201111111111011202111111110112021

Optimize Model Params &Optimize Model Params &Branch LengthsBranch Lengths

Alexandros Stamatakis, October 2007

1111111111111111111111111111

0110221111111101102211111111 -110 -1101011110222011110111102220111 -105 -1051111111011202111111110112021 -100 -100

Rapid BootstrapUse Starting Tree &Use Starting Tree &

Model Params to compute Model Params to compute RELL scoresRELL scores

Alexandros Stamatakis, October 2007

1111111111111111111111111111

0110221111111101102211111111 -110 -1101011110222011110111102220111 -105 -1051111111011202111111110112021 -100 -100

Rapid BootstrapUse Starting Tree &Use Starting Tree &

Model Params to compute Model Params to compute RELL scoresRELL scores

Sort by RELLSort by RELL

Alexandros Stamatakis, October 2007

1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

TT00: Thorough Search: Thorough Search

Alexandros Stamatakis, October 2007

1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

TT00: Thorough Search : Thorough Search

TT11: Quick Search on T: Quick Search on T00

Alexandros Stamatakis, October 2007

1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

TT00: Thorough Search : Thorough Search

TT11: Quick Search on T: Quick Search on T00

TT22: Quick Search on T: Quick Search on T11

Alexandros Stamatakis, October 2007

1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

TT00: Thorough Search : Thorough Search

TT11: Quick Search on T: Quick Search on T00

TT22: Quick Search on T: Quick Search on T11

sequential dependency is bad for parallelism

Alexandros Stamatakis, October 2007

Scalability of Rapid Bootstrap

Alexandros Stamatakis, October 2007

Scalability of Rapid Bootstrap

Some datasets are harder than others

Alexandros Stamatakis, October 2007

Scalability of Rapid Bootstrap

Alexandros Stamatakis, October 2007

ML-Scores: Garli, RAxML, PHYML 715 Sequences

Alexandros Stamatakis, October 2007

Correlation 125 Taxa: 0.91

Alexandros Stamatakis, October 2007

Support Value Distribution

Alexandros Stamatakis, October 2007

Bootstrap Likelihood Values125 x 19,436

10,000 replicates only 195 non-trivial bipartitions

Alexandros Stamatakis, October 2007

Bootstrap Likelihood Values125 x 19,436

Alexandros Stamatakis, October 2007

3,491 rBCL sequencesRapid versus Standard BS

Correlation: 0.98

Alexandros Stamatakis, October 2007

7,764 DNA Best Tree

Alexandros Stamatakis, October 2007

7,764 DNA All Bipartitions

Alexandros Stamatakis, October 2007

775 x 3,838 AA

Alexandros Stamatakis, October 2007

New Opportunities

Assess Impact of Alignment Method on tree and support values

Test Bootstrap of the Bootstrap (double Bootstrap) procedures

Devise and empirically verify Bootstopping criteria

Alexandros Stamatakis, October 2007

Bootstrap of the Bootstrap140 AA (Efron et al PNAS 1996)

Alexandros Stamatakis, October 2007

Bootstrap of the Bootstrap3,491 rBCL

Alexandros Stamatakis, October 2007

Bootstopping Rapid Bootstrapping allows to assess

Bootstopping criteria as follows1. Compute a high number of BS replicates (10,000)2. Devise topology-based bootstopping criterion and

apply it to these 10,000 replicates3. Compare support values induced by bootstopped

trees (say 300 replicates) with 10,000 replicates

We have 10,000 replicates for 18 datasets containing 125 to 2,554 sequences

Alexandros Stamatakis, October 2007

Bootstopping Criterion

Every 50, 100, 150, ... replicates do a test: Say we have N BS trees Do the following 100 times:

Randomly split up this set of N trees into 2 equal sets S1, S2, of size N/2

Compute the bipartition support vectors for S1 and S2

Compute Pearson correlation of the support vectors

return average of the 100 Pearson correlations if average > 0.99 stop

Alexandros Stamatakis, October 2007

Result Overview

Bootstopped between 100-400 (avg 213)

Correlation on best tree: Bootstopped versus 10,000 replicates > 0.99 (avg 0.995)

Correlation of all bipartitions > 0.995 (avg 0.997)

Alexandros Stamatakis, October 2007

Bootstopping Best 140 AA

Alexandros Stamatakis, October 2007

Bootstopping Best 404 DNA (Multi-Gene)

Alexandros Stamatakis, October 2007

Bootstopping Best 994 DNA

Alexandros Stamatakis, October 2007

Bootstopping All 994 DNA

Alexandros Stamatakis, October 2007

Bootstopping Best 1,908 DNA

Alexandros Stamatakis, October 2007

Bootstopping Best 2,554 DNA

Alexandros Stamatakis, October 2007

Putting the Pieces together Blue-Gene: Can handle huge datasets

Use Cat approximation on BlueGene Further speedup of factor 3.5 Memory footprint reduction factor 4

Alexandros Stamatakis, October 2007

8,864 Bacteria under GTR+Γand GTR+CAT

Log Likelihood Score under Γ

14 days14 days7 days7 days

Execution Time

Alexandros Stamatakis, October 2007

Putting the Pieces together Blue-Gene: Can handle huge datasets

Use Cat approximation on BlueGene Further speedup of factor 3.5 Memory footprint reduction factor 4

Integrate rapid Bootstrap into BlueGene version

Additional speedup ≈ 15 Mechanisms available to accelerate

BlueGene version by factor 50-60 Integrate Bootstopping into BlueGene

Conclusion: We will soon be able to compute a small tree of life with 10,000 organisms and data from multiple genes!

Alexandros Stamatakis, October 2007

Outline● Introduction

● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services

● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion

● Related Projects● Outlook

Alexandros Stamatakis, October 2007

Host-Parasite Co-Evolution

Hosts (eg Mammals) Parasites (eg Lice)

Alexandros Stamatakis, October 2007

Host-Parasite Co-Evolution

Hosts Parasites

Adjacency Matrix 0/1

8 Parasites

6 hosts

Co-Evolution Hypothesis

Alexandros Stamatakis, October 2007

Host-Parasite Co-Evolution

Hosts Parasites

Adjacency Matrix 0/1

8 Parasites

6 hosts

Co-Evolution Hypothesis

Statistical Test

Alexandros Stamatakis, October 2007

What can HPC do forBioinformatics?Axelerated Parafit

“Parafit: statistical test of co-evolution”, Pierre Legendre, Syst. Biol. 2003

AxParafit (Axelerated Parafit) Statistical test of hypotheses of host-parasite co-

evolution C porting, optimization, BLAS integration Speedup up to factor 67 Master-Worker MPI-parallelization

Largest co-phylogenetic study to date conducted within 8 minutes instead of 4 weeks

Open-Source Code: http://icwww.epfl.ch/~stamatak/AxParafit.html

SwissGrid-based Web-Server planned

Alexandros Stamatakis, October 2007

AxParafit: Sequential Performance

Alexandros Stamatakis, October 2007

AxParafit: Parallel Performance

Alexandros Stamatakis, October 2007

The ML Benchmark:A Current Community Project

Standardized way required to test ML search programs Web-Server with real-world alignments and performance data

at Swiss Institute of Bioinformatics Many developers of popular ML programs involved

Stephane Guindon (PHYML) Montpellier Simon Wheelan (LeaPhy) Manchester Bui Quang Minh (IQPNNI) Vienna Derrick Zwickl (GARLI) Virginia Thomas Keane (dprML) Cambridge

Byproduct: SPEC-like CPU benchmark for phylogenetics Follow-up: (planned) ML competition at major conference with

industrial sponsor

Alexandros Stamatakis, October 2007

A Current Problem:Handling Multi-Gene Alignments

Gene 1 Gene 2

Sequence 1

Sequence 5

Missing Data ≠ Gap Data

Alexandros Stamatakis, October 2007

A Multi-Gene Model

Alexandros Stamatakis, October 2007

A Multi-Gene Model

Alexandros Stamatakis, October 2007

A Multi-Gene Model

Alexandros Stamatakis, October 2007

A Multi-Gene Model

LogLH (T) = LogLh (T|Red)

Alexandros Stamatakis, October 2007

LogLH (T) = LogLh (T|Red) +LogLH(T|Yellow)

A Multi-Gene Model

Alexandros Stamatakis, October 2007

LogLH (T) = LogLh (T|Red) +LogLH(T|Yellow)

A Multi-Gene ModelChallenge: devise efficient data structures for this

Alexandros Stamatakis, October 2007

Why are Individual Branches per Gene a Challenge?

Alexandros Stamatakis, October 2007

Why are Individual Branches per Gene a Challenge?

Alexandros Stamatakis, October 2007

Outlook

Alexandros Stamatakis, October 2007

Outlook

Tree of Life What is a good alignment in a

phylogenetic context? Simultaneous alignment and tree building More HPC & memory-aware programming Multi-core architectures Models for “gappy” multi-gene alignments

Alexandros Stamatakis, October 2007

Acknowledgements BlueGene Project

Michael Ott, TUM

Srinivas Aluru, Jaroslaw Zola, Iowa State

Dan Janies, Andrew Johnson, Ohio State

IBM CELL & Playstation

Filip Blagojevic, Dimitris Nikolopoulos, Virginia Tech

Christos Antonopoulos, Univ. of Thessaly

Bootstopping

Bernard Moret, Masoud Alipour, EPFL

Olaf Bininda-Emonds, Univ. Jena

RAxML Web-Server

Jacques Rougemont, SIB

Terri Liebowitz, SDSC

AxParafit/AxPcoords

Markus Goeker, Alexander Auch, Jan Meier-Kolthoff, University of Tuebingen

Datasets for Studies

Jun Inoue (Florida), Nicolas Salamin (Lausanne), Marc Gottschling (Berlin), Guido Grimm (Tuebingen), Nikos Poulakakis (Yale), Usman Roshan (NJIT)

Alexandros Stamatakis, October 2007

Thank you for your Attention !

Lake Geneva, Switzerland