Crunching Huge Phylogenies. A. Stamatakis

Crunching Huge Phylogenies:A Rapid Bootstrap Algorithm and Massive Parallelism on the IBM

BlueGene

Alexandros Stamatakis

Swiss Federal Institute of Technology Lausanne (EPFL)School of Computer & Communication Sciences

Laboratory for Computational Biology and BioinformaticsLausanne, Switzerland

&Swiss Institute of Bioinformatics

Alexandros.Stamatakis@epfl.chicwww.epfl.ch/~stamatak

Alexandros Stamatakis, October 2007

The Missing Part

Data Assembly Tree AnalysisInference ?

The Missing Part

Data Assembly Tree Analysis

IBM BlueGene/Lsupercomputer

Rapid BootstrappingBootstopping Criterion

The Big Hardware Problem

1980 2007

CPU Speed 40% p.a.

Memory Speed 9% p.a.

... and why this concerns Bioinformatics

1980 2007

CPU Speed 40% p.a.

Sequence Data

... and why this concerns Bioinformatics

1980 2007

CPU Speed 40% p.a.

Sequence Data

Application of HPC techniques will become much more important

Cache Hierarchy

Outline● Introduction

● Computation of Phylogenies ● Maximum Likelihood● Web & Grid Services

● Three Steps Towards the Tree of Life● Parallelism on IBM BlueGene/L● Rapid Bootstrapping● A Bootstopping criterion

● Related Projects● Outlook

Phylogenetics Input: “good” multiple Alignment Output: unrooted binary tree Various methods for phylogenetic

inference Neighbour Joining (fast & simple) Maximum Parsimony (relatively fast &

simple) Maximum Likelihood (complex & slow) Bayesian Methods (complex & slower)

ML & Bayesian: explicit model choice

Complex Methods & Models required to reconstruct large & complicated trees !

Focus of this talk is on Maximum Likelihood!

The real reason for working on ML: ......

Challenges for Phyloinformatics

Holy grail: “Tree of Life” What is a good alignment in a

phylogenetic context? Simultaneous alignment and tree building Improve/extend models ... but thereby size

of computable trees decreases! More HPC awareness Exploit multi-core architectures Amount of available data grows at a

higher rate than algorithms are getting faster

The algorithmic problem

The number of trees

The number of trees explodes!

BANG !

Maximum Likelihood

Seq1Seq1Seq2Seq2Seq3Seq3Seq4Seq4

AlignmentAlignment

Length: mLength: m

Maximum Likelihood

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

SubstitutionSubstitutionmodelmodel

Maximum Likelihood

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

Prior probabilities,Prior probabilities,Empirical base frequenciesEmpirical base frequencies

ππA A ππC C ππG G ππT T

Maximum Likelihood

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

Seq 1Seq 1

Seq 2Seq 2 Seq 4Seq 4

Seq 3Seq 3b1b1

Maximum Likelihood

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

Seq 1Seq 1

Seq 3Seq 3b1b1

virtual root: vrvirtual root: vr

Maximum Likelihood

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

Seq 1Seq 1

Seq 3Seq 3b1b1

P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)

Maximum Likelihood

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

Seq 1Seq 1

Seq 3Seq 3b1b1

P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T) P(A) P(C) P(G) P(T)P(A) P(C) P(G) P(T)

Lots of floating point operations!

Maximum Likelihood

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

Seq 1Seq 1

Seq 3Seq 3

optimize branch lengthsoptimize branch lengths

Maximum Likelihood

AlignmentAlignment

Length: mLength: m

AACCGGTT

A C G TA C G T

Seq 1Seq 1

Seq 3Seq 3

optimize model parametersoptimize model parameters

Maximum Likelihood

Goal: Obtain topology with maximum likelihood value

Problem I: Number of possible topologies is exponential in n

Problem II: Computation of likelihood function is expensive

Problem III: Probably high score accuracy required

Problem IV: High memory consumption

Solution:

• New Algorithms

• New Models

• High Performance Computing

Maximum Likelihood

Goal: Obtain topology with maximum likelihood value

Problem I: Number of possible topologies is exponential in n

Problem II: Computation of likelihood function is expensive

Problem III: Probably high score accuracy required

Problem IV: High memory consumption

Solution:

• New Algorithms

• New Models

• High Performance Computing

RAxML Randomized Axelerated

Maximum Likelihood

Web & Grid Services RAxML Web-Server at San Diego Supercomputing

Center via www.phylo.org (CIPRES project) Web-Server at Vital-IT unit of Swiss Institute of

Bioinformatics phylobench.vital-it.ch/raxml-bb/ Includes novel search algorithm with 1 order of

magnitude run-time improvement Since Sept 3, about 700 jobs from 130 Ips Extension to SwissGrid planned Novel algorithm with Bootstopping to be

integrated into CIPRES portal soon RAxML integration into Distributed European

Infrastructure for Supercomputing Applications www.deisa.org started 10 days ago

Integration into Debian medical distribution

RAxML Black Box

Why are Black Boxesuseful?

Levels of Parallelism

Embarrassing Parallelism

MPI, CORBA, Grid Technologies

Coarse-Grained Parallelism: MPI Version of RAxML

Master Process

Worker Processes

B-0B-1 B-3

PC-CLUSTER

InterconnectionNetwork

Inference Parallelism

MPI, algorithm-dependent

Inference Parallelism

Loop-Level Parallelism

MPI, algorithm-dependent

OpenMP, GPUs, IBM CELL (Playstation), IBM BlueGene,Clusters with fast Interconnect

Loop Level Parallelism

P[i] = f(Q[i], R[i])

virtual root

P[i] = f(Q[i], R[i])

virtual root

This operation uses ≥ 90% of total execution time !

P[i] = f(Q[i], R[i])

virtual root

This operation uses ≥ 90% of total execution time !→ simple fine-grained parallelization

virtual root

virtual rootThe real reason for assuming independent evolution among sites: ......

Fine-Grained Parallelism:OpenMP version of RAxML

HPC for ML (Bayesian) Proof of Concept & Programming

Techniques: RAxML on a Graphics Processing Unit RAxML on the IBM CELL & Playstation

Production Level Implementations: RAxML with OpenMP RaxML with MPI RAxML on BlueGene Multi-Core Architectures

HPC for ML (Bayesian) Proof of Concept & Programming

Techniques: RAxML on a Graphics Processing Unit RAxML on the IBM CELL & Playstation

Production Level Implementations: RAxML with OpenMP RaxML with MPI RAxML on BlueGene Multi-Core Architectures

A good excuse to buy one

RAxML-BlueGene Many slow processors: 1024 in one rack 512 MB or 1GB of main memory per node But: high performance network Challenges:

Distribute tree data structure among CPUs Exploit fast collective communication network

For optimal efficiency: loop-level + embarrassing parallelism hybrid parallelism with MPI

Test & Production Run Data With Olaf Bininda-Emonds, Jena: 2,182

mammalian sequences x 51,000 base pairs With Dan Janies, Ohio State: 270 Human

Haplotype Map sequences x 500,000 base pairs

To be presented at IEEE/ACM 2007 Supercomputing Conference.

Largest ML analysis to date in terms of memory footprint

Loop-Level Parallelism on BlueGene

50 Seqs x 23,385 bp

Superlinear Speedup

250 Seqs x 403,581 bp

Confidence Values Tree without node confidence

values is mostly useless Problem:

Confidence value calculation is major computational obstacle

We can compute large trees but not analyse them: compute ≠analyse !

Current Slow Methods Sampling with Bayesian methods Non-parametric Bootstrapping

A Tree with Confidence Values

Joint work with Marc Gottschling, Charite Hospital, Berlin

BootstrappingOriginal Alignment

perturbation

compute tree compute tree compute tree

BootstrappingOriginal Alignment

perturbation

compute tree compute tree compute tree

This needs to be done 100-1000 timesEmbarrassingly Parallel !

Two Questions How to compute Bootstraps faster? How many Bootstrap replicates do we

Current Work:Rapid Bootstrapping Algorithm

Tested on 22 diverse (mammals, bacteria, archaea, grasses, fishes, plants, viral) real-world DNA/AA single-/multi-gene datasets containing 125-7,764 sequences

Pearson correlation on best-scoring ML trees between RBS (Rapid BS) & SBS (Standard BS) support values 0.95-0.99 (except one dataset at 0.91), average 0.97

Weighted topological distance < 6%, average 4% Program Acceleration: 8-20, average ≈ 15

Acceleration by one order of magnitude Full ML analysis (100BS + ML search) of datasets of

up to 5,000 sequences within less than 5 days on your desktop!

Allows for a sufficiently large number of Bootstrap replicates

Quick & Dirty Bootstrap

Modify Algorithm

Computational Experiments

Quick & Dirty Bootstrap

Modify Algorithm

Computational Experiments

iterate

Rapid Bootstrap

1111111111111111111111111111

011022111111110110221111111110111102220111101111022201111111111011202111111110112021

Rapid Bootstrap

1111111111111111111111111111

011022111111110110221111111110111102220111101111022201111111111011202111111110112021

Compute Starting TreeCompute Starting Tree

Rapid Bootstrap

1111111111111111111111111111

011022111111110110221111111110111102220111101111022201111111111011202111111110112021

Optimize Model Params &Optimize Model Params &Branch LengthsBranch Lengths

1111111111111111111111111111

0110221111111101102211111111 -110 -1101011110222011110111102220111 -105 -1051111111011202111111110112021 -100 -100

Rapid BootstrapUse Starting Tree &Use Starting Tree &

Model Params to compute Model Params to compute RELL scoresRELL scores

1111111111111111111111111111

0110221111111101102211111111 -110 -1101011110222011110111102220111 -105 -1051111111011202111111110112021 -100 -100

Rapid BootstrapUse Starting Tree &Use Starting Tree &

Model Params to compute Model Params to compute RELL scoresRELL scores

Sort by RELLSort by RELL

1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

TT00: Thorough Search: Thorough Search

1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

TT00: Thorough Search : Thorough Search

TT11: Quick Search on T: Quick Search on T00

1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

1111111111111111111111111111

1111111011202111111110112021 -100 -1001011110222011110111102220111 -105 -1050110221111111101102211111111 -110 -110

Rapid Bootstrap

sequential dependency is bad for parallelism

Scalability of Rapid Bootstrap

Some datasets are harder than others

ML-Scores: Garli, RAxML, PHYML 715 Sequences

Correlation 125 Taxa: 0.91

Support Value Distribution

Bootstrap Likelihood Values125 x 19,436

10,000 replicates only 195 non-trivial bipartitions

Bootstrap Likelihood Values125 x 19,436

3,491 rBCL sequencesRapid versus Standard BS

Correlation: 0.98

7,764 DNA Best Tree

7,764 DNA All Bipartitions

775 x 3,838 AA

New Opportunities

Assess Impact of Alignment Method on tree and support values

Test Bootstrap of the Bootstrap (double Bootstrap) procedures

Devise and empirically verify Bootstopping criteria

Bootstrap of the Bootstrap140 AA (Efron et al PNAS 1996)

Bootstrap of the Bootstrap3,491 rBCL

Bootstopping Rapid Bootstrapping allows to assess

Bootstopping criteria as follows1. Compute a high number of BS replicates (10,000)2. Devise topology-based bootstopping criterion and

apply it to these 10,000 replicates3. Compare support values induced by bootstopped

trees (say 300 replicates) with 10,000 replicates

We have 10,000 replicates for 18 datasets containing 125 to 2,554 sequences

Bootstopping Criterion

Every 50, 100, 150, ... replicates do a test: Say we have N BS trees Do the following 100 times:

Randomly split up this set of N trees into 2 equal sets S1, S2, of size N/2

Compute the bipartition support vectors for S1 and S2

Compute Pearson correlation of the support vectors

return average of the 100 Pearson correlations if average > 0.99 stop

Result Overview

Bootstopped between 100-400 (avg 213)

Correlation on best tree: Bootstopped versus 10,000 replicates > 0.99 (avg 0.995)

Correlation of all bipartitions > 0.995 (avg 0.997)

Bootstopping Best 140 AA

Bootstopping Best 404 DNA (Multi-Gene)

Bootstopping Best 994 DNA

Bootstopping All 994 DNA

Bootstopping Best 1,908 DNA

Bootstopping Best 2,554 DNA

Putting the Pieces together Blue-Gene: Can handle huge datasets

Use Cat approximation on BlueGene Further speedup of factor 3.5 Memory footprint reduction factor 4

8,864 Bacteria under GTR+Γand GTR+CAT

Log Likelihood Score under Γ

14 days14 days7 days7 days

Execution Time

Putting the Pieces together Blue-Gene: Can handle huge datasets

Use Cat approximation on BlueGene Further speedup of factor 3.5 Memory footprint reduction factor 4

Integrate rapid Bootstrap into BlueGene version

Additional speedup ≈ 15 Mechanisms available to accelerate

BlueGene version by factor 50-60 Integrate Bootstopping into BlueGene

Conclusion: We will soon be able to compute a small tree of life with 10,000 organisms and data from multiple genes!

Host-Parasite Co-Evolution

Hosts (eg Mammals) Parasites (eg Lice)

Hosts Parasites

Adjacency Matrix 0/1

8 Parasites

6 hosts

Co-Evolution Hypothesis

Hosts Parasites

Adjacency Matrix 0/1

8 Parasites

6 hosts

Co-Evolution Hypothesis

Statistical Test

What can HPC do forBioinformatics?Axelerated Parafit

“Parafit: statistical test of co-evolution”, Pierre Legendre, Syst. Biol. 2003

AxParafit (Axelerated Parafit) Statistical test of hypotheses of host-parasite co-

evolution C porting, optimization, BLAS integration Speedup up to factor 67 Master-Worker MPI-parallelization

Largest co-phylogenetic study to date conducted within 8 minutes instead of 4 weeks

Open-Source Code: http://icwww.epfl.ch/~stamatak/AxParafit.html

SwissGrid-based Web-Server planned

AxParafit: Sequential Performance

AxParafit: Parallel Performance

The ML Benchmark:A Current Community Project

Standardized way required to test ML search programs Web-Server with real-world alignments and performance data

at Swiss Institute of Bioinformatics Many developers of popular ML programs involved

Stephane Guindon (PHYML) Montpellier Simon Wheelan (LeaPhy) Manchester Bui Quang Minh (IQPNNI) Vienna Derrick Zwickl (GARLI) Virginia Thomas Keane (dprML) Cambridge

Byproduct: SPEC-like CPU benchmark for phylogenetics Follow-up: (planned) ML competition at major conference with

industrial sponsor

A Current Problem:Handling Multi-Gene Alignments

Gene 1 Gene 2

Sequence 1

Sequence 5

Missing Data ≠ Gap Data

A Multi-Gene Model

LogLH (T) = LogLh (T|Red)

LogLH (T) = LogLh (T|Red) +LogLH(T|Yellow)

A Multi-Gene Model

LogLH (T) = LogLh (T|Red) +LogLH(T|Yellow)

A Multi-Gene ModelChallenge: devise efficient data structures for this

Why are Individual Branches per Gene a Challenge?

Outlook

Tree of Life What is a good alignment in a

phylogenetic context? Simultaneous alignment and tree building More HPC & memory-aware programming Multi-core architectures Models for “gappy” multi-gene alignments

Acknowledgements BlueGene Project

Michael Ott, TUM

Srinivas Aluru, Jaroslaw Zola, Iowa State

Dan Janies, Andrew Johnson, Ohio State

IBM CELL & Playstation

Filip Blagojevic, Dimitris Nikolopoulos, Virginia Tech

Christos Antonopoulos, Univ. of Thessaly

Bootstopping

Bernard Moret, Masoud Alipour, EPFL

Olaf Bininda-Emonds, Univ. Jena

RAxML Web-Server

Jacques Rougemont, SIB

Terri Liebowitz, SDSC

AxParafit/AxPcoords

Markus Goeker, Alexander Auch, Jan Meier-Kolthoff, University of Tuebingen

Datasets for Studies

Jun Inoue (Florida), Nicolas Salamin (Lausanne), Marc Gottschling (Berlin), Guido Grimm (Tuebingen), Nikos Poulakakis (Yale), Usman Roshan (NJIT)

Thank you for your Attention !

Lake Geneva, Switzerland

Crunching Huge Phylogenies. A. Stamatakis

Technology

Transcript of Crunching Huge Phylogenies. A. Stamatakis

Rooting phylogenies

PHYLOGENIES AND COMMUNITY ECOLOGY

Chapter 25: Reconstructing and Using Phylogenies

Data crunching con Spark

Recombination, Phylogenies and Parsimony 21.11.05

3 Genome phylogenies - Universiteit Utrecht

MICHAIL STAMATAKIS, PhD Associate Professor in Chemical ...ucecmst/Michail_Stamatakis_FullCV.pdf · Michail Stamatakis - Page 2 of 20 . RESEARCH EXPERIENCE . 2009 – 2012 . Post-Doctoral:

Alexey M. Kozlov, Alexandros Stamatakis

Excel 2: Adventures in Data Crunching

Crunching the numbers NR14

Comparing Phylogenies

Reconstructing Mammalian Phylogenies: A Detailed ...

Chapter 3 Quantitative characters, phylogenies and ...evolution.genetics.washington.edu/papers/quantchar/quantchar.pdfQuantitative characters, phylogenies and morphometrics Joseph

Lecture 5 : Phylogenies

25 Reconstructing and Using Phylogenies. 25 Phylogenetic Trees Steps in Reconstructing Phylogenies Reconstructing a Simple Phylogeny Biological Classification.

Number Crunching in Python

Summarising Sets of Phylogenies

Crunching Reviews, 8 Million & Counting! Reviewlattice ...

Integrating Fossils into Phylogenies

Inferring Phylogenies from RAD Sequence Data