RESEARCH STATEMENT: High Performance Computing Bioinformatics Alexandros Stamatakis ICS-FORTH...

Post on 16-Jan-2016

213 views 0 download

Tags:

Transcript of RESEARCH STATEMENT: High Performance Computing Bioinformatics Alexandros Stamatakis ICS-FORTH...

RESEARCH STATEMENT: High Performance Computing

Bioinformatics

Alexandros StamatakisICS-FORTH

stamatak@ics.forth.gr

© Alexandros Stamatakis, February 2005 2

Outline ABOUT PART I: Phylogenetic Inference

introduction maximum likelihood solutions current & future directions

PART II: Areas of Common Interest

© Alexandros Stamatakis, February 2005 3

About

1995

© Alexandros Stamatakis, February 2005 4

About

1997/1998

© Alexandros Stamatakis, February 2005 5

About

1999

© Alexandros Stamatakis, February 2005 6

About

2001

© Alexandros Stamatakis, February 2005 7

About

2001-2004

© Alexandros Stamatakis, February 2005 8

About

2001-2004

© Alexandros Stamatakis, February 2005 9

About

2005

© Alexandros Stamatakis, February 2005 10

About DFG offered PostDoc Grant for ICS Rejected 2-year PostDoc contract

offered by CBU Bergen Funded by DAAD PostDoc Grant:

© Alexandros Stamatakis, February 2005 11

About DFG offered PostDoc Grant for ICS Rejected 2-year PostDoc contract

offered by CBU Bergen Funded by DAAD PostDoc Grant:

Independence

© Alexandros Stamatakis, February 2005 12

About DFG offered PostDoc Grant for ICS Rejected 2-year PostDoc contract

offered by CBU Bergen Funded by DAAD PostDoc Grant:

Independence Autism

© Alexandros Stamatakis, February 2005 13

About DFG offered PostDoc Grant for ICS Rejected 2-year PostDoc contract

offered by CBU Bergen Funded by DAAD PostDoc Grant:

Independence Autism Today’s goal: “What do I work on …

© Alexandros Stamatakis, February 2005 14

About DFG offered PostDoc Grant for ICS Rejected 2-year PostDoc contract

offered by CBU Bergen Funded by DAAD PostDoc Grant:

Independence Autism Today’s goal: “What do I work on &

what can we work on together?”

© Alexandros Stamatakis, February 2005 15

About DFG offered PostDoc Grant for ICS Rejected 2-year PostDoc contract

offered by CBU Bergen Funded by DAAD PostDoc Grant:

Independence Autism Today’s goal: “What do I work on &

what can we work on together?” This presentation is an informal one!

© Alexandros Stamatakis, February 2005 16

Outline ABOUT PART I: Phylogenetic Inference

introduction maximum likelihood solutions current & future directions

PART II: Areas of Common Interest

© Alexandros Stamatakis, February 2005 17

Phylogenetic Analysis Motivation

Tree-of-life New insights in medical & biological

research CIPRES: NSF-funded 11.6 million $ tree-of-

life project (www.phylo.org)

© Alexandros Stamatakis, February 2005 18

Phylogenetic Analysis Motivation

Tree-of-life New insights in medical & biological

research CIPRES: NSF-funded 11.6 million $ tree-of-

life project (www.phylo.org)

What about a European tree-of-life project?

© Alexandros Stamatakis, February 2005 19

Phylogenetic Analysis Motivation

Tree-of-life New insights in medical & biological research CIPRES: NSF-funded 11.6 million $ tree-of-life

project (www.phylo.org) Applications of phylogenetic trees

Bader et al (2001) Industrial applications of high-performance computing for phylogeny reconstruction.

Baker et al (1994) Which whales are hunted? A molecular genetic aproach to whaling.

© Alexandros Stamatakis, February 2005 20

Phylogenetic Methods Input: “good” multiple Alignment Output: unrooted binary tree Various models for phylogenetic inference

Models differ in computational complexity & accuracy of final trees

Fast & simple models Neighbor Joining Maximum Parsimony (MP)

Slow & complex models Maximum Likelihood (ML) Bayesian Methods

© Alexandros Stamatakis, February 2005 21

Example: Phylogeny of great Apes

Orangutan Gorilla Chimp Human

common ancestor time

© Alexandros Stamatakis, February 2005 22

The number of trees

© Alexandros Stamatakis, February 2005 23

The number of trees

© Alexandros Stamatakis, February 2005 24

The number of trees

© Alexandros Stamatakis, February 2005 25

The number of trees

© Alexandros Stamatakis, February 2005 26

The number of trees explodes!

BANG !

© Alexandros Stamatakis, February 2005 27

The Algorithmic Problem Number of potential trees grows

exponentially

# Taxa # Trees

5 15

10 2.027.025

15 7.905.853.580.625

50 2.84 * 10^76

© Alexandros Stamatakis, February 2005 28

The Algorithmic Problem Number of potential trees grows

exponentially

# Taxa # Trees

5 15

10 2.027.025

15 7.905.853.580.625

50 2.84 * 10^76

This is the number of atoms in

the universe 10^80

© Alexandros Stamatakis, February 2005 29

Outline ABOUT PART I: Phylogenetic Inference

introduction maximum likelihood solutions current & future directions

PART II: Areas of Common Interest

© Alexandros Stamatakis, February 2005 30

Maximum LikelihoodMaximum Likelihood

calculates:

1. Topologies

2. Branch lengths v[i]

3. Likelihood of the tree

S1

S2

S3v1

v2 v3 v4

v5

v6

v7

S5

S4

© Alexandros Stamatakis, February 2005 31

Maximum LikelihoodMaximum Likelihood

calculates:

1. Topologies

2. Branch lengths v[i]

3. Likelihood of the tree

Goal: Obtain topology with maximum likelihood value

Problem I: Number of possible topologies is exponential in n

Problem II: Computation of likelihood function is expensive

Problem III: 99.99% accuracy required

Solution: algorithmic optimizations + new heuristics + HPC

S1

S2

S3v1

v2 v3 v4

v5

v6

v7

S5

S4

© Alexandros Stamatakis, February 2005 32

Maximum LikelihoodMaximum Likelihood

calculates:

1. Topologies

2. Branch lengths v[i]

3. Likelihood of the tree

S1

S2

S3v1

v2 v3 v4

v5

v6

v7

S5

S4

Goal: Obtain topology with maximum likelihood value

Problem I: Number of possible topologies is exponential in n

Problem II: Computation of likelihood function is expensive

Problem III: 99.99% accuracy required

Solution: algorithmic optimizations + new heuristics + HPC

© Alexandros Stamatakis, February 2005 33

Goal: Obtain topology with maximum likelihood value

Problem I: Number of possible topologies is exponential in n

Problem II: Computation of likelihood function is expensive

Problem III: 99.99% accuracy required

Solution: algorithmic optimizations + new heuristics + HPC

Maximum LikelihoodMaximum Likelihood

calculates:

1. Topologies

2. Branch lengths v[i]

3. Likelihood of the tree

S1

S2

S3v1

v2 v3 v4

v5

v6

v7

S5

S4

Only results are reported !

© Alexandros Stamatakis, February 2005 34

Outline ABOUT PART I: Phylogenetic Inference

introduction maximum likelihood solutions

program development algorithmic optimization heuristics HPC solutions

current & future directions PART II: Areas of Common Interest

© Alexandros Stamatakis, February 2005 35

Directions

Technical Innovation

Algorithmic Innovation

© Alexandros Stamatakis, February 2005 36

Directions

Technical Innovation

Algorithmic Innovation

Technical innovation drags behind

© Alexandros Stamatakis, February 2005 37

Making Ends Meet

Technical Innovation

Algorithmic Innovation

Major Advances

© Alexandros Stamatakis, February 2005 38

RAxML Phylogeny Program Development

Develop fast sequential algorithm withnew heuristics & optimizations

© Alexandros Stamatakis, February 2005 39

RAxML Phylogeny Program Development

Develop fast sequential algorithm withnew heuristics & optimizations

Phylogenetics are an algorithmic discipline

© Alexandros Stamatakis, February 2005 40

RAxML Phylogeny Program Development

Develop fast sequential algorithm withnew heuristics & optimizations

Parallel program

© Alexandros Stamatakis, February 2005 41

RAxML Phylogeny Program Development

Develop fast sequential algorithm withnew heuristics & optimizations

Parallel program

Iterate

© Alexandros Stamatakis, February 2005 42

Algorithmic Optimization Acceleration of the likelihood function by

detection of equal patterns and re-using previously computed values

Performance improvement of up to 65% References

Stamatakis et al. “Accelerating Parallel Maximum Likelihood-based Phylogenetic Tree Calculations using Subtree Equality Vectors”, Supercomputing 2002.

Stamatakis et al. “AxML: A Fast Program for Sequential and Parallel Phylogenetic Tree Calculations based on the Maximum Likelihood Method”, CSB2002.

© Alexandros Stamatakis, February 2005 43

New Heuristics New heuristics to accelerate tree search Algorithm:

1. Lazy pre-scoring of many alternative topologies2. Store best 20-30 pre-scored trees in a list3. Thorough evaluation of those 20-30 best trees

Since October 2003 fastest & best ML-program on real world alignment data

Reference Stamatakis et al. “RAxML-III: A Fast Program for

Maximum Likelihood-based Inference of Large Phylogenetic Trees”, Bioinformatics, 21(4):456-463.

© Alexandros Stamatakis, February 2005 44

Simulated Annealing Combination of hill-climbing & simulated

annealing Not significantly slower than hill climbing Builds consensus trees on the fly More likely to avoid local optima Finds better trees for large alignments

(≥ 500 sequences) Reference

Stamatakis. “An Efficient Program for phylogenetic Inference Using Simulated Annealing”, IPDPS2005.

© Alexandros Stamatakis, February 2005 45

Parallel MPI Implementation

Non-deterministic parallel implementation with very low communication costs

Due to non-determinism partially superlinear speedups

Also available as distributed http-based program

Largest ML-analysis to data containing 10.000 organisms on RRZE PC-Cluster

Reference Stamatakis et al. “Parallel Inference of a 10.000-

taxon Phylogeny with Maximum Likelihood”, EuroPar 2004.

© Alexandros Stamatakis, February 2005 46

Shared Memory Parallelism

P

QR

P[i] = f( g(Q[i]) , g(R[i]) )

virtual root

© Alexandros Stamatakis, February 2005 47

Shared Memory Parallelism

P

QR

P[i] = f( g(Q[i]) , g(R[i]) )

virtual root

This operation uses ≥ 90% of total execution time !

© Alexandros Stamatakis, February 2005 48

Shared Memory Parallelism

P

QR

P[i] = f( g(Q[i]) , g(R[i]) )

virtual root

This operation uses ≥ 90% of total execution time ! simple fine-grained parallelisation

© Alexandros Stamatakis, February 2005 49

Shared Memory Parallelism

P

QR

virtual root

© Alexandros Stamatakis, February 2005 50

Shared Memory Parallelism

P

QR

virtual root

© Alexandros Stamatakis, February 2005 51

Shared Memory Parallelism

P

QR

virtual root

© Alexandros Stamatakis, February 2005 52

OpenMP Parallelisation of RAxML

OpenMP little effort required to parallelize program (1 week!)

Can help to solve memory problems for very long/large alignments

Can easily be applied to other programs such as PHYML

Hybrid parallelisation possible

© Alexandros Stamatakis, February 2005 53

OpenMP Parallelisation of RAxML

OpenMP little effort required to parallelize program (1 week!)

Can help to solve memory problems for very long/large alignments

Can easily be applied to other programs such as PHYML

Hybrid parallelisation possible Performance extremely HW-dependent !

very good on AMD Opteron not so good on Intel Itanium & Xeon

© Alexandros Stamatakis, February 2005 54

RAxML-OMP Speedup

© Alexandros Stamatakis, February 2005 55

Outline ABOUT PART I: Phylogenetic Inference

introduction maximum likelihood solutions current & future directions

Things I can cover Things I can partially cover Things I cannot cover

PART II: Areas of Common Interest

© Alexandros Stamatakis, February 2005 56

Outline ABOUT PART I: Phylogenetic Inference

introduction maximum likelihood solutions current & future directions

Things I can cover Things I can partially cover Things I cannot cover

PART II: Areas of Common Interest

© Alexandros Stamatakis, February 2005 57

Things I can cover Novel Divide-and-Conquer algorithms

Cooperations: Usman Roshan at NJIT, Olaf Bininda-Emonds at TUM, Le Sy Vinh at HHUD

Hybrid parallelisation of RAxML Cooperation: Michael Ott at TUM

OpenMP parallelisation of PHYML Cooperation: Michael Ott at TUM

Parallelisation of RAxML on GPUs Cooperation: Pedro Trancoso at UC, Michael

Ott at TUM Grid-enabled RAxML

Cooperation: Angelos Bilas at ICS

© Alexandros Stamatakis, February 2005 58

Things I can partially cover RAxML code maintenance Phylogenetic on-line Benchmark Phylogeny contest Implementation of Protein

substitution models in RAxML Force European Tree-of-Life

project Candidate SPEC-Benchmark ? RAxML Web-Interface Simultaneous multiple alignment

and tree building

© Alexandros Stamatakis, February 2005 59

Things I cannot cover Phylogenetic Networks Novel Consesus-Tools Novel Mathematical Models Visualization tools Issues of quality assessment Performance of RAxML-OMP on

SMT/HT architectures

© Alexandros Stamatakis, February 2005 60

Outline ABOUT PART I: Phylogenetic Inference

introduction maximum likelihood solutions current & future directions

PART II: Areas of Common Interest What I am looking for What you might be looking for

© Alexandros Stamatakis, February 2005 61

What I am looking for: Compute-intensive sequential

Bioinformatics applications Everything that runs between 1 and 10.000

days Compute-intensive NP-complete

optimization problems Design & experimental evaluation of new

heuristics Manpower Biologists at IMBB that want to use RAxML

© Alexandros Stamatakis, February 2005 62

What I am looking for: Compute-intensive sequential

Bioinformatics applications Everything that runs between 1 and 10.000

days Compute-intensive NP-complete

optimization problems Design & experimental evaluation of new

heuristics Manpower Biologists at IMBB that want to use RAxML

With the current infrastructure at ICS we could compute trees for up to 5000 sequences

© Alexandros Stamatakis, February 2005 63

What you might be looking for:

Support with phylogenetic tree building programs

Parallel and distributed implementation of your applications

A benchmark application to test CPU architectures RAxML has some interesting properties

Collaborate on some interesting aspects of phylogenetics I cannot fully cover

© Alexandros Stamatakis, February 2005 64

Conclusion Presentation of past, current &

future work on RAxML Issues for collaboration in

phylogenetics Issues for collaboration in other

domains Code & papers available at:

www.ics.forth.gr/~stamatak