RESEARCH STATEMENT: High Performance Computing Bioinformatics Alexandros Stamatakis ICS-FORTH...
-
Upload
spencer-jennings -
Category
Documents
-
view
213 -
download
0
Transcript of RESEARCH STATEMENT: High Performance Computing Bioinformatics Alexandros Stamatakis ICS-FORTH...
RESEARCH STATEMENT: High Performance Computing
Bioinformatics
Alexandros StamatakisICS-FORTH
© Alexandros Stamatakis, February 2005 2
Outline ABOUT PART I: Phylogenetic Inference
introduction maximum likelihood solutions current & future directions
PART II: Areas of Common Interest
© Alexandros Stamatakis, February 2005 3
About
1995
© Alexandros Stamatakis, February 2005 4
About
1997/1998
© Alexandros Stamatakis, February 2005 5
About
1999
© Alexandros Stamatakis, February 2005 6
About
2001
© Alexandros Stamatakis, February 2005 7
About
2001-2004
© Alexandros Stamatakis, February 2005 8
About
2001-2004
© Alexandros Stamatakis, February 2005 9
About
2005
© Alexandros Stamatakis, February 2005 10
About DFG offered PostDoc Grant for ICS Rejected 2-year PostDoc contract
offered by CBU Bergen Funded by DAAD PostDoc Grant:
© Alexandros Stamatakis, February 2005 11
About DFG offered PostDoc Grant for ICS Rejected 2-year PostDoc contract
offered by CBU Bergen Funded by DAAD PostDoc Grant:
Independence
© Alexandros Stamatakis, February 2005 12
About DFG offered PostDoc Grant for ICS Rejected 2-year PostDoc contract
offered by CBU Bergen Funded by DAAD PostDoc Grant:
Independence Autism
© Alexandros Stamatakis, February 2005 13
About DFG offered PostDoc Grant for ICS Rejected 2-year PostDoc contract
offered by CBU Bergen Funded by DAAD PostDoc Grant:
Independence Autism Today’s goal: “What do I work on …
© Alexandros Stamatakis, February 2005 14
About DFG offered PostDoc Grant for ICS Rejected 2-year PostDoc contract
offered by CBU Bergen Funded by DAAD PostDoc Grant:
Independence Autism Today’s goal: “What do I work on &
what can we work on together?”
© Alexandros Stamatakis, February 2005 15
About DFG offered PostDoc Grant for ICS Rejected 2-year PostDoc contract
offered by CBU Bergen Funded by DAAD PostDoc Grant:
Independence Autism Today’s goal: “What do I work on &
what can we work on together?” This presentation is an informal one!
© Alexandros Stamatakis, February 2005 16
Outline ABOUT PART I: Phylogenetic Inference
introduction maximum likelihood solutions current & future directions
PART II: Areas of Common Interest
© Alexandros Stamatakis, February 2005 17
Phylogenetic Analysis Motivation
Tree-of-life New insights in medical & biological
research CIPRES: NSF-funded 11.6 million $ tree-of-
life project (www.phylo.org)
© Alexandros Stamatakis, February 2005 18
Phylogenetic Analysis Motivation
Tree-of-life New insights in medical & biological
research CIPRES: NSF-funded 11.6 million $ tree-of-
life project (www.phylo.org)
What about a European tree-of-life project?
© Alexandros Stamatakis, February 2005 19
Phylogenetic Analysis Motivation
Tree-of-life New insights in medical & biological research CIPRES: NSF-funded 11.6 million $ tree-of-life
project (www.phylo.org) Applications of phylogenetic trees
Bader et al (2001) Industrial applications of high-performance computing for phylogeny reconstruction.
Baker et al (1994) Which whales are hunted? A molecular genetic aproach to whaling.
© Alexandros Stamatakis, February 2005 20
Phylogenetic Methods Input: “good” multiple Alignment Output: unrooted binary tree Various models for phylogenetic inference
Models differ in computational complexity & accuracy of final trees
Fast & simple models Neighbor Joining Maximum Parsimony (MP)
Slow & complex models Maximum Likelihood (ML) Bayesian Methods
© Alexandros Stamatakis, February 2005 21
Example: Phylogeny of great Apes
Orangutan Gorilla Chimp Human
common ancestor time
© Alexandros Stamatakis, February 2005 22
The number of trees
© Alexandros Stamatakis, February 2005 23
The number of trees
© Alexandros Stamatakis, February 2005 24
The number of trees
© Alexandros Stamatakis, February 2005 25
The number of trees
© Alexandros Stamatakis, February 2005 26
The number of trees explodes!
BANG !
© Alexandros Stamatakis, February 2005 27
The Algorithmic Problem Number of potential trees grows
exponentially
# Taxa # Trees
5 15
10 2.027.025
15 7.905.853.580.625
50 2.84 * 10^76
© Alexandros Stamatakis, February 2005 28
The Algorithmic Problem Number of potential trees grows
exponentially
# Taxa # Trees
5 15
10 2.027.025
15 7.905.853.580.625
50 2.84 * 10^76
This is the number of atoms in
the universe 10^80
© Alexandros Stamatakis, February 2005 29
Outline ABOUT PART I: Phylogenetic Inference
introduction maximum likelihood solutions current & future directions
PART II: Areas of Common Interest
© Alexandros Stamatakis, February 2005 30
Maximum LikelihoodMaximum Likelihood
calculates:
1. Topologies
2. Branch lengths v[i]
3. Likelihood of the tree
S1
S2
S3v1
v2 v3 v4
v5
v6
v7
S5
S4
© Alexandros Stamatakis, February 2005 31
Maximum LikelihoodMaximum Likelihood
calculates:
1. Topologies
2. Branch lengths v[i]
3. Likelihood of the tree
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponential in n
Problem II: Computation of likelihood function is expensive
Problem III: 99.99% accuracy required
Solution: algorithmic optimizations + new heuristics + HPC
S1
S2
S3v1
v2 v3 v4
v5
v6
v7
S5
S4
© Alexandros Stamatakis, February 2005 32
Maximum LikelihoodMaximum Likelihood
calculates:
1. Topologies
2. Branch lengths v[i]
3. Likelihood of the tree
S1
S2
S3v1
v2 v3 v4
v5
v6
v7
S5
S4
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponential in n
Problem II: Computation of likelihood function is expensive
Problem III: 99.99% accuracy required
Solution: algorithmic optimizations + new heuristics + HPC
© Alexandros Stamatakis, February 2005 33
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponential in n
Problem II: Computation of likelihood function is expensive
Problem III: 99.99% accuracy required
Solution: algorithmic optimizations + new heuristics + HPC
Maximum LikelihoodMaximum Likelihood
calculates:
1. Topologies
2. Branch lengths v[i]
3. Likelihood of the tree
S1
S2
S3v1
v2 v3 v4
v5
v6
v7
S5
S4
Only results are reported !
© Alexandros Stamatakis, February 2005 34
Outline ABOUT PART I: Phylogenetic Inference
introduction maximum likelihood solutions
program development algorithmic optimization heuristics HPC solutions
current & future directions PART II: Areas of Common Interest
© Alexandros Stamatakis, February 2005 35
Directions
Technical Innovation
Algorithmic Innovation
© Alexandros Stamatakis, February 2005 36
Directions
Technical Innovation
Algorithmic Innovation
Technical innovation drags behind
© Alexandros Stamatakis, February 2005 37
Making Ends Meet
Technical Innovation
Algorithmic Innovation
Major Advances
© Alexandros Stamatakis, February 2005 38
RAxML Phylogeny Program Development
Develop fast sequential algorithm withnew heuristics & optimizations
© Alexandros Stamatakis, February 2005 39
RAxML Phylogeny Program Development
Develop fast sequential algorithm withnew heuristics & optimizations
Phylogenetics are an algorithmic discipline
© Alexandros Stamatakis, February 2005 40
RAxML Phylogeny Program Development
Develop fast sequential algorithm withnew heuristics & optimizations
Parallel program
© Alexandros Stamatakis, February 2005 41
RAxML Phylogeny Program Development
Develop fast sequential algorithm withnew heuristics & optimizations
Parallel program
Iterate
© Alexandros Stamatakis, February 2005 42
Algorithmic Optimization Acceleration of the likelihood function by
detection of equal patterns and re-using previously computed values
Performance improvement of up to 65% References
Stamatakis et al. “Accelerating Parallel Maximum Likelihood-based Phylogenetic Tree Calculations using Subtree Equality Vectors”, Supercomputing 2002.
Stamatakis et al. “AxML: A Fast Program for Sequential and Parallel Phylogenetic Tree Calculations based on the Maximum Likelihood Method”, CSB2002.
© Alexandros Stamatakis, February 2005 43
New Heuristics New heuristics to accelerate tree search Algorithm:
1. Lazy pre-scoring of many alternative topologies2. Store best 20-30 pre-scored trees in a list3. Thorough evaluation of those 20-30 best trees
Since October 2003 fastest & best ML-program on real world alignment data
Reference Stamatakis et al. “RAxML-III: A Fast Program for
Maximum Likelihood-based Inference of Large Phylogenetic Trees”, Bioinformatics, 21(4):456-463.
© Alexandros Stamatakis, February 2005 44
Simulated Annealing Combination of hill-climbing & simulated
annealing Not significantly slower than hill climbing Builds consensus trees on the fly More likely to avoid local optima Finds better trees for large alignments
(≥ 500 sequences) Reference
Stamatakis. “An Efficient Program for phylogenetic Inference Using Simulated Annealing”, IPDPS2005.
© Alexandros Stamatakis, February 2005 45
Parallel MPI Implementation
Non-deterministic parallel implementation with very low communication costs
Due to non-determinism partially superlinear speedups
Also available as distributed http-based program
Largest ML-analysis to data containing 10.000 organisms on RRZE PC-Cluster
Reference Stamatakis et al. “Parallel Inference of a 10.000-
taxon Phylogeny with Maximum Likelihood”, EuroPar 2004.
© Alexandros Stamatakis, February 2005 46
Shared Memory Parallelism
P
QR
P[i] = f( g(Q[i]) , g(R[i]) )
virtual root
© Alexandros Stamatakis, February 2005 47
Shared Memory Parallelism
P
QR
P[i] = f( g(Q[i]) , g(R[i]) )
virtual root
This operation uses ≥ 90% of total execution time !
© Alexandros Stamatakis, February 2005 48
Shared Memory Parallelism
P
QR
P[i] = f( g(Q[i]) , g(R[i]) )
virtual root
This operation uses ≥ 90% of total execution time ! simple fine-grained parallelisation
© Alexandros Stamatakis, February 2005 49
Shared Memory Parallelism
P
QR
virtual root
© Alexandros Stamatakis, February 2005 50
Shared Memory Parallelism
P
QR
virtual root
© Alexandros Stamatakis, February 2005 51
Shared Memory Parallelism
P
QR
virtual root
© Alexandros Stamatakis, February 2005 52
OpenMP Parallelisation of RAxML
OpenMP little effort required to parallelize program (1 week!)
Can help to solve memory problems for very long/large alignments
Can easily be applied to other programs such as PHYML
Hybrid parallelisation possible
© Alexandros Stamatakis, February 2005 53
OpenMP Parallelisation of RAxML
OpenMP little effort required to parallelize program (1 week!)
Can help to solve memory problems for very long/large alignments
Can easily be applied to other programs such as PHYML
Hybrid parallelisation possible Performance extremely HW-dependent !
very good on AMD Opteron not so good on Intel Itanium & Xeon
© Alexandros Stamatakis, February 2005 54
RAxML-OMP Speedup
© Alexandros Stamatakis, February 2005 55
Outline ABOUT PART I: Phylogenetic Inference
introduction maximum likelihood solutions current & future directions
Things I can cover Things I can partially cover Things I cannot cover
PART II: Areas of Common Interest
© Alexandros Stamatakis, February 2005 56
Outline ABOUT PART I: Phylogenetic Inference
introduction maximum likelihood solutions current & future directions
Things I can cover Things I can partially cover Things I cannot cover
PART II: Areas of Common Interest
© Alexandros Stamatakis, February 2005 57
Things I can cover Novel Divide-and-Conquer algorithms
Cooperations: Usman Roshan at NJIT, Olaf Bininda-Emonds at TUM, Le Sy Vinh at HHUD
Hybrid parallelisation of RAxML Cooperation: Michael Ott at TUM
OpenMP parallelisation of PHYML Cooperation: Michael Ott at TUM
Parallelisation of RAxML on GPUs Cooperation: Pedro Trancoso at UC, Michael
Ott at TUM Grid-enabled RAxML
Cooperation: Angelos Bilas at ICS
© Alexandros Stamatakis, February 2005 58
Things I can partially cover RAxML code maintenance Phylogenetic on-line Benchmark Phylogeny contest Implementation of Protein
substitution models in RAxML Force European Tree-of-Life
project Candidate SPEC-Benchmark ? RAxML Web-Interface Simultaneous multiple alignment
and tree building
© Alexandros Stamatakis, February 2005 59
Things I cannot cover Phylogenetic Networks Novel Consesus-Tools Novel Mathematical Models Visualization tools Issues of quality assessment Performance of RAxML-OMP on
SMT/HT architectures
© Alexandros Stamatakis, February 2005 60
Outline ABOUT PART I: Phylogenetic Inference
introduction maximum likelihood solutions current & future directions
PART II: Areas of Common Interest What I am looking for What you might be looking for
© Alexandros Stamatakis, February 2005 61
What I am looking for: Compute-intensive sequential
Bioinformatics applications Everything that runs between 1 and 10.000
days Compute-intensive NP-complete
optimization problems Design & experimental evaluation of new
heuristics Manpower Biologists at IMBB that want to use RAxML
© Alexandros Stamatakis, February 2005 62
What I am looking for: Compute-intensive sequential
Bioinformatics applications Everything that runs between 1 and 10.000
days Compute-intensive NP-complete
optimization problems Design & experimental evaluation of new
heuristics Manpower Biologists at IMBB that want to use RAxML
With the current infrastructure at ICS we could compute trees for up to 5000 sequences
© Alexandros Stamatakis, February 2005 63
What you might be looking for:
Support with phylogenetic tree building programs
Parallel and distributed implementation of your applications
A benchmark application to test CPU architectures RAxML has some interesting properties
Collaborate on some interesting aspects of phylogenetics I cannot fully cover
© Alexandros Stamatakis, February 2005 64
Conclusion Presentation of past, current &
future work on RAxML Issues for collaboration in
phylogenetics Issues for collaboration in other
domains Code & papers available at:
www.ics.forth.gr/~stamatak