From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan...

40
From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering IIT-Kanpur IIT-K REACH Symposium 2010

Transcript of From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan...

Page 1: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

From Sequence Analysis to Simulations: Applications of HPC in Modern Biology

R. SankararamakrishnanDepartment of Biological Sciences & Bioengineering

IIT-Kanpur

IIT-K REACH Symposium 2010

Oct 9th 2010

Page 2: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Computers and Computing in Biology

Bioinformatics

Computational Biology

Mathematical Biology

Biostatistics

Biomathematics

Quantitative Biology

Biophysics

Page 3: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

What is Bioinformatics? - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

What is Computational Biology? - The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.

- NIH Definition http://www.bisti.nih.gov/

Definitions

Page 4: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Explosive growth of biological data

Page 5: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.
Page 6: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

HPC Applications: Three examples

Evolutionary relationship among a given set of protein or DNA sequences

Drug Discovery and Design

Structure-function relationship of large biomolecular assemblies

Page 7: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

I. HPC in PhylogeneticsI. HPC in Phylogenetics

Page 8: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Phylogeny and Phylogenetic tree

Study of evolutionary relationships (sequences/species)

Relationships between organisms with common ancestor

Phylogenetic tree is a graph representing evolutionary history of sequences/species

Page 9: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

HumanChimpanzee

Gorilla

Orangutan

Rooted Tree Unrooted Tree

Direction of evolution

Human

Chimpanzee

Gorilla

Orangutan

Phylogenetic trees can be represented in two different ways

Has a unique node

No assumption about common ancestry

Page 10: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.
Page 11: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Molecular phylogeny in a criminal investigation

Page 12: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Maximum Likelihood Method – An Introduction

David Mount (2002)

Page 13: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Maximum Likelihood Method – An Introduction

David Mount (2002)

Page 14: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

For each unrooted tree, there will be many possible rooted trees

Page 15: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

!22

!322

n

nN

nR

!32

!523

n

nN

nU

Species

Number of Rooted Trees Number of Unrooted Trees

2 1 1

3 3 1

4 15 3

5 105 15

6 34,459,425 2,027,025

7 213,458,046,767,875 7,905,853,580,625

8 8,200,794,532,637,891,559,375

221,643,095,476,699,771,875

Number of possible unrooted and rooted trees

Page 16: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Maximum likelihood phylogeny problem is NP-hard

Very CPU intensive

For trees containing more than 20 to 25 sequences, the problem cannot be solved any more

Efficient heuristic tree search algorithms are required to reduce the size of the search space

Recently developed algorithms:

IQPNNI, PHYML, GARLI, RAxML

None of these algorithms are guaranteed to find the ML tree; only yield the best known ML tree

Computing phylogenetic trees using ML method

Page 17: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Parallelization strategy

Ott et al. (2008)

Page 18: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

RAxML performance in some HPC platforms

Ott et al. (2008)

212 sequences, 566,470 base pairs

One of the largest datasets analyzed under ML

IBM BlueGene/L; 1024 CPUs

7 distinct tree searches in 14 hours

Page 19: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Phylogenetic analysis of plant channel proteins identified new subfamily

Bansal and Sankararamakrishnan, BMC Struct. Biol. (2007)Gupta and Sankararamakrishnan, BMC Plant Biol. (2009)

Page 20: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

II. HPC in Drug Discovery & II. HPC in Drug Discovery & Drug DesignDrug Design

Page 21: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

“Is there really a case where a drug that is on the market was designed by a computer?”“The reality is that the use of computers and computer methods permeates all aspects of drug discovery today”

Jorgensen (2004)

Roles of Computation in Drug Discovery

Page 22: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

“Drug discovery is complex: Successful teams and companies need to congratulated, whereas search for one individual or computer program is counterproductive. There is not going to be a voila moment at the computer terminal. Instead, there is systematic use of wide-ranging computational tools to facilitate and enhance the drug discovery process”

Computation in Drug Discovery

Jorgensen (2004)

Page 23: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Structure-based Drug Design – An Introduction

http://csb.stanford.edu/levitt/demo_lectures/lec7/Lecture7/Discovering_Drugs/pages/Structure_Based_Drug_Design.html

http://www.biocryst.com/our_science

Page 24: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Wim Holwww.bmsc.washington.edu/WimHol/sbdd3.JPG

Page 25: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Lead Generation

Lead optimization

De novo design

Virtual screening

Bleicher et al. (2003)

All drugs that are presently in the market are estimated to target less than 500 biomolecules

Docking & Scoring

Drug targets and Drug discovery: Issues

Issues: Scoring function, solvent effect and protein flexibility

Page 26: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Four proteins: trypsin, HIV PR, CDK2 and AChE

Test set for each protein: 10,000 randomly selected compounds

6000 docking poses were selected for the top 1000 compounds

They served as initial conformations for MD simulations

Combination of docking and MD showed a higher and more stable enrichment performance than docking method used alone

Page 27: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

A special purpose computer, MDGRAPE-3, was used for MD simulations

It is a cluster of personal computers

Each equipped with 24 MDGRAPE-3 chips and has a peak speed of approximately 2 Tflops

50 such computers were used

Average computational time for a single protein-ligand complex is 2.5 h

For 6,000 protein-ligand conformations, calculations were completed in a week

Page 28: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Steered Molecular Dynamics to compute the force required to extract the inhibitors from enzymes

A small string is connected to the ligand in the complex

This string is pulled at constant velocity into the surrounding water

Force is determined from the extension of the spring and recorded as a function of time

Strongly-bound inhibitors higher peak forces

Weaker inhibitors flatter profiles

Steered MD in Drug Discovery

Jorgensen, 2010

Page 29: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Protein-protein interactions in programmed cell death

Lama and Sankararamakrishnan, Proteins (2008)Lama and Sankararamakrishnan, Biochemistry (2010)

Bcl-2 family complex structures

Total number of atoms: ~50,000 to ~75,000

Simulation period: 50 ns

Page 30: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

III. Large Biomolecular Assemblies

Page 31: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

First Biomolecular simulation was performed in 1977

Page 32: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

GlpF: 81006 AtomsAQP1: 75057 Atoms PfAQP: 81503 Atoms

30ns production run was performed for all the three systems.

Each simulation takes ~40 days CPU time (Total CPU time ~ 120 days).

MD simulations of channel proteins in bilayers

Alok Jain, Ravi Verma and R. Sankararamakrishnan, Manuscript in preparation

Page 33: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Complete virus: 1 million atoms(Freddolino et al., 2006)

Arrays of light-harvesting proteins – 1 million atoms (Chandler et al., 2008)

Simulations reaching the million-atom mark

BAR domain proteins – 2.3 million atoms (Yin et al., 2009)

The flagellum – 2.4 million atoms (Kitao et al., 2006)

Page 34: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Minimization and equilibration

Cluster of 48 AMD Athlon 2600+ processors

Simulation

256 Altix nodes at NCSA @UIUC

1.1. ns/day

Complete virus: 1 million atoms

(Freddolino et al., 2006)

Page 35: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Functions of large molecular machines

30S ribosome

Fungal fatty acid synthase

Page 36: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Gumbart et al. (2009)

2.7 million atoms

50 ns simulation

MD of protein-conducting channel bound to ribosome

Largest system simulated to date

Bacterial ribosomes are important targets for antibiotics

Page 37: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Phylogenetic analysis

Large Biomolecula

r systems

Drug Design & Discovery

HPC

Page 38: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

HPC Platforms for Biology Applications

FPGA-boards: Field programmable gate arrays are ICs which can be programmed. FGPA boards with commonly used bioinformatics algorithms are available

Graphics-Processing Unit (GPU): All bioinformatics applications

Grid Computing: Many applications

Distributed Computing: Protein folding, Drug docking

Cloud Computing:

Page 39: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.

Acknowledgements

Anjali Bansal

Dilraj Lama

Alok Jain

Tuhin Kumar Pal

Priyanka Srivastava

Vivek Modi

Ravi Kumar Verma

Krishna Deepak

Phani Deep

DST, DBT, CSIR, MHRD

Page 40: From Sequence Analysis to Simulations: Applications of HPC in Modern Biology R. Sankararamakrishnan Department of Biological Sciences & Bioengineering.