BioPilot: Data-Intensive Computing for Complex Biological Systems

Presented by

BioPilot: Data-Intensive Computingfor Complex Biological Systems

Nagiza SamatovaComputer Science Research Group

Computer Science and Mathematics Division

Sponsored bythe Department of Energy’s Office of Science

Advanced Scientific Computing ResearchScientific Discovery through Advanced Computing

2 Samatova_BioPilot_SC07

Biological research is becoming a high-throughput, data-intensive science, requiring development and evaluation of new methods and algorithms for large-scale computational analysis, modeling, and simulation

-80 -60 -40 -20 0 20 40 60 800

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Figure 1

Relative Frequency of Appearance

Fragment Ion Mass Offset

Computational algorithms for proteomics Improve the efficiency and reliability of protein identification and

quantification Peptide identification using MS/MS using pre-computed spectral

databases Combinatorial peptide search with mutations, post- translational

modifications and cross-linked constructsInference, modeling, and simulation of biological networks

Reconstruction of cellular network topologies using high-throughput biological data

Stochastic and differential equation-based simulations of the dynamics of cellular networks

Integration of bioinformatics and biomolecular modeling and simulation

Structure, function, and dynamics of complex biomolecular system in appropriate environments

Comparative analysis of large-scale molecular simulation trajectories Event recognition in biomolecular simulations Multi-level modeling of protein models and their conformational space

Goals for enabling data-intensivecomputing in biology


Comparative molecular trajectoryanalysis with BioSimGrid

Routine protein simulations generate TB trajectories.

Routine analysis tools (ED) require multiple passes.

Correlation analyses have random access patterns.

Comparative analyses require multiple/distributed trajectory access.

Comparative Trajectory Analysis Tools

ENERGYGRADIENTOPTIMIZE

DYNAMICSTHERMODYNAMICS

QMDQM/MM

ETQHOP

INPUTPROPERTYPREPAREANALYZE

ESPVIB

Classical Force FieldDFT

SCF: RHF UHF ROHFMP2: RHF UHFMP3: RHF UHFMP4: RHF UHF

RI-MP2CCSD(T): RHFCASSCF/GVB

MCSCFMR-CI-PT

CI: Columbus Full Selected

Integral APIGeometryBasis Sets

PEigSpFFT

LAPACKBLAS

MAGlobal Arrays

ecceChemIO

NWCHM

The analysis of molecular dynamicstrajectories presents a large,data-intensive analysis problem:

The scientific challenge:Integrated molecular simulationand comparative moleculartrajectory analysis:

Enzymatic reaction mechanisms Molecular machines Molecular basis of signal and

material transport

T. P. Straatsma, Computational Biology and Bioinformatics


d

Building accurate protein structures Homology models can be built from related protein structures, but even

with proper alignment, usually have about 4A RMS error–too large to permit meaningful computational simulation/molecular dynamics.Goal is to reduce the errors in initial models.

We multi-align the group of neighbors using sequence and structure information to find the most stable parts of the domain. Extracted stable core structure is 2030% closer in terms of RMSD to the target than to any of the original templates.

More flexible parts of target are modeled locally by choosing most sequentially similar loops from the library of local segments found in all the homologues.

A genetic algorithm conformation search strategy using Cray XT4 uses backbone and sidechain angles as parameters. Each GA step is evaluated by minimization.

A computed template compared with the target, improved from 3.4A initially to 2.3A

E. Uberbacher, Biological and Environmental Sciences Division


• Ultra-scale docking. The Bayesian potentials allow exploration of 100s and 1000s of protein complexes instead of one or two.

• Docking from independently crystallized subunits. We demonstrated excellent results on complexes reconstructed from ~500 independently crystallized subunits.

• Whole genome predictions. Developed technology has been used to predict 1000s of protein complexes on the genome-wide level in several organisms.

Best structure. The set with the largest common kernel always includes nearly the best native structure present in the ensemble of 10,000 folded structures.

Quality estimator. The size of the largest common kernel (obtained from the connectivity graph) provides an excellent estimate of how close the selected structure is to the native ones.

A. Gorin, Computer Science and Mathematics Division

Analysis of ultra-large structural ensembles


Predictive proteome analysis using advanced computation and physical modelsScientific challenge:Data sizes: 30,000 to 200,000 spectra from a single experiment to becompared with as many as millions of theoretical spectra

Computational tools identify only 1020%of spectra from proteomics analysis.

Fragmentation patterns are highly sequence-specific.

Statistical characterization of noise essential.

Identification rate increased by 23–40%.

NH3-HC-CO- NH-HC-CO- NH-HC-CO- NH-HC-COOHR R R R

20%@5% False Pos. Rate

Increase in identified peptides

40%@1% False Pos. Rate

W. R. Cannon, Computational Biology and Bioinformatics


In-depth analysis of MS proteomicsdata flowsMass-spectrometry-based proteomics is one of the richest sources of the biological information, but the data flows are enormous (100,000s of samples each consisting of 100,000s of spectra).

Computationally, both advanced graph algorithms and memory-based indexes (> 1 TB are required forin-depth analysis of the spectra).

Two principally new capabilities are demonstrated: • Quantity of identified peptides is 50% increased. • Among highly reliable identifications, increase is many-fold.

A. Gorin, Computer Science and Mathematics Division


ScalaBLAST

Standalone BLAST application would take >1 year for IMG (>1.6 million proteins) vs IMG and >3 years for IMG vs nr.

“PERL script” approach breaks even modest clusters becauseof poor memory management.

Scientific challenge:

Good scaling on both shared and distributed memory architectures

C. S. Oehmen, Computational Biology and Bioinformatics

120

0 150

Scal

ing

fact

or

Number of processors10050

0

40

80

MPP2ALTIXHTCBLAST

0 1000

Que

ries

per (

proc

*min

ute)

Number of processors10050

0

3

4

6

10000

5

2

1work factor

ideal

IMG 1.6

IMG 2.2

Run

-tim

e (h

ours

) 4035302520151050

0 1 2 3 4 5 6 7 8Trillions of pairwise comparisons


After: ~1GB

Before: ~2TB

pGraph: Parallel Graph Library for analysis of biological networksMajor features: Memory reduction by 1000 times Scalability to 100s of processors FPT-based search space reduction theory

Find cliques Merge cliques

2,109 vertices 16,169 edges

4123 modulesModule

851 “common-

target” associations

Association

pGraph

Nagiza Samatova, Computer Science and Mathematics Division


Discovering cell networks with structure and genome context

2007, Sep 727(5): 793

The conserved ASD domain links ~30% of ECF sigma and anti-sigma pairs in the signal transduction cascade, e.g., response to iron and 1O2 with FecIR, RpoE-ChrR proteins.

Many combinations of different sigma () and sensor (S) domains exist. Two levels of positional clustering, (1) domain and (2) gene neighbor, generate vast permutations between protein pairs.

The structure of the ChrR anti-sigma factor from Rhodobacter sphaeroides reveals domains which help explain the combinatoric complexity of gene regulation in response to environmental conditions. An advanced toolkit for graph, genome context, and visual analytics was used to study ECF sigma factor regulation.

FecI FecRRpoE ChrR

3D structure of ChrR anti-sigma regulator

ECF regulation

1 ASD

2

3

4

ASD

S4

S3

anti-

ASD

ASD

S2

S1

Sofia Heidi, Computational Biology and Bioinformatics


To model microbial communities, the smallest realistic model would include >109 cells x 100 reactions ~1011 reactions.

Currently can handle ~104–106 reactions on a single-processor machine.

Scientific challenge:

Stochastic simulations of biological networks

J. Phys. Chem. B 105, 11026, 2001

Speedup over exact Gillespie SSA ~25

Multinomial Tau-leaping method:

Speedups ~40–200

Probability-weighted dynamic Monte Carlo method…

H. Resat, Computational Biology and Bioinformatics

1 4Number of phosphorylation binding sites

32 5 876 9 1211100

250

500

750

1000

1250

1500

Ave

rage

runn

ing

time

(sec

onds

)


Contacts

Tjerk Straatsma Principal InvestigatorPacific Northwest National [email protected]

Nagiza Samatova Principle InvestigatorOak Ridge National [email protected]

Christine Chalk Program ManagerU.S. Department of EnergyOffice of ScienceAdvanced Scientific Computing Research, [email protected]


BioPilot: Data-Intensive Computing for Complex Biological Systems

Documents

Transcript of BioPilot: Data-Intensive Computing for Complex Biological Systems