N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf ·...

39
Bio- and chem-Informatics: Where do the twain meet? N. Sukumar, Curt Breneman, Kristin P. Bennett, Charles Bergeron, Mark J. Embrechts, Changjian Huang, Shekhar Garde, Rahul Godawat, Ishita Manjrekar, Theresa Hepburn, C. Matthew Sundling, Margaret McLellan, Micheel Krein ACS, August 2007 http://reccr.chem.rpi.edu

Transcript of N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf ·...

Page 1: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Bio- and chem-Informatics: Where do the twain meet?

N. Sukumar, Curt Breneman,

Kristin P. Bennett, Charles Bergeron,

Mark J. Embrechts, Changjian Huang,

Shekhar Garde, Rahul Godawat, Ishita Manjrekar,

Theresa Hepburn, C. Matthew Sundling, Margaret McLellan, Micheel Krein

ACS, August 2007 http://reccr.chem.rpi.edu

Page 2: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Birth of InformaticsBirth of Informatics

40,000 BC Experiment/Observation

~1700 AD Mathematical Theory

1950+ Computation

1970+ Simulation

1990+ Informatics/Data Mining

Page 3: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Cheminformatics/BioinformaticsCheminformatics/Bioinformatics: Statement of the Problem

Experiment Assay Screening or Gene Data(the more data the better)

DataNo Prior Hypothesis

Page 4: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Data Representation Statistical Model Biological Activity

CheminformaticsCheminformatics & Bioinformatics& Bioinformatics: One science, two tongues

The Confusion of Tongues by Gustave Doré (1865)Engraving based on the Minaret of Samarra

NN

Cl

O

AAACCTCATAGGAAGCATACCAGGAATTACATCA…

Page 5: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

The vocabulary ofThe vocabulary ofCheminformaticsCheminformatics & Bioinformatics& Bioinformatics

The Tower of Babel by Pieter Brueghel the Elder (1563)

Data Representation Statistical Model Biological Activity

NN

Cl

O

AAACCTCATAGGAAGCATACCAGGAATTACATCA…

Page 6: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

The grammar ofThe grammar ofCheminformaticsCheminformatics & Bioinformatics& Bioinformatics

Data Representation Statistical Model Biological Activity

The Building of the Tower of Babelby Abel Grimmer (1570-1619)

ΣΣΣ

Σ

Page 7: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Model Building, Validation and

Applicability Domains

Model Building, Validation and

Applicability DomainsGeneric Data Mining

ToolsGeneric Data Mining

Tools

CheminformaticsCheminformaticsBioinformaticsBioinformatics

Alignment-free Molecular Property

Descriptors

Alignment-free Molecular Property

Descriptors

Protein Chromatography

Modeling

Protein Chromatography

Modeling

Drug Design and QSARDrug Design and QSARProtein Kinetic

Stability PredictionProtein Kinetic

Stability Prediction

Simulation-based Protein Affinity

Descriptors

Simulation-based Protein Affinity

Descriptors

CheminformaticsCheminformatics at RECCRat RECCR(funded under the Molecular Libraries Roadmap Initiative of NIH)

Director: Curt Breneman

Page 8: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Descriptors Model ActivityStructures

Structural Descriptors

Physiochemical Descriptors

Topological Descriptors

Geometrical Descriptors

DescriptorsDescriptors

NN

Cl

O

AAACCTCATAGGAAGCATACCAGGAATTACATCA…

Page 9: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Electron Density DerivedMolecular Surface Properties

– Electrostatic Potential

– Electronic Kinetic Energy Density

– Electron Density Gradients ∇ρ•N– Kinetic Energy Gradients ∇G•N, ∇K•N

– Laplacian of the Electron Density

– Local Average Ionization Potential

– Bare Nuclear Potential (BNP)

– Fukui function F-(r) = ρHOMO(r)

K ( r ) = −(ψ * ∇ 2ψ + ψ∇ 2ψ *)G (r ) = −∇ ψ * .∇ ψ

EP ( r ) =Z α

r − Rαα∑ −

ρ (r' )dr 'r − r'∫

L(r) = −∇ 2 ρ(r) = K (r) − G (r)

PIP ( r ) =ρ i ( r ) ε i

ρ ( r )i∑

( ) ZB N P rr R

αα

α

=−∑

Page 10: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

RECON/TAE Descriptors in MOERECON/TAE Descriptors in MOE

Page 11: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Surface Property Distribution Surface Property Distribution RECON/TAE DescriptorsRECON/TAE Descriptors

Surface histograms represent electronic property distributions to provide data for descriptors

PIP (Local Ionization Potential)surface property for a member ofthe Lombardo blood-brain barrierdataset.

Page 12: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

• Breneman, C.M., et al., New developments in PEST shape/property hybrid descriptors.J. Comput. aided Mol. Design, 17, 231-240, 2003.

• Wagener, M., J. Sadowski, and J. Gasteiger, J. Am. Chem. Soc., 117, 7769-7775, 1995.

Topological RECON Autucorrelation Descriptors implemented by Bill Katt

RAD: Recon Autocorrelation DescriptorsRAD: Recon Autocorrelation Descriptors

Uses Integrated TAE Surface Properties

Function binned by distance Rxy between atoms x and y.

,( ) 1/xy x y

x yA R n P P= ×∑

d

d

Page 13: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

CoEPrA competitionhttp://www/coepra.org/

calibration prediction

1 89 88 9 57872 76 76 8 51443 133 133 9 5787

Basic information about the three datasets.

number of samples number of amino acids

number of COEPRA descriptorsround

Comparative Evaluation of Prediction Algorithms competition organized to provide objective testing of various algorithms via a process of blind prediction for chemical and biological data. Each round consisted of a training and a test set of sequences of amino acid residues (octa/nonapeptides) and 643 COEPRA descriptors per residue, the nature of which are unknown. The activities are binding affinities to HLA-A*0201 major histocompability complex.C. Bergeron, T. Hepburn, C. M. Sundling, N. Sukumar, K. P. Bennett and C. M. Breneman, Protein and Peptide Letters (submitted)

Page 14: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

LOO CV calibration

prediction LOO CV calibration

prediction LOO CV calibration

prediction

COEPRA 0.6253 0.45525 0.72626 0.67786 0.72143 0.69124MOE/RAD 0.26075 0.34384 0.40711 0.38614 0.42743 0.49457SIMIL 0.5121 0.35226 0.5751 0.54868 0.583 0.61791COEPRA+MOE/RAD 0.67972 0.46358 0.7415 0.66128 0.72444 0.69411COEPRA+SIMIL 0.61979 0.45916 0.73493 0.66398 0.72071 0.69336all 0.67466 0.46638 0.73861 0.66339 0.72661 0.69404

COEPRA 0.29793 0.40054 0.4984 0.74618 0.47026 0.58993MOE/RAD 0.095377 0.14367 0.3228 0.54602 0.30109 0.44067SIMIL 0.14204 0.19956 0.6134 0.42729 0.48197 0.51488COEPRA+MOE/RAD 0.29279 0.40289 0.50207 0.78416 0.46442 0.59074COEPRA+SIMIL 0.27913 0.41212 0.50504 0.75443 0.47468 0.59474all 0.27472 0.41359 0.5091 0.78195 0.46891 0.59566

COEPRA 0.30222 0.15262 0.3544 0.19975 0.37345 0.21932MOE/RAD 0.16246 -0.13471 0.1037 0.035348 0.17685 0.19992SIMIL 0.23704 0.0321 0.33496 0.11797 0.32633 0.16893COEPRA+MOE/RAD 0.30319 0.17773 0.3541 0.2115 0.37538 0.24234COEPRA+SIMIL 0.30468 0.14853 0.35641 0.19684 0.37588 0.21944all 0.30459 0.17254 0.35553 0.2078 0.37697 0.24051

round 1

round2

round 3

exponential KPLSPLS Gaussian KPLS

Results of CoEPrA competition

Page 15: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

ROMS implemented by Changjian Huang and Mark J. Embrechts

Page 16: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction
Page 17: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Protein Protein RECONRECON

Page 18: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction
Page 19: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction
Page 20: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

1CGP – DNA Complex

Page 21: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Representations of DNA Structure: Representations of DNA Structure: Can we improve on ATCG?Can we improve on ATCG?

• Promoter regions and transcription factor binding sites require specific identification

• Most successful methods represent DNA by sequence of letters

• DNA bases assumed to act independently

• Several higher order multibase models exist

• Sequence data for training/validation is usually limited

• Representation of DNA has little to do with the energetics of binding of protein to DNA

Page 22: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

““DixelDixel”” DNA descriptorsDNA descriptors

• A “basis set” of all possible nucleotide base pairs with all possible neighbors results in a set of base pair “triplets”.

• Ab Initio properties of base pair and two flanking base pairs (end capped) are computed.

Central base pair is encoded and stored as a RECON “Dixel” object.

Page 23: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Base pair properties perturbed by Base pair properties perturbed by flanking base pairs flanking base pairs –– raw raw ““DixelDixel”” datadata

Page 24: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Dixels for EP Dixels for PIP

Page 25: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

CRP Dixel Model and Conserved base-pairsDescriptor Importance by Property

0

0.51

1.5

2

2.5

33.5

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

BNP DGN DRN EP G LAPL PIP

Page 26: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction
Page 27: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction
Page 28: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

PEST: Molecular Shape/Property PEST: Molecular Shape/Property Hybrid Encoding Hybrid Encoding

• Curt M. Breneman, C. Matthew Sundling, N. Sukumar, Lingling Shen, William P. Katt and Mark J. Embrechts, “New developments in PEST shape/property hybrid descriptors” J. Computer-Aided Mol. Design, 17, 231–240, (2003)

• Karthigeyan Nagarajan, Randy Zauhar, and William J. Welsh, “Enrichment of Ligands for the Serotonin Receptor Using the Shape Signatures Approach” J. Chem. Inf. Model., 45, 49-57 (2005)

Ø A TAE property-encoded surface is subjected to internal ray reflection analysis.

Ø A ray is initialized with a random location and direction within the molecular surface and reflected throughout inside the electron density isosurface until the molecular surface is adequately sampled.

Ø Molecular shape information is obtained by recording the ray-path information, including segment lengths, reflection angles and property values at each point of incidence.

Ø Adds shape information that encode the spatial relationships of surface properties

Ø Alignment-free

Page 29: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Regional Shape/Property Surface Encoding in PEST Analysis

Page 30: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Molecular Surface analysisusing PEST rays

Property-Encoded Electron Density-derived Surface

EP(2,5) EP(6,1)

Page 31: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

PPEST Protein Shape/Property Descriptors PPEST Protein Shape/Property Descriptors

PPEST implemented by Qiong Luo and Matt Sundling

Page 32: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

1POC EP pH 3.0

1POC EP pH 6.0

1POC EP pH 4.0

1POC EP pH 7.0

1POC EP pH 5.0

1POC EP pH 8.0

1POC (Bee-venom Phospholipase A2 )pH Sensitive EP Encoding by Matt Sundling

Page 33: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

1POC LENGTH/EP Protein PEST1POC LENGTH/EP Protein PEST

1POC EP ph 6.0

1POC EP ph 4.0

1POC EP ph 7.0

1POC EP ph 5.0

1POC EP ph 8.0

Page 34: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

• Liquid water constitutes one of the essential components of biological systems and it is difficult to overstate the role of water in biological structure and function.

• Proteins crystallize with several units of water weakly bound to the rest of the protein

• Water provides the thermodynamic driving force for proteins to fold and self-assemble.

• It mediates not only tertiary and quaternary interactions, but also interactions between different biomolecules and between biomolecules and ligands or surfaces.

• Water is also known to take part in specific enzymatic reactions.• Protein conformational dynamics appear to be linked (slaved) to the

dynamics of vicinal water, thereby affecting protein function.• Water in the vicinity of proteins and other biomolecules critically

influences protein structure, dynamics, function and other thermodynamic and kinetic properties.

Role of water in proteinsRole of water in proteins

Page 35: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

SimulationSimulation--derivedderivedHydrationHydration--based descriptorsbased descriptors

Statistical analysis of the dynamics of water distributions solvating proteins used to create a set of regional property descriptors:

Water O fluctuation Structure Electron density

• average local water density,• water density fluctuations,• local water orientations,• electron density profile due to water packing and orientations (polarization),• electrostatic potential on protein surface induced by the vicinal water

structuring,• dynamics of local water.

Page 36: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

First hydration shell propertiesFirst hydration shell propertiesof of CspBCspB proteinprotein

Protein amino acids(green = hydrophobic,blue = positively charged,red = negatively charged)

local water-O density local water-H density local electron density projected onto the triangulated protein surface

Hydration-based descriptors developed and implemented by Shekhar Garde, Rahul Godawat and Ishita Manjrekar

Page 37: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

PMF expansion based methodPMF expansion based method• Developing an efficient alternative to full simulations by means

of a potentials-of-mean-force expansion

• employing a library of lower-order correlation functions derived from explicit simulations to predict the average equilibrium density and the orientation profile of water in the space surrounding biomolecules or ligands.

Water density values in space surrounding an alpha-helix (left) and a protein X (right) predicted using the PMF expansion (cyan) and obtained from exact simulation (magenta)

Page 38: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

Eleven different canonical sites that represent protein atoms in the PMF-library, obtained from clustering of AMBER force-field data

~ 0.6% 1.04600 0 .356 -0 .16423 S

~ 2% 0.71128 0 .325 -0 .89731 N2

~ 15% 0.71128 0 .325 -0 .41573 N1

~ 3% 0.87864 0 .296 -0 .81021 O2

~ 15% 0.87864 0 .296 -0 .57349 O

~ 2% 0.88031 0 .307 -0 .66428 OH

~ 7% 0.35982 0 .34 -0 .14331 C1

~ 15% 0.35982 0 .34 0 .59735 C

~ 4% 0.45773 0 .34 0 .27946 CT

~ 25% 0.45773 0 .34 -0 .01545 CT

~ 9% 0.45773 0 .34 -0 .29039 CT

% in typ.

prote in

Eps ilon (KJ/m ol)

S ize (nm )

Charge A tom Type

Page 39: N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf · 2007. 11. 24. · testing of various algorithms via a process of blind prediction

http://reccr.chem.rpi.edu

RECCR is funded under the Molecular Libraries Roadmap Initiative of NIH(# 1P20HG003899-01 of 09-23-2005)

Thank you!