N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf ·...
Transcript of N. Sukumar, Curt Breneman - Cheminformaticsreccr.chem.rpi.edu/Presentations/ACS_Boston_2007.pdf ·...
Bio- and chem-Informatics: Where do the twain meet?
N. Sukumar, Curt Breneman,
Kristin P. Bennett, Charles Bergeron,
Mark J. Embrechts, Changjian Huang,
Shekhar Garde, Rahul Godawat, Ishita Manjrekar,
Theresa Hepburn, C. Matthew Sundling, Margaret McLellan, Micheel Krein
ACS, August 2007 http://reccr.chem.rpi.edu
Birth of InformaticsBirth of Informatics
40,000 BC Experiment/Observation
~1700 AD Mathematical Theory
1950+ Computation
1970+ Simulation
1990+ Informatics/Data Mining
Cheminformatics/BioinformaticsCheminformatics/Bioinformatics: Statement of the Problem
Experiment Assay Screening or Gene Data(the more data the better)
DataNo Prior Hypothesis
Data Representation Statistical Model Biological Activity
CheminformaticsCheminformatics & Bioinformatics& Bioinformatics: One science, two tongues
The Confusion of Tongues by Gustave Doré (1865)Engraving based on the Minaret of Samarra
NN
Cl
O
AAACCTCATAGGAAGCATACCAGGAATTACATCA…
The vocabulary ofThe vocabulary ofCheminformaticsCheminformatics & Bioinformatics& Bioinformatics
The Tower of Babel by Pieter Brueghel the Elder (1563)
Data Representation Statistical Model Biological Activity
NN
Cl
O
AAACCTCATAGGAAGCATACCAGGAATTACATCA…
The grammar ofThe grammar ofCheminformaticsCheminformatics & Bioinformatics& Bioinformatics
Data Representation Statistical Model Biological Activity
The Building of the Tower of Babelby Abel Grimmer (1570-1619)
ΣΣΣ
Σ
Model Building, Validation and
Applicability Domains
Model Building, Validation and
Applicability DomainsGeneric Data Mining
ToolsGeneric Data Mining
Tools
CheminformaticsCheminformaticsBioinformaticsBioinformatics
Alignment-free Molecular Property
Descriptors
Alignment-free Molecular Property
Descriptors
Protein Chromatography
Modeling
Protein Chromatography
Modeling
Drug Design and QSARDrug Design and QSARProtein Kinetic
Stability PredictionProtein Kinetic
Stability Prediction
Simulation-based Protein Affinity
Descriptors
Simulation-based Protein Affinity
Descriptors
CheminformaticsCheminformatics at RECCRat RECCR(funded under the Molecular Libraries Roadmap Initiative of NIH)
Director: Curt Breneman
Descriptors Model ActivityStructures
Structural Descriptors
Physiochemical Descriptors
Topological Descriptors
Geometrical Descriptors
DescriptorsDescriptors
NN
Cl
O
AAACCTCATAGGAAGCATACCAGGAATTACATCA…
Electron Density DerivedMolecular Surface Properties
– Electrostatic Potential
– Electronic Kinetic Energy Density
– Electron Density Gradients ∇ρ•N– Kinetic Energy Gradients ∇G•N, ∇K•N
– Laplacian of the Electron Density
– Local Average Ionization Potential
– Bare Nuclear Potential (BNP)
– Fukui function F-(r) = ρHOMO(r)
K ( r ) = −(ψ * ∇ 2ψ + ψ∇ 2ψ *)G (r ) = −∇ ψ * .∇ ψ
EP ( r ) =Z α
r − Rαα∑ −
ρ (r' )dr 'r − r'∫
L(r) = −∇ 2 ρ(r) = K (r) − G (r)
PIP ( r ) =ρ i ( r ) ε i
ρ ( r )i∑
( ) ZB N P rr R
αα
α
=−∑
RECON/TAE Descriptors in MOERECON/TAE Descriptors in MOE
Surface Property Distribution Surface Property Distribution RECON/TAE DescriptorsRECON/TAE Descriptors
Surface histograms represent electronic property distributions to provide data for descriptors
PIP (Local Ionization Potential)surface property for a member ofthe Lombardo blood-brain barrierdataset.
• Breneman, C.M., et al., New developments in PEST shape/property hybrid descriptors.J. Comput. aided Mol. Design, 17, 231-240, 2003.
• Wagener, M., J. Sadowski, and J. Gasteiger, J. Am. Chem. Soc., 117, 7769-7775, 1995.
Topological RECON Autucorrelation Descriptors implemented by Bill Katt
RAD: Recon Autocorrelation DescriptorsRAD: Recon Autocorrelation Descriptors
Uses Integrated TAE Surface Properties
Function binned by distance Rxy between atoms x and y.
,( ) 1/xy x y
x yA R n P P= ×∑
d
d
CoEPrA competitionhttp://www/coepra.org/
calibration prediction
1 89 88 9 57872 76 76 8 51443 133 133 9 5787
Basic information about the three datasets.
number of samples number of amino acids
number of COEPRA descriptorsround
Comparative Evaluation of Prediction Algorithms competition organized to provide objective testing of various algorithms via a process of blind prediction for chemical and biological data. Each round consisted of a training and a test set of sequences of amino acid residues (octa/nonapeptides) and 643 COEPRA descriptors per residue, the nature of which are unknown. The activities are binding affinities to HLA-A*0201 major histocompability complex.C. Bergeron, T. Hepburn, C. M. Sundling, N. Sukumar, K. P. Bennett and C. M. Breneman, Protein and Peptide Letters (submitted)
LOO CV calibration
prediction LOO CV calibration
prediction LOO CV calibration
prediction
COEPRA 0.6253 0.45525 0.72626 0.67786 0.72143 0.69124MOE/RAD 0.26075 0.34384 0.40711 0.38614 0.42743 0.49457SIMIL 0.5121 0.35226 0.5751 0.54868 0.583 0.61791COEPRA+MOE/RAD 0.67972 0.46358 0.7415 0.66128 0.72444 0.69411COEPRA+SIMIL 0.61979 0.45916 0.73493 0.66398 0.72071 0.69336all 0.67466 0.46638 0.73861 0.66339 0.72661 0.69404
COEPRA 0.29793 0.40054 0.4984 0.74618 0.47026 0.58993MOE/RAD 0.095377 0.14367 0.3228 0.54602 0.30109 0.44067SIMIL 0.14204 0.19956 0.6134 0.42729 0.48197 0.51488COEPRA+MOE/RAD 0.29279 0.40289 0.50207 0.78416 0.46442 0.59074COEPRA+SIMIL 0.27913 0.41212 0.50504 0.75443 0.47468 0.59474all 0.27472 0.41359 0.5091 0.78195 0.46891 0.59566
COEPRA 0.30222 0.15262 0.3544 0.19975 0.37345 0.21932MOE/RAD 0.16246 -0.13471 0.1037 0.035348 0.17685 0.19992SIMIL 0.23704 0.0321 0.33496 0.11797 0.32633 0.16893COEPRA+MOE/RAD 0.30319 0.17773 0.3541 0.2115 0.37538 0.24234COEPRA+SIMIL 0.30468 0.14853 0.35641 0.19684 0.37588 0.21944all 0.30459 0.17254 0.35553 0.2078 0.37697 0.24051
round 1
round2
round 3
exponential KPLSPLS Gaussian KPLS
Results of CoEPrA competition
ROMS implemented by Changjian Huang and Mark J. Embrechts
Protein Protein RECONRECON
1CGP – DNA Complex
Representations of DNA Structure: Representations of DNA Structure: Can we improve on ATCG?Can we improve on ATCG?
• Promoter regions and transcription factor binding sites require specific identification
• Most successful methods represent DNA by sequence of letters
• DNA bases assumed to act independently
• Several higher order multibase models exist
• Sequence data for training/validation is usually limited
• Representation of DNA has little to do with the energetics of binding of protein to DNA
““DixelDixel”” DNA descriptorsDNA descriptors
• A “basis set” of all possible nucleotide base pairs with all possible neighbors results in a set of base pair “triplets”.
• Ab Initio properties of base pair and two flanking base pairs (end capped) are computed.
Central base pair is encoded and stored as a RECON “Dixel” object.
Base pair properties perturbed by Base pair properties perturbed by flanking base pairs flanking base pairs –– raw raw ““DixelDixel”” datadata
Dixels for EP Dixels for PIP
CRP Dixel Model and Conserved base-pairsDescriptor Importance by Property
0
0.51
1.5
2
2.5
33.5
4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
BNP DGN DRN EP G LAPL PIP
PEST: Molecular Shape/Property PEST: Molecular Shape/Property Hybrid Encoding Hybrid Encoding
• Curt M. Breneman, C. Matthew Sundling, N. Sukumar, Lingling Shen, William P. Katt and Mark J. Embrechts, “New developments in PEST shape/property hybrid descriptors” J. Computer-Aided Mol. Design, 17, 231–240, (2003)
• Karthigeyan Nagarajan, Randy Zauhar, and William J. Welsh, “Enrichment of Ligands for the Serotonin Receptor Using the Shape Signatures Approach” J. Chem. Inf. Model., 45, 49-57 (2005)
Ø A TAE property-encoded surface is subjected to internal ray reflection analysis.
Ø A ray is initialized with a random location and direction within the molecular surface and reflected throughout inside the electron density isosurface until the molecular surface is adequately sampled.
Ø Molecular shape information is obtained by recording the ray-path information, including segment lengths, reflection angles and property values at each point of incidence.
Ø Adds shape information that encode the spatial relationships of surface properties
Ø Alignment-free
Regional Shape/Property Surface Encoding in PEST Analysis
Molecular Surface analysisusing PEST rays
Property-Encoded Electron Density-derived Surface
EP(2,5) EP(6,1)
PPEST Protein Shape/Property Descriptors PPEST Protein Shape/Property Descriptors
PPEST implemented by Qiong Luo and Matt Sundling
1POC EP pH 3.0
1POC EP pH 6.0
1POC EP pH 4.0
1POC EP pH 7.0
1POC EP pH 5.0
1POC EP pH 8.0
1POC (Bee-venom Phospholipase A2 )pH Sensitive EP Encoding by Matt Sundling
1POC LENGTH/EP Protein PEST1POC LENGTH/EP Protein PEST
1POC EP ph 6.0
1POC EP ph 4.0
1POC EP ph 7.0
1POC EP ph 5.0
1POC EP ph 8.0
• Liquid water constitutes one of the essential components of biological systems and it is difficult to overstate the role of water in biological structure and function.
• Proteins crystallize with several units of water weakly bound to the rest of the protein
• Water provides the thermodynamic driving force for proteins to fold and self-assemble.
• It mediates not only tertiary and quaternary interactions, but also interactions between different biomolecules and between biomolecules and ligands or surfaces.
• Water is also known to take part in specific enzymatic reactions.• Protein conformational dynamics appear to be linked (slaved) to the
dynamics of vicinal water, thereby affecting protein function.• Water in the vicinity of proteins and other biomolecules critically
influences protein structure, dynamics, function and other thermodynamic and kinetic properties.
Role of water in proteinsRole of water in proteins
SimulationSimulation--derivedderivedHydrationHydration--based descriptorsbased descriptors
Statistical analysis of the dynamics of water distributions solvating proteins used to create a set of regional property descriptors:
Water O fluctuation Structure Electron density
• average local water density,• water density fluctuations,• local water orientations,• electron density profile due to water packing and orientations (polarization),• electrostatic potential on protein surface induced by the vicinal water
structuring,• dynamics of local water.
First hydration shell propertiesFirst hydration shell propertiesof of CspBCspB proteinprotein
Protein amino acids(green = hydrophobic,blue = positively charged,red = negatively charged)
local water-O density local water-H density local electron density projected onto the triangulated protein surface
Hydration-based descriptors developed and implemented by Shekhar Garde, Rahul Godawat and Ishita Manjrekar
PMF expansion based methodPMF expansion based method• Developing an efficient alternative to full simulations by means
of a potentials-of-mean-force expansion
• employing a library of lower-order correlation functions derived from explicit simulations to predict the average equilibrium density and the orientation profile of water in the space surrounding biomolecules or ligands.
Water density values in space surrounding an alpha-helix (left) and a protein X (right) predicted using the PMF expansion (cyan) and obtained from exact simulation (magenta)
Eleven different canonical sites that represent protein atoms in the PMF-library, obtained from clustering of AMBER force-field data
~ 0.6% 1.04600 0 .356 -0 .16423 S
~ 2% 0.71128 0 .325 -0 .89731 N2
~ 15% 0.71128 0 .325 -0 .41573 N1
~ 3% 0.87864 0 .296 -0 .81021 O2
~ 15% 0.87864 0 .296 -0 .57349 O
~ 2% 0.88031 0 .307 -0 .66428 OH
~ 7% 0.35982 0 .34 -0 .14331 C1
~ 15% 0.35982 0 .34 0 .59735 C
~ 4% 0.45773 0 .34 0 .27946 CT
~ 25% 0.45773 0 .34 -0 .01545 CT
~ 9% 0.45773 0 .34 -0 .29039 CT
% in typ.
prote in
Eps ilon (KJ/m ol)
S ize (nm )
Charge A tom Type
http://reccr.chem.rpi.edu
RECCR is funded under the Molecular Libraries Roadmap Initiative of NIH(# 1P20HG003899-01 of 09-23-2005)
Thank you!