Informed by Informatics?
-
Upload
charles-crawford -
Category
Documents
-
view
17 -
download
0
description
Transcript of Informed by Informatics?
![Page 1: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/1.jpg)
Informed by Informatics?
Dr John MitchellUnilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.
![Page 2: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/2.jpg)
Modelling in Chemistry
Density Functional Theoryab initio
Molecular Dynamics
Monte Carlo
Docking
PHYSICS-BASED
EMPIRICALATOMISTIC
Car-Parrinello
NON-ATOMISTIC
DPD
CoMFA
2-D QSAR/QSPR
Machine Learning
AM1, PM3 etc.Fluid Dynamics
![Page 3: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/3.jpg)
Density Functional Theoryab initio
Molecular Dynamics
Monte Carlo
Docking
Car-Parrinello
DPD
CoMFA
2-D QSAR/QSPRMachine Learning
AM1, PM3 etc.
HIGH THROUGHPUT
LOW THROUGHPUT
Fluid Dynamics
![Page 4: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/4.jpg)
Density Functional Theoryab initio
Molecular Dynamics
Monte Carlo
Docking
Car-Parrinello
DPD
CoMFA
2-D QSAR/QSPRMachine Learning
AM1, PM3 etc.
INFORMATICS
THEORETICAL CHEMISTRY
NO FIRM BOUNDARIES!
Fluid Dynamics
![Page 5: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/5.jpg)
Density Functional Theoryab initio
Molecular Dynamics
Monte Carlo
Docking
Car-Parrinello
DPD
CoMFA
2-D QSAR/QSPRMachine Learning
AM1, PM3 etc.Fluid Dynamics
![Page 6: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/6.jpg)
Informatics and Empirical Models
• In general, Informatics methods represent phenomena mathematically, but not in a physics-based way.
• Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model.
• Do not attempt to simulate reality. • Usually High Throughput.
![Page 7: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/7.jpg)
QSPR
• Quantitative Structure Property Relationship
• Physical property related to more than one other variable
• Hansch et al developed QSPR in 1960’s, building on Hammett (1930’s).
• Property-property relationships from 1860’s
• General form (for non-linear relationships):y = f (descriptors)
![Page 8: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/8.jpg)
QSPRY X1 X2 X3 X4 X5 X6
Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
Y = f (X1, X2, ... , XN )
• Optimisation of Y = f(X1, X2, ... , XN) is called regression.
• Model is optimised upon N “training molecules” and then tested upon M “test” molecules.
![Page 9: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/9.jpg)
QSPRY X1 X2 X3 X4 X5 X6
Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• Quality of the model is judged by three parameters:
n
i
predi
obsi yy
nBias
1
)(1
n
i
predi
obsi yy
nRMSE
1
2)(1
2
1
2
1
2 )(/)(1 averagen
i
obsi
predi
n
i
obsi yyyyr
![Page 10: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/10.jpg)
QSPRY X1 X2 X3 X4 X5 X6
Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• Different methods for carrying out regression:
• LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc.
• NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.
![Page 11: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/11.jpg)
QSPRY X1 X2 X3 X4 X5 X6
Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• However, this does not guarantee a good predictive model….
![Page 12: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/12.jpg)
QSPRY X1 X2 X3 X4 X5 X6
Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• Problems with experimental error.• QSPR only as accurate as data it is trained upon.• Therefore, we are need accurate experimental data.
![Page 13: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/13.jpg)
QSPRY X1 X2 X3 X4 X5 X6
Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –
• Problems with “chemical space”.• “Sample” molecules must be representative of “Population”.• Prediction results will be most accurate for molecules similar
to training set.• Global or Local models?
![Page 14: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/14.jpg)
1. Solubility
Dave Palmer
• Pfizer Institute for Pharmaceutical Materials Science
• http://www.msm.cam.ac.uk/pfizer
D.S. Palmer, et al., J. Chem. Inf. Model., 47, 150-158 (2007)
D.S. Palmer, et al., Molecular Pharmaceutics, ASAP article (2008)
![Page 15: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/15.jpg)
Solubility is an important issue in drug discovery and a major source of attrition
This is expensive for the industry
A good model for predicting the solubility of druglike molecules would be very valuable.
![Page 16: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/16.jpg)
DatasetsPhase 1 – Literature Data• Compiled from Huuskonen dataset and AquaSol database – pharmaceutically relevant molecules• All molecules solid at room temperature• n = 1000 molecules
Phase 2 – Our Own Experimental Data● Measured by Toni Llinàs using CheqSol Machine● pharmaceutically relevant molecules● n = 135 molecules
● Aqueous solubility – the thermodynamic solubility in unbuffered water (at 25oC)
![Page 17: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/17.jpg)
Diversity-Conserving Partitioning
● MACCS Structural Key fingerprints
● Tanimoto coefficient
● MaxMin Algorithm
Full dataset n = 1000 molecules
Training n = 670 molecules
Testn = 330 molecules
![Page 18: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/18.jpg)
Diversity Analysis
![Page 19: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/19.jpg)
Structures & Descriptors
● 3D structures from Concord
● Minimised with MMFF94
● MOE descriptors 2D/ 3D
● Separate analysis of 2D and 3D descriptors
● QuaSAR Contingency Module (MOE)
● 52 descriptors selected
![Page 20: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/20.jpg)
Machine Learning Method
Random Forest
![Page 21: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/21.jpg)
Random Forest
● Introduced by Briemann and Cutler (2001)
● Development of Decision Trees (Recursive Partitioning):
● Dataset is partitioned into consecutively smaller subsets
● Each partition is based upon the value of one descriptor
● The descriptor used at each split is selected so as to optimise splitting
● Bootstrap sample of N objects chosen from the N available objects with replacement
![Page 22: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/22.jpg)
Random Forest
● Random Forest is a collection of Decision or Regression Trees grown with the CART algorithm.
● Standard Parameters:
● 500 decision trees
● No pruning back: Minimum node size = 5
● “mtry” descriptors (square root of number of total number) tried at each split
Important features:
● Incorporates descriptor selection
● Incorporates “Out-of-bag” validation – using those molecules not in the bootstrap samples
![Page 23: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/23.jpg)
Random Forest for Solubility Prediction
A Forest of Regression Trees
• Dataset is partitioned into consecutively smaller subsets (of similar solubility)
• Each partition is based upon the value of one descriptor
• The descriptor used at each split is selected so as to minimise the MSE
Leo Breiman, "Random Forests“, Machine Learning 45, 5-32 (2001).
![Page 24: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/24.jpg)
Random Forest for Predicting Solubility
• A Forest of Regression Trees• Each tree grown until terminal
nodes contain specified number of molecules
• No need to prune back• High predictive accuracy• Includes method for descriptor
selection• No training problems – largely
immune from overfitting.
![Page 25: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/25.jpg)
Random Forest: Solubility Results
RMSE(te)=0.69r2(te)=0.89Bias(te)=-0.04
RMSE(tr)=0.27r2(tr)=0.98Bias(tr)=0.005
RMSE(oob)=0.68r2(oob)=0.90Bias(oob)=0.01
DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007)
![Page 26: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/26.jpg)
Drug Disc.Today, 10 (4), 289 (2005)
![Page 27: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/27.jpg)
Can we use theoretical chemistry to calculate solubility via a thermodynamic cycle?
![Page 28: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/28.jpg)
Gsub from lattice energy & an entropy term
Ghydr from a semi-empirical solvation model
(i.e., different kinds of theoretical/computational methods)
![Page 29: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/29.jpg)
… or this alternative cycle?
![Page 30: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/30.jpg)
Gsub from lattice energy (DMAREL) plus entropy
Gsolv from SCRF using the B3LYP DFT functional
Gtr from ClogP
(i.e., different kinds of theoretical/computational methods)
![Page 31: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/31.jpg)
● Nice idea, but didn’t quite work – errors larger than QSPR
● “Why not add a correction factor to account for the difference between the theoretical methods?”
…
![Page 32: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/32.jpg)
…
● Within a week this had become a hybrid method, essentially a QSPR with the theoretical energies as descriptors
![Page 33: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/33.jpg)
This regression equation gave r2=0.77 and RMSE=0.71
![Page 34: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/34.jpg)
Solubility by TD Cycle: Conclusions
● We have a hybrid part-theoretical, part-empirical method.
● An interesting idea, but relatively low throughput as a crystal structure is needed.
● Slightly more accurate than pure QSPR for a druglike set.
● Instructive to compare with literature of theoretical solubility studies.
![Page 35: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/35.jpg)
2. Bioactivity
Florian Nigsch
F. Nigsch, et al., J. Chem. Inf. Model., 48, 306-318 (2008)
![Page 36: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/36.jpg)
Feature Space - Chemical Space
m = (f1,f2,…,fn)
f1
f2
f3
Feature spaces of high dimensionality
COX2
f2
f3
f1
DHFR
CDK1CDK2
![Page 37: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/37.jpg)
Properties of Drugs
• High affinity to protein target
• Soluble• Permeable• Absorbable• High bioavailability• Specific rate of
metabolism• Renal/hepatic clearance?
• Volume of distribution?• Low toxicity• Plasma protein binding?• Blood-Brain-Barrier
penetration?• Dosage (once/twice
daily?)• Synthetic accessibility• Formulation (important in
development)
![Page 38: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/38.jpg)
Multiobjective Optimisation
Bioactivity Synthetic accessibility
Permeability
Toxicity
Metabolism
Solubility
Huge number of candidates …
![Page 39: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/39.jpg)
Multiobjective Optimisation
Bioactivity Synthetic accessibility
Permeability
Toxicity
Metabolism
Solubility
U S E L
E S S
Drug
Huge number of candidates most of which
are useless!
![Page 40: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/40.jpg)
Spam
• Unsolicited (commercial) email
• Approx. 90% of all email traffic is spam
• Where are the legitimate messages?
• Filtering
![Page 41: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/41.jpg)
Analogy to Drug Discovery
• Huge number of possible candidates
• Virtual screening to help in selection process
![Page 42: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/42.jpg)
Winnow Algorithm
• Invented in late 1980s by Nick Littlestone to learn Boolean functions
• Name from the verb “to winnow”– High-dimensional input data
• Natural Language Processing (NLP), text classification, bioinformatics
• Different varieties (regularised, Sparse Network Of Winnow - SNOW, …)
• Error-driven, linear threshold, online algorithm
![Page 43: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/43.jpg)
Machine Learning Methods
Winnow (“Molecular Spam Filter”)
![Page 44: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/44.jpg)
Training Example
![Page 45: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/45.jpg)
Features of Molecules
Based on circular fingerprints
![Page 46: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/46.jpg)
Combinations of Features
Combinations of molecular features
to account for synergies.
![Page 47: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/47.jpg)
Pairing Molecular Features Gives Bigrams
![Page 48: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/48.jpg)
Orthogonal Sparse Bigrams
• Technique used in text classification/spam filtering
• Sliding window process
• Sparse - not all combinations
• Orthogonal - non-redundancy
![Page 49: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/49.jpg)
Workflow
![Page 50: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/50.jpg)
Protein Target Prediction
• Which proteins does a given molecule bind to?• Virtual Screening• Multiple endpoint drugs - polypharmacology• New targets for existing drugs• Prediction of adverse drug reactions (ADR)
– Computational toxicology
![Page 51: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/51.jpg)
Predicted Protein Targets
• Selection of ~230 classes from the MDL Drug Data Report
• ~90,000 molecules• 15 independent
50%/50% splits into training/test set
![Page 52: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/52.jpg)
Predicted Protein Targets
Cumulative probability of correct prediction within the three top-ranking predictions: 82.1% (±0.5%)
![Page 53: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/53.jpg)
3. Melting Points
Florian Nigsch
F. Nigsch, et al., J. Chem. Inf. Model., 46, 2412-2422 (2006)
![Page 54: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/54.jpg)
Interest in Modelling
Interest in modelling properties of molecules
- solubility, toxicity, drug-drug interactions, …
Virtual screening has become a necessity in terms of time and cost efficiency
ADMET integration into drug development process
- many fewer late stage failures
- continuous refinement with new experimental dataAbsorption, Distribution, Metabolism, Excretion and Toxicity
![Page 55: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/55.jpg)
Melting Points are Important
Melting point facilitates prediction of solubility
- General Solubility Equation (Yalkowsky)
Notorious challenge (unknown crystal packing, interactions in liquid state)
Goal: Stepwise model for bioavailability, from solid compound to systemic availability
![Page 56: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/56.jpg)
Can Current Models be Improved?
Bergström: 277 total compounds, PLS, RMSE=49.8 ºC, q2=0.54; ntest:ntrain=185:92
Karthikeyan: 4173 total compounds, ANN, RMSE=49.3 ºC, q2=0.66; ntest:ntrain=1042:2089
Jain: 1215 total compounds, MLR, AAE=33.2 ºC, no q2 reported; ntest:ntrain=119:1094
RMSE slightly below 50 ºC!
q2 just above 0.50!
![Page 57: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/57.jpg)
Machine Learning Method
k-Nearest Neighbours
![Page 58: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/58.jpg)
k-Nearest Neighbours
Used for classification and regression
Suitable for large and diverse datasets
• Local (tuneable via k)• Non-linearCalculation of distance matrix
instead of training step
Distance matrix depends on descriptors used
Classification
Regression
![Page 59: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/59.jpg)
Molecules and Numbers
Datasets4119 organic molecules of
diverse complexity, MPs from 14 to 392.5 ºC (dataset 1)
277 drug-like molecules, MPs from 40 to 345 ºC (dataset 2)
Both datasets available in the recent literature
DescriptorsQuasar descriptors from MOE
(Molecular Operating Environment), 146 2D and 57 3D descriptors
Sybyl Atom Type Counts (53 different types, including information about hybridisation state)
![Page 60: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/60.jpg)
Finding the Neighbours
Distances
Euclidean distance between point i and point j in r dimensions
dij Xmi Xm
j 2
m1
r
1/ 2
Set of k nearest neighbours
N = {Ti, di} k = 1,5,10,15
![Page 61: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/61.jpg)
Calculation of Predictions
Predictions: Tpred = f(N)
unweighted weighted
Arithmetic (1) and geometric (2) average
Inverse distance (3)
Exponential decrease (4)
Tpreda
1
kTn
n1
k
Tpredg Tn
1/ k
n1
k
Tpredi
1
1
dnn1
k
Tn
1
dnn1
k
Tprede
1
e dnn1
k
Tne
dn
n1
k
1
3
2
4
![Page 62: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/62.jpg)
Assessment by Cross-validation
Select m random molecules for test set
N-m remaining molecules form training set
Organics Drugs
N: 4119
m: 1000
m/N: 24.3%
N: 277
m: 80
m/N: 28.9%
Data
Random splitTraining Test
Predict test set
(from training set)
RMSEP
Repeat 25 tim
es
RMSECV 1
25RMSEP 2
i1
25
Cross-validated root-mean-square error
![Page 63: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/63.jpg)
Assessment by Cross-validation
Organics Drugs
N: 4119
m: 1000
m/N: 24.3%
N: 277
m: 80
m/N: 28.9%
Data
Random splitTraining Test
Predict test set(from training set)
RMSEP
Re
pea
t 25 tim
es
RMSECV 1
25RMSEP 2
i1
25
Cross-validated root-mean-square error
Not “25-fold CV” where 1/25th is predicted based on the rest
Prediction of 25 random sets of 1000 molecules
Monte Carlo Cross-validation
![Page 64: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/64.jpg)
Results
NNMethod
A G I E
RMSE r2 q2 RMSE r2 q2 RMS
E r2 q2 RMSE r2 q2
1 57.6 0.36 0.21 - - - - - - - - -
5 48.9 0.44 0.43 50.0 0.43 0.40 48.2 0.45 0.45 48.8 0.45 0.43
10 48.8 0.44 0.43 50.2 0.43 0.40 47.7 0.46 0.46 47.6 0.47 0.46
15 49.3 0.43 0.42 50.9 0.42 0.38 48.1 0.46 0.45 47.3 0.48 0.47
Using a combination of 2D and 3D descriptors:
Training set
Test set 1000 molecules (organic)
3119 molecules (organic)
![Page 65: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/65.jpg)
Results: Organic Dataset
NNMethod
A G I E
RMSE r2 q2 RMSE r2 q2 RMS
E r2 q2 RMSE r2 q2
1 57.6 0.36 0.21 - - - - - - - - -
5 48.9 0.44 0.43 50.0 0.43 0.40 48.2 0.45 0.45 48.8 0.45 0.43
10 48.8 0.44 0.43 50.2 0.43 0.40 47.7 0.46 0.46 47.6 0.47 0.46
15 49.3 0.43 0.42 50.9 0.42 0.38 48.1 0.46 0.45 47.3 0.48 0.47
RMSE = 47.2 ºC
r2 = 0.47
q2 = 0.47
2D onlyRMSE = 50.7 ºC
r2 = 0.39
q2 = 0.39
3D onlyUsing 15 nearest neighbours
and exponential weighting
Exponential weighting
performs best!
![Page 66: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/66.jpg)
Results: Drugs Dataset
NNMethod
A G I E
RMSE r2 q2 RMSE r2 q2 RMS
E r2 q2 RMSE r2 q2
1 57.1 0.19 - - - - - - - - - -
5 47.7 0.27 0.25 48.8 0.26 0.22 47.1 0.29 0.27 48.5 0.28 0.22
10 46.9 0.29 0.28 47.8 0.30 0.25 46.3 0.30 0.29 47.4 0.29 0.26
15 48.0 0.25 0.24 49.0 0.26 0.21 47.2 0.28 0.27 47.2 0.29 0.27
RMSE = 46.5 ºC
r2 = 0.30
q2 = 0.29
2D onlyRMSE = 48.4 ºC
r2 = 0.24
q2 = 0.23
3D onlyUsing 10 nearest neighbours
and inverse distance weighting
Inverse distance weighting performs
best!
![Page 67: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/67.jpg)
Summary of Results
Organics: RMSE = 46.2 ºC, r2 = 0.49, ntest:ntrain = 1000:3119
Drugs: RMSE = 42.2 ºC, r2 = 0.42, ntest:ntrain = 80:197
Difficulties at extremities of temperature scale
due to: distances in descriptor space and temperature
![Page 68: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/68.jpg)
Summary of Results
Nonetheless: Comparison with other models in the literature shows that the presented model:
- performs better in terms of RMSE under more stringent conditions (even unoptimised)
- parameterless; optimised model many fewer parameters than, for example, in an ANN
![Page 69: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/69.jpg)
Conclusions
Information content of descriptors
Exponential weighting performs best
Genetic algorithm is able to improve performance, especially for atom type counts as descriptors
Remaining error
- interactions in liquid and/or solid state not captured (by single-molecule descriptors)
- kNN method in this setting (datasets + descriptors)
![Page 70: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/70.jpg)
Thanks
• Pfizer & PIPMS
• Unilever
• Dave Palmer
• Florian Nigsch
![Page 71: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/71.jpg)
All slides after here are for information only
![Page 72: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/72.jpg)
The Two Faces of Computational Chemistry
Theory Empiricism
![Page 73: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/73.jpg)
Informatics and Empirical Models
• In general, Informatics methods represent phenomena mathematically, but not in a physics-based way.
• Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model.
• Do not attempt to simulate reality. • Usually High Throughput.
![Page 74: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/74.jpg)
Theoretical Chemistry
• Calculations and simulations based on real physics.
• Calculations are either quantum mechanical or use parameters derived from quantum mechanics.
• Attempt to model or simulate reality.
• Usually Low Throughput.
![Page 75: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/75.jpg)
Optimisation with Genetic Algorithm
Transformation of data to yield lower global RMSEP
- global means: for whole data set
Three operations to transform the columns of data matrix for optimal weighting
One bit per descriptor on chromosomes: (op; param) consisting of an operation and a parameter
{multiplication
power
loge
{(a,b) = (0, 30) for mult.
(a,b) = (0, 3) for power
none for loge
![Page 76: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/76.jpg)
Iterative Genetic Optimisation
Initial data Initial chromosomes
Transformation of data
Calculation of distance matrix
Evaluation of models
Survival of the fittest
Genetic operations
Stop if criterion reached Set of optimal transformations
![Page 77: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/77.jpg)
Genetic Algorithm Able to Improve Results
Data set Descriptors RMSE (ºC) r2 q2
1 2D 46.2 0.49 0.49
2 2D 42.2 0.42 0.40
1 SAC 49.1 0.40 0.39
2 SAC 46.7 0.29 0.27
SAC: Sybyl Atom Counts
Application of optimal transformations to initial data
Reassessment of models
Averages over 25 runs!
![Page 78: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/78.jpg)
Errors at Low/High-Melting Ends…
ei x i y iA ei
2 sgn(ei)
SRMSE sgn(A)AN
Signed RMSE analysis Definition of signed RMSE
Details
In 30 ºC steps across whole range
All 4119 molecules of data set 1
![Page 79: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/79.jpg)
… Result From Large Distances
Distance in descriptor space
NNs of high-melting compounds typically further away
Distance in temperature
{< 0: negative neighbour
> 0: positive neighbourTtest-TNN
![Page 80: Informed by Informatics?](https://reader036.fdocuments.us/reader036/viewer/2022081603/568137f1550346895d9faba5/html5/thumbnails/80.jpg)
Over- & Underpredictions at Extremities
Distance in descriptor space
NNs of high-melting compounds typically further away
Distance in temperature
{< 0: negative neighbour
> 0: positive neighbourTtest-TNN
Underprediction!
Overprediction!