Molecular data and analysis · should be submitted to Dr Rafel Cabot Mesquida*, Chief Teaching...
Transcript of Molecular data and analysis · should be submitted to Dr Rafel Cabot Mesquida*, Chief Teaching...
Useful Information
• The web address for these lectures is
http://www-jmg.ch.cam.ac.uk/cil/partii/ (on front of
handout)
• Assessment Exercises are also at this address. They
will be marked out of ten. Your (hard copy) answers
should be submitted to Dr Rafel Cabot Mesquida*,
Chief Teaching Technician, ( [email protected] )
• Glen exercises due: Feb 15th 2020
• Lectures and handout available on Moodle
*a metal tray in the office G12 labelled “Part II Cheminformatics”
2 Finding molecules
In 1924 Dr. Markush was awarded a patent on pyrazolone dyes (USP No. 1,506,316) in
which he claimed generic chemical structures in addition to those actually synthesized.
Structures of this type were permitted after a ruling in 1925 by the US Patent Office and
became known as “Markush structures”. The “Markush Doctrine” of patent law greatly
increases flexibility in the preparation of claims for the definition of an invention.
Expanding our representation of chemical
Structures – Markush structures.
We can expand our search by introducing less exact labelling of attachments to the
core structure. Markush structures are essentially structures involving R-groups,
where a part of the molecule is defined by a series of alternatives – a more complex
example
Additionally, to introduce a more generic approach to structure matching, we might
define e.g. hydrogen-bond donors as:
R = OH,NH,SH,PH for example – care is of course needed e.g. a
COOH may be ionised and have no H !
This approach is extensively used in the patent literature to cover claims
of chemical structures with many variations.
Markush or Generic Structures. J. Chem. Inf. Comp. Sci. 1991, 31 (1)
A comparison of different approaches to Markush structure handling
J. Chem. Inf. Comp. Sci. 31(1), 1991, 64-68
An example of a Patent claim using Markush structures – how
many does this cover ?
Markush structure searching over the years, Edlyn S. Simmons World Patent Information, Volume
25, Issue 3, September 2003, Pages 195-202
Searching Markush compound structures is still an unsolved problem (so-called
‘nasties’), and has great implications for patents. MarPat is a Markush searchable
database of patents. https://www.cas.org/support/documentation/markush .
2. Finding molecules using Molecular Similarity
• You may perform a structural search of a database, and find no molecules. You still want to use a molecule like your query in some way, so, how do you find one that is ‘similar’ ?
• We may have e.g. a molecule that shows anti-cancer effects, but is toxic
• We could then look for other molecules that could have a similar anti-cancer effect, but a lower toxic effect
• ‘Similarity’ though, has a context and the right molecular description is needed for each specific case.
Bender A., Glen RC., Org. Biomol. Chem., 2004, 2, 3204 – 3218.
Molecular similarity: a key technique in molecular informatics.
The similarity concept is widely used in medicinal chemistry :
e.g. using the concept of Bio-Isosteres – the fundamental
concept in discovering new drugs
This idea (a bio-isostere) suggests that a chemical group can be
mimicked by a replacement group that, in many documented cases,
has appeared similar in its response to biological receptors (usually
proteins).
e.g. :
Bioorganic & Medicinal Chemistry Letters
Volume 17, Issue 14, 15 July 2007, Pages
4040-4043
Changing substituents
while maintaining affinity
in an anti-bacterial.
to
An example at Influenza neuraminidase – a critical enzyme the
virus uses for infection - inhibitors Oseltamivir and Zanamivir
use a bio-isosteric replacement of the natural substrate
<=Similar to=>
Neuraminidase cleaves
the glycosidic linkages
of neuraminic acids
Therefore in a search,
these additional ‘R’
groups can be included as
Markush structures
More examples used as bio-isosteres (pairs)
Sarah R. Langdon,Peter Ertl,and Nathan Brown. Bioisosteric
Replacement and Scaffold Hopping in Lead Generation and
Optimization . Mol. Inf. 2010, 29, 366 – 385
Robert P. Sheridan. The Most Common Chemical Replacements in
Drug-Like Compounds. J. Chem. Inf. Comput. Sci. 2002, 42, 103-108
Similar to…
Similar to…
But supposing you can’t easily ‘think up a similar substructure’?
There are various methods that have been devised to compute
similarity. These are generally:
•Based on the structure
•In one (strings), two (graphs) or three dimensions (coordinates)
•Based on molecular properties
•Experimental (e.g. size and shape) and computed properties (e.g.
Dipole Moment)
Lets look at how a similarity calculation can be defined using some
of these methods.
Similarity. The Maximal Common Subgraph (the biggest common
fragment)
•Important search to determine which part of a structure is constant –
e.g. identifying reaction components – in this one, the atoms and bonds
which are constant comprise the MCS.
•Is a complex case of identifying a fragment, as we don’t know the size of the
MCS beforehand, so involves ‘backtracking’ to compute – and is therefore
time consuming. For example, this is also one of the problems of converting
a list of compounds to a Markush structure.
MCS algorithms can be applied to problems other than atom-atom mapping in
reactions -
•structural similarity between molecules - size of MCS (relative to size of
molecules) can be used as a measure of similarity of molecules) e.g. search
for molecules containing at least 80% of query substructure.
Similarity by Molecular Fingerprints
• Fingerprints are a common approach to describing molecular similarity
• Fingerprints can be considered as a ‘bar code’ for the molecule
• Used because
– uses only the molecular graph
– does not require structural conformation or alignment
– fast searching method
• It is very fast to annotate a database of millions of molecules with fingerprints
• Often you are using fingerprints in searching databases, and don’t realise it !
Molecular Fingerprints
• Hash codes (already mentioned for searching)
• The simplest fingerprint registers the presence or
absence of fragments in a molecule. e.g.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesn’t contain F
.......X
Molecular Fingerprints
• We could use this fingerprint for example,
to find only molecules containing
Phosphorous that have an amine in their
structure
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesn’t contain F
.......X
NH
N
N
O
NH2N
O
OH
HH
HHO
PO
O-
HO
O-
• Fingerprints can be generated algorithmically, we don’t need to manually specify all the fragments
• Fingerprint method most often used is based on the CRC algorithm (cyclic redundancy check) –you could look this up on the web.
• Advantages/disadvantages
– easy to calculate
– very fast
– not specific to one area of chemistry
– difficult to understand
Automated fingerprint generation
Fingerprint Generation – Hashing
CRC (Cyclic Redundancy Check)
CH3CH2CH2CH2OH H-C-C-C
C-C-C-O
C-C-O-H
| etc.
I, where 0 < I > 109MOD( I / 151 )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
12
e.g. 150-bit fingerprint for 4-atom fragments (we generally use 4-7 atom)
1 .......151
Linear Fingerprint
Level
0 1 2 3
C.ar C.ar
C.ar
O.3
C.2
C.2
C.ar
C.ar
C.3
C.ar
O.2
O.co2
O.co2
Fingerprint Generation – circular fingerprints
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
.......number of
‘atom types’ times
the number of
levelsLayer zero
30 atom types
Layer 1
30 atom types
Layer 2
Thirty atom types
These are very ‘sparse’ but work well – I’ll show some examples later
J. Chem. Inf. Model. 2007, 47(2), 583-590
Can also use a
variant of the
Morgan
algorithm
Comparing the fingerprints of molecules
- Tanimoto or Jaccard similarity
where A, B, A&B, are the number of bits set in fingerprint A, B, and A-AND-
B.
In a hypothetical example, A, B,and A&B are 24, 21, and 19, respectively,
resulting in a Tanimoto coefficient of 0.73 (1.00 is perfect similarity)
BABA
BAT
)(
another way to put it, TC = BC / (B1 + B2 - BC)
Values above 0.85 are usually significant. This method is
commonly used to search for Pharmaceuticaly active
molecules, reagents, reactions...
Tanimoto similarity example
- similarity to σ-chloro-ρ-aminobenzoic acid
σ-chloro-ρ-aminobenzoic acid
Structure Tanimoto
coefficient
Benzoic acid 0.52
m-chlorobenzoic acid 0.64
o-chlorobenzoic acid 0.80
o-chloro-p-aminobenzoic acid 1.0
p-aminobenzoic acid 0.70
p-chlorobenzoic acid 0.66
Similarity search in SciFinder Scholar1 - query structure
2 - similarity search
3 – pick > 85% similarity
4 - six structures retrieved (from xxx Million). This
probably uses linear fingerprints
‘Tanimoto’ similarity indices are one of a class of
methods for bit-string comparisons.
Some comparison indices additional to the Tanimoto coefficient
(Nab/(Na+Nb-Nab) ) are:
Hamming coefficient =
Cosine coefficient = Nab/Sqrt (Na x Nb)
n
i
baXORH1
)),((
A good introduction is in :
http://www.orgchm.bas.bg/~vmonev/SimSearch.pdf
J. Chem. Inf. Comput. Sci., 37, 18-22 (1977)
J. Chem. Inf. Comput. Sci., 43, 819-828 (2003)
J. Chem. Inf. Comput. Sci. 38, 983-996 (1998)
J. Chem. Inf. Model. Publication Date (Web): October 19,
2012, DOI: 10.1021/ci300261r. Just accepted.
Not just bits – properties of moleculesThere are of course, an enormous number of “molecular properties”
That can be used to compare molecules – some of the more common ones are
listed below:
1. Quantum mechanical descriptors based on the wavefunction (Carbo index) Quantitative Structure-Activity Relationships, Volume 16, Issue 1 (p 25-32)
2. Topological indices (Weiner, Kier and Hall)H. Wiener, "Structural determination of paraffin boiling points", J. Am. Chem. Soc., 1947, 69(1), 17-20.
L. B. Kier, L. H. Hall, Molecular Connectivity in Structure-Activity Analysis, J. Wiley & Sons, New York, 1986
3. Compute molecular properties: volume, surface area, logP, pKa, .........vast
number – then cluster molecules according to a similarity measure.
Molecules............Index...........graph of similarity of pairs
Beck et al. Chemical
Physics 356 (2009) 121–
130
J. Chem. Inf. Model., 2007, 47 (2), pp 583–590
Metabolic Site/Product predictor (MetaPrint2D)
Metabolic Site/Product predictor (MetaPrint2D)
2Query compound
For each query atom, find
all similar environments in
database
Calculate reaction
occurrence ratios
Total number of similar reaction centres
Total number similar atoms in rest of database
Calculate relative ratios for each atom in
query compound, and display predictions
Using a naive Bayes probabilistic model
Symyx Metabolite
database (~80000
transformations)Substrate + Products
Calculate environment for
each substrate atom
Identify reaction centres
1
Calculate environment for
each atom
3How often is environment
found at a reaction centre?
4
5
Database Version 2005.1 2006.1 2007.1 2008.1
Transformations 72599 78009 82671 87446Single step 58757 62147 65732 69402Product not reported 811 831 834 882Newly added 5410 4662 4775
Interestingly, the
molecule dosed (which
has excellent
bioavailability) is a
partial agonist, while the
main metabolite is a full
agonist. So, as the drug
concentration lowers in
blood, the remaining
compound becomes
more potent – probably a
longer lasting effect
Paracetamol toxicity
(Tylenol)
Overdose results in
species NAPQI and
liver damage
Metaprint2D results
glutathione
3 Finding molecules using three dimensional data
•‘Real’ molecules exist in a 3-dimensional world
•Their properties depend on their shape and the spacial
disposition of functional groups.
•Simple example: dipole moment
2.5 Debye 0.5 Debye
An example of the
exquisite matching of
a substrate to a protein
binding site – here 3D
shape and the
complimentary non-
bonded interactions
are extremely
important
Cheminformatics Tools
for drug design
• Three dimensions in drug discovery
• A ‘pharmacophore’ is a 3-D representation of the required features
for binding to a biological receptor
5.2
4.2-4.7
6.7
4.8
5.1-7.1
Distances in Ǻngstroms.
Here is the pharmacophore
model used to design the migraine drug
‘Zomig’ deduced from comparison of
molecules that interact with the receptor
binding site
Similarity Searching based on
pharmacophores - What do we need ?• A database of 3-dimensional structures (Zinc
database is 200 million)
– Atom Coordinates
– Atom types
– Ring, fragment, property, H-bonding etc. definitions
– An excellent example is the Cambridge Structural Database of X-ray structures (next door)
• A definition of the query
– Fragments of molecules and their properties
– Constraints
• Distances between functional groups
• Angles between these
– The concept of Dummy atoms is useful
– e.g. ring centres, H-bonding points, planes
Example search (“Virtual Screening”) of our current
4.5 Million 3D database
5.2
4.2-4.7
6.7
4.8
5.1-7.1
A protonated amine (NH3+), a ring centre (defined by 6 atoms)
hydrogen-bond acceptor, a hydrogen bond donor-acceptor
-brings up the point that ‘properties’ can be specified at atom points
--Markush atoms
Hydrophobic
center
Positive NH Bond
Donor/Acceptor
H Bond
Acceptor
When x-ray structures are available – molecules can
be ‘docked’ into the binding site – pharmacophores
can be generated and used for searching as before
• A docking program will take a
randomised ligand conformation from a
ligand/protein x-ray structure and place
the molecule back in the correct
position.
• Many thousands of molecules can be
‘docked’ / hour.
• Molecules can be selected based on their
‘fit’ to the protein, and subsequently
tested for binding affinity
docked Gleevec with Gleevec X-ray 1T46 (x-ray structure) overlaid with
the predicted position of Gleevec – almost perfect – which implies we
could use the same docking approach to search for new molecules that
work in the same way
Docking example using Gold: the
anti-cancer drug Gleevec – a
specific cancer target inhibitor of
Bcr-Abl tyrosine kinase, the
constitutive abnormal kinase in
chronic myeloid leukemia.
Docking example using Gold:
Gleevec – specific cancer target
inhibitor of Bcr-Abl tyrosine
kinase, the constitutive abnormal
kinase in chronic myeloid
leukemia. Red lines define a
pharmacophore
The pharmacophore can be extracted and used to search for additional
Molecules from our database, these are then tested by ‘docking’ and
If they fit, can be tested for anti-cancer properties in this case.
*GOLD. Jones G, Willett P, Glen R C, Molecular Recognition of Receptor Sites using a Genetic Algorithm with a
Description of Desolvation, J.Mol. Biol.245, 43-53 (1995).
Jones G, Willett P, Glen R C, Leach A R, Taylor R. Development and Validation of a Genetic Algorithm for Flexible
Docking. J. Mol. Biol. 267, 727-748 (1997).
“Virtual screening” using similarity – an important way to find starting points for
designing new drugs
Suppose we have no information on a biological target. Also, like
many pharmaceutical companies, we have 1 Million real molecules
in our compound store. But, due to cost, we can only afford to
screen 10,000. How can we pick the best representative set to
screen?
There are essentially two ways to do this – similarity and diversity.
Pro
per
ty A
Property B
A
B
Selection based on
similarity to A and B
Pro
per
ty A
Property B
A diverse set
Virtual screening using similarity
On the bottom left, we have used two molecules displaying
biological activity (A and B) to find those most similar in the
database for testing, to maximise our chances of finding new hits.
On the bottom right, we have no molecules to use, so we select the
best diverse set, maximising our chances of a hit whilst only testing
a representative subset of the compounds library.
Pro
per
ty A
Property B
A
B
Selection based on
similarity to A and B
Pro
per
ty A
Property B
A diverse set
An example of a reaction in a modified Smiles, called Smirks.
‘Acetic acid and (.) ethanol > in the presence of HCl and Ethanol >
make ethylacetate
Chemical Reactions can also be represented in the computer
http://www.daylight.com/dayhtml_tutorials/languages/smirks/index.html
Virtual screening using a virtual library
The molecules we screen in the computer don’t have to be
physically available. We can generate vast libraries of
molecules we could synthesise, and search these. Promising
molecules could be synthesised. A billion examples is not
unreasonable. An example of potential HIV Protease inhibitors
There are so many, we have to be very selective and computer-aided
design can help
Example reaction of two acids with two alcohols to make four products
(the acids have ‘R’ groups). New characters and atom mapping is used.
[*:1][C:2](=[O:3])[O:4][H].[*:2][C:5][O:6][H]>>[*:1][C:2](=[O:3])[O:6][C:5][*:2].[H][O:4][H]
You’ve found some interesting molecules –
but how can we predict their properties
quantitatively ?
• Particularly in drug discovery (but also in
materials science for example) methods
have been developed to relate the structure
and properties of molecules to their function
• These are called Quantitative Structure
Property (or Activity) Relationships -
QSPR, QSAR
• The handout contains details of some
approaches that may be of interest.
• Quantitative Structure Property Relationships (QSPR)
• Quantitative Structure Activity Relationships (QSAR)
• We calculate descriptors to combine with statistical and machine-learning methods to create models to predict properties.
Combining molecular structure with calculation of
properties to deduce predictive models is usually
termed :
Picture the data – often the best approach
J. Med. Chem., 44 (5), 2001, pp681 -693,
‘Exclusion zone’ – compounds here
Are not bio-available
hyd
rop
ho
bic
ity
size
Here is a real example, two descriptors CMR (the size of the molecule) and logD
(the distribution coefficient between octanol and water) are calculated, plotted and
annotated with their ability to be absorbed in the intestine. The white areas are
molecules that are absorbed, the shaded molecules are not - so as drugs shaded
molecules would not be orally absorbed therefore useless in pills.
This approach to bioavailability has had a fundamental effect on new drug discovery,
see Lipinski’s Rule of Five in the notes.
Simplest approaches
• 1. Read across. If molecule A has a measured property, and molecule B has not had it measured, if molecules A and B are very similar, perhaps they have similar properties and we can predict the property of B. This is a common approach in predicting e.g. the toxicity of molecules.
• We use information for one chemical, called a “source chemical”, to make a prediction of the same property or toxicological endpoint for another chemical, called a “target chemical”, termed “read across”.
Example of using chemical and biological similarity in read-across prediction of toxicity
Low et al. Chem. Res. Toxicol., 2013, 26 (8), pp 1199–1208
Building models (QSAR/QSPR)
Molecular database
Calculate/measuremolecular properties
Analysis
Prediction
This is the most commonapproach for molecularanalysis and prediction
Supervised methods
Supervised methods. The most common method is linear regression. Simple linear
regression fits a straight line through the set of n points in such a way that makes
the sum of the squared residuals of the model (that is, vertical distances between
the points of the data set and the fitted line) is as small as possible. The equation
we obtain can be used to predict a new property based on the descriptors calculated
or measured for the new molecule.
Q is the function we want to obtain and minimiseAlpha is the correction factor to move all the points so the line goes through the originBeta is the coefficient to multiply our descriptor (x) by.Epsilon is a residual (which we wish to minimise)The method is explained in more detail the notes.
Machine Learning:
Predicting TLC (Thin Layer Chromatography)
Start point
Solvent front
Compound
moved
to here
(Rf=y/x)
X
Y
•Compounds move up the plate
depending on the solvent, their
properties etc.
•We can predict the Rf’s
(retention times) using details
of the molecules and the
solvent.
•Separate mixtures, identify
compounds etc.
Silica on glass
15 2-OH
16 3-OH,6-OH
17 2-OH,6-OH
18 2-OH,3-OH
19 2-OH,5-OH
20 2-OH,4-OH
21 3-OH,4-OH
22 2-COOH
8 4-F,3-CF
9 4-F,2-CF
10 4-CH
11 2-CH
12 3-CH
13 4-NH
14 H
1 4-F
2 3-F
3 2-F
4 CF
5 3-CF
6 4-CF
7 2-F,4-CF
COOH1
23
4
5 6
3
3
3
3
3
3
3
3
3
2
• 22 substituted benzoic acids
Data
• 2 solvent systems
• 6 - mixtures 1 Acetonitrile - Water 30 : 70
2 Acetonitrile - Water 40 - 60
3 Acetonitrile - Water 50 - 50
4 MeOH - Water 40 - 60
5 MeOH - Water 50 - 50
6 MeOH - Water 60 - 40
• 22 compounds x 6 mixtures = 132 experiments
Data
Measurements
No. compound number
Cpd name of compound
Solvent water and acetonitrile/methanol
Rf retention time
Rm (log (1-Rf)/Rf))
S_Area surface area of molecule in A2
clogp calculated partition coefficient octanol/water
volume molecular volume in A3
MPolar polarizability of the molecule cm-25
dipole dipole moment of the molecule (Debye)
dipsol dipole moment of the solvent (%solv1+%sol2)*100 Debye
PolSol polarizability of of the solvent (%pol1+%pol2)/100 Debye
Ovality: how removed from sperical
water dipole is also given, 2.75Debye
3/2
4
3
3
4/ VSOvality
• Molecular properties were calculated for each of the molecules
and tabulated in a spreadsheet (tlcdata.xls) e.g.
LogK‘ = -0.401QON + 0.396CLOGP + 0.109DIP -0.056DIPMOM -3.162ESDL1
+ 0.231CMR + 0.110POLSOL - 5.326
r = 0.954 F7,110 = 155.59
Variance Explained = 91.0 %
Multiple linear regression – using the ‘best’ 7 parameters
•Test set
oTraining set
measured
Unsupervised methods (typically classification models)
In the previous examples, data was fitted to a
model, usually predicting a numeric value of the
desired property. However, it is also possible to
cluster the data, and hence make predictions
about a particular class a new molecule will fall
into e.g. is it toxic or non-toxic. This is “guilt by
association”.
The most common approach to do this is cluster
analysis, which includes a diverse set of
approaches.
Hierarchical clustering and k-means clustering are
common approaches. Clustering involves finding
the distance between all points of the data (e.g.
the Tanimoto distance) usually using the Euclidean
distance or the Manhattan distance. The clusters
are then determined by either a bottom-up
approach (agglomerative) or by a Divisive
approach (top-down).
High
similarity
cuttoff
Low
similarity
cuttoff
Plot of 2 PC’s of a dataset made up of many molecules and many calculated properties, It is possible to get a view of how diverse
molecules are within the property space, and also, for new molecules, where they are located.
Includes: physical properties (such as charge, van der Waals volume, and molecular refractivity)
subdivided surface areas (atomic contributions to logP and molecular refractivity)
counts of elemental atom types and of bond types
Kier/ Hall connectivity and kappa shape indices
topological indices (Wiener index and Balaban index)
pharmacophore feature counts (number of acidic and basic groups and hydrogen bond donors and acceptors)
partial charge descriptors, surface area, volume, and shape descriptors (among them water accessible surface area, mass density,
and principal moments of inertia).
So this is basically describing a series of molecules in many ways, then compressing the plot into two dimensions. Good for
selecting a screening set of compounds for testing.
J. Chem. Inf. Model., 2005, 45 (3), pp 581–590
The concept of ‘Chemical space’ – non-hierarchical clustering similar molecules
•Simulates the way that neurons are interconnected
•‘learns’ by adjusting the connection weights between nodes taking an input set
of parameters and attempting to fit the output measurements
•New data can then be entered and using the ‘learned’ model -> predict
This network has a
2:4:4:1 topology
Like neurons, the connections
are made when a threshold value
is attained.
Use ‘back propagation of errors’ to
adjust the connections
http://en.wikipedia.org/wiki/Backpropagation
http://en.wikipedia.org/wiki/Artificial_neural_network
A machine learning method – a Neural network
measure
dpredicted
TLC Neural Network and plot of measured vs Predicted results
There has been an enormous recent interest in ‘Deep’ Learning/Artificial Intelligence – citations (WoS, 1995-2018) – about
2002 it there was a rapid increase
Deep Learning Chemistry
Artificial Intelligence Drug
Artificial Intelligence Chemistry
Artificial Intelligence QSAR
568 articles, 10,700 citations
39 articles, 242 citations
797 articles, 15,500 citations
3,904 articles, 101,164 citations
2,405 articles, 53,295 citations
380 articles, 9,471 citations
Deep Learning QSAR
Deep Learning Drug
Deep Learning/Artificial Intelligence – citations (WoS, 1900-2019)A rapid increase (in 2002) as appreciation of the new approaches spread –
a huge number of applications across many diverse areas of chemistry
Artificial Intelligence Chemistry
5,114 articles, 133,543 citations
Artificial Intelligence Chemistry
Artificial Intelligence Drug
3,450 articles, 75,157 citations
Artificial Intelligence Synthesis Chemistry
255 articles, 8,429 citations 3,765 articles, 59,406 citations
Artificial Intelligence bioinformatics
Artificial Intelligence Chemistry
2002
30 new deep learning papers uploaded to arXiv per day over the previous month
What’s changed? – much ‘deeper’ networks can now be optimised
The renaissance in NN started with “ImageNet Classification with Deep
Convolutional Networks”, cited over 30,000 times and is widely regarded as one
of the most influential publications in the field.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton created a “large, deep
convolutional neural network” that was used to win the 2012 ILSVRC (ImageNet
Large-Scale Visual Recognition Challenge).
LeCun et al. N AT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, 2012, Advances in neural information processing systems, 1097-
1105. (Conference Proceedings). Cited by 29524
The 9 Deep Learning Papers You Need To Know About (Understanding CNNs Part 3)
Several reviews in drug discovery and many applications of Deep Learning
From machine learning to deep learning: progress in
machine intelligence for rational drug discovery
Zhang et al. Drug Discovery Today Volume 22, Number
11 November 2017
‘The most commonly used networks are convolutional
neural networks (CNN), stacked autoencoders, deep belief
networks (DBN), and restricted Boltzmann machines’
Is Multitask Deep Learning Practical for Pharma?
Ramsundar et al. Chem. Inf. Model., 2017, 57 (8), pp 2068–
2076
’ Our analysis and open-source implementation in
DeepChem provide an argument that multitask deep
networks are ready for widespread use in commercial drug
discovery.’
Deep Learning in Drug Discovery
Gawehn, Hiss and Schneider. Mol. Inf. 2016, 35, 3 – 14
‘With the development of
new deep learning concepts such as RBMs and CNNs, the
molecular modeler’s tool box has been equipped with potentially
game-changing methods.’
Statistical and machine learning approaches to predicting
protein-ligand interactions.
Colwell, L. J., Curr Opin Struc Biol 2018, 49, 123-128.
‘We explain the major technical challenges including the
problems of sampling noise and the challenge of using
benchmark datasets that are sufficiently unbiased
Deep Learning for Drug-Induced Liver Injury
Xu et al. J. Chem. Inf. Model. 2015, 55, 2085−2093
Protein−Ligand Scoring with Convolutional Neural
Networks. Ragoza et al. J. Chem. Inf. Model. 2017, 57,
942−957
Deep Neural Nets as a Method for Quantitative
Structure−Activity
Relationships.
Junshui et al. J. Chem. Inf. Model. 2015, 55, 263−274
Low Data Drug Discovery with One-Shot Learning.
Alte-Tran et al. ACS Cent. Sci., 2017, 3 (4), pp 283–293
‘we demonstrate how one-shot learning can be used to significantly
lower the amounts of data required to make meaningful predictions
in drug discovery applications. We introduce a new architecture, the
iterative refinement long short-term memory, that, when combined
with graph convolutional neural networks, significantly improves
learning of meaningful distance metrics’
Several reviews in drug discovery and many applications of Deep Learning
Quantum Mechanics and Deep Learning –teaching a DNN to do DFT calculations
“As the results clearly show, the ANI method is a potential game-changer for
molecular simulation. Even the current version, ANI-1, is more accurate vs. the
reference DFT level of theory in the provided test cases than DFTB, and PM6,
two of the most widely used semi-empirical QM methods. Besides being
accurate, a single point energy, and eventually forces, can be calculated as many
as six orders of magnitude faster than through DFT.”
Smith J.S. et al. ANI-1: an extensible neural network potential with DFT accuracy at force field
computational cost. Chem. Sci., 2017,8, 3192-3203.
Smith J.S. et al. ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic
molecules. SCIENTIFIC DATA, 4:170193
3D tumour
specimen
Discriminatio
n tumour /
non-tumour
Deep
Learning
Neural
Networks
Subtype
identification
Chemical
components and
related metabolic
pathways
Molecular picture
of tumour
interactions
DESI-MSI
100 300 700 1000m/z
The tumour microenvironment
is 3-dimensional.
More chances to capture the
biological interactions.
Dimensionalit
y reduction
on
ly tu
mo
ur s
pe
ctra
Application of Deep Learning to 3D DESI mass
spectrometry imaging in cancer
Inglese, Paolo, et al. "Deep learning and 3D-DESI imaging reveal the hidden metabolic heterogeneity of cancer." Chemical Science (2017).
Machine
Learning
The first 3D mass spectral imaging of a tumour
Paolo Inglese, James S. McKenzie, Anna Mroz, James Kinross, Kirill Veselkov, Elaine Holmes, Zoltan Takats, Jeremy K. Nicholson and Robert C. Glen. Deep learning and 3D-DESI imaging reveal the hidden metabolic heterogeneity of cancer. Chem. Sci., 2017, 8, 3500