Molecular data and analysis · should be submitted to Dr Rafel Cabot Mesquida*, Chief Teaching...

Useful Information

• The web address for these lectures is

http://www-jmg.ch.cam.ac.uk/cil/partii/ (on front of

handout)

• Assessment Exercises are also at this address. They

will be marked out of ten. Your (hard copy) answers

should be submitted to Dr Rafel Cabot Mesquida*,

Chief Teaching Technician, ( [email protected] )

• Glen exercises due: Feb 15th 2020

• Lectures and handout available on Moodle

*a metal tray in the office G12 labelled “Part II Cheminformatics”

http://www-jmg.ch.cam.ac.uk/cil/partii/

mailto:[email protected]

2 Finding molecules

In 1924 Dr. Markush was awarded a patent on pyrazolone dyes (USP No. 1,506,316) in

which he claimed generic chemical structures in addition to those actually synthesized.

Structures of this type were permitted after a ruling in 1925 by the US Patent Office and

became known as “Markush structures”. The “Markush Doctrine” of patent law greatly

increases flexibility in the preparation of claims for the definition of an invention.

Expanding our representation of chemical

Structures – Markush structures.

We can expand our search by introducing less exact labelling of attachments to the

core structure. Markush structures are essentially structures involving R-groups,

where a part of the molecule is defined by a series of alternatives – a more complex

example

Additionally, to introduce a more generic approach to structure matching, we might

define e.g. hydrogen-bond donors as:

R = OH,NH,SH,PH for example – care is of course needed e.g. a

COOH may be ionised and have no H !

This approach is extensively used in the patent literature to cover claims

of chemical structures with many variations.

Markush or Generic Structures. J. Chem. Inf. Comp. Sci. 1991, 31 (1)

A comparison of different approaches to Markush structure handling

J. Chem. Inf. Comp. Sci. 31(1), 1991, 64-68

An example of a Patent claim using Markush structures – how

many does this cover ?

Markush structure searching over the years, Edlyn S. Simmons World Patent Information, Volume

25, Issue 3, September 2003, Pages 195-202

Searching Markush compound structures is still an unsolved problem (so-called

‘nasties’), and has great implications for patents. MarPat is a Markush searchable

database of patents. https://www.cas.org/support/documentation/markush .

https://www.cas.org/support/documentation/markush

2. Finding molecules using Molecular Similarity

• You may perform a structural search of a database, and find no molecules. You still want to use a molecule like your query in some way, so, how do you find one that is ‘similar’ ?

• We may have e.g. a molecule that shows anti-cancer effects, but is toxic

• We could then look for other molecules that could have a similar anti-cancer effect, but a lower toxic effect

• ‘Similarity’ though, has a context and the right molecular description is needed for each specific case.

Bender A., Glen RC., Org. Biomol. Chem., 2004, 2, 3204 – 3218.

Molecular similarity: a key technique in molecular informatics.

The similarity concept is widely used in medicinal chemistry :

e.g. using the concept of Bio-Isosteres – the fundamental

concept in discovering new drugs

This idea (a bio-isostere) suggests that a chemical group can be

mimicked by a replacement group that, in many documented cases,

has appeared similar in its response to biological receptors (usually

proteins).

e.g. :

Bioorganic & Medicinal Chemistry Letters

Volume 17, Issue 14, 15 July 2007, Pages

4040-4043

Changing substituents

while maintaining affinity

in an anti-bacterial.

to

https://www.sciencedirect.com/science/article/pii/S0960894X07005069

An example at Influenza neuraminidase – a critical enzyme the

virus uses for infection - inhibitors Oseltamivir and Zanamivir

use a bio-isosteric replacement of the natural substrate

<=Similar to=>

Neuraminidase cleaves

the glycosidic linkages

of neuraminic acids

Therefore in a search,

these additional ‘R’

groups can be included as

Markush structures

More examples used as bio-isosteres (pairs)

Sarah R. Langdon,Peter Ertl,and Nathan Brown. Bioisosteric

Replacement and Scaffold Hopping in Lead Generation and

Optimization . Mol. Inf. 2010, 29, 366 – 385

Robert P. Sheridan. The Most Common Chemical Replacements in

Drug-Like Compounds. J. Chem. Inf. Comput. Sci. 2002, 42, 103-108

Similar to…

Similar to…

But supposing you can’t easily ‘think up a similar substructure’?

There are various methods that have been devised to compute

similarity. These are generally:

•Based on the structure

•In one (strings), two (graphs) or three dimensions (coordinates)

•Based on molecular properties

•Experimental (e.g. size and shape) and computed properties (e.g.

Dipole Moment)

Lets look at how a similarity calculation can be defined using some

of these methods.

Similarity. The Maximal Common Subgraph (the biggest common

fragment)

•Important search to determine which part of a structure is constant –

e.g. identifying reaction components – in this one, the atoms and bonds

which are constant comprise the MCS.

•Is a complex case of identifying a fragment, as we don’t know the size of the

MCS beforehand, so involves ‘backtracking’ to compute – and is therefore

time consuming. For example, this is also one of the problems of converting

a list of compounds to a Markush structure.

MCS algorithms can be applied to problems other than atom-atom mapping in

reactions -

•structural similarity between molecules - size of MCS (relative to size of

molecules) can be used as a measure of similarity of molecules) e.g. search

for molecules containing at least 80% of query substructure.

Similarity by Molecular Fingerprints

• Fingerprints are a common approach to describing molecular similarity

• Fingerprints can be considered as a ‘bar code’ for the molecule

• Used because

– uses only the molecular graph

– does not require structural conformation or alignment

– fast searching method

• It is very fast to annotate a database of millions of molecules with fingerprints

• Often you are using fingerprints in searching databases, and don’t realise it !

Molecular Fingerprints

• Hash codes (already mentioned for searching)

• The simplest fingerprint registers the presence or

absence of fragments in a molecule. e.g.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesn’t contain F

.......X

Molecular Fingerprints

• We could use this fingerprint for example,

to find only molecules containing

Phosphorous that have an amine in their

structure

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesn’t contain F

.......X

NH

N

N

O

NH2N

O

OH

HH

HHO

PO

O-

HO

O-

• Fingerprints can be generated algorithmically, we don’t need to manually specify all the fragments

• Fingerprint method most often used is based on the CRC algorithm (cyclic redundancy check) –you could look this up on the web.

• Advantages/disadvantages

– easy to calculate

– very fast

– not specific to one area of chemistry

– difficult to understand

Automated fingerprint generation

Fingerprint Generation – Hashing

CRC (Cyclic Redundancy Check)

CH3CH2CH2CH2OH H-C-C-C

C-C-C-O

C-C-O-H

| etc.

I, where 0 < I > 109MOD( I / 151 )

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

12

e.g. 150-bit fingerprint for 4-atom fragments (we generally use 4-7 atom)

1 .......151

Linear Fingerprint

Level

0 1 2 3

C.ar C.ar

C.ar

O.3

C.2

C.2

C.ar

C.ar

C.3

C.ar

O.2

O.co2

O.co2

Fingerprint Generation – circular fingerprints

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1

.......number of

‘atom types’ times

the number of

levelsLayer zero

30 atom types

Layer 1

30 atom types

Layer 2

Thirty atom types

These are very ‘sparse’ but work well – I’ll show some examples later

J. Chem. Inf. Model. 2007, 47(2), 583-590

Can also use a

variant of the

Morgan

algorithm

Comparing the fingerprints of molecules

- Tanimoto or Jaccard similarity

where A, B, A&B, are the number of bits set in fingerprint A, B, and A-AND-

B.

In a hypothetical example, A, B,and A&B are 24, 21, and 19, respectively,

resulting in a Tanimoto coefficient of 0.73 (1.00 is perfect similarity)

BABA

BAT

)(

another way to put it, TC = BC / (B1 + B2 - BC)

Values above 0.85 are usually significant. This method is

commonly used to search for Pharmaceuticaly active

molecules, reagents, reactions...

Tanimoto similarity example

- similarity to σ-chloro-ρ-aminobenzoic acid

σ-chloro-ρ-aminobenzoic acid

Structure Tanimoto

coefficient

Benzoic acid 0.52

m-chlorobenzoic acid 0.64

o-chlorobenzoic acid 0.80

o-chloro-p-aminobenzoic acid 1.0

p-aminobenzoic acid 0.70

p-chlorobenzoic acid 0.66

Similarity search in SciFinder Scholar1 - query structure

2 - similarity search

3 – pick > 85% similarity

4 - six structures retrieved (from xxx Million). This

probably uses linear fingerprints

‘Tanimoto’ similarity indices are one of a class of

methods for bit-string comparisons.

Some comparison indices additional to the Tanimoto coefficient

(Nab/(Na+Nb-Nab) ) are:

Hamming coefficient =

Cosine coefficient = Nab/Sqrt (Na x Nb)

n

i

baXORH1

)),((

A good introduction is in :

http://www.orgchm.bas.bg/~vmonev/SimSearch.pdf

J. Chem. Inf. Comput. Sci., 37, 18-22 (1977)

J. Chem. Inf. Comput. Sci., 43, 819-828 (2003)

J. Chem. Inf. Comput. Sci. 38, 983-996 (1998)

J. Chem. Inf. Model. Publication Date (Web): October 19,

2012, DOI: 10.1021/ci300261r. Just accepted.

http://www.orgchm.bas.bg/~vmonev/SimSearch.pdf

Not just bits – properties of moleculesThere are of course, an enormous number of “molecular properties”

That can be used to compare molecules – some of the more common ones are

listed below:

1. Quantum mechanical descriptors based on the wavefunction (Carbo index) Quantitative Structure-Activity Relationships, Volume 16, Issue 1 (p 25-32)

2. Topological indices (Weiner, Kier and Hall)H. Wiener, "Structural determination of paraffin boiling points", J. Am. Chem. Soc., 1947, 69(1), 17-20.

L. B. Kier, L. H. Hall, Molecular Connectivity in Structure-Activity Analysis, J. Wiley & Sons, New York, 1986

3. Compute molecular properties: volume, surface area, logP, pKa, .........vast

number – then cluster molecules according to a similarity measure.

Molecules............Index...........graph of similarity of pairs

Beck et al. Chemical

Physics 356 (2009) 121–

130

J. Chem. Inf. Model., 2007, 47 (2), pp 583–590

Metabolic Site/Product predictor (MetaPrint2D)

Metabolic Site/Product predictor (MetaPrint2D)

2Query compound

For each query atom, find

all similar environments in

database

Calculate reaction

occurrence ratios

Total number of similar reaction centres

Total number similar atoms in rest of database

Calculate relative ratios for each atom in

query compound, and display predictions

Using a naive Bayes probabilistic model

Symyx Metabolite

database (~80000

transformations)Substrate + Products

Calculate environment for

each substrate atom

Identify reaction centres

1

Calculate environment for

each atom

3How often is environment

found at a reaction centre?

4

5

Database Version 2005.1 2006.1 2007.1 2008.1

Transformations 72599 78009 82671 87446Single step 58757 62147 65732 69402Product not reported 811 831 834 882Newly added 5410 4662 4775

Interestingly, the

molecule dosed (which

has excellent

bioavailability) is a

partial agonist, while the

main metabolite is a full

agonist. So, as the drug

concentration lowers in

blood, the remaining

compound becomes

more potent – probably a

longer lasting effect

Paracetamol toxicity

(Tylenol)

Overdose results in

species NAPQI and

liver damage

Metaprint2D results

glutathione

http://upload.wikimedia.org/wikipedia/commons/1/12/Paracetamol_metabolism.svg

3 Finding molecules using three dimensional data

•‘Real’ molecules exist in a 3-dimensional world

•Their properties depend on their shape and the spacial

disposition of functional groups.

•Simple example: dipole moment

2.5 Debye 0.5 Debye

An example of the

exquisite matching of

a substrate to a protein

binding site – here 3D

shape and the

complimentary non-

bonded interactions

are extremely

important

Cheminformatics Tools

for drug design

https://www.click2drug.org/directory_SmallMoleculesDatabase.html

Example of a site which has

various drug design tools

http://www.click2drug.org/

• Three dimensions in drug discovery

• A ‘pharmacophore’ is a 3-D representation of the required features

for binding to a biological receptor

5.2

4.2-4.7

6.7

4.8

5.1-7.1

Distances in Ǻngstroms.

Here is the pharmacophore

model used to design the migraine drug

‘Zomig’ deduced from comparison of

molecules that interact with the receptor

binding site

Similarity Searching based on

pharmacophores - What do we need ?• A database of 3-dimensional structures (Zinc

database is 200 million)

– Atom Coordinates

– Atom types

– Ring, fragment, property, H-bonding etc. definitions

– An excellent example is the Cambridge Structural Database of X-ray structures (next door)

• A definition of the query

– Fragments of molecules and their properties

– Constraints

• Distances between functional groups

• Angles between these

– The concept of Dummy atoms is useful

– e.g. ring centres, H-bonding points, planes

Example search (“Virtual Screening”) of our current

4.5 Million 3D database

5.2

4.2-4.7

6.7

4.8

5.1-7.1

A protonated amine (NH3+), a ring centre (defined by 6 atoms)

hydrogen-bond acceptor, a hydrogen bond donor-acceptor

-brings up the point that ‘properties’ can be specified at atom points

--Markush atoms

Hydrophobic

center

Positive NH Bond

Donor/Acceptor

H Bond

Acceptor

When x-ray structures are available – molecules can

be ‘docked’ into the binding site – pharmacophores

can be generated and used for searching as before

• A docking program will take a

randomised ligand conformation from a

ligand/protein x-ray structure and place

the molecule back in the correct

position.

• Many thousands of molecules can be

‘docked’ / hour.

• Molecules can be selected based on their

‘fit’ to the protein, and subsequently

tested for binding affinity

docked Gleevec with Gleevec X-ray 1T46 (x-ray structure) overlaid with

the predicted position of Gleevec – almost perfect – which implies we

could use the same docking approach to search for new molecules that

work in the same way

Docking example using Gold: the

anti-cancer drug Gleevec – a

specific cancer target inhibitor of

Bcr-Abl tyrosine kinase, the

constitutive abnormal kinase in

chronic myeloid leukemia.

Docking example using Gold:

Gleevec – specific cancer target

inhibitor of Bcr-Abl tyrosine

kinase, the constitutive abnormal

kinase in chronic myeloid

leukemia. Red lines define a

pharmacophore

The pharmacophore can be extracted and used to search for additional

Molecules from our database, these are then tested by ‘docking’ and

If they fit, can be tested for anti-cancer properties in this case.

*GOLD. Jones G, Willett P, Glen R C, Molecular Recognition of Receptor Sites using a Genetic Algorithm with a

Description of Desolvation, J.Mol. Biol.245, 43-53 (1995).

Jones G, Willett P, Glen R C, Leach A R, Taylor R. Development and Validation of a Genetic Algorithm for Flexible

Docking. J. Mol. Biol. 267, 727-748 (1997).

“Virtual screening” using similarity – an important way to find starting points for

designing new drugs

Suppose we have no information on a biological target. Also, like

many pharmaceutical companies, we have 1 Million real molecules

in our compound store. But, due to cost, we can only afford to

screen 10,000. How can we pick the best representative set to

screen?

There are essentially two ways to do this – similarity and diversity.

Pro

per

ty A

Property B

A

B

Selection based on

similarity to A and B

Pro

per

ty A

Property B

A diverse set

http://www.ncbi.nlm.nih.gov/pubmed/15141118

Virtual screening using similarity

On the bottom left, we have used two molecules displaying

biological activity (A and B) to find those most similar in the

database for testing, to maximise our chances of finding new hits.

On the bottom right, we have no molecules to use, so we select the

best diverse set, maximising our chances of a hit whilst only testing

a representative subset of the compounds library.

Pro

per

ty A

Property B

A

B

Selection based on

similarity to A and B

Pro

per

ty A

Property B

A diverse set

An example of a reaction in a modified Smiles, called Smirks.

‘Acetic acid and (.) ethanol > in the presence of HCl and Ethanol >

make ethylacetate

Chemical Reactions can also be represented in the computer

http://www.daylight.com/dayhtml_tutorials/languages/smirks/index.html

http://www.daylight.com/dayhtml_tutorials/languages/smirks/index.html

Virtual screening using a virtual library

The molecules we screen in the computer don’t have to be

physically available. We can generate vast libraries of

molecules we could synthesise, and search these. Promising

molecules could be synthesised. A billion examples is not

unreasonable. An example of potential HIV Protease inhibitors

There are so many, we have to be very selective and computer-aided

design can help

Example reaction of two acids with two alcohols to make four products

(the acids have ‘R’ groups). New characters and atom mapping is used.

[*:1][C:2](=[O:3])[O:4][H].[*:2][C:5][O:6][H]>>[*:1][C:2](=[O:3])[O:6][C:5][*:2].[H][O:4][H]

You’ve found some interesting molecules –

but how can we predict their properties

quantitatively ?

• Particularly in drug discovery (but also in

materials science for example) methods

have been developed to relate the structure

and properties of molecules to their function

• These are called Quantitative Structure

Property (or Activity) Relationships -

QSPR, QSAR

• The handout contains details of some

approaches that may be of interest.

• Quantitative Structure Property Relationships (QSPR)

• Quantitative Structure Activity Relationships (QSAR)

• We calculate descriptors to combine with statistical and machine-learning methods to create models to predict properties.

Combining molecular structure with calculation of

properties to deduce predictive models is usually

termed :

Picture the data – often the best approach

J. Med. Chem., 44 (5), 2001, pp681 -693,

‘Exclusion zone’ – compounds here

Are not bio-available

hyd

rop

ho

bic

ity

size

Here is a real example, two descriptors CMR (the size of the molecule) and logD

(the distribution coefficient between octanol and water) are calculated, plotted and

annotated with their ability to be absorbed in the intestine. The white areas are

molecules that are absorbed, the shaded molecules are not - so as drugs shaded

molecules would not be orally absorbed therefore useless in pills.

This approach to bioavailability has had a fundamental effect on new drug discovery,

see Lipinski’s Rule of Five in the notes.

http://pubs.acs.org/doi/abs/10.1021/jm000956k

Simplest approaches

• 1. Read across. If molecule A has a measured property, and molecule B has not had it measured, if molecules A and B are very similar, perhaps they have similar properties and we can predict the property of B. This is a common approach in predicting e.g. the toxicity of molecules.

• We use information for one chemical, called a “source chemical”, to make a prediction of the same property or toxicological endpoint for another chemical, called a “target chemical”, termed “read across”.

Example of using chemical and biological similarity in read-across prediction of toxicity

Low et al. Chem. Res. Toxicol., 2013, 26 (8), pp 1199–1208

http://pubs.acs.org/doi/abs/10.1021/tx200148a

Building models (QSAR/QSPR)

Molecular database

Calculate/measuremolecular properties

Analysis

Prediction

This is the most commonapproach for molecularanalysis and prediction

Supervised methods

Supervised methods. The most common method is linear regression. Simple linear

regression fits a straight line through the set of n points in such a way that makes

the sum of the squared residuals of the model (that is, vertical distances between

the points of the data set and the fitted line) is as small as possible. The equation

we obtain can be used to predict a new property based on the descriptors calculated

or measured for the new molecule.

Q is the function we want to obtain and minimiseAlpha is the correction factor to move all the points so the line goes through the originBeta is the coefficient to multiply our descriptor (x) by.Epsilon is a residual (which we wish to minimise)The method is explained in more detail the notes.

http://en.wikipedia.org/wiki/Simple_linear_regression

Machine Learning:

Predicting TLC (Thin Layer Chromatography)

Start point

Solvent front

Compound

moved

to here

(Rf=y/x)

X

Y

•Compounds move up the plate

depending on the solvent, their

properties etc.

•We can predict the Rf’s

(retention times) using details

of the molecules and the

solvent.

•Separate mixtures, identify

compounds etc.

Silica on glass

15 2-OH

16 3-OH,6-OH

17 2-OH,6-OH

18 2-OH,3-OH

19 2-OH,5-OH

20 2-OH,4-OH

21 3-OH,4-OH

22 2-COOH

8 4-F,3-CF

9 4-F,2-CF

10 4-CH

11 2-CH

12 3-CH

13 4-NH

14 H

1 4-F

2 3-F

3 2-F

4 CF

5 3-CF

6 4-CF

7 2-F,4-CF

COOH1

23

4

5 6

3

3

3

3

3

3

3

3

3

2

• 22 substituted benzoic acids

Data

• 2 solvent systems

• 6 - mixtures 1 Acetonitrile - Water 30 : 70

2 Acetonitrile - Water 40 - 60

3 Acetonitrile - Water 50 - 50

4 MeOH - Water 40 - 60



• 22 compounds x 6 mixtures = 132 experiments

Data

Measurements

No. compound number

Cpd name of compound

Solvent water and acetonitrile/methanol

Rf retention time

Rm (log (1-Rf)/Rf))

S_Area surface area of molecule in A2

clogp calculated partition coefficient octanol/water

volume molecular volume in A3

MPolar polarizability of the molecule cm-25

dipole dipole moment of the molecule (Debye)

dipsol dipole moment of the solvent (%solv1+%sol2)*100 Debye

PolSol polarizability of of the solvent (%pol1+%pol2)/100 Debye

Ovality: how removed from sperical

water dipole is also given, 2.75Debye

3/2

4

3

3

4/ VSOvality

• Molecular properties were calculated for each of the molecules

and tabulated in a spreadsheet (tlcdata.xls) e.g.

LogK‘ = -0.401QON + 0.396CLOGP + 0.109DIP -0.056DIPMOM -3.162ESDL1

+ 0.231CMR + 0.110POLSOL - 5.326

r = 0.954 F7,110 = 155.59

Variance Explained = 91.0 %

Multiple linear regression – using the ‘best’ 7 parameters

•Test set

oTraining set

measured

Unsupervised methods (typically classification models)

In the previous examples, data was fitted to a

model, usually predicting a numeric value of the

desired property. However, it is also possible to

cluster the data, and hence make predictions

about a particular class a new molecule will fall

into e.g. is it toxic or non-toxic. This is “guilt by

association”.

The most common approach to do this is cluster

analysis, which includes a diverse set of

approaches.

Hierarchical clustering and k-means clustering are

common approaches. Clustering involves finding

the distance between all points of the data (e.g.

the Tanimoto distance) usually using the Euclidean

distance or the Manhattan distance. The clusters

are then determined by either a bottom-up

approach (agglomerative) or by a Divisive

approach (top-down).

High

similarity

cuttoff

Low

similarity

cuttoff

http://en.wikipedia.org/wiki/File:Hierarchical_clustering_simple_diagram.svg

http://en.wikipedia.org/wiki/Hierarchical_clustering

Plot of 2 PC’s of a dataset made up of many molecules and many calculated properties, It is possible to get a view of how diverse

molecules are within the property space, and also, for new molecules, where they are located.

Includes: physical properties (such as charge, van der Waals volume, and molecular refractivity)

subdivided surface areas (atomic contributions to logP and molecular refractivity)

counts of elemental atom types and of bond types

Kier/ Hall connectivity and kappa shape indices

topological indices (Wiener index and Balaban index)

pharmacophore feature counts (number of acidic and basic groups and hydrogen bond donors and acceptors)

partial charge descriptors, surface area, volume, and shape descriptors (among them water accessible surface area, mass density,

and principal moments of inertia).

So this is basically describing a series of molecules in many ways, then compressing the plot into two dimensions. Good for

selecting a screening set of compounds for testing.

J. Chem. Inf. Model., 2005, 45 (3), pp 581–590

The concept of ‘Chemical space’ – non-hierarchical clustering similar molecules

•Simulates the way that neurons are interconnected

•‘learns’ by adjusting the connection weights between nodes taking an input set

of parameters and attempting to fit the output measurements

•New data can then be entered and using the ‘learned’ model -> predict

This network has a

2:4:4:1 topology

Like neurons, the connections

are made when a threshold value

is attained.

Use ‘back propagation of errors’ to

adjust the connections

http://en.wikipedia.org/wiki/Backpropagation

http://en.wikipedia.org/wiki/Artificial_neural_network

A machine learning method – a Neural network

http://en.wikipedia.org/wiki/Backpropagation

http://en.wikipedia.org/wiki/Artificial_neural_network

measure

dpredicted

TLC Neural Network and plot of measured vs Predicted results

There has been an enormous recent interest in ‘Deep’ Learning/Artificial Intelligence – citations (WoS, 1995-2018) – about

2002 it there was a rapid increase

Deep Learning Chemistry

Artificial Intelligence Drug

Artificial Intelligence Chemistry

Artificial Intelligence QSAR

568 articles, 10,700 citations

39 articles, 242 citations


3,904 articles, 101,164 citations



Deep Learning QSAR

Deep Learning Drug

Deep Learning/Artificial Intelligence – citations (WoS, 1900-2019)A rapid increase (in 2002) as appreciation of the new approaches spread –

a huge number of applications across many diverse areas of chemistry




Artificial Intelligence Drug


Artificial Intelligence Synthesis Chemistry

255 articles, 8,429 citations 3,765 articles, 59,406 citations

Artificial Intelligence bioinformatics


2002

30 new deep learning papers uploaded to arXiv per day over the previous month

What’s changed? – much ‘deeper’ networks can now be optimised

The renaissance in NN started with “ImageNet Classification with Deep

Convolutional Networks”, cited over 30,000 times and is widely regarded as one

of the most influential publications in the field.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton created a “large, deep

convolutional neural network” that was used to win the 2012 ILSVRC (ImageNet

Large-Scale Visual Recognition Challenge).

LeCun et al. N AT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5

Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, 2012, Advances in neural information processing systems, 1097-

1105. (Conference Proceedings). Cited by 29524

The 9 Deep Learning Papers You Need To Know About (Understanding CNNs Part 3)

https://adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html

Several reviews in drug discovery and many applications of Deep Learning

From machine learning to deep learning: progress in

machine intelligence for rational drug discovery

Zhang et al. Drug Discovery Today Volume 22, Number

11 November 2017

‘The most commonly used networks are convolutional

neural networks (CNN), stacked autoencoders, deep belief

networks (DBN), and restricted Boltzmann machines’

Is Multitask Deep Learning Practical for Pharma?

Ramsundar et al. Chem. Inf. Model., 2017, 57 (8), pp 2068–

2076

’ Our analysis and open-source implementation in

DeepChem provide an argument that multitask deep

networks are ready for widespread use in commercial drug

discovery.’

Deep Learning in Drug Discovery

Gawehn, Hiss and Schneider. Mol. Inf. 2016, 35, 3 – 14

‘With the development of

new deep learning concepts such as RBMs and CNNs, the

molecular modeler’s tool box has been equipped with potentially

game-changing methods.’

Statistical and machine learning approaches to predicting

protein-ligand interactions.

Colwell, L. J., Curr Opin Struc Biol 2018, 49, 123-128.

‘We explain the major technical challenges including the

problems of sampling noise and the challenge of using

benchmark datasets that are sufficiently unbiased

Deep Learning for Drug-Induced Liver Injury

Xu et al. J. Chem. Inf. Model. 2015, 55, 2085−2093

Protein−Ligand Scoring with Convolutional Neural

Networks. Ragoza et al. J. Chem. Inf. Model. 2017, 57,

942−957

Deep Neural Nets as a Method for Quantitative

Structure−Activity

Relationships.

Junshui et al. J. Chem. Inf. Model. 2015, 55, 263−274

Low Data Drug Discovery with One-Shot Learning.

Alte-Tran et al. ACS Cent. Sci., 2017, 3 (4), pp 283–293

‘we demonstrate how one-shot learning can be used to significantly

lower the amounts of data required to make meaningful predictions

in drug discovery applications. We introduce a new architecture, the

iterative refinement long short-term memory, that, when combined

with graph convolutional neural networks, significantly improves

learning of meaningful distance metrics’

Several reviews in drug discovery and many applications of Deep Learning

Quantum Mechanics and Deep Learning –teaching a DNN to do DFT calculations

“As the results clearly show, the ANI method is a potential game-changer for

molecular simulation. Even the current version, ANI-1, is more accurate vs. the

reference DFT level of theory in the provided test cases than DFTB, and PM6,

two of the most widely used semi-empirical QM methods. Besides being

accurate, a single point energy, and eventually forces, can be calculated as many

as six orders of magnitude faster than through DFT.”

Smith J.S. et al. ANI-1: an extensible neural network potential with DFT accuracy at force field

computational cost. Chem. Sci., 2017,8, 3192-3203.

Smith J.S. et al. ANI-1, A data set of 20 million calculated off-equilibrium conformations for organic

molecules. SCIENTIFIC DATA, 4:170193

3D tumour

specimen

Discriminatio

n tumour /

non-tumour

Deep

Learning

Neural

Networks

Subtype

identification

Chemical

components and

related metabolic

pathways

Molecular picture

of tumour

interactions

DESI-MSI

100 300 700 1000m/z

The tumour microenvironment

is 3-dimensional.

More chances to capture the

biological interactions.

Dimensionalit

y reduction

on

ly tu

mo

ur s

pe

ctra

Application of Deep Learning to 3D DESI mass

spectrometry imaging in cancer

Inglese, Paolo, et al. "Deep learning and 3D-DESI imaging reveal the hidden metabolic heterogeneity of cancer." Chemical Science (2017).

Machine

Learning

The first 3D mass spectral imaging of a tumour

Paolo Inglese, James S. McKenzie, Anna Mroz, James Kinross, Kirill Veselkov, Elaine Holmes, Zoltan Takats, Jeremy K. Nicholson and Robert C. Glen. Deep learning and 3D-DESI imaging reveal the hidden metabolic heterogeneity of cancer. Chem. Sci., 2017, 8, 3500

Molecular data and analysis · should be submitted to Dr Rafel Cabot Mesquida*, Chief Teaching...

Documents

Transcript of Molecular data and analysis · should be submitted to Dr Rafel Cabot Mesquida*, Chief Teaching...