Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and...

56
Useful Information The web address for these lectures is http://www-jmg.ch.cam.ac.uk/cil/partii/ (on front of handout) Assessment is by two online exercises (Glen and Goodman) at this address. Each will be marked out of ten. Your (paper) answers should be submitted to Mykola. Glen exercises due: Feb 10 th 2018 Lectures and handout available on Moodle

Transcript of Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and...

Page 1: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Useful Information

bull The web address for these lectures is

httpwww-jmgchcamacukcilpartii (on

front of handout)

bull Assessment is by two online exercises

(Glen and Goodman) at this address Each

will be marked out of ten Your (paper)

answers should be submitted to Mykola

bull Glen exercises due Feb 10th 2018

bull Lectures and handout available on Moodle

Molecular Informatics

1 molecules and computers

An Introduction to Chemoinformatics Andrew R

Leach Valerie J Gillet Springer 2007

Chemoinformatics - A Textbook Johann Gasteiger and Thomas Engel

Wiley-VCH 2003

Handbook of Chemoinformatics Johann Gasteiger

Wiley-VCH 2003 (new edition out 2018)

Chemoinformatics An Approach to Virtual Screening

By Alexandre Varnek Alex Tropsha RSC Publishing

Bunin Barry A Chemoinformatics Theory Practice and Products

Dordrecht Springer 2007

Chemoinformatics An Approach to Virtual Screening By Alexandre

Varnek Alex Tropsha RSC Publishing

Drug Metabolism Prediction Ed R Mannhold H Kubinye G Folkers

Ed Johannes Kirchmair Methods and Principles in Medicinal

Chemistry Vol 63 Pub Wiley-VCH

Sources- textbooksonline you may wish to consider if you want

to take the subject further

Journals of MolecularCheminformatics you may wish

to follow up on

Journal of Chemical Information and Modeling

Journal of Chemical Theory and Computation

Journal of Cheminformatics

Journal of Computer-Aided Molecular Design

Journal of Molecular Graphics amp Modeling

Journal of Computational Chemistry

Journal of Medicinal Chemistry

Reviews in Computational Chemistry

Drug Discovery Today

BMC Bioinformatics

Nature Reviews Drug Discovery

Expert Opinion on Drug Discovery

WIRES computational Molecular Science

Molecular

Informatics

Includes all aspects of the study of molecules on computers

Also includes Chem(o)informatics

This includes the representation of molecules databases display

simulation prediction of their properties and the discovery and

design of new molecules and materials

Molecular informatics is closely related to bioinformatics

computational chemistry molecular modelling simulation machine

learning and statistics as well as online publications - but the area

has principally been driven by investment in new methods for drug

discovery hence the concentration on small organic molecules

Cambridge HPC

Places to find Molecular

Informatics apps

bull httpwwwmacinchemorgmobilescience

bull Molecules ndash eg RSC-Chemspider)

bull Publishers (eg ACSRSC mobile)

bull Calculations (eg Yield101 for Rxns)

bull Visualisation (eg Pymol for proteins)

J Chem Educ 2013 90 (3) pp 320ndash325

DOI 101021ed300329e

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 2: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Molecular Informatics

1 molecules and computers

An Introduction to Chemoinformatics Andrew R

Leach Valerie J Gillet Springer 2007

Chemoinformatics - A Textbook Johann Gasteiger and Thomas Engel

Wiley-VCH 2003

Handbook of Chemoinformatics Johann Gasteiger

Wiley-VCH 2003 (new edition out 2018)

Chemoinformatics An Approach to Virtual Screening

By Alexandre Varnek Alex Tropsha RSC Publishing

Bunin Barry A Chemoinformatics Theory Practice and Products

Dordrecht Springer 2007

Chemoinformatics An Approach to Virtual Screening By Alexandre

Varnek Alex Tropsha RSC Publishing

Drug Metabolism Prediction Ed R Mannhold H Kubinye G Folkers

Ed Johannes Kirchmair Methods and Principles in Medicinal

Chemistry Vol 63 Pub Wiley-VCH

Sources- textbooksonline you may wish to consider if you want

to take the subject further

Journals of MolecularCheminformatics you may wish

to follow up on

Journal of Chemical Information and Modeling

Journal of Chemical Theory and Computation

Journal of Cheminformatics

Journal of Computer-Aided Molecular Design

Journal of Molecular Graphics amp Modeling

Journal of Computational Chemistry

Journal of Medicinal Chemistry

Reviews in Computational Chemistry

Drug Discovery Today

BMC Bioinformatics

Nature Reviews Drug Discovery

Expert Opinion on Drug Discovery

WIRES computational Molecular Science

Molecular

Informatics

Includes all aspects of the study of molecules on computers

Also includes Chem(o)informatics

This includes the representation of molecules databases display

simulation prediction of their properties and the discovery and

design of new molecules and materials

Molecular informatics is closely related to bioinformatics

computational chemistry molecular modelling simulation machine

learning and statistics as well as online publications - but the area

has principally been driven by investment in new methods for drug

discovery hence the concentration on small organic molecules

Cambridge HPC

Places to find Molecular

Informatics apps

bull httpwwwmacinchemorgmobilescience

bull Molecules ndash eg RSC-Chemspider)

bull Publishers (eg ACSRSC mobile)

bull Calculations (eg Yield101 for Rxns)

bull Visualisation (eg Pymol for proteins)

J Chem Educ 2013 90 (3) pp 320ndash325

DOI 101021ed300329e

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 3: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

An Introduction to Chemoinformatics Andrew R

Leach Valerie J Gillet Springer 2007

Chemoinformatics - A Textbook Johann Gasteiger and Thomas Engel

Wiley-VCH 2003

Handbook of Chemoinformatics Johann Gasteiger

Wiley-VCH 2003 (new edition out 2018)

Chemoinformatics An Approach to Virtual Screening

By Alexandre Varnek Alex Tropsha RSC Publishing

Bunin Barry A Chemoinformatics Theory Practice and Products

Dordrecht Springer 2007

Chemoinformatics An Approach to Virtual Screening By Alexandre

Varnek Alex Tropsha RSC Publishing

Drug Metabolism Prediction Ed R Mannhold H Kubinye G Folkers

Ed Johannes Kirchmair Methods and Principles in Medicinal

Chemistry Vol 63 Pub Wiley-VCH

Sources- textbooksonline you may wish to consider if you want

to take the subject further

Journals of MolecularCheminformatics you may wish

to follow up on

Journal of Chemical Information and Modeling

Journal of Chemical Theory and Computation

Journal of Cheminformatics

Journal of Computer-Aided Molecular Design

Journal of Molecular Graphics amp Modeling

Journal of Computational Chemistry

Journal of Medicinal Chemistry

Reviews in Computational Chemistry

Drug Discovery Today

BMC Bioinformatics

Nature Reviews Drug Discovery

Expert Opinion on Drug Discovery

WIRES computational Molecular Science

Molecular

Informatics

Includes all aspects of the study of molecules on computers

Also includes Chem(o)informatics

This includes the representation of molecules databases display

simulation prediction of their properties and the discovery and

design of new molecules and materials

Molecular informatics is closely related to bioinformatics

computational chemistry molecular modelling simulation machine

learning and statistics as well as online publications - but the area

has principally been driven by investment in new methods for drug

discovery hence the concentration on small organic molecules

Cambridge HPC

Places to find Molecular

Informatics apps

bull httpwwwmacinchemorgmobilescience

bull Molecules ndash eg RSC-Chemspider)

bull Publishers (eg ACSRSC mobile)

bull Calculations (eg Yield101 for Rxns)

bull Visualisation (eg Pymol for proteins)

J Chem Educ 2013 90 (3) pp 320ndash325

DOI 101021ed300329e

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 4: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Journals of MolecularCheminformatics you may wish

to follow up on

Journal of Chemical Information and Modeling

Journal of Chemical Theory and Computation

Journal of Cheminformatics

Journal of Computer-Aided Molecular Design

Journal of Molecular Graphics amp Modeling

Journal of Computational Chemistry

Journal of Medicinal Chemistry

Reviews in Computational Chemistry

Drug Discovery Today

BMC Bioinformatics

Nature Reviews Drug Discovery

Expert Opinion on Drug Discovery

WIRES computational Molecular Science

Molecular

Informatics

Includes all aspects of the study of molecules on computers

Also includes Chem(o)informatics

This includes the representation of molecules databases display

simulation prediction of their properties and the discovery and

design of new molecules and materials

Molecular informatics is closely related to bioinformatics

computational chemistry molecular modelling simulation machine

learning and statistics as well as online publications - but the area

has principally been driven by investment in new methods for drug

discovery hence the concentration on small organic molecules

Cambridge HPC

Places to find Molecular

Informatics apps

bull httpwwwmacinchemorgmobilescience

bull Molecules ndash eg RSC-Chemspider)

bull Publishers (eg ACSRSC mobile)

bull Calculations (eg Yield101 for Rxns)

bull Visualisation (eg Pymol for proteins)

J Chem Educ 2013 90 (3) pp 320ndash325

DOI 101021ed300329e

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 5: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Molecular

Informatics

Includes all aspects of the study of molecules on computers

Also includes Chem(o)informatics

This includes the representation of molecules databases display

simulation prediction of their properties and the discovery and

design of new molecules and materials

Molecular informatics is closely related to bioinformatics

computational chemistry molecular modelling simulation machine

learning and statistics as well as online publications - but the area

has principally been driven by investment in new methods for drug

discovery hence the concentration on small organic molecules

Cambridge HPC

Places to find Molecular

Informatics apps

bull httpwwwmacinchemorgmobilescience

bull Molecules ndash eg RSC-Chemspider)

bull Publishers (eg ACSRSC mobile)

bull Calculations (eg Yield101 for Rxns)

bull Visualisation (eg Pymol for proteins)

J Chem Educ 2013 90 (3) pp 320ndash325

DOI 101021ed300329e

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 6: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Places to find Molecular

Informatics apps

bull httpwwwmacinchemorgmobilescience

bull Molecules ndash eg RSC-Chemspider)

bull Publishers (eg ACSRSC mobile)

bull Calculations (eg Yield101 for Rxns)

bull Visualisation (eg Pymol for proteins)

J Chem Educ 2013 90 (3) pp 320ndash325

DOI 101021ed300329e

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 7: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Cheminformatics 101

How do we store molecules on the computer

There are estimated to be 1060 possible small molecules that

could be made How do we find the best molecule for the

problem we are addressing Letrsquos take a look ldquounder the

bonnetrdquo of the way molecules are actually manipulated on the

computer You will be familiar with

1 Trivial name eg Morphine

2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-

methylmorphinan-36-diol

However these names do not convey the structure of molecules

in a way the computer can readily understand We need to

convert these into ldquomachine readable formatsrdquo which allows

ease of searching based on the complexities of molecular

structure But what is a molecule

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 8: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Bear this in mind Molecules are complicated When we look at this scene we add a

huge amount of information from our senses and knowledge ndash but it nearly all gets

lost in computational representation

Representing chemistry needs to be engineered to represent materials and processes

As you will see we are moving in that direction with more complete representations

of molecules and materials

Not (5α6α)-78-

didehydro-45-

epoxy-17-

methylmorphinan-

36-diol

A real life mixture

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 9: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

What is a molecule

is it a series of connected points

a wave function

the sum of its properties

In the computer molecules are therefore abstractions and interpretations of data

So more experimental data and an appropriate description of a molecule may

translate to a wider reality

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 10: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Storing molecules different methods for different purposes

Methods for storing molecules can conveniently be broken down

into

1-Dimensional (simple and very compact and fast to access)

2-Dimensional (contains the chemical diagram)

3-Dimensional approaches (the shapes of molecules)

1D Line notations (a string of characters from a keyboard)

2D Molecular Diagrams (Graphs)

3D Graphs plus XYZ coordinates (giving the 3D structure)

We will look at examples of these (and you can follow up in the

notes)

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 11: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

1-Dimensional - Line notations ndash a string representation of

molecules- here are three examples of different line notations

SMILES is the most useful and widely used These are

lsquostringsrsquo all of the same molecule

bull Line Notationsndash WLN

bull L66J BMRamp DSWQ IN1amp1

ndash ROSDAL

bull 1=-5-=10=510-11-11N-12-

=17=123-18S-

19O18=20O18=21O8-22N-

2322-24

ndash SMILES

bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c

2cc(N(C)C)cc3

ndash IUPAC 6-dimethylamino-4-

phenylamino-naphthalene-2-sulphonic

acid

Notice ndash all these notations use just the characters on a standard

typewriter keyboard

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 12: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 13: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3

)([H])[C]41[C]3([H])[CH]2C

Itrsquos Artemesinin

Which has

anti-malarial

properties

(Nobel Prize)

This language is called SMILES

This is an example of a lsquomoleculersquo and is what is actually stored in the

computer ndash does it make sense to you

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 14: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

A SMILES tutorial is available at

httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml

You can practice by drawing a structure and a smiles will be available

by picking the smiley face

httpwwwmolinspirationcomcgi-binpropertiestextMode=1

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 15: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

InChibull A more recent line notation is called

InChI

bull This will address some of the problems

of SMILES eg polymers and materials

not covered by SMILES

bull InChI is generated using computer

algorithms and is virtually un-

interpretable by a humans

bull Importantly it is commonly used as a

unique chemical identifier - each

molecule should theoretically have a

unique InChI One molecule one InChi

bull Websites are available that can generate

InChIconvert from InChi to structure

from different names and formats

Again a string like this is easily

matched on a computer

RSC ChemSpider

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 16: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Storing chemical diagrams on computers

bull The valence model of a molecule can be represented by a

chemical graph A simple graph contains nodes (atoms) and edges

(bonds) joining pairs of nodes

bull The spacial position of the nodes length of the edges and

crossings are irrelevant Generally we ignore hydrogens unless

tautomerism or pKa is an issue Computers handle graphs very

well and molecules represented like this are examples of labelled

graphs (the atoms have names eg Oxygen)

bull Chemical structures are of course more complex than this and

aromaticity stereochemistry tautomerism non-stoichimetric

compounds etc are often problematic The computer would

(using a simple graph) deduce these two canonical structures are

different molecules

bull Eg to solve this we could introduce the concept of an aromatic

bond

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 17: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

An example of storing the molecular

graph SD file format

MDL SD (structure data) format files contain the following information and the

information about a molecule is stored in the following format

Header Block

describes the molecule eg itrsquos name

Connection table

defines the molecular structure (atoms and bonds)

Data block (optional)

Properties eg volume

Terminator line

a line containing four dollar signs ($$$$)

indicating the end of information on this molecule

This is probably the most common format to store small molecules ndash SD files in

software are widely used to store many molecule structures in databases

A ldquoformatrdquo in computer science is a precisely described order of data The program

reading it expects the information in exactly the right place (or it screws up)

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 18: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

eg PDB protein file format

eg SD file format

1

1

1

1

1

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

The Connection Table ndash describing bonds

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 19: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

bull Defines the bonding arrangement of a molecule Treats the molecule

as a labelled graph

Connect 3 2 5 4

Bond 3 5 1

Bond 3 4 1

Bond 5 6 2

(reduced)

eg PDB protein file format

eg SD file format

The Connection Table ndash describing bonds

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 20: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Why store molecules in 2Dbull Quite often we only need the

chemical diagram eg to find a

molecule that matches a

chemical structure search

bull It is often the case that we donrsquot

know the conformation (shape)

of a molecule ndash so storing it in

3D would be pointless Look at

the changes in conformation in

this molecule at room

temperature

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 21: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Example SD file - benzene

bull benzene

bull ACDLabs0812062058

bull August 2013

bull 6 6 0 0 0 0 0 0 0 0 1 V2000

bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0

bull 2 1 1 0 0 0 0

bull 3 1 2 0 0 0 0

bull 4 2 2 0 0 0 0

bull 5 3 1 0 0 0 0

bull 6 4 1 0 0 0 0

bull 6 5 2 0 0 0 0

bull M END

bull $$$$

Molecule nameInformation on this molecule

Comment (eg date is used here)

ldquocounts linerdquo has the

number of atoms and

bonds as a minimum

ldquoatom blockrdquo has xyz

coordinates of the atom and

element as a minimum

ldquobond blockrdquo 1-line for each bond

from atom - to atom ndash bond type

These identify the end of this molecule

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 22: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Alanine SD file

Bond length is 153A

xyz symbol mass diff charge stereo h-counthellip

Charge

0 = uncharged or value

other than

these 1 = +3 2 = +2 3 =

+1

4 = doublet radical

5 = -1 6 = -2 7

= -3

Molecule is chiral

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 23: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

What if we only know the atom positions and not the bonds

A key example would be x-ray crystallography Here we determine the

positions of the atoms and impute the bonds from our chemical knowledge

Here is an electron density map from an x-ray experiment We see the

electrons but not the nuclei In a small molecule this is very accurate and we

can almost see the bonds However in eg a protein structure which is very

large we canrsquot always determine the atom positions exactly so these are

stored and the bonds imputed The format for storing these is a bit different

Which way

round should

this ring go

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 24: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Example protein databank file (pdb)

(uses an adjacency matrix)

HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A

TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684

EXPDTA X-RAY DIFFRACTION

REMARK 2 RESOLUTION 180 ANGSTROMS

REMARK 200 TEMPERATURE (KELVIN) 113

REMARK 200 PH 69

SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN

SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO

CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4

ORIGX1 1000000 0000000 0000000 000000

ORIGX2 0000000 1000000 0000000 000000

ORIGX3 0000000 0000000 1000000 000000

SCALE1 0018319 0000000 0000000 000000

SCALE2 0000000 0018147 0000000 000000

SCALE3 0000000 0000000 0015426 000000

ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128

ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841

ATOM 3 C VAL A 1 4565 1067 -3593 100 2764

ATOM 4 O VAL A 1 4653 1287 -4853 100 2717

ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001

HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234

HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213

HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410

HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534

CONECT 1510 1511 1512 1513 1514

CONECT 1511 1510

CONECT 1512 1510

MASTER 334 0 6 6 18 0 11 6 1583 1 95 15

END

Space group and unit cell

dimensions

Protein atoms ndash including xyz

occupancy and temperature factor

Non-protein atoms ndash including xyz

occupancy and temperature factor

Bonds New format

mmCIF

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 25: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

What if we want to vary the atom

positions eg in driving a

reaction coordinate

Using Cartesian (xyz)

coordinates is very

cumbersome so instead

we use the natural angles

and distances

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 26: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

This uses internal coordinates

bull Also called a Z-matrix

ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)

ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation

ndash A z-matrix uses the following geometric descriptions to describe molecules

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 27: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Bond Length

Bond angle

Torsion angle

Out of plane bending

eg a carbonyl

Non-bonded distance

C

O

du

du

O

N OH

Dummy atom

positions

Internal

Coordinates

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 28: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

How to construct a Z-matrix (in Gaussian format)

1For the first atom to be defined give the atomic symbol

only

2For the second atom give the atomic symbol the number

1 and the name of a variable to describe the distance

between atoms 1 and 2

3For the third atom give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB and

the name of a variable to describe the angle between the

current atom NA and NB

4For all later atoms give the atomic symbol the atom

number NA the name of a variable to describe the distance

between the current atom and NA the atom number NB the

name of a variable to describe the angle between the current

atom NA and NB the atom number of another previously

defined atom NC and finally the name of a variable to

describe the dihedral angle between the current atom NA

NB and NC

5After all the atoms have been listed enter a blank line

6Next list each variable with its corresponding value Use a

separate line for each variable

7In some cases where some of the variables are to be fixed

as constants in a geometry optimisation they are listed here

after a blank line rather than above

with the real variables

8End the Z-matrix with a blank line

Water (C2v)

O

H 1 l1

H l l1 2 a1

l1 096

a1 1040

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 29: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Methanol Z-matrix

C

O 1 l1

H 1 l2 2 a1

H 1 l3 2 a2 3 da1

H 1 l3 2 a2 3 -da1

H 2 l4 1 a3 3 1800

l1 142

l2 109

l3 109

l4 109

l5 109

l6 10

a1 1090

a2 1100

a3 1080

a4 1100

a5 1100

da1 600

da2 1200

da3 600 z-matrix

lsquoimproperrsquo torsion angles

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 30: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

What about comprehensive

properties of molecules ndash they

are more than xyz

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 31: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

XML and molecules

bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc

bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file

bull Chemical Markup Language (CML) is being developed specifically for chemistry

bull In the future much more information will be stored with molecules allowing greater re-use of data

bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 32: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Ethanol

ltCMLgt

-Can be parsed

-Can contain reactions

properties etc

-Can contain

relationships to other

molecules and also

concepts

InChI

InChI=1C2H6Oc1-2-3h3H2H21H3

SMILES

C(C)O

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 33: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not

really flat because of thermal fluctuations So we represent 3D

molecules by including their coordinates or their internal coordinates

bull Obtaining the 3-dimensional coordinates can involve experiment (x-

ray electron or neutron diffraction eg the Cambridge

Crystallographic Database or the Protein Databank ndash PDB)

bull From these can be obtained atom positions bonds coordinates etc

bull There are a number of 3-D construction methods available such as

Corina or Concord (put in a SMILES and get a 3-D molecule) which

use rules derived from experiment

bull Molecules can also be constructed in 2D and subjected to molecular

mechanics or Quantum Mechanics calculations to obtain 3D structures

bull Conformation still remains to be deduced There are many methods

that deduce conformations usually involving torsional angle rotation

to scan the conformational space (like a Ramachandran plot ndash but often

in many more torsional angle dimensions)

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 34: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Molecules in 3D - uses

bull More accurate calculation of molecular

properties

bull Comparison of the shapes (conformations)

of molecules

bull Comparison of the dynamics of molecules

bull Calculation of bulk properties

bull Simulation of chemical reactionshelliphellip

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 35: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

The dynamics of a molecule can be computed and stored as a series of

frames of the coordinates and bonds of the structure (much like a cartoon)

Molecular Dynamics is the most popular method for larger systems Here is

an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a

series of snapshots of SD files concatenated together to make a movie just

like a film strip

httpsenwikipediaorgwikiCHARMM

httpambermdorg

httpwwwksuiuceduResearchnamd

httpwwwksuiuceduResearchvmd

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 36: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

We can simulate an entire biological receptor

and test new molecules to see how they might

work

bull Here is a simulation of an important new cancer target

called lsquoApelinrsquo with water and the membrane We used

these to design new molecules that block this receptor

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 37: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Apelinrsquos role in Cancer

lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an

antagonist (a molecule that blocks the receptor) stop tumour growth

bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma

growth Brain 2017 awx253 httpsdoiorg101093brainawx253

bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et

al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 38: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

So there are very many file formats

ndash which could be a real painbut

they can be convertedalc -- Alchemy file prep -- Amber PREP file

bs -- Ball amp Stick file caccrt -- Cacao Cartesian file

ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical Resource

Kit 2D file

crk3d -- CRK3D Chemical Resource

Kit 3D file

box -- Dock 35 Box file dmol -- DMol3 Coordinates file

feat -- Feature file gam -- GAMESS Output file

gamout -- GAMESS Output file gpr -- Ghemical Project file

mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file

hin -- HyperChem HIN file jout -- Jaguar Output file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file car -- MSI BiosymInsight II CAR

file

sdf -- MDL Isis SDF file sd -- MDL Isis SDF file

mdl -- MDL Molfile file mol -- MDL Molfile file

mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file

mmads -- MMADS file mpqc -- MPQC file

bgf -- MSI BGF file nwo -- NWChem Output file

pdb -- PDB file ent -- PDB file

pqs -- PQS file qcout -- Q-Chem Output file

res -- ShelX file ins -- ShelX file

smi -- SMILES file mol2 -- Sybyl Mol2 file

unixyz -- UniChem XYZ file vmol -- ViewMol file

alc -- Alchemy file bs -- Ball amp Stick file

caccrt -- Cacao Cartesian file cacint -- Cacao Internal file

cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file

c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table

file

cht -- Chemtool file cml -- Chemical Markup Language

file

crk2d -- CRK2D Chemical

Resource Kit 2D file

crk3d -- CRK3D Chemical

Resource Kit 3D file

cssr -- CSD CSSR file box -- Dock 35 Box file

dmol -- DMol3 Coordinates file feat -- Feature file

fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file

inp -- GAMESS Input file gcart -- Gaussian Cartesian file

gau -- Gaussian Input file gpr -- Ghemical Project file

gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file

hin -- HyperChem HIN file jin -- Jaguar Input file

bin -- OpenEye Binary file mmd -- MacroModel file

mmod -- MacroModel file out -- MacroModel file

dat -- MacroModel file sdf -- MDL Isis SDF file

sd -- MDL Isis SDF file mdl -- MDL Molfile file

mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file

mmads -- MMADS file bgf -- MSI BGF file

csr -- MSI Quanta CSR file nw -- NWChem Input file

pdb -- PDB file ent -- PDB file

pov -- POV-Ray Output file pqs -- PQS file

report -- Report file qcin -- Q-Chem Input file

smi -- SMILES file fix -- SMILES Fix file

mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file

txt -- Titles file unixyz -- UniChem XYZ file

vmol -- ViewMol file xed -- XED file

xyz -- XYZ file zin -- ZINDO Input file

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 39: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

File formats

-the bane of our lives

Interconnection program ndash Babel

Recent IUPAC moves towards a lsquostandardrsquo format however

In the near future there are likely to be many competing

requirements for file content

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 40: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Molecules on computers ndash things to look out for

since what is stored is actually quite crude

For example-

Stereochemistry may be relative and not absolute or even incorrect

In proteins only the HEAVY atom positions are observed (sometimes at low resolution)

so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen

and oxygen get confused

Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved

correctly (problems with storing bonding)

Tautomers can be incorrect ndash check they look reasonable

Mesomers can be incorrect (double bonds)

Polymers (which are mixtures of MWt and topology) are very difficult to store

Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics

halogen-aromatic bonds etc may be inferred but not observed in the file)

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 41: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

2 Finding molecules

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 42: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

A typical problem involves finding the

lsquorightrsquo molecules

All the molecules

synthesisable

1060 Molecules with

relevance to the problem

1010 (that we can generate)

Molecules we can

search efficiently 107

Molecules we can

make and test 103

Serious difference in numbers we can consider a few hundred

molecules in our heads computers can evaluate millions

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 43: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

1 Finding molecules using Substructure searching

One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph

Many software systems allow searching for whole molecules or

fragments of molecules or even lsquosimilarrsquo molecules

For a specific molecule this means specifying the required search pattern to get an

exact match

An example might be Search for an exact match for dimethylaniline Here wersquove

used the PubChem database which contains compounds and pharmacological

screening data on 92 million unique molecules httppubchemncbinlmnihgov

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 44: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

A more complex example ndash find a molecule

containing our query as part of a larger molecule

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 45: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Results ndash takes a few seconds to search all

92 million molecules

How does this workThe first thing to notice is that the searching is so fast that clearly all the database

is not being searched

The first part of the process is to pre-compute pointers to only the sets of

compounds in the database we are interested in

This is called Hashing

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 46: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Speeding up compound matching using hash codes

bull The simplest Hash code registers the presence or absence

of fragments in a molecule eg

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesnrsquot contain F

X

So a quick match of the bitstring of the fragment searched over the

(pre-computed) bitstrings in the database will eliminate most of the

structures

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 47: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Next step ndash Substructure (pattern) matching

Exact match

Match the canonical SMILES InCHI or full structure

Beware salts etc may need to strip out counterions

Substructure Search

Supposing we wish to find part of a molecule in a database

To do this we have to do a substructure search

A substructure is a sub-graph of the molecular graph This is

an example of substructural fragments of a larger molecule

shown in different colours You could eg look for all the

molecules with the phenol fragment

There are two steps commonly used to do this firstly we

have to number all the molecules consistently (the Morgan

algorithm) for example to compare two structures to see if

they are identical then we have to match the query

substructure to each substructure in the molecule (the

Ullman algorithm)

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 48: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Substructures in molecules

bull lsquoSubgraphsrsquo can be identified in a structure graph

corresponding to fragments of the whole structure

eg Rings substituents etc

ndash ndashOH

ndash ndashNH2

ndash ndashCOOH

ndash phenyl

bull this can be done by

tracing appropriate

paths in the graph

bull subgraphs may overlap

OH

CH2

CHNH2

OH

O

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 49: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Matching the query structure to the database

Two algorithms are commonly used

The Morgan algorithm which numbers the molecule uniquely and

The Ullman algorithm which matches the fragments

Note ndash a number

of lsquosubstructuralrsquo fragments can

be matched here (6 in all)

Gund P Ann Rep in Med Chem 14299 1979

Sheridan RP et al J Chem Inf Comp Sci 29255 1989

Brint AT Willett P J Mol Graphics 549 1987

JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing

Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925

Morgan HL The generation of a unique machine description for chemical structures Journal of

Chemical Documentation 5 1965 107-112

Read these for more details-

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 50: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

The basic concept is that molecules are considered as

Graphs

Atoms are lsquonodesrsquo

Bonds are lsquoedgesrsquo

A molecule is an example of a labelled graph nodes can

have labels such as atomic number The nodes are

connected by the edges (and these can also be labelled

eg double bond)

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 51: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

The Morgan algorithm iteratively computes an integer label

for each node in a structure (atoms) For a computer the key

point is that the numbering is consistent (unlike humans) and

fast The Morgan algorithm works as follows

(i) Assign an integer label i to each node considering its

atomic number degree (number of substituents) and types of

bonds

(ii) Update for each label by adding the labels of the adjacent

nodes (connected atoms) and

(iii) Repeat (ii) until the number of different labels does not

increase Then order the nodes by the value of the labels

H L Morgan The generation of a unique machine description for chemical structures -

A technique developed at chemical abstracts service J Chemical Documentation

5107ndash113 1965

The first step - Numbering a molecule uniquely

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 52: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

A simple Morgan algorithm

bullThis is continued until the number of equivalent classes is unchanged

bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion

concentric circles from the highest numbered

bullRules are used to break equalities such as C before N double bond before single etc

bullSome equivalent atoms are arbitrarily numbered

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 53: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 54: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 55: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Urea

Adjacency matrix

O

C C

1

2 34

O

C2

C5

C6

C7

C3

1

C4

O1 C2 C3 C4

O1 1

C2 1

C3 1

C4 1 1 1

O1 C2 C3 C4 C5 C6 C7

O1 1

C2 1 1

C3 1 1

C4 1 1 1

C5 1 1

C6 1 1

C7 1 1

Cyclohexanone adjacency

matrix

Numbering is crucial Renumber

for different starting points Also

the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part

of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems

The Ullman Algorithm

is a way of detecting

detecting substructures

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships

Page 56: Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and Thomas Engel, Wiley-VCH 2003. Handbook of Chemoinformatics, Johann Gasteiger, Wiley-VCH

Next lecture

bull How can we use this type of information to

solve our chemistry problems

ndash PatentsMarkush Structures

ndash Molecular Similarity

ndash Molecular Properties

ndash 3D searching

ndash Virtual Screening

ndash Structure PropertyActivity Relationships