Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and...
Transcript of Useful Information - University of Cambridge · Chemoinformatics - A Textbook, Johann Gasteiger and...
Useful Information
bull The web address for these lectures is
httpwww-jmgchcamacukcilpartii (on
front of handout)
bull Assessment is by two online exercises
(Glen and Goodman) at this address Each
will be marked out of ten Your (paper)
answers should be submitted to Mykola
bull Glen exercises due Feb 10th 2018
bull Lectures and handout available on Moodle
Molecular Informatics
1 molecules and computers
An Introduction to Chemoinformatics Andrew R
Leach Valerie J Gillet Springer 2007
Chemoinformatics - A Textbook Johann Gasteiger and Thomas Engel
Wiley-VCH 2003
Handbook of Chemoinformatics Johann Gasteiger
Wiley-VCH 2003 (new edition out 2018)
Chemoinformatics An Approach to Virtual Screening
By Alexandre Varnek Alex Tropsha RSC Publishing
Bunin Barry A Chemoinformatics Theory Practice and Products
Dordrecht Springer 2007
Chemoinformatics An Approach to Virtual Screening By Alexandre
Varnek Alex Tropsha RSC Publishing
Drug Metabolism Prediction Ed R Mannhold H Kubinye G Folkers
Ed Johannes Kirchmair Methods and Principles in Medicinal
Chemistry Vol 63 Pub Wiley-VCH
Sources- textbooksonline you may wish to consider if you want
to take the subject further
Journals of MolecularCheminformatics you may wish
to follow up on
Journal of Chemical Information and Modeling
Journal of Chemical Theory and Computation
Journal of Cheminformatics
Journal of Computer-Aided Molecular Design
Journal of Molecular Graphics amp Modeling
Journal of Computational Chemistry
Journal of Medicinal Chemistry
Reviews in Computational Chemistry
Drug Discovery Today
BMC Bioinformatics
Nature Reviews Drug Discovery
Expert Opinion on Drug Discovery
WIRES computational Molecular Science
Molecular
Informatics
Includes all aspects of the study of molecules on computers
Also includes Chem(o)informatics
This includes the representation of molecules databases display
simulation prediction of their properties and the discovery and
design of new molecules and materials
Molecular informatics is closely related to bioinformatics
computational chemistry molecular modelling simulation machine
learning and statistics as well as online publications - but the area
has principally been driven by investment in new methods for drug
discovery hence the concentration on small organic molecules
Cambridge HPC
Places to find Molecular
Informatics apps
bull httpwwwmacinchemorgmobilescience
bull Molecules ndash eg RSC-Chemspider)
bull Publishers (eg ACSRSC mobile)
bull Calculations (eg Yield101 for Rxns)
bull Visualisation (eg Pymol for proteins)
J Chem Educ 2013 90 (3) pp 320ndash325
DOI 101021ed300329e
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Molecular Informatics
1 molecules and computers
An Introduction to Chemoinformatics Andrew R
Leach Valerie J Gillet Springer 2007
Chemoinformatics - A Textbook Johann Gasteiger and Thomas Engel
Wiley-VCH 2003
Handbook of Chemoinformatics Johann Gasteiger
Wiley-VCH 2003 (new edition out 2018)
Chemoinformatics An Approach to Virtual Screening
By Alexandre Varnek Alex Tropsha RSC Publishing
Bunin Barry A Chemoinformatics Theory Practice and Products
Dordrecht Springer 2007
Chemoinformatics An Approach to Virtual Screening By Alexandre
Varnek Alex Tropsha RSC Publishing
Drug Metabolism Prediction Ed R Mannhold H Kubinye G Folkers
Ed Johannes Kirchmair Methods and Principles in Medicinal
Chemistry Vol 63 Pub Wiley-VCH
Sources- textbooksonline you may wish to consider if you want
to take the subject further
Journals of MolecularCheminformatics you may wish
to follow up on
Journal of Chemical Information and Modeling
Journal of Chemical Theory and Computation
Journal of Cheminformatics
Journal of Computer-Aided Molecular Design
Journal of Molecular Graphics amp Modeling
Journal of Computational Chemistry
Journal of Medicinal Chemistry
Reviews in Computational Chemistry
Drug Discovery Today
BMC Bioinformatics
Nature Reviews Drug Discovery
Expert Opinion on Drug Discovery
WIRES computational Molecular Science
Molecular
Informatics
Includes all aspects of the study of molecules on computers
Also includes Chem(o)informatics
This includes the representation of molecules databases display
simulation prediction of their properties and the discovery and
design of new molecules and materials
Molecular informatics is closely related to bioinformatics
computational chemistry molecular modelling simulation machine
learning and statistics as well as online publications - but the area
has principally been driven by investment in new methods for drug
discovery hence the concentration on small organic molecules
Cambridge HPC
Places to find Molecular
Informatics apps
bull httpwwwmacinchemorgmobilescience
bull Molecules ndash eg RSC-Chemspider)
bull Publishers (eg ACSRSC mobile)
bull Calculations (eg Yield101 for Rxns)
bull Visualisation (eg Pymol for proteins)
J Chem Educ 2013 90 (3) pp 320ndash325
DOI 101021ed300329e
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
An Introduction to Chemoinformatics Andrew R
Leach Valerie J Gillet Springer 2007
Chemoinformatics - A Textbook Johann Gasteiger and Thomas Engel
Wiley-VCH 2003
Handbook of Chemoinformatics Johann Gasteiger
Wiley-VCH 2003 (new edition out 2018)
Chemoinformatics An Approach to Virtual Screening
By Alexandre Varnek Alex Tropsha RSC Publishing
Bunin Barry A Chemoinformatics Theory Practice and Products
Dordrecht Springer 2007
Chemoinformatics An Approach to Virtual Screening By Alexandre
Varnek Alex Tropsha RSC Publishing
Drug Metabolism Prediction Ed R Mannhold H Kubinye G Folkers
Ed Johannes Kirchmair Methods and Principles in Medicinal
Chemistry Vol 63 Pub Wiley-VCH
Sources- textbooksonline you may wish to consider if you want
to take the subject further
Journals of MolecularCheminformatics you may wish
to follow up on
Journal of Chemical Information and Modeling
Journal of Chemical Theory and Computation
Journal of Cheminformatics
Journal of Computer-Aided Molecular Design
Journal of Molecular Graphics amp Modeling
Journal of Computational Chemistry
Journal of Medicinal Chemistry
Reviews in Computational Chemistry
Drug Discovery Today
BMC Bioinformatics
Nature Reviews Drug Discovery
Expert Opinion on Drug Discovery
WIRES computational Molecular Science
Molecular
Informatics
Includes all aspects of the study of molecules on computers
Also includes Chem(o)informatics
This includes the representation of molecules databases display
simulation prediction of their properties and the discovery and
design of new molecules and materials
Molecular informatics is closely related to bioinformatics
computational chemistry molecular modelling simulation machine
learning and statistics as well as online publications - but the area
has principally been driven by investment in new methods for drug
discovery hence the concentration on small organic molecules
Cambridge HPC
Places to find Molecular
Informatics apps
bull httpwwwmacinchemorgmobilescience
bull Molecules ndash eg RSC-Chemspider)
bull Publishers (eg ACSRSC mobile)
bull Calculations (eg Yield101 for Rxns)
bull Visualisation (eg Pymol for proteins)
J Chem Educ 2013 90 (3) pp 320ndash325
DOI 101021ed300329e
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Journals of MolecularCheminformatics you may wish
to follow up on
Journal of Chemical Information and Modeling
Journal of Chemical Theory and Computation
Journal of Cheminformatics
Journal of Computer-Aided Molecular Design
Journal of Molecular Graphics amp Modeling
Journal of Computational Chemistry
Journal of Medicinal Chemistry
Reviews in Computational Chemistry
Drug Discovery Today
BMC Bioinformatics
Nature Reviews Drug Discovery
Expert Opinion on Drug Discovery
WIRES computational Molecular Science
Molecular
Informatics
Includes all aspects of the study of molecules on computers
Also includes Chem(o)informatics
This includes the representation of molecules databases display
simulation prediction of their properties and the discovery and
design of new molecules and materials
Molecular informatics is closely related to bioinformatics
computational chemistry molecular modelling simulation machine
learning and statistics as well as online publications - but the area
has principally been driven by investment in new methods for drug
discovery hence the concentration on small organic molecules
Cambridge HPC
Places to find Molecular
Informatics apps
bull httpwwwmacinchemorgmobilescience
bull Molecules ndash eg RSC-Chemspider)
bull Publishers (eg ACSRSC mobile)
bull Calculations (eg Yield101 for Rxns)
bull Visualisation (eg Pymol for proteins)
J Chem Educ 2013 90 (3) pp 320ndash325
DOI 101021ed300329e
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Molecular
Informatics
Includes all aspects of the study of molecules on computers
Also includes Chem(o)informatics
This includes the representation of molecules databases display
simulation prediction of their properties and the discovery and
design of new molecules and materials
Molecular informatics is closely related to bioinformatics
computational chemistry molecular modelling simulation machine
learning and statistics as well as online publications - but the area
has principally been driven by investment in new methods for drug
discovery hence the concentration on small organic molecules
Cambridge HPC
Places to find Molecular
Informatics apps
bull httpwwwmacinchemorgmobilescience
bull Molecules ndash eg RSC-Chemspider)
bull Publishers (eg ACSRSC mobile)
bull Calculations (eg Yield101 for Rxns)
bull Visualisation (eg Pymol for proteins)
J Chem Educ 2013 90 (3) pp 320ndash325
DOI 101021ed300329e
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Places to find Molecular
Informatics apps
bull httpwwwmacinchemorgmobilescience
bull Molecules ndash eg RSC-Chemspider)
bull Publishers (eg ACSRSC mobile)
bull Calculations (eg Yield101 for Rxns)
bull Visualisation (eg Pymol for proteins)
J Chem Educ 2013 90 (3) pp 320ndash325
DOI 101021ed300329e
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Cheminformatics 101
How do we store molecules on the computer
There are estimated to be 1060 possible small molecules that
could be made How do we find the best molecule for the
problem we are addressing Letrsquos take a look ldquounder the
bonnetrdquo of the way molecules are actually manipulated on the
computer You will be familiar with
1 Trivial name eg Morphine
2 IUPAC name (5α6α)-78-didehydro-45-epoxy-17-
methylmorphinan-36-diol
However these names do not convey the structure of molecules
in a way the computer can readily understand We need to
convert these into ldquomachine readable formatsrdquo which allows
ease of searching based on the complexities of molecular
structure But what is a molecule
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Bear this in mind Molecules are complicated When we look at this scene we add a
huge amount of information from our senses and knowledge ndash but it nearly all gets
lost in computational representation
Representing chemistry needs to be engineered to represent materials and processes
As you will see we are moving in that direction with more complete representations
of molecules and materials
Not (5α6α)-78-
didehydro-45-
epoxy-17-
methylmorphinan-
36-diol
A real life mixture
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
What is a molecule
is it a series of connected points
a wave function
the sum of its properties
In the computer molecules are therefore abstractions and interpretations of data
So more experimental data and an appropriate description of a molecule may
translate to a wider reality
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Storing molecules different methods for different purposes
Methods for storing molecules can conveniently be broken down
into
1-Dimensional (simple and very compact and fast to access)
2-Dimensional (contains the chemical diagram)
3-Dimensional approaches (the shapes of molecules)
1D Line notations (a string of characters from a keyboard)
2D Molecular Diagrams (Graphs)
3D Graphs plus XYZ coordinates (giving the 3D structure)
We will look at examples of these (and you can follow up in the
notes)
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
1-Dimensional - Line notations ndash a string representation of
molecules- here are three examples of different line notations
SMILES is the most useful and widely used These are
lsquostringsrsquo all of the same molecule
bull Line Notationsndash WLN
bull L66J BMRamp DSWQ IN1amp1
ndash ROSDAL
bull 1=-5-=10=510-11-11N-12-
=17=123-18S-
19O18=20O18=21O8-22N-
2322-24
ndash SMILES
bull c1ccccc1Nc2cc(S(=O)(=O)O)cc3c
2cc(N(C)C)cc3
ndash IUPAC 6-dimethylamino-4-
phenylamino-naphthalene-2-sulphonic
acid
Notice ndash all these notations use just the characters on a standard
typewriter keyboard
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
O=C2O[CH]1O[C](C)(OO4)CC[C](C(C)CC3
)([H])[C]41[C]3([H])[CH]2C
Itrsquos Artemesinin
Which has
anti-malarial
properties
(Nobel Prize)
This language is called SMILES
This is an example of a lsquomoleculersquo and is what is actually stored in the
computer ndash does it make sense to you
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
A SMILES tutorial is available at
httpwwwdaylightcomdayhtml_tutorialslanguagessmilesindexhtml
You can practice by drawing a structure and a smiles will be available
by picking the smiley face
httpwwwmolinspirationcomcgi-binpropertiestextMode=1
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
InChibull A more recent line notation is called
InChI
bull This will address some of the problems
of SMILES eg polymers and materials
not covered by SMILES
bull InChI is generated using computer
algorithms and is virtually un-
interpretable by a humans
bull Importantly it is commonly used as a
unique chemical identifier - each
molecule should theoretically have a
unique InChI One molecule one InChi
bull Websites are available that can generate
InChIconvert from InChi to structure
from different names and formats
Again a string like this is easily
matched on a computer
RSC ChemSpider
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Storing chemical diagrams on computers
bull The valence model of a molecule can be represented by a
chemical graph A simple graph contains nodes (atoms) and edges
(bonds) joining pairs of nodes
bull The spacial position of the nodes length of the edges and
crossings are irrelevant Generally we ignore hydrogens unless
tautomerism or pKa is an issue Computers handle graphs very
well and molecules represented like this are examples of labelled
graphs (the atoms have names eg Oxygen)
bull Chemical structures are of course more complex than this and
aromaticity stereochemistry tautomerism non-stoichimetric
compounds etc are often problematic The computer would
(using a simple graph) deduce these two canonical structures are
different molecules
bull Eg to solve this we could introduce the concept of an aromatic
bond
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
An example of storing the molecular
graph SD file format
MDL SD (structure data) format files contain the following information and the
information about a molecule is stored in the following format
Header Block
describes the molecule eg itrsquos name
Connection table
defines the molecular structure (atoms and bonds)
Data block (optional)
Properties eg volume
Terminator line
a line containing four dollar signs ($$$$)
indicating the end of information on this molecule
This is probably the most common format to store small molecules ndash SD files in
software are widely used to store many molecule structures in databases
A ldquoformatrdquo in computer science is a precisely described order of data The program
reading it expects the information in exactly the right place (or it screws up)
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
eg PDB protein file format
eg SD file format
1
1
1
1
1
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
The Connection Table ndash describing bonds
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
bull Defines the bonding arrangement of a molecule Treats the molecule
as a labelled graph
Connect 3 2 5 4
Bond 3 5 1
Bond 3 4 1
Bond 5 6 2
(reduced)
eg PDB protein file format
eg SD file format
The Connection Table ndash describing bonds
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Why store molecules in 2Dbull Quite often we only need the
chemical diagram eg to find a
molecule that matches a
chemical structure search
bull It is often the case that we donrsquot
know the conformation (shape)
of a molecule ndash so storing it in
3D would be pointless Look at
the changes in conformation in
this molecule at room
temperature
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Example SD file - benzene
bull benzene
bull ACDLabs0812062058
bull August 2013
bull 6 6 0 0 0 0 0 0 0 0 1 V2000
bull 19050 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 19050 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -01282 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 07531 -27882 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -07932 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull -03987 -21232 00000 C 0 0 0 0 0 0 0 0 0 0 0 0
bull 2 1 1 0 0 0 0
bull 3 1 2 0 0 0 0
bull 4 2 2 0 0 0 0
bull 5 3 1 0 0 0 0
bull 6 4 1 0 0 0 0
bull 6 5 2 0 0 0 0
bull M END
bull $$$$
Molecule nameInformation on this molecule
Comment (eg date is used here)
ldquocounts linerdquo has the
number of atoms and
bonds as a minimum
ldquoatom blockrdquo has xyz
coordinates of the atom and
element as a minimum
ldquobond blockrdquo 1-line for each bond
from atom - to atom ndash bond type
These identify the end of this molecule
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Alanine SD file
Bond length is 153A
xyz symbol mass diff charge stereo h-counthellip
Charge
0 = uncharged or value
other than
these 1 = +3 2 = +2 3 =
+1
4 = doublet radical
5 = -1 6 = -2 7
= -3
Molecule is chiral
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
What if we only know the atom positions and not the bonds
A key example would be x-ray crystallography Here we determine the
positions of the atoms and impute the bonds from our chemical knowledge
Here is an electron density map from an x-ray experiment We see the
electrons but not the nuclei In a small molecule this is very accurate and we
can almost see the bonds However in eg a protein structure which is very
large we canrsquot always determine the atom positions exactly so these are
stored and the bonds imputed The format for storing these is a bit different
Which way
round should
this ring go
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Example protein databank file (pdb)
(uses an adjacency matrix)
HEADER OXIDOREDUCTASEOXIDOREDUCTASE INHIBITOR 26-MAY-11 3S7A
TITLE HUMAN DIHYDROFOLATE REDUCTASE BINARY COMPLEX WITH PT684
EXPDTA X-RAY DIFFRACTION
REMARK 2 RESOLUTION 180 ANGSTROMS
REMARK 200 TEMPERATURE (KELVIN) 113
REMARK 200 PH 69
SEQRES 1 A 186 VAL GLY SER LEU ASN CYS ILE VAL ALA VAL SER GLN ASN
SEQRES 2 A 186 MET GLY ILE GLY LYS ASN GLY ASP LEU PRO TRP PRO PRO
CRYST1 54588 55106 64827 9000 9000 9000 P 21 21 21 4
ORIGX1 1000000 0000000 0000000 000000
ORIGX2 0000000 1000000 0000000 000000
ORIGX3 0000000 0000000 1000000 000000
SCALE1 0018319 0000000 0000000 000000
SCALE2 0000000 0018147 0000000 000000
SCALE3 0000000 0000000 0015426 000000
ATOM 1 N VAL A 1 3036 -1035 -3538 100 3128
ATOM 2 CA VAL A 1 4283 -0343 -3015 100 2841
ATOM 3 C VAL A 1 4565 1067 -3593 100 2764
ATOM 4 O VAL A 1 4653 1287 -4853 100 2717
ATOM 5 CB VAL A 1 5510 -1245 -3141 100 3001
HETATM 1510 S SO4 A 187 25993 -1362 5893 100 2234
HETATM 1511 O1 SO4 A 187 25504 -1159 4508 100 2213
HETATM 1512 O2 SO4 A 187 27282 -0770 6155 100 2410
HETATM 1513 O3 SO4 A 187 24953 -1035 6850 100 1534
CONECT 1510 1511 1512 1513 1514
CONECT 1511 1510
CONECT 1512 1510
MASTER 334 0 6 6 18 0 11 6 1583 1 95 15
END
Space group and unit cell
dimensions
Protein atoms ndash including xyz
occupancy and temperature factor
Non-protein atoms ndash including xyz
occupancy and temperature factor
Bonds New format
mmCIF
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
What if we want to vary the atom
positions eg in driving a
reaction coordinate
Using Cartesian (xyz)
coordinates is very
cumbersome so instead
we use the natural angles
and distances
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
This uses internal coordinates
bull Also called a Z-matrix
ndash Used to alter the ldquointernal coordinatesrdquo of a molecule (eg modelling a reaction)
ndash Early form of specification of a starting geometry for molecules ndash sometimes used graph paper draw the molecule and get a starting set of coordinates before optimisation
ndash A z-matrix uses the following geometric descriptions to describe molecules
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Bond Length
Bond angle
Torsion angle
Out of plane bending
eg a carbonyl
Non-bonded distance
C
O
du
du
O
N OH
Dummy atom
positions
Internal
Coordinates
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
How to construct a Z-matrix (in Gaussian format)
1For the first atom to be defined give the atomic symbol
only
2For the second atom give the atomic symbol the number
1 and the name of a variable to describe the distance
between atoms 1 and 2
3For the third atom give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB and
the name of a variable to describe the angle between the
current atom NA and NB
4For all later atoms give the atomic symbol the atom
number NA the name of a variable to describe the distance
between the current atom and NA the atom number NB the
name of a variable to describe the angle between the current
atom NA and NB the atom number of another previously
defined atom NC and finally the name of a variable to
describe the dihedral angle between the current atom NA
NB and NC
5After all the atoms have been listed enter a blank line
6Next list each variable with its corresponding value Use a
separate line for each variable
7In some cases where some of the variables are to be fixed
as constants in a geometry optimisation they are listed here
after a blank line rather than above
with the real variables
8End the Z-matrix with a blank line
Water (C2v)
O
H 1 l1
H l l1 2 a1
l1 096
a1 1040
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Methanol Z-matrix
C
O 1 l1
H 1 l2 2 a1
H 1 l3 2 a2 3 da1
H 1 l3 2 a2 3 -da1
H 2 l4 1 a3 3 1800
l1 142
l2 109
l3 109
l4 109
l5 109
l6 10
a1 1090
a2 1100
a3 1080
a4 1100
a5 1100
da1 600
da2 1200
da3 600 z-matrix
lsquoimproperrsquo torsion angles
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
What about comprehensive
properties of molecules ndash they
are more than xyz
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
XML and molecules
bull XML is a computer language that allows lsquometadatarsquo to be stored Metadata describes the context of data eg the units of measurement the date the measurement was made the relationship to other data etc
bull The lsquoXrsquo stands for extensible It means we can add almost any type of structured data to the file
bull Chemical Markup Language (CML) is being developed specifically for chemistry
bull In the future much more information will be stored with molecules allowing greater re-use of data
bull see Chemical Markup XML and the World-Wide Web Part I Basic principles P Murray-Rust and H S Rzepa J Chem Inf Comp Sci 1999 39 928
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Ethanol
ltCMLgt
-Can be parsed
-Can contain reactions
properties etc
-Can contain
relationships to other
molecules and also
concepts
InChI
InChI=1C2H6Oc1-2-3h3H2H21H3
SMILES
C(C)O
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Molecules and 3-Dimensionsbull Molecules are of course not flat Even very flat molecules are not
really flat because of thermal fluctuations So we represent 3D
molecules by including their coordinates or their internal coordinates
bull Obtaining the 3-dimensional coordinates can involve experiment (x-
ray electron or neutron diffraction eg the Cambridge
Crystallographic Database or the Protein Databank ndash PDB)
bull From these can be obtained atom positions bonds coordinates etc
bull There are a number of 3-D construction methods available such as
Corina or Concord (put in a SMILES and get a 3-D molecule) which
use rules derived from experiment
bull Molecules can also be constructed in 2D and subjected to molecular
mechanics or Quantum Mechanics calculations to obtain 3D structures
bull Conformation still remains to be deduced There are many methods
that deduce conformations usually involving torsional angle rotation
to scan the conformational space (like a Ramachandran plot ndash but often
in many more torsional angle dimensions)
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Molecules in 3D - uses
bull More accurate calculation of molecular
properties
bull Comparison of the shapes (conformations)
of molecules
bull Comparison of the dynamics of molecules
bull Calculation of bulk properties
bull Simulation of chemical reactionshelliphellip
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
The dynamics of a molecule can be computed and stored as a series of
frames of the coordinates and bonds of the structure (much like a cartoon)
Molecular Dynamics is the most popular method for larger systems Here is
an example of a 3D simulation ndash a nanopore for sequencing DNA Imagine a
series of snapshots of SD files concatenated together to make a movie just
like a film strip
httpsenwikipediaorgwikiCHARMM
httpambermdorg
httpwwwksuiuceduResearchnamd
httpwwwksuiuceduResearchvmd
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
We can simulate an entire biological receptor
and test new molecules to see how they might
work
bull Here is a simulation of an important new cancer target
called lsquoApelinrsquo with water and the membrane We used
these to design new molecules that block this receptor
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Apelinrsquos role in Cancer
lsquoApelinrsquo (a peptide receptor) is involved in cancerndash therefore we designed an
antagonist (a molecule that blocks the receptor) stop tumour growth
bull Elizabeth Harford-Wright et al Pharmacological targeting of apelin impairs glioblastoma
growth Brain 2017 awx253 httpsdoiorg101093brainawx253
bull Emerging Role of Apelin as a Therapeutic Target in Cancer A Patent Review Rayalam et
al Recent Patents on Anti-Cancer Drug Discovery 211 6(3) 367-372
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
So there are very many file formats
ndash which could be a real painbut
they can be convertedalc -- Alchemy file prep -- Amber PREP file
bs -- Ball amp Stick file caccrt -- Cacao Cartesian file
ccc -- CCC file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical Resource
Kit 2D file
crk3d -- CRK3D Chemical Resource
Kit 3D file
box -- Dock 35 Box file dmol -- DMol3 Coordinates file
feat -- Feature file gam -- GAMESS Output file
gamout -- GAMESS Output file gpr -- Ghemical Project file
mm1gp -- Ghemical MM file qm1gp -- Ghemical QM file
hin -- HyperChem HIN file jout -- Jaguar Output file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file car -- MSI BiosymInsight II CAR
file
sdf -- MDL Isis SDF file sd -- MDL Isis SDF file
mdl -- MDL Molfile file mol -- MDL Molfile file
mopcrt -- MOPAC Cartesian file mopout -- MOPAC Output file
mmads -- MMADS file mpqc -- MPQC file
bgf -- MSI BGF file nwo -- NWChem Output file
pdb -- PDB file ent -- PDB file
pqs -- PQS file qcout -- Q-Chem Output file
res -- ShelX file ins -- ShelX file
smi -- SMILES file mol2 -- Sybyl Mol2 file
unixyz -- UniChem XYZ file vmol -- ViewMol file
alc -- Alchemy file bs -- Ball amp Stick file
caccrt -- Cacao Cartesian file cacint -- Cacao Internal file
cache -- CAChe MolStruct file c3d1 -- Chem3D Cartesian 1 file
c3d2 -- Chem3D Cartesian 2 file ct -- ChemDraw Connection Table
file
cht -- Chemtool file cml -- Chemical Markup Language
file
crk2d -- CRK2D Chemical
Resource Kit 2D file
crk3d -- CRK3D Chemical
Resource Kit 3D file
cssr -- CSD CSSR file box -- Dock 35 Box file
dmol -- DMol3 Coordinates file feat -- Feature file
fh -- Fenske-Hall Z-Matrix file gamin -- GAMESS Input file
inp -- GAMESS Input file gcart -- Gaussian Cartesian file
gau -- Gaussian Input file gpr -- Ghemical Project file
gr96a -- GROMOS96 (A) file gr96n -- GROMOS96 (nm) file
hin -- HyperChem HIN file jin -- Jaguar Input file
bin -- OpenEye Binary file mmd -- MacroModel file
mmod -- MacroModel file out -- MacroModel file
dat -- MacroModel file sdf -- MDL Isis SDF file
sd -- MDL Isis SDF file mdl -- MDL Molfile file
mol -- MDL Molfile file mopcrt -- MOPAC Cartesian file
mmads -- MMADS file bgf -- MSI BGF file
csr -- MSI Quanta CSR file nw -- NWChem Input file
pdb -- PDB file ent -- PDB file
pov -- POV-Ray Output file pqs -- PQS file
report -- Report file qcin -- Q-Chem Input file
smi -- SMILES file fix -- SMILES Fix file
mol2 -- Sybyl Mol2 file txyz -- Tinker XYZ file
txt -- Titles file unixyz -- UniChem XYZ file
vmol -- ViewMol file xed -- XED file
xyz -- XYZ file zin -- ZINDO Input file
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
File formats
-the bane of our lives
Interconnection program ndash Babel
Recent IUPAC moves towards a lsquostandardrsquo format however
In the near future there are likely to be many competing
requirements for file content
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Molecules on computers ndash things to look out for
since what is stored is actually quite crude
For example-
Stereochemistry may be relative and not absolute or even incorrect
In proteins only the HEAVY atom positions are observed (sometimes at low resolution)
so bonds and hydrogen atoms are added and are not always correct Sometimes Nitrogen
and oxygen get confused
Hypervalent atoms (as in nitro groups for example) are often not stored and retrieved
correctly (problems with storing bonding)
Tautomers can be incorrect ndash check they look reasonable
Mesomers can be incorrect (double bonds)
Polymers (which are mixtures of MWt and topology) are very difficult to store
Only valence bonds are accurately stored (hydrogen-bonds crystal fields in inorganics
halogen-aromatic bonds etc may be inferred but not observed in the file)
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
2 Finding molecules
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
A typical problem involves finding the
lsquorightrsquo molecules
All the molecules
synthesisable
1060 Molecules with
relevance to the problem
1010 (that we can generate)
Molecules we can
search efficiently 107
Molecules we can
make and test 103
Serious difference in numbers we can consider a few hundred
molecules in our heads computers can evaluate millions
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
1 Finding molecules using Substructure searching
One of the most useful lsquopropertiesrsquo of a molecule is its molecular graph
Many software systems allow searching for whole molecules or
fragments of molecules or even lsquosimilarrsquo molecules
For a specific molecule this means specifying the required search pattern to get an
exact match
An example might be Search for an exact match for dimethylaniline Here wersquove
used the PubChem database which contains compounds and pharmacological
screening data on 92 million unique molecules httppubchemncbinlmnihgov
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
A more complex example ndash find a molecule
containing our query as part of a larger molecule
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Results ndash takes a few seconds to search all
92 million molecules
How does this workThe first thing to notice is that the searching is so fast that clearly all the database
is not being searched
The first part of the process is to pre-compute pointers to only the sets of
compounds in the database we are interested in
This is called Hashing
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Speeding up compound matching using hash codes
bull The simplest Hash code registers the presence or absence
of fragments in a molecule eg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesnrsquot contain F
X
So a quick match of the bitstring of the fragment searched over the
(pre-computed) bitstrings in the database will eliminate most of the
structures
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Next step ndash Substructure (pattern) matching
Exact match
Match the canonical SMILES InCHI or full structure
Beware salts etc may need to strip out counterions
Substructure Search
Supposing we wish to find part of a molecule in a database
To do this we have to do a substructure search
A substructure is a sub-graph of the molecular graph This is
an example of substructural fragments of a larger molecule
shown in different colours You could eg look for all the
molecules with the phenol fragment
There are two steps commonly used to do this firstly we
have to number all the molecules consistently (the Morgan
algorithm) for example to compare two structures to see if
they are identical then we have to match the query
substructure to each substructure in the molecule (the
Ullman algorithm)
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Substructures in molecules
bull lsquoSubgraphsrsquo can be identified in a structure graph
corresponding to fragments of the whole structure
eg Rings substituents etc
ndash ndashOH
ndash ndashNH2
ndash ndashCOOH
ndash phenyl
bull this can be done by
tracing appropriate
paths in the graph
bull subgraphs may overlap
OH
CH2
CHNH2
OH
O
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Matching the query structure to the database
Two algorithms are commonly used
The Morgan algorithm which numbers the molecule uniquely and
The Ullman algorithm which matches the fragments
Note ndash a number
of lsquosubstructuralrsquo fragments can
be matched here (6 in all)
Gund P Ann Rep in Med Chem 14299 1979
Sheridan RP et al J Chem Inf Comp Sci 29255 1989
Brint AT Willett P J Mol Graphics 549 1987
JR Ullman An Algorithm for Subgraph Isomorphism Journal of the Association of Computing
Machinery (JACM) Vol 23 pp 31-42 1976 httpportalacmorgcitationcfmid=321925
Morgan HL The generation of a unique machine description for chemical structures Journal of
Chemical Documentation 5 1965 107-112
Read these for more details-
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
The basic concept is that molecules are considered as
Graphs
Atoms are lsquonodesrsquo
Bonds are lsquoedgesrsquo
A molecule is an example of a labelled graph nodes can
have labels such as atomic number The nodes are
connected by the edges (and these can also be labelled
eg double bond)
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
The Morgan algorithm iteratively computes an integer label
for each node in a structure (atoms) For a computer the key
point is that the numbering is consistent (unlike humans) and
fast The Morgan algorithm works as follows
(i) Assign an integer label i to each node considering its
atomic number degree (number of substituents) and types of
bonds
(ii) Update for each label by adding the labels of the adjacent
nodes (connected atoms) and
(iii) Repeat (ii) until the number of different labels does not
increase Then order the nodes by the value of the labels
H L Morgan The generation of a unique machine description for chemical structures -
A technique developed at chemical abstracts service J Chemical Documentation
5107ndash113 1965
The first step - Numbering a molecule uniquely
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
A simple Morgan algorithm
bullThis is continued until the number of equivalent classes is unchanged
bullNumbering starts from highest then uses next highest connection as 2 etc Like an onion
concentric circles from the highest numbered
bullRules are used to break equalities such as C before N double bond before single etc
bullSome equivalent atoms are arbitrarily numbered
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Urea
Adjacency matrix
O
C C
1
2 34
O
C2
C5
C6
C7
C3
1
C4
O1 C2 C3 C4
O1 1
C2 1
C3 1
C4 1 1 1
O1 C2 C3 C4 C5 C6 C7
O1 1
C2 1 1
C3 1 1
C4 1 1 1
C5 1 1
C6 1 1
C7 1 1
Cyclohexanone adjacency
matrix
Numbering is crucial Renumber
for different starting points Also
the lsquoedgesrsquo of the graph are annotated to speed up the matching This is the lsquoslowrsquo part
of lsquomatching subgraph isomorphismrsquo and is of the class of NP-complete problems
The Ullman Algorithm
is a way of detecting
detecting substructures
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships
Next lecture
bull How can we use this type of information to
solve our chemistry problems
ndash PatentsMarkush Structures
ndash Molecular Similarity
ndash Molecular Properties
ndash 3D searching
ndash Virtual Screening
ndash Structure PropertyActivity Relationships