1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard...
-
Upload
everett-walker -
Category
Documents
-
view
214 -
download
0
Transcript of 1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard...
1Chemical Structure Representation
and Search Systems
Lecture 5. Nov 13, 2003
John Barnard
Barnard Chemical Information LtdChemical Informatics Software & Consultancy Services
Sheffield, UK
2 Lecture 5: Topics to be Covered
• Reaction searchingo atom-atom mappingo Maximal Common Substructure search
• 3D substructure search• Searching Markush structures in patents
o nature and origin of Markush structureso fragment codeso topological systems (MARPAT, Markush DARC)
3 Searching Chemical Reactions
each database entry contains several molecules• reactants• products• catalysts• solvents• etc.
may want query substructure confined to one of these• can be done by assigning role indicator to each
molecule but role indicators are not enough on their own for
a useful reaction search system
4 Reaction search
Query: CO
COH
5 Reaction search
Query:
“Hit”:
We didn’t get what we wanted because the hydroxyl in the product did not involve the same oxygen as the ketone in the reactant
We need to “map” the atoms between the reactant and product
CO
COH
O
OH
CH3
OH
Br
OCH3+ BrH +
6 Atom mapping
atoms on each side of the reaction can be numbered to show which corresponds to which• similar mappings can be used in the query
automatic assignment of atom mapping is very important in reaction indexing systems• problem is obviously related to finding a graph
isomorphism between reactant and product sides• except that the two sides are NOT isomorphic
.6.
.5.
.4.
.3.
.2..1.
.9.
O.11.
.7.
OH.8.
CH3
.10.
.6.
.5.
.4.
.3.
.2..1.
.7.
OH.8.
Br.12.
.9.
O.11.
CH3
.10.
+ BrH.12.
+
7 Maximal common subgraph
atoms and bonds in red represent the largest subgraph that is common to both sides• all these atoms have same neighbours on both sides• none of these bonds are made or broken
remaining atoms and bonds represent reaction site
.6.
.5.
.4.
.3.
.2.C.1.
C .9.
O.11.
.7.
OH.8.
CH3
.10.
.6.
.5.
.4.
.3.
.2.C.1.
.7.
OH.8.
Br.12.
CH
.9.
O.11.
CH3
.10.
+ BrH.12.
+
8 Maximal common subgraph
Finding the MCS between two graphs is an NP-complete problem• even worse than subgraph isomorphism because you
don’t know in advance how big the subgraph will be• exhaustive backtracking is prohibitively slow• the best algorithms find an approximate solution (i.e. a
large, but not necessarily maximal, subgraph)• tricks can be used to determine an upperbound for the
size of the MCS (so you can stop looking when you’ve found one of this size)
• new algorithm published 2002
9 Applications of MCS
MCS algorithms can be applied to other things than atom-atom mapping in reactions• structural similarity between molecules
o size of MCS (relative to size of molecules) can be used as measure of similarity of molecules
• approximate match searcheso search for molecules containing at least 80% of
query substructure
• multiple maximal common substructure
10 Multiple MCS
largest substructure common to whole set of molecules• can be used to extract “core” for a Markush
structure• might represent features important for
biological activity• even more difficult than MCS of two molecules
o unfortunately it doesn’t work to find MCS of first two, and then MCS between that and the third, etc.
11 3-D substructure search
Analogous to 2-D substructure search• need to find atoms in correct spatial orientation relative
to each othero some fuzziness (tolerance) permitted in distance values
• query can be defined as a group of atoms, with specified interatomic distances
o sometimes called a pharmacophore
• both query and database structures can be shown as topological graphs in which the nodes are atoms, but the edges are interatomic distances
12 3-D substructure searching
the interatomic distances are the labels on the edges
graph is fully-connected (an edge between every pair of nodes)
the graph edges do not correspond to bonds in the molecule
matching is then a process of subgraph isomorphism between such graphs
N C
C
O
2.3Å
5.1Å
2.5Å
6.4Å
7.1Å 4.1Å
13 3D substructure searching
subgraph isomorphism involving fully-connected graphs is computationally more demanding than for 2D substructure search
• Ullmann’s algorithm performs well• other approaches (e.g. clique detection) have also been used
fingerprint-like screening stages can also be applied in the search, based on 3D-fragments such as 3-point pharmacophores
• screens based on torsion and valance angles have also been used
Willett, P. Three-Dimensional Chemical Structure Handling. Wiley: New York (1991)
14 Chemical patents
Contract between inventor and State to encourage innovation
• Inventor reveals nature of invention• State grants protected monopoly over its exploitation for limited
period Invention must be novel, useful and non-obvious
• new ways of making compounds• new compounds with useful properties (therapeutic uses)
Essential for success of pharmaceutical industry Knowledge of existing patents (prior art) essential to avoid
fruitless development
15 Chemical patents
May claim single product or process More usually claim class of products or processes to
ensure protection for closely-related compounds etc. Very broad claims can disguise true nature of invention
• But may claim compounds which lack claimed activity• Nested series of claims (A, preferably B, more preferably C etc.)
can provide “fallback” positions Extremely broad claims have become more common as
Patent Offices moved to publication before examination• Sibley, J. F. “Too broad generic disclosures: a problem for all”
J. Chem. Inf. Comput. Sci. 1991, 31 (1) 5-8
16R1-X-R36
R1 is a substituted or unsubstituted, mono-, di- or polycyclic, aromatic or non-aromatic carbocylic or heterocyclic ring system, or…
X is a single or double bond, substituted or unsubstituted heteroatom, or substituted carbon atom, or substituted or unsubstituted chain of two or more carbon atoms and/or heteroatoms…
R36 is substituted or unsubstituted asymmetrical heterocylic ring system having at least 3 nitrogens…[Structure 32 from Claim 105 of PCT Application 8704321,
claimed as novel]
17 The patent explosion
Originally only granted patents published. Belgium (1950s), Netherlands (1964) and EPO
(1978) -> publishing all patent applications. Rapid publication makes information available
very quickly. Huge number of patents, many low quality,
insufficient or incorrect details, no novelty. Less work for patent examiners but greater
problems for retrieval systems.
18
Structural information in chemical patents Uses mixture of:
• 2D structure diagrams
• linear formulae (e.g. “C2H5”, “EtOH”)
• specific nomenclature (e.g, “phenyl”, “isopropyl”)• generic nomenclature (e.g. “alkyl”, “heteroaryl”)• non-structural expressions (e.g. “pharmaceutically
acceptable cation”, “group known in the art”)
Many machine readable systems just show structural information as free text and images
19 Specific Structures from Patents
Several databases contain specific molecules claimed in patents• Chemical Abstracts Registry• Derwent Registry• MDL announced major new database Nov 2003
o will include reactions, molecules and Markush displayo http://www.mdl.com/company/news/press_releases/2003
/pr_patentdb_07nov03.jsp
20 Markush Structures
also known as “Generic Structures” or “R-group Structures”
chemical structures involving variable parts
OH
R1R2
Br
*
I*
Cl
*R1=
CH2
*
CH3CH2
* CH2CH3 CH2
* CH2CH2
CH3R2=
21 Markush Structures
compact representation of a set or class of specific compounds with common structural features
used in • chemical patents• query structures in substructure search systems• Quantitative Structure-Activity Relationship (QSAR)
analysiso class of related compounds with activity data
• combinatorial librarieso rapid synthesis of large numbers of related compounds
• legislation (controlled drugs, chemical weapons)
22 Variability in Markush Structures
s-variation (substituent variation)list of alternative values for an R-group
p-variation (position variation)variable point of attachment
f-variation (frequency variation)multiple occurrence of groups
h-variation (homology variation)generically described group (e.g. “alkyl”)• potentially infinite set of specific alternatives
23 Types of variation
substituent variation
R1 is methyl or ethyl
homology variation
R2 is alkyl
position variation
R3 is amino
frequency variation
m is 1-3
OH
R1
R2
R3
(CH2)m
Cl
24 Types of Markush structure
subst homol posn freq
Patents * * * *
Queries * (*) (*) (*)
QSAR * *
Libraries * (*) (*)
Legislation * * (*) (*)
25 Markush Structures
Compact representation for sets of molecules• common parts shown once only
Can be considered as formal “grammar” for generating valid molecules (“sentences”)
Enumeration of coverage usually impractical and often impossible (infinite sets)
Appropriate algorithms for handling take advantage of Markush representation:• Avoid enumeration (especially infinite sets)• Compare finite grammars rather than infinite sets of valid
sentences
26 Dr Eugene A. Markush
born Budapest, Hungary, c. 1888 migrated to USA, 1913 (Citizen, 1920) Founded Pharma Chemical Corporation (NJ),
1919 Filed US patent 1506316 on pyrolazone dyes, 9
January 1924, using expression “where R is a group selected from ...” to circumvent USPTO “rule against ‘or’ ”
died New York, 21 April 1968
27 Markush storage and retrieval
Early systems (1950s, 1960s) developed in-house by pharmaceutical companies/consortiums
High costs of patent abstracting and technical difficulties with automation shifted development to specialist companies
Fragmentation code systems superseded by topological (structure graphics) systems
28 Fragmentation Codes
Structural features (ring systems, functional groups, etc.) used as indexing terms
Structural relationships usually lost• all alternatives tend to be “over-coded”• retrieved structures include many “false drops” (“ballast”)
Codes originally assigned manually• Now usually generated (semi-)automatically from graphical input• Queries also generated automatically
Some codes use “closed” set of terms (periodically revised) Others are “open-ended”
29 Fragmentation Codes
Derwent World Patent Index Chemical Code • Closed code with about one thousand terms• Large comprehensive backfile (from early 1960s)• Available for online searching (Questel)
IFI/Plenum Code• Open-ended code• Used for “CLAIMS” database (U.S. patents)• Available for online searching (STN)
o no graphical interface
30 Fragmentation Codes
GREMAS code• Very sophisticated open-ended code• Private collaboration between (mainly) German
pharmaceutical companies• Good retrieval performance• Input discontinued in early 1990s• Backfile (from 1950s) still searched at a few
companies
31 Graphical (“topological”) systems
Development started in early 1980s Intended to supplement graphical substructure
search systems for specific structures• MACCS, CAS Online, DARC, etc.
User draws graphical (sub)structure query System displays graphical Markush structure hits Two commercial systems implemented
• available for online searching only• each with its own database• no “in-house” systems or databases
32 Markush DARC
Joint development of• Questel SA (software and online host) • Derwent Information Ltd (WPIM database)• INPI (French Patent Office) (PHARMSEARCH
database) Integrated database (“Merged Markush File”) now
available• http://www.inpi.fr/inpi/mms/index.htm• Extension forwards (Derwent) and backwards (INPI)
33 MARPAT
software and database from Chemical Abstracts Service
available online via STN International • http://www.cas.org/CASFILES/marpat.html
integrated with CA Registry database of specific compounds
Proposal to allow Derwent database to be searched with MARPAT software dropped in mid 1990s for commercial reasons
34 The Markush Problem
Representation• Mixture of structures and text• Generic (h-variant) expressions• Vagueness (“where by X we mean…”)
Search• The “translation” problem
o Specific groups (e.g. tert. butyl) must be matched against generic expressions (e.g. 1-6C alkyl)
• The “segmentation” problemo Boundaries between scaffold and R-groups may not coincide
in query and database structures
35 Matching Markush Structures Translation and Segmentation problems coincide
to make it difficult to spot matching structures
O
O R1 R2 R1CH3
CH3
/ isopropylR1 = alkyl
O*
R4
R3R2 = NH2 /
R3 = O
R4 = cycloalkyl
R1 = t-butyl/ cycloalkyl
/ S
36 Sheffield University Research
Extended project (1979-1994) on Markush structure storage and retrieval• designed external (GENSAL) and internal (ECTR)
storage formatso parameter lists for homology-variant groups
• developed novel matching algorithms based around graph isomorphism
o “reduced graph” concept
• influenced development of commercial systemso independent work also done at CAS, Derwent and Questel
Downs and Barnard, J. Documentation, 1998, 54 (1), 106-120
37 GENSAL
formalised version of language used in patent specifications
design analogous to programming language lexical elements include
• structure diagrams• specific and generic chemical nomenclature• substitution operators• position/multiplicity values
GENSAL Interpreter program (compiler) generates internal representation based on “partial” connection tables with links between them
38 GENSAL example
R1
R2
R1 = H / alkyl <1-4>;
R2 = F / Cl ;
R1 + R2 = SD
;
R3 = phenyl OSB <1-2> Cl;
IF R2 = Cl THEN R1 = H.
R3
*
*O
39 Parameter Lists
Represent generic (“homology-variant”) expressions by set of permitted numerical ranges for structural parameterse.g. “alkyl”:• 1-n carbon atoms• 0 heteroatoms• 0 double or triple bonds• 0-n branch points• 0 rings
40 Reduced Graphs
connected groups of atoms “collapsed” to form a single node of the reduced graph• atoms in the same ring system (R)• optionally branched carbon chains (C)• connected acyclic heteroatoms (Z)
N
NH
CH2C
OH
O
O O
Z 3 R 9 C 2
Z 1
Z 1
41 Reduced Graphs
boundaries between nodes are non-arbitrary• thus provides solution to segmentation problem
each node can be described by a parameter list
homology-variant groups can also be represented as reduced graph nodes with parameter lists
• thus provides solution to translation problem:o first identify isomorphism between reduced graphso if parameter lists match can do atom-by-atom match on original atoms in
specific groups, if necessary
N 1 O 2C 8 N 1R 6 :1R 5 :1
C 2
0 1
0 1
42 Design of Commercial Systems
Sheffield system never implemented commercially Ideas incorporated into both Markush DARC and
MARPAT• also used by BCI Ltd. in various projects
Other ideas developed independently• both systems have patent protection
Basic concepts parallel those developed at Sheffield
• Barnard, J. M. “A comparison of different approaches to Markush structure handling” JCICS, 1991, 31 (1), 64-67
• Berks, A. “Current state of the art of Markush topological search systems”, World Patent Information, 2001, 23 5-13
43 Markush DARC
Specific groups shown as structure diagrams• Rather clunky display (one R-group at a time)
Generic groups shown as “superatoms”• e.g. CHK = alkyl, HEF = fused heterocycle• qualitative attributes used in searching• quantitative parameters (texnotes) available for display
reduced graph concepts used in atom-by-atom search stage
44 Markush DARC Display
45 MARPAT
Part of CASLink substructure search system on STN
Input and display uses text and graphics • similar to GENSAL
Generic Group Nodes with quantitative attributes (not fully implemented for search)
46 MARPAT Generic Group NodesR
an y g ro u p
C ycy c lic g ro u p
A kca rb o n ch a in
Qh e te ra to m
C bca rb o cy c le
H yh e te ro c y le
Xh a lo gen
Mm eta l
GGN definitions imply reduced graph concept “Spin-off” GGNs generated for specific groups to allow
specific-generic matching (“translation”)
47 MARPAT Display
MSTR 1
G1 = N, CH G2 = H, X, SC,Cl DER: or acid addition salts MPL: Claim 1
48 Conclusions from Lecture 5
Chemical reaction search requires atom-atom mapping between reactant and product
• Maximal Common Subgraph algorithms can be used 3D substructure search uses interatomic distances as edge
labels in fully-connected graphs Markush structures pose particular problems to structure
search systems• extremely broad classes• homology-variant (generic) expressions• segmentation between R-groups
Two publicly-available Markush search systems for chemical patents
• Markush DARC and MARPAT
49 Further Reading
Chen, L.; Nourse, J. G.; Christie, B. D.; Leland, B. A.; Grier, D. L. “Over 20 years of reaction access from MDL: a novel reaction substructure search system”. J. Chem. Inf. Comput. Sci. 2002, 42, 1296-1310.
“Representation and manipulation of 3D molecular structures”. Chapter 2 (pp. 27-52) in A. R. Leach and V. J. Gillet, An Introduction to Chemoinformatics, Dordrecht: Kluwer, 2003
Berks, A. H. “Current state of the art of Markush topological search systems”. In J. Gasteiger (ed.) Handbook of Chemoinformatics: From Data to Knowledge, Vol 2, pp. 885-903, Wiley-VCH, 2003
50 Lecture 6: Topics to be Covered
Similarity searching• similarity search vs. substructure search• similarity and distance metrics• different types of descriptor for similarity
search• choice of descriptors
The drug discovery process