Roger W. Barnard and G. Brock Williams- Combinatorial Excursions in Moduli Space
Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries...
-
Upload
jayson-hoover -
Category
Documents
-
view
216 -
download
0
Transcript of Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries...
www.digitalchemistry.co.uk
Representing Markush Structures from Patents and Combinatorial Libraries
Dr John M. BarnardScientific DirectorDigital Chemistry Ltd., UK
Presented at EMBL-EBI Industry Programme Workshop on Chemical Registry Systems
Hinxton, Cambridge, 10/11 October 2011
Entire presentation Copyright © Digital Chemistry Ltd., 2011
2
Outline
• What are Markush structures?
– where do they occur?
– what types of variability do they include?
• Existing representation formats
– internal and external
• Canonicalization issues
• Proposed InChI Extensions
3
Markush structures
Dr Eugene Markush (1887-1968) was involved in a legal wrangle with the US Patent Office in 1924
Classes of molecule with common structural features– represent sets of individual
specific molecules (from a handful to many billions, or even infinite numbers)
– best known in chemical patents
– can also be used to represent combinatorial libraries, and in other contexts
4
Markush structure enumeration
Specific structures can be generated by combinatorial assembly of alternatives for each R-group
O H
R 1
R 2
CH 2 C H 3 CH 2 C H 2
C H 3
C H 3
F C l B r I
R 1 =
R 2 =
O H
F
O H
F
O H
F
O H
C l
O H
B r
O H
C l
O H
C l
O H
B r
etc...
5
Variability in Markush Structures
• substituent (s-) variation list of specific alternatives,
which can be expressed in many different ways
• position (p-) variation variable point of
attachment
• frequency (f-) variation multiple occurrence of
groups
• homology (h-) variation generically-described
group (e.g. “alkyl”) potentially infinite number
of alternatives
R1 = propyl / CH3CH(Br)CH 3 / OMe /* CH2
R2
R2 = F
R3 = alkyl C1-6
M = 2-5
CH3
OH
CH2M
N
N
O
R1
R3
Four main ways in which variability can be shown in Markush structures
6
Markush structure representation
• Generally extensions of methods used for specific structures
– connection tables, line notations etc.
– distinction between internal (processing) and external (storage and exchange) formats
– special provision for homology variation
• Variety of proprietary formats, some published
– many with limited capabilities
• No accepted standard
7
Markush representations – pre 1980
• Mainly based on fragmentation codes
– Derwent WPI code
– IFI/Plenum CLAIMS
– GREMAS code (German-based consortium)
• Some work on extensions to traditional line notations
– Hayward notation
– Wiswesser Line Notation
– ALWIN (ALgorithmically extended WIswesser Notation)
• BASF connection table-based system (E. Meyer et al.)
– designed in late 1950s and fully operational by 1965
Incomplete, ambiguous
representations
8
Sheffield University Project (1979-94)
• Long-running project on patent Markush storage and retrieval, directed by Prof Michael Lynch
• Early part of project concentrated on structure representation
– GENSAL (input and display format)
– parameter lists
• representation of homology-variant groups
– Extended Connection Table Representation (ECTR)
• internal (in-memory) format for processing
• complex AND/OR tree of partial connection tables with links showing logical structure of Markush
9
GENSAL
Formal language (analogous to programming language, with fully-defined syntax) formalising Markush descriptions in Derwent abstracts
Markush scaffold
Variable-position
attachment
Combined definition of R1 and R2 (forming fused
ring) Conditional definition
“optionally substituted by”
“Statements” to define R-group
variables Parameter list
Definition using nomenclature
Definition using structure diagram
R1 = H / alkyl <1-4>;
R2 = F / Cl ;
R1 + R2 = SD
;
R3 = phenyl OSB [2,4,6] <1-2> Cl;
IF R2 = Cl THEN R1 = H.
R2
R1
R3
O
positions of substituents
number of substituents
10
Parameter lists
Represent homology-variant expressions by set of permitted numerical ranges for structural parameterse.g. “alkyl”:
1-n carbon atoms 0 heteroatoms 0 double bond 0 triple bonds 0-n branch points 0 rings
Original GENSAL parameters
C carbon countZ heteroatom countE double bond countY triple bond countQ quaternary branch pointsT ternary branch pointsRC number of ringsRN number of ring atomsRF number of ring fusionsRA number of aromatic rings
5-10C unbranched alkyl: C<5-10> Z<0> E<0> Y<0> Q<0> T<0> RC<0>
optionally-aromatic fused heterocycle: Z<1-> RC<2-> RA<0->
11
Markush DARC (1988-)
• First commercial Markush structure system
• Joint development by Questel (software), Derwent and INPI (French Patent Office) (database)
– MMS (Merged Markush Service) database now owned by ThomsonReuters
• Proprietary format for data storage and exchange
– VMN files (binary connection table)
– AMN file (associated text file with parameter list data)
– XML version of VMN file also exists, though involves significant data loss on orientation of R-groups with two or more attachments
12
Markush DARC Superatoms
• Used to represent homology-variant groups
• Set of 22 predefined groups with mnemonic names
– some represent enumerated lists of elements e.g. HAL
– some represent classes e.g. CHK (alkyl), HEA (heteroaryl)
– some represent structurally-undefined groups e.g. DYE
• Can be qualified by
– attributes (qualitative)
– parameters (quantitative – comparable to GENSAL)
13
Markush DARC Superatoms
14
MARPAT (1991-)
• Chemical Abstracts Service's competitor to Markush DARC
• Proprietary software and database
• Input and display format similar toGENSAL
• Internal representationis extension of CASspecific structureformat
15
• Hierarchical set ofspecial atom types used to represent homology-variant groups
• Can be qualified by
– categories (cf Markush DARC attributes)
– attributes (cf Markush DARC parameters)
MARPAT Generic Group Nodes
16
MDL RGfiles
• A flavour of Molfile– text-formatted
connection table– various versions– proprietary to Accelrys
(formerly Symyx, MDL)
• Really intended for R-group queries, but widely used for Markush structures
• Significant limitations— substituent-variation only— limit of two fixed-position
connections for each R-group
17
Oc1c([1*])c([2*])ccc1
SMILES and Extensions
• Daylight's original SMILES can only represent complete molecules
– [*] atoms can be used as dummy atoms (for R-groups and attachment points), and given “isotope” labels
• Digital Chemistry pioneered use of “pseudo” ring closures to assemble complete molecules from Markush building blocks
O H
R 1
R 2
Oc1c%11c%12ccc1 . C%11 . F%12Oc1c%11c%12ccc1 . CC%11 . F%12Oc1c%11c%12ccc1 . C%11 . Br%12
18
SMILES and Extensions
• Many vendors have added their own non-standard extensions to SMILES to show R-groups and attachment points in Markush structures– limited consistency between vendors/parsers
• Daylight developed their own CHUCKLES and CHORTLES notations– primarily for peptide libraries
• “Open SMILES” project could provide forum for agreeing “standard” extensions– has got bogged down in other issues
• There are issues in representing incomplete aromatic rings, and potentially aromatic rings
19
Other Line Notations
• Sybyl Line Notation (SLN)– similar to extended
SMILES– developed by Tripos– designed to show
combinatorial libraries
• ROSDAL– developed by Beilstein
Institute– primarily a query language
– has some capabilities for showing homology variation
7G1,6-1=-6—9.8=10O;G1=(1&1-2&2;3&1=4&2;5O&2-=11-6,9-12&1,8-14Cl,10-13Cl).
20
XML, CML etc.
• Various XML-ifications of pre-existing formats have been promoted– usually just put the original format in an XML
wrapper
• CML is pre-eminent among “proper” XML formats for chemistry– has not yet achieved wide acceptance as a standard– latest version (2.4) has extensions able to handle
polymer repeating units, but no real Markush capabilities
• Digital Chemistry has done some design work on an XML format for Markush structure exchange– not published or fully implemented
21
Markush Canonicalisation
• Canonicalisation involves putting a chemical representation into a unique “correct” form
– applying business rules
– renumbering atoms into canonical order
• This becomes a bit more complicated when it comes to Markush structures, which represent “sets” of specific molecules
– obvious rule is that Markushes are equivalent when they cover the same set of specific molecules
– but...
22
Equivalent Markushes?
O H
C H 2
R 2
R 1
H CH 2 C H 3C H 3
F B rC l I
R 1 =
R 2 =
CH 2 C H 3 CH 2 C H 2
C H 3
C H 3
F IC l B r
O H
R 1
R 2
R 1 =
R 2 =
The “segmentation problem”
23
Equivalent Markushes?
“Extensional” vs. “Intensional” representation
O H
R 1
R 2
CH 2 C H 3 CH 2 C H 2
C H 3
C H 3 C H
C H 3
C H 3
CH 2 C H 2
C H 2C H 3
CH 2 C H
C H 3
C H 3
C H C H 2
C H 3C H 3C H 3
C H 3
C H 3
F C l B r IR 2 =
R 1 =
C 1-4 a lkyl
ha l
O H
R 1
R 2
R 1 =
R 2 =
24
Business Rules and Tautomers
… may be difficult to apply in a Markush structure
N H
O
N
O HThe preferred tautomeric form...
Aromaticity detection may also be an issue
H, C H 3
N
O
R 1
R 1 =
25
Markush Canonicalisation
• Canonicalising an individual building block (scaffold or R-group alternative) is relatively simple
– “dummy atoms” for attachment points and R-groups
• Could define a sequence for R-group alternatives
– alphanumeric sequence of canonical representations
• Could define a sequence for R-groups
– non-arbitrary R-group labels
Would give you a “canonical Markush” but dependent on arbitrary segmentation (boundaries between scaffold and R-groups) and with limitations on
applicability of business rules.
26
Canonicalisation of homology variation
• Problem here is defining what is to be represented
– canonicalising a parameter list would be fairly simple
• Different existing systems have subtly different representations
– Markush DARC superatoms/attributes/textnote parameters
– MARPAT generic group nodes/categories/attributes
– GENSAL parameter lists
Any standard canonical representation would effectively have to impose its own choice of
basic representation, which ideally would be a superset of everyone else's
27
InChI generic structure extensions
• InChI is now well-established as a canonical representation standard for specific molecules
– unique alphanumeric string identifier
– open-source software for generation
• Working party has been looking at extending standard and software to handle Markush structures
– InChI Trust has approved proposals from Digital Chemistry Ltd for staged implementation
– currently awaiting allocation of funding
28
InChI generic structure extensions
1.InChIs for groups with external attachments
- InChI for “methyl” etc. as distinct from methane or radical
2.Assembly of InChIs into Markush structure
- arbitrary segmentation means little point in canonicalising this assembly
3.Additional types of variability
- several stages
O H
R 1
C H 2C H 3
C H 3R 1 =
O H
R 1C H 3R 1 =
C H2C H
3R 1 =