Dragos Horvath Laboratoire d’InfoChimie, UMR 7177 CNRS – Université de Strasbourg
description
Transcript of Dragos Horvath Laboratoire d’InfoChimie, UMR 7177 CNRS – Université de Strasbourg
Ligand-Based Virtual Screening:
Extraction of Knowledge from Experimentally Confirmed Ligands, and the Quest for other Candidates Matching the
Known.
Dragos HorvathLaboratoire d’InfoChimie, UMR 7177
CNRS – Université de Strasbourg6700 Strasbourg, France
1. « Ceci n’est pas une molécule »
Molecular Models and Descriptors in Chemoinformatics: Numerical Encoding of Structural Information
Molecular Descriptors or Fingerprints
• Need to represent a structure by a characteristic bunch (vector) of numbers (descriptors).– Example: (Molecular Mass, Number of N Atoms, Total Charge,
Number of Aromatic Rings, Radius of Gyration)• Should include property-relevant aspects:
– the “nature” of atoms, including information on their neighbor-hood-induced properties, and their relative arrangement.
– Number of N Atoms (Primary Amino Groups, Secondary Amino Groups, … , … , Amide, … , Pyridine N, …)
– … unless being a H bond acceptor is the key (O or N alike)!– Arrangement in space (3D, conformation-dependent distances in
Å) or in the molecular graph (2D, topological distance = separating bond count)
2 0…
1 0…
Example 1: ISIDA Sequence Counts
O-C*C*C*C-N 1O-C*N*C*C-N 1…
(1,1,…) (2,0,…)
Example 2: Fuzzy mapping of pharmacophore triplets (2D-FPT)
3 3
3
4
6
7
4
3 4
5
5 3
0 0 0 … 0 0 … +6 … … +3 … … … … 0 …
5
5 4
Hp3-Hp3-Hp3Hp3-Hp3-Hp4Hp3-Hp3-Hp5… Ar4-Hp3-Hp4Ar4-Hp3-Hp5
… …… … Hp7-Ar4-PC6…Hp3-HA5-Ar5
Hp4-HA5-Ar5
…… …
Di(m) = total occupancy of basis triplet i in molecule m.
Atoms labeled by their pharmacophore types:
• Hydrophobic, Aromatic• Hydrogen Bond Donor, Cation• Hydrogen Bond Acceptor, Anion
Chemically Relevant Typing: Proteolytic equilibrium dependence of 2D-FPT
Ar5-N
C5-PC8
Ar8-N
C8-PC8
?12%
88%
• A descriptor of the nature of the molecule’s pharmacophoric neigh-borhood “seen” by every reference atom, assuming an optimal overlay of the molecule on the reference...
Pharmacophoric FeaturesAlk. Aro. HBA HDB (+) (-)
1 X11 X12 X13 X14 X15 X16
2 X21 X22 X23 X24 X25 X26
3 X31 X32 X33 X34 X35 X36
4 X41 X42 X43 X44 X45 X46R
efer
ence
Ato
ms
5 X51 X52 X53 X54 X55 X56
3: A 3D Example: Overlay-Based ComPharm Pharmacophore Field Intensities
2. Computer-Aided Ligand-Based Design: the « Medicinal Chemistry » of Ligand
Fingerprints
« Similar molecules have similar properties » → « Molecules with similar fingerprints have similar
properties »
« Structure-Activity Relationships » → « Fingerprint-Activity Relationships » (or Quantitative
Structure-Activity Relationships, QSAR)
Molecular Similarity
Expressed by Fingerprint Similarity
2.1 Molecular Similarity in Chemoinformatics
The Similarity Principle – Neighborhood Behavior
Calculated Structural Dissimilarity S(m,M)
Prop
erty
Dis
sim
ilari
ty
*
***
**
*
* *
*
*
*
*
*
*
** *
**
*
**
*
* *
***
*
**
*
***
**
* *
**
**
*
*
*
False Positives (FP)
True Negatives (TN)
True Positives (TP)
Potentially (!) False Negatives (FN)
Molecule Pairs M,m
*
*
** **
*
*
*
*
*
*
*
Some Random Ranking Criterion for pairs (m,M)
Pairs with different Properties L(m,M)=|P(m)-P(M)| ≥l
Pairs with similar Properties L(m,M)=|P(m)-P(M)| <l
Dissimilarity Cutoff s ?
Best Matching Candidates
AutomatedFingerprintMatching...
ReferenceFingerprint
Nearest Neighbors
Superposition-based Similarity Scoring
Ligand Candidate Fingerprint Library
Active Reference
Similarity-Based Virtual Screening…
Strenght & Limitations of Similarity-based VS (+) Only need ONE active ligand to seek for more like it… (+) With appropriate descriptors, calculated similarity may be
complementary to the scaffold-based similarity perceived by medicinal chemists → « Scaffold Hopping »: bypassing synthetic bottlenecks and/or
pharmacokinetic property problems, patent space evasion, etc.
(--) Within the reference ligand, « all groups are equal, but some are more equal than others » when it comes to controlling activity… so what if we mismatch the latter???
Ac4
Ac..
Acm
Act
A1
A2
A3
Mol
M1
M2
M3
M4
M..
Mm
D1 D2 D3 … Dn
d11 d21 d31 … dn1
d12 d22 d32 … dn2
d13 d23 d33 … dn3
d14 d24 d34 … dn4
d1… d2… d3… … dn…
d1m d2m d3m … dnm
A QSAR model expresses observed correlations between certain descriptors and activity
ModelFitting
neighbor-hood model
decisiontree
neural net
linear
1
2
Dn
1
2
n
1
2
n
A1
2
n
DD
oui non
(M active) (M inactive)Di<?oui nonoui nonoui non
M (D1,D2,D3,…,Dn)
´= ii DA a´= iiå ´= ii
?
2.2: So, we need to LEARN the features that really matter – building QSARs
Correlations: The Cornerstone of QSAR Philosophy (or, perhaps, Religion?)
MolID Activity Class
Phe Ring
Count
HBAcceptor
Count
Count of C-N pairs at 5 bonds
1 1 1 3 2
2 1 2 2 1
3 1 1 4 2
4 1 1 3 2
5 1 2 2 3
6 0 1 1 5
7 0 1 0 4
8 0 2 2 3
9 0 2 1 5
10 0 1 0 4
More is better
Less is better
I always end up in this deplorable state,
no matter whether I drink: Vodka-Soda Martini-Soda Gin-Soda Whisky-Soda…
… therefore, as of tomorrow, I decided to
stop drinkin’ SODA !
Correlation is not Causality - an Obvious, but Inconvenient Truth…
Diverse library of 16x6x10=960 compounds… with NPC=NHD
SAR sets are always limited in diversity and therefore may (and always will) accomodate coincidental relationships between different features:
Why Lucky Correlations may Outperform more Rigorous Modeling…
• Rule-based pharmacophore feature assignment has a hard time with the imine group =N–. In rule-based triplets of benzodiazepine receptor ligands, it was flagged as cation.
• Proper pH-sensitive flagging corrected this error… and dramatically reduced model quality! • Labelling as ‘cation’ was a way to ‘highlight’ that N group – and,
since preferentially seen in actives, highlighting made QSAR learning easier
CAUSAL QSAR… but not in the way you’d expect it! A Psychedelic µ Receptor Model…
• Training set: small combinatorial carbamate library, of 240 compounds obtained by robotized synthesis, LC/MS purity control and µ receptor affinity (pIC50) measurement (proof-of-concept study, CEREP 1997)
• A successful ComPharm QSAR model (R2≈0.8) was built to explain the measured pIC50 values (btw. µ- and mM)− HB-acceptor in para of benzyl alcohol enhances µ receptor
affinity
-OR, -NR2
…based on wrong experimental data!
• The most “active” carbamates of the training set turned out to be contaminated with ‰ traces of decarboxylation product, featuring the opioid ligand specific tertiary amine and having nanomolar potencies!
• Our QSAR actually explained… the decarboxylation mechanism: p-OR or –NR2 stabilizes the intermediate carbocation… thus rendering contamination possible!
-OR, -NR 2
+
Not seeing the Scaffold because of the Molecules: why Water is a Thrombin Inhibitor.• Non-linear Thrombin affinity model, with R2
train=0.92, Q2=0.84 and R2
validation=0.61.– pKi = 2.2e-4Ar4Hp14PC122 –3.8e-5Ar4Ar12HA102 –1.4HD8Hp6PC42 +2.45exp[-(Ar6Ar10HD14-19.8)2/1362]
+4.36/{1+exp[-(Ar10Ar12Hp8-71.3)/119]} +0.77exp[-3(Ar12HD8Hp12-13.3)2/1042] –1.05/{1+exp[-(Ar12Hp4Hp10-185.1)/327]} –2.26{1+exp[-(Ar12Hp6NC8-3.2)/13]} –4.71exp[-(HA4HD4Hp4-43.4)2/1222] +7.3e-4HA4Hp4Hp6 –2.2e-3HA10HD4Hp8 +5.94exp[-(HA12Hp14PC12-1.2)2/152]
• If all population levels are zero, the calculated pKi is of 5.9, mostly due to contributions from the highlighted Gaussians. – Absence of pharmacophore triplets automatically qualifies any
small molecule as micromolar thrombin inhibitor!• The model has learned that, for benzamidines – the
chemical class represented in the training set, the presence of underlined triplets coincides with a loss of activity.
Beware of “Antipharmacophores”!
• Ar12-HD8-Hp12 is an “antipharmacophore” of the Thrombin model: – Absence of this pharmacophore triplet means an enhanced
activity, its presence correlates with an observed affinity loss• The undesirable Presence, in the above context, implicitly
means presence in specific points of the ubiquitous benzamidine scaffold – Fragments or pharmacophore elements that are genuinely “bad”
for activity, no matter where they are located, are rare. An “antipharmacophore” rather reflects poor training set diversity!
Actives Inactives Actives, but not available for training
The Phantom Scaffold… materializes when adding diverse inactives to the training set
• All the training molecules – both actives and inactives – being benzamidines, the QSAR model cannot possibly learn the importance of the benzamidine moiety!
• After enriching the thrombin data set with inactives, one model out of ~2000 was able to predict the activity of unrelated thrombin ligands of known binding geometries.– Scaffold Hopping: Yes, we can!
• Triplets entering the model successfully corroborate some of the ligand features involved in binding.– These include features entering the pockets P1 and P2, but not the
aromatic moiety binding in pocket P3.
Enhanced Training Set Diversity leads to Models with “Scaffold Hopping” abilities
HA
8-H
p6-P
C4
HA
8-H
p6-P
C4
3.24
3.92
1.93
2.55
3.24
3.92
1.93
2.55
Missing P3: deleting a Phe does not lead to ‘the same molecule, but with one phenyl less’.
• The training set included both examples of actives, with required aromatic P3 substituent and inactives, without this key moiety. So, why did the model not learn about the importance of P3?
• In all training examples, however, the removal of the P3 moiety was always done by deacylating the phenylalanine… – thus, compounds missing P3 substitution were systematically
compounds having one excess protonable group– the model actually chose to learn the ‘bogus’ rule that one
more cation causes an activity loss.
? P3
P3
“Let s be a representative sample of the set S…”• It takes a sample of ~104 individuals to extrapolate the
voting intentions of a population of ~107. What’s the representative subset size of 1025 drug-like compounds?– If we ever dared to publish QSARs trained on fewer compounds,
shame on us!• If given N=3 coordinate pairs (Y,X), not even Carl
Friedrich Gauss could come up with a model more sophisticated than Y=aX2+bX+c– Don’t listen when they say that Support Vector Machines have
very few “tunable parameters”!• May your model apply to one million and one molecules –
it may still fail for the one million and second!– One cannot validate QSAR – but just fail to invalidate it!
DISCLAIMER
THE AUTHOR STRONGLY DISAGREES WITH
HIS PERSONAL OPINIONS OUTLINED IN THIS
SLIDE.
We are Medicinal Chemists – tell us about Pharmacophore Models, forget QSAR!!
• Bad news: Pharmacophore models are just a peculiar type of 3D-QSAR:– use overlay models to “bind”
descriptors to specific spots in space
– Pharmacophore hot spots are defined by the consensual presence of groups of similar type, throughout the series of known actives
– Descriptors are occupancy levels of these spots
No knowledge of the active site – need
alternative overlay hypotheses !
Kama Sutra with Ligands: Match As Many Equivalent Pharmacophore Features You May!
methotrexate dihydrofolate
+??
??+
?++
+++Böhm, Klebe, Kubinyi, “Wirkstoffdesign” (1999)
The Cox-2 Scenario: An Ideal Pharmacophore Model
• As exhaustive and diverse as possible a set of
CycloOxygenase-II (Cox2) inhibitors, including:– A set of 1914 marketed drugs of the U.S. Pharmacopeia, tested
on Cox2 by the Cerep laboratories (BioPrintTM).
– A set of 326 inhibitors compiled from literature by N. Baurin
(thanks!), including co-crystallized selective Cox2 ligand SC-
558 and related medicinal chemistry series.
– Potencies, expressed as pIC50 = -log IC50[mol/l] can be directly
compared (cross-check on compounds present in both subsets)
Minimalistic Overlay-based Model
Acceptor Requested
Hyd
roph
obe
Requ
este
d
Aromatic Requested
Donor
Forb
idde
n
Descriptor Coeff.
Ar@Atom#2 0.191
Hp@Atom#11 0.179
HA@Atom#20 0.430
HD@Atom#20 -0.428
zexp(Ar-HA5) 1.414
zsig3(#PC) 1.414
Intercept 0.000
– Training RMS=0.712, R2=0.712– validation RMS=0.698, Q2=0.724
HipHop Acceptor!
HipHop Hydrophobe
HipHop Hydrophobe
Furthermore, it supports « Scaffolfd Hopping » !
• it manages to explain the Cox2 activities of the apparently unrelated nonspecific Cox1/Cox2 inhibitors:
• This is an ideal scenario – scaffold-independent model trained on thousands of compounds: so maybe the overlay models are mechanistically relevant !
… or maybe not!
His 90
Arg 513
Arg 120
Val 116 Val 349
Trp 387
Tyr 355?
SC-558
?
!
??
Predictive? Yes! Enlightening? No!
• Overlay-based models correctly explained the behavior of the two distinct Cox2 binder classes… on hand of an erroneous working hypothesis!!– The ‘correct’ overlay asks an ‘anion’ (flurbiprofene –COO-) to
be aligned atop of a ‘hydrophobe’ (CF3) – a heresy in pharmacophore matching! (Well, is CF3 a hydrophobe??)
– QSAR building is never safe from correlation artifacts, not even in models with 6 variables versus 2200 observables and excellent statistics!
– Such model may be very successful in selecting database subsets enriched in new Actives - but QSAR alone would never have elucidated the binding mechanism to Cox2!
QSAR – a Bookkeeping Tool !?
• “Bookkeeping” QSAR: a quantitative way to wrap up the information contained in your training set compounds– Models are scaffold-bound, heavily populated by scores of bogus
antipharmacophores– It will typically tell you things you’d notice by simply looking at
the molecules– Sampling of all the possible models fitting the observations
allows to enumerate all the alternative working hypotheses that still await to be discarded…
– …or validated. The model might highlight not yet tried combinations of known features with better activities!
– Think positive: this provides a rational plan to challenge these hypotheses, and thus learn more from better planned experiments.
3. Did we forget something? Each Model has its Limited Applicability Domain…
… even General Relativity and Quantum Physics!
Defining the Conditions of Applicability of QSAR Models – and respecting them – might help!
The Applicability Domain (AD) – A Compromise…
• Restrict the applicability of a QSAR model to a well-defined subset of the chemical space – the one populated by the training molecules. Insufficient sampling of chemotypes outside this AD is then irrelevant.– How do we define this subset of chemical space to be as large as
possible, while nevertheless densely enough populated by training molecules?
Feature count 1
Feature count 2
**
** **
**
**
* *
*
**
*
Example: the FeatureControl Approach
… but no miraculous solution! A real-life inspired (Gedanken)Experiment
• Modeling of metal ligation propensities, with a training set composed of three subfamilies, R being alkyl chains:
• pKbind= aAnilin/AcidNAnil + aPyr/AcidNPrd+ gsizeF(R) + CAcid
• AD requirements: the molecule should contain NAnil=0 or 1 aniline fragment NAcid=0 or 1 benzoic acid fragment NPrd=0 or 1 pyridine fragment Alkyl chains of size as seen in training set
• Would you like to know whether propane is a potent metal binder ?– Yes, it is: pKbind= gsizeF(C3) + CAcid (same as for p-propyl
benzoic acid)– But it can’t possibly be within the AD, can it? NAnil=0 or 1 aniline fragment NAcid=0 or 1 benzoic acid fragment NPrd=0 or 1 pyridine fragment Alkyl chains of size as seen in training set
Contributions of a good programmer, but lousy chemist, to the understanding of QSAR!
Feature count 1
Feature count 2
**
*
* **
**
*
*
* *
*
*
**
Example: the FeatureControl Approach
Building Up Trust from Consensus
• pKbind= aAnil/AcidNAnil + aPyr/AcidNPrd+ gsizeF(R) + CAcid
• pKbind= aAnil/PrdNAnil + aAcid/PrdNAcid+ gsizeF(R) + CPrd
• pKbind= aAcid/AnilNAcid + aPrd/AnilNPrd+ gsizeF(R) + CAnil
• These three alternative models are perfectly equivalent – or “redundant” – as far as training set molecules are concerned– Identical prediction for each training molecule, identical
statistical parameters• However, they cease to be redundant when it comes to
propane: CAcid ≠ CAnil ≠ CPrd
– Divergent prediction by allegedly “redundant” models is a clear signal of AD violation!
Conclusions…
• In Ligand-based knowledge extraction, the single most important piece of hardware is a BRAIN
• Correlation is not causality… it’s correlation!– So, if correlations observed within the training set do apply to
other molecules, forget metaphysical afterthoughts and exploit them, in successful virtual screening
– However, an in-depth analysis of the model – if feasible – may reveal intrinsic limitations and pitfalls, and help to better delimit the AD.
• Training set diversity is the key!– Do not hesitate to add bogus inactives, in order to teach your
model proper “border conditions” such as “cosmic vacuum is inactive”. Absence of features cannot provide activity…
• If a big pharma manager asks you “So, is QSAR useful?”, please reply “Compared to what?”– A wrong QSAR model may nevertheless ring a bell in a
medicinal chemist’s brain, and help to make right decisions – There are moments in life when one should rely on the
accumulated knowledge, and use QSAR to discover new combinations of known features: • A properly trained model, within its AD, stands fair chances to discover
new actives and sensibly decrease synthesis/testing effort, maybe find a new scaffold porting the known binding pharmacophore (lead hopping)
– There are moments when one should put known things aside, and venture out for random search of new paradigm-breaking ligands – new scaffold, new binding mode, new action mechanism, but...
Conclusions…
… if you’re afraid to be branded a low-tech amateur for not
using QSAR, count on your Marketing Department to sell
random picking as “THE ultimate successful application of
the Universal QSAR Model”:
pK i = 4.0 + ħ x HosoyaIndex + 5.0 x rand( )n
Planck’s constant – a flavor of fundamental science (and conveniently small)
Laymen will appreciate the exotic resonance of the name, evoking some oriental wisdom from high-tech Japan
Proprietary & patented algorithm and fittable coefficient n