Dragos Horvath Laboratoire d’InfoChimie, UMR 7177 CNRS – Université de Strasbourg

Ligand-Based Virtual Screening:

Extraction of Knowledge from Experimentally Confirmed Ligands, and the Quest for other Candidates Matching the

Known.

Dragos HorvathLaboratoire d’InfoChimie, UMR 7177

CNRS – Université de Strasbourg6700 Strasbourg, France

[email protected]

1. « Ceci n’est pas une molécule »

Molecular Models and Descriptors in Chemoinformatics: Numerical Encoding of Structural Information

Molecular Descriptors or Fingerprints

• Need to represent a structure by a characteristic bunch (vector) of numbers (descriptors).– Example: (Molecular Mass, Number of N Atoms, Total Charge,

Number of Aromatic Rings, Radius of Gyration)• Should include property-relevant aspects:

– the “nature” of atoms, including information on their neighbor-hood-induced properties, and their relative arrangement.

– Number of N Atoms (Primary Amino Groups, Secondary Amino Groups, … , … , Amide, … , Pyridine N, …)

– … unless being a H bond acceptor is the key (O or N alike)!– Arrangement in space (3D, conformation-dependent distances in

Å) or in the molecular graph (2D, topological distance = separating bond count)

2 0…

1 0…

Example 1: ISIDA Sequence Counts

O-C*C*C*C-N 1O-C*N*C*C-N 1…

(1,1,…) (2,0,…)

Example 2: Fuzzy mapping of pharmacophore triplets (2D-FPT)

3 3

3

4

6

7

4

3 4

5

5 3

0 0 0 … 0 0 … +6 … … +3 … … … … 0 …

5

5 4

Hp3-Hp3-Hp3Hp3-Hp3-Hp4Hp3-Hp3-Hp5… Ar4-Hp3-Hp4Ar4-Hp3-Hp5

… …… … Hp7-Ar4-PC6…Hp3-HA5-Ar5

Hp4-HA5-Ar5

…… …

Di(m) = total occupancy of basis triplet i in molecule m.

Atoms labeled by their pharmacophore types:

• Hydrophobic, Aromatic• Hydrogen Bond Donor, Cation• Hydrogen Bond Acceptor, Anion

Chemically Relevant Typing: Proteolytic equilibrium dependence of 2D-FPT

Ar5-N

C5-PC8

Ar8-N

C8-PC8

?12%

88%

• A descriptor of the nature of the molecule’s pharmacophoric neigh-borhood “seen” by every reference atom, assuming an optimal overlay of the molecule on the reference...

Pharmacophoric FeaturesAlk. Aro. HBA HDB (+) (-)

1 X11 X12 X13 X14 X15 X16

2 X21 X22 X23 X24 X25 X26

3 X31 X32 X33 X34 X35 X36

4 X41 X42 X43 X44 X45 X46R

efer

ence

Ato

ms

5 X51 X52 X53 X54 X55 X56

3: A 3D Example: Overlay-Based ComPharm Pharmacophore Field Intensities

2. Computer-Aided Ligand-Based Design: the « Medicinal Chemistry » of Ligand

Fingerprints

« Similar molecules have similar properties » → « Molecules with similar fingerprints have similar

properties »

« Structure-Activity Relationships » → « Fingerprint-Activity Relationships » (or Quantitative

Structure-Activity Relationships, QSAR)

Molecular Similarity

Expressed by Fingerprint Similarity

2.1 Molecular Similarity in Chemoinformatics

The Similarity Principle – Neighborhood Behavior

Calculated Structural Dissimilarity S(m,M)

Prop

erty

Dis

sim

ilari

ty

*

***

**

*

* *

*

*

*

*

*

*

** *

**

*

**

*

* *

***

*

**

*

***

**

* *

**

**

*

*

*

False Positives (FP)

True Negatives (TN)

True Positives (TP)

Potentially (!) False Negatives (FN)

Molecule Pairs M,m

*

*

** **

*

*

*

*

*

*

*

Some Random Ranking Criterion for pairs (m,M)

Pairs with different Properties L(m,M)=|P(m)-P(M)| ≥l

Pairs with similar Properties L(m,M)=|P(m)-P(M)| <l

Dissimilarity Cutoff s ?

Best Matching Candidates

AutomatedFingerprintMatching...

ReferenceFingerprint

Nearest Neighbors

Superposition-based Similarity Scoring

Ligand Candidate Fingerprint Library

Active Reference

Similarity-Based Virtual Screening…

Strenght & Limitations of Similarity-based VS (+) Only need ONE active ligand to seek for more like it… (+) With appropriate descriptors, calculated similarity may be

complementary to the scaffold-based similarity perceived by medicinal chemists → « Scaffold Hopping »: bypassing synthetic bottlenecks and/or

pharmacokinetic property problems, patent space evasion, etc.

(--) Within the reference ligand, « all groups are equal, but some are more equal than others » when it comes to controlling activity… so what if we mismatch the latter???

Ac4

Ac..

Acm

Act

A1

A2

A3

Mol

M1

M2

M3

M4

M..

Mm

D1 D2 D3 … Dn

d11 d21 d31 … dn1

d12 d22 d32 … dn2

d13 d23 d33 … dn3

d14 d24 d34 … dn4

d1… d2… d3… … dn…

d1m d2m d3m … dnm

A QSAR model expresses observed correlations between certain descriptors and activity

ModelFitting

neighbor-hood model

decisiontree

neural net

linear

1

2

Dn

1

2

n

1

2

n

A1

2

n

DD

oui non

(M active) (M inactive)Di<?oui nonoui nonoui non

M (D1,D2,D3,…,Dn)

´= ii DA a´= iiå ´= ii

?

2.2: So, we need to LEARN the features that really matter – building QSARs

Correlations: The Cornerstone of QSAR Philosophy (or, perhaps, Religion?)

MolID Activity Class

Phe Ring

Count

HBAcceptor

Count

Count of C-N pairs at 5 bonds

1 1 1 3 2

2 1 2 2 1

3 1 1 4 2

4 1 1 3 2

5 1 2 2 3

6 0 1 1 5

7 0 1 0 4

8 0 2 2 3

9 0 2 1 5

10 0 1 0 4

More is better

Less is better

I always end up in this deplorable state,

no matter whether I drink: Vodka-Soda Martini-Soda Gin-Soda Whisky-Soda…

… therefore, as of tomorrow, I decided to

stop drinkin’ SODA !

Correlation is not Causality - an Obvious, but Inconvenient Truth…

Diverse library of 16x6x10=960 compounds… with NPC=NHD

SAR sets are always limited in diversity and therefore may (and always will) accomodate coincidental relationships between different features:

Why Lucky Correlations may Outperform more Rigorous Modeling…

• Rule-based pharmacophore feature assignment has a hard time with the imine group =N–. In rule-based triplets of benzodiazepine receptor ligands, it was flagged as cation.

• Proper pH-sensitive flagging corrected this error… and dramatically reduced model quality! • Labelling as ‘cation’ was a way to ‘highlight’ that N group – and,

since preferentially seen in actives, highlighting made QSAR learning easier

CAUSAL QSAR… but not in the way you’d expect it! A Psychedelic µ Receptor Model…

• Training set: small combinatorial carbamate library, of 240 compounds obtained by robotized synthesis, LC/MS purity control and µ receptor affinity (pIC50) measurement (proof-of-concept study, CEREP 1997)

• A successful ComPharm QSAR model (R2≈0.8) was built to explain the measured pIC50 values (btw. µ- and mM)− HB-acceptor in para of benzyl alcohol enhances µ receptor

affinity

-OR, -NR2

…based on wrong experimental data!

• The most “active” carbamates of the training set turned out to be contaminated with ‰ traces of decarboxylation product, featuring the opioid ligand specific tertiary amine and having nanomolar potencies!

• Our QSAR actually explained… the decarboxylation mechanism: p-OR or –NR2 stabilizes the intermediate carbocation… thus rendering contamination possible!

-OR, -NR 2

+

Not seeing the Scaffold because of the Molecules: why Water is a Thrombin Inhibitor.• Non-linear Thrombin affinity model, with R2

train=0.92, Q2=0.84 and R2

validation=0.61.– pKi = 2.2e-4Ar4Hp14PC122 –3.8e-5Ar4Ar12HA102 –1.4HD8Hp6PC42 +2.45exp[-(Ar6Ar10HD14-19.8)2/1362]

+4.36/{1+exp[-(Ar10Ar12Hp8-71.3)/119]} +0.77exp[-3(Ar12HD8Hp12-13.3)2/1042] –1.05/{1+exp[-(Ar12Hp4Hp10-185.1)/327]} –2.26{1+exp[-(Ar12Hp6NC8-3.2)/13]} –4.71exp[-(HA4HD4Hp4-43.4)2/1222] +7.3e-4HA4Hp4Hp6 –2.2e-3HA10HD4Hp8 +5.94exp[-(HA12Hp14PC12-1.2)2/152]

• If all population levels are zero, the calculated pKi is of 5.9, mostly due to contributions from the highlighted Gaussians. – Absence of pharmacophore triplets automatically qualifies any

small molecule as micromolar thrombin inhibitor!• The model has learned that, for benzamidines – the

chemical class represented in the training set, the presence of underlined triplets coincides with a loss of activity.

Beware of “Antipharmacophores”!

• Ar12-HD8-Hp12 is an “antipharmacophore” of the Thrombin model: – Absence of this pharmacophore triplet means an enhanced

activity, its presence correlates with an observed affinity loss• The undesirable Presence, in the above context, implicitly

means presence in specific points of the ubiquitous benzamidine scaffold – Fragments or pharmacophore elements that are genuinely “bad”

for activity, no matter where they are located, are rare. An “antipharmacophore” rather reflects poor training set diversity!

Actives Inactives Actives, but not available for training

The Phantom Scaffold… materializes when adding diverse inactives to the training set

• All the training molecules – both actives and inactives – being benzamidines, the QSAR model cannot possibly learn the importance of the benzamidine moiety!

• After enriching the thrombin data set with inactives, one model out of ~2000 was able to predict the activity of unrelated thrombin ligands of known binding geometries.– Scaffold Hopping: Yes, we can!

• Triplets entering the model successfully corroborate some of the ligand features involved in binding.– These include features entering the pockets P1 and P2, but not the

aromatic moiety binding in pocket P3.

Enhanced Training Set Diversity leads to Models with “Scaffold Hopping” abilities

HA

8-H

p6-P

C4

HA

8-H

p6-P

C4

3.24

3.92

1.93

2.55

3.24

3.92

1.93

2.55

Missing P3: deleting a Phe does not lead to ‘the same molecule, but with one phenyl less’.

• The training set included both examples of actives, with required aromatic P3 substituent and inactives, without this key moiety. So, why did the model not learn about the importance of P3?

• In all training examples, however, the removal of the P3 moiety was always done by deacylating the phenylalanine… – thus, compounds missing P3 substitution were systematically

compounds having one excess protonable group– the model actually chose to learn the ‘bogus’ rule that one

more cation causes an activity loss.

? P3

P3

“Let s be a representative sample of the set S…”• It takes a sample of ~104 individuals to extrapolate the

voting intentions of a population of ~107. What’s the representative subset size of 1025 drug-like compounds?– If we ever dared to publish QSARs trained on fewer compounds,

shame on us!• If given N=3 coordinate pairs (Y,X), not even Carl

Friedrich Gauss could come up with a model more sophisticated than Y=aX2+bX+c– Don’t listen when they say that Support Vector Machines have

very few “tunable parameters”!• May your model apply to one million and one molecules –

it may still fail for the one million and second!– One cannot validate QSAR – but just fail to invalidate it!

DISCLAIMER

THE AUTHOR STRONGLY DISAGREES WITH

HIS PERSONAL OPINIONS OUTLINED IN THIS

SLIDE.

We are Medicinal Chemists – tell us about Pharmacophore Models, forget QSAR!!

• Bad news: Pharmacophore models are just a peculiar type of 3D-QSAR:– use overlay models to “bind”

descriptors to specific spots in space

– Pharmacophore hot spots are defined by the consensual presence of groups of similar type, throughout the series of known actives

– Descriptors are occupancy levels of these spots

No knowledge of the active site – need

alternative overlay hypotheses !

Kama Sutra with Ligands: Match As Many Equivalent Pharmacophore Features You May!

methotrexate dihydrofolate

+??

??+

?++

+++Böhm, Klebe, Kubinyi, “Wirkstoffdesign” (1999)

The Cox-2 Scenario: An Ideal Pharmacophore Model

• As exhaustive and diverse as possible a set of

CycloOxygenase-II (Cox2) inhibitors, including:– A set of 1914 marketed drugs of the U.S. Pharmacopeia, tested

on Cox2 by the Cerep laboratories (BioPrintTM).

– A set of 326 inhibitors compiled from literature by N. Baurin

(thanks!), including co-crystallized selective Cox2 ligand SC-

558 and related medicinal chemistry series.

– Potencies, expressed as pIC50 = -log IC50[mol/l] can be directly

compared (cross-check on compounds present in both subsets)

Minimalistic Overlay-based Model

Acceptor Requested

Hyd

roph

obe

Requ

este

d

Aromatic Requested

Donor

Forb

idde

n

Descriptor Coeff.

Ar@Atom#2 0.191

Hp@Atom#11 0.179

HA@Atom#20 0.430

HD@Atom#20 -0.428

zexp(Ar-HA5) 1.414

zsig3(#PC) 1.414

Intercept 0.000

– Training RMS=0.712, R2=0.712– validation RMS=0.698, Q2=0.724

HipHop Acceptor!

HipHop Hydrophobe

HipHop Hydrophobe

Furthermore, it supports « Scaffolfd Hopping » !

• it manages to explain the Cox2 activities of the apparently unrelated nonspecific Cox1/Cox2 inhibitors:

• This is an ideal scenario – scaffold-independent model trained on thousands of compounds: so maybe the overlay models are mechanistically relevant !

… or maybe not!

His 90

Arg 513

Arg 120

Val 116 Val 349

Trp 387

Tyr 355?

SC-558

?

!

??

Predictive? Yes! Enlightening? No!

• Overlay-based models correctly explained the behavior of the two distinct Cox2 binder classes… on hand of an erroneous working hypothesis!!– The ‘correct’ overlay asks an ‘anion’ (flurbiprofene –COO-) to

be aligned atop of a ‘hydrophobe’ (CF3) – a heresy in pharmacophore matching! (Well, is CF3 a hydrophobe??)

– QSAR building is never safe from correlation artifacts, not even in models with 6 variables versus 2200 observables and excellent statistics!

– Such model may be very successful in selecting database subsets enriched in new Actives - but QSAR alone would never have elucidated the binding mechanism to Cox2!

QSAR – a Bookkeeping Tool !?

• “Bookkeeping” QSAR: a quantitative way to wrap up the information contained in your training set compounds– Models are scaffold-bound, heavily populated by scores of bogus

antipharmacophores– It will typically tell you things you’d notice by simply looking at

the molecules– Sampling of all the possible models fitting the observations

allows to enumerate all the alternative working hypotheses that still await to be discarded…

– …or validated. The model might highlight not yet tried combinations of known features with better activities!

– Think positive: this provides a rational plan to challenge these hypotheses, and thus learn more from better planned experiments.

3. Did we forget something? Each Model has its Limited Applicability Domain…

… even General Relativity and Quantum Physics!

Defining the Conditions of Applicability of QSAR Models – and respecting them – might help!

The Applicability Domain (AD) – A Compromise…

• Restrict the applicability of a QSAR model to a well-defined subset of the chemical space – the one populated by the training molecules. Insufficient sampling of chemotypes outside this AD is then irrelevant.– How do we define this subset of chemical space to be as large as

possible, while nevertheless densely enough populated by training molecules?

Feature count 1

Feature count 2

**

** **

**

**

* *

*

**

*

Example: the FeatureControl Approach

… but no miraculous solution! A real-life inspired (Gedanken)Experiment

• Modeling of metal ligation propensities, with a training set composed of three subfamilies, R being alkyl chains:

• pKbind= aAnilin/AcidNAnil + aPyr/AcidNPrd+ gsizeF(R) + CAcid

• AD requirements: the molecule should contain NAnil=0 or 1 aniline fragment NAcid=0 or 1 benzoic acid fragment NPrd=0 or 1 pyridine fragment Alkyl chains of size as seen in training set

• Would you like to know whether propane is a potent metal binder ?– Yes, it is: pKbind= gsizeF(C3) + CAcid (same as for p-propyl

benzoic acid)– But it can’t possibly be within the AD, can it? NAnil=0 or 1 aniline fragment NAcid=0 or 1 benzoic acid fragment NPrd=0 or 1 pyridine fragment Alkyl chains of size as seen in training set

Contributions of a good programmer, but lousy chemist, to the understanding of QSAR!

Feature count 1

Feature count 2

**

*

* **

**

*

*

* *

*

*

**

Example: the FeatureControl Approach

Building Up Trust from Consensus

• pKbind= aAnil/AcidNAnil + aPyr/AcidNPrd+ gsizeF(R) + CAcid

• pKbind= aAnil/PrdNAnil + aAcid/PrdNAcid+ gsizeF(R) + CPrd

• pKbind= aAcid/AnilNAcid + aPrd/AnilNPrd+ gsizeF(R) + CAnil

• These three alternative models are perfectly equivalent – or “redundant” – as far as training set molecules are concerned– Identical prediction for each training molecule, identical

statistical parameters• However, they cease to be redundant when it comes to

propane: CAcid ≠ CAnil ≠ CPrd

– Divergent prediction by allegedly “redundant” models is a clear signal of AD violation!

Conclusions…

• In Ligand-based knowledge extraction, the single most important piece of hardware is a BRAIN

• Correlation is not causality… it’s correlation!– So, if correlations observed within the training set do apply to

other molecules, forget metaphysical afterthoughts and exploit them, in successful virtual screening

– However, an in-depth analysis of the model – if feasible – may reveal intrinsic limitations and pitfalls, and help to better delimit the AD.

• Training set diversity is the key!– Do not hesitate to add bogus inactives, in order to teach your

model proper “border conditions” such as “cosmic vacuum is inactive”. Absence of features cannot provide activity…

• If a big pharma manager asks you “So, is QSAR useful?”, please reply “Compared to what?”– A wrong QSAR model may nevertheless ring a bell in a

medicinal chemist’s brain, and help to make right decisions – There are moments in life when one should rely on the

accumulated knowledge, and use QSAR to discover new combinations of known features: • A properly trained model, within its AD, stands fair chances to discover

new actives and sensibly decrease synthesis/testing effort, maybe find a new scaffold porting the known binding pharmacophore (lead hopping)

– There are moments when one should put known things aside, and venture out for random search of new paradigm-breaking ligands – new scaffold, new binding mode, new action mechanism, but...

Conclusions…

… if you’re afraid to be branded a low-tech amateur for not

using QSAR, count on your Marketing Department to sell

random picking as “THE ultimate successful application of

the Universal QSAR Model”:

pK i = 4.0 + ħ x HosoyaIndex + 5.0 x rand( )n

Planck’s constant – a flavor of fundamental science (and conveniently small)

Laymen will appreciate the exotic resonance of the name, evoking some oriental wisdom from high-tech Japan

Proprietary & patented algorithm and fittable coefficient n

Dragos Horvath Laboratoire d’InfoChimie, UMR 7177 CNRS – Université de Strasbourg

Documents

Transcript of Dragos Horvath Laboratoire d’InfoChimie, UMR 7177 CNRS – Université de Strasbourg