Predicting Pharmacology

WP van Hoorn, Feb 20061

Predicting Pharmacology

Willem van Hoorn

Pfizer Global Research & Development

Sandwich

UK

[email protected]

Pipeline Pilot UGM, San Diego, Mar 2006


Willem van Hoorn

Standing on the Shoulders of Giants

Gaia Paolini

Richard Shapland

Andrew Hopkins

Jonathan Mason


The Work of Giants

4.8 M structures

275k active compounds

600k activities (IC50, etc)

3k targets

800 human targets

InpharmaticaStARLITe

CerepBioprint

ThomsonIDDB

Pfizer in house

• Oracle / DayCard cartridge• Structures stored as smiles• Pipeline Pilot:• Canonical tautomers, salt stripping, etc• Access: ODBC components + web service• Pfizer compound structure retrieval

Unified DB


Why Giants Are Required


Unified DB

Unified Database as Starting Point

Bayesian Learn Molecular Categories

Predicting activities

Linear Discriminant Analysis (LDA)

Predicting gene families

Polypharmacology interaction network

http://www.r-project.org/index.html


MetalloproteasesMetalloproteases

Cysteine proteasesCysteine proteases

Serine proteasesSerine proteases

PhosphodiesterasesPhosphodiesterases Aminergic GPCRsAminergic GPCRs

Peptide GPCRsPeptide GPCRs

GPCRs (others: classes A, B & C)GPCRs (others: classes A, B & C)

Enzymes Enzymes (hydrolases, transferases, oxidoreductases & others)(hydrolases, transferases, oxidoreductases & others)

Ion ChannelsIon Channels

Nuclear hormoneNuclear hormonereceptorsreceptors

Aspartyl proteasesAspartyl proteases

KinasesKinases

MiscellaneousMiscellaneous

Polypharmacology Network From Binding Data

Node : targetEdge : compound


Deriving Multi-Category Bayesian Model

Unified DB

238k actives (≤ 10 µM),human target, Mw < 1000,pass reactivity filter,≥ 10 actives / target

FCFP_6

90% / 214k 10% / 23,792

55,781 activities

698 models


Assessing the Predictions of the Random Test Set

Large number of predictions:• 23,792 * 698 ~ 16.6M• 55,781 activities, rest unknown presumed inactive• Interpretation of Bayesian score?• Score ≥ cut-off : active, rest inactive• # predicted actives = F(cut-off)

Comparison with random:• For each cut-off: calculate number of predicted actives• Generate exactly same number of random predicted actives


50

Assessing the Predictions of the Random Test Set

58,428 predictions / 17,210 compounds16,281 compounds ≥1 correct prediction31,600 true positives (random: 292)Enrichment ~ 100 fold26,828 false positives (random: 55,489)24,181 false negatives


Nuclear hormone receptorsNuclear hormone receptors

Ion ChannelsIon Channels

PhosphodiesterasesPhosphodiesterases

AminergicAminergicGPCRsGPCRs

PeptidePeptideGPCRsGPCRs

GPCRs (others)GPCRs (others)

Enzymes Enzymes (others)(others)

True positive prediction

False positive prediction

Predicted Polypharmacology Network At Bayesian Cut-off 50


Predicted Polypharmacology Network At Bayesian Cut-off 50

• At confidence level 50, most predictions are intra gene class• Quite a few false positive connections coincide with true positives• Exceptions: Ion Channels, Enzymes-others• Although the prediction is wrong, the connection is right?• Or the prediction is right and the connection is false negative (not measured?)• Most interesting part of predicted connections to test• Compare to Peter Willett’s work in similarity searches:

(Next) Nearest neighbours of inactive nearest neighbours are equal likely to

be active as nearest neighbours themselves: J. Med. Chem. 2005, 48, 7049


A More Challenging Test Set: Cerep Bioprint

Unified DB

238k actives (≤ 10 µM),human target, Mw < 1000,pass reactivity filter,≥ 10 actives / target

FCFP_6

237k

Bioprint997 compounds316 targets

694 models



50

720 predictions / 291 compounds210 compounds ≥1 correct prediction433 true positives (random: 17)Enrichment ~ 25 fold287 false positives (random: 55,489)12,281 false negatives


Another Look At The Same Data

0

36,222 predictions 6,121 true positives30,101 false positives6,593 false negatives48% of actives in 11% of dataPlus 378 extra predicted targets



• Bioprint harder to predict than 10% random test set • Data can be interpreted depending on need• Few high confidence predictions, appropriate for triaging HTS hits• Many low confidence predictions, appropriate for risk assessment of lead


length

height

left rim bottom rim

H. LohningerTeach/Me Data Analysishttp://www.vias.org/tmdatanaleng

Linear Discriminant Analysis

diagonal

NOTE Length Left Right Bottom Top Diagonal GenuineBN1 214.8 131.0 131.1 9.000 9.700 141.0 true

BN2 214.6 129.7 129.7 8.100 9.500 141.7 true

BN3 214.8 129.7 129.7 8.700 9.600 142.2 true

BN4 214.8 129.7 129.6 7.500 10.40 142.0 true

BN5 215.0 129.6 129.7 10.40 7.700 141.8 true

BN6 215.7 130.8 130.5 9.000 10.10 141.4 true

BN7 215.5 129.5 129.7 7.900 9.600 141.6 true

BN8 214.5 129.6 129.2 7.200 10.70 141.7 true

BN9 214.9 129.4 129.7 8.200 11.00 141.9 true

BN10 215.2 130.4 130.3 9.200 10.00 140.7 true

…. …. …. …. …. …. …. ….

BN195 214.9 130.3 130.5 11.60 10.60 139.8 false

BN196 215.0 130.4 130.3 9.900 12.10 139.6 false

BN197 215.1 130.3 129.9 10.30 11.50 139.7 false

BN198 214.8 130.3 130.4 10.60 11.10 140.0 false

BN199 214.7 130.7 130.8 11.20 11.20 139.4 false

BN200 214.3 129.9 129.9 10.20 11.50 139.6 false

• Similar to PCA which tries to represent classes• Tries to discover what distinguishes classes• Compare letters: O and Q• PCA focuses on circle, LDA on tail• Web example: distinguish between genuine and false banknotes• Training set: 200 banknotes, 100 genuine / 100 forgeries


Predicting Forgeries with LDA and Bayesian

NOTE Length Left Right Bottom Top Diagonal BankNotes LD1

BN1 215.1 130.0 129.8 9.100 10.20 141.5 true 2.501

BN2 214.7 130.7 130.8 11.20 11.20 139.4 false -4.561

BN3 214.3 129.9 129.9 10.20 11.50 139.6 false -3.390

BN4 214.7 130.0 129.4 7.800 10.00 141.2 true 4.060

NOTE Length Left Right Bottom Top Diagonal BankNotesBayes

BN1 215.1 130.0 129.8 9.100 10.20 141.5 1.992

BN2 214.7 130.7 130.8 11.20 11.20 139.4 -6.611

BN3 214.3 129.9 129.9 10.20 11.50 139.6 -6.341

BN4 214.7 130.0 129.4 7.800 10.00 141.2 1.771

LDA

Bayesian


Predicting Gene Class by Physical Properties

Compounds binding to different gene classes posses different

physical property distributions:

Can this be used to predict gene class from physical properties alone?

How does LDA compare to Bayesian?

Mw clogP



Unified DB

148k actives (≤ 10 µM),human target, Mw < 1000,pass reactivity filter,binding to single target class only

Aminergic GPCRsAspartyl ProteasesCysteine ProteasesEnzymes- othersGPCRs Class A- othersGPCRs Class BGPCRs Class CHydrolasesIon Channels- Ligand_GatedIon Channels- othersKinases- othersMetalloproteasesNuclear hormone receptorsOthersOxidoreductasesPDEsPeptide GPCRsProtein KinasesSerine ProteasesTransferases

20 Gene Classes:


Molecular_WeightNum_H_Acceptors Num_H_DonorsNum_RotatableBondsMolecular_PolarSurfaceAreaNo_IonCenters Molecular_SolubilityMolecular_SurfaceAreaClogP *Andrews*


10 Descriptors:

147,534

118,118

29,416



29416 (9025)1 (0)

349 (137)5309 (1423)8123 (2811)

791 (248)888 (241)2638 (499)482 (163)279 (74)

0 (0)152 (59)47 (0)0 (0)0 (0)1 (0)

1268 (366)1969 (321)

75 (28)1180 (613)

5864 (2042)LDA (correct)

29416 (5631)1012 (125)792 (133)341 (147)

2809 (1135)2176 (392)1437 (329)

90 (47)2083 (345)1626 (293)1545 (100)964 (104)

2109 (280)350 (42)

3346 (146)2340 (115)962 (309)

1 (0)1464 (73)

1670 (614)2299 (902)

Bayes (correct)

29416 (1447)1460 (36)1526 (53)

1488 (148)1461 (236)1468 (56)1492 (54)

1465 (167)1459 (53)1515 (47)1430 (11)1441 (29)1448 (52)1461 (15)1438 (29)1477 (14)

1524 (117)1451 (135)1470 (13)1479 (29)

1463 (153)Random (correct)

29416727913

292750271178138533361238849198594764286339226

26472574252728

3228ExperimentTarget class

TotalTransferasesSerine ProteasesProtein KinasesPeptide GPCRsPDEsOxidoreductasesOthersNuclear hormone receptorsMetalloproteasesKinases- othersIon Channels- othersIon Channels- Ligand_GatedHydrolasesGPCRs Class CGPCRs Class BGPCRs Class A- othersEnzymes- othersCysteine ProteasesAspartyl ProteasesAminergic GPCRs



• Enrichment over random: LDA ~ 6 fold, Bayes ~4 fold• Bayesian: more equal spread• LDA: some baskets contain too many eggs?• Some of the misclassifications might be true: many missing values• Unbiased and fast method to (pre)screen large compound collection• Compare with other unbiased methods: docking, pharmacophore search


Conclusions

• Data from heterogeneous sources can be combined in one knowledge base• Predictive Bayesian models can be derived from it• Models are adaptive, regenerate to incorporate latest experimental results• Models are not replacement for experiment• Models can lead to substantially lower screening investment• Drug design compared to supermarket stock inventory:

Just in time delivery vs. just enough screening

• Don’t discount simple molecular properties

Predicting Pharmacology

Technology

Transcript of Predicting Pharmacology