Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds...

39
Machine Learning and Drug Design Observations and Lessons UCSC Machine Learning Summer School 2012 Nigel Duffy, PhD. Copyright Numerate Inc

Transcript of Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds...

Page 1: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Machine Learning and Drug Design

Observations and Lessons UCSC Machine Learning Summer School

2012 Nigel Duffy, PhD.

Copyright Numerate Inc

Page 2: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Copyright Numerate Inc

Applying Machine Learning

More than picking the best / hottest algorithm … • What is the problem you’re trying to solve?

• How do you measure improvement or success? • What data is available?

• How much? • Where does it come from? • What does it look like?

• What algorithms best match your problem and your data?

If Weka had all the answers no one would need us …

Page 3: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Copyright Numerate Inc

An Aside

Classification and Regression Trees, 1984

0 1

0.15 0.5 0.54 0.85

Has maximum in [0.1,0.34] ?

Ship

Radar Profile

Page 4: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Copyright Numerate Inc

Finding New Drugs

Page 5: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Copyright Numerate Inc

Finding New Drugs

What is the learning problem?

Page 6: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Copyright Numerate Inc

Predicting Compound Activity

Thomas Splettstoesser, Wikimedia Commons

Page 7: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Predicting Compound Activity

Representation Statistics

Encode chemical information – Representation

Capture key aspects of problem

Challenging statistical problems

Overfitting: The “train/test paradox”

Bias: Where does the training data come from?

Noise: What were the experimental conditions?

Machine Learning

Copyright Numerate Inc.

Page 8: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Representation: Conformers

Enumerate potential binding conformations of small molecules

Copyright Numerate Inc.

Page 9: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Hydrophobic

group

Positive

charge

Hydrogen

bond

acceptor

Aromatic

ring

Hydrogen

bond

donor

Representation: 3D Features

• Capture key shape and electrostatic features

Copyright Numerate Inc.

Page 10: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Representation: Pharmacophores

Pharmacophore: Set of 4 feature points and distances

Rotation and translation invariant

Captures local information

Doesn’t require whole molecule alignment

Single pharmacophores are bad

4 Points do not capture enough information

Not noise tolerant

Descriptive but not predictive

Copyright Numerate Inc.

Page 11: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Pharmacophore

57467 Pharmacophore

57483

Each pharmacophore corresponds to a bit (~10 million bits)

fit = bit on no fit = bit off

Representation: Pharmacophore Bit Keys

Copyright Numerate Inc.

Page 12: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Representation: Sets of Pharmacophores

Bit Keys are bad: Discretized Distances

• Sensitive to conformer generation

• Sensitive to bin boundaries

• Lack resolution

• Poor generalization

Bit Keys are sufficient for purchasable screen

Not sufficient for lead identification

Solution: Use the full set of pharmacophores

Copyright Numerate Inc.

Page 13: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Copyright Numerate Inc

Available Data

Literature: • Curation errors • Publication bias • Divergent assay conditions Patents: • Curation errors • Divergent assay conditions • Obfuscated data • Only intervals reported

Page 14: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Addressing Noise Using Ranking

IC50

A B

C D

E

F

Lab 1 Lab 2

A

D

B

C E F

• IC50 values are not consistent

– Experimental noise

– Varying experimental conditions

• Relative orders are consistent

– Treat all data as ordering data

Copyright Numerate Inc

Page 15: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

IC50

A B

C D

E

F

Lab 1 Lab 2

A

D

B

C E F

• IC50 values are not consistent

– Experimental noise

– Varying experimental conditions

• Relative orders are consistent

– Treat all data as ordering data

• Tolerance windows address measurement noise

Outside

tolerance window

Inside

tolerance

window

Addressing Noise Using Ranking

Copyright Numerate Inc

Page 16: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Noise: Some examples

Assay Compounds Pearson (log) Kendall Tau KT (tol=1.0)

HEK v. CHO 425 0.71 0.56 0.92

HEK Internal 40 0.74 0.64 1.0

CHO Internal 60 0.96 0.79 1.0

Varying Assay Conditions:

Copyright Numerate Inc

Page 17: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Noise: Some examples

Assay Compounds Pearson (log) Kendall Tau KT (tol=0.5)

Kinase 1 43 0.96 0.78 1.0

Kinase 2 23 0.91 0.77 1.0

“Same” Conditions – different lab:

Assay Compounds Pearson (log) Kendall Tau KT (tol=0.5)

Kinase 3 Assay 1 44 0.82 0.73 0.97

Kinase 3 Assay 2 45 0.79 0.58 0.99

Copyright Numerate Inc

Page 18: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Copyright Numerate Inc

Bias

Our Goal: Find NEW molecules Data: Only regarding old molecules – owned by someone else Publication bias: Mostly active compounds are published Medicinal chemistry bias: Compounds are made because someone believed they would be active Synthetic chemistry bias: More easy compounds are made than hard compounds

Page 19: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Empirical Risk Minimization

Minimize the empirical risk: – where S is the training data

– f is a predictive model

– x is the representation

– y is the label

– L is a bounded loss function

Can alternatively be expressed as

Expected loss over the empirical distribution

Syx

yxfL),(

),(

)),((E S~y)(x, yxfL

Copyright Numerate Inc

Page 20: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Bias: Where does our data come from?

Essential assumption: Training data is drawn

independently from an identical distribution

to the test distribution D, i.e.,

Due to bias this is never even approximately

true for biochemical data, i.e, S is not equal D

Publication bias, Medicinal chemistry bias, Synthetic

bias

Must address data bias for ALL methods

Including model validation, docking, and free-energy

),(E),(E D~y)(x,S~y)(x, yxLyxL

Copyright Numerate Inc

Page 21: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Bias: Where do we put our errors?

Consider fitting the function sin(x/y) where

X (visible) is drawn independently and uniformly from [-3,3]

Y (hidden) is 3 or 1 with varying probabilities

Copyright Numerate Inc.

Page 22: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

What if?

In training data P(y=1)=0.05 and P(y=3)=0.95

In testing data P(y=1)=0.95 and P(y=3)=0.05

Bias: Where do we put our errors?

y=1 y=3

Makes error estimates worthless Modeling is hard

Copyright Numerate Inc.

Page 23: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Publication Bias

Publication bias Mostly active compounds are published

Baseline predictor: predict active

Published Data Apparent active: inactive ratio > 10:1

Predict active: >90% accuracy on

literature data

Reality True active: inactive ratio < 1:1000

Predict active: <0.1% accuracy on

real compounds

Active Inactive

Where do we put the errors? Not where we should

Copyright Numerate Inc.

Page 24: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Addressing Publication Bias with Ranking

Treat all data as ordering data:

Compound A is active and compound B is inactive

Create ordered pairs of data (A,B)

Are they in the correct order?

Bias disappears, baseline is 50% accurate.

Copyright Numerate Inc

Page 25: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Copyright Numerate Inc

Measuring Success

Our problem: From a large set of molecules select a small number to make and test The challenge: Making and testing is expensive. How do you evaluate methods? It turns out that the probability of zero hits is: a = area above ROC curve of model n = number of molecules evaluated e = proportion of active compounds in original set k = number of compounds made and tested So area above ROC curve is an appropriate metric.

Page 26: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Copyright Numerate Inc

Measuring Success

Page 27: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Metrics aren’t the whole story

Be careful of metrics:

• Molecular weight often predicts well

• Intuitively too good to be true

• Likely a problem with your experiment!

Frequency

Molecular Weight

Copyright Numerate Inc.

Page 28: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Copyright Numerate Inc

Measuring Success

Evaluation experiments must approximate your problem well: • Reflect real world biases

• Cannot just randomly sub-sample your data • Reflect real world noise

• Cannot test noise robustness on pristine data • Pure statistical measures of loss don’t capture everything

• Remember the molecular weight example

So, identify the right metric but never completely trust it. You eventually have to test in the real world.

Page 29: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

VALIDATING ACTIVITY PREDICTIONS

Representative Examples Actives Notes

p38α MAP Kinase Inhibitor

(Ser-Thr Kinase) 31/44 (70%)

6 Cmpds IC50 < 1 μM

10 new series

In vivo activity

Acyl-CoA Acyl Transferase

Inhibitor (ACAT) 17/25 (68%) IC50 110 nM

Transglutaminase 2 Inhibitor

(TG2) 13/25 (52%)

Input: 2 series

Output: 5 series

Adrenergic Receptor Antagonist

α1a + α1d > α1b

5/8 (62%)

2/8 Selective Selectivity ≈ Tamsulosin

Built models and screened in silico library of commercially available compounds

GPCRs, other receptors, kinases, other enzymes, channels, macromolecular

Assayed diverse predicted active compounds in the laboratory >10 times

Copyright Numerate Inc.

Page 30: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

p38α MAP KINASE INHIBITORS

Built predictive model for p38a MAPK inhibitors

Literature IC50 data for 73 compounds

11 distinct structural classes of inhibitors represented

No target structural data used

Screened library of 700,000 commercial compounds in silico

44 diverse, predicted active compounds obtained for laboratory testing

31 of 44 compounds (70%) inhibit p38a MAPK by ≥ 50% at 10 μM

6 compounds with IC50 < 1 μM

Estimate Enrichment ≥ 700

Actives are structurally diverse => robust scaffold-hopping

New classes (scaffolds) identified equals number in the training set

Copyright Numerate Inc.

Page 31: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

p38a MAPK MODEL TRAINING SET

Triazines

UreasAzoles

BenzenediamidesAmidobenzamides

Benzophenones

Pyridinopyrimidinones

Indoles

Quinazolinones

Pyrrolopyrimidines

NN

HN

HN

OO

N

O

N

N

NN

HN

O

F

NH

N O

ClClN

F

Cl

NH

O Cl

H2N Br

HN

N

HN

O

O

N

N

NHN

N

N

NH2

N NNS

O

F

ClCl

F

N

HN

HNN

NH

O

O

NN

O

N

N

HN

N

Quinazolines

HN

N

N

N

HN

OH2NN

O

NH

NO

Cl

Copyright Numerate Inc.

Page 32: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

STRUCTURALLY DIVERSE HITS

• 31 / 44 compounds tested inhibit enzyme activity by ≥ 50% at 10 mm • Actives include 10 new structural classes

NN

HN

HN

O

S

NO 2

H2N NH

HN

O

SH 2N

ON

N

N

N

O

ON N

O

H

HN

OHN

O

F

NN

N

O

HN

O

O

NH

N

N S

O

O

Br

NN

S

O NH

O

O

NH

HN

PHIX-102 PHIX-104 PHIX-105 PHIX-111 PHIX-112 PHIX-113

PHIX-114

N

NN

N

NH

S

O

PHIX-116

HNO 2N

S

F

N N

S

PHIX-118

NS

O 2N O

N

OH

PHIX-119

HNO 2N

O

N N

S

O

O

PHIX-120

NN

NS

O

HN

NH

PHIX-121

N

N

SN

O

PHIX-122

NO

HNO

O

O

NH

NH

S

NH

O

PHIX-123 PHIX-124

HN

HN

HN

HN

OOOO

N

N

OH

HN

PHIX-128

NH

O

PHIX-130

NH

O 2NHN

O

NO 2

PHIX-132

PHIX-141N

O

HN

O

O

PHIX-138

N HN

O

S

Cl

F

C l

PHIX-139

N NH2

O

Cl

O

C l

PHIX-142O

HN

HN

OO

N

OO

N

ClCl

PHIX-143

N

O

HN

O

Cl

O

PHIX-145

Copyright Numerate Inc.

Page 33: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Goal: Protein kinase R inhibitor for TB, diabetes

Collaboration with Carl Nathan (Cornell)

Starting point: 20 PKR inhibitors in 5 series (BBRC 2003 308:50)

Step 1: Augment training data with 28 Ser/Thr Ki’s

Step 2: Identify diverse new PKR inhibitor chemotypes

Build model, screen 5M purchasables in silico

Acquire and assay 162 diverse, predicted actives 35/162 IC50< 20 mM, 8 new series

(BMCL 2011 21:4108)

NMRT-2862 is non-toxic to macrophages and

recapitulates effects of PKR knockout.

33 Copyright Numerate Inc

Page 34: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Goal: Transglutaminase 2 inhibitor for CS,CF

Collaboration with Chaitan Khosla (Stanford)

Starting point: 35 (mM+) actives in 2 series

Different binding sites/MOA

Step 1: Build 1st generation model, screen purchasables 4 new active series identified (A-D)

Step 2: Hit expansion in Series D

50 analogs, NIH-funded (BMCL 2011 21:2692)

Compound Series Ki (mM) MOA

NMRT-1002 A 3 Rev

NMRT-2149 B 6.5 Rev

NMRT-2235 C 5.9 Irrev

NMRT-2081 D 0.4 Rev, Non-comp

NMRT-2081

Copyright Numerate Inc

Page 35: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Compound WT L100I K103N Y181C

1.4 nM 23 nM 12 nM 1.9 nM

#1 3.8 9.1 13 200

#21 4.0 6.6 3.2 27

Goal: Versatile NNRTI for treatment of HIV/AIDS

Profile

Potent inhibitor of WT and drug-resistant HIV-1 replication

Improved PK/ADME profile and pharmaceutical properties

Models built with whole-cell (phenotypic) data

Huge chemical space outlined

Broad-spectrum lead series achieved in 3 months

Copyright Numerate Inc

Page 36: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Goal: Multi-targeted cardioprotective agent

Therapeutic profile

Potent reductions in cholesterol, triglycerides (TGs), C-reactive

protein (CRP), fibrinogen (FN), glucose

Enhanced anti-inflammatory and cardioprotective properties

Low potential for myopathy and drug-drug interactions

Pharmacologic profile

Potent inhibitor of HMG-CoA reductase

Potent inhibitor of p38α mitogen-activated protein kinase

↓ TNFα, IL-1b, CRP, FN, TGs, platelet reactivity, atherogenesis

↓ Gluconeogenesis, leptin resistance

Low affinity for retinoid X receptor α

Enhanced anti-inflammatory activity

Copyright Numerate Inc

Page 37: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

In vitro results

13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Little or no affinity for RXRα

Improved anti-inflammatory activity in whole cell assay

U.S. Patent claims issued for 3 series

Compound HMG CoA R p38α MAPK RXRα TNFα

7 nM > 100000 nM 38000 nM 93000 nM

1.3 > 100000 8100 82000

MTCP-1 33 11000 > 100000 31000

MTCP-2 2.0 36000 > 100000 30000

MTCP-3 3.7 11000 80000 21000

MTCP-4 130 1900 > 100000 18000

MTCP-5 11 6000 > 100000 230000

MTCP-6 1.2 400 > 100000 8000

MTCP-7 4.8 270 > 100000 9700

MTCP-8 98 < 100 > 100000 9700

Copyright Numerate Inc

Page 38: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

In vivo results

ApoE*3 Leiden transgenic mouse model

Significant Cholesterol reduction

Triglyceride reduction ≥ atorvastatin

Reduction in Fibrinogen & E-Selectin ≥ atorvastatin

Copyright Numerate Inc

Page 39: Machine Learning and Drug Designniejiazhong/slides/duffy.pdf · In vitro results 13/19 compounds inhibit HMG-CoA R, 10/19 inhibit p38a MAPK Dual-acting profile verified in 8/19 compounds

Conclusions

Successful applications of machine learning require thought:

• Define the problem

• Its not just about the “best” algorithm

• Match the algorithm to the problem

• Represent the key properties of the data

• Consider the bias

• Think about the noise

• Design your evaluation experiments carefully

• Validate in the real world.

Copyright Numerate Inc.