Protein-Ligand Interaction Prediction: An Improved Chemogenomics Approach

Protein-Ligand Interaction Prediction: An Improved Chemogenomics

Approach

Laurent JacobJean-Philippe Vert

Introduction Predicting interactions between small

molecules and proteins› Vital to the drug discovery process› Key to understanding biological processes

3 classes of drug targets› G-protein-coupled

receptors (GPCRs)› Enzymes› Ion channels

Classical Methods Consider each target independently from other

proteins Ligand-based approach

› Compare to known ligands of the target› Requires knowledge about other ligands of a given

target Structure-based or docking approaches

› Uses 3D structure of the target to determine how well a ligand can bind

› Requires 3D structure of the target› Very time consuming

Cannot apply if no ligand or 3D structure is known for a given target

Chemogenomics Chemical space:

› set of all small molecules Biological space:

› set of all proteins or protein families Mine the entire chemical space for

interactions with the biological space Knowledge of some ligands for a target

can help to predict ligands for similar targets

Chemogenomic Approaches Ligand-based chemogenomics

› Look at families or subfamilies of proteins› Model ligands at the level of a family

Target-based chemogenomics› Cluster receptors based on ligand binding site

similarity› Use known ligands for each cluster to infer

shared ligands Target-ligand approach

› Use binding information for targets to predict ligands for another target in a single step

Previous Experiments Bock and Gough (2005)

› Describe ligand-receptor complexes by merging ligand and target descriptors

› Use machine learning methods to predict if a ligand-receptor pair forms a complex

Erhan et al. (2006)› Merge a set of ligand descriptors with a set of

receptor descriptors in a framework of neural networks and support vector machines

› Offers a large flexibility in the choice of descriptors

Proposed Method Investigates different types of descriptors Builds upon recent developments in kernel

methods› In bio- and cheminformatics

Tests different methods for prediction of ligands › For 3 major classes of targets

Shows that the choice of representation greatly effects accuracy

New kernel based on hierarchies of receptors outperforms all other descriptors› Performs especially well for targets with few or no

known ligands

Learning Problem Given n target/molecule pairs (t1,c1), …, (tn,

cn) known to form complexes or not› Each pair is represented by a vector (t,c)

Estimate a linear function › f(t,c)=w┬ (t,c)

Whose sign is used to predict if a chemical c can bind to a target t

The vector w is estimated from the training set

Vector Representation Represent a molecule c by a vector lig(c)Rdc

› Encode physiochemical and structural properties› Model interactions between small molecules and a

single target Represent a protein t by a vector tar(t)Rdt

› Capture properties of the proteins sequence or structure

› Infer models that predict the structural or functional class of a protein

Need to represent a pair (c,t) in a single vector› Capture interactions between features of the molecule

and protein that can be useful predictors› Multiply a descriptor of c with a descriptor of t

Tensor Product (c,t) = lig(c) tar(t) Represent the set of all possible products of

features of c and t dc x dt vector

› The (i,j)-th entry is the product of the i-th entry of lig(c) by the j-th entry of tar(t)

Size may be prohibitively large Use kernel methods

Kernel Trick Can process large- or infinite-dimensional patters if

the inner product between any two patterns can be computed

Can factorize the inner product between two tensor product vectors› (lig(c) tar(t))┬ (lig(c’) tar(t’)) › = lig(c)┬ lig(c’) x tar(t) ┬ tar(t’)

Obtain the inner product between two tensor products› K((c,c’),(t,t’))= Kligand(c,c’) x Ktarget(t,t’)

Kligand(c,c’)= lig(c)┬lig(c’) Ktarget(t,t’)= tar(t) ┬tar(t’)

Kernels For Ligands Have been impressive advances in use of

SVM in chemoinformatics Kernels have been designed using:

› Physiochemical properties of molecules› 2D or 3D fingerprints› Comparison of 2D and 3D structures of molecules

Detection of common substructures in 2D graphs Encoding various properties of 3D structures

Used in single-target virtual screening and prediction of pharmacokinetics and toxicity

Tanimoto Kernel Classical choice State-of-the-art performance Kligand(c,c’) = lig(c)┬ lig(c’) / [lig(c)┬ lig(c) +

lig(c’)┬ lig(c’) - lig(c)┬ lig(c’)] lig(c) ┬ is a binary vector Bits indicate if the 2D structure of c contains

all linear paths of length l or less as a subgraph› Choose l=8

Used ChemCPP software to compute

Kernels For Targets SVM and kernel methods are widely used in

bioinformatics Various Kernels have been proposed based

on:› Amino-acid sequence of proteins› 3D structures of proteins› Pattern of occurrences of proteins in multiple

sequenced genomes Used for various tasks related to structural

or functional classification of proteins

Dirac Kernel KDirac(t,t’)

› = 1 if t = t’› = 0 otherwise

Represents different targets as orthonormal vectors

Orthogonality between two proteins t and t’ implies orthogonality between all pairs (c,t) and (c’,t’) for any two molecules c and c’› Learning is performed independently for each target

protein › Does not share any information of known ligands

between different targets

Multitask Kernel

Kmultitask(t,t’) = 1 + Kdirac(t,t’) Removes the orthogonality Combines target-specific properties of the

ligands and general properties across all targets

Allows sharing of information during learning Preserves the specificities of the ligands for

each target Does not weigh much how known

interactions should contribute

Mismatch and Local Alignment Kernels

Empirical observations suggest that molecules that bind to t are only likely to bind to t’ if they are similar in terms of structure or evolutionary history› Can be detected by comparing protein sequences

Mismatch kernel: › compares short sequences of amino acids up to some

number of mismatches› Choose 3mers with a maximum of one mismatch

Local alignment kernel: › uses the alignment score between the primary

sequences of proteins to measure their similarity

Hierarchy Kernel Khierarchy(t,t’)=(h(t), h(t’)) h(t) has a feature for each node in the

hierarchy› Is set to 1 if the node is part of t’s hierarchy› Is set to 0 otherwise› Plus one feature is constantly set to 1

Use data from the target and data from other targets, giving it smaller weight

Performed the best in the experiments

Enzyme Hierarchy Enzyme Commission numbers

› International Union of Biochemistry and Molecular Biology (1992)

› Classifies by the chemical reaction they catalyze› Four-level hierarchy

For example,› EC 1 includes oxidoreductases› EC 1.2 includes oxidoreductases that act on the aldehyde or

oxo group of donors› EC 1.2.2 has NAD+ or NADP+ as an acceptor› EC 1.2.2.1 caltalyze the oxidation of formate to bicarbonate

Enzymes that are close in the hierarchy should have similar ligands

GPCR Hierarchy GPCRs are grouped into four classes

› Group A: rhodopsin family› Group B: secretin family› Group C: metabotropic family› Group D: regroups more divers receptors

KEGG database subdivides rhodopsin family into three subgroups› Amine receptors› Peptide receptors› Other receptors

And adds a second level of classification based on the type of ligands or known subdivisions

Ion Channel Hierarchy The KEGG database divides ion channels into 8

classes› Cys-loop superfamily› Glutamate-gated cation channels› Epithelial and related Na+ channels› Voltage-gated cation channels› Related to voltage-gated cation channels› Related to inward rectifier K+ channels› Chloride channels› Related to ATPase-linked transporters

Each class is further subdivided› By, for example, the type of ligands or type of ion

passing through the channel

Data Extraction Extracted compound interaction data from KEGG

BRITE database› Known compounds for each target› Type of interaction

Enzymes: inhibitor, cofactor, effector GPCR: antagonist, full/partial agonist Ion Channels: pore blocker, positive/negative allosteric

modulator, agonist, antagonist Did not take into account

› Orthologs of targets › Enzymes with same EC number› Compounds with no molecular descriptor

Primarily peptides› Targets with no known compounds

Data Points Generated as many negative ligand-target

pairs as known ligand-target pairs› Randomly chose ligands› Produced false negatives› Need experimentally confirmed negative pairs

2436 data points for enzymes› 675 enzymes, 524 compounds

798 data points for GPCRs› 100 receptors, 219 compounds

2230 data points for ion channels› 114 channels, 462 compounds

Known LigandsDistribution of the number of known ligands per target for

enzymes, GPCR, and ion channel datasets

Each bar indicates the proportion of targets for which a given number of training points are available

Few compounds are known for most targets

Jacob, L. et al. Bioinformatics 2008 24:2149-2156; doi:10.1093/bioinformatics/btn409

Experiments Experiment 1

› Trained an SVM classifier on all points involving other targets of the family plus a fraction of points involving t

› Tested on the remaining data points for t› Assesses the accuracy for a given target when using ligands for

other targets for training Experiment 2

› Trained an SVM classifier using only interactions that did not involve t

› Tested on data points that did involve t› Simulated making predictions for targets with no known ligands

Measured performance using the area under the ROC curve (AUC)

Results: Experiment 1Mean AUC on each dataset with various target kernels

Hierarchy kernel shows significant improvements› Sharing information for known ligands of different targets› Incorporating prior information into the kernels

Ktar \ Target Enzymes GPCR Channels

Dirac 0.646±0.009 0.750±0.023 0.770±0.020Multitask 0.931±0.006 0.749±0.022 0.873±0.015Hierarchy 0.955±0.005 0.926±0.015 0.925±0.012Mismatch 0.725±0.009 0.805±0.023 0.875±0.015Local alignment 0.676±0.009 0.824±0.021 0.901±0.013

Gram MatricesTarget kernel Gram matrices (Ktar) for ion channels with

multitask, hierarchy, and local alignment kernels

Hierarchy kernel adds structure information Local alignment kernel retains some substructures For GPCR and enzymes, almost no structure is found by the

sequence kernels


Relative ImprovementRelative improvement of the hierarchy kernel against the Dirac

kernel as a function of the number of known ligands for enzymes, GPCR, and ion channel datasets

Strong improvement when few ligands are known Decreases when enough training points become available After a certain point, performance is impaired


Results: Experiment 2Mean AUC on each dataset with various target kernels

Dirac kernel showed random behavior› Learning with no training data

Hierarchy kernel still gives reasonable results› 1.7%, 5.1%, 7.2% loss for enzymes, GPCR, and ion channels

compared to the first experiment

Ktar \ Target Enzymes GPCR Channels

Dirac 0.500±0.000 0.500±0.000 0.500±0.000Multitask 0.902±0.008 0.576±0.026 0.704±0.026Hierarchy 0.938±0.006 0.875±0.020 0.853±0.019Mismatch 0.602±0.008 0.703±0.027 0.729±0.024Local alignment 0.535±0.005 0.751±0.025 0.772±0.023

References1. Rognan D: Chemogenomic approaches to rational drug

design. Br J Pharmacol 2007, 152:38-52.2. Kanehisa M, Goto S, Kawashima S, Nakaya A: {The KEGG

databases at GenomeNet}. Nucl. Acids Res. 2002, 30:42-46.3. Jacob L, Vert J: Protein-ligand interaction prediction: an

improved chemogenomics approach. Bioinformatics 2008, 24:2149-2156.

4. Erhan D, L'Heureux P, Yue SY, Bengio Y: Collaborative Filtering on a Family of Biological Targets. Journal of Chemical Information and Modeling 2006, 46:626-635.

5. Bock JR, Gough DA: Virtual Screen for Ligands of Orphan G Protein-Coupled Receptors. Journal of Chemical Information and Modeling 2005, 45:1402-1414.

Protein-Ligand Interaction Prediction: An Improved Chemogenomics Approach

Documents

Transcript of Protein-Ligand Interaction Prediction: An Improved Chemogenomics Approach