Protein-Ligand Interaction Prediction: An Improved Chemogenomics Approach
description
Transcript of Protein-Ligand Interaction Prediction: An Improved Chemogenomics Approach
Protein-Ligand Interaction Prediction: An Improved Chemogenomics
Approach
Laurent JacobJean-Philippe Vert
Introduction Predicting interactions between small
molecules and proteins› Vital to the drug discovery process› Key to understanding biological processes
3 classes of drug targets› G-protein-coupled
receptors (GPCRs)› Enzymes› Ion channels
Classical Methods Consider each target independently from other
proteins Ligand-based approach
› Compare to known ligands of the target› Requires knowledge about other ligands of a given
target Structure-based or docking approaches
› Uses 3D structure of the target to determine how well a ligand can bind
› Requires 3D structure of the target› Very time consuming
Cannot apply if no ligand or 3D structure is known for a given target
Chemogenomics Chemical space:
› set of all small molecules Biological space:
› set of all proteins or protein families Mine the entire chemical space for
interactions with the biological space Knowledge of some ligands for a target
can help to predict ligands for similar targets
Chemogenomic Approaches Ligand-based chemogenomics
› Look at families or subfamilies of proteins› Model ligands at the level of a family
Target-based chemogenomics› Cluster receptors based on ligand binding site
similarity› Use known ligands for each cluster to infer
shared ligands Target-ligand approach
› Use binding information for targets to predict ligands for another target in a single step
Previous Experiments Bock and Gough (2005)
› Describe ligand-receptor complexes by merging ligand and target descriptors
› Use machine learning methods to predict if a ligand-receptor pair forms a complex
Erhan et al. (2006)› Merge a set of ligand descriptors with a set of
receptor descriptors in a framework of neural networks and support vector machines
› Offers a large flexibility in the choice of descriptors
Proposed Method Investigates different types of descriptors Builds upon recent developments in kernel
methods› In bio- and cheminformatics
Tests different methods for prediction of ligands › For 3 major classes of targets
Shows that the choice of representation greatly effects accuracy
New kernel based on hierarchies of receptors outperforms all other descriptors› Performs especially well for targets with few or no
known ligands
Learning Problem Given n target/molecule pairs (t1,c1), …, (tn,
cn) known to form complexes or not› Each pair is represented by a vector (t,c)
Estimate a linear function › f(t,c)=w┬ (t,c)
Whose sign is used to predict if a chemical c can bind to a target t
The vector w is estimated from the training set
Vector Representation Represent a molecule c by a vector lig(c)Rdc
› Encode physiochemical and structural properties› Model interactions between small molecules and a
single target Represent a protein t by a vector tar(t)Rdt
› Capture properties of the proteins sequence or structure
› Infer models that predict the structural or functional class of a protein
Need to represent a pair (c,t) in a single vector› Capture interactions between features of the molecule
and protein that can be useful predictors› Multiply a descriptor of c with a descriptor of t
Tensor Product (c,t) = lig(c) tar(t) Represent the set of all possible products of
features of c and t dc x dt vector
› The (i,j)-th entry is the product of the i-th entry of lig(c) by the j-th entry of tar(t)
Size may be prohibitively large Use kernel methods
Kernel Trick Can process large- or infinite-dimensional patters if
the inner product between any two patterns can be computed
Can factorize the inner product between two tensor product vectors› (lig(c) tar(t))┬ (lig(c’) tar(t’)) › = lig(c)┬ lig(c’) x tar(t) ┬ tar(t’)
Obtain the inner product between two tensor products› K((c,c’),(t,t’))= Kligand(c,c’) x Ktarget(t,t’)
Kligand(c,c’)= lig(c)┬lig(c’) Ktarget(t,t’)= tar(t) ┬tar(t’)
Kernels For Ligands Have been impressive advances in use of
SVM in chemoinformatics Kernels have been designed using:
› Physiochemical properties of molecules› 2D or 3D fingerprints› Comparison of 2D and 3D structures of molecules
Detection of common substructures in 2D graphs Encoding various properties of 3D structures
Used in single-target virtual screening and prediction of pharmacokinetics and toxicity
Tanimoto Kernel Classical choice State-of-the-art performance Kligand(c,c’) = lig(c)┬ lig(c’) / [lig(c)┬ lig(c) +
lig(c’)┬ lig(c’) - lig(c)┬ lig(c’)] lig(c) ┬ is a binary vector Bits indicate if the 2D structure of c contains
all linear paths of length l or less as a subgraph› Choose l=8
Used ChemCPP software to compute
Kernels For Targets SVM and kernel methods are widely used in
bioinformatics Various Kernels have been proposed based
on:› Amino-acid sequence of proteins› 3D structures of proteins› Pattern of occurrences of proteins in multiple
sequenced genomes Used for various tasks related to structural
or functional classification of proteins
Dirac Kernel KDirac(t,t’)
› = 1 if t = t’› = 0 otherwise
Represents different targets as orthonormal vectors
Orthogonality between two proteins t and t’ implies orthogonality between all pairs (c,t) and (c’,t’) for any two molecules c and c’› Learning is performed independently for each target
protein › Does not share any information of known ligands
between different targets
Multitask Kernel
Kmultitask(t,t’) = 1 + Kdirac(t,t’) Removes the orthogonality Combines target-specific properties of the
ligands and general properties across all targets
Allows sharing of information during learning Preserves the specificities of the ligands for
each target Does not weigh much how known
interactions should contribute
Mismatch and Local Alignment Kernels
Empirical observations suggest that molecules that bind to t are only likely to bind to t’ if they are similar in terms of structure or evolutionary history› Can be detected by comparing protein sequences
Mismatch kernel: › compares short sequences of amino acids up to some
number of mismatches› Choose 3mers with a maximum of one mismatch
Local alignment kernel: › uses the alignment score between the primary
sequences of proteins to measure their similarity
Hierarchy Kernel Khierarchy(t,t’)=(h(t), h(t’)) h(t) has a feature for each node in the
hierarchy› Is set to 1 if the node is part of t’s hierarchy› Is set to 0 otherwise› Plus one feature is constantly set to 1
Use data from the target and data from other targets, giving it smaller weight
Performed the best in the experiments
Enzyme Hierarchy Enzyme Commission numbers
› International Union of Biochemistry and Molecular Biology (1992)
› Classifies by the chemical reaction they catalyze› Four-level hierarchy
For example,› EC 1 includes oxidoreductases› EC 1.2 includes oxidoreductases that act on the aldehyde or
oxo group of donors› EC 1.2.2 has NAD+ or NADP+ as an acceptor› EC 1.2.2.1 caltalyze the oxidation of formate to bicarbonate
Enzymes that are close in the hierarchy should have similar ligands
GPCR Hierarchy GPCRs are grouped into four classes
› Group A: rhodopsin family› Group B: secretin family› Group C: metabotropic family› Group D: regroups more divers receptors
KEGG database subdivides rhodopsin family into three subgroups› Amine receptors› Peptide receptors› Other receptors
And adds a second level of classification based on the type of ligands or known subdivisions
Ion Channel Hierarchy The KEGG database divides ion channels into 8
classes› Cys-loop superfamily› Glutamate-gated cation channels› Epithelial and related Na+ channels› Voltage-gated cation channels› Related to voltage-gated cation channels› Related to inward rectifier K+ channels› Chloride channels› Related to ATPase-linked transporters
Each class is further subdivided› By, for example, the type of ligands or type of ion
passing through the channel
Data Extraction Extracted compound interaction data from KEGG
BRITE database› Known compounds for each target› Type of interaction
Enzymes: inhibitor, cofactor, effector GPCR: antagonist, full/partial agonist Ion Channels: pore blocker, positive/negative allosteric
modulator, agonist, antagonist Did not take into account
› Orthologs of targets › Enzymes with same EC number› Compounds with no molecular descriptor
Primarily peptides› Targets with no known compounds
Data Points Generated as many negative ligand-target
pairs as known ligand-target pairs› Randomly chose ligands› Produced false negatives› Need experimentally confirmed negative pairs
2436 data points for enzymes› 675 enzymes, 524 compounds
798 data points for GPCRs› 100 receptors, 219 compounds
2230 data points for ion channels› 114 channels, 462 compounds
Known LigandsDistribution of the number of known ligands per target for
enzymes, GPCR, and ion channel datasets
Each bar indicates the proportion of targets for which a given number of training points are available
Few compounds are known for most targets
Jacob, L. et al. Bioinformatics 2008 24:2149-2156; doi:10.1093/bioinformatics/btn409
Experiments Experiment 1
› Trained an SVM classifier on all points involving other targets of the family plus a fraction of points involving t
› Tested on the remaining data points for t› Assesses the accuracy for a given target when using ligands for
other targets for training Experiment 2
› Trained an SVM classifier using only interactions that did not involve t
› Tested on data points that did involve t› Simulated making predictions for targets with no known ligands
Measured performance using the area under the ROC curve (AUC)
Results: Experiment 1Mean AUC on each dataset with various target kernels
Hierarchy kernel shows significant improvements› Sharing information for known ligands of different targets› Incorporating prior information into the kernels
Ktar \ Target Enzymes GPCR Channels
Dirac 0.646±0.009 0.750±0.023 0.770±0.020Multitask 0.931±0.006 0.749±0.022 0.873±0.015Hierarchy 0.955±0.005 0.926±0.015 0.925±0.012Mismatch 0.725±0.009 0.805±0.023 0.875±0.015Local alignment 0.676±0.009 0.824±0.021 0.901±0.013
Gram MatricesTarget kernel Gram matrices (Ktar) for ion channels with
multitask, hierarchy, and local alignment kernels
Hierarchy kernel adds structure information Local alignment kernel retains some substructures For GPCR and enzymes, almost no structure is found by the
sequence kernels
Jacob, L. et al. Bioinformatics 2008 24:2149-2156; doi:10.1093/bioinformatics/btn409
Relative ImprovementRelative improvement of the hierarchy kernel against the Dirac
kernel as a function of the number of known ligands for enzymes, GPCR, and ion channel datasets
Strong improvement when few ligands are known Decreases when enough training points become available After a certain point, performance is impaired
Jacob, L. et al. Bioinformatics 2008 24:2149-2156; doi:10.1093/bioinformatics/btn409
Results: Experiment 2Mean AUC on each dataset with various target kernels
Dirac kernel showed random behavior› Learning with no training data
Hierarchy kernel still gives reasonable results› 1.7%, 5.1%, 7.2% loss for enzymes, GPCR, and ion channels
compared to the first experiment
Ktar \ Target Enzymes GPCR Channels
Dirac 0.500±0.000 0.500±0.000 0.500±0.000Multitask 0.902±0.008 0.576±0.026 0.704±0.026Hierarchy 0.938±0.006 0.875±0.020 0.853±0.019Mismatch 0.602±0.008 0.703±0.027 0.729±0.024Local alignment 0.535±0.005 0.751±0.025 0.772±0.023
References1. Rognan D: Chemogenomic approaches to rational drug
design. Br J Pharmacol 2007, 152:38-52.2. Kanehisa M, Goto S, Kawashima S, Nakaya A: {The KEGG
databases at GenomeNet}. Nucl. Acids Res. 2002, 30:42-46.3. Jacob L, Vert J: Protein-ligand interaction prediction: an
improved chemogenomics approach. Bioinformatics 2008, 24:2149-2156.
4. Erhan D, L'Heureux P, Yue SY, Bengio Y: Collaborative Filtering on a Family of Biological Targets. Journal of Chemical Information and Modeling 2006, 46:626-635.
5. Bock JR, Gough DA: Virtual Screen for Ligands of Orphan G Protein-Coupled Receptors. Journal of Chemical Information and Modeling 2005, 45:1402-1414.