A Probabilistic Approach for Collective Similarity-based ... · “sridhar-bioinformatics15” —...

8
“sridhar-bioinformatics15” — 2016/5/25 — page 1 — #1 Doc-Start Bioinformatics doi.10.1093/bioinformatics/xxxxxx Advance Access Publication Date: Day Month Year Manuscript Category A Probabilistic Approach for Collective Similarity-based Drug-Drug Interaction Prediction Dhanya Sridhar 1,* , Shobeir Fakhraei 1,2 and Lise Getoor 1 1 Computer Science Department, University of California Santa Cruz, Santa Cruz, CA, 95050, United States 2 Computer Science Department, University of Maryland, College Park, MD, 20740, United States * [email protected] Associate Editor: Jonathan Wren Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Motivation: As concurrent use of multiple medications becomes ubiquitous among patients, it is crucial to characterize both adverse and synergistic interactions between drugs. Statistical methods for prediction of putative drug-drug interactions can guide in-vitro testing and cut down significant cost and effort. With the abundance of experimental data characterizing drugs and their associated targets, such methods must effectively fuse multiple sources of information and perform inference over the network of drugs. Results: We propose a probabilistic approach for jointly inferring unknown drug-drug interactions from a network of multiple drug-based similarities and known interactions. We use the highly scalable and easily extensible probabilistic programming framework Probabilistic Soft Logic. We compare against two methods including a state-of-the-art drug-drug interaction prediction system across three experiments and show best performing improvements of more than 50% in AUPR over both baselines. We find five novel interactions validated by external sources among the top-ranked predictions of our model. Availability: Final versions of all datasets and implementations will be made publicly available. Contact: [email protected] 1 Introduction Increasingly, patients use multiple pharamceutical drugs simultaneously to treat their illnesses. Interactions between drugs can result in reduced efficacy of one or more drugs, and in some cases, even deletrious side- effects. The risk of adverse effects is higher in demographics like the elderly that commonly take multiple medications at once. On the other hand, certain drugs interact to produce synergistic effects that are more effective in combatting diseases like cancer (Nahta et al., 2004; Chou, 2010). Crowther et al. (1997) characterizes a drug-drug interaction (DDI) as a drug effect that is greater or less than expected in the presence of another drug. While it remains crucial to verify potential DDIs in vitro, it is prohibitively expensive to exhaustively test all possible interactions. Therefore, computational modeling and predictive methods provide a viable way to identify the most salient potential interactions for downstream experimental validation (Zhang et al., 2009). Interactions between drugs are classified as pharmacokinetic and pharmacodynamic. Computational and mathematical modeling methods rely on current understanding of the mechanisms underlying each of these types of interactions and are specific to each interaction type. A pharmacokinetic interaction with a drug affects the process by which the other drug is absorbed, distributed, metabolized or excreted in the body (Crowther et al., 1997). On the other hand, drugs acting on the same receptor, site of action, or physiological system constitutes a pharmacodynamic interaction. While pharmacokinetic interactions are usually associated with an adverse or exaggerated response, pharmacodynamic interactions are implicated in both synergistic and detrimental effects. Many pharmacokinetic interactions are facilitated by the enzyme family Cytochrome P450 (CYP) and extensive but incomplete knowledge of its mechanisms have been used for computational modeling of pharmacokinetic interactions (Ekins and Wrighton, 2001). Similarly, prior work has applied mathematical modeling of known drug response mechanisms to simulate and predict pharmacodynamic interactions (Jonker et al., 2005; Jin et al., 2011). In contrast to computational modeling, statistical and predictive methods leverage data and evidence from related experiments as domain knowledge and biological priors. Recent advancements in high-throughput experimentation have generated a wealth of biological characterizations of drug compounds and their target genes (Fakhraei et al., 2015). A key challenge for statistical models of drug-drug interactions, or the closely © The Author 2016. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] 1

Transcript of A Probabilistic Approach for Collective Similarity-based ... · “sridhar-bioinformatics15” —...

Page 1: A Probabilistic Approach for Collective Similarity-based ... · “sridhar-bioinformatics15” — 2016/5/25 — page 3 — #3 A Probabilistic Approach for Collective Similarity-based

“sridhar-bioinformatics15” — 2016/5/25 — page 1 — #1

Doc-Start

Bioinformaticsdoi.10.1093/bioinformatics/xxxxxx

Advance Access Publication Date: Day Month YearManuscript Category

A Probabilistic Approach for CollectiveSimilarity-based Drug-Drug Interaction PredictionDhanya Sridhar 1,∗, Shobeir Fakhraei 1,2 and Lise Getoor1

1Computer Science Department, University of California Santa Cruz, Santa Cruz, CA, 95050, United States2Computer Science Department, University of Maryland, College Park, MD, 20740, United States

[email protected]

Associate Editor: Jonathan Wren

Received on XXXXX; revised on XXXXX; accepted on XXXXX

Abstract

Motivation: As concurrent use of multiple medications becomes ubiquitous among patients, it is crucial tocharacterize both adverse and synergistic interactions between drugs. Statistical methods for predictionof putative drug-drug interactions can guide in-vitro testing and cut down significant cost and effort. Withthe abundance of experimental data characterizing drugs and their associated targets, such methodsmust effectively fuse multiple sources of information and perform inference over the network of drugs.Results: We propose a probabilistic approach for jointly inferring unknown drug-drug interactions froma network of multiple drug-based similarities and known interactions. We use the highly scalable andeasily extensible probabilistic programming framework Probabilistic Soft Logic. We compare against twomethods including a state-of-the-art drug-drug interaction prediction system across three experimentsand show best performing improvements of more than 50% in AUPR over both baselines. We find fivenovel interactions validated by external sources among the top-ranked predictions of our model.Availability: Final versions of all datasets and implementations will be made publicly available.Contact: [email protected]

1 Introduction

Increasingly, patients use multiple pharamceutical drugs simultaneouslyto treat their illnesses. Interactions between drugs can result in reducedefficacy of one or more drugs, and in some cases, even deletrious side-effects. The risk of adverse effects is higher in demographics like theelderly that commonly take multiple medications at once. On the otherhand, certain drugs interact to produce synergistic effects that are moreeffective in combatting diseases like cancer (Nahta et al., 2004; Chou,2010). Crowther et al. (1997) characterizes a drug-drug interaction (DDI)as a drug effect that is greater or less than expected in the presenceof another drug. While it remains crucial to verify potential DDIsin vitro, it is prohibitively expensive to exhaustively test all possibleinteractions. Therefore, computational modeling and predictive methodsprovide a viable way to identify the most salient potential interactions fordownstream experimental validation (Zhang et al., 2009).

Interactions between drugs are classified as pharmacokinetic andpharmacodynamic. Computational and mathematical modeling methodsrely on current understanding of the mechanisms underlying each of

these types of interactions and are specific to each interaction type. Apharmacokinetic interaction with a drug affects the process by whichthe other drug is absorbed, distributed, metabolized or excreted in thebody (Crowther et al., 1997). On the other hand, drugs acting onthe same receptor, site of action, or physiological system constitutesa pharmacodynamic interaction. While pharmacokinetic interactionsare usually associated with an adverse or exaggerated response,pharmacodynamic interactions are implicated in both synergistic anddetrimental effects. Many pharmacokinetic interactions are facilitated bythe enzyme family Cytochrome P450 (CYP) and extensive but incompleteknowledge of its mechanisms have been used for computational modelingof pharmacokinetic interactions (Ekins and Wrighton, 2001). Similarly,prior work has applied mathematical modeling of known drug responsemechanisms to simulate and predict pharmacodynamic interactions(Jonker et al., 2005; Jin et al., 2011).

In contrast to computational modeling, statistical and predictivemethods leverage data and evidence from related experiments as domainknowledge and biological priors. Recent advancements in high-throughputexperimentation have generated a wealth of biological characterizationsof drug compounds and their target genes (Fakhraei et al., 2015). A keychallenge for statistical models of drug-drug interactions, or the closely

© The Author 2016. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] 1

Page 2: A Probabilistic Approach for Collective Similarity-based ... · “sridhar-bioinformatics15” — 2016/5/25 — page 3 — #3 A Probabilistic Approach for Collective Similarity-based

“sridhar-bioinformatics15” — 2016/5/25 — page 2 — #2

2 Sridhar et al.

related problem of drug-target interactions, lies in fusing or combininginformation from multiple data sources. Much related work has developedways of computing similarity scores between drugs or pairs of drugs tobe used as features for machine learning classifiers (Cheng and Zhao,2014; Gottlieb et al., 2012; Atias and Sharan, 2011; Vilar et al., 2014,2013). Sophisticated algorithms such as restricted Boltzmann machinesand matrix factorization are especially effective in combining two typesof similarities by learning latent representations of the entities (Wangand Zeng, 2013; Cao et al., 2015; Gönen, 2012), however, they do notinherently support multiple similarities in the same model. Statisticalmethods for DDI prediction are more generalizable as they do not rely onextensive expert knowledge of each mode of interaction. Although Parket al. (2015); Huang et al. (2013) apply their predictive methods only topharmacodynamic interactions, statisicals models can be easily extendedto both types of interactions. To the best of our knowledge, Gottlieb et al.(2012) present state-of-the-art results for drug-drug interaction predictionof both pharmacokinetic and pharmacodynamic interactions with theirINDI system for combining similarity measures to use features for a locallogistic regression classifier.

However, these similarity-based methods neglect the structuralinformation encoded in the biological network of drugs and theirinteractions. Two general types of approaches have been studied for addingnetwork information to similarity-based features: methods that computeadditional network-based features and methods that perform inferencedirectly over the structure of the network. We refer to these as network-similarity methods and network-based inference methods, respectively.Both kinds of approaches begin by formulating a graph of drugs and theirinteractions. Network-similarity methods proceed by computing relationalfeatures, based on the local neighborhoods of drugs such as neighborhoodoverlap and other well-studied network attributes (Cheng and Zhao, 2014;Cao et al., 2015; Huang et al., 2013). The relational features supplement thesimilarity information given as input to a classifier. In contrast, network-based inference methods reason over the graph structure when predictinginteractions.

Multiple network-based inference approaches have been introduced forthe closely related problem of drug-target interaction prediction. Bleakleyand Yamanishi (2009) formulate the problem of inferring missing linksin a bipartite graph of drugs and targets, and introduce a model that useslocal bipartite structure for prediction. Cheng et al. (2012); Mei et al.(2013) similarly leverage local bipartite topology for inference and Parket al. (2015) introduce a random walk approach for reasoning over thenetwork of drugs and targets. However, local network-based featurescannot enforce global constraints based on the full graph of entities.Given local relational features, current network-based inference methodsfollow traditional machine learning algorithms in assuming the instancesto be independent and identically distributed. In recent work, Fakhraeiet al. (2014, 2013) improve upon existing bipartite drug-target interactionprediction approaches using the probabilistic programming frameworkProbabilistic Soft Logic (PSL) to jointly classify all interactions, fusingsimilarity relations and global network information.

In this work, we formalize the problem of network-based drug-druginteraction prediction using multiple similarity relations. We collectivelypredict drug-drug interactions, considering statistical dependenciesbetween predictions along with knowledge of observed interactions usingProbabilistic Soft Logic. We apply our collective approach to predictDDIs on three kinds of interactions: (1) CYP-related interactions (2) non-CYP related interactions (3) general interactions documented by Drugbank(Wishart et al., 2006). For all settings, we evaluate our collective approachagainst two non-collective methods including state-of-the-art INDI systemof Gottlieb et al. (2012) and a non-collective PSL model. Our modelachieves statistically significant improvement up to 5% in area under theROC (AUC) results from Gottlieb et al. (2012). We further assess area

under the precision-recall curve (AUPR) for all methods and show that ourcollective DDI prediction approach significantly outperforms the state-of-the-art baseline method by up to 50%. Finally, we present important novelDDIs predicted by our approach that are validated in literature.

2 Materials

We use two datasets for our experimental evaluation. The first dataset,released by Gottlieb et al. (2012), includes pairwise interactions between807 drugs, with the drug IDs anonymized . We constructed the seconddataset by extracting interactions from Drugbank for the 315 drugs usedby Fakhraei et al. (2014); Perlman et al. (2011), where Drugbank IDsare provided for additional validation. The following section describedinteraction types and similarities used in these datasets.

2.1 Drug Interaction Data

For the first dataset, Gottlieb et al. (2012) download 10,702 interactionsfrom DrugBank and 70,099 interactions listed as moderate or high fromDrugs.com website(Wishart et al., 2006). The dataset contains two typesof interactions: (1) CYP-related interactions (CRDs), where both drugs aremetabolized by the same cytochrome P450 (CYP) enzyme (2) non-CYP-related interactions, where no CYP is shared between the drugs (NCRDs).After filtering and processing, the final dataset includes 10,106 CRD and45,737 NCRD DDIs (Gottlieb et al., 2012) across 807 drugs.

For the second dataset, we download interactions from DrugBankversion 4.3 for the 315 drugs used by Perlman et al. (2011); Fakhraei et al.(2014). We cross referenced Drugbank IDs released for the 315 drugs toextract the listed drug interactions, resulting in 4293 known interactions.

2.2 Drug Similarity Data

Both datasets contain seven drug-drug similarities. Four of these similaritymeasures are drug-based: Chemical-based, Ligand-based, Side-effect-based, Annotation-based. Three similarities are between drug targets andcomputed by aggregating over known targets for the drugs: Sequence-based, PPI network-based, and Gene Ontology-based. In the first dataset,Gottlieb et al. (2012) average maximal similarities between the associatedtargets for drugs that have more than one target. In the second dataset,we average over all possible pairwise similarities between target genes fordrugs that have multiple targets.

Following section provides a brief description of the methods inGottlieb et al. (2012); Perlman et al. (2011) for similarity extraction:

Chemical-based is the Jaccard similarity, or closely related Dice similarity,of molecular fingerprints from pairs of drugs. Molecular fingerprints areretrieved from cheminformatics toolkits such as chemical developmentkit (CDK) (Steinbeck et al., 2006) or RDKit using canonical SMILES1.Fingerprinting methods represent molecules as bit strings for fast similaritycomputation and are grouped into hashing-based and structural methods.Hashed fingerprints such as the Daylight method rely on hash functionsto represent linear substructures of molecules as bit strings. Structuralfingerprints such as MACCS, Atom-Pair, Morgan and Feature-BasedMorgan methods use features of molecular substructures to compute bitstrings. The Jaccard and Dice similarity scores between two setsX and Yare defined as

Jaccard(X,Y ) =|X ∩ Y ||X ∪ Y |

,Dice(X,Y ) =2|X ∩ Y ||X|+ |Y |

1 Simplified Molecular Input Line Entry Specification

Page 3: A Probabilistic Approach for Collective Similarity-based ... · “sridhar-bioinformatics15” — 2016/5/25 — page 3 — #3 A Probabilistic Approach for Collective Similarity-based

“sridhar-bioinformatics15” — 2016/5/25 — page 3 — #3

A Probabilistic Approach for Collective Similarity-based Drug-Drug Interaction Prediction 3

We obtain all fingerprints described above from RDKit and additionally,the hashed fingerprint from CDK computed with default values as used byGottlieb et al. (2012). In our experiments, we use the hashed fingerprintfrom CDK after comparing performance of all fingerprinting methods ondevelopment data.

Ligand-based is the Jaccard similarity between the corresponding setsof protein-receptor families for each drug pair. The protein-receptor isobtained from the similarity ensemble approach (SEA) search tool (Keiseret al., 2009) Drugs’ canonical SMILES compared with a collection ofligands2.

Side-effect-based is the Jaccard similarity score between common side-effects for each pair of drugs.

Annotation-based is the Resnik semantic similarity (Resnik et al., 1999)of Drugs’ ATC codes mapped to the World Health Organization ATCclassification system (Skrbo et al., 2003).

Sequence-based is the Smith-Waterman sequence alignment scorebetween the corresponding drug targets (proteins). They are normalized viadividing the pairwise score by the geometric mean of the alignment scoresof each sequence against itself, suggested in Bleakley and Yamanishi(2009).

Protein-protein interaction network-based is the distance between pairsof corresponding drug targets using their corresponding proteins in thehuman protein-protein interactions network via an all-pairs shortest pathalgorithm.

Gene Ontology-based is the Resnik semantic similarity (Resnik et al.,1999) between Gene Ontology annotations of drugs’ correspondingtargets.

For more detailed descriptions of these similarities, refer to Perlmanet al. (2011); Gottlieb et al. (2012).

3 Methods

3.1 The Problem of Drug-Drug Interaction Prediction

We consider the problem of inferring new edges in a partially observedgraph of interactions between drug vertices by leveraging multiple knownsimilarity relations between vertices. We are given a set of drugs D =

{D1 . . . Dn}. We observe a set of interaction edges between the drugsdenoted by n × n interaction matrix I where Iij = 1 indicates aninteraction between di and dj and is 0 indicates an unobserved or missingedge. Additionally we are given a set ofn×ndrug-drug similarity relationsencoded by tables {M1 . . .Mk} where Mlij ∈ [0, 1] and indicatessimilarity between di and dj according to biological similarity l.

We define a drug network as a multigraphG = (V,E) where V = D

is the vertex set of drugs and E = {M1 . . .Mk} ∪ I is the collection ofmultiple edge types given by the similarity relations and the interactionmatrix I . The drug-drug interaction prediction problem is to use all theinformation encoded in G to predict the unobserved interaction edgesbetween drug vertices in G.

3.2 Collective Probabilistic Reasoning for Network-basedInference of Interactions

Given all the information G, we want to infer interaction values formissing edges U = {(di, dj)|Iij = 0}. Many techniques have been

2 A substance that binds with a biomolecule to serve a biological purpose.

studied for inference of missing links but here we focus on the intersectionof two well known approaches: network-based methods and collectiveprobabilistic methods. Generally, network-based inference techniquesmake use of the structure of G by considering the local neighborhoodsfor each di and dj in edges we want to infer. For example, network-based methods might include set similarity of the neighbors of di and djalong with the local edge similarites encoded inG. Collective probabilisticmethods learn joint distributions P (U,G) to infer the most probable jointassignment to all edges in U thereby leveraging statistical dependenciesbetween prediction targets as well as the observations inG. Network-basedcollective methods combine the two techniques by parametrizingP (U,G)

according to structural features of G. Collective prediction methods havebeen shown to work well in the closely related setting of drug-targetinteraction prediction (Fakhraei et al., 2014). Below we describe hinge-loss Markov random fields and probabilistic soft logic, a framework forperforming network based collective inference, and describe our modelfor drug-drug interaction prediction.

3.2.1 Hinge-loss Markov Random Fields and Probabilistic Soft LogicOur model for collective drug-drug interaction prediction uses a specialclass of Markov random field (MRF). In this section, we review thefoundations of our model and describe its use in interaction predictionbetween drugs.

Probabilistic Soft Logic (PSL) uses declarative, first order logic-likesyntax to template for a special class of Markov random field (MRF)model known as hinge-loss MRFs (HL-MRFs). HL-MRFs admit efficient,scalable and exact maximum a posteriori (MAP) inference (Bach et al.,2015). These models are defined over continuous random variables, whichprovide a natural interpretation for real-valued similarities. MAP inferencein HL-MRFs is a convex optimization problem over these variables.Formally, a hinge-loss MRF defines a joint probability density functionof the form

P (Y|X) =1

Zexp

(−

M∑r=1

λrφr(Y,X))

, (1)

where the entries of target variables Y and observed variables X are in[0, 1], λ is a vector of weight parameters, Z is a normalization constant,and

φr(Y,X) = (max{lr(Y,X), 0})ρr (2)

is a hinge-loss potential specified by a linear function lr and optionalexponent ρr ∈ {1, 2}. Given a collection of logical implications basedon domain knowledge described in PSL and a set of observations fromdata, the rules are instantiated, or grounded out, with known entities inthe dataset. Each instantiation of the rules maps to a hinge-loss potentialfunction as in Equation 2, and the potential functions define an HL-MRFmodel.

To illustrate modeling in PSL, we consider a prototypical similaritybased rule that encourages transitive closure for link prediction betweenentities a, b, c:

Similar(a, b) ∧ Link(b, c)→ Link(a, c)

where link represents the continuous target variable for a link predictiontask and similar is a continuous observed variable. The convex relaxationof this logical implication for continuous truth values is

max(similar(a, b) + link(b, c)− link(a, c)− 1, 0)

and can be understood as its distance to satisfaction.

MAP inference minimizes the weighted, convex distances tosatisfaction to find an consistent joint assignment for all the target variables.

Page 4: A Probabilistic Approach for Collective Similarity-based ... · “sridhar-bioinformatics15” — 2016/5/25 — page 3 — #3 A Probabilistic Approach for Collective Similarity-based

“sridhar-bioinformatics15” — 2016/5/25 — page 4 — #4

4 Sridhar et al.

Higher rule weights induce higher penalties for violating the rule increasingits relative importance to other rules. Weights are learned from data throughmaximum likelihood estimation using training data and the structuredperceptron algorithm. Exact MAP inference is performed on the learnedmodel to find the most likely assignments for variables using the consensusbased ADMM algorithm. PSL supports latent variable modeling withadditional EM-based learning algorithms. For a full description of PSL,see Bach et al. (2015). Thus, PSL rules encode the domain knowledgethat leads to a consistent assignment to all target variables. HL-MRFshave achieved state-of-the-art performance in many domains including thecollective drug-target interaction prediction task (Fakhraei et al., 2014).The open source PSL software can be downloaded from the website(http://psl.umiacs.umd.edu/).

3.2.2 Collective Drug-drug Interaction Prediction PSL Model

?

?

Si Sj Sk

I

I

I

Si

Sj

Sk

Fig. 1: Triad-based drug-drug interaction prediction rules.

The rules of a PSL model capture beliefs or knowledge about theproblem domain. For the drug-drug interaction domain encoded by drugnetwork G, we assert that a drug is likely to be involved in an interactionif it is similar to another drug that is a known interactor. To model thenotion of similarity, we are interested in fusing multiple sources of drugsimilarity. We make this concrete in the full set of rules for drug-druginteraction prediction shown in figure 2.

w1 : SimChemical(D1, D2) ∧ Interacts(D2, D3)→Interacts(D1, D3)

w2 : SimLigand(D1, D2) ∧ Interacts(D2, D3)→Interacts(D1, D3)

. . .

w7 : SimGO(D1, D2) ∧ Interacts(D2, D3)→Interacts(D1, D3)

Fig. 2: PSL model for collective drug-drug interaction prediction.

where we have one rule for each drug similarity described in section2, resulting in seven rules. We represent the prediction target with theInteracts(D1, D3) predicate. Given a set of drugs d1, d2, and d3 withknown interaction between d2 and d3, the rule results in groundings of theform:

w1 : SimChemical(d1, d2) ∧ Interacts(d2, d3)→ Interacts(d1, d3)

w2 : SimChemical(d2, d3) ∧ Interacts(d3, d1)→ Interacts(d2, d1)

Fig. 3: Small subset of ground PSL rules.

We exclude multiple symmetric groundings for ease of exposition.The ground rules illustrate the propagation of similarity informationbetween target variables as the prediction of Interacts(d1, d3) informsthe assignment of Interact(d2, d1). Following Fakhraei et al. (2014), werefer to these as ‘triad rules’ as they encourage triangle completion, ortriadic closure. Figure 1 shows a schematic overview of the triad rules. Thepredicted interaction edge provides evidence for other inferences, resultingin a flow of information throughout the network. This form of collectiveprediction leverages the full structure of the drug network graph G while

combining multiple sources of similarity information. To fully evaluatethe impact of joint prediction, we describe below two baseline methodsthat work non-collectively and assume independence between predictedinteractions.

3.3 Comparison Methods

We compare against two non-collective methods including the state-of-the-art INDI framework for inferring interactions between drugs (Gottliebet al., 2012). We describe each of these below.

3.3.1 State-of-the-art INDI methodGottlieb et al. (2012) introduce the INDI framework for novel drug-drug interaction prediction. They introduce a method for computingsimilarity scores between target interaction edges to known interactionedges based on the given drug-drug similarities. For each target drug-pair,each pairwise combination of similarities is considered for computing thesimilarity score to the most similar known drug interaction. The procedureeffectively performs nearest neighbor search using different similaritydistance measures. Each score is then used as a feature to train a logisticregression classifier. We refer to Gottlieb et al. (2012) for full details.

3.3.2 Non-collective PSL ModelTo quantify the effect of collective prediction, we also use a non-collectivePSL model that considers the dependencies between the target interactionsand observed interactions only, as in the INDI method. Formally, wemodify the triad rules above as follows: where we introduce the InteractsObs

w1 : SimChemical(D1, D2) ∧ InteractsObs(D2, D3)→Interacts(D1, D3)

. . .

w7 : SimGO(D1, D2) ∧ InteractsObs(D2, D3)→Interacts(D1, D3)

Fig. 4: Non-collective PSL model for drug-drug interaction prediction.

predicate to limit the triadic closure of predicted interactions to knowninteractions only.

3.4 Experimental Protocol

In order to validate our collective drug-drug interaction prediction methodand compare against state-of-the-art methods, we perform experiments onthe two drug interaction networks described in section 3. For each dataset,we perform ten-fold cross-validation across all pairs of interactions. Weuse eight folds as interaction evidence or observations, one fold as traininglabels to learn weights for the rules, and the final fold as a held-out testset. All similarities between drugs are used as evidence, or features, forthe models. The similarity distributions are highly left-skewed, which isproblematic for the soft truth interpretation used by PSL, as values below0.5 do not highly affect the inference. We transform all similaritity valuesbetween drugs by taking the cube-root to normalize the distributions andallow for proper interpretation by PSL.

We compute area under the precision-recall curve (AUPR) for thepositive class, area under the ROC (AUC) and F1 score on the test set.Link prediction tasks usually suffer from class imbalance as true positivelinks are sparse compared to true negatives. Related work on general linkprediction and DDI prediction report AUC because it is more robust to theskewness than metrics such as accuracy. However, AUC is still sensitiveto the high number of negative examples. For practical downstreambiological validation of predicted DDIs, it is more important to have areliable ranking of candidate positive interactions. The precision-recallcurve better captures the effectiveness of models at discriminating true

Page 5: A Probabilistic Approach for Collective Similarity-based ... · “sridhar-bioinformatics15” — 2016/5/25 — page 3 — #3 A Probabilistic Approach for Collective Similarity-based

“sridhar-bioinformatics15” — 2016/5/25 — page 5 — #5

A Probabilistic Approach for Collective Similarity-based Drug-Drug Interaction Prediction 5

positive examples. F1 score is another measure of classification accuracyand can interpreted as a weighted average of precision and recall. SincePSL outputs real-valued truth scores and logistic regression produces classprobabilities, we threshold the values to {0,1} to compute the F1 score.We perform grid search over a range of threshold values between [0,1] toobtain best-performing thresholds.

We implement the INDI feature computation method in Matlab

by extending a related implementation of the computation for the drug-target interaction prediction setting (Fakhraei et al., 2014; Perlman et al.,2011). We use the logistic regression classifier provided in the glmfitpackage with default settings. For our models, we use the open-sourcePSL framework. We run 700 iterations of the structured voted-perceptronweight learning algorithm in PSL and use default settings for the ADMMinference algorithm. We will make all code and datasets publicly available.

3.4.1 Blocking Methods for PSLIn a drug network with n drugs and n2 interactions where PSL considersdependencies between pairs of interactions, the computational complexityreaches O(n4), which quickly becomes expensive for large networks. Tomake the approach scalable, we employ a common techniques to blockunimportant links from being grounded out by the model. In the PSL triadrule setting, for each similarity i, we limit the possible Similari(D1, D2)

edges that are considered for each drugD1. By blocking on the similaritylinks, we restrict the grounding of all possible triads to only the ones thatare most likely.

To block similarities in the grounded out PSL models, for each drug,we perform nearest neighbor search to pick the top 15 most similar otherdrugs as evidence for Simi(D1, D2). In the first drug dataset, for theCRD interaction experiments, we use a more restrictive blocking method toinduce more sparsity since CRD interactions are rarer. When searching forthe 15 nearest neighbors for each drug, we restrict ourselves to those drugsthat have appeared in at least one observed interaction in the full network.In this sparser setting, some drugs may not appear in any Simi(D1, D2)

groundings. For those drugs, we additionally retrieve five most similarother drugs using standard nearest neighbor search and include the pairsas evidence for Simi(D1, D2). Fakhraei et al. (2014) provide morecomprehensive analysis on techniques for blocking.

4 Results

4.1 Comparison to State-of-the-art Baselines

We compare our proposed collective PSL approach for DDI predictionto two baselines including the state-of-the-art INDI system with 10-foldcross-validation experiments. We apply the three methods to each foldand report average and standard deviations of our chosen metrics foreach model. We refer to the INDI system as INDI, the non-collectivePSL baseline as NC-PSL, and collective PSL model as PSL. Tables 1-3present average and standard deviation for area under the precision-recallcurve (AUPR) and area under the ROC (AUC) from cross-validationexperiments on the three interaction types from two datasets: (1) CYP-related interactions (CRD) from Drugs.com and Drugbank (Gottlieb et al.,2012) (2) Non-CYP-related interactions (NCRD) from Drugs.com andDrugbank (Gottlieb et al., 2012) (3) General interactions from DrugBank.Bolded results highlight statistically significant improvement over bothbaselines with α = 0.05. Figures 5, 6 and 7 show precision-recallcurves of all methods plotted for interaction type settings (1), (2) and(3) respectively. Additionally, to assess the benefit of fusing multiplesimilarities, we compare against our collective PSL model implementedwith single similarities. Table 4 shows AUPR for the collective PSL modelfor single similarities across interaction type settings (1), (2) and (3).

Table 1. Average AUPR, AUC and F1 scores (with best threshold t indicated),and standard deviation for 10 fold CV comparing all DDI prediction modelsfor CRD interactions from dataset 1.

Method AUPR-Pos AUROC F1

INDI 0.15 ± 0.007 0.92 ± 0.003 0.24 ± 0.005 (t = 0.1 )

NC-PSL 0.15 ± 0.01 0.91 ± 0.004 0.23 ± 0.01 (t = 0.8)

PSL 0.34 ± 0.02 0.96 ± 0.003 0.4 ± 0.02 (t = 0.3)

Table 2. Average AUPR, AUC and F1 scores (with best threshold t indicated),and standard deviation for 10 fold CV comparing all DDI prediction modelsfor NCRD interactions from dataset 1.

Method AUPR-Pos AUROC F1

INDI 0.64 ± 0.01 0.95 ± 0.003 0.63 ± 0.01 (t = 0.35)

NC-PSL 0.70 ± 0.006 0.96 ± 0.001 0.62 ± 0.01 (t = 0.9)

PSL 0.78 ± 0.006 0.97 ± 0.001 0.70 ± 0.01 (t = 0.3)

Table 3. Average AUPR, AUC and F1 scores (with best threshold t indicated),and standard deviation for 10 fold CV comparing all DDI prediction modelsfor general interactions from dataset 2.

Method AUPR-Pos AUROC F1

INDI 0.47 ± 0.04 0.91 ± 0.01 0.51 ± 0.03 (t = 0.2)

NC-PSL 0.56 ± 0.04 0.95 ± 0.006 0.6 ± 0.03 (t = 0.5)

PSL 0.69 ± 0.02 0.96 ± 0.006 0.67 ± 0.02 (t = 0.4)

Table 4. Average AUPR and standard deviation for 10 fold CV for singlesimilarity collective DDI prediction models across all interaction types

Similarity CRD NCRD General

ATC 0.18 ± 0.01 0.73 ± 0.01 0.68 ± 0.02

Chemical 0.32 ± 0.02 0.58 ± 0.01 0.46 ± 0.04

Distance 0.31 ± 0.03 0.63 ± 0.004 0.35 ± 0.04

Gene Ontology 0.33 ± 0.02 0.63 ± 0.004 0.39 ± 0.04

Ligand 0.18 ± 0.01 0.67 ± 0.01 0.37 ± 0.03

Sequence 0.29 ± 0.02 0.63 ± 0.004 0.37 ± 0.04

Side Effect 0.30 ± 0.01 0.56 ± 0.01 0.51 ± 0.03

Fig. 5: Precision-recall curves comparing all DDI prediction models on CRDInteractions dataset.

Fig. 6: Precision-recall curves comparing all DDI models on NCRDInteractions dataset.

Page 6: A Probabilistic Approach for Collective Similarity-based ... · “sridhar-bioinformatics15” — 2016/5/25 — page 3 — #3 A Probabilistic Approach for Collective Similarity-based

“sridhar-bioinformatics15” — 2016/5/25 — page 6 — #6

6 Sridhar et al.

Table 5. Top ranked PSL model predictions for interactions unknown inDrugBank

Rank Drug Bank IDs Drug Bank IDs

1 DB00870; DB01418 Suprofen and Acenocoumarol2 DB01067; DB00839 Glipizide and Tolazamide3 DB01297; DB00806 Practolol and Pentoxifylline4 DB00870; DB00806 Suprofen and Pentoxifylline5 DB00272; DB01232 Betazole and Saquinavir6 DB00870; DB01032 Suprofen and Probenecid7 DB00939; DB01418 Meclofenamic acid and Acenocoumarol8 DB00414; DB01032 Acetohexamide and Probenecid9 DB01297; DB01392 Practolol and Yohimbine10 DB01097; DB01262 Leflunomide and Decitabine

Fig. 7: Precision-recall curves comparing all DDI models ongeneral interactions dataset.

Gottlieb et al. (2012) report AUC results consistent with our evaluationof the INDI system as given in tables 1 and 2. Our collective PSL modelstatistically significantly outperforms both baselines in AUC, AUPR andF1-score for all three interaction type prediction experiments. For AUPR,in the best case CRD interactions setting, our collective model improvesup to 50% in AUPR over the state-of-the-art INDI system and non-collective PSL model, from 0.15 to 0.34. For the NCRD and generalinteractions, the collective PSL approach sees gains of up to 20% inAUPR over both baselines. The collective model improves up to 0.05in AUC over the state-of-the-art INDI method, with AUC as high as0.97 for the NCRD interaction setting, signficantly improving over the0.95 achieved by the INDI system. For F1-score, our collective modelimproves close to 50% over the INDI and non-collective baselines for theCRD setting and up to 30% for NCRD and general interaction settings.Interestingly, the non-collective PSL method performs at least as well asthe INDI system in setting (1) and for settings (2) and (3), significantlyoutperforms the INDI system in AUPR and AUC. This improvement bythe non-collective PSL model demonstrates the method’s effectiveness incombining multiple similarities as well as or better than the state-of-the-artsimilarity combination technique used by INDI. The gains achieved by thefully collective PSL model highlights the benefits of joint inference overthe full drug-drug interaction network. Additionally, for all interactionsettings, the multiple similarity collective approach significantly improvesin AUPR over all individual similarity collective models. This resultsupports the findings of Fakhraei et al. (2014); Gottlieb et al. (2012) thatmultiple similarities benefit performance of both drug-target and drug-druginteraction prediction tasks.

4.2 Validation of Unseen Interaction Predictions

In order for statistical methods to be useful for domain experts, predictivemodels should produce highly probable novel interactions for subsequentin vitro testing. Thus, following Gottlieb et al. (2012); Fakhraei et al.(2014); Bleakley and Yamanishi (2009), we compare top-ranked, unseenDDI predictions produced by our collective PSL model with evidencefrom medical and biological data sources. These predictions are novelwith respect to Drugbank interactions used as training data and validate theability of our collective approach to produce salient interaction predictionsgiven observations.

For this experiment, we use predicted drug-drug interactions fromnon-anonymized dataset 2 to cross-reference in literature. From our 10-fold cross validation experiments, we output the predictions and filter outthose that are not present in Drugbank as verified interactions. We rankthese new predictions and consider the top 10 interactions as shown intable 5. Bolded rows indicate drug pairs that are verified by literature oranother database as interactors, or have substantive supporting evidencefor potential interaction. We use the Interactions Checker tool provided bydrugs.com (http://drugs.com) for validation, as these interactionswere not used to train any of our models. Additionally, Drugbank providesthe BioInteractor tool that uses drug-target, -enzyme and -transporterassociations to predict highly probable interactions that are not includedin the main database. Our collective PSL approach highly ranks fiveinteractions that are substantiated by Interactions Checker or BioInteractor.Some interactions involve the following four drugs that are no longer FDAapproved or used outside of the United States: Suprofen, Acenocoumarol(used worldwide but not in U.S.), Practolol and Acetohexamide. Becausethese drugs are presently less well-studied and documented, they arisenaturally as test cases for our validation study.

The top predicted interaction is between Suprofen, a non-steroidal anti-inflammatory drug, and Acenocoumarol, an anticoagulant. BioInteractorcharacterizes the effect of Suprofen on Acenocoumarol as a CYPmediated pharmacokinetic interaction. The sixth most highly rankedprediction involving Suprofen and Probenecid, a uricosuric agent usedto treat gout, is also classified by BioInteractor as a CYP relatedinteraction. Acetohexamide is in the sulfonylurea class of compoundsused to treat type-II diabetes and is predicted by our model to interactwith Probenecid, which is highly protein-bound. Interactions Checkercharacterizes this particular interaction as enhancing the hypoglycemiceffects of sulfonyureas when taken together. The risk of using Probenecidtogether with Acetohexamide is high with the elderly, who are commonlytreated simultaneously for gout and diabetes.

The collective PSL model also ranks Decitabine, used to treatLeukemia, and Leflunomide, used for rheumatoid arthritis treatment,as interactors. Interactions Checker indicates a major interactionbetween Leflunomide and Decitabine in conjunction since both areimmunosuppressants and can have additive effects to increase riskof serious infection. Ranked third, Pentoxifyline, a vasodilator andanti-inflammatory used to improve blood circulation, is predicted tointeract with Practolol, a beta-blocked formerly used to treat cardiacarrhythmias. Although Drugs.com does not list this particular interaction,the Interactions Checker lists moderate interaction between Propranolol,beta-blocker now used in place of Practolol, and Pentoxifyline. Theprediction of an effect on Pentoxifyline by a drug chemically similar toPropranolol also demonstrates the effectiveness of the PSL triad rules.The propagation of likely interaction across drugs that are similar is alsoevident in the second ranked prediction of interaction between Glipizideand Tolazamide. Both are sulonyureas like Acetohexamide and are used totreat type-II diabetes. Though the drugs deliver similar responses, currentlythere is no strong evidence of their interaction.

Page 7: A Probabilistic Approach for Collective Similarity-based ... · “sridhar-bioinformatics15” — 2016/5/25 — page 3 — #3 A Probabilistic Approach for Collective Similarity-based

“sridhar-bioinformatics15” — 2016/5/25 — page 7 — #7

A Probabilistic Approach for Collective Similarity-based Drug-Drug Interaction Prediction 7

We compare the predictions in Table 5 to the top ten novel interactionspredicted by the INDI system. In contrast, only three out of the tenpredictions made by INDI can be verified by BioInteractor or InteractionsChecker: (1) Ciprofloxacin and Lomefloxacin (rank 2) (2) Methotrexateand Lomefloxacin (rank 4) (3) Mifepristone and Lomefloxacin (rank6). There are no overlaps with the predictions ranked highly by PSL.Lomefloxacin and Ciprofloxacin both fight infection, Methotrexate treatscancers and Mifepristone ends pregancy. Interestingly, both the collectivePSL model and the INDI system predict interactions involving majorcancer drugs, Decitabine and Methotrexate, respectively.

5 Discussion

In this work, we formulate the problem of collective drug-drug interactionprediction. We introduce a joint probabilistic approach using the PSLframework to fuse multiple sources of similarity information togetherwith domain-knowledge of the network structure for this domain. Theoriginality of this work lies in proposing and experimentally validating ahighly scalable, collective probabilistic approach for DDI prediction thatis easily extensible with different sources of information and similaritymeasures. We evaluate our approach on two datasets containing three typesof interactions, including one extracted for this work with known DrugbankIDs for additional validations. We perform ten-fold cross-validation on allsettings and see that our collective PSL model significantly outperformstwo other similarity-based methods, including the state-of-the-art INDIsystem, on two important metrics for link prediction, AUPR and AUC.Our best performing PSL model improves more than 50% upon AUPRof both baselines and achieves a best AUC of 0.97. Moreover, the non-collective similarity-based method implemented in PSL also significantlyoutperforms INDI in two settings and performs comparably to INDI inother settings. This result also highlights the effectiveness of PSL asan extensible framework for similarity-based reasoning that enjoys thebenefits of collective inference shown by the first result. Furthermore,the top then predictions of our best performing collective PSL methodscontain five interactions that are unseen in Drugbank but substantiated byDrugs.com and the BioInteractor tool on Drugbank. This result signifiesthe usefulness of our collective approach for producing high-qualitypredictions that can be verified experimentally downstream. Anotherbenefit of our collective PSL method is scalability and speed. The focusof the INDI method is combining similarities by computing interactionedge-based similarity score using a nearest-neighbor search approach. Thisfeature computation is a computationally expensive procedure, requiringO(n4) passes over the drug entities. For a dataset containing 807 drugs,this computation takes approximately 12 hours on average per fold on asingle 32GB machine with 4 cores. The comparable non-collective PSLmodel introduced in this work takes approximately 1 hour for a roundof weight learning and inference per fold on the same machine. Thecollective PSL model completes computation for a fold in approximately7 hours. The PSL framework admits highly efficient, polynomial-timeinference and here, we further reduce computational complexity byblocking unnecessary groundings of the model. Scalability is crucial forlink prediction tasks in increasingly massive biological networks, as newdrugs are frequently introduced.

The task of DDI prediction is closely related to problems of predictingdrug side effects, drug adverse reactions, and synergistic drug pairs. Infact, predicting synergistic drug interactions is just a specific subtask ofthe DDI prediction problem. Our collective approach for similarity-basedreasoning in networks can be applied and generalized to all these relatedsettings.

6 Acknowledgements

We would like to thank Evan Paull, Vladislav Uzunangelov andJoshua Stuart for their valuable feedback and insightful discussions.Funding: This work was supported by the National Science Foundation[IIS1218488].Conflicts of Interest: None declared.

References

Atias, N. and Sharan, R. (2011). An algorithmic framework for predicting side effectsof drugs. J. Comput. Biol., 18, 207–218.

Bach, S. H., Broecheler, M., Huang, B., and Getoor, L. (2015). Hinge-loss markovrandom fields and probabilistic soft logic. arXiv preprint arXiv:1505.04406.

Bleakley, K. and Yamanishi, Y. (2009). Supervised prediction of drug–targetinteractions using bipartite local models. Bioinformatics, 25, 2397–2403.

Cao, D.-S., Xiao, N., Li, Y.-J., Zeng, W.-B., Liang, Y.-Z., Lu, A.-P., Xu, Q.-S., andChen, A. (2015). Integrating multiple evidence sources to predict adverse drugreactions based on a systems pharmacology model. CPT: Pharmacometrics Syst.Pharmacol., 4, 498–506.

Cheng, F. and Zhao, Z. (2014). Machine learning-based prediction of drug–druginteractions by integrating drug phenotypic, therapeutic, chemical, and genomicproperties. Journal of the American Medical Informatics Association, 21, e278–e286.

Cheng, F., Liu, C., Jiang, J., Lu, W., Li, W., Liu, G., Zhou, W., Huang, J., andTang, Y. (2012). Prediction of drug-target interactions and drug repositioning vianetwork-based inference. PLoS Comput Biol, 8, e1002503.

Chou, T.-C. (2010). Drug combination studies and their synergy quantification usingthe chou-talalay method. Cancer research, 70, 440–446.

Crowther, N. R., Holbrook, A. M., Kenwright, R., and Kenwright, M. (1997). Druginteractions among commonly used medications. chart simplifies data from criticalliterature review. Canadian Family Physician, 43, 1972.

Ekins, S. and Wrighton, S. A. (2001). Application of in silico approaches to predictingdrug–drug interactions. J. Pharmacol. Toxicol. Methods, 45, 65–69.

Fakhraei, S., Raschid, L., and Getoor, L. (2013). Drug-target interaction predictionfor drug repurposing with probabilistic similarity logic. In ACM SIGKDD 12thInternational Workshop on Data Mining in Bioinformatics (BIOKDD).

Fakhraei, S., Huang, B., Raschid, L., and Getoor, L. (2014). Network-based drug-target interaction prediction with probabilistic soft logic. Comput. Biol. Bioinf.,11, 775–787.

Fakhraei, S., Onukwugha, E., and Getoor, L. (2015). Data analytics forpharmaceutical discoveries. In Healthcare Data Analytics, pages 599–623. CRCPress.

Gönen, M. (2012). Predicting drug–target interactions from chemical and genomickernels using bayesian matrix factorization. Bioinformatics, 28, 2304–2310.

Gottlieb, A., Stein, G. Y., Oron, Y., Ruppin, E., and Sharan, R. (2012). Indi:a computational framework for inferring drug interactions and their associatedrecommendations. Mol. Syst. Biol., 8, 592.

Huang, J., Niu, C., Green, C. D., Yang, L., Mei, H., and Han, J. (2013). Systematicprediction of pharmacodynamic drug-drug interactions through protein-protein-interaction network. PLoS Comput Biol, 9, e1002998.

Jin, G., Zhao, H., Zhou, X., and Wong, S. T. (2011). An enhanced petri-net modelto predict synergistic effects of pairwise drug combinations from gene microarraydata. Bioinformatics, 27, i310–i316.

Jonker, D. M., Visser, S. A., van der Graaf, P. H., Voskuyl, R. A., and Danhof,M. (2005). Towards a mechanism-based analysis of pharmacodynamic drug–druginteractions in vivo. Pharmacol. Ther., 106, 1–18.

Keiser, M. J., Setola, V., Irwin, J. J., Laggner, C., Abbas, A. I., Hufeisen, S. J.,Jensen, N. H., Kuijer, M. B., Matos, R. C., Tran, T. B., et al. (2009). Predictingnew molecular targets for known drugs. Nature, 462, 175–181.

Mei, J.-P., Kwoh, C.-K., Yang, P., Li, X.-L., and Zheng, J. (2013). Drug–target interaction prediction by learning from local information and neighbors.Bioinformatics, 29, 238–245.

Nahta, R., Hung, M.-C., and Esteva, F. J. (2004). The her-2-targeting antibodiestrastuzumab and pertuzumab synergistically inhibit the survival of breast cancercells. Cancer research, 64, 2343–2346.

Park, K., Kim, D., Ha, S., and Lee, D. (2015). Predicting pharmacodynamic drug-drug interactions through signaling propagation interference on protein-proteininteraction networks. PloS one, 10, e0140816.

Perlman, L., Gottlieb, A., Atias, N., Ruppin, E., and Sharan, R. (2011). Combiningdrug and gene similarity measures for drug-target elucidation. J. Comput. Biol.,18, 133–145.

Page 8: A Probabilistic Approach for Collective Similarity-based ... · “sridhar-bioinformatics15” — 2016/5/25 — page 3 — #3 A Probabilistic Approach for Collective Similarity-based

“sridhar-bioinformatics15” — 2016/5/25 — page 8 — #8

8 Sridhar et al.

Resnik, P. et al. (1999). Semantic similarity in a taxonomy: An information-basedmeasure and its application to problems of ambiguity in natural language. J. Artif.Intell. Res.(JAIR), 11, 95–130.

Skrbo, A., Begovic, B., and Skrbo, S. (2003). [classification of drugs using the atcsystem (anatomic, therapeutic, chemical classification) and the latest changes].Med. Arh., 58, 138–141.

Steinbeck, C., Hoppe, C., Kuhn, S., Floris, M., Guha, R., and Willighagen, E. L.(2006). Recent developments of the chemistry development kit (cdk)-an open-source java library for chemo-and bioinformatics. Curr. Pharm. Des., 12, 2111–2120.

Vilar, S., Uriarte, E., Santana, L., Tatonetti, N. P., and Friedman, C. (2013). Detectionof drug-drug interactions by modeling interaction profile fingerprints. PloS one,8(3), e58321.

Vilar, S., Uriarte, E., Santana, L., Lorberbaum, T., Hripcsak, G., Friedman, C.,and Tatonetti, N. P. (2014). Similarity-based modeling in large-scale prediction ofdrug-drug interactions. Nature protocols, 9(9), 2147–2163.

Wang, Y. and Zeng, J. (2013). Predicting drug-target interactions using restrictedboltzmann machines. Bioinformatics, 29(13), i126–i134.

Wishart, D. S., Knox, C., Guo, A. C., Shrivastava, S., Hassanali, M., Stothard, P.,Chang, Z., and Woolsey, J. (2006). Drugbank: a comprehensive resource for insilico drug discovery and exploration. Nucleic Acids Res., 34(suppl 1), D668–D672.

Zhang, L., Zhang, Y. D., Zhao, P., and Huang, S.-M. (2009). Predicting drug–druginteractions: an fda perspective. The AAPS journal, 11(2), 300–306.