[IEEE Informatics (ICOCI) - Kuala Lumpur, Malaysia (2006.06.6-2006.06.8)] 2006 International...

5
ICOCI2006 Support Vector Machines for Predicting Protein-Protein Interactions using Domains and Hydrophobicity Features Hany Alashwal, Safaai Deris and Razib M. Othman Artificial Intelligence and Bioinformatics Laboratory Faculty of Computer Science and Information Systems Universiti Teknologi Malaysia, 81 310 Skudai, Johor, Malaysia Email: [email protected] ;[email protected] ;[email protected] Abstract-Since proteins work in the context of many other proteins and rarely work in isolation, it is highly important to study protein-protein interactions to understand proteins functions. The interactions data that have been identified by high-throughput technologies like the yeast two-hybrid system are known to yield many false positives. As a result, methods for computational prediction of protein-protein interactions based on sequence information are becoming increasingly important. In this study, computational prediction of protein-protein interactions (PPI) from domain structure and hydrophobicity properties is presented. Protein domain structure and hydrophobicity properties are used separately as the sequence feature for the support vector machines (SYM) as a learning system. Both features achieved accuracy of about 80%. But domains structure had receiver operating characteristic (ROC) score of 0.8480 with running time of 34 seconds, while hydrophobicity had ROC score of 0.8159 with running time of 20,571 seconds (5.7 hours). These results indicate that protein-protein interaction can be predicted from domain structure with reliable accuracy and acceptable running time. 1. INTROD UCTIO N A majorchallenge in bioinformatics is assigning fimction to newly discovered proteins. Although, most methods annotating protein fimction utilize sequence homology to proteins of experimentally known fimction , such a homology-based annotation transfer is problematic and limited in scope [1]. This is due to the fact that proteins work in the context of many other proteins and rarely work in isolation. The more we know about molecular biology the more we realize that protein-protein interactions affect almost all processes in a cell [2, 3]. For example structural proteins needto interact in orderto shape organelles andthe whole cell, molecular machines such as ribosome or RNA polymerases are hold together by protein-protein interactions, and the same is true for multi-subunit channels or receptors in membranes. It is estimated that even simple single-celled organisms such as yeast have about 6000 proteins interact by at least 3 interactions per protein, i.e. a total of 20,000 interactions or more [4]. By extrapolation, theremay be on the order of 100,000 interactions in the human body. As a result, identifying protein-protein interactions (PP!) represents a crucial stepin understanding proteins fimctions . Most of the interactions data was identified by high- throughput technologies like the yeast two-hybrid system, which are known to yield many false po sitives [5]. In addition, in vivo experiments that identify protein-protein interaction are still time-consuming and labor-intensive; besides, they identify a small number of interactions. As a result, methods for computational prediction of protein- protein interactions based on sequence information are becoming increasingly important. 1-4244-0220-4/06/$20.00©2006 IEEE Over the past few years, several computational approaches to p redict prote in-protein interaction have been proposed. One of the earliest techniques was based on the assumption that protein-protein interactions are evolutionary conserved. It involves orthology-based mapping of a known reference interaction network to another , target organism [6]. Other methods relies on exploration of s imilarity of expression profiles to predict interacting proteins [7] , coordination of occurrence of gene products in genomes, description of similarity of phylogenetic profiles [8] or trees [9], and studying the patterns of domain fusion [10]. However, it has been noted that these methods predict protein-protein interactions in a very general sense, meaning joint involvement in a certain biological process, and not necessarily actual physical interaction [11]. Another possibility to computationally predict interacting proteins is to correlate experimental data on interaction partners with computable or manually annotated features of protein sequences using machine learning approaches, such as support vector machines (SVM) [12] and data mining techniques, suchas association rulemining [13]. Themostcommon sequence feature usedforthispurpose is the protein domains structure. The motivation for this choice is that molecular interactions are typically mediated by a great variety of interaction domains [14]. It is thus logical to assume that the patterns of domain occurrence in interacting proteins provide useful information for training PPI prediction methods . Ina recent study, Kim et al. [15] introduced the notion of potentially interacting domain pair (PID) to describe domain pairs that occur in interacting proteins more frequentl y than would be expected by chance Assuming thateach protein in the training set may contain different combinations of multiple domains, the tendency of twoproteinsto interact is then calculated as a stunoverlog odd ratios overall possible domain pairs in the interacting proteins. Using cross- validation, the authors demonstrated 50% sensitivity and 98%specificity in reconstructing the training dataset. Gomez et al. [16] developed a probabilistic model to predict protein interactions in the context of regulatory networks. A biological network is represented as a directed graph with proteins as vertices and interactions as edges. A probability is assigned to every edge and non-edge, where the probability for each edge depends on how domains in two corresponding proteins"attra ct" and"repel" each other. The regulatory network is predicted as the one with the largest probability for its network topology. Using the database of interacting proteins, DIP [17] , as the standard of truth and PFAM domains as sequence features, the authors built a probabilistic network of yeast interactions and reported veryhightrue positive andtrue negative rates of93 and90%,respectively.

Transcript of [IEEE Informatics (ICOCI) - Kuala Lumpur, Malaysia (2006.06.6-2006.06.8)] 2006 International...

Page 1: [IEEE Informatics (ICOCI) - Kuala Lumpur, Malaysia (2006.06.6-2006.06.8)] 2006 International Conference on Computing & Informatics - Support vector machines for predicting protein-protein

ICOCI2006

Support Vector Machines for Predicting Protein-Protein Interactions usingDomains and Hydrophobicity Features

Hany Alashwal, Safaai Deris and Razib M. OthmanArtificial Intelligence and Bioinformatics Laboratory

Faculty of Computer Science and Information SystemsUniversiti Teknologi Malaysia, 81 310 Skudai, Johor, Malaysia

Email: [email protected] ;[email protected] ;[email protected]

Abstract-Since proteins work in the context of manyother proteins and rarely work in isolation, it is highlyimportant to study protein-protein interactions tounderstand proteins functions. The interactions datathat have been identified by high-throughputtechnologies like the yeast two-hybrid system are knownto yield many false positives. As a result, methods forcomputational prediction of protein-protein interactionsbased on sequence information are becomingincreasingly important. In this study, computationalprediction of protein-protein interactions (PPI) fromdomain structure and hydrophobicity properties ispresented. Protein domain structure and hydrophobicityproperties are used separately as the sequence featurefor the support vector machines (SYM) as a learningsystem. Both features achieved accuracy of about 80%.But domains structure had receiver operatingcharacteristic (ROC) score of 0.8480 with running timeof 34 seconds, while hydrophobicity had ROC score of0.8159 with running time of 20,571 seconds (5.7 hours).These results indicate that protein-protein interactioncan be predicted from domain structure with reliableaccuracy and acceptable running time .

1. INTROD UCTIO N

A majorchallenge in bioinformatics is assigning fimctionto newly discovered proteins. Although, most methodsannotating protein fimction utilize sequence homology toproteins of experimentally known fimction, such ahomology-based annotation transfer is problematic andlimited in scope [1]. This is due to the fact that proteinswork inthe context of manyotherproteins and rarely workin isolation. The more we know about molecular biologythe more we realize that protein-protein interactions affectalmost all processes in a cell [2, 3]. For example structuralproteins needto interact inorderto shape organelles andthewhole cell, molecular machines such as ribosome or RNApolymerases are hold together by protein-proteininteractions, andthe same is true formulti-subunit channelsor receptors in membranes. It is estimated that evensimplesingle-celled organisms such as yeast have about 6000proteins interact by at least 3 interactions per protein, i.e. atotal of 20,000 interactions or more [4]. By extrapolation,there may be on the order of ~100,000 interactions in thehuman body.

As a result, identifying protein-protein interactions (PP!)represents a crucial stepinunderstanding proteins fimctions.Most of the interactions data was identified by high­throughput technologies like the yeast two-hybrid system,which are known to yield many false positives [5]. Inaddition, in vivo experiments that identify protein-proteininteraction are still time-consuming and labor-intensive;besides, they identify a small number of interactions. As aresult, methods for computational prediction of protein­protein interactions based on sequence information arebecoming increasingly important.

1-4244-0220-4/06/$20.00©2006 IEEE

Over the past few years, several computationalapproaches to predict protein-protein interaction have beenproposed. One of the earliest techniques was based on theassumption that protein-protein interactions areevolutionary conserved. It involves orthology-basedmapping of a known reference interaction network toanother, target organism [6].

Other methods relies on exploration of similarity ofexpression profiles to predict interacting proteins [7],coordination of occurrence of gene products in genomes,description of similarity of phylogenetic profiles [8] or trees[9], and studying the patterns of domain fusion [10].However, it has been noted that these methods predictprotein-protein interactions in a very general sense,meaning joint involvement in a certain biological process,andnotnecessarily actual physical interaction [11].

Another possibility to computationally predict interactingproteins is to correlate experimental data on interactionpartners withcomputable or manually annotated features ofprotein sequences using machine learning approaches, suchas support vector machines (SVM) [12] and data miningtechniques, suchas association rulemining [13].

Themostcommon sequence feature usedforthispurposeis the protein domains structure. The motivation for thischoice is that molecular interactions are typically mediatedby a great variety of interaction domains [14]. It is thuslogical to assume that the patterns of domain occurrence ininteracting proteins provide useful information for trainingPPIprediction methods.

Ina recent study, Kim et al. [15] introduced thenotion ofpotentially interacting domain pair(PID) to describe domainpairs that occur in interacting proteins morefrequently thanwould beexpected by chance Assuming thateachprotein inthe training set may contain different combinations ofmultiple domains, the tendency of twoproteinsto interact isthencalculated asa stunoverlogoddratios overallpossibledomain pairs in the interacting proteins. Using cross­validation, the authors demonstrated 50% sensitivity and98%specificity inreconstructing thetraining dataset.

Gomez et al. [16] developed a probabilistic model topredict protein interactions in the context of regulatorynetworks. A biological network is represented as a directedgraph withproteins as vertices and interactions as edges. Aprobability is assigned to every edge and non-edge, wherethe probability for each edge depends on how domains intwo corresponding proteins"attract" and"repel" eachother.The regulatory network is predicted as the one with thelargest probability for its network topology. Using thedatabase of interacting proteins, DIP[17], as the standard oftruth and PFAM domains as sequence features, the authorsbuilt a probabilistic network of yeast interactions andreported veryhightruepositive andtruenegative rates of93and90%,respectively.

Page 2: [IEEE Informatics (ICOCI) - Kuala Lumpur, Malaysia (2006.06.6-2006.06.8)] 2006 International Conference on Computing & Informatics - Support vector machines for predicting protein-protein

ICOCI2006

The dual optimization is solved here by introducing theLagrange multipliers o; for the non-separable case. Becauselinear function classes are not sufficient in many cases, wecan substitute <l>(xJ for each example Xi and use the kernelfunction K(XbX) such that <l>(xJ.<l>(x). We thus get thefollowing optimization problem:

yKw.xi+b)~1 Vi,i=I, ... ,n (1)

In the linear non-separable case, the optimal separatinghyperplane can be foundby introducing slackvariables ~i' i= 1,... , n and user-adjustable parameter C and thenminimizing Ilwll2/ 2 + C Li ~i , subject to the followingconstraints:

II. METHODS

A. Support Vector Machines

The Support Vector Machine (SVM) is a binaryclassification algorithm. Hence, it is well suitedfor the taskof discriminating between interacting and non-interactingproteinpairs. The SVM is basedon the ideaof constructingthe maximal margin hyperplane in the feature space [19].Suppose we have a set of labeled training data {Xb Yi}, i =1,... , n, YiE{I,-I}, Xi E Rd

, and have the separatingh~Iane (w. x) +b =0,where feature vector: x E Rd

, WE

R and be R. In the linearseparable case the SVM simplylooks for the separating ~~erplane that maximizes themargin by minimizing Ilwl! /2 subject to the followingconstraint:

(6)

11

subjectto 0 ~ «, ~ C, i = 1, ...,n. & LaiYi == 0 (4)i=l

SVM has the following advantages to processbiologicaldata [12]: (1) SVM is computationally efficient and it ischaracterized by fast training which is essential for high­throughput screening of large proteindatasets. (2) SVM isreadily adaptable to new data, allowing for continuousmodel updates in parallel with the continuing growth ofbiological databases. (3) SVM provides a principled meansto estimate generalization performance via an analytic upperbound on the generalization error. This means that aconfidence level may be assigned to the prediction, andavoidsproblems with overfitting inherent in neuralnetworkfunction approximation.

B. Feature Representation

The initial step of any supervised learning process is theconstruction of an appropriate feature space to describetraining examples. In the context of protein-proteininteractions, it is believedthat the likelihood of two proteinsto interact with each other is associated with their structuraldomaincomposition [13, 14, 15]. It is also assumedthat thehydrophobic effects drive protein-protein interactions [3,18]. For these reasons, this study investigates theapplicability of the domain structure and hydrophobicityproperties as protein features to facilitate the prediction ofprotein-protein interactions using the support vectormachines.

The domaindatawas retrieved fromthe PFAM database.PFAM is a reliable collection of multiple sequencealignments of protein families and profile hidden Markovmodels [19]. The currentversion 10.0 contains 6190 fullyannotated PFAM-A families. PFAM-Bprovides additionalPRODOM-generated alignments of sequence clusters inSWISSPROT and TrEMBL that are not modeled inPFAM-A.

Whenthe domaininformation is used,the dimension sizeof the feature vector becomes the number of domainsappeared in all the yeast proteins. The feature vector foreachproteinwas thus formulated as:

x= [d}, db ..., db ..., dn] (5)

where d, = m when the proteinp has m piecesof domaind; and d, = 0 otherwise, This formula allows the effect ofmultiple domains to be taken into aCCOIDlt. Anotherrepresentation is by using domainscores. In our case, eachtraining example is a pair of interacting proteins (positiveexample) or a pair of proteins known or presumednot tointeract (negative example).

In a similar approach, the amino acid hydrophobicityproperties can be used to construct the feature vectors forSVM. The amino acids hydrophobicity properties areobtained from(Hopp& Woods, 1981). The hydrophobicityfeatures canbe represented in feature vectoras:

(2)

Another sequence feature that has been used to predictPPI in-silico is the hydrophobicity properties of the aminoacid residues. Chung et al. [18] have used SVM learningsystemto recognize andpredictPPI inyeastSaccharomycescerevisiae. Theyselected onlythe hydrophobicity propertiesas sequence feature and combine it to the amino acidsequence of interacting proteins. According to theirexperiments, they reported 94% accuracy, 99% precision,and 90% recall in average. Although they achieved betterresults than the previous work using only hydrophobicityfeature, their method of generating a negative dataset (i.e.non-interacting proteins pairs)is different fromthe previouswork. They constructed the negative interaction set byreplacing each value of the concatenated amino acidsequence with a random feature value. This approach maysimplify the learning task and artificially raiseclassificationaccuracy for training data. There is no guarantee, however,that the generalized classification accuracy will not degradeif the predictor is presented with new, previously unseendata which are hard to classify. Therefore, in this studyweproposeda betterand morerealistic methodto construct thenegative interaction set. Then we compared the using ofdomain structure and hydrophobicity properties as theproteinfeatures for the learning system. The choiceof thesetwo features is motivated by the previous discussedliterature.

This paper is organized as follows. Section 2 gives ageneral description of our method to design feature space,select training data, and conduct learning. Section 3describes protein interaction data sets used in this work asthe standard of truth and the implementation of ourpredictor. In Section 4 we presentand discuss experimentalresults of thiswork.Finally, some ideason future directionsareprovidedin Section5.

yKw.xi+b)~ 1-~i' ~i~O, i = 1,...,n.

21-4244-0220-4/06/$20.00©2006 IEEE

Page 3: [IEEE Informatics (ICOCI) - Kuala Lumpur, Malaysia (2006.06.6-2006.06.8)] 2006 International Conference on Computing & Informatics - Support vector machines for predicting protein-protein

Belresh

lrn\WinJearn\featlrn\WinJearn\prot

DI? ~55111 Mel \MRC(f.( vlP:1 6;t~ ,I;':;12 ' (JR2Ii N 0IP:11314(DI?:::55:11 A~(I '!MRC!'-( vIP: I ~3}1 SMl YJ1I2,:C [IIP:3,61EOIP::S,:1I A~C I '!MA(~: [IIP:m~'1 PUf3 VllO:3: [lIP:67t:EOIP.:S' :1I A.;(I IMlCs:.: [IIP:m,:r·1 A.0 3 Y: AI71'1 01P:1,oi':€DiP .ea~s~ ,,:,(3 ','8R(J)'iw OIP:1al l A~i 2 .ClR21iw OIP:113i:EDIPf2ffiN ;'A~ 3 ',1lIlO:6W DIP::(O:I-I BUD?? YGRZ€e( DIP: 115.1tEDI? :~i81 MC3 \ORQJjW vIP: I ~6lt 'l HA=2 YGL23i1: oIP: m~:(

DIP:EZ~N A~C3 \ORQJjW OIP::t:~.l'I LAS!! Y(,RI$IW 0IP: I::54;[0IP:E2~N A~C3 IBIlO;<iW OIP: 2dZ':N R.Q3 iEH li l 'l OIP:I::OKEOIP.m;r1 ,,'£3 l'8IlO;<iW OIP:XS,1-I RA: ;~ iOLo;~: OIP:1313iE

Fig. 1. A part of DIP database

wherek is the numberof amino acid in the protein x, h,= 1when the amino acid is hydrophobic and h, = 0 when theaminoacid is hydrophilic.

III. M ATERIALS AND IMPLEMENTATION

A. Data sets

Protein interaction data can be obtained from theDatabase of Interacting Proteins (DIP;http://www.dip.doe­mbi.ucla.edu!). At the time of ourexperiments, the databasecomprises 15117 entries representing pairs of proteinsknownto mutually bind,giving rise to a specific biologicalfimction. Here, interacting meanthat two aminoacidchainswere experimentally identified to bind to each other. Eachinteraction paircontains fields linking to other public proteindatabases, protein name identification and references toexperimental literature underlying the interactions. Figure Ishows a part of DIP, where each row represents a pair ofinteracting proteins (the third and the sixth columnsrepresent proteins names).

The proteins sequences files were obtained for theSaccharomyces Genome Database (SGD;http://www.yeastgenome.org! ). The SGD project collectsinformation and maintains a database of the molecularbiology of the yeast Saccharomyces cerevisiae. Thisdatabase includes a variety of genomic and biologicalinformation and is maintained and updated by SGDcurators. The SGD also maintains the S. cerevisiae GeneName Registty, a complete list of all gene names used in S.cerevisiae. This task was transferred to the SGD by Dr.Robert Mortimer in early 1994. We have also compiled asetof general guidelines to genenaming that may be of helpto researcherswho arenaming new S. cerevisiae genes.

The proteins sequence information is needed in thisresearch in order to elucidate the domain structure of theproteins involved in the interaction and to represent theamino acid hydrophobicity inthe feature vectors.

B. Data Preprocessing

Since proteins domains are highly informative for theprotein-protein interaction, we useddomain dataas the mainfeature for protein sequence. We focused on domain dataretrieved from the PFAM database, a reliable collection ofmultiple sequence alignments of protein families and profilehidden Markov models. In order to elucidate the PFAMdomain structure in the yeast proteins, we first obtain allsequences of yeastproteins from SGD. Given that sequencefile, we then run InterProScan [20] to examine whichPFAM domainsappear in each protein. We usedthe stand­alone version oflnterProScan. Apart from the result file isshownin Figure 2.

1-4244-0220-4/06/$20.00©2006 IEEE3

ICOCI2006

<h:ml verslOrI.'l ,O· ."ea<lInljl~ ·I'O·8G'9·1 · 1>- .....hllpro_m.lclwlu

- <prot.. n .1. "ooO..S" Ien9th. ·S:H· eK M . · 50 5 5500 ..C12 7£ OOl ·:>• CWI h" p' 0 "'. · II'NUUUIlIl;l· n.-mt. "Cyt ul.:h rum u e uM l tl o/l "' u~ ...ub," ,lt I" lypu . -'a rnlty · ,.

• <Child.ht><r*.. r~f lpt_ro f"· II'KlI l l-t f> 1 1' h

oC/c."'ld~ ll\t~

• c m.. tc.h id. · PFOOI U · RiImII 'COX1' dbn¥ll • • ·PFA~'· :>

ccc .. tlon ItMt Io'S ' Itfld,,'4 foI' ICOfl!l . 'S .::h . · 3 0 ' · It ..tu ~.'r . w:l-nc" Io· Ii MMPfllnl · /:></!ni tCI'l >

</ rn . rptO:></pl'olu 'l:>

• <pt'o l.. n .1,, ·000 5C· 14I"9th,,'0 :J'" Cf'(:M .."Or n 6 ..06r :J6.. 16 A5 A->• CWI hll p' O id"· II'NUUU"I-tZ' n,armit" · l ntrv n fIlo/llu ro/l e, t y p" II ' tVJl"'. ""'o/I lIIily""

• o:.m., tc h 1(1. -"" 0 13 "11:1 " n..rnu""lntru "_ 1I111Iura 1 "'tlbn..""" "J' f "' ~l · ,,

<iOC:at'M ~ t.vt .. ' ,.117· oM . ' , } !' S(.OIO" · l ,, - l~ ' ~Mlu~ .. ' I' .... odt\nC,.... · HMMl'f llnl· r»o(./rro.'l t ch~

</orl l iMpl"O~

• onl . 1p"O 1d. - I PRoo n .. 7 7" n;tm". "R 'oIA - d lnw:t lltrt DN A rUllyn lll.rll ~1R (R IRVIR", IRI r.. nllO -=rtrl lll ~.)- Iyptoa-oocn llln-:o- <PaunIUrt>

<rIlC n, f 'pl'_r.f ..· IPR0 00 12:J· /><r"C r. r 'p'_r.f. · IPROC:J:20 6· I><r""_fld ,pl_r"'.· II'KU tJ 3 :J ~ :lo · /"

</fovnd_,n"• c m:at( h 'd&'I' ~ OUU 1 11 'I'I:lIr'oO " 'RV '~ ~~"'J' f "' ~l ' >

<iOCMII) 1'I ~ t.vt ..· :lllh· oM ..':. } r 'KOI . .. ·"I. l n· } 1' ~l.lltu~ ..· I · ....~ne" .. ·H~l~lI'f ll nl· I~

</M atch></rlt. rpro:o

</Pl'ot"rt>

Fig. 2. A part from the protein domains file.

From the result file of InterProScan, we list up all PFAMdomains that appear in yeast proteins and index them.Figure 3 showsan example of protein domains thatappearsin yeast genome. The first column represents a proteinwhereas the following columns represent the domains thatappear in the protein. The order of this list is not importantas long we keep it through the whole procedure. Thenumber of all domains listed and indexed in this way isconsideredthe dimension size of the feature vector, and theindex of each PFAM domain within the list now indicatesone of the elements ina feature vector.

The next step is to construct a feature vector for eachprotein. For example, if a protein has domain A and Bwhichhappened to be indexed 12and 56 respectively in theabove step, then we assign "1" to the 12th and 56thelements in the feature vector, and "0" to all the otherelements. Next we focus on the protein pair to be used forSVM training and testing. The assembling offeaturevectorfor each protein pair can be done by concatenating thefeature vectors of proteins constructed in the previous step.Figure 5 showsthe format of the feature vectors to be usedbySYM.

When hydrophobicity is used, each amino acid will bereplacedby 1 if it is hydrophobic and 0 if it is hydrophilic.Two separate training sets for domain and hydrophobicityfeatures have beenconstructed.

proteins_domains.lxl IYBL085M PF00018 PF00169 PF07647 PF07653YBL087C PF00238

~ YBL088C PF00454 PF02259 PF02260..:.J YB L089M PF01490

YB L091C PF00557YB L091C-A PF00635YBL092MPF01655YBL098M PF01360YBL099M PF00006 PF00306 PF02874YB L101M-A PF01021YBL101M-B PF01021 PF00665YBL103C PF00010YB L105C PF00168 PF00069 PF02 185 PF00433 PF00130YBLlllC PF00270YBR001C PF01204 PF07492

Fig. 3. An example of protein domains that appear inyeast genome.

Page 4: [IEEE Informatics (ICOCI) - Kuala Lumpur, Malaysia (2006.06.6-2006.06.8)] 2006 International Conference on Computing & Informatics - Support vector machines for predicting protein-protein

ICOCI2006

TABLEITH E PERFORMANCE OF SVM FOR PREDICTING PPJ USING DOMAIN

AND HYDROPHOBICITY FEATURES.

IV. RESULTS AND DISCUSSION

In this study, we used the L1BSVM software [21] as aclassification tool. The standard radial basis fimction (RBF)as available in L1BSYM was selected as a kemel fimction.Different values ofy forthe kemel K(x, y) = exp(-y 1~_YI12 ),y>O were systematically tested to optimize the balancebetween sensitivity and specificity of the prediction. It isimportant to emphasize that in all our experiments we usedonlysoftmargin SVM. Theyarebettersuited formostreal­world applications thanhardmargin SVMbecause the lattershows poor performance for overlapping classes; in ourcase, no priori knowledge was available on whether classesoverlap ornot

Ten-fold cross-validation was utilized to obtain thetraining accuracy. The entire set of training pairs was splitinto 10 folds so that each fold contained approximatelyequal number of positive and negative pairs. Each trialinvolved selecting one fold as a test set, utilizing theremaining nine folds for training our model, and thenapplying thetrained model to thetestset.

The receiver operating characteristic (ROC) is also usedto evaluate the results of our experiments. It is a graphicalplotof the sensitivity (fraction of true positives - TP) vs. 1­specificity (the fraction of false positives - FP) for a binaryclassifier system as its discrimination threshold is varied.The sensitivity canbe defined as:TP / (TP+ FN) where TPand FN stand for true positive and false negative,respectively. The specificity can be defined as: TN / (TN +FP) where TN and FP stand for true negative and falsepositive, respectively. The area under the ROC curve iscalled ROCscore.

In the table below, a comparison between domainsstructure and hydrophobicity as the protein feature ispresented. The cross-validation accuracy results indicatethat there is no significant difference when using domainstructure or hydrophobic properties as the protein feature.However, ROC score indicates that domain structure isnoticeably better than hydrophobic properties (Figure 4).Another aspect is the running time for both features.Evidently, whendomain structure used, thedataset ismuchsmaller than the data set for the hydrophobic properties.Consequently, the running time required for domainstructure training data is much less than the time runningrequired for the hydrophobic training data as shown inTable!.

Hydrophobicity 78.6214 % 0.8159

REFERENCES

B. Rost, 1. Liu, R. Nair, K.O. Wrzeszczynski, andY. Ofran. Automatic prediction of protein fimction.Cell. Mol. Life Sci. 60,2637-2650. 2003.

H. Lodish, A. Berk, L. Zipursky, P. Matsudaira, D.Baltimore, and 1. Damell, Molecular Cell Biology,4thed.,W.H. Freeman, New York, 2000.

B. Alberts, A. Johnson, 1. Lewis, M. Raff, K.Roberts, and P. Walter, Molecular Biology of theCell, 4thedition, Garland Science, 2002.

P. Uetz, and C.S.Vollert, "Protein-ProteinInteractions", Encyclopedic Reference ofGenomicsand Proteomics in Molecular Medicine(ERGPMM), Springer Verlag, 2005.

E.M. Phizicky, and S. Fields, "Protein-proteininteractions: Method for detection and analysis",Microbiological Reviews, 1995, pp.94-123.

[5]

[4]

[3]

[2]

[1 ]

, , ,I I I I, , , ,...... .~ ~ " _.-- _. -I I I II , I ,. ""

f 0 8 i :. .i ······i·······i······ ·i······ ·i·······i······ ·i······'iii 01 ~ . . . ~ .~ :: : ---+- Hydrophobicity ; AUC= 0 8159

~ 0 6 ..•. . . : ~ . . •. . . . ~ ---e-- Domains; AUC= 0.84802 . : :

~ 0.5 •••- • ' · ·- · · ·· ~ ··_·· ·· ~·· ··· ·· ~·· ·· · ··i· · ·· · · ·i · · ·· · ··i· · ·· ··· i· ··· ··· i· · ····~ ,: : : : : : : :.~ 0.4 ••• ·· f··· ····r·······f······· r·······r· ······r·······r·······1·······1······~ 0.3 · ~ ~ ~ ~ i i i i i .

,:: 0.2 . . •. ~ . . •.. •. ~ . . •. .. . ~ . .• .. . . ~ : . . •. .. . ; ; ; . .. .• .. ; . . . .• ., , , , I I I I I, , , , I I I I I, , , , I I I I I

0 1 ··· ···:· ···· ··:·· ··· ··:······ ·:·· ·· ·· ·f· ····· ·;··· ·· ·· ;·· ····· f··· ···· f···· ··, , , , I I I I I, , , , I I I I I

0.1 0.2 03 0.4 0.5 0.6 0.7 0.8 0.9FalsePOSit ive Fraction (l -Specificity)

V. CONCLUSION

The prediction approach reported in this papergeneratesa binary decision aboutpotential protein-protein interactionsbased on the domain structure of the interacting proteins.One difficult challenge in this research is to find negativeexamples of interacting proteins, i.e., to find non-interactingprotein pairs. For negative examples of SVM training andtesting, we use a randomizing method. However, findingproper non-interacting protein pairs is important to ensurethat prediction system reflects the real world. Discoveringinteracting protein patterns using primary structures ofknown protein interaction pairs may be subsequentlyenhanced by using other features such as secondary andtertiary structure in the leaming machine. In conclusion theresult of this study suggests thatprotein-protein interactionscan be predicted from domain structure with reliableaccuracy and acceptable running time. Consequently, theseresults showthe possibility of proceeding directly from theautomated identification of a cell's gene products toinference of the protein interaction pairs, facilitating proteinfimction andcellular signaling pathway identification.

Fig. 4. The ROC plot for domain and hydrophobicityfeatures.

20,571 seconds(5.7 hours)

79.4372 % 0.8480 34 seconds

Accuracy ROC score Running time

Domain

Feature

41-4244-0220-4/06/$20.00©2006 IEEE

Page 5: [IEEE Informatics (ICOCI) - Kuala Lumpur, Malaysia (2006.06.6-2006.06.8)] 2006 International Conference on Computing & Informatics - Support vector machines for predicting protein-protein

[6]

[7]

[8]

[9]

[10]

[11 ]

[12]

[13]

[14]

J. Wojcik, and V. Schachter, "Protein-Proteininteraction map inference using interacting domainprofile pairs", Bioirformaiics, 2001, 17:S296-S305.

E.M. Marcotte, M. Pellegrini, MJ. Thompson, TO.Yeates, and D. Eisenberg, "A combined algorithmfor genome-wide prediction of protein fimction",Nature 1999,402, pp:83-86.

M. Pellegrini, E.M. Marcotte, MJ. Thompson, D.Eisenberg, and TO. Yeates, "Assigning proteinfimctions by comparative genome analysis: proteinphylogenetic profiles", Proc Nat! Acad Sci USA1999,96, pp:4285-4288.

F. Pazos and A. Valencia, "Similarity ofphylogenetic trees as indicator of protein-proteininteraction", Protein Eng. 2001, 14(9), pp:609-614.

AJ. Enright, 1. N. llipoulos, C. Kyrpides and c.A.Ouzounis, "Protein interaction maps for completegenomes based on gene fusion events", Nature,1999,402, pp:86-90.

D. Eisenberg, E.M. Marcotte, 1. Xenarios, and TO.Yeates, "Protein fimction in the post-genomic era",Nature, 2000,405pp:823-826.

J.R Bock and D.A. Gough. "Predicting protein­protein interactions from primary structure",Bioinformatics.May2001 , 17(5), pp:455-460.

TOyama, K. Kitano, K. Satou, and T Ito,"Extraction of knowledge on protein-proteininteraction by association rule discovery",Bioinformatics,2002, 18no5, pp:705-714.

T. Pawson and P. Nash, "Assembly of cellregulatory systems through protein interactiondomains", Science, 2003, 300,pp:445-452.

[15]

[16]

[17]

[18]

[19]

[20]

[21 ]

[22]

ICOCI2006

WK Kim, J. Park, and J.K. Suh, "Large scalestatistical prediction of protein-protein interaction bypotentially interacting domain (PID) pair". GenomeIrformaiics, 2002, 13, pp:42-50.

S.M. Gomez, W.S. Noble, and A. Rzhetsky,"Leaming to predict protein-protein interactionsfrom protein sequences", Bioirformatics, 2003,Vol.19 no.15, pp:1875-1881.

1. Xenarios, L. Salwinski, XJ. Duan, P. Higney,S.M. Kim, and D. Eisenberg, "DIP, the Database ofInteracting Proteins: a research tool for studyingcellular networks of protein interactions", Nucl.Acids. Res,2002, 30(1), pp:303- 305.

Y. Chung, G. Kim, Y. Hwang, and H. Park.Predicting Protein-Protein Interactions from OneFeature Using SYM. In proceedings of lEAlAlEpp:50-55.2004.

V.N. Vapnik, The Nature of Statistical LearningTheory. Springer, 1995.

A. Bateman, L. Coin, R. Durbin, RD. Finn, V.Hollich, Griffiths-Jones, S. Khanna, A. Marshall,S.E. Moxon, L.L. Sonnhammer, OJ. Studholme, C.Yeats, and S.R. Eddy, "ThePfam: Protein FamiliesDatabase", Nucleic AcidsResearch: Database Issue,2004,32, pp:Dl38-Dl41 .

NJ. Mulder, R Apweiler, TK. Attwood, et aI. ,"The InterPro Database brings increased coverageand new features", Nucleic Acids Research, 2003,31,pp:315-318.

C.C. Chang and CJ . Lin, "LIBSVM : a library forsupport vectormachines", 2001.Software availableat http ://www.csie.ntu.edu.tw/~ci lin/libsvm

1-4244-0220-4/06/$20.00©2006 IEEE5