Learning from heterogeneous data sources: an application ... · 3/4/2016 · Transfer learning for...

Learning from heterogeneous data sources: anapplication in spatial proteomics

Lisa M. Breckels1,2, Sean Holden3, David Wojnar4, Claire M. Mulvey2,Andy Christoforou2, Arnoud Groen2, Matthew W.B. Trotter5, Oliver

Kohlbacher4,6,7, Kathryn S. Lilley2, and Laurent Gatto *1,2

1Computational Proteomics Unit, Department of Biochemistry, Universityof Cambridge, Tennis Court Road, Cambridge, CB2 1QR, UK

2Cambridge Centre for Proteomics, Department of Biochemistry, Universityof Cambridge, Tennis Court Road, Cambridge, CB2 1QR, UK

3Computer Laboratory, University of Cambridge, 15 JJ Thomson Avenue,Cambridge CB3 0FD, UK

4Quantitative Biology Center, Universitat Tubingen, Auf der Morgenstelle10, 72076 Tubingen, Germany

5Celgene Institute for Translational Research Europe (CITRE), Sevilla41092, Spain

6Center for Bioinformatics, Universitat Tubingen, Sand 14, 72076Tubingen, Germany

7Biomolecular Interactions, Max Planck Institute for DevelopmentalBiology, Spemannstr. 35, 72076 Tubingen, Germany

March 4, 2016

*email: [email protected]

1

.CC-BY 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted March 4, 2016. ; https://doi.org/10.1101/022152doi: bioRxiv preprint

[email protected]

https://doi.org/10.1101/022152

http://creativecommons.org/licenses/by/4.0/

Transfer learning for spatial proteomics

Abstract

Sub-cellular localisation of proteins is an essential post-translational regulatory mechanismthat can be assayed using high-throughput mass spectrometry (MS). These MS-based spa-tial proteomics experiments enable us to pinpoint the sub-cellular distribution of thousandsof proteins in a specific system under controlled conditions. Recent advances in high-throughput MS methods have yielded a plethora of experimental spatial proteomics datafor the cell biology community. Yet, there are many third-party data sources, such as im-munofluorescence microscopy or protein annotations and sequences, which represent a richand vast source of complementary information. We present a unique transfer learning clas-sification framework that utilises a nearest-neighbour or support vector machine system, tointegrate heterogeneous data sources to considerably improve on the quantity and qualityof sub-cellular protein assignment. We demonstrate the utility of our algorithms throughevaluation of five experimental datasets, from four different species in conjunction with fourdifferent auxiliary data sources to classify proteins to tens of sub-cellular compartments withhigh generalisation accuracy. We further apply the method to an experiment on pluripotentmouse embryonic stem cells to classify a set of previously unknown proteins, and validateour findings against a recent high resolution map of the mouse stem cell proteome. Themethodology is distributed as part of the open-source Bioconductor pRoloc suite for spatialproteomics data analysis.

Abbreviations LOPIT: Localisation of Organelle Proteins by Isotope Tagging, PCP: Pro-tein Correlation Profiling, ML: Machine learning, TL: Transfer learning, SVM: Support vec-tor machine, PCA: Principal component analysis, GO: Gene Ontology, CC: Cellular com-partment, iTRAQ: Isobaric tags for relative and absolute quantitation, TMT: Tandem masstags, MS: Mass spectrometry

Running title A Transfer Learning Framework for Spatial Proteomics Data

2



https://doi.org/10.1101/022152



Introduction

Cell biology is currently undergoing a data-driven paradigm shift [1]. Molecular biologytools, imaging, biochemical analyses and omics technologies, enable cell biologists to trackthe complexity of many fundamental processes such as signal transduction, gene regulation,protein interactions and sub-cellular localisation [2]. This remarkable success, has resultedin dramatic growth in data over the last decade, both in terms of size and heterogene-ity. Coupled with this influx of experimental data, databases such as UniProt [3] and theGene Ontology [4] have become more information rich, providing valuable resources for thecommunity. The time is ripe to take advantage of complementary data sources in a sys-tematic way to support hypothesis- and data-driven research. However, one of the biggestchallenges in computational biology is how to meaningfully integrate heterogeneous data;transfer learning, a paradigm in machine learning, is ideally suited to this task.

Transfer learning has yet to be fully exploited in computational biology. To date, variousdata mining and machine learning (ML) tools, in particular classification algorithms havebeen widely applied in many areas of biology [5]. A classifier is trained to learn a mappingbetween a set of observed instances and associated external attributes (class labels) which issubsequently used to predict the attributes on data with unknown class labels (unlabelleddata). In transfer learning, there is a primary task to solve, and associated primary datawhich is typically expensive, of high quality and targeted to address a specific questionabout a specific biological system/condition of interest. While standard supervised learningalgorithms seek to learn a classifier on this data alone, the general idea in transfer learning isto complement the primary data by drawing upon an auxiliary data source, from which onecan extract complementary information to help solve the primary task. The secondary datatypically contains information that is related to the primary learning objective, but was notprimarily collected to tackle the specific primary research question at hand. These data canbe heterogeneous to the primary data and are often, but not necessarily, cheaper to obtainand more plentiful but with lower signal-to-noise ratio.

There are several challenges associated with the integration of information from auxiliarysources. Firstly, if the primary and auxiliary sources are combined via straightforwardconcatenation the signal in the primary can be lost through dilution with the auxiliarydue to the latter being more plentiful and often having lower signal-to-noise ratio (see S5File, Figure 7 for an illustration). Feature selection can be used to extract the attributeswith the most distinct signals, however the challenge still remains in how to combine thisdata in a meaningful way. Secondly, combining data that exist in different data spaces isoften not straightforward and different data types can be sensitive to the classifier employed,in terms of classifier accuracy.

In one of the first applications of transfer learning Wu and Dietterich [6] used a k-nearest neighbours (k-NN) and support vector machine (SVM) framework for plant imageclassification. Their primary data consisted of high-resolution images of isolated plant leavesand the primary task was to determine the tree species given an isolated leaf. An auxiliarydata source was available in the form of dried leaf samples from a Herbarium. Using a kernelderived from the shapes of the leaves and applying the transfer learning (TL) framework[6], they showed that when primary data is small, training with auxiliary data improvesclassification accuracy considerably. There were several limitations in their methods: firstly,

3



https://doi.org/10.1101/022152



the data in the k-NN TL classifier were only weighted by data source and not on a class-by-class basis, and, secondly in the SVM framework both data sources were expected tohave the same dimensions and lie in the same space. We present an adaption and significantimprovement of this framework and extend the usability of the method by (i) incorporating amulti-class weighting schema in the k-NN TL classifier, and (ii) by allowing the integrationof primary and auxiliary data with different dimensions in the SVM schema to allow theintegration of heterogeneous data types. We apply this framework to the task of proteinsub-cellular localisation prediction from high resolution mass spectrometry (MS)-based data.

Spatial proteomics, the systematic large-scale analysis of a cell’s proteins and their assign-ment to distinct sub-cellular compartments, is vital for deciphering a protein’s function(s)and possible interaction partners. Knowledge of where a protein spatially resides within thecell is important, as it not only provides the physiological context for their function butalso plays an important role in furthering our understanding of a protein’s complex molec-ular interactions e.g. signalling and transport mechanisms, by matching certain molecularfunctions to specific organelles.

There are a number of sources of information which can be utilised to assign a protein toa sub-cellular niche. These range from high quality data produced from experimental high-throughput quantitative MS-based methods (e.g. LOPIT [7] and PCP [8]) and imaging data(e.g. [9]), to freely available data from repositories and amino acid sequences. The former,in a nutshell, involves cell lysis followed by separation and fractionation of the subcellularstructures as a function of their density, and then selecting a set of distinct fractions toquantify by mass-spectrometry. These quantitative protein profiles are representative oforganelle distribution and hence are indicative of their subcellular localisation [10]. Basedon the distribution of a set of known genuine organelle marker proteins, pattern recognitionand ML methods can be used to match and associate the distributions of unknown residentsto that of one of the markers. There is thus a reliance on reliable organelle markers andstatistical learning methods for robust proteome-wide localisation prediction [11]. Theseapproaches have been utilised to gain information about the sub-cellular location of proteinsin several biological examples, such as Arabidopsis [7, 12, 13, 14, 15, 16], Drosophila [17],yeast [18], human cell lines [19, 20], mouse [8, 21] and chicken [22], using a number ofalgorithms, such as, SVMs [23], k-NN [15], random forest [24], naive Bayes [14], neuralnetworks [25], and partial-least squares discriminant analysis [7], [17], [22].

Although application of computational tools to spatial proteomics is a recent develop-ment, the determination of protein localisation using in silico data is well-established (re-viewed in [26, 27, 28]). Many methods have been developed to predict protein localisationfrom amino acid sequence features e.g. amino-acid composition information (e.g. [29, 30, 31,32, 33, 34, 35]), localisation signals and motifs relevant to protein sorting (e.g. [36, 37, 38, 39,40, 41, 42, 43]). Annotation-based prediction methods have also been widely used that useinformation about functional domains (e.g. [44, 45]), protein-protein interactions (e.g. [46,47, 48]) and Gene Ontology (GO) [4] terms (e.g. [49, 50, 51, 52]). However, not all proteinsin GO are reliably annotated; for example, according to the 2015 03 release of UniProtKB [3]the human, mouse, Drosophila melanogaster and Arabidopsis thaliana proteomes have lessthan 14%, 14%, 6% and 13% experimentally-verified Gene Ontology cellular compartment(GO CC) sub-cellular annotations, in each proteome respectively.

Despite improvements in generalisation accuracy of sequence- and annotation-based clas-

4



https://doi.org/10.1101/022152



sifiers, a fundamental problem concerns the biological relevance and ultimate utility to cellbiology of such predictors. Protein sequences and their annotation do not change accord-ing to cellular condition or cell type, whereas protein localisation can change in responseto cellular perturbation. Furthermore, this type of data does not adequately describe therange of mechanisms via which a particular protein may reside in a particular organelle.Not all protein sequences contain motifs or exhibit compositional properties indicative of or-ganelle residency. Despite the inherent limitations of using in silico data to predict dynamiccell- and condition-specific protein properties, transfer learning [6, 53, 50, 51, 52] may allowthe transfer of complementary information available from these data to classify proteins inexperimental proteomics datasets.

Here, we present a new transfer learning framework for the integration of heterogeneousdata sources, and apply it to the task of sub-cellular localisation prediction from experimentaland condition-specific MS-based quantitative proteomics data. Using the k-NN and SVMalgorithms in a transfer learning framework we find that when given data from a high qualityMS experiment, integrating data from a second less information rich but more plentifulauxiliary data source directly in to classifier training and classifier creation results in theassignment of proteins to organelles with high generalisation accuracy. Five experimental MSLOPIT datasets, from four different species, were employed in testing the classifiers. We showthe flexibility of the pipeline through testing four auxiliary data sources; (1) Gene Ontologyterms, (2) immunocytochemistry data [9], (3) sequence and annotation features , and (4)protein-protein interaction data [54]. The results obtained demonstrate that this transferlearning method outperforms a single classifier trained on each single data source alone andon a class-by-class basis, highlighting that the primary data is not diluted by the auxiliarydata. This methodology forms part of the open-source open-development Bioconductor [55]pRoloc [56] suite of computational methods available for organelle proteomics data analysis.

Results

Here, we have adapted a classic application of inductive transfer learning (TL) [6] usingexperimental quantitative proteomics data as the primary source and Gene Ontology CellularCompartment (GO CC) terms as the auxiliary source. Using this TL approach, we haveexploited auxiliary data to improve upon the protein localisation prediction from quantitativeMS-based spatial proteomics experiments using (1) a class-weighted k-NN classifier, and (2)an SVM classifier in a TL framework. We also show the flexibility of the framework byusing data from the Human Protein Atlas [57] and input sequence and annotation featuresfrom the YLoc [58, 59] web server, and protein-protein interaction data from the STRINGdatabase [54] as auxiliary data sources.

To assess classifier performance we employed the classic machine learning schema ofpartitioning our labelled data into training and testing sets, and used the testing sets toassess the strength of our classifiers. Parameter optimisation was conducted on the labelledtraining data using 100 rounds of stratified 80/20 partitioning, in conjunction with 5-foldcross-validation in order to estimate the free parameters via a grid search, as implementedin the pRoloc package [56] (and described in the methods below). Here, for the k-NN TLalgorithm these parameters are the weights assigned to each class for each data source, and

5



https://doi.org/10.1101/022152



for the SVM TL algorithm these are C, γP and γA for the two kernels, as described in thematerials and methods. The testing set is then used to assess the generalisation accuracyof the classifier. By applying the best parameters found in the training phase on test data,observed and expected classification results can be compared, and then used to assess howwell a given model works by getting an estimate of the classifier’s ability to achieve a goodgeneralisation, that is given an unknown example predict its class label with high accuracy.This schema was repeated for all 5 datasets, and for the SVM and k-NN classifiers, trainedon (i) LOPIT alone, and (ii) GO CC alone, for comparison with the TL algorithms.

For simplicity, throughout this manuscript we refer to the mouse pluripotent embryonicstem cell dataset as the ‘mouse dataset’, the human embryonic kidney fibroblast dataset asthe ‘human dataset’, the Drosophila embryos dataset as the ‘fly dataset’, the Arabidopsisthaliana callus dataset as the ‘callus dataset’ and finally the second Arabidopsis thalianaroots dataset, as the ‘roots dataset’.

The k-NN transfer learning classifier

The median macro-F1 scores for the mouse, human, callus, roots and fly datasets were 0.879,0.853, 0.863, 0.979, 0.965, respectively, for the combined k-NN transfer learning approach.A two sample t-test showed that over 100 test partitions, the mean estimated generalisationperformance for the k-NN transfer learning approach was significantly higher than on profilestrained solely from only primary or auxiliary alone for the mouse (p = 2e−21 for primaryalone and p = 7e−78 for auxiliary alone), human (p = 1e−7 for primary alone and p = 8e−32

for auxiliary alone), plant roots (p = 4e−17 for primary alone and p = 4e−22 for auxiliaryalone), and fly (p = 3e−5 for primary alone, p = 1e−112 for auxiliary alone) data (Fig. 1). Wefound that the plant callus dataset did not significantly benefit (nor detrimentally affected)by the incorporation of auxiliary data. This was unsurprising as this dataset is extremelywell-resolved in LOPIT (S1 File, Fig. 1, top right) and the median macro F1-score over 100rounds of training and testing with a baseline k-NN classifier resulted in a median macroF1-score of 0.985 (the combined approach yielded a macro F1-score of 0.973).

Mouse Human Callus Roots Fly

0.4

0.6

0.8

1.0

Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary

Data source

Ma

cro

F1

Sco

re

Figure 1: Boxplots, displaying the estimated generalisation performance over 100 test partitions.Results for the k-NN transfer learning algorithm applied with (i) optimised class-specific weights (combined),(ii) only primary data and (iii) only auxiliary data, for each dataset.

The k-NN transfer learning classifier uses optimised class weights to control the pro-

6



https://doi.org/10.1101/022152



portion of primary to auxiliary neighbours to use in classification. One advantage of thisapproach is the ability for the user to set class weights manually, allowing complete controlover the amount of auxiliary data to incorporate. As previously described, the class weightscan be set through prior optimisation on the labelled training data. Fig. 2 shows the detailedresults for the mouse dataset and the distribution of the 100 best weights selected over 100rounds of optimisation are shown on the top left. We found the distribution of weights foreach dataset reflected closely the sub-cellular resolution in each experiment. For example,for the experiment on the mouse dataset the distribution of best weights identified for theendoplasmic reticulum (ER), mitochondria and chromatin niches are heavily skewed towards1 indicating that the proportion of neighbours to use in classification should be predomi-nantly primary. Note, as described in the methods if the class weight is assigned to 1, thenstrictly only neighbours in primary data are used in classification and similarly, if the classweight is 0 then all weight is given to the auxiliary data. If the weight falls between these twolimits the neighbours in both the primary and auxiliary data sources is considered. Fromexamining the principal components analysis (PCA) plot (Fig. 2, top right) we indeed foundthat these organelles are well separated in the LOPIT experiment. Conversely, we foundthat the 40S ribosome overlaps somewhat with the nucleus (non-chromatin) cluster (Fig. 2,top right) which is reflected in the best choice of class weights for these two niches; theywere both assigned best weights of 1/3 and their weight distributions are skewed towards 0indicating that more auxiliary data should be used to classify these sub-cellular classes. Ifwe further examine the class-F1 scores for these two sub-cellular niches (Fig. 2, bottom) weindeed find that including the auxiliary data in classification yields a significant improvementin generalisation accuracy (p = 1e−16 for 40S ribosome (red) and p = 1e−10 for the nucleus(non-chromatin) (pink)). We also found this to be the case for the proteasome, which isoverlapping with the cytosol. We found LOPIT alone did not distinguish between thesetwo sub-cellular niches in this particular experiment, however, the addition of auxiliary datafrom the Gene Ontology resulted in a significant increase in classifier prediction (p = 2e−16)as shown by the class-specific box plot in Fig. 2, bottom (black).

Many experiments are specifically targeted towards resolving a particular organelle ofinterest (e.g. the TGN in the roots dataset) which requires careful optimisation of theLOPIT gradient. In such a setup sub-cellular niches other than the one of interest may notbe well-resolved which may simply be due to the fact that the gradient was not optimisedfor maximal separation of all sub-cellular niches, but only one or a few particular organelles.Such experiments in particular may benefit from the incorporation of auxiliary data. Wefound that for the roots dataset all sub-cellular classes, except the TGN sub-compartment,benefitted from including auxiliary data (S1 File Fig. 3, bottom), highlighting the advantageof using more than one source of information for sub-cellular protein classification. The bestweight for the TGN was found to be 1 (S1 File Fig. 3, top left), as expected and indicatinghigh resolution in LOPIT for this class. In this framework we are able to resolve differentniches in the data according to different data sources, further highlighted in the class-specificboxplots in S1 File Figs 1 to 4.

7



https://doi.org/10.1101/022152



−2 0 2 4

−2−1

01

23

4

PC1 (40.28%)

PC

2 (2

5.7%

)

● ●

● ●

● ● ●

● ●

●

● ●

● ●

● ● ●

● ● ●

● ●

Proteasome

Plasma membrane

Nucleus − Non−chromatin

Nucleus − Chromatin

Mitochondrion

Lysosome

Endoplasmic reticulum

Cytosol

60S Ribosome

40S Ribosome

0 1/3 2/3 1Classifier weight

Cla

ss

40S Ribosome 60S Ribosome Cytosol Endoplasmic reticulum

Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Non−chromatin

Plasma membrane Proteasome

0.4

0.6

0.8

1.0

0.6

0.8

1.0

0.00

0.25

0.50

0.75

1.00

0.6

0.7

0.8

0.9

1.0

0.00

0.25

0.50

0.75

1.00

0.8

0.9

1.0

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.2

0.4

0.6

0.8

1.0

0.00

0.25

0.50

0.75

1.00

Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary

Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary

Combined Primary Auxiliary Combined Primary Auxiliary

F1 s

core

Figure 2: Visualisation of k-NN TL results. Top left: Bubble plot, displaying the distribution of theoptimised class weights over the 100 test partitions for the transfer learning algorithm applied to the mousedataset. Top right: Principal components analysis plot (first and second components, of the possible eight)of the mouse dataset, showing the clustering of proteins according to their density gradient distributions.Bottom: Sub-cellular class-specific box plots, displaying the estimated generalisation performance over 100test partitions for the transfer learning algorithm applied with (i) optimised class-specific weights (combined),(ii) only primary data and (iii) only auxiliary data, for each sub-cellular class.

8



https://doi.org/10.1101/022152



The SVM transfer learning classifier

Adapting Wu and Dietterich’s classic application of transfer learning [6] we have implementedan SVM transfer learning classifier that allows the incorporation of a second auxiliary datasource to improve upon the classification of experimental and condition-specific sub-cellularlocalisation predictions. The method employs the use of two separate kernels, one for eachdata source. As previously described, to assess the generalisation accuracy of our classifier weemployed the classic machine learning schema of partitioning our labelled data into trainingand testing sets, and used the testing sets to assess the strength of our classifiers. This wasrepeated on 100 independent partitions for (i) the SVM TL method, (ii) a standard SVMtrained on LOPIT alone, and (iii) a standard SVM trained on GO CC alone.

For the SVM TL experiments the resultant median macro-F1 scores for the mouse, hu-man, callus, roots and fly datasets were 0.902, 0.868, 0.956, 0.875, 0.961, respectively, overthe 100 partitions. As per the k-NN TL, we found the macro-F1 scores for the SVM TL(S2 File Fig. 1) were significantly higher than on profiles trained solely from only primaryor auxiliary alone; mouse (p = 5e−56 for primary alone and p = 6e−37 for auxiliary alone),human (p = 7e−3 for primary alone and p = 1e−21 for auxiliary alone), callus (p = 4e−3

for primary alone and p = 1e−92 for auxiliary alone), roots (p = 2e−45 for primary aloneand p = 7e−25 for auxiliary alone), and fly (p = 3e−3 for primary alone and p = 4e−105 forauxiliary alone) data. This was also evident on the organellar level as seen in the supportingfigures in the S2 File.

Other auxiliary data sources

One of the advantages of the transfer learning framework is the flexibility to use differenttypes of information for both the primary and auxiliary data source. We demonstrate theflexibility of this framework by testing other complementary sources of information as anauxiliary data source.

The Human Protein Atlas. The sub-cellular Human Protein Atlas [57] provides proteinexpression patterns on a sub-cellular level using immunofluorescent staining of human U-2OS cells. As described in the materials and methods the hpar Bioconductor package [60] wasused to query the sub-cellular Human Protein Atlas [57] (version 13, released on 11/06/2014).This auxiliary data, to be integrated with our human LOPIT experiment, was encoded asa binary matrix describing the localisation of 670 proteins in 18 sub-cellular localisations.Information for 192 of the 381 labelled marker proteins were available. These 192 proteinscovered 8 of the 10 known localisations in the human LOPIT experiment and were usedto estimate the classifier generalisation accuracy of (i) the transfer learning approach withboth data sources, (ii) the LOPIT data alone and (iii) the HPA data alone, as describedpreviously. As detailed in the supplementary information (S3 File Fig. 1), we observed astatistically significant improvement in our overall classification accuracy as well as severalpositive organelle-specific results.

YLoc sequence and annotation features. Sequence and annotation features, that wereused as input from the computational classifier YLoc [58, 59] (see materials and methods,

9



https://doi.org/10.1101/022152



Table 1) were selected as an auxiliary data source to complement the LOPIT mouse stemcell dataset. 34 sequence and annotation features were selected using a correlation featureselection, as described in the materials and methods. Using the LOPIT mouse dataset asour primary data, and the 34 YLoc features as our auxiliary we employed the standardprotocol for testing classifier performance (i) using the k-NN transfer learning with bothdata sources, (ii) the primary data alone and (iii) the auxiliary data alone. Although we didnot observe a statistically significant improvement using the auxiliary data in the transferlearning framework, we did not see any statistically significant disadvantage in combininginformation (S3 File Fig. 2). Thus we found that incorporating data from auxiliary sources inthis framework does not dilute any strong signals in the original experiment, demonstratingthe flexibility of the classifier.

Protein-protein interaction data. Protein-protein interaction data was retrieved fromthe STRING database [54] (version 10) in the human data set. An interaction contingencymatrix was constructed using the STRING combined scores (see methods). Interaction scoresfor 1109 possible interaction partners were available for 99 of the 381 markers. As describedfor the other sources above, using this protein-protein interaction information as an auxiliarydata source we employed the standard protocol for testing classifier performance (i) usingthe k-NN transfer learning with both data sources, (ii) the primary data alone and (iii) theauxiliary data alone. As per the YLoc data we did not observe a statistically significantincrease in combining auxiliary information with our primary data using transfer learning,however, we did not see any statistically significant disadvantage (S3 File Fig. 3).

Biological application

We applied the two transfer learning classifiers to a real-life scenario, using the E14TG2amouse stem cell dataset as our use-case to (i) demonstrate algorithm application, and (ii)highlight the applicability of the method for predicting protein localisation in MS-basedspatial proteomics data over other single-source classifiers.

Sub-cellular protein localisation prediction in mouse pluripotent embryonic stemcells. The E14TG2a mouse stem cell LOPIT dataset contained 387 labelled and 722 un-labelled protein protein profiles distributed among 10 sub-cellular niches (Table 1 of the S6File). Following extraction of the GO CC auxiliary data matrix for all proteins in the datasetthe following four classifiers were applied (1) a k-NN (with LOPIT data only), (2) the k-NNTL (with LOPIT and GO CC data), (3) an SVM (with LOPIT data only) and (4) the SVMTL (with LOPIT and GO CC data) and the parameters for each optimised (see methods)for the prediction of the sub-cellular localisation of the unlabelled proteins in the dataset.

In supervised machine learning the instances which one wishes to classify can only beassociated to the classes that were used in training. Thus, it is common when applying asupervised classification algorithm, wherein the whole class diversity is not present in thetraining data, to set a specific score cutoff on which to define new assignments, below whichclassifications are set to unknown/unassigned. The pRoloc tutorial, which is found in the

10



https://doi.org/10.1101/022152



set of accompanying vignettes in the pRoloc package [56], describes this procedure and howto implement this in practice. Deciding on a threshold is not trivial as classifier scoresare heavily dependent upon the classifier used and different sub-cellular niches can exhibitdifferent score distributions.

To validate our results and calculate classification thresholds based on a 5% false dis-covery rate (FDR) for each of the four classifiers (i.e. k-NN, k-NN TL, SVM, SVM TL)we compared the predicted localisations with the localisation of the same proteins found inthe highest resolution spatial map of mouse pluripotent embryonic stem cells to date [61].From examining the overlap between our new classifications and the localisations in the highresolution mouse map we found 183 of our 722 unlabelled proteins matched a high confi-dence localisation in the new dataset. Of the remaining, 347 of our proteins were labelledas unknown in the mouse map (i.e. were assigned a low confidence localisation in the ex-periment), and 192 proteins did not appear in the map. We used the localisation of these183 high confidence proteins as our gold standard on which to validate our findings and seta FDR for our predictions.

Increasing classifier discrimination power. S4 File Fig. 1 shows the score distributionsfor correct and incorrect assignments of the unassigned proteins in the dataset (as validatedthrough the hyperLOPIT mouse map [61]) and the distribution of the scores per classifier.Note, the scores are not a reflection of the classification power and the score distributionsbetween the four different methods are not comparable to one another as they are calculatedusing different techniques. For both of the single-source k-NN and SVM classifiers there is alarge overlap in the distribution of scores for correct and incorrect assignments (S4 File Fig.1). It is desirable to have a distribution of scores that enables one to choose a cutoff thatminimises the FDR. What is evident from examining the score distributions of incorrect andcorrect assignments is that by using transfer learning we have increased the discriminationpower of the classifier and thus lowered our FDR.

This is further highlighted by receiver-operator characteristics (ROC) analysis (Fig. 3) inwhich the performance of the 4 different classifiers is displayed for different scoring thresholds.When given a specific score cutoff, the ROC curve plots the true-positive rate (TPR) versusthe false-positive rate (FPR) for each classifier. We calculated the area under the ROC curve(AUC) for each classifier and found the AUC for the k-NN, SVM, k-NN TL and SVM TLwas 0.693, 0.705, 0.746 and 0.822, for each classifier respectively.

Using our knowledge of the correct/incorrect outcomes of these 183 previously unlabelledproteins we calculated an appropriate threshold at which to classify all unlabelled proteins.Using a FDR of 5% we found assignment thresholds for the SVM (0.85), SVM TL (0.785)and k-NN TL (0.805) to classify the remaining unlabelled proteins. A FDR of 5% was notpossible with the k-NN classifier, and the lowest achievable FDR was 15%, which occurredusing the strictest threshold of 1 i.e. only when all 5 nearest neighbours agreed. Comparingthe classifications made from the single-source classifiers to those made with the transferlearning methods, we found in both cases we get many more assignments using the combinedtransfer learning approaches compared to the single-source methods using a fixed FDR of5%, as discussed below.

Fig. 4 shows the SVM and SVM TL scores assigned to each of the 183 validated proteins.

11



https://doi.org/10.1101/022152



0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR (1 − specificity)

TPR

(sen

sitiv

ity)

k−NNSVMk−NN TLSVM TL

Figure 3: Receiver-operator characteristics (ROC) analysis. The performance of the 4 differentalgorithms for varying scoring thresholds. For a specific score cutoff, the ROC curve shows the true-positiverate (TPR) versus the false-positive rate (FPR) for each classifier. We calculated the area under the ROCcurve (AUC) for each curve; the AUC for the k-NN, SVM, k-NN TL and SVM TL classifiers were 0.693,0.705, 0.746 and 0.822, for each classifier respectively.

The sub-cellular class is highlighted by solid colours and an un-filled point on the plotrepresents the case where the two classifiers disagreed on the sub-cellular localisation. Wefound that the SVM TL classifier gave 70% more high-confidence classifications with thesame 5% FDR threshold than the single-source SVM trained on primary data alone. Allproteins that were assigned to a sub-cellular niche with a high confidence score in both theSVM and SVM TL (Fig. 4, top right grid) were assigned to the same class. We also foundthat many proteins outside of the high confidence threshold were assigned the same sub-cellular class using both methods, as indicated by the abundance of solid points on the plot.Of the total 722 previously unlabelled proteins we assigned high confidence localisations for204 proteins using the SVM TL, and 176 proteins using the k-NN TL method, based on aFDR of 5% (Tables 1 and 2 of the S4 File).

New findings. By way of biological validation we investigated the additional protein as-signments that were found using the SVM TL method (Fig. 4, bottom right grid) as novelassignments to one of these classes, the plasma membrane, by searching through the literaturefor supporting empirical evidence. For example, using the SVM TL method we found fournew proteins (GTR3 MOUSE, SNTB2 MOUSE, PAR6B MOUSE and ADA17 MOUSE) as-signed only to the plasma membrane with the SVM TL method (Fig. 5) that were also as-signed to the plasma membrane in the recent high resolution mouse map [61] (S4 File Fig. 2).Dehydroascorbic acid transporter (GTR3 MOUSE) is a multi-pass membrane protein which

12



https://doi.org/10.1101/022152



0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

SVM transfer learning

SV

M

Figure 4: Scatterplot displaying the scores for the SVM and SVM TL classifiers for the 183 proteins vali-dated by the hyperLOPIT mouse map [61]. Each point represents one protein and its associated classifierscores. Filled circles highlight proteins that were assigned the same sub-cellular class with each classifier,empty circles represent the instance when the two classifiers gave different results. The solid lines showthe classification boundaries for the two classifiers at a 5% FDR, above which proteins are classified to thehighlighted class, below these boundaries proteins are deemed low confidence and thus left unassigned.

has been previously shown to be a plasma membrane protein in studies isolating the cell sur-face glycoprotein in Jurkat cells [62]. Beta-2 syntrophin or syntrophin 3 (SNTB2 MOUSE)is a phosphoprotein with PDZ domain through which it interacts with ion channels andreceptors. There are confounding reports of the sub-cellular location of this peripheral pro-tein. It associates with dystrophins and has no signal sequence. It is found mostly in musclefibres and brain [63], but to date, its role has not been studied in mouse embryonic stemcells. Given its association with ion channels and receptors, it is perfectly feasible that thesteady location of this protein in stem cells is the plasma membrane. Partitioning defective 6homolog beta (PAR6B MOUSE) is a peripheral membrane protein thought to be in complexwith E-cadherin, aPKC, and Par3 at the plasma membrane [64], where it functions to guideGTP-bound Rho small GTPases to atypical protein kinase C proteins [65]. Disintegrin andmetalloproteinase domain-containing protein 17 (ADA17 MOUSE) is a single pass plasmamembrane protein which functions to cleave the intracellular domain of various plasma mem-brane proteins including notch and TNF-alpha [66]. It is therefore involved in the upstreamevents in several signalling pathways. It has a 17 amino acid N-terminal signal sequencesuggestive of its function as a membrane protein. The full list of localisation predictions forall proteins in the mouse dataset can be found in the R data package pRolocdata.

13



https://doi.org/10.1101/022152



−2 0 2 4

−2

−1

01

23

4

PC1 (40.28%)

PC

2 (

25.7

%)

Figure 5: Principal components analysis plot (PCA) of the mouse stem cell dataset. Proteinsare clustered according to their density gradient distributions. Each point on the PCA plot represents oneprotein. Filled circles are the original protein markers used in classification, hollow circles show new locationsas assigned by the SVM TL classifier. The 4 proteins GTR3 MOUSE, SNTB2 MOUSE, PAR6B MOUSEand ADA17 MOUSE that were found in the SVM TL method and not in an SVM classification with LOPITonly are highlighted.

A Comparison of Transfer Leaning Algorithms

We compared the macro- and class-F1 scores from all experiments on all 5 datasets usedto assess the classifier performance of the k-NN TL and SVM TL methods. We found thatno single method systematically outperformed the other, as described in the S5 File of thesupporting supplement.

When applying the SVM TL and k-NN TL classifiers to the unlabelled proteins (seebiological validation) an analysis of the final assignments (as classified based on a FDR of5%) showed that the predicted protein localisations were in high agreement. Although therewere no protein-organelle assignment mismatches between TL methods we did find a fewcases where one TL method would assign a protein to one of the sub-cellular classes butthe other TL method did not result in any organelle assignment, due to low classificationscores (see S4 File Table 3). Overall, we did not find any contradicting sub-cellular classassignments.

We also compared Wu’s original k-NN algorithm against our class-specific implementation

14



https://doi.org/10.1101/022152



(see S5 File Figure 6). Wu’s method was better than using primary data alone for all butthe callus dataset, but was significantly outperformed by our method for the mouse (p =4e−4) and roots dataset (p = 4e−3).

Discussion

In this study we have presented a flexible transfer learning framework for the integration ofheterogeneous data sources for robust supervised machine learning classification. We havedemonstrated the biological usage of the framework by applying these methods to the task ofprotein localisation prediction from MS-based experiments. We further show the flexibilityof the framework by applying these methods to the five different spatial proteomics datasets,from four different species, in conjunction with three different auxiliary data sources toclassify proteins to multiple sub-cellular compartments. We find the two different classifiers—the k-NN TL and SVM TL—perform equally well and importantly both of these methodsoutperform a single classifier trained on each single data source alone. We further appliedthe algorithm to a real-life use case, to classify a set of previously unknown proteins in aspatial proteomics experiment on mouse embryonic stem cells, which was validated usingthe most high resolution map of the mouse E14TG2a stem cell proteome produced to date[61]. We find integrating data from a second data source directly into classifier training andclassifier creation results in the assignment of proteins to organelles with high generalisationaccuracy. Finally, we find that using freely available data from repositories we can improveupon the classification of experimental and condition-specific protein-organelle predictionsin an organelle-specific manner.

To our knowledge, no other method has been developed to date that allows the incorpo-ration of an auxiliary data source for the primary task of predicting sub-cellular localisationin spatial proteomics experiments. In this study we have developed methods that not onlyallow the inclusion of an auxiliary data source in localisation prediction, but we have createda flexible framework allowing the use of many different types of auxiliary information, andfurthermore allowing the user complete control over the weighting between data sources andbetween specific classes. This is extremely important for the analysis of biological data ingeneral, and spatial proteomics data in particular, as many experiments are targeted towardsresolving specific biologically relevant aspects (sub-cellular niches in spatial proteomics) andthus users may wish to control the impact of auxiliary information for aspects that havebeen specially targeted for analysis by the primary experimental method. In this contextthe setting of weights manually in the k-NN transfer learning classifier allows users completepower to explicitly choose whether to call upon an auxiliary data source or simply use datafrom their own experiment, on an organelle-by-organelle basis.

The effectiveness of using databases as an auxiliary data source will depend greatlyon abundance and quality of annotation available for the species under investigation. Forexample, human is a well-studied species and there is a large amount of information availablein the Gene Ontology and Human Protein Atlas. Furthermore, some organelles are easier toenrich for and thus there exists much more information available to utilise as an auxiliarysource on a organelle by organelle basis. The transfer learning methods we present here allowthe inclusion of any type of auxiliary data, provided of course there is information available

15



https://doi.org/10.1101/022152



for the proteins under investigation.The integration of auxiliary data sources is a double-edged sword. On the one hand,

it can shed light on (i) the primary classification task by reinforcing weak patterns or (ii)complement the signal in the primary data. On the other hand however it is easy to dilutevaluable signals in an expensive experiment by shadowing the uniqueness, and hence bio-logical relevance of the experimental primary data when integration is not performed withcare, a phenomenon coined negative transfer (see S5 File Figure 7). Thus one needs to becautious with data integration in general and not overlook the biological relevance of theprimary data. Here, we provide a solution to this issue by using transfer learning: the k-NN transfer learning classifier uses optimised class-specific weights so as not to penalise anystrong signals in the primary, if no signal is found in the auxiliary; similarly, the SVM trans-fer learning method uses optimised data-specific gamma parameters for each data-specifickernel.

The transfer learning framework forms part of the open-source open-development Bio-conductor pRoloc suite of computational methods available for organelle proteomics dataanalysis. Moreover, as the pipeline utilises the formal Bioconductor classes, different datatypes, for example from gene expression technologies among others, can be easily used inthis framework. The integration of different data sources is one of the major challenges inthe data intensive world of computational biology, and here we offer a flexible and powerfulsolution to unify data obtained from different but complimentary techniques.

Materials and Methods

Primary data

Five datasets, from studies on Arabidopsis thaliana [7, 15], Drosophila embryos [17], hu-man embryonic kidney fibroblast cells [20], and mouse pluripotent embryonic stem cells(E14TG2a) (unpublished) were collected using the standard LOPIT approach as describedby Sadowski et al. [12]. In the LOPIT protocol, organelles and large protein complexesare separated by iodixanol density gradient ultracentrifugation. Proteins from a set of en-riched sub-cellular fractions are then digested and labelled separately with iTRAQ or TMTreagents, pooled, and the relative abundance of the peptides in the different fractions ismeasured by tandem MS. The number of measurements obtained per gradient occupancyprofile (which comprises of a set of isotope abundance measurements) is thus dependent onthe reagents and LOPIT methodology used.

The first Arabidopsis thaliana dataset [7] on callus cultures employed dual use of fourisotopes across eight fractions and thus yielded 8 values per protein profile. The aim of thisexperiment was to resolve Golgi membrane proteins from other organelles. Gradient-basedseparation was used to facilitate this, including separating and discarding as much nuclearmaterial as possible during a pre-centrifugation step, and carbonate washing of membranefractions to remove peripherally associated proteins, thereby maximising the likelihood ofassaying less abundant integral membrane proteins from organelles involved in the secretorypathway.

The second Arabidopsis thaliana dataset on whole roots is one of the replicates published

16



https://doi.org/10.1101/022152



by Groen et al. [15], which was set up to identify new markers of the trans-Golgi network(TGN). The TGN is an important protein trafficking hub where proteins from the Golgiare transported to and from the plasma membrane and the vacuole. The dynamics of thisorganelle are therefore complex which makes it a challenge to identify true residents of thisorganelle. For each replicate, sucrose gradient fractions were subjected to a carbonate washto enrich for membrane proteins and four fractions were iTRAQ labelled. Following MSthe resultant iTRAQ reporter ion intensities for the four fractions were normalised to sixratios and then each protein’s abundance was further normalised across its six ratios by sum.In Groen’s original experiment the iTRAQ quantitation information for common proteinsbetween the three different gradients were concatenated to increase the resolution of theTGN [23].

The aim of the Drosophila experiment [17] was to apply LOPIT to an organism withheterogeneous cell types. Tan et al. [17] were particularly interested in capturing the plasmamembrane proteome (personal communication). There was a pre-centrifugation step to de-plete nuclei, but no carbonate washing, thus peripheral and luminal proteins were not re-moved. In this experiment four isotopes across four distinct fractions were implemented andthus yielded four measurements (features) per protein profile.

The human dataset [67, 20] was a proof-of-concept for the use of LOPIT with an ad-herent mammalian cell culture. Human embryonic kidney fibroblast cells (HEK293T) wereused and LOPIT was employed with 8-plex iTRAQ reagents, thus returning eight valuesper protein profile within a single labelling experiment. As in the LOPIT experiments inArabidopsis and Drosophila, the aim was to resolve the multiple sub-cellular niches of post-nuclear membranes, and also the soluble cytosolic protein pool. Nuclei were discarded atan early stage in the fractionation scheme as previously described, and membranes were notcarbonate washed in order to retain peripheral membrane and lumenal proteins for analysis.

The E14TG2a embryonic mouse dataset (unpublished) also employed iTRAQ 8-plex la-belling, with the aim of cataloguing protein localisation in pluripotent stem cells culturedunder conditions favouring self-renewal. In order to achieve maximal coverage of sub-cellularcompartments, fractions enriched in nuclei and cytosol were included in the iTRAQ labellingscheme, along with other organelles and large protein complexes as for the previously de-scribed datasets. No carbonate wash was performed.

For validation of the predicted localisations made using the transfer learning classifiers onthe E14TG2a dataset above, a new high resolution mouse map was used as a gold standard[61]. This high resolution map was generated using hyperplexed LOPIT (hyperLOPIT), anovel technique for robust classification of protein localisation across the whole cell. Themethod uses an elaborate sub-cellular fractionation scheme, enabled by the use of TandemMass Tag (TMT) 10-plex and application of a recently introduced MS data acquisitiontechnique termed synchronous precursor selection MS3 (SPS)-MS3 [68], for high accuracy andprecision of TMT quantification. The study used state-of-the-art data analysis techniques[67, 56] combined with stringent manual curation of the data to provide a robust mapof the mouse pluripotent embryonic stem cell proteome. The authors also provide a webinterface to the data for exploration by the community through a dedicated online R shiny[69] application (https://lgatto.shinyapps.io/christoforou2015).

All datasets are freely distributed as part of the Bioconductor [55] pRolocdata datapackage [56].

17



https://lgatto.shinyapps.io/christoforou2015

https://doi.org/10.1101/022152



Auxiliary data

The Gene Ontology. The Gene Ontology (GO) project provides controlled structuredvocabulary for the description of biological processes, cellular compartments and molecularfunctions of gene and gene products across species [4]. For each protein seen in every LOPITexperiment the protein’s associated Gene Ontology (GO) cellular component (CC) names-pace terms were retrieved using the pRoloc package [56]. Given all possible GO CC termsassociated to the proteins in the experiment we constructed a binary matrix representingthe presence/absence of a given term for each protein, for each experiment.

Human Protein Atlas. The Human Protein Atlas (HPA) [57] (version 13, released on11/06/2014) was used as an auxiliary source of information to complement the human LO-PIT dataset. The sub-cellular HPA provides protein expression patterns on a sub-cellularlevel using immunofluorescent staining of human U-2 OS cells. We used the hpar Biocon-ductor package [60] to query the atlas. The data was encoded as a binary matrix describingthe localisation of 670 proteins in 18 sub-cellular localisations. In the HPA the reliabilityof annotated protein expression data is given a status of supportive or uncertain, depen-dent on similarity to immunostaining patterns and consistency with available experimentalgene/protein characterisation data in the UniProtKB database. Here, we only localisationsthat have been supportively identified.

YLoc Classifier Features. YLoc [58, 59] is an interpretable web server developed byBriesemeister and co-workers for the prediction of protein sub-cellular localisation. TheYLoc classifier uses numerous features derived from sequence and annotation. A summaryof the features included in the YLoc classifier is shown in Table 1. These features providea source of complementary auxiliary data for the high quality MS based datasets describedabove. To use these features as an auxiliary source of information, a large-scale correlation-based feature selection (CFS) approach [70], as described in [58, 59], was used with themarkers from the mouse dataset to find the set of the most important features.

Protein-protein interaction data. The STRING (Search Tool for the Retrieval of Inter-acting Genes/Proteins) database [54] contains known and predicted protein interactions andquantitatively integrates interaction data from direct (physical) and indirect (functional) as-sociations for a large number of organisms, including human. We have queried the STRINGdatabase (version 10) with protein accessions and retrieved the interaction partners of pro-teins in the human LOPIT data. For each of these proteins, an interaction was recordedand scored using the STRING combined interaction score which was then used to constructan interaction contingency matrix to use as an auxiliary data source. For the 1371 proteinsin our human dataset, 520 proteins (99 markers) displayed interactions, which were used inclassifier testing.

The creation of the auxiliary datasets are documented and demonstrated using executablecode in the pRoloc-transfer-learning vignette.

18



https://doi.org/10.1101/022152



Table 1: A summary of the types of features considered in training and building Briesemeister et al’s YLocclassifier.

Sequence derived Annotation basedAmino acid sequence PROSITE patterns [71]

e.g. amino-acid composition (AAC), Gene Ontology Termspseudo- and normalised- AAC [30] e.g. cellular compartment namespace

Physiochemical properties terms from close homologuese.g. hydrophobic, positively/negativelycharged, aromatic, small etc.

Autocorrelation featurese.g. autocorrelation of properties suchas charge, volume etc.

Sorting signalse.g. mono nuclear localisation signal,nuclear export signal, secretorypathways etc.

The definition of primary and auxiliary is not defined algorithmically by the qualityor the size of the data, but rather by the data and question at hand. Here, LOPIT wasconsidered the primary data because it represented the experiment of interest that was tobe complemented by the auxiliary data. In fact, from an algorithmic point of view, primaryand auxiliary are reciprocal.

Markers

Spatial proteomics relies extensively on reliable sub-cellular protein markers to infer proteomewide localisation. Markers are proteins that are defined as reliable residents and can be usedas reference points to identify new members of that sub-cellular niche. Here, marker proteinsare selected by domain experts through careful mining of the literature. Markers for eachLOPIT experiment were specific to the system under study and conditions of interest andare distributed as part of the Bioconductor [55] pRoloc package [56].

Notation

The primary MS-based experimental datasets P consist of multivariate protein profiles. Theauxiliary data A is a presence/absence binary matrix of Gene Ontology Cellular Compart-ment (GO CC) terms. Data are annotated to either (i) a single known organelle (labelleddata), or (ii) have unknown localisation (unlabelled data). Thus we split P and A intolabelled (L) and unlabelled (U) sections such that P = (LP , UP ) and A = (LA, UA).

The labelled examples for P and A are represented by LP = {(xl, yl)|l = 1, ..., |LP |}where xl ∈ RS, and LA = {(vl, yl)|l = 1, ..., |LA|} where vl ∈ RT . Thus each lth proteinis described by vectors of S and T features (generally, S << T ), for P and A respectively.Each dataset shares a common set of proteins that is annotated to one of the same yl ∈ C ={1, ..., |C|} sub-cellular classes, where |C| ∈ N is the total number of sub-cellular classes.

19



https://doi.org/10.1101/022152



Unlabelled data, UP and UA are represented by UP = {xu|u = 1, ..., |UP |} where xu ∈ RS

and UA = {vu|u = 1, ..., |UA|} where vu ∈ RT , respectively.The labelled data for the ith organelle class, with Ni indicating the number of proteins

for the ith organelle class, is given for P by gPi = {(x, y) ∈ LP |y = i} and for A bygAi = {(v, y) ∈ LA|y = i}. The labelled dataset of all available proteins over the |C| different

sub-cellular classes is given for P by LP =⋃|C|

i=1 gPi and for A by LA =

⋃|C|i=1 g

Ai .

Transfer learning using a k-nearest neighbours framework

We adapt Wu and Dietterich’s [6] classic application of inductive transfer using experimentalquantitative proteomics data as the primary source (P ) and GO CC terms as the auxiliarysource (A). We aim to exploit auxiliary data to improve upon the sub-cellular classificationof proteins found in MS-based LOPIT experiments in an organelle-specific way, using thebaseline k-nearest neighbours (k-NN) algorithm in a transfer learning framework.

In k-NN classification, an unknown example is classified by a majority vote of its labelledneighbours, with the example being assigned to the class most common among its k nearestneighbours. Independent of the transfer learning classifier we compute the best k for eachdata source for values k ∈ {3, 5, 7, 9, 11, 13, 15} through an initial 100 rounds of 5-fold cross-validation using each set of labelled training data for P and then independently for A (asimplemented in pRoloc). We denote by kP the best k for P , and by kA the best k for A.

Having obtained the best k for each data source, the transfer learning algorithm worksas follows. For the uth protein (xu,vu) we wish to classify in U , we start by finding the kPand kA labelled nearest neighbours for xu and vu in LP and LA, respectively. Denote thesesets NP

u and NAu . We then define the vectors pT

u = (pu1 , . . . , pu|C|) and qT

u = (qu1 , . . . , qu|C|) to

contain counts for each class in the sets of nearest neighbours; that is,

pui = |{(x, y) ∈ NPu |y = i}|

qui = |{(v, y) ∈ NAu |y = i}|.

For each protein, let pu = pu/kP and qu = qu/kA be normalized vectors with elements sum-ming to 1 and representing the distribution of classes among the sets of nearest neighboursfor each protein. Finally, let NNP = {pu|u = 1, ..., |UP |} and NNA = {qu|u = 1, ..., |UP |}.

To include both the primary and auxiliary data in the set of potential neighbours wetook a weighted combination of the votes in NNP and NNA for each sub-cellular class. Classweights are defined by the parameter vector θT = (θ1, . . . , θ|C|) with values θi ∈ {0, 13 ,

23, 1}

chosen by optimisation through a prior 100 independent rounds of 5-fold cross-validation ona separate training partition of the labelled data. For the uth unknown protein (xu,vu) inU , the voting scores for each class i ∈ C are calculated as

V (i) = θipui + (1− θi)qui (1)

and the protein is assigned to the class c ∈ C maximizing V (i)

c = arg maxi

V (i).

The class weights θi in equation 1 control the relative importance of the two types of neigh-bours for each class i ∈ C. This differs from Wu and Dietterich’s [6] original approach as

20



https://doi.org/10.1101/022152



they only weight the data sources and not the classes and the data sources. In this paperwe select each class weight θi from the set {0, 1

3, 23, 1}; however, the algorithm allows us to

use any real-valued θi ∈ [0, 1]. If θi = 1, then all weight is given to the primary data in classi and only primary nearest neighbours in class i are considered. Similarly, if θi = 0, thenall weight is given to the auxiliary data in class i and only auxiliary nearest neighbours inclass i are considered. If 0 < θi < 1 then a combination of neighbours in the primary andauxiliary data sources is considered.

Transfer learning using an SVM framework

Linear programming SVMs The method is based on the use of the linear programmingformulation of the SVM (lpSVM). This formulation promotes classifiers that are sparse, inthe sense that where possible only a few parameters obtained through training are non-zero;for a detailed introduction see Mangasarian [72].

We begin by describing the standard lpSVM used for classical two-class classificationproblems with a single labelled training set. We use the multiple-class version of this ap-proach with the individual primary and auxiliary sets P and A as a comparison later inthe paper; we present the method here assuming that the primary set P is being usedand can be set up as a binary classification problem; for example, we might wish to pre-dict whether or not a protein should be assigned to a single specified sub-cellular class.For binary classification problems with class labels y ∈ {+1,−1}, and given labelled dataLP = {(xl, yl)|l = 1, . . . ,m} where m = |LP | the classifier takes the form

h(x) =

{+1 if f(x;αP , b) ≥ 0−1 otherwise

(2)

where f is the latent function

f(x;αP , b) =m∑l=1

ylαPl K

P (xl,x) + b.

Here, KP is a kernel (Shawe-Taylor and Cristianini [73]) associated with the primary dataand αT

P = (αP1 , . . . , α

Pm) and b are parameters determined by training.

For any vector xT = (x1, . . . , xn) let |.|1 denote the 1-norm

|x|1 =n∑

i=1

|xi|.

The training algorithm requires that we solve the linear programme

minαP ,ξ,b

|αP |1 + C|ξ|1 (3)

such that for each i = 1, . . . ,m

yif(xi;αP , b) + ξi ≥ 1

21



https://doi.org/10.1101/022152



and αP , ξ ≥ 0. The parameters ξ and C act in the same way as the corresponding pa-rameters in the standard SVM: ξ contains the slack variables allowing some examples to bemisclassified, and C controls the extent to which such misclassifications are penalized duringtraining.

Note that it is possible for the linear programme to have no solution, although we foundthis to be extremely rare. When this was the case the classifier reverted to predicting themost common class in the labelled data.

Transfer learning for binary classification. Once again we adapt the method of Wuand Dietterich [6] to our problem. The original method requires adaptation as it is designedfor data having two important differences compared with ours. First, it does not requireexamples in the labelled data sets LP and LA to be in correspondence and for correspondingtraining examples to share the same label. Second it assumes that P and A share the samenumber of features. While the first of these differences is easily dealt with as our data is aspecial case that is already covered, the second is more problematic. If we now introducethe labelled auxilliary data LA = {(vl, yl)|l = 1, . . . ,m} a direct application of the approachin [6] requires us to evaluate kernels of the form K(x,v). As P and A contain data withdifferent numbers of features this presents a problem for any SVM-type method, as kernelsare usually required to satisfy the Mercer conditions (Mercer [74]), one of which is that theyare symmetric, such that K(x,x′) = K(x′,x). While research on the use of asymmetrickernels has appeared—see for example [75]—even if we relax this requirement a kernel isessentially a measure of the similarity of its arguments, and the question arises of how onemight sensibly measure the similarity of a protein profile with a presence/absence vector ofGO CC terms. This problem does not arise with Wu and Dietterich’s data as the two setsthey use have the same dimension and are derived in a way that makes measuring similaritystraightforward.

We therefore simplify the original method as follows. We maintain the machinery em-ployed above for the primary data, and introduce a separate kernel KA and parameter vectorαA for the auxilliary data. A vector to be classified now contains both a protein profile xand a GO vector v. The latent function becomes

f(x,v;αP ,αA, b) =m∑l=1

yl[αPl K

P (xl,x) + αAl K

A(vl,v)]

+ b

and training requires us to solve the linear program

minαP ,αA,ξ,b

|αP |1 + |αA|1 + C|ξ|1 (4)

such that for each i = 1, . . . ,m

yif(xi,vi;αP ,αA, b) + ξi ≥ 1

and αP ,αA, ξ ≥ 0.Note that this differs from the method of Multiple Kernel Learning (MKL) (Lanckriet

et al. [76], Gonen and Alpaydin [77]) in that in MKL the single kernel K is replaced in the

22



https://doi.org/10.1101/022152



usual SVM formulation by a weighted sum of kernels

K(x1,x2) =D∑i=1

diKi(x1,x2)

where di ≥ 0 and∑D

i=1 di = 1. The di are then included with α and b in a more in-volved constrained optimisation problem. Our approach has the advantages that it remainsa straightforward linear program and in fact introduces fewer constraints on the form of thelatent function f .

Throughout our experiments we used for KP and KA the Gaussian kernel

K(x1,x2) = exp(−γ||x1 − x2||2)

where ||.|| denotes the 2-norm ||x|| = (∑

i x2i )

1/2. We optimized over the value of C, and

also separate values γP and γA for the two kernels as described below, with C in the range{0.125, 0.25, 0.5, 1, 2, 4, 8, 16} and γP , γA in the range {0.01, 0.1, 1, 10, 100, 1000}.

Multiple classes, class imbalance and probabilistic outputs. As a baseline compar-ison in our experiments we used a standard SVM as implemented in the package LIBSVM(Chang and Lin [78]). In extending our transfer learning technique to deal with multipleclasses and probabilistic outputs we therefore maintained as close a similarity as possible tothe methods used by that library.

SVMs and lpSVMs are in their basic form inherently binary classifiers. In order toaddress multiple-class problems using non-probabilistic outputs such as the one presentedhere we use the method of Knerr et al. [79]. We train a binary classifier to separate eachpair of classes. In order to classify a new example we then take a vote among these binaryclassifiers, assigning the example to the class with the most votes.

As we typically have several sub-cellular classes the binary classification problems usedin constructing the multiple-class classifier are inherently unbalanced. We adjust for thisusing the method of Morik et al. [80]. In each binary problem let n+ denote the numberof positive examples and n− the number of negative examples. In the linear programmeobjective functions (equations 3 and 4) we replace the single value for C with the adjustedvalues

C+ = C√n−/n+

C− = C√n+/n−

for the positive and negative examples respectively. Let S+ denote the set of indices of thepositive examples and S− the set of indices for the negative examples. The term C|ξ|1 inequations 3 and 4 becomes

C+∑i∈S+

|ξi|+ C−∑i∈S−

|ξi|.

Finally, we prefer to employ probabilistic outputs rather than simply thresholding as inequation 2. Once again we employ the same techniques as LIBSVM. The method for binaryclassifiers is presented by Platt [81] and Lin et al. [82], and for multiple-class classifiers byWu et al. [6].

23



https://doi.org/10.1101/022152



Assessing classifier generalisation accuracy

In order to evaluate the generalisation accuracy of each transfer learning classifier we em-ployed the following schema in all experiments. A set of LOPIT profiles labelled with knownmarkers, and their counterpart auxiliary GO CC profiles, were separated at random intotraining (80%) and test (20%) partitions. The split was stratified, such that the relativeproportions of each class in each of the two sets matched that of the complete set of data.The test profiles were withheld from classifier training and employed to test the generali-sation accuracy of the trained classifiers. On each 80% training partition 5-fold stratifiedcross-validation was conducted to test all free parameters via a grid search and select thebest set of parameters for each classifier. In each experiment, for each dataset, this processof 80/20% stratified splitting, training with 5-fold stratified cross-validation on the 80% andtesting on the 20% was repeated 100 times in order to produce 100 sets of macro F1 scoresand class-specific F1 scores. The F1 score (He [83]) is a common measure used to assessclassifier performance. It is the harmonic mean of precision and recall, where

precision =tp

tp + fp, recall =

tp

tp + fn

and tp denotes the number of true positives, fp the number of false positives, and fn thenumber of false negatives. Thus

F1 = 2× precision× recall

precision + recall.

A high macro F1 score indicates that the marker proteins in the test data set are consistentlycorrectly assigned by the algorithm.

To assess whether incorporating an auxiliary data source into classifier training andclassifier creation was better than using primary or auxiliary data alone, we conducted threeindependent experiments for each data source and for each transfer learning method. Weused the above schema to assess the generalisation accuracy of using (1) the transfer learningk-Nearest Neighbours (k-NN) classifier, (2) the primary LOPIT data alone, using a baselinek-NN, (3) the auxiliary GO CC data alone, using a baseline k-NN. We repeated this for thelpSVM transfer learning classifier and used a standard SVM with an RBF kernel for singledata source experiments. Using these experiments we were able to compare using a simplek-NN versus the transfer learning k-NN, and also the use of a standard SVM versus thecombined transfer learning lpSVM approach.

A two-sample two-tailed t-test, assuming unequal variance, was used to assess whether,over the 100 test partitions, the estimated generalisation performance using the optimisedclass-specific fusion approach was better than using either primary data alone, or auxiliarydata alone. A threshold of 0.01 was used in all t-tests to determine significance.

Optimised parameters for the mouse pluripotent embryonic stem cell data. Toclassify the 722 unlabelled proteins in the E14TG2a mouse stem cell dataset we performed100 rounds of stratified 5-fold cross validation on the training partition as detailed above.The best parameters were found to be k = 5 for the k-NN classifier and for the k-NN TL clas-sifier kP = 5, kA = 5 and the best class weights were found to be θ = (1

3, 23, 23, 1, 1

3, 1, 1, 1

3, 23, 0)

24



https://doi.org/10.1101/022152



for the 40S ribosome, 60S ribosome, cytosol, endoplasmic reticulum, lysosome, mitochondria,nucleus - chromatin, nucleus - non-chromatin, plasma membrane and proteasome, respec-tively. For the SVM classifier we found the best parameters to be C = 16 and γ = 10. Forthe SVM TL classifier we found C = 16, γP = 1 and γA = 0.1. Using these parameterswith their associated algorithms we classified the 722 unlabelled proteins in the dataset andobtained a classifier score for each protein.

Supporting Information

S1 File

Supporting figures for the k-NN transfer learning experiments. Visualisations ofthe k-NN transfer learning results for human, plant callus, plant roots and fly datasets.Including bubble plots displaying the distribution of the optimised class weights, principalcomponents analysis plots of the LOPIT primary datasets boxplots displaying the estimatedgeneralisation performance of each classifier.

S2 File

Supporting figures for the SVM transfer learning experiments. Boxplots displayingthe estimated generalisation performance of the SVM transfer learning algorithm applied tothe mouse, human, plant callus, plant roots and fly datasets.

S3 File

Supporting figures for other data sources. Macro- and class-specific results for thek-NN transfer learning algorithm used with the auxiliary Human Protein Atlas dataset, aYLoc sequence and annotation features auxiliary dataset and a protein-protein interactionsdataset.

S4 File

Additional figure and tables for biological application. Boxplots displaying the dis-tribution of classification scores assigned to the unknown proteins in the mouse dataset foreach of the 4 classifiers. Principal components analysis plot displaying the protein classifi-cation results from applying the k-NN transfer learning algorithm on the unlabelled data inthe mouse dataset. Accompanying tables showing the number of sub-cellular assignments ofthe unlabelled proteins amongst the 10 known sub-cellular classes in the data from applyingeach transfer learning method. the mouse stem cell dataset highlighting the new localistionsfound by the k-NN transfer learning method.

S5 File

A comparison of transfer learning methods. A short comparison between k-NN trans-fer learning (TL) and SVM TL classifiers, including boxplots displaying the macro- and

25



https://doi.org/10.1101/022152



class-F1 scores for the k-NN TL and SVM TL experiments over the 100 test partitions oneach dataset. This supplementary file also includes a comparison with Wu’s original k-NNtransfer learning classifier and negative transfer effects are also described.

S6 File

Dataset summary statistics. Tables displaying the dimensions for each primary andauxiliary dataset, including the total number of proteins identified in each LOPIT datasetand number of known markers of sub-cellular protein location.

Acknowledgments

LMB was supported by a BBSRC Tools and Resources Development Fund (AwardBB/K00137X/1). LG was supported by the European Union 7th Framework Program(PRIME-XS project, grant agreement number 262067) and a BBSRC Strategic Longer andLarger Award (Award BB/L002817/1). DW and OK acknowledge funding from the Euro-pean Union (PRIME-XS, GA 262067) and Deutsche Forschungsgemeinschaft (KO-2313/6-1).The authors would also like to thank Dr. Dureid El-Moghraby from the High PerformanceComputing Service and Dr. Jenny Barna from Bioinformatics and Computational Biology,University of Cambridge, for their support. This work was performed using the DarwinSupercomputer of the University of Cambridge High Performance Computing Service, pro-vided by Dell Inc. using Strategic Research Infrastructure Funding from the Higher Educa-tion Funding Council for England and funding from the Science and Technology FacilitiesCouncil.

26



https://doi.org/10.1101/022152


Learning from heterogeneous data sources: an application ... · 3/4/2016 · Transfer learning for...

Documents

Transcript of Learning from heterogeneous data sources: an application ... · 3/4/2016 · Transfer learning for...