[IEEE 2013 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - Shanghai, China...

5
Online Semi-Supervised Learning: Algorithm and Application in Metagenomics Sultan Imangaliyev ∗†‡ , Bart Keijser ∗† , Wim Crielaard ∗‡ , and Evgeni Tsivtsivadze ∗† Top Institute Food and Nutrition, Wageningen, The Netherlands Research Group Microbiology and Systems Biology, TNO Earth, Environmental and Life Sciences, Zeist, The Netherlands Email: {firstname.lastname}@tno.nl Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam Amsterdam, The Netherlands Email: [email protected] Abstract—As the amount of metagenomic data grows rapidly, online statistical learning algorithms are poised to play key role in metagenome analysis tasks. Frequently, data are only partially labeled, namely dataset contains partial information about the problem of interest. This work presents an algorithm and a learning framework that is naturally suitable for the analysis of large scale, partially labeled metagenome datasets. We propose an online multi-output algorithm that learns by sequentially co-regularizing prediction functions on unlabeled data points and provides improved performance in comparison to several supervised methods. We evaluate predictive performance of the proposed methods on NIH Human Microbiome Project dataset. In particular we address the task of predicting relative abundance of Porphyromonas species in the oral cavity. In our empirical evaluation the proposed method outperforms several supervised regression techniques as well as leads to notable computational benefits when training the predictive model. I. I NTRODUCTION AND BACKGROUND The human body is colonized by a myriad of bacterial species, occupying the various sites of our body and hav- ing great impact on health and disease. Distinct microbial populations can be found at different body sites, reflecting the specific physical and chemical conditions, and forming an ecological network. Disturbances in the composition of these microbial networks (dysbiosis) have been associated with diseases, including periodontal diseases [1]. The development of novel methods for DNA sequencing has allowed researchers to examine complex microbial ecosystems of the human body in great detail [2]. Linking the composition of these microbial ecosystems to clinically relevant physiological data has posed a challenge from a computational point of view. Online learning methods (e.g. [3]) provide a natural al- gorithmic framework for the analysis of such large-scale metagenome datasets. Furthermore, the collected data may frequently be only partially labeled (e.g. while the microbial composition is observed, the effects of the ecosystem on host are unknown) and is typically much easier to obtain than fully labeled data. In this work we propose an algorithm that is suitable for learning from large scale partially labeled data. As an example we study a relevant problem in oral health domain, namely the prediction of Porphyromonas species which are implicated in certain forms of periodontal disease. Thus, timely prediction of occurrence of these species could serve as a valuable preventive approach. Interactions among different microbial species in various niches of human body are known to exist. For example, many species present in oral cavity could be also found in nasopharyngeal niche. Such biological interaction suggest possibility to use additional information in form of “unlabeled data” from different niches to better predict occurrence of Porphyromonas bacteria. Recently, a number of techniques have been proposed that can take unlabeled data into account to improve model performance. An approach that stands out from the rest and has a solid theoretical foundation is so called co-regularization [4]. The general idea behind such algorithms that they split the attributes into independent sets and an algorithm is learnt based on these different “views”. Briefly stated, algorithms based upon this approach search for hypotheses from different views, such that the training error of each hypothesis on the labeled data is small and, at the same time, the hypotheses give similar predictions for the unlabeled data. Within this framework, the disagreement among the predictors is taken into account via a co-regularization term. Empirical results show that the co-regularization approach works well for domain adaptation [5], classification [4], regression [6], and clustering [7] tasks. Moreover, theoretical investigations demonstrate that the co- regularization approach reduces the Rademacher complexity by an amount that depends on the “distance” between the views [8]. A classical example of multi-view learning is a web-document classification task where the document can be represented by keyword features or as the link features it contains, thus, creating two distinct views of the same data point [9]. Many of the multi-view algorithms are formulated within a regularization framework [10]. In this framework, the learning algorithm selects a hypothesis f which minimizes a cost function and which is, at the same time, not too “complex”, i.e. which does not overfit while training and is therefore able to generalize to unseen data. In this work we extend the framework described in [11] to be applicable to multiple output learning setting. Consider a training set D =(X, Y ) where X =(x 1 ,..., x m ) T ∈X m and Y =(y 1 ,..., y m ) T R m×p . Also, consider different representation of the data points, that is unique subsets of fea- tures corresponding to M different hypotheses spaces H 1 ,..., H M . The disjoint feature subsets are frequently referred to as views. Let us assume that in addition to the training set D = 2013 IEEE International Conference on Bioinformatics and Biomedicine 978-1-4799-1310-7/13/$31.00 ©2013 IEEE

Transcript of [IEEE 2013 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - Shanghai, China...

Page 1: [IEEE 2013 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - Shanghai, China (2013.12.18-2013.12.21)] 2013 IEEE International Conference on Bioinformatics and

Online Semi-Supervised Learning:Algorithm and Application in Metagenomics

Sultan Imangaliyev∗ † ‡, Bart Keijser∗ †, Wim Crielaard∗ ‡, and Evgeni Tsivtsivadze∗ †∗Top Institute Food and Nutrition, Wageningen, The Netherlands

† Research Group Microbiology and Systems Biology,

TNO Earth, Environmental and Life Sciences, Zeist, The Netherlands

Email: {firstname.lastname}@tno.nl‡Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam

Amsterdam, The Netherlands

Email: [email protected]

Abstract—As the amount of metagenomic data grows rapidly,online statistical learning algorithms are poised to play key rolein metagenome analysis tasks. Frequently, data are only partiallylabeled, namely dataset contains partial information about theproblem of interest. This work presents an algorithm and alearning framework that is naturally suitable for the analysis oflarge scale, partially labeled metagenome datasets. We proposean online multi-output algorithm that learns by sequentiallyco-regularizing prediction functions on unlabeled data pointsand provides improved performance in comparison to severalsupervised methods. We evaluate predictive performance of theproposed methods on NIH Human Microbiome Project dataset. Inparticular we address the task of predicting relative abundanceof Porphyromonas species in the oral cavity. In our empiricalevaluation the proposed method outperforms several supervisedregression techniques as well as leads to notable computationalbenefits when training the predictive model.

I. INTRODUCTION AND BACKGROUND

The human body is colonized by a myriad of bacterialspecies, occupying the various sites of our body and hav-ing great impact on health and disease. Distinct microbialpopulations can be found at different body sites, reflectingthe specific physical and chemical conditions, and formingan ecological network. Disturbances in the composition ofthese microbial networks (dysbiosis) have been associated withdiseases, including periodontal diseases [1]. The developmentof novel methods for DNA sequencing has allowed researchersto examine complex microbial ecosystems of the human bodyin great detail [2]. Linking the composition of these microbialecosystems to clinically relevant physiological data has poseda challenge from a computational point of view.

Online learning methods (e.g. [3]) provide a natural al-gorithmic framework for the analysis of such large-scalemetagenome datasets. Furthermore, the collected data mayfrequently be only partially labeled (e.g. while the microbialcomposition is observed, the effects of the ecosystem on hostare unknown) and is typically much easier to obtain than fullylabeled data. In this work we propose an algorithm that issuitable for learning from large scale partially labeled data. Asan example we study a relevant problem in oral health domain,namely the prediction of Porphyromonas species which areimplicated in certain forms of periodontal disease. Thus, timelyprediction of occurrence of these species could serve as a

valuable preventive approach. Interactions among differentmicrobial species in various niches of human body are knownto exist. For example, many species present in oral cavitycould be also found in nasopharyngeal niche. Such biologicalinteraction suggest possibility to use additional information inform of “unlabeled data” from different niches to better predictoccurrence of Porphyromonas bacteria.

Recently, a number of techniques have been proposedthat can take unlabeled data into account to improve modelperformance. An approach that stands out from the rest andhas a solid theoretical foundation is so called co-regularization[4]. The general idea behind such algorithms that they split theattributes into independent sets and an algorithm is learnt basedon these different “views”. Briefly stated, algorithms basedupon this approach search for hypotheses from different views,such that the training error of each hypothesis on the labeleddata is small and, at the same time, the hypotheses give similarpredictions for the unlabeled data. Within this framework,the disagreement among the predictors is taken into accountvia a co-regularization term. Empirical results show that theco-regularization approach works well for domain adaptation[5], classification [4], regression [6], and clustering [7] tasks.Moreover, theoretical investigations demonstrate that the co-regularization approach reduces the Rademacher complexityby an amount that depends on the “distance” between theviews [8]. A classical example of multi-view learning is aweb-document classification task where the document can berepresented by keyword features or as the link features itcontains, thus, creating two distinct views of the same datapoint [9]. Many of the multi-view algorithms are formulatedwithin a regularization framework [10]. In this framework, thelearning algorithm selects a hypothesis f which minimizesa cost function and which is, at the same time, not too“complex”, i.e. which does not overfit while training and istherefore able to generalize to unseen data.

In this work we extend the framework described in [11]to be applicable to multiple output learning setting. Considera training set D = (X,Y ) where X = (x1, . . . ,xm)T ∈ Xm

and Y = (y1, . . . ,ym)T ∈ Rm×p. Also, consider different

representation of the data points, that is unique subsets of fea-tures corresponding to M different hypotheses spaces H1, . . . ,HM . The disjoint feature subsets are frequently referred to asviews. Let us assume that in addition to the training set D =

2013 IEEE International Conference on Bioinformatics and Biomedicine

978-1-4799-1310-7/13/$31.00 ©2013 IEEE

Page 2: [IEEE 2013 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - Shanghai, China (2013.12.18-2013.12.21)] 2013 IEEE International Conference on Bioinformatics and

(X,Y ) with labeled examples we have a training set D = (X)with unlabeled data points X = (xm+1, . . . ,xm+n)

T ∈ Xn.In the co-regularization setting we would like to identifyfunctions F = (f1, . . . , fM ) ∈ H1 × . . . × HM minimizingthe objective function

J(F ) =M∑

v=1

L(fv, D) + λM∑

v=1

‖fv‖2Hv +

M∑

v,u=1

LC(fv, fu, D), (1)

where λ, μ ∈ R+ are regularization parameters, first term of

the equation is a loss function that penalizes the differencebetween a prediction and the corresponding true value, thesecond one is a regularization term that penalizes complexmodels to counter over-fitting of the model, and the third oneis the loss function measuring the disagreement between theprediction functions of the views on the unlabeled data.

The above formulation is quite general and allows us toconstruct various learning schemes by specializing the lossfunction and the optimization procedure. For example, byconsidering a single view and specializing the loss in theabove formulation we can obtain support vector machines [12]by choosing a hinge loss function or regularized least-squares(RLS) [13] by choosing a squared loss function. In turn, theRLS algorithm with slight modifications - possibly includinga bias term - leads to a wide class of other learners, such asthe least-squares support vector machine [14], proximal vectormachines [15] and kernel ridge regression [16].

Co-regularized algorithms are usually not straightforwardlyapplicable to large scale learning tasks, when large amounts ofunlabeled as well as labeled data are available for the training.Several recently proposed algorithms have complexity that islinear in the number of unlabeled data points and superlinearin the number of labeled examples (e.g. cubic as in caseof co-regularized least squares [6]). Such methods becomeimpossible to use as the dataset size increases.

Our algorithm has complexity that does not depend on thenumber of data points and so is suitable for larger datasets.There are additional benefits associated with an online formu-lation of the algorithm such as quick retraining of the modelgiven additional unlabeled or labeled examples, flexibility ofco-regularizing iteratively at predefined time points and naturalextensions to multiple convex loss functions e.g. hinge loss,cross entropy or squared loss.

II. MULTI-OUTPUT CO-REGULARIZED ALGORITHM

Online algorithms are amongst the most popular ap-proaches for large scale learning. Methods such as PEGASOS

[17], LASVM [18] and GURLS [19] have been successfullyapplied to a wide range of large scale problems leading tostate-of-the-art generalization performance. Our algorithm isrelated to the above mentioned methods but is preferable incase unlabeled data points are available for learning.

A popular approach to tackle large scale learning prob-lems is by using efficient approximation techniques such asstochastic gradient decent (see e.g. [3]). In this work weconsider multiple-output prediction setting, e.g. instead of the

single output variable Y we have to simultaneously predict pindependent output variables corresponding to output matrixY ∈ R

m×p. Let us consider the multi-output co-regularizedalgorithm (MOCA) in the online setting. Slightly overloadingour notations, we write the objective function as a function ofweighting coefficients W

J∗(W) =M∑

v=1

(m∑

i=1

L(xvi ,yi;W

v) + λLR(Wv)

)

+μM∑

v,u=1v �=u

m+n∑

i=m+1

LC(xvi ,x

ui ;W

v,Wu), (2)

where the first term corresponds to the loss function men-tioned previously and the second term to a regularization onthe individual prediction functions. The third is again a co-regularization term that measures the disagreement betweenthe different prediction functions on unlabeled data. The su-perscript v on each term refers to a certain view in multi-viewsetting. We can approximate the optimal solution (obtainedwhen minimizing (1)) by means of gradient descent

W vt+1 = W v

t − ηvt∇WvJ∗(W), (3)

where the first term refers to a weighting coefficient vectorobtained in previous iteration, and the second term is a productbetween learning rate ηvt and gradient of objective functionwith respect to weighting coefficient vector.

Let us consider the setting in which the squared lossfunction is used for the co-regularization and L2 norm forthe regularization terms. The choice of squared loss for theco-regularization term is quite natural as it penalizes thedifferences among the prediction functions constructed formultiple views (similar to the standard regression setting wherethe differences between the predicted and true scores arepenalized). For every iteration t of the algorithm, we firstchoose a set At ⊆ D of size k. Similarly we choose At ⊆ Dof size l for each round t on the unlabeled dataset. Then, wereplace the “true” objective (1) with an approximate objectivefunction and write the update rule as follows

W vt+1 = (1− ηvt λ)W

vt − ηvt

(x,y∈At)

∇L(xv,y;W vt ) (4)

−4μηvtM∑

v,u=1v �=u

(x∈At)

(W vT

t xv −WuTt xu

)xv.

Note that if we choose At = D and At = D on each round twe obtain the gradient projection method. At the other extreme,if we choose At to contain a single randomly selected example,we recover a variant of the stochastic gradient method. Ingeneral, we allow At to be a set of k and At to be a setof l data points sampled i.i.d. from D and D, respectively.

The hinge loss function is usually considered as moreappropriate for classification problems, although in severalstudies it has been empirically demonstrated that squared lossoften leads to similar performance (see [20], [21]). Let us de-fine A+ to be the set of examples for which W v obtains a non-zero loss. When the squared loss function is used for labeled

Page 3: [IEEE 2013 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - Shanghai, China (2013.12.18-2013.12.21)] 2013 IEEE International Conference on Bioinformatics and

Fig. 1. Multi-output co-regularized algorithm (MOCA-k-l)

Require: Datasets D and D, regularization parameter λ, batch sizes k and l, number of views M , number of iterations N ,co-regularization parameter μ.

Ensure: W v = 01: for t = 1, 2, . . . , N do2: Choose At ⊆ D, where |At| = k and At ⊆ D, where |At| = l3: Set ηvt = 1

λt

4: W vt+1 ← (1− ηvt λ)W

vt + ηvt

∑(x,y∈At)

(y −W vTxv)xv - 4μηvt∑M

v,u=1v �=u

∑(x∈At)

(W vT

t xv −WuTt xu

)xv

5: Output W vN+1 (weight vectors for a single view)

and unlabeled data we obtain the update rule by substituting thesecond term in (4) with ηvt

∑(x,y∈At)

(y−W vTxv)xv . Finally,if the number of dimensions in the dataset is not large we canuse all unlabeled data points at every iteration by precomputing

multiplication terms in 4μηvt∑M

v,u=1v �=u

(XvTW v−XuTWu)Xv.

We provide a description of the proposed online co-regularizedalgorithm for classification task on Figure 1.

III. EXPERIMENTS

A. Previous experimental results

Recently, online co-regularized algorithm [11] for a single-output prediction problems has been evaluated on severalpublicly available datasets from the UCI repository12 and theBioInfer corpus3 - a real world natural language processingdataset. For completeness, we report these results as well asnew experiments conducted using the HMP dataset. Standardregression and classification datasets are ABALONE, CADATA,HOUSING, MG, SPACE, SVMGUIDE3, GERMANNUMER, AUS-TRALIAN. Depending on the learning task, the performancemeasure is either AUC (Area Under Curve) for classificationor RMSE (Root Mean Square Error) for regression.

The results of the experiments on single output OCA areincluded in the Tables 1-2. For simplicity of comparison, weused MOCA-k-l notation, where k is a number of labeled andl is a number of unlabeled examples used in semi-supervisedsetting. Final evaluation is done based on RMSE estimated ontest set, but not on RMSE estimated during cross-validation(CV RMSE). It can be observed that in all experimentsexcept the housing dataset, the proposed single-output co-regularized algorithm outperforms supervised learning meth-ods. The housing dataset is also the smallest dataset consideredin our empirical evaluation. In all cases (with the exceptionof housing dataset) the OCA leads to statistically significantimprovement over the standard PEGASOS algorithm.

B. HMP dataset

The ability to relate complex datasets that provide aquantitative description of a microbial community presenton/in the human to clinical relevant parameters on health

1http://archive.ics.uci.edu/ml/2http://www.csie.ntu.edu.tw/∼cjlin/libsvm/3Available at www.it.utu.fi/BioInfer

TABLE I. PERFORMANCE OF THE SINGLE-OUTPUT OCA ALGORITHM

AND THE BASELINE METHODS ON THE REGRESSION DATASETS.DIFFERENCES IN RMSE PERFORMANCE ARE STATISTICALLY SIGNIFICANT

ACCORDING TO THE WILCOXON SIGNED RANK TEST.

DATASET METHOD CV RMSE TEST RMSE

ABALONE

PEGASOS SL 14.40 19.46PEGASOS MV SL 11.70 15.52OCA-1-5 11.73 13.50

CADATA

PEGASOS SL 24.45 26.00PEGASOS MV SL 24.29 27.26OCA-1-1 23.97 25.76

HOUSING

PEGASOS SL 19.34 16.34PEGASOS MV SL 17.07 17.59OCA-1-1 17.75 18.54OCA-1-5 16.13 18.69

MGPEGASOS SL 45.53 46.71PEGASOS MV SL 45.88 45.73OCA-1-1 44.51 45.57

SPACE

PEGASOS SL 58.32 58.17PEGASOS MV SL 50.42 51.90OCA-1-1 41.95 36.60

BIOINFER

PEGASOS SL 45.36 63.86PEGASOS MV SL 44.94 63.16OCA-1-5 39.85 61.29

TABLE II. PERFORMANCE OF THE SINGLE-OUTPUT OCA ALGORITHM

AND THE BASELINE METHODS ON THE CLASSIFICATION DATASETS.DIFFERENCES IN AUC PERFORMANCE ARE STATISTICALLY SIGNIFICANT

ACCORDING TO THE WILCOXON SIGNED RANK TEST.

DATASET METHOD CV AUC TEST AUC

GER.NUMER

PEGASOS HL 0.72 0.74PEGASOS MV HL 0.76 0.71OCA-1-1 0.75 0.75

SVMGUIDE3PEGASOS HL 0.85 0.74PEGASOS MV HL 0.83 0.75OCA-1-5 0.82 0.76

AUSTRALIAN

PEGASOS HL 0.95 0.92PEGASOS MV HL 0.95 0.92OCA-1-1 0.93 0.93

or physiology poses a computational challenge. Using theproposed MOCA algorithm we address a relevant problem inoral health domain, namely the prediction of Porphyromonasspecies which are implicated in certain forms of periodontaldisease. We used the abundance of a particular group of speciesas meta data. As bacterial species are part of a more elaboratemicrobial ecosystem with shared metabolic and physiologicalfunctions, we expected that the presence and abundance oftarget species would be reflected in abundance of taxonomic

Page 4: [IEEE 2013 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - Shanghai, China (2013.12.18-2013.12.21)] 2013 IEEE International Conference on Bioinformatics and

unrelated species. For this purpose we train our model on theNational Institutes of Health Human Microbiome Project (NIHHMP) dataset, that is publicly available and can be downloadedfrom the Human Microbiome Project website4.

The NIH HMP [22] aims at characterizing, using nextgeneration sequencing technology, the genetic diversity ofmicrobial populations living in and on humans, and at in-vestigating their roles in the functioning of the human body.The HMP dataset contain microbial samples from variousbody parts but our primal interested is oral health. Thus wesubsampled the dataset so that it includes only those sampleID’s referring to following body sites : saliva; buccal mucosa(cheek), keratinized gingiva (gums), palate, tonsils, throat andtongue soft tissues, supra- and subgingival dental plaque (toothbiofilm above and below the gum). We also normalized thedataset by forcing row-wise sum of Operational TaxonomicUnit (OTU) counts to be equal to one. Preprocessing rawsequencing data resulting in OTU count numbers is beyondthe scope of this paper and therefore is not described.

Furthermore, for regression task we choose the outputvariable to be OTU counts of the Porphyromonas bacteriabecause it is associated with with chronic periodontitis anddentoalveolar abscess [23] which are common oral diseaseshard to cure. From hmp1.v35.hq.otu.lookup file we identify116 OTUs having ”Porphyromonas” string in their taxonomyname. Furthermore, we count the amount of non-zero entries ofeach selected bacteria and chose several most frequent speciesas our output variables. Once the dataset, which consists of2670 examples in total, is constructed we reshuffle it. We usethe classical learning setting, where 70% of the data is used fortraining and the remaining 30% as testing. 20% of the trainingdata is randomly selected to be labeled, and the others areused as unlabeled data. In our learning task, the performancemeasure is RMSE. The dataset is preprocessed by applyinga linear scaling to each feature to the interval [0, 1]. We alsoapply a linear scaling on the labels, to the interval [0, 100].

Parameter selection for each model is done by 5-fold cross-validation over the train partition of the data. For the super-vised models, parameters to be selected are learning rate η0and regularization parameter λ. For the supervised and semi-supervised multi-view models we consider two views that areconstructed via random partitioning of the data attributes intotwo unique sets. Such division of the attributes for constructingmultiple views has been previously used in [6]. For the multi-view model we have to estimate the learning rate η0, as wellas the λ1 and λ2 parameters. The semi-supervised model hasan additional parameter μ controlling the influence of the co-regularization on model selection.

We compare the performance of our MOCA algorithm withseveral other methods, namely K-Nearest Neighbors (KNN),Ordinary Least Squares (OLS), Stochastic Gradient Descent(SGD), Least Angle Regression (LARS), Partial Least SquaresRegression (PLSR), and Decision Trees Regression (DTR). Inour experiments we use the implementation of these baselinemethods from scikit-learn Python package [24]. The resultsof the experiments are reported in the Table 3. It couldbe observed that our method clearly outperforms baseline

4Available at www.hmpdacc.org/HMMCP/

TABLE III. PERFORMANCE OF THE MOCA ALGORITHM AND THE

BASELINE METHODS ON THE HMP DATASET. PARAMETERS WERE CHOSEN

VIA 5-FOLD CROSS-VALIDATION. DIFFERENCES IN RMSE PERFORMANCE

ON THE TEST DATASET ARE STATISTICALLY SIGNIFICANT ACCORDING TO

THE WILCOXON SIGNED RANK TEST.

METHOD TEST RMSE DIFFERENCEMOCA-1-10 12.31 0

MOCA MVSL 12.43 0.12

KNN 14.46 2.15

OLS 19.95 7.64

SGD 13.30 0.99

LARS 13.07 0.76

PLSR 15.53 3.22

DTR 12.87 0.56

0 1 3 5 1012

12.5

13

13.5

14

14.5

15

Number of the unlabeled examples (per iteration)

RM

SE

Learning performance (with respect to the number of unlabeled examples)

MOCASGDKNNLARSDT

Fig. 2. Performance of the MOCA algorithm with the various number ofunlabeled data examples per iteration. The experiment with 0 unlined examplescorresponds to fully supervised learning setting. For the supervised learningalgorithms, only the labeled part of dataset is used for training. The same setis then used for training the co-regularized model, together with the unlabeleddata.

techniques. These results demonstrate that adding unlabeleddata helps to improve predictive performance of the model.

We also compare performance of semi-supervised algo-rithm with performance of the multi-view version of thealgorithm, excluding the co-regularization term (multi-view su-pervised learning (MVSL) MOCA). Furthermore, we comparewith several instantiations of the MOCA algorithm, termed asMOCA-k-l, using various sizes of unlabeled batch examples(see Figure 2). The results show improvement of the predictionperformance with respect to increased number of unlabeledexamples sampled at every iteration of the MOCA algorithm.

IV. CONCLUSIONS AND FURTHER DIRECTIONS

We propose a novel multi-output co-regularized learningalgorithm (MOCA) and demonstrate its application on a realworld problem in oral health domain. More precisely, weaddress the task of prediction of Porphyromonas specieswhich are implicated in certain forms of periodontal disease.In our semi-supervised modeling approach we use the fact

Page 5: [IEEE 2013 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - Shanghai, China (2013.12.18-2013.12.21)] 2013 IEEE International Conference on Bioinformatics and

that interactions among different microbial species in variousniches of human body are known to exist. Such biologicalinteractions suggest possibility to use additional information inform of “unlabeled data” from different niches to better predictoccurrence of Porphyromonas. We have made use of a datasetavailable from the human microbiome project. We deriveda training set in which we used the abundance of Porphy-romonas as meta data utilized for semi-supervised regressionproblem. The algorithm can be extended and used in futureexperiments for identification of bacterial species related tothe Porphyromonas abundance as predictive and/or prognosticbiomarkers. Such biomarkers may allow identification of novelintervention strategies to lower abundance of Porphyromonasand prevent or treat periodontal disease which has an importantvalue for practicing dental clinicians.

Proposed MOCA method is capable of handling multi-dimensional, sparse metagenomic data. It is robust to over-fitting, computationally fast and capable of making use oflarge amounts of unlabeled data. In comparison to supervisedapproaches our methods leads to better predictive performanceboth on standard benchmark datasets from UCI repository anda real world HMP dataset.

In this work we have tried only one strategy for splitting thefeatures into views that are constructed via random partitioningof the data attributes into two unique sets. Given that these”views” are integral parts of the co-regularization strategy, itwould be informative to try more than one such strategy to seehow they compare in terms of predictive performance (e.g. 5unique sets, 10 unique sets, etc).

Our algorithm can be easily modified based on the appli-cation domain. For instance, it can be extended to be usedwith the non-linear kernel functions. Given large amount ofsequence based data in metagenomics domain, such extensioncould be quite useful and would allow to take a sequencedata representation into account. Furthermore, kernel functionsallow to capture complex non-linear interaction among thevariables and lead to improved prediction performance of themodel.

ACKNOWLEDGMENT

The study presented in this publication was funded byTI Food and Nutrition, a public-private partnership in pre-competitive research on food and nutrition.

REFERENCES

[1] G. Hajishengallis and R. Lamont, “Beyond the red complex and intomore complexity: the polymicrobial synergy and dysbiosis (PSD) modelof periodontal disease etiology,” Molecular Oral Microbiology, vol. 27,no. 6, pp. 409–419, Dec 2012.

[2] K. Faust and J. Raes, “Microbial interactions: from networks to models,”Nat. Rev. Microbiol., vol. 10, no. 8, pp. 538–550, Aug 2012.

[3] G.-x. Yuan, C.-h. Ho, and C.-j. Lin, “Recent advances of large-scalelinear classification,” Proceedings of the IEEE, no. 3, pp. 1–15,2011. [Online]. Available: http://www.csie.ntu.edu.tw/∼cjlin/papers/survey-linear.pdf

[4] V. Sindhwani, P. Niyogi, and M. Belkin, “A co-regularization approachto semi-supervised learning with multiple views,” in Proceedings ofICML Workshop on Learning with Multiple Views, 2005.

[5] H. Daume, A. Kumar, and A. Saha, “Co-regularization based semi-supervised domain adaptation,” in Advances in Neural InformationProcessing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-Taylor,R. Zemel, and A. Culotta, Eds., 2010, pp. 478–486.

[6] U. Brefeld, T. Gartner, T. Scheffer, and S. Wrobel, “Efficient co-regularised least squares regression,” in Proceedings of the InternationalConference on Machine learning. New York, NY, USA: ACM, 2006,pp. 137–144.

[7] U. Brefeld and T. Scheffer, “Co-em support vector learning,” in Pro-ceedings of the 21st International Conference on Machine learning.New York, NY, USA: ACM, 2004, p. 16.

[8] D. Rosenberg and P. L. Bartlett, “The Rademacher complexity of co-regularized kernel classes,” in Proceedings of the Eleventh InternationalConference on Artificial Intelligence and Statistics, M. Meila andX. Shen, Eds., 2007, pp. 396–403.

[9] A. Blum and T. Mitchell, “Combining labeled and unlabeled datawith co-training,” in Proceedings of the eleventh annual conference onComputational learning theory. New York, NY, USA: ACM, 1998,pp. 92–100.

[10] B. Scholkopf and A. J. Smola, Learning with Kernels: Support VectorMachines, Regularization, Optimization, and Beyond. Cambridge, MA,USA: MIT Press, 2001.

[11] T. de Ruijter, E. Tsivtsivadze, and T. Heskes, “Online co-regularizedalgorithms,” in Discovery Science, ser. Lecture Notes in ComputerScience, J.-G. Ganascia, P. Lenca, and J.-M. Petit, Eds., vol. 7569.Springer, 2012, pp. 184–193.

[12] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.

[13] R. Rifkin, G. Yeo, and T. Poggio, “Regularized least-squares clas-sification,” in Advances in Learning Theory: Methods, Model andApplications. Amsterdam: IOS Press, 2003, pp. 131–154.

[14] J. A. K. Suykens and J. Vandewalle, “Least squares support vectormachine classifiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, 1999.

[15] G. Fung and O. L. Mangasarian, “Proximal support vector machineclassifiers,” in The seventh ACM SIGKDD international conference onKnowledge discovery and data mining. New York, NY, USA: ACM,2001, pp. 77–86.

[16] C. Saunders, A. Gammerman, and V. Vovk, “Ridge regression learningalgorithm in dual variables,” in Fifteenth International Conference onMachine Learning. San Francisco, CA, USA: Morgan KaufmannPublishers Inc., 1998, pp. 515–521.

[17] S. Shalev-Shwartz, Y. Singer, and N. Srebro, “Pegasos: Primal Es-timated sub-GrAdient SOlver for SVM,” in Proceedings of the 24thinternational conference on Machine learning. ACM, 2007, pp. 807–814.

[18] L. Bottou, A. Bordes, and S. Ertekin, “Lasvm,” 2009, http://mloss.org/software/view/23/.

[19] A. Tacchetti, P. Mallapragada, M. Santoro, and L. Rosasco, “GURLS:a toolbox for large scale multiclass learning,” in NIPS 2011 workshopon parallel and large-scale machine learning, http://cbcl.mit.edu/gurls/.

[20] R. Rifkin, G. Yeo, and T. Poggio, “Regularized least-squares classifi-cation,” in Advances in Learning Theory: Methods, Model and Appli-cations, ser. NATO Science Series III: Computer and System Sciences,J. Suykens, G. Horvath, S. Basu, C. Micchelli, and J. Vandewalle, Eds.Amsterdam: IOS Press, 2003, vol. 190, ch. 7, pp. 131–154.

[21] P. Zhang and J. Peng, “Svm vs regularized least squares classification,”in Proceedings of the International Conference on Pattern Recognition,ser. ICPR ’04, 2004, pp. 176–179.

[22] The NIH HMP Working Group, J. Peterson, S. Garges, M. Giovanni,P. McInnes, L. Wang, J. A. Schloss, V. Bonazzi, J. E. McEwen,K. A. Wetterstrand, C. Deal, C. C. Baker, V. Di Francesco, T. K.Howcroft, R. W. Karp, R. D. Lunsford, C. R. Wellington, T. Belachew,M. Wright, C. Giblin, H. David, M. Mills, R. Salomon, C. Mullins,B. Akolkar, L. Begg, C. Davis, L. Grandison, M. Humble, J. Khalsa,A. R. Little, H. Peavy, C. Pontzer, M. Portnoy, M. H. Sayre, P. Starke-Reed, S. Zakhari, J. Read, B. Watson, and M. Guyer, “The NIH humanmicrobiome project,” vol. 19, no. 12, pp. 2317–2323, 2009.

[23] L. P. Samaranayake, Essential Microbiology for Dentistry, 4th Revisededition. Elsevier Health Sciences, 2012.

[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay, “Scikit-learn: Machine learning in Python,” Journal of MachineLearning Research, vol. 12, pp. 2825–2830, 2011.