Integrating Features and Similarities: Flexible …people.duke.edu/~pr73/recent/mlfsAAAI2015.pdf ·...

14
Integrating Features and Similarities: Flexible Models for Heterogeneous Multiview Data Wenzhao Lian, Piyush Rai, Esther Salazar, Lawrence Carin ECE Department, Duke University Durham, NC 27708 {wenzhao.lian,piyush.rai,esther.salazar,lcarin}@duke.edu Abstract We present a probabilistic framework for learning with heterogeneous multiview data where some views are given as ordinal, binary, or real-valued feature matrices, and some views as similarity matrices. Our framework has the following distinguishing aspects: (i) a unified latent factor model for integrating information from di- verse feature (ordinal, binary, real) and similarity based views, and predicting the missing data in each view, leveraging view correlations; (ii) seamless adaptation to binary/multiclass classification where data consists of multiple feature and/or similarity-based views; and (iii) an efficient, variational inference algorithm which is es- pecially flexible in modeling the views with ordinal- valued data (by learning the cutpoints for the ordinal data), and extends naturally to streaming data settings. Our framework subsumes methods such as multiview learning and multiple kernel learning as special cases. We demonstrate the effectiveness of our framework on several real-world and benchmarks datasets. Introduction Many data analysis problems involve heterogeneous data with multiple representations or views. We consider a gen- eral problem setting, where data in some views may be given as a feature matrix (which, in turn, may be ordi- nal, binary, or real-valued), while in other views as a ker- nel or similarity matrix. Each view may also have a sig- nificant amount of missing data. Such a problem setting is frequently encountered in diverse areas, ranging from cog- nitive neuroscience (Salazar et al. 2013) to recommender systems (Zhang, Cao, and Yeung 2010; Pan et al. 2011; Shi, Larson, and Hanjalic 2014). Consider a problem from cognitive neuroscience (Salazar et al. 2013), where data col- lected from a set of people may include ordinal-valued re- sponse matrices on multiple questionnaires, real-valued fea- ture matrices consisting of fMRI/EEG data, and one or more similarity matrices computed using single-nucleotide poly- morphism (SNP) measurements. There could also be miss- ing observations in each view. The eventual goal could be to integrate these diverse views to learn the latent traits (fac- tors) of people, or learn a classifier for predicting certain psychopathological conditions in people, or to predict the Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. missing data in one or more views (e.g., predicting missing fMRI data, leveraging other views). Likewise, in a multi- domain recommender system (Zhang, Cao, and Yeung 2010; Pan et al. 2011; Shi, Larson, and Hanjalic 2014), for a set of users, we have multiple, partially observed ordinal-valued rating matrices for domains such as movies, books, and elec- tronic appliances, along with a binary-valued matrix of click behavior on movies, and user-user similarities. The goal here could be to predict the missing ratings in each domain’s rat- ing matrix, leveraging information in all the sources. When the eventual goal is only doing classification or clustering on such multiview data, one direct procedure is to represent each view (feature/kernel based) as a kernel ma- trix and apply multiple kernel learning methods (G ¨ onen and Alpaydın 2011; Kumar, Rai, and Daum´ e III 2011). However, such an approach may be inappropriate when data is missing in one or more views, resulting in kernel matrices with miss- ing entries, for which most of these methods are unsuitable. Moreover, these methods lack a proper generative model for each view, and therefore cannot be used for the task of un- derstanding the data (e.g., learning latent factors), modeling specific feature types (e.g., ordinal), or predicting the miss- ing data in each view (multiview matrix completion). We present a probabilistic framework for modeling such heterogeneous multiview data with potentially missing data in each view. We then show a seamless adaptation of this framework for binary/multiclass classification for data hav- ing multiple feature- and/or similarity-based views. Our framework learns a view-specific latent matrix for each feature/similarity-based view, and combines these latent ma- trices via a set of structured-sparsity driven factor analy- sis models (Virtanen et al. 2012) to learn a global low- dimensional representation of the data. The view-specific la- tent matrices can be used for matrix completion for each view, whereas the global representation can be combined with specific objectives to solve problems such as classifi- cation or clustering of the multiview data. Our framework also consists of an efficient, variational inference algorithm, which is appealing in its own right by providing a princi- pled way to learn the cutpoints for data in the ordinal-valued views, which can be useful for the general problem of mod- eling ordinal data, such as in recommender systems.

Transcript of Integrating Features and Similarities: Flexible …people.duke.edu/~pr73/recent/mlfsAAAI2015.pdf ·...

Integrating Features and Similarities: Flexible Modelsfor Heterogeneous Multiview Data

Wenzhao Lian, Piyush Rai, Esther Salazar, Lawrence CarinECE Department, Duke University

Durham, NC 27708{wenzhao.lian,piyush.rai,esther.salazar,lcarin}@duke.edu

Abstract

We present a probabilistic framework for learning withheterogeneous multiview data where some views aregiven as ordinal, binary, or real-valued feature matrices,and some views as similarity matrices. Our frameworkhas the following distinguishing aspects: (i) a unifiedlatent factor model for integrating information from di-verse feature (ordinal, binary, real) and similarity basedviews, and predicting the missing data in each view,leveraging view correlations; (ii) seamless adaptationto binary/multiclass classification where data consists ofmultiple feature and/or similarity-based views; and (iii)an efficient, variational inference algorithm which is es-pecially flexible in modeling the views with ordinal-valued data (by learning the cutpoints for the ordinaldata), and extends naturally to streaming data settings.Our framework subsumes methods such as multiviewlearning and multiple kernel learning as special cases.We demonstrate the effectiveness of our framework onseveral real-world and benchmarks datasets.

IntroductionMany data analysis problems involve heterogeneous datawith multiple representations or views. We consider a gen-eral problem setting, where data in some views may begiven as a feature matrix (which, in turn, may be ordi-nal, binary, or real-valued), while in other views as a ker-nel or similarity matrix. Each view may also have a sig-nificant amount of missing data. Such a problem setting isfrequently encountered in diverse areas, ranging from cog-nitive neuroscience (Salazar et al. 2013) to recommendersystems (Zhang, Cao, and Yeung 2010; Pan et al. 2011;Shi, Larson, and Hanjalic 2014). Consider a problem fromcognitive neuroscience (Salazar et al. 2013), where data col-lected from a set of people may include ordinal-valued re-sponse matrices on multiple questionnaires, real-valued fea-ture matrices consisting of fMRI/EEG data, and one or moresimilarity matrices computed using single-nucleotide poly-morphism (SNP) measurements. There could also be miss-ing observations in each view. The eventual goal could be tointegrate these diverse views to learn the latent traits (fac-tors) of people, or learn a classifier for predicting certainpsychopathological conditions in people, or to predict the

Copyright c© 2015, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

missing data in one or more views (e.g., predicting missingfMRI data, leveraging other views). Likewise, in a multi-domain recommender system (Zhang, Cao, and Yeung 2010;Pan et al. 2011; Shi, Larson, and Hanjalic 2014), for a set ofusers, we have multiple, partially observed ordinal-valuedrating matrices for domains such as movies, books, and elec-tronic appliances, along with a binary-valued matrix of clickbehavior on movies, and user-user similarities. The goal herecould be to predict the missing ratings in each domain’s rat-ing matrix, leveraging information in all the sources.

When the eventual goal is only doing classification orclustering on such multiview data, one direct procedure is torepresent each view (feature/kernel based) as a kernel ma-trix and apply multiple kernel learning methods (Gonen andAlpaydın 2011; Kumar, Rai, and Daume III 2011). However,such an approach may be inappropriate when data is missingin one or more views, resulting in kernel matrices with miss-ing entries, for which most of these methods are unsuitable.Moreover, these methods lack a proper generative model foreach view, and therefore cannot be used for the task of un-derstanding the data (e.g., learning latent factors), modelingspecific feature types (e.g., ordinal), or predicting the miss-ing data in each view (multiview matrix completion).

We present a probabilistic framework for modeling suchheterogeneous multiview data with potentially missing datain each view. We then show a seamless adaptation of thisframework for binary/multiclass classification for data hav-ing multiple feature- and/or similarity-based views. Ourframework learns a view-specific latent matrix for eachfeature/similarity-based view, and combines these latent ma-trices via a set of structured-sparsity driven factor analy-sis models (Virtanen et al. 2012) to learn a global low-dimensional representation of the data. The view-specific la-tent matrices can be used for matrix completion for eachview, whereas the global representation can be combinedwith specific objectives to solve problems such as classifi-cation or clustering of the multiview data. Our frameworkalso consists of an efficient, variational inference algorithm,which is appealing in its own right by providing a princi-pled way to learn the cutpoints for data in the ordinal-valuedviews, which can be useful for the general problem of mod-eling ordinal data, such as in recommender systems.

A Generative Framework for HeterogeneousMultiview Data

We first describe our basic framework for MultiviewLearning with Features and Similarities (abbreviatedhenceforth as MLFS), for modeling heterogeneous mul-tiview data, where the views may be in the form ofordinal/binary/real-valued feature matrices and/or real-valued similarity matrices. Our framework enables one tointegrate the data from all the views to learn latent factorsunderlying the data, predict missing data in each view, andinfer view-correlations. We then show how MLFS can beadapted for binary/multiclass classification problems.

We assume the data consist of N objects having a total ofM feature-based and/or similarity-based views. Of theM =M1 +M2 +M3 views, the first M1 are assumed to be or-dinal feature matricesX(1), . . . ,X(M1) (binary feature ma-trix is a special case), the next M2 views are assumed to bereal-valued feature matricesX(M1+1), . . . ,X(M1+M2), andthe remaining M3 views are assumed to be real-valued sim-ilarity matrices X(M1+M2+1), . . . ,X(M1+M2+M3). One ormore of these matrices may have missing data (randomlymissing entries or randomly missing entire rows and/orcolumns). For a feature-based view, X(m) denotes a fea-ture matrix of size N × Dm; for a similarity-based view,X(m) denotes a similarity matrix of size N × N . We as-sume the data X(m) in each feature/similarity-based vieware generated from a latent real-valued matrix U (m) =

[U(m)1 ; . . . ;U

(m)N ] ∈ RN×Km , where U (m)

i , i = 1, . . . , Nare assumed to be row vectors.

Feature-based Views: TheN×Dm feature matrixX(m)

for view m is generated, via a link-function fm, from a real-valued matrix U (m) of the same size (thus Km = Dm).Therefore, X(m)

id = fm(U(m)id ) where i indexes the i-th

object and d indexes the d-th feature. For real-valued data,the link-function is identity, so X(m)

id = U(m)id . For ordinal

data in view m having Lm levels (1, · · · , Lm), X(m)id = l if

gml−1 < U(m)id < gml , with cutpoints −G = gm0 < gm1 <

gm2 < . . . < gmLm−1 < gmLm= +G. Because the cutpoints

contain information indicating relative frequencies of ordi-nal outcomes in each view, we will learn them, as describedin the next Section.

Similarity-based Views: The N × N similarity ma-trix X(m) of view m is generated as X

(m)ij ∼

N (U(m)i U

(m)j

>, τ−1m ) where X(m)

ij denotes the pairwisesimilarity between objects i and j in view m. In this work,we consider symmetric similarity matrices and thus onlymodelX(m)

ij , i < j, but the model can be naturally extendedto asymmetric cases. In this case, U (m) ∈ RN×Km is akinto a low-rank approximation of the similarity matrix X(m)

(Km < N ).Although the view-specific latent matrices {U (m)}Mm=1

have different meanings (and play different roles) in feature-based and similarity-based views, in both cases there ex-ists a mapping from U (m) to the observed data X(m). Wewish to extract and summarize the information from all these

view-specific latent matrices {U (m)}Mm=1 to obtain a globallatent representation of the data, and use it for tasks suchas classification or clustering. To do so, we assume theview-specific latent matrices {U (m)}Mm=1 as being gener-ated from a shared real-valued latent factor matrix V =[V1; . . . ;VN ] of sizeN×R (whereR denotes the number oflatent factors) with view-specific sparse factor loading ma-trices W = {W (m)}Mm=1: U (m)

i ∼ N (ViW(m), γ−1m I),

whereW (m) ∈ RR×Km .Since different views may capture different aspects of the

entity under test (in addition to capturing aspects that arepresent in all views), we wish to impose this structure inthe learned global latent factor matrix V and the associatedfactor loading matrices W = {W (m)}Mm=1. Note that eachcolumn (resp., row) of V (resp.,W ) corresponds to a globallatent factor. We impose a structured-sparsity prior in thefactor loading matrices {W (m)}Mm=1 of all the views, suchthat some of the rows in these matrices share the same sup-port for non-zero entries whereas some rows are non-zeroonly for a subset of these matrices. Figure 1 summarizes ourbasic framework.

(a) (b)

Figure 1: (a) plate notation showing the data in M views. Thedata matrix X(m) could either be a feature matrix or a similaritymatrix (with the link from U (m) to X(m) appropriately defined).(b) for M = 3 views, a structured-sparsity based decomposition ofthe view-specific latent matrices to learn shared and view-specificlatent factors. First two factors are present in all the views (nonzerofirst two rows in each W (m)) while others are present only in someviews. The matrix V is the global latent representation of the data.

We assume each row of V ∈ RN×R drawn as Vi ∼N (0, I). We use group-wise automatic relevance deter-mination (Virtanen et al. 2012) as the sparsity inducingprior on {W (m)}Mm=1, which also helps in inferring R byshrinking the unnecessary rows in W to close to zero.Each row of W (m) is assumed to be drawn as W (m)

r ∼N (0, α−1mrI), r = 1, . . . , R, where αmr ∼ Gam(aα, bα)and choosing aα, bα → 0, we have Jeffreys prior p(αmr) ∝1/αmr, favoring strong sparsity. We can identify the fac-tor activeness in each view from the precision hyperpa-rameter αmr: small αmr (large variance) indicates active-ness of factor r in view m. Let B be a (M × R)-binary matrix indicating the active view vs factor associ-ations, then Bmr = 1 if α−1mr > ε, for some small ε(e.g., 0.01). The correlation between views m and m′ canalso be computed as (W (m))>W (m′)/(S(m)S(m′)) whereW

(m)r =

∑Km

j=1(W(m)rj )2, r = 1, . . . , R and S(m) =√

(W (m))>W (m′).Identifiability via Rotation: Factor analysis models are

known to have identifiability issues due to the fact thatVW (m) = V QQ−1W (m), for arbitrary rotationQ (Virta-

nen et al. 2012). We explicitly optimize w.r.t. Q to maintainidentifiability in the model, and achieve faster convergenceduring inference.

Adaptation for Multiview ClassificationThis general framework for MLFS can be applied for multi-view factor analysis and matrix completion problems whendata consists of multiple feature-based (ordinal/binary/real)and/or similarity-based views. Our framework is more gen-eral than other prior works for these problems, that assumeall views having the same feature-based representation (Vir-tanen et al. 2012; Zhang, Cao, and Yeung 2010). We nowshow how MLFS can be adapted for other problems such asmultiview classification.

In multiview classification, the training data consist ofN objects, each having M feature and/or similarity basedviews. As earlier, we assume that the data are given as acollection of (potentially incomplete) feature and/or simi-larity matrices {X(m)}Mm=1. Each object also has a labelyi ∈ {1, . . . , C}, i = 1, . . . , N , and the goal is to learn aclassifier that predicts the labels for test objects where eachtest object has representation in M views (or a subset of theviews). The classification adaptation of MLFS is based ona multinomial probit model (Girolami and Rogers 2006) onthe global latent factors V = [V1; . . . ;VN ] where Vi ∈R1×R, which can be summarized as: yi = argmaxc{zic},where c = 1, . . . , C; zic ∼ N (Viβc, 1); βc ∼ N (0, ρ−1I),where βc ∈ RR×1. Under this adaptation, we learn both Vand βc jointly, instead of in two separate steps.

A particular advantage of our framework for classificationis that, in addition to handling views having potentially dif-ferent representations, it allows objects to be missing in oneor more views. The existing multiview or multiple kernellearning (Yu et al. 2011; Gonen and Alpaydın 2011) meth-ods require the views to be transformed into the same repre-sentation and/or cannot easily handle missing data. A sim-ilar adaptation can be used to perform clustering instead ofmulticlass classification by replacing the multinomial probitclassification by a Gaussian mixture model.

InferenceSince exact inference is intractable, we use variationalBayesian EM (Beal 2003) to perform approximate infer-ence. We will infer the variational distribution for the latentvariables, collectively referred to as Θ, and consisting of{{U (m),W (m),αm, γm}Mm=1,V }, along with {βc, zc}Cc=1

for classification. For the cutpoints G = {gm}M1m=1 and

the rotation matrix Q, we will seek a point estimate. Aswill be shown, our inference algorithm also works in thestreaming setting (Broderick et al. 2013) where each datapoint is seen only once. Sometimes, for brevity, we will use{U ,W ,α,γ} for {U (m),W (m),αm, γm}Mm=1, {β, z} for{{βc}Cc=1, z}, and {µ, z} for {{µj}Jj=1, z}, respectively.The data from all views will be collectively referred to as Xwhich, in the classification case, also consists of the labels.Due to the lack of space, we only provide brief descriptionsof the key aspects of our inference algorithm, leaving furtherdetails in the Supplementary Material.

We approximate the true posterior p(Θ|X ,G,Q)) by itsmean-field approximation:

q(Θ) =

M∏m=M1+M2+1

N∏i=1

q(U(m)i )

N∏i=1

q(Vi)

M∏m=1

R∏r=1

q(W (m)r )

M∏m=1

R∏r=1

q(αmr)

M∏m=1

q(γm) (1)

Thus, we minimize the KL-divergenceKL(q(Θ)||p(Θ|X ,G,Q)), equivalent to maximiz-ing the evidence lower bound (ELBO) given byL(q(Θ),G,Q) = Eq(Θ)[log p(X ,Θ|G,Q) − log(q(Θ))].With further approximation for the ordinal-valued viewsusing a Taylor-series expansion, we can efficiently up-date variational parameters for q(Θ). Note that in (1),the terms q(U

(m)i ) appear only for the similarity-based

views because for feature-based view, U (m)i are integrated

out in the variational lower bound (details of derivationare provided in Supplementary Material). We maximizethe variational lower bound by iterating between vari-ational E-step: maxq(Θ) L(q(Θ),G,Q), and M-step:maxG,Q L(q(Θ),G,Q). In this section, we summarizethe variational updates for U , W , V , the cutpoints G,the rotation matrix Q, and the extension to the streamingsetting.

Update U(m)i : For similarity-based views, the vari-

ational posterior q(U(m)i ) = N (µu,Σu), where

Σu = (〈γm〉IKm+

∑j 6=i〈τm〉〈U

(m)j

>U

(m)j 〉)−1 and

µu = (〈γm〉〈Vi〉〈W (m)〉 + 〈τm〉(∑j>iX

(m)ij 〈U

(m)j 〉 +∑

j<iX(m)ji 〈U

(m)j 〉)Σu, where 〈.〉 denotes expectation

w.r.t. q.Update Vi: q(Vi) = N (µv,Σv). For the supervised

multiclass classification case with N (0, I) prior on Vi,Σv = (I+

∑Mm=1〈γm〉〈W (m)W (m)>〉+

∑Cc=1〈βcβ>c 〉)−1

and µv = (∑M1

m=1〈γm〉∑Km

j=1〈W(m):j

>〉(gm

X(m)i,j

+

gmX

(m)i,j −1

)/2 +∑M1+M2

m=M1+1〈γm〉X(m)i 〈W (m)>〉 +∑M1+M2+M3

m=M1+M2+1〈γm〉〈U(m)i 〉〈W (m)>〉 +∑C

c=1〈zic〉〈β>c 〉)Σv .UpdateW (m): The variational posterior for the j-th col-

umn of W (m) (j = 1, . . . ,Km), is given by q(W (m):j ) =

N (µw,Σw), where Σw = (〈diag(αm1, αm2, ..., αmR)〉 +〈γt〉

∑Ni=1〈V >i Vi〉)−1 and µw depends on the type of

view. For ordinal-valued feature-based view, µw =

Σw〈γm〉∑Ni=1〈V >i 〉(gmX(m)

i,j

+gmX

(m)i,j −1

)/2. For real-valued

feature-based view, µw = Σw〈γm〉∑Ni=1〈V >i 〉X

(m)ij . For

similarity-based view, µw = Σw〈γm〉∑Ni=1〈V >i 〉〈U

(m)ij 〉.

Inferring the cutpoints: We infer the cutpoints gmfor each ordinal view by optimizing the objective func-tion Lm(gm) =

∑Lm

l=1 Lml , where Lml = Nml [log(gml −

gml−1) − 〈γm〉(gml2 + gml−1

2 + gml gml−1)/6] + 〈γm〉(gml +

True Factor Loadings

Fa

cto

rs

2 4 6 8 10 12 14 16 18 20

1

2

3

4

0

0.5

1

1.5

2

2.5Ordinal Real

Similarity

(a)Inferred Factor Loadings

Fa

cto

rs

2 4 6 8 10 12 14 16 18 20

1

2

3

4

0

0.5

1

1.5

2

2.5Ordinal Real

Similarity

(b)

Figure 2: (a) and (b): true and inferred factor loading matrices Wfor ordinal, real, and similarity-based views on synthetic data.

gml−1)∑i,j:Xm

i,j=l〈Vi〉〈W (m)

:j 〉/2. Here Nml is the number

of observations with value l in view m. The gradients of Ltlare also analytically available. Moreover, the objective func-tion is concave w.r.t. gm – in each variational M-step, the so-lution gm given the variational distributions q(Θ) is globallyoptimal. It can be solved efficiently using Newton’s method.At every VB iteration, optimization over gm is guaranteedto increase the variational lower bound.

Inferring the rotation matrix Q: Since rotation leavesp(X|Θ) unchanged, optimization over Q effectively mini-mizesKL(q(Θ)||p(Θ)) (encouragesV ,W ,α to be similarto the prior). This encourages structured sparsity (imposedby prior), helps escaping from local optima, and achievesfaster convergence (Virtanen et al. 2012). Figure 2 shows anexperiment on a three-view synthetic data, consisting of or-dinal, real, and similarity based views, each generated froma subset of four latent factors. The rotation matrix also helpsin the identifiability of inferred loading matrices.

Streaming extension: In the streaming setting, givenXn

(a new data point, or a batch of new data points, with eachhaving some or all the views), we infer the latent factors Vnusing the following update equations: q(Vn) = N (µn,Σn),

where µn = (∑M1

m=1

∑j〈γm〉〈W

(m):j

>〉(gm

X(m)n,j

+

gmX

(m)n,j −1

)/2 +∑M1+M2

m=M1+1〈γm〉X(m)n 〈W (m)>〉 +∑M1+M2+M3

m=M1+M2+1〈γm〉〈U(m)n 〉〈W (m)>〉)Σn, and

Σn = (I +∑Mm=1〈γm〉〈W (m)W (m)>〉)−1. The

global variables Θ, such as W (m), can be updatedin a manner similar to (Broderick et al. 2013) asq(Θn) ∝ q(Θn−1)p(Xn|Θn). Due to the lack of space,the experiments for the streaming setting are presentedseparately in the Supplementary Material.

Related WorkThe existing methods for learning from multiview data,such as (Gonen and Alpaydın 2011; Virtanen et al. 2012;Zhang, Cao, and Yeung 2010; Bickel and Scheffer 2004;Yu et al. 2011; Shao, Shi, and Yu 2013; Zhe et al. 2014;Chaudhuri et al. 2009; Kumar, Rai, and Daume III 2011),usually either require all the views to be of the same type(e.g., feature based or similarity based), or are designed tosolve specific problems on multiview data (e.g., classifica-tion or clustering or matrix completion). Moreover, mostof these are non-generative w.r.t. the views (Gonen andAlpaydın 2011; Chaudhuri et al. 2009; Kumar, Rai, andDaume III 2011), lacking a principled mechanism to han-dle/predict missing data in one or more views. The idea of

learning shared and view-specific latent factors for multi-view data has been used in some other previous works (Jia,Salzmann, and Darrell 2010; Virtanen et al. 2012). Thesemethods however do not generalize to other feature types(e.g., ordinal/binary) or similarity matrices, and to classifi-cation/clustering problems. Another recent method (Klami,Bouchard, and Tripathi 2014), based on the idea of collec-tive matrix factorization (Singh and Gordon 2008), jointlyperforms factorization of multiple matrices with each denot-ing a similarity matrix defined over two (from a collection ofseveral) sets of objects (both sets can be the same). However,due to its specific construction, this method can only model asingle similarity matrix over the objects of a given set (unlikeour method which allows modeling multiple similarity ma-trices over the same set of objects), does not explicitly modelordinal data, does not generalize to classification/clustering,and uses a considerably different inference procedure (batchMAP estimation) than our proposed framework.

Finally, related to our contribution on inferring thecutpoints for the ordinal views, in another recent work(Hernandez-Lobato, Houlsby, and Ghahramani 2014) pro-posed a single-view ordinal matrix completion model withan Expectation Propagation (EP) based algorithm for learn-ing the cutpoints. It however assumes a Gaussian prior onthe cutpoints. We do not make any such assumption and op-timize w.r.t. the cutpoints. Moreover, the optimization prob-lem for inferring the cutpoints is concave, leading to an effi-cient and fast-converging inference.

ExperimentsIn this section, we first apply our framework for analyzinga real-world dataset from cognitive neuroscience. We thenpresent results on benchmark datasets for recommender sys-tem matrix completion and classification, respectively.

Cognitive Neuroscience DataThis is a heterogeneous multiview data collected from 637college students. The data consist of 23 ordinal-valuedresponse matrices from self-report questionnaires, con-cerning various behavioral/psychological aspects; one real-valued feature matrix from fMRI data having four features:threat-related (left/right) amygdala reactivity and reward-related (left/right) ventral striatum (VS) reactivity (Nikolovaand Hariri 2012); and four similarity matrices, obtainedfrom SNP measurements of three biological systems (nore-pinephrine (NE), dopamine (DA) and serotonin (5-HT))(Hariri et al. 2002; Nikolova et al. 2011), and a personalityratings dataset provided by informants (e.g., parents, siblingor friends) (Vazire 2006). For the SNP data (A,C,G,T nu-cleotides), the similarity matrices are based on the genome-wide average proportion of alleles shared identical-by-state(IBS) (Lawson and Falush 2012). For the informant re-ports (on 94 questions), the similarities are based on com-puting the averaged informants’ ratings for each studentand then using a similarity measure proposed in (Daemenand De Moor 2009). There are also binary labels associ-ated with diagnosis of psychopathological disorders. Wefocus on two broadband behavioral disorders: Internaliz-ing (anxious and depression symptoms) and Externalizing

F1(2

8)

F2(1

3)

F3(1

0)

F4(8

)

F5(8

)

F6(6

)

F7(6

)

F8(5

)

F9(5

)

F10

(3)

F11

(2)

F12

(2)

F13

(2)

F14

(1)

F15

(1)

F16

(1)

F17

(1)

O: Personality (NEO)O: Mood/anxietyO: Anxiety inventoryO: Depression scaleO: State trait angerO: AgressionO: Social desirabilityO: Emotion regulationO: CopeO: Life event scaleO: Stress scaleO: Childhood traumaO: Reaction questionnaireO: RelatednessO: Antisocial processO: PsychopathyO: ImpulsivenessO: Eating disordersO: Drug useO: Self report delinquencyO: HandednessO: Alcohol useO: Sleep qualityR: fMRI (amygdala & VS)S: Serotonin(5−HT)S: Dopamine(DA)S: Norepinephrine(NE)S: Informants

(a)

O: Eating disorders

O: Emotion regulation

O: Cope

O: Stress scale

O: Anxiety inventory

O: Social desirability

O: Agression

O: Impulsiveness

O: Antisocial process

O: State trait anger

O: Relatedness

O: Personality (NEO)

O: Handedness

O: Reaction questionnaire

O: Sleep quality

O: Depression scale

O: Mood/anxiety

O: Psychopathy

O: Drug use

O: Self report delinquency

O: Alcohol use

O: Childhood trauma

S: Informants

R: fMRI (amygdala & VS)

S: Dopamine(DA)

S: Serotonin(5−HT)

S: Norepinephrine(NE)

O: Life event scale

0

0.2

0.4

0.6

0.8

1

(b) (c) (d)

Figure 3: (a) Active views for each factor. Row labels indicate the type of view: ordinal (O), real (R) and similarity (S); column indexesfactors (Number of active views in parenthesis). (b) Inferred view-correlation matrix. (c) Type of questions associated with each one of the7 factors for the NEO questionnaire (first row in the left panel), based on the factor loading matrix of NEO. (d) Predicting ordinal responsesand predicting fMRI. Table 1: AUC scores on the prediction of internalizing and externalizing disorders.

MLFS (all) MLFS (ordinal) MLFS (real+sim.) MLFS (concat.) BEMKLIntern. 0.754 ± 0.032 0.720 ± 0.026 0.546 ± 0.031 0.713 ± 0.027 0.686 ± 0.037Extern. 0.862 ± 0.019 0.770 ± 0.027 0.606 ± 0.024 0.747 ± 0.034 0.855 ± 0.015

(aggressive, delinquent and hyperactive symptoms as wellas substance use disorders) (Krueger and Markon 2006).We apply our MLFS framework on this data to: (i) inter-pret common/view-specific latent factors as well as view-correlations, (ii) do multiview classification to predict psy-chopathological conditions, (iii) predict missing data (e.g.,question answers and fMRI response) leveraging informa-tion from multiple views. We perform analysis consideringKm = 20 (for similarity-based views), R = 30 latent fac-tors, and prior hyperparameters aα = bα = 0.01.

Common/view-specific factors and view-correlations:For our first task, we are interested in understanding the databy (i) identifying latent personality traits (factors) presentin the students, and (ii) inferring the view-correlations. Ourmodel can help distinguish between common and view-specific factors by looking at the view-factor associationmatrix B (model section). Figure 3(a) shows the inferredview-factor associations for this data. We only show 17 fac-tors which have at least one active view. Note that the firstone represents the common factor (present in all the views),whereas the last 4 factors have only one active view (struc-tured noise). Figure 3(b) shows the view-correlation matrixinferred from W (m), computed as described in the modelsection. As the figure shows, our model (seemingly) cor-rectly discovers views that have high pairwise correlations,such as questionnaires on drug-use, self-report delinquencyand alcohol-use. Further insights can be obtained by inter-preting the factor loadings W (m) (for which the rows cor-respond to factors and columns to questions). The NEOquestionnaire (240 questions) is of particular interest in psy-chology to measure the five broad domains of personality(openness, conscientiousness, extraversion, agreeableness,and neuroticism). Figure 3(c) shows, for the 7 factors ac-tive in NEO, the percentage of questions associated with ev-ery domain of personality. It is insightful to observe that thefirst factor includes, in an equitable manner, questions re-lated with the five domains, whereas for the other factors,questions related with one or two domains of the personality

are dominant.Predicting psychopathological disorders: Our next task

predicts each of the two types of psychopathological disor-ders (Internalizing and Externalizing; each is a binary clas-sification task). To do so, we first split the data at randominto training (50%) and testing (50%) sets. The training setis used to fit MLFS in four different settings: (1) MLFS withall the views, (2) MLFS with ordinal views (questionnaires),(3) MLFS with real and similarity based views (fMRI, SNPand informants), (4) MLFS concatenating the ordinal viewsinto a single matrix. We consider Bayesian Efficient Multi-ple Kernel Learning (BEMKL) (Gonen 2012) as a baselinefor this experiment. For this baseline, we transformed theordinal and real-valued feature based views to kernel ma-trices. Each experiment is repeated 10 times with differentsplits of training and test data. Since the labels are highlyimbalanced (very few 1s), to assess the prediction perfor-mance, we compute the average of the area under ROCcurve (AUC). Table 1 shows the mean AUC with standarddeviation, bold numbers indicate the best performance. TheMLFS model, which considers all the heterogeneous views,yields the overall best performance.

Predicting ordinal responses and fMRI: We first con-sider the task of ordinal matrix completion (questionnaires).We hide (20%, 30%, . . . , 90%) data in each ordinal view andpredict the missing data using the following methods: (1)MLFS with all the views, (2) MLFS with only ordinal views,concatenated as a single matrix, and (3) sparse factor pro-bit model (SFPM) proposed in (Hahn, Carvalho, and Scott2012). Top plot in Figure 3 (d) shows the average mean abso-lute error (MAE) over 10 runs. The smallest MAE achievedby MLFS with all views demonstrates the benefit of integrat-ing information from both the features and similarity basedviews with the group sparse factor loading matrices. Ournext experiment is on predicting fMRI responses leverag-ing other views. For this task, we hide fMRI data from 30%of the subjects. For this group, we only assume access tothe ordinal- and similarity-based views. We compare withtwo baselines: (1) a linear regression model (LRM) where

Table 2: Benchmark datasets: Ordinal matrix completion leveraging the similarity based view.Epinion Ciao

MAE Exact Match MAE Exact MatchOrdinal only 0.8700 (±0.0079) 0.3871 (±0.0056) 1.0423 (±0.0162) 0.3039 (±0.0068)

KPMF 1.0664 (±0.0204) 0.2715 (± 0.0212) 1.1477(±0.0242) 0.2788 (±0.0140)MLFS 0.8470 (±0.0050) 0.4060 (±0.0102) 0.9826 (±0.0133) 0.3261 (±0.0070)Table 3: Benchmark datasets: Accuracies on multiple similarity matrix based classification.

UCI Handwritten Digits (10 classes) Protein Fold (27 classes)No missing 50% missing No missing 50% missing

Concatenation 93.47% (±1.40%) 92.02% (±2.03%) 50.46% (± 2.96%) 45.93% (± 3.59%)BEMKL 94.94% (±0.84%) 88.59% (±2.76%) 53.70% (±2.88%) 47.77% (±3.01%)MLFS 95.14% (±0.85%) 93.61% (±1.16%) 51.11% (±2.02%) 48.27% (±2.48%)

the covariates are the ordinal responses and the similarity-based views (decomposed using SVD); (2) a sparse factorregression model (SFRM) (Carvalho et al. 2008) with samecovariates as before. Bottom plot in Figure 3 (d) shows themean square error (MSE) averaged over 10 runs. Here again,MLFS outperforms the other baselines, showing the benefitsof a principled generative model for the data. The Supple-mentary Material contains additional comparisons, includ-ing a plot for predicted vs. ground-truth of missing fMRIresponses.

Matrix Completion for Recommender SystemsFor this task, we consider two benchmark datasets1, Epin-ion and Ciao, both having two views: ordinal rating matrix(range 1-5) and similarity matrix. The goal in this exper-iment is to complete the partially observed rating matrix,leveraging similarity based view. Note that multiview ma-trix completion methods such as (Zhang, Cao, and Yeung2010) cannot be applied for this task because these requireall the views to be of the same type (e.g., ordinal matrix).The Epinion dataset we use consists of a 1000× 1500 ordi-nal user-movie rating matrix (∼ 2% observed entries). TheCiao dataset we use consists of product 1000 × 500 ordi-nal user-DVDs rating matrix (∼ 1% observed entries). Inaddition, for each dataset, we are given a network over theusers which is converted into a 1000×1000 similarity matrix(computed based on the number of common trusted users foreach pair of users). We compare our method with two base-lines: (i) Ordinal only: uses only the ordinal view, (ii) Ker-nelized Probabilistic Matrix Factorization (KPMF) (Zhou etal. 2012): allows using a similarity matrix to assist a matrixcompletion task (it however treats the ratings as real-valued;we round it when comparing the exact match). We run 10different splits with 50% of the observed ratings as trainingset and the remaining 50% ratings as test set and report theaveraged results. As Table 2 shows, our method outperformsboth baselines in terms of the completion accuracy (Mean-Absolute Error and Exact Match).

Multiview/Multiple Kernel ClassificationOur next experiment is on the task of multiple kernel classifi-cation on benchmark datasets. The multinomial probit adap-tation of MLFS, with all similarity-based views, naturallyapplies for this problem. For this experiment, we choose twobenchmark datasets: UCI Handwritten Digits (Kumar, Rai,

1http://www.public.asu.edu/ jtang20/datasetcode/truststudy.htm/

and Daume III 2011) and Protein Fold Prediction (Gonen2012). The Digits data consists of 2000 digits (10 classes),with each having 6 type of feature representations. We con-struct 6 kernel matrices for this data in the same manneras (Kumar, Rai, and Daume III 2011). We split the data into100 digits for training and 1900 digits for test. The Proteindata consists of 624 protein samples (27 classes), each hav-ing 12 views. We construct 12 kernel matrices for this datain the same manner as (Gonen 2012). For Protein data, wesplit the data equally into training and test sets. For both Dig-its and Protein data experiments, for each training/test split(10 runs), we try two settings: no missing and 50% missingobservations in each view. We compare with two baselines:(i) Concatenation: performs SVD on each view’s similaritymatrix, concatenates all of the resulting matrices, and learnsa multiclass probit model, and (ii) Bayesian Efficient Mul-tiple Kernel Learning (BEMKL) (Gonen 2012), which is astate-of-the-art multiple kernel learning method. The resultsare shown in Table 3. For the missing data setting, we usezero-imputation for the baseline methods (our method doesnot require imputation). As shown in the table, our methodyields better test set classification accuracies as compared tothe other baselines. For Protein data, although BEMKL per-forms better for the fully observed case, our method is betterwhen the data in each view is significantly missing.

ConclusionWe presented a probabilistic, Bayesian framework for learn-ing from heterogeneous multiview data consisting of di-verse feature-based (ordinal, binary, real) and similarity-based views. In addition to learning the latent factors andview correlations in multiview data, our framework allowssolving various other problems involving multiview data,such as matrix completion and classification. Our contribu-tion on learning the cutpoints for ordinal data is useful inits own right (e.g., applications in recommender systems).The streaming extension shows the feasibility of posing ourframework in online learning and active learning, left as fu-ture work. Our work can also be extended for multiviewclustering when the data consists a mixture of feature- (real,binary, ordinal, etc.) and similarity-based views.

AcknowledgmentsThe research reported here was funded in part by ARO,DARPA, DOE, NGA and ONR.

ReferencesBeal, M. J. 2003. Variational algorithms for approximate Bayesianinference. Ph.D. Dissertation, University of London.Bickel, S., and Scheffer, T. 2004. Multi-View Clustering. In ICDM.Broderick, T.; Boyd, N.; Wibisono, A.; Wilson, A. C.; and Jordan,M. 2013. Streaming Variational Bayes. In NIPS.Carvalho, C.; Chang, J.; Lucas, J.; Nevins, J.; Wang, Q.; and West,M. 2008. High-dimensional sparse factor modeling: Applicationsin gene expression genomics. JASA.Chaudhuri, K.; Kakade, S. M.; Livescu, K.; and Sridharan, K.2009. Multi-view Clustering via Canonical Correlation Analysis.In ICML.Daemen, A., and De Moor, B. 2009. Development of a kernelfunction for clinical data. In Conf Proc IEEE Eng Med Biol Soc.,5913–5917.Girolami, M., and Rogers, S. 2006. Variational Bayesian Multi-nomial Probit Regression with Gaussian Process Priors. NeuralComputation 18.Gonen, M., and Alpaydın, E. 2011. Multiple Kernel LearningAlgorithms. JMLR.Gonen, M. 2012. Bayesian Efficient Multiple Kernel Learning. InICML.Hahn, P. R.; Carvalho, C. M.; and Scott, J. G. 2012. A sparse factor-analytic probit model for congressional voting patterns. Journal ofthe Royal Statistical Society: Series C.Hariri, A.; Mattay, V.; Tessitore, A.; Kolachana, B.; Fera, F.; Gold-man, D.; Egan, M.; and Weinberger, D. 2002. Serotonin transportergenetic variation and the response of the human amygdala. Science297(5580):400–403.Hernandez-Lobato, J. M.; Houlsby, N.; and Ghahramani, Z. 2014.Probabilistic matrix factorization with non-random missing data.In ICML.Jia, Y.; Salzmann, M.; and Darrell, T. 2010. Factorized LatentSpaces with Structured Sparsity. In NIPS.Klami, A.; Bouchard, G.; and Tripathi, A. 2014. Group-sparseEmbeddings in Collective Matrix Factorization. In ICLR.Krueger, R., and Markon, K. 2006. Understanding psychopathol-ogy: Melding behavior genetics, personality, and quantitative psy-chology to develop an empirically based model. Current Directionsin Psychological Science 15:113–117.Kumar, A.; Rai, P.; and Daume III, H. 2011. Co-regularized Multi-view Spectral Clustering. In NIPS.Lawson, D. J., and Falush, D. 2012. Population identification usinggenetic data. Annu. Rev. Genomics Hum. Genet. 13:337–361.Nikolova, Y., and Hariri, A. R. 2012. Neural responses to threatand reward interact to predict stress-related problem drinking: Anovel protective role of the amygdala. Biology of Mood & AnxietyDisorders 2.Nikolova, Y.; Ferrell, R.; Manuck, S.; and Hariri, A. 2011. Multilo-cus genetic profile for dopamine signaling predicts ventral striatumreactivity. Neuropsychopharmacology 36:1940–1947.Pan, W.; Liu, N. N.; Xiang, E. W.; and Yang, Q. 2011. Trans-fer Learning to Predict Missing Ratings via Heterogeneous UserFeedbacks. In IJCAI.Salazar, E.; Bogdan, R.; Gorka, A.; Hariri, A.; and Carin, L. 2013.Exploring the Mind: Integrating Questionnaires and fMRI. InICML.Shao, W.; Shi, X.; and Yu, P. 2013. Clustering on Multiple In-complete Datasets via Collective Kernel Learning. arXiv preprintarXiv:1310.1177.

Shi, Y.; Larson, M.; and Hanjalic, A. 2014. Collaborative FilteringBeyond the User-Item Matrix: A Survey of the State of the Art andFuture Challenges. ACM Comput. Surv. 47(1).Singh, A. P., and Gordon, G. J. 2008. Relational learning viacollective matrix factorization. In KDD.Vazire, S. 2006. Informant reports: A cheap, fast, and easy methodfor personality assessment. Journal of Research in Personality40(5):472 – 481.Virtanen, S.; Klami, A.; Khan, S. A.; and Kaski, S. 2012. BayesianGroup Factor Analysis. In AISTATS.Yu, S.; Krishnapuram, B.; Rosales, R.; and Rao, R. B. 2011.Bayesian Co-Training. JMLR.Zhang, Y.; Cao, B.; and Yeung, D. 2010. Multi-Domain Collabo-rative Filtering. In UAI.Zhe, S.; Xu, Z.; Qi, Y.; and Yu, P. 2014. Joint Association Discov-ery and Diagnosis of Alzheimer’s Disease by Supervised Hetero-geneous Multiview Learning. In Pacific Symposium on Biocomput-ing, volume 19.Zhou, T.; Shan, H.; Banerjee, A.; and Sapiro, G. 2012. KernelizedProbabilistic Matrix Factorization: Exploiting Graphs and Side In-formation. In SDM.

Supplemental Material for “Integrating Features and Similarities: Flexible Modelsfor Heterogeneous Multiview Data”

Variational ObjectiveDefine Θ = {U (m),W (m),αm, γm}Mm=1. For the multiview classification problem, {βc, {znc}Nn=1}Cc=1 also need to be esti-mated. Besides, for the ordinal views, we denote the cutpoints asG = {gm}M1

m=1, and the rotation matrix asQ. Throughout thederivation, we discuss the model for the multiclass classification problem (where we are also given the labels which we denoteby y).

The goal is to minimize KL divergence KL(q(Θ)||p(Θ|X, y,G,Q)), where q(Θ) is a mean field approximation ofp(Θ|X, y,G,Q). This is equivalent to maximizing the evidence lower bound (ELBO) L(q(Θ),G,Q):

L(q(Θ),G,Q) = 〈log p(X, y,Θ)− log q(Θ)〉q(Θ) (1)

= 〈log p(X|W ,V , γ)〉+ 〈logp(W |α)p(α)p(V )p(γ)p(β)

q(W )q(α)q(V )q(γ)q(β)〉+ 〈log p(y|V ,β)〉

=

M1∑m=1

〈log p(X(m)|V , γm)〉

+

M1+M2∑m=M1+1

〈log p(X(m)|V , γm)〉+

M1+M2+M3∑m=M1+M2+1

〈log p(X(m)|U (m), τm)〉

+

M1+M2+M3∑m=M1+M2+1

〈log p(U (m)|V , γm)〉+

M1+M2+M3∑m=1

〈log p(W (m)|αm)〉+ 〈log p(αm)〉

+

M1+M2+M3∑m=1

〈log p(γm)〉+ 〈log p(V )〉+

C∑c=1

〈log p(βc)〉+ 〈log p(z|V ,β)〉+ 〈log p(y|z)〉

−M1+M2+M3∑

m=1

〈log q(W (m))〉+ 〈log q(αm)〉 −M1+M2+M3∑

m=1

〈log q(γm)〉

−〈log q(V )〉 −C∑c=1

〈log q(βc)〉 − 〈log q(z)〉 −M1+M2+M3∑m=M1+M2+1

〈log q(U (m))〉

Approximation for ordinal viewsDirectly maximizing L(q(Θ),G,Q) is intractable, thus further approximation is needed for the first term of (1). Only theordinal views are considered in this subsection. For Gaussian (real-valued) and similarity-based views, no such approximationis needed. The approximation for the ordinal views proceeds as follows:

〈log p(X|W ,V , γ)〉q(Θ) =∑i,j,m

〈log

∫p(X

(m)ij |U

(m)ij )p(U

(m)ij |Vi,W

(m):j , γm)dU

(m)ij 〉 (2)

=∑i,j,m

〈log

gm

X(m)ij∫

gm

X(m)ij−1

N (U(m)ij ;ViW

(m):j , γ−1m )dU

(m)ij 〉

=∑i,j,m

〈log[Φ(βi,j,m)− Φ(αi,j,m)]〉

= const +∑i,j,m

〈log

βi,j,m∫αi,j,m

exp(−u2

2)du〉

≥∑i,j,m

1

〈βi,j,m − αi,j,m〉〈βi,j,m∫αi,j,m

log(βi,j,m − αi,j,m)− 1

2u2du〉+ const

=∑i,j,m

log(gmX

(m)ij

− gmX

(m)ij −1

) +1

2〈log γm〉 −

1

2〈γm〉〈(ViW (m)

:j )2〉

+1

2〈γm〉〈ViW (m)

:j 〉(gm

X(m)ij

+ gmX

(m)ij −1

)

−1

6〈γm〉((gmX(m)

ij

)2 + (gmX

(m)ij −1

)2 + gmX

(m)ij

gmX

(m)ij −1

) + const (3)

In the above, βi,j,m = (gmX

(m)ij

− ViW (m):j )γ

− 12

m , αi,j,m = (gmX

(m)ij −1

− ViW (m):j )γ

− 12

m , and Φ(.) is c.d.f. of the normal

distribution. (3) is obtained using Jensen’s inequality, but it can be also derived from Taylor’s expansion, showing the conditionsof the bound’s tightness. As below, (3) can be equivalently expressed using the erf fucntion.∑

i,j,m

〈log[Φ(βi,j,m)− Φ(αi,j,m)]〉

=∑i,j,t

〈log[erf(βi,j,m√

2)− erf(

αi,j,m√2

)]〉+ const

≈∑i,j,m

〈log2√π

[βi,j,m√

2− 1

3(βi,j,m√

2)3 − αi,j,m√

2+

1

3(αi,j,m√

2)3]〉+ const (4)

=∑i,j,m

〈log(βi,j,m − αi,j,m)〉+ 〈log[1− 1

6(β2i,j,m + α2

i,j,m + βi,j,mαi,j,m)]〉+ const

≈∑i,j,m

〈log(βi,j,m − αi,j,m)〉 − 1

6〈(β2

i,j,m + α2i,j,m + βi,j,mαi,j,m)〉+ const (5)

=∑i,j,m

log(gmX

(m)ij

− gmX

(m)ij −1

) +1

2〈log γm〉 −

1

2〈γm〉〈(ViW (m)

:j )2〉

+1

2〈γm〉〈ViW t

:j〉(gmX(m)ij

+ gmX

(m)ij −1

)

−1

6〈γm〉((gmX(m)

ij

)2 + (gmX

(m)ij −1

)2 + gmX

(m)ij

gmX

(m)ij −1

) + const (6)

In the above derivation, an approximation using Taylor’s expansion is used. Ignoring the higher order terms of O(x5) for erffunction, we have the approximation in (4); while ignoring the higher order terms of O(x2) for logarithm function, we achievethe approximation in (5). The final lower bound (6) provides analytical updates of variational parameters for q(Θ). We evaluatedthis variational approximation on synthetic data where the cutpoints are available, and we can recover the true cutpoints. Markovchain Monte Carlo (MCMC) is also used as comparison for ordinal matrix completion problems, and identical performance areobserved. Figure 2(a) in this supplementary material shows the results for the ordinal matrix completion task (questionnairesresponses) on cognitive neuroscience data. We notice that the results (in terms of mean absolute error) based on MCMC andVB algorithms are pretty similar.

Learning the cutpointsWith the variational objective derived in (1) and (6), we can use Variational EM to learn the variational distribution q(Θ)(varational E-step), and the point estimates of cutpoints G and rotation matrix Q (varational M-step). Ignoring the constantterms w.r.t.G, we have the following objective function for the cutpoints.

L(G) =

M1∑m=1

Lm(gm) + const (7)

Lm(gm) =

Lm∑l=1

Lml (8)

Lml = Nml [log(gml − gml−1)− 1

6〈γm〉(gml

2 + gml−12 + gml g

ml−1)]

+1

2〈γt〉(gml + gml−1)

∑i,j:Xm

ij =l

〈Vi〉〈W (m):j 〉 (9)

In above, Lm is the number of possible ordinal outcomes, andNml is the number of data points having value l inm–th view. The

gradients of Lml are also analytically available. Because gm0 and gmLmare fixed to achieve identifiablity, only the gradients with

respect to gml , l = 1, · · · , Lm − 1 are required. Note that the objective fucntion in (9) is concave w.r.t. gm; therefore, in eachvariational M-step, the solution gm given the variational distributions q(Θ) is global optimal. This constrained optimizationproblem (with ordering constraints gml ≤ gml′ , for l < l′) can be solved efficiently using Newton’s method, with the gradientprovided below:

∇gmlLm(gm) = Nm

l [1

gml − gml−1− 1

6〈γm〉(2gml + gml−1)] +

1

2〈γm〉

∑i,j:X

(m)ij =l

〈Vi〉〈W (m):j 〉 (10)

+Nml+1[

−1

gml+1 − gml− 1

6〈γm〉(2gml + gml+1)] +

1

2〈γm〉

∑i,j:Xm

ij =l+1

〈Vi〉〈W (m):j 〉

Learning the rotation matrixAt each variational M-step, an unconstrained optimization problem to learn Q is solved to achieve faster convergence. Afterrotation, the variational distributions for Vi,W

(m):j , αmr,βc are updated as follows.

Vi = ViQ−1 ∼ N (µv,oldQ

−1,Q−TΣv,oldQ−1) (11)

W(m):j = QW

(m):j ∼ N (Qµw,old,QΣw,oldQ

T ) (12)

αmr ∼ Ga(aα +1

2Km, bα +

1

2QT

:r〈W (m)W (m)>〉Q:r) (13)

βc ∼ N (Σβ

N∑i=1

〈zic〉〈V Ti 〉Q−1, Σβ) (14)

Σβ = (ρIR +Q−TN∑i=1

〈V Ti Vi〉Q−1)−1

Ignoring the terms that are constant w.r.t.Q in the variational lower bound, we have the following objective function w.r.t.Q:

L′(Q) = 〈logp(W , α)p(V )p(β)

q(W )q(α)q(V )q(β)〉

= 〈logp(V )

q(V )〉+ 〈log

p(W |α)p(α)

q(W )q(α)〉+ 〈log

p(β)

q(β)〉 (15)

Inspecting (15) term by term, we have the analytical form as follows.

〈logp(V )

q(V )〉 = −N log |Q|+

N∑i=1

log |ΣVi| − 1

2tr(Q−1〈V TV 〉Q−T ) (16)

〈logp(W |α)p(α)

q(W )q(α)〉 =

M1+M2+M3∑m=1

Km log |Q| − Km

2

R∑r=1

log tr(QT:r〈W (m)W (m)>〉Q:r) (17)

〈logp(β)

q(β)〉

ρ→0≈ −C|Q|+ 1

2C log |

N∑i=1

〈V Ti Vi〉| (18)

Further, we have the gradients w.r.t.Q available in analytical form. IfQ = IR, no rotation is added; with rotation,Q draws q(Θ)towards the prior p(Θ) because (15) effective minimizes KL(q(Θ)||p(Θ)) while not affecting the likelihood term p(X|Θ).The solution of this unconstrained optimization problem is guaranteed to increase the variational lower bound.

Updating variational distributionsWe use a mean field approximation to learn the variational distributions q(Θ):

q(Θ) =

M∏m=M1+M2+1

N∏i=1

q(U(m)i )

N∏i=1

q(Vi)

M∏m=1

R∏r=1

q(W (m)r )

M∏m=1

R∏r=1

q(αmr)

M∏m=1

q(γm) (19)

Update Vi, for i = 1, · · · , N Multiview classification:

q(Vi) = N (µv,Σv) (20)

Σv = (I +

M∑m=1

〈γm〉〈W (m)W (m)>〉+

C∑c=1

〈βcβ>c 〉)−1

µv = (

M1∑m=1

〈γm〉Km∑j=1

〈W (m):j

>〉gmX

(m)i,j

+ gmX

(m)i,j −1

2+

M1+M2∑m=M1+1

〈γm〉X(m)i 〈W (m)>〉

+

M1+M2+M3∑m=M1+M2+1

〈γm〉〈U (m)i 〉〈W (m)>〉+

C∑c=1

〈zic〉〈βc〉)Σv

〈Vi〉 = µv (21)

〈V Ti Vi〉 = µTv µv + Σv (22)

Update αmrq(αmr) = Ga(aα, bα) (23)

aα = aα +Km

2

bα = bα +1

2〈W (m)

r W (m)r

>〉

〈αmr〉 =aα

bα(24)

〈logαmr〉 = ψ(aα)− log(bα) (25)

Update γmq(γm) = Ga(aγ , bγ) (26)

aγ = aγ +NKm

2For m = 1, · · · ,M1,

bγ = bγ +1

2

∑i,j

〈(ViW (m):j )2〉 − 〈ViW (m)

:j 〉(gm

X(m)ij

+ gmX

(m)ij −1

) +1

3(gm

X(m)ij

2 + gmX

(m)ij −1

2 + gmX

(m)ij

gmX

(m)ij −1

)

For m = M1 + 1, · · · ,M1 +M2,

bγ = bγ +1

2tr(〈W (m)W (m)>〉

N∑i=1

〈V Ti Vi〉)− tr(

N∑i=1

〈U (m)i

>〉〈Vi〉〈W (m)〉) +

1

2tr(

N∑i=1

X(m)i

>X

(m)i )

For m = M1 +M2 + 1, · · · ,M1 +M2 +M3,

bγ = bγ +1

2tr(〈W (m)W (m)>〉

N∑i=1

〈V Ti Vi〉)− tr(

N∑i=1

X(m)i

>〈Vi〉〈W (m)〉) +

1

2tr(

N∑i=1

〈U (m)i

>U

(m)i 〉)

〈γm〉 =aγ

bγ(27)

〈log γm〉 = ψ(aγ)− log(bγ) (28)

Update τm, for m = M1 +M2 + 1, · · · ,M1 +M2 +M3

q(τm) = Ga(aτ , bτ ) (29)

aτ = aτ +N(N − 1)

4

bτ = bτ +1

2

N∑i=1

∑j>i

((X(m)ij )2−2X

(m)ij 〈U

(m)i 〉〈U (m)

j

>〉+ tr(〈U (m)

i

>U

(m)i 〉〈U (m)

j U(m)j

>〉)

〈τm〉 =aτ

bτ(30)

〈log τm〉 = ψ(aτ )− log(bτ ) (31)

Update βc and zi for classification task.

q(βc) = N (µβ ,Σβ) (32)

Σβ = (ρIR +

N∑i=1

〈V Ti Vi〉)−1

µβ = Σβ

N∑i=1

〈zic〉〈V Ti 〉

〈βc〉 = µβ (33)

〈βTc βc〉 = µTβµβ + Σβ (34)

q(zi) = T N yi(ξ, IC) (35)

ξc = 〈Vi〉〈βTc 〉

In above, T N yi(zi) means ziyi = maxczic

〈zic〉c 6=yi = ξc −Ep(u)[φ(u+ ξyi − ξc)Φi,cu ]

Ep(u)[Φ(u+ ξyi − ξc)Φi,cu ]

(36)

Φi,cu =∏j 6=yi,c

Φ(u+ ξyi − ξc)

〈zyic〉 = ξyi −∑j 6=yi

(〈zij〉 − ξj) (37)

In above, u ∼ N (0, 1). φ(.) and Φ(.) denote the p.d.f. and c.d.f. for normal distribution respectively.

Out-of-sample predictionFor out-of-sample data point(s) X∗, we would like to infer q(V∗) ≈ p(V∗|X(1)

∗ , · · · ,X(M1+M2+M3)∗ ). Based on the chain

rule, we have

q(V∗) ∝ p(V∗)∏m

∫p(X

(m)∗ |U (m)

∗ )p(U(m)∗ |V∗)dU (m)

∗ = p(V∗)∏m

p(X(m)∗ |V∗) (38)

∝∫p(V∗|U (m)

∗ , · · · ,U (m)∗ )

∏m

p(U(m)∗ |X(m)

∗ )dU(m)∗ (39)

For ordinal feature views, (38) is used, where the likelihood term p(X(m)∗ |V∗) is approximated by (3). Thus the q(V∗) ≈

p(V∗|X(1)∗ , · · · ,X(M1)

∗ ) is accordingly approximated by a Gaussian distribution. For Gaussian feature views, becausep(X

(m)∗ |V∗),m = M1 + 1, · · · ,M1 + M2 is a Gaussian distribution, we can directly apply (38) and obtain the Gaussian

posterior q(V∗) ≈ p(V∗|X(M1+1)∗ , · · · ,X(M1+M2)

∗ ).For similarity-based views, because U (m)

∗: cannot be integrated out, (39) is used. We need to estimate q(U (m)∗ ) and q(V∗) in

two stages as follows.

p(U(m)∗ ) =

∫p(U

(m)∗ |V∗)p(V∗)dV∗

= N (0, 〈W (m)>W (m)〉+ 〈γm〉−1IKm) (40)

p(X(m)∗ |U (m)

∗ ) =

N∏j=1

N (U(m)∗ 〈U (m)

j

>〉, 〈τm〉−1) (41)

Therefore we have the following posterior for p(U (m)∗ |X(m)

∗ ):

p(U(m)∗ |X(m)

∗ ) = N (〈τm〉N∑j=1

X(m)∗j 〈U

(m)j 〉Σu,Σu) (42)

Σu = ((〈W (m)>W (m)〉+ 〈γm〉−1IKm)−1 + 〈τm〉

N∑j=1

〈U (m)j

>U

(m)j 〉)−1

We also observed that

p(V∗|U∗) ∝ p(V∗)

M1+M2+M3∏m=M1+M2+1

p(U(m)∗ |V∗)

= N (

M1+M2+M3∑m=M1+M2+1

〈γm〉〈U (m)∗ 〉〈W (m)>〉Σv|u,Σv|u) (43)

Σv|u = (IR +

M1+M2+M3∑m=M1+M2+1

〈γm〉〈W (m)W (m)>〉)−1

Combining (42) and (43), we have q(V∗) ≈ p(V∗|X(M1+M2+1)∗ , · · · ,X(M1+M2+M3)

∗ ), which is a Gaussian distribution.Finally, combing the ordinal, real (Gaussian), and similarity-based views in a sequential manner (dealing with feature viewsfirst and using this posterior as the prior for the similarity-based views), we get the overall out-of-sample prediction for V∗:

q(V∗) = N (µv,Σv) (44)

µv = (

M1∑m=1

∑j

〈γm〉gmX

(m)ij

+ gmX

(m)ij −1

2〈W (m)

:j

>〉+

M1+M2∑m=M1+1

〈γm〉X(m)∗ 〈W (m)>〉

+

M1+M2+M3∑m=M1+M2+1

〈γt〉〈U (m)∗ 〉〈W (m)>〉)Σv

Σv = (IR +

M1+M2+M3∑m=1

〈γt〉〈W (m)W (m)>〉)−1

Streaming extensionIn the setting where data points are observed in a streaming fashion, we need to update local variables for newly observeddata {V∗,U (m)

∗ , z∗}, and global variables {W (m), αmr,βc}. The hyperparameter γm is fixed at a reasonable estimate forsimplicity, which can also be updated similarly to the batch setting.

Local variablesAs derived in Section , treating each newly observed exampleX∗ (having some or all the views) as an out-of-sample point, wehave the variational estimate q(U (m)

∗ ) for m = M1 + M2 + 1, · · · ,M1 + M2 + M3, and q(V∗), provided in (42) and (44).Further, We can natually update z∗ following (35).

Global variablesOnce the local variable distributions are learned, we can update the global variables Θg = {W (m), αmr,βc}, based onq(Θg

n+1) ∝ q(Θgn)p(X∗|Θg

n).Specifically, we can updateW (m)

:j for similarity-based views as follows (updates for feature-based views have a similar form

by replacing U (m)∗ , as in the previous section.

q(W(m):j ) = N (µw,n,Σw,n)→ N (µw,n+1,Σw,n+1) (45)

Σw,n+1 = (Σ−1w,n + 〈γm〉〈V T∗ V∗〉)−1

µw,n+1 = Σw,n+1(Σ−1w,nµw,n + 〈γm〉〈V T∗ 〉〈U

(m)∗j 〉)

Updating αmr is the same as batch setting in (23) because q(αmr) does not directly depend on local variables.

50 100 150 200 250 300 350 400 450 5000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of examples seen

Acc

urac

y

Batch vs Streaming Inference

Streaming: I = 10

Streaming: I = 30

Streaming: I = 50

Batch

Figure 1: Steaming Extension

We can also update the classifier βc. Computational speed-up details are not discussed here, which includes avoiding matrixinversions for each new observation.

q(βc) = N (µβ,n,Σβ,n)→ N (µβ,n+1,Σβ,n+1) (46)

Σβ,n+1 = (Σ−1β,n + 〈V T∗ V∗〉)−1

µβ,n+1 = Σβ,n+1(Σ−1β,nµβ,n + 〈z∗c〉〈V T∗ 〉)

ExperimentsWe demonstrate MLFS in a streaming setting, revisiting the Digits data classification experiment for: (i) MLFS with batchinference with a training set of 500 examples, (ii) MLFS with streaming inference, processing one example at a time (forvarious choices of the initial pool size I) and doing only a single pass over the data. Figure 1 shows the average accuracychanging with number of visited examples increasing, run with 10 data splits. While it is unreasonable to expect that a trulystreaming algorithm (seeing each example just once) will outperform its batch counterpart, it attains reasonably competitiveaccuracies even when running with very small initial pool sizes.

Additional results for cognitive neuroscience dataHere, we include some additional results on the cognitive neuroscience data for two tasks: matrix completion of ordinal re-sponses (questionnaires data), and prediction of fMRI responses. For the first task, Figure 2 (a)shows the average mean absoluteerror (MAE) for different percentage of missingness over 10 runs considering three scenarios: (i) MLFS fitted using the pro-posed VB algorithm, (ii) MLFS fitted using MCMC, and (iii) MLFS fitted using the proposed VB algorithm but consideringall the ordinal views concatenated as a single ordinal matrix. We used an unoptimized MATLAB implementation and our VBbased inference method converged in about 10 iterations (in terms of variational lower bound). As we can see, for MLFS withall the views, our VB result is competitive to MCMC; moreover, both outperform the baseline of concatenating all the ordinalviews. For the second task to predict fMRI responses leveraging information from other views, Figure 2(b) shows a plot ofobserved vs. predicted values. The points roughly follow a straight line indicating good predictions.

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.4

0.45

0.5

0.55

0.6

0.65

Missing rate

MA

E

MLFS (VB)

MLFS (MCMC)

MLFS Concat. quest. (VB)

(a)

−0.5 0 0.5 1

−0.5

0

0.5

1

Observed values

Pre

dic

ted v

alu

es

(b)

Figure 2: (a) Average mean absolute error (MAE) for the ordinal responses over 10 runs as a function of the fraction of missing data. Errorbars indicate the standard deviation around the mean. (b) Observed vs. predicted fMRI values (amygdala and VS) from 30% of the subjects.