Multi-Task Classiﬂcation for Incomplete...

Multi-Task Classification for Incomplete Data

Chunping Wang [email protected] Liao [email protected] Carin [email protected] of Electrical and Computer EngineeringDuke UniversityDurham, NC 27708-0291, USA

David B. Dunson [email protected]

Department of Statistical ScienceDuke UniversityDurham, NC 27708-0291, USA

Abstract

A non-parametric hierarchical Bayesian framework is developed for designing a sophisti-cated classifier based on a mixture of simple (linear) classifiers. Each simple classifier istermed a local “expert”, and the number of experts and their construction are manifestedvia a Dirichlet process formulation. The simple form of the “experts” allows direct han-dling of incomplete data. The model is further extended to allow simultaneous designof classifiers on multiple data sets, termed multi-task learning, with this also performednon-parametrically via the Dirichlet process. Fast inference is performed using variationalBayesian analysis, and example results are presented for several data sets.Keywords: Classification, Incomplete Data, Expert, Multi-task Learning, Dirichlet pro-cess

1. Introduction

In many applications one must deal with data that have been collected incompletely. Forexample, in censuses and surveys, some participants may not respond to certain questions(Rubin, 1987); in email spam filtering, server information may be unavailable for emails fromexternal sources (Dick et al., 2008); in medical studies, measurements on some subjects maybe partially lost at certain stages of the treatment (Ibrahim, 1990); in DNA analysis, gene-expression microarrays may be incomplete due to insufficient resolution, image corruption,or simply dust or scratches on the slide (Wang et al., 2006); in sensing applications, asubset of sensors may be absent or fail to operate at certain regions (Williams and Carin,2005). Since most data analysis procedures (for example, regression and classification) aredesigned for complete data, and cannot be directly applied to incomplete data, appropriatelyhandling missing values is challenging.

Traditionally, due to the lack of an explicit framework for addressing missing data, dataare often “completed” by ad hoc editing, such as case deletion and single imputation, wherefeature vectors with missing values are simply discarded or completed with specific valuesin the initial stage before the main inference (for example, mean imputation and regressionimputation (see Schafer and Graham, 2002)). Although analysis procedures designed for

1

complete data become applicable after these edits, shortcomings are clear. For case deletion,discarding information is generally inefficient, especially when data are scarce. Secondly,the remaining complete data may be statistically unrepresentative. More importantly, evenif the incomplete-data problem is eliminated by ignoring data with missing features in thetraining phase, it is still inevitable in the testing stage since testing data cannot be ignoredsimply because a portion of features are missing. For single imputation, the main concernis that the uncertainty of the missing features is ignored by imputing fixed values.

Rubin (1976) developed a theoretical framework for incomplete-data problems, wherewidely-cited terminology for missing patterns was defined initially. It was proved thatignoring the missing mechanism is appropriate (Rubin, 1976) under the missing at random(MAR) assumption, meaning that the missing mechanism is conditionally independentof the missing features given the observed data. As elaborated later, given the MARassumption (Dick et al., 2008; Ibrahim, 1990; Williams and Carin, 2005), incomplete datacan generally be handled by full maximum likelihood and Bayesian approaches; however,when the missing mechanism does depend on the missing values (missing not at randomor MNAR), a problem-specific model is necessary to describe the missing mechanism, andno general approach exists. In this paper, we address missing features under the MARassumption. Previous work in this framework may be placed into two groups, depending onwhether the missing data are handled before algorithm learning or within the algorithm.

For the former, an extra step is required to estimate p(xm|xo), conditional distributionsof missing values given observed ones, with this step distinct from the main inference al-gorithm. After p(xm|xo) is learned, various imputation methods may be performed. As aMonte Carlo approach, Bayesian multiple imputation (MI) (Rubin, 1987) is widely used,where multiple (M > 1) samples from p(xm|xo) are imputed to form M “complete” datasets, with the complete-data algorithm applied on each, and results of those imputed datasets combined to yield a final result. The MI method “completes” data sets so that algo-rithms designed for complete data become applicable. Furthermore, Rubin (1987) showedthat MI does not require as many samples as Monte Carlo methods usually do. With amild Gaussian mixture model (GMM) assumption for the joint distribution of observed andmissing data, Williams et al. (2007) managed to analytically integrate out missing valuesover p(xm|xo) and performed essentially infinite imputations. Since explicit imputationsare avoided, this method is more efficient than the MI method, as suggested by empiricalresults (Williams et al., 2007). Other examples of these two-step methods include Williamsand Carin (2005); Smola et al. (2005); Shivaswamy et al. (2006).

Since the imputation models are estimated separately from the main inference algo-rithm, much flexibility is left for selection of the subsequent algorithm, with few modi-fications needed for the complete-data approaches. However, as discussed further below,learning two subsets of unknowns separately (density functions on missing values as wellas algorithm parameters) may not lead to an optimal solution. Moreover, errors made inlearning distributions of missing features may undermine the complete-data algorithm.

The other class of methods explicitly addresses missing values during the model-learningprocedure. The work proposed by Chechik et al. (2008) represents a special case, in which nomodel is assumed for structurally absent values; the margin for the support vector machine(SVM) is re-scaled according to the observed features for each instance. Empirical results(Chechik et al., 2008) show that this procedure is comparable to several single-imputation

2

methods when values are missing at random. Another recent work (Dick et al., 2008)handles the missing features inside the procedure of learning a support vector machine(SVM), without constraining the distribution of missing features to any specific class. Themain concern is that this method can only handle missing features in the training data;however, in many applications one cannot control whether missing values occur in thetraining or testing data.

A widely employed approach for handling missing values within the algorithm involvesmaximum likelihood (ML) estimation via expectation maximization (EM) (Dempster et al.,1977). The missing data are treated as random variables and integrated out in the E-step sothat the likelihood is maximized with respect to model parameters in the M-step. The maindifficulty is that the integral in the E-step is analytically tractable only when an assumptionis made on the distribution of the missing features. For example, the intractable integralis avoided by requiring the features to be discrete (Ibrahim, 1990), or assuming a Gaussianmixture model (GMM) for the features (Ghahramani and Jordan, 1994; Liao et al., 2007).The discreteness requirement is often too restrictive, while the GMM assumption is mildsince it is well known that a GMM can approximate arbitrary continuous distributions.

The quadratically gated mixture of experts (QGME) (Liao et al., 2007) uses the GMMto form the gating network, which statistically partitions the feature space into quadraticsubregions. In each subregion, one linear classifier works as a local expert. The GMMand the mixture of experts (ME) (Jacobs et al., 1991) are naturally integrated, and benefitfrom each other. As a mixture of experts (Jacobs et al., 1991), the QGME is capable ofaddressing a classification problem with a nonlinear decision boundary in terms of multiplelocal “experts”; the simple form of this model makes it straightforward to handle incompletedata without completing kernel functions (Graepel, 2002; Williams and Carin, 2005). TheGMM assumption facilitates handling incomplete data during classifier learning and alsoprovides a feasible gating network for simple experts.

It is of interest to note that the QGME may also be related to boosting (Freund andSchapire, 1999; Sun et al., 2006), in the sense that weak/simple classifiers are combinedto yield performance improvement. However, there are also clear differences. In boostingmethods global weak classifiers and associated input-independent weights are learned se-quentially, and the final result is given by the weighted “hard” votes; the QGME yieldsa final result by combining statistical opinions of local weak (linear) classifiers weightedby input-dependent probabilities. Although both are capable of yielding a complex classboundary with simple components, the QGME is amenable to incomplete data directly.

However, as in many mixture-of-expert models (Jacobs et al., 1991; Waterhouse andRobinson, 1994), the number of local experts in the QGME must be specified initially,and thus a model-selection stage is in general necessary. Moreover, since the expectation-maximization method renders a point (single) solution that maximizes the likelihood, over-fitting may occur when data are scarce. In this paper, we first extend the finite QGME(Liao et al., 2007) to an infinite QGME (iQGME), with theoretically an infinite number ofexperts realized via a Dirichlet process (DP) (Ferguson, 1973) prior; this yields developinga fully Bayesian solution, rather than a point estimate. In this manner model selection isavoided and the uncertainty on the number of experts is captured in the posterior densityfunction.

3

In addition to challenges with incomplete data, one must often address an insufficientquantity of labeled data. Williams et al. (2007) have employed semi-supervised learning toaddress this challenge, using the contextual information in the unlabeled data to augmentthe limited labeled data, all done in the presence of missing/incomplete data. Anotherform of context one may employ to address limited labeled data is multi-task learning(MTL) (Caruana, 1997), which allows the learning of multiple tasks simultaneously toimprove generalization performance. Caruana (1997) provided an overview of MTL anddemonstrated it on multiple problems. In recent research, a hierarchical statistical structurehas been favored, where information is transferred via a common prior within a hierarchicalBayesian model (Yu et al., 2003; Zhang et al., 2006). Specifically, information is transferredmainly between related tasks (Xue et al., 2007), where the Dirichlet process (DP) (Ferguson,1973) was introduced as a common prior. To the best of our knowledge, there is no previousexample of addressing incomplete data in a multi-task setting, this problem constituting animportant aspect of this paper.

The iQGME is a hierarchical Bayesian model, and is therefore amenable to extensionto MTL, again through use of DP. If we consider J learning tasks, and for each we wishto learn an iQGME classifier, the J classifiers are coupled by a joint prior distribution overthe parameters of all classifiers. Each classifier provides the solution for a classification taskwith incomplete data. The solutions for the J tasks are obtained simultaneously under theunified framework.

The main contributions of this paper may be summarized as follows. The problemof missing data in classifier design is addressed by extending QGME (Liao et al., 2007)to a fully Bayesian setting, with the number of local experts inferred automatically via aDP prior. The algorithm is further extended to a multi-task setting, again using a non-parametric Bayesian model, simultaneously learning J missing-data classification problems,with appropriate sharing. Throughout, efficient inference is implemented via the variationalBayesian (VB) method (Beal, 2003).

The remainder of the paper is organized as follows. In Section 2 we extend the finiteQGME (Liao et al., 2007) to an infinite QGME via a Dirichlet process prior, and a furtherextension to the multi-task learning case is made in Section 3. The incomplete-data problemis defined and discussed in Section 4. Variational Bayesian inference for hidden parametersand missing features is developed in Section 5. Experimental results for synthetic data andmultiple real data sets are presented in Section 6, followed by conclusions and discussionsin Section 7.

2. Infinite Quadratically Gated Mixture of Experts

In this section, we first give a brief review on the quadratically gated mixture of experts(QGME) (Liao et al., 2007) and Dirichlet process (DP) (Ferguson, 1973), and then extendthe number of experts to be infinite via DP.

2.1 Quadratically Gated Mixture of Experts for a Single Task

Consider a binary classification problem with real-valued P -dimensional column feature vec-tors xi and corresponding class labels yi ∈ 1,−1. We assume binary labels for simplicity,while the proposed method may be directly extended to cases with multiple classes. Latent

4

variables ti are introduced as “soft labels” associated with yi, as in probit models (Albertand Chib, 1993), where yi = 1 if ti > 0 and yi = −1 if ti < 0. The finite quadratically gatedmixture of experts (QGME) (Liao et al., 2007) is defined as

(ti|zi = h,xi, W ) ∼ N (ti|wTh xb

i , 1), (1)(xi|zi = h,µ,Λ) ∼ NP (xi|µh,Λ−1

h ), (2)

(zi|π) ∼K∑

h=1

πhδh, (3)

with∑K

h=1 πh = 1, and where δh is point measure concentrated at h (with probability onea draw from δh will be h). The (P + 1) ×K matrix W has columns wh, where each wh

are the weights on a local linear classifier, and the xbi are feature vectors with an intercept,

i.e., xbi = [xT

i , 1]T . Specifically, a total of K groups of wh are introduced to parameterizethe K experts. With probability πh the indicator for the ith data point satisfies zi = h,which means the h-th local expert is selected, and xi is distributed according to a P -variateGaussian distribution with mean µh and precision Λh.

It can be seen that the QGME is highly related to the mixture of experts (ME) (Jacobset al., 1991) and the hierarchical mixture of experts (HME) (Jordan and Jacobs, 1994) ifwe express it as

p(yi|xi, W , µ,Λ) =K∑

h=1

p(zi = h|xi,µ,Λ)p(yi|zi = h,xi, W ), (4)

p(yi|zi = h,xi, W ) =∫

tiyi>0N (ti|wT

h xbi , 1)dti, (5)

p(zi = h|xi, µ,Λ) =πhNP (xi|µh,Λ−1

h )∑Kk=1 πkNP (xi|µk,Λ−1

k ). (6)

From (4), as a special case of the ME, the QGME is capable of handling nonlinear problemswith linear experts characterized in (5). However, unlike other ME models, the QGMEprobabilistically partitions the feature space through a mixture of K Gaussian distributionsfor xi as in (6). This assumption on the distribution of xi is mild since it is well knownthat a Gaussian mixture model (GMM) is general enough to approximate any continuousdistribution. In the QGME, xi as well as yi are treated as random variables (generativemodel) and we consider a joint probability p(yi, xi) instead of a conditional probabilityp(yi|xi) for fixed xi as in most ME models (which are typically discriminative). In theQGME the GMM of the inputs xi plays two important roles: i) as a gating network, while ii)enabling analytic incorporation of incomplete data during classifier inference (as discussedfurther below). However, the QGME is also different from typical generative models, suchas the Gaussian mixture model, naive Bayes and the hidden Markov model, since the localexperts are discriminative. Previous work on the comparison between discriminative andgenerative models (Ng and Jordan, 2002; Liang and Jordan, 2008) may help us analyzeresults qualitatively.

The QGME (Liao et al., 2007) is inferred via the expectation-maximization (EM)method, which renders a point-estimated solution for an initially specified model (1)-(3),

5

with a fixed number K of local experts. Since learning the correct model requires a phase ofmodel selection, and moreover in many applications there may exist no such fixed “correct”model, in the work reported here, we infer the full posterior for a QGME model with thenumber of experts data-driven. The objective can be achieved by imposing a nonparametricDirichlet process (DP) prior.

2.2 Dirichlet Process

The Dirichlet process (DP) (Ferguson, 1973) is a random measure defined on measures ofrandom variables, denoted as DP(αG0), with a real scaling parameter α ≥ 0 and a basemeasure G0. Assuming that a measure is drawn G ∼ DP(αG0), the base measure G0

reflects the prior expectation of G and the scaling parameter α controls how much G isallowed to deviate from G0. In the limit α → ∞, G goes to G0; in the limit α → 0, Greduces to a delta function at a random point in the support of G0.

The stick-breaking construction (Sethuraman, 1994) provides an explicit form of a drawfrom a DP prior. Specifically, it has been proven that a draw G may be constructed as

G =∞∑

h=1

πhδθ∗h , (7)

with 0 ≤ πh ≤ 1 and∑∞

h=1 πh = 1, and

πh = Vh

h−1∏

l=1

(1− Vl), Vhiid∼ Be(1, α), θ∗h

iid∼ G0.

From (7), it is clear that G is discrete (with probability one) with an infinite set of weightsπh at atoms θ∗h. The construction mechanism may be described as following. Assume that

a stick of unit-length is to be broken into pieces. At a given step h, a fraction Vhiid∼ Be(1, α)

of the remaining stick is taken as the hth weight πh and assigned to the hth atom θ∗hiid∼ G0.

Since the weights πh decrease stochastically with h, the summation in (7) may be truncatedwith N terms, yielding an N -level truncated approximation to a draw from the Dirichletprocess (Ishwaran and James, 2001).

Assuming that underlying variables θi are drawn i.i.d. from G, the associated dataχi ∼ F (θi) will naturally cluster with θi taking distinct values θ∗h, where the function F (θ)represents an arbitrary parametric model for the observed data, with hidden parametersθ. Therefore, the number of clusters is automatically determined by the data and could be“infinite” in principle. Since θi take distinct values θ∗h with probabilities πh, this clusteringis a statistical procedure instead of a hard partition, and thus we only have a belief onthe number of clusters, which is affected by the scaling parameter α. For draws Vh from adistribution Be(1, α), E[Vh] = 1

1+α and straightforwardly E[πh] = αh−1

(1+α)h . When α is small,

E[πh] ≈ αh−1, and hence only the first few sticks may have large weights (encouraging asmall number of clusters); when α is large, E[πh] ≈ 1

1+α , which means that all sticks havesimilar weights (encouraging many small clusters). As the value of α influences the priorbelief on the clustering, a Gamma hyper-prior on α is usually employed.

6

2.3 Infinite QGME via DP

Consider a classification task with a training data set D = (xi, yi) : i = 1, . . . , n, wherexi ∈ RP and yi ∈ −1, 1. With soft labels ti introduced as in Section 2.1, the infiniteQGME (iQGME) model is achieved via a DP prior imposed on the measure G of (µi,Λi, wi),the hidden variables characterizing the density function of each data point (xi, ti). Forsimplicity, we use the same symbols to denote parameters associated with each data pointand the distinct values, with subscripts i and h indexing data points and unique values,respectively:

(xi, ti) ∼ NP (xi|µi,Λ−1i )N (ti|wT

i xbi , 1),

(µi,Λi,wi)iid∼ G,

G ∼ DP(αG0), (8)

where the base measure G0 is factorized as the product of a normal-Wishart prior for(µh,Λh) and a normal prior for wh, for the sake of conjugacy. As discussed in Section2.2, data samples cluster automatically, and the same mean µh, covariance matrix Λh

and regression coefficients (expert) wh are shared for a given cluster h. Specifically, thehierarchical iQGME model is as follows for i = 1, . . . , n and h = 1, . . . ,∞:

(ti|zi = h,xi,W , b) ∼ N (ti|wTh xb

i , 1),(xi|zi = h,µ,Λ) ∼ NP (xi|µh,Λ−1

h ),

(zi|V) ∼∞∑

h=1

πhδh, where πh = Vh

∏

l<h

(1− Vl),

(Vh|α) ∼ Be(Vh|1, α),(µh,Λh|m0, u0, B0, ν0) ∼ NP (µh|m0, u

−10 Λ−1

h )W(Λh|B0, ν0),(wh|ζ, λ) ∼ NP+1(wh|ζ, [diag(λ)]−1), where λ = [λ1, . . . , λP+1].

Furthermore, to achieve a more robust algorithm, we assign diffuse (non-informative) hyper-priors on several crucial parameters. As discussed in Section 2.2, the scaling parameter αreflects our prior belief on the number of clusters. For the sake of conjugacy, a diffuseGamma prior is usually assumed for α as suggested by West et al. (1994). In addition,parameters ζ, λ characterizing the prior of the distinct local classifiers wh are another setof important parameters, since we focus on classification tasks. Normal-Gamma priors arethe conjugate priors for the mean and precision of a normal density. Therefore,

(α|τ10, τ20) ∼ Ga(α|τ10, τ20)(ζ|γ0, λ) ∼ NP+1(ζ|0, γ−1

0 [diag(λ)]−1),(λp|a0, b0) ∼ Ga(λp|a0, b0), p = 1, . . . , P + 1,

where τ10, τ20, a0, b0 are usually set to be much less than one and of about the same mag-nitude, so that the constructed Gamma distributions with means about one and largevariances are diffuse; γ0 is usually set to be around one.

The graphical representation of the iQGME for single-task learning is shown in Figure 1.We notice that a possible variant with sparse local classifiers could be obtained if we impose

7

n

ix

iz

it hw

hV

hµ 00,um

iy

2010,

00,ba

0

h 00,v

Figure 1: Graphical representation of the iQGME for single-task leaning. All circles denote randomvariables, with shaded ones indicating observable data, and bright ones representinghidden variables. Diamonds denote fixed hyper-parameters, boxes represent independentreplicates with numbers at the lower right corner indicating the numbers of i.i.d. copies,and arrows indicate the dependence between variables (pointing from parents to children).

zero mean for the local classifiers wh, i.e., ζ = 0, and retain the Gamma hyper-prior forthe precision λ, as in the relevance vector machine (RVM) (Tipping, 2000), which employsa corresponding Student-t sparseness prior on the weights. Although this variant is usefulfor seeking relevant features in some applications, its general performance is slightly worsethan that of the model shown in Figure 1 for the data considered below. To focus on themain topic in this paper, which is the multi-task learning for incomplete data, we do notdiscuss this variant further.

3. Multi-Task Learning with iQGME

Assume we have J data sets, with the jth represented as Dj = (xji, yji) : i = 1, . . . , nj;our goal is to design a classifier for each data set, and the design of each classifier is termeda “task”. One may learn separate classifiers for each of the J data sets (single-task learning)by ignoring connections between the data sets, or a single classifier may be learned basedon the union of all data (pooling) by ignoring differences between the data sets. Moreappropriately, in a hierarchical Bayesian framework J task-dependent classifiers may belearned jointly, with information borrowed via a higher-level prior (multi-task learning). Insome previous research all tasks are assumed to be equally related to each other (Yu et al.,2003; Zhang et al., 2006), or related tasks share exactly the same task-dependent classifier(Xue et al., 2007). However, with multiple local experts, the proposed iQGME model for asingle task is fairly flexible, which enables us to borrow information across those J tasks inseveral ways (two data sets may share parts of the respective classifiers, without requiringsharing of all classifier components). In this section we develop two multi-task learningmodels, these encouraging different sharing structures.

8

3.1 Encouraging Local Sharing Across Tasks via the Hierarchical DirichletProcess

As discussed in Section 2.2, a DP prior encourages clustering in a task (each cluster corre-sponds to a local expert). Now considering multiple tasks, a hierarchical Dirichlet process(HDP) (Teh et al., 2006) may be considered to solve the problem of sharing clusters (localexperts) across multiple tasks. Assume a random measure Gj is associated with each task j,where each Gj is an independent draw from Dirichlet process DP(αG0) with a base measureG0 drawn from a Dirichlet process DP(βH), i.e.,

Gj ∼ DP(αG0), for j = 1, . . . , J,

G0 ∼ DP(βH).

As a draw from a Dirichlet process, G0 is discrete with probability one and has a stick-breaking representation as in (7). With such a base measure, the task-dependent DPs reusethe atoms θ∗h defined in G0, yielding the desired sharing of atoms among tasks.

With the task-dependent iQGME defined in (8), we consider all J tasks jointly:

(xji, tji) ∼ NP (xji|µji,Λ−1ji )N (tji|wT

jixbji, 1),

(µji,Λji, wji)iid∼ Gj ,

Gj ∼ DP(αG0),G0 ∼ DP(βH).

n

jix

jiz

jit

sw

jhV

sµ 00,um

00,ba

jiy

2010,

jhcsU

4030,

J

0

s 00,v

(a) via the hierarchical Dirichlet process (HDP)

n

jix

jiz

jit

shw

shV

shµ 00,um

00,ba

jiy

2010,

jcsU

4030,

J

0

sh00

,v

(b) via the nested Dirichlet process (nDP)

Figure 2: Graphical representation of the iQGME for multi-task leaning. Refer to Figure 1 foradditional information.

In this form of borrowing information, experts with associated means and precisionmatrices are shared across tasks as distinct atoms. Since means and precision matricesstatistically define local regions in feature space, sharing is encouraged locally. We explicitlywrite the stick-breaking representations for Gj and G0, with zji and cjh introduced as theindicators for each data point and each distinct atom of Gj , respectively. By factorizing the

9

base measure H as a product of a normal-Wishart prior for (µs,Λs) and a normal prior forws, the hierarchical model of the multi-task iQGME via the HDP is represented as

(tji|cjh = s, zji = h,xji,W ) ∼ N (tji|wTs xb

ji, 1),

(xji|cjh = s, zji = h,µ,Λ) ∼ NP (xji|µs,Λ−1s ),

(zji|V ) ∼∞∑

h=1

πjhδh, where πjh = Vjh

∏

l<h

(1− Vjl),

(Vjh|α) ∼ Be(Vjh|1, α), j = 1, . . . , J,

(α|τ10, τ20) ∼ Ga(α|τ10, τ20),

(cjh|U) ∼∞∑

s=1

ηsδs, where ηs = Us

∏

l<s

(1− Ul),

(Us|β) ∼ Be(Us|1, β),(β|τ30, τ40) ∼ Ga(β|τ30, τ40),

(µs,Λs|m0, u0, B0, ν0) ∼ NP (µs|m0, u−10 Λ−1

s )W(Λs|B0, ν0),(ws|λ, ζ) ∼ NP+1(ws|ζ, [diag(λ)]−1),(ζ|γ0, λ) ∼ NP+1(0, γ−1

0 [diag(λ)]−1),(λp|a0, b0) ∼ Ga(λp|a0, b0), p = 1, . . . , P + 1,

where j = 1, . . . , J , i = 1, . . . , nj index tasks and data points in each tasks, respectively;h = 1, . . . ,∞, s = 1, . . . ,∞ index atoms for task-dependent Gj and the globally sharedbase G0, respectively. The setting of hyper-priors is similar to the single-task case. Thegraphical representation of the iQGME for multi-task learning via the HDP is shown inFigure 2(a).

3.2 Encouraging Global Sharing Across Tasks via the Nested Dirichlet Process

Consider J tasks with underlying parameters θji (corresponding to local experts) drawnfrom a task-dependent measure Gj for each task. The collection GjJ

j=1 is said to followa nested Dirichlet process (nDP) (Rodrıguez et al., 2008), if

Gj ∼∞∑

s=1

ηsδG∗s , for j = 1, . . . , J, (9)

G∗s ∼ DP(αH), (10)

with ηs = Us∏

l<s(1−Ul), where Usiid∼ Be(1, β). Expression (9) implies that the measures

Gj are all drawn from a stick-breaking process, with the atoms G∗s themselves drawn from

DP(αH) as in (10). Therefore, the nDP model simultaneously enables clustering at thetask level as well as within each task.

10

With the task-dependent iQGME defined in (8), we consider all J tasks jointly:

(xji, tji) ∼ NP (xji|µji,Λ−1ji )N (tji|wT

jixbji, 1),

(µji,Λji, wji)iid∼ Gj ,

Gj ∼∞∑

s=1

ηsδG∗s ,

G∗s ∼ DP(αH).

In this form of borrowing information, a theoretically infinite number of mixtures of expertswith associated means and precision matrices are shared among tasks, and each of themixtures contains infinite experts. Since any mixture of experts is shared as a whole, thiskind of sharing is encouraged globally in feature space. We explicitly write the stick-breakingrepresentations for atoms of the stick-breaking process G∗

s, with zji and cj introduced asindicators for each data point and each task, respectively. The graphical representation ofthe iQGME for multi-task learning via the nDP is shown in Figure 2(b). By factorizing thebase measure H as a product of a normal-Wishart prior for (µsh,Λsh) and a normal priorfor wsh, the hierarchical model of the multi-task iQGME via the nDP is constructed as

(tji|cj = s, zji = h,xji, W ) ∼ N (tji|wTshxb

ji, 1),

(xji|cj = s, zji = h,µ,Λ) ∼ NP (xji|µsh,Λ−1sh ),

(zji|V , cj = s) ∼∞∑

h=1

πshδh, where πsh = Vsh

∏

l<h

(1− Vsl),

(Vsh|α) ∼ Be(Vsh|1, α),(α|τ10, τ20) ∼ Ga(α|τ10, τ20),

(cj |U) ∼∞∑

s=1

ηsδs, where ηs = Us

∏

l<s

(1− Ul),

(Us|β) ∼ Be(Us|1, β),(β|τ30, τ40) ∼ Ga(β|τ30, τ40),

(µsh,Λsh|m0, u0, B0, ν0) ∼ NP (µsh|m0, u−10 Λ−1

sh )W(Λsh|B0, ν0),(wsh|λ, ζ) ∼ NP+1(wsh|ζ, [diag(λ)]−1),

(ζ|γ0, λ) ∼ NP+1(0, γ−10 [diag(λ)]−1),

(λp|a0, b0) ∼ Ga(λp|a0, b0), p = 1, . . . , P + 1.

By comparing Figure 2(a) and Figure 2(b), we observe that in the HDP structure, thereare only one set of components shared by all tasks, and two indicators jointly determine thecomponent index for each data point. However, in the nDP structure, there are infinite setsof components, and the indicator cj reflects the set index for the task j and the indicatorzji further defines the component index for the ith data point in task j. More sharingtends to occur in the HDP structure, however, more sharing is not always preferable. Sinceusually neither tasks nor local experts are precisely the same, sharing always introducesa compromise. In the iQGME models, as the parameters characterizing the distributionsof features xji are shared together with experts, sharing may occur when both the class

11

boundary and data distribution are similar, as well as when either one is similar. If sharingoccurs when only the data distribution is similar, the learning of experts may be misled.Given the analysis, we suspect that the nDP structure, which imposes more constraints onsharing, may perform better than the HDP structure (this is examined when presentingresults).

4. Incomplete Data Problem

In the above discussion it was assumed that all components of the feature vectors wereavailable (no missing data). In this section, we consider the situation for which featurevectors xi are partially observed. We partition each feature vector xi into observed andmissing parts, xi = [xoi

i ; xmii ], where xoi

i = xip : p ∈ oi denotes the subvector of observedfeatures and xmi

i = xip : p ∈ mi represents the subvector of missing features, with oi andmi denoting the set of indices for observed and missing features, respectively. Each xi hasits own observed set oi and missing set mi, which may be different for each i. Followinga generic notation (Schafer and Graham, 2002), we refer to R as the missingness. For anarbitrary missing pattern, R could be defined as a missing data indicator matrix, that is,

Rip =

1, xip observed,0, xip missing.

We use ξ to denote parameters characterizing the distribution of R, which is usually calledthe missing mechanism. In the classification context, the joint distribution of class labels,observed features and the missingness R may be given by integrating out the missingfeatures xm,

p(y, xo,R|Θ, ξ) =∫

p(y, x|Θ)p(R|x, ξ)dxm. (11)

To handle such a problem, assumptions must be made on the distribution of R. If themissing mechanism is conditionally independent of missing values xm given the observeddata, i.e., p(R|x, ξ) = p(R|xo, ξ), the missing data are defined to be missing at random(MAR) (Rubin, 1976). Consequently, (11) reduces to

p(y, xo, R|Θ, ξ) = p(R|xo, ξ)∫

p(y, x|Θ)dxm = p(R|xo, ξ)p(y, xo|Θ). (12)

According to (12), the likelihood is factorizable under the assumption of MAR. As long asthe prior p(Θ, ξ) = p(Θ)p(ξ) (factorizable), the posterior

p(Θ, ξ|y, xo, R) ∝ p(y, xo, R|Θ, ξ)p(Θ, ξ) = p(R|xo, ξ)p(ξ)p(y, xo|Θ)p(Θ)

is also factorizable. For the purpose of inferring model parameters Θ, no explicit speci-fication is necessary on the distribution of the missingness. As an important special caseof MAR, missing completely at random (MCAR) occurs if we can further assume thatp(R|x, ξ) = p(R|ξ), which means the distribution of missingness is independent of ob-served values xo as well. When the missing mechanism depends on missing values xm,the data are termed to be missing not at random (MNAR). From (11), an explicit form

12

has to be assumed for the distribution of the missingness, and both the accuracy and thecomputational efficiency should be considered.

When missingness is not totally controlled, as in most realistic applications, we cannottell from the data alone whether the MCAR or MAR assumption is valid. Since the MCARor MAR assumption is unlikely to be precisely satisfied in practice, inference based on theseassumptions may lead to a bias. However, as demonstrated in many cases, it is believed thatfor realistic problems departures from MAR are usually not large enough to significantlyimpact the analysis (Collins et al., 2001). On the other hand, without the MAR assumption,one must explicitly specify a model for the missingness R, which is a difficult task in mostcases. As a result, the data are typically assumed to be either MCAR or MAR in theliterature, unless significant correlations between the missing values and the distribution ofthe missingness are suspected.

In this work we follow the MAR assumption, and thus expression (12) applies. In theiQGME framework, (12) may be rewritten as

p(y, xo|Θ) =∫

p(y, x|Θ)dxm =∫

ty>0

∫p(t|x,Θ2)p(x|Θ1)dxmdt. (13)

The solution to such a problem with incomplete data xm is analytical since the distributionsof t and x are assumed to be a Gaussian and a Gaussian mixture model, respectively.Naturally, the missing features could be regarded as hidden variables to be inferred andthe graphical representation of the iQGME with incomplete data remains the same as inFigures 1 and 2, except that the node presenting features are partially observed now. Aselaborated below, the important but mild assumption that the features are distributed asa GMM enables us to analytically infer the variational distributions associated with themissing values in a procedure of variational Bayesian inference.

As in many models (Williams et al., 2007), estimating the distribution of the missingvalues first and learning the classifier at a second step gives the flexibility of selecting theclassifier for the second step. However, (13) suggests that the classifier and the data distri-bution are coupled, provided that partial data are missing and thus have to be integratedout. Therefore, a joint estimation of missing features and classifiers (searching in the spaceof (Θ1,Θ2)) is more desirable than a two-step process (searching in the space of Θ1 for thedistribution of the data, and then in the space of Θ2 for the classifier).

5. Variational Bayesian Inference

For simplicity we denote the collection of hidden variables and model parameters as Θ andspecified hyper-parameters as Ψ. In a Bayesian framework we are interested in p(Θ|D,Ψ),the joint posterior distribution of the unknowns given observed data and hyper-parameters.From Bayes’ rule,

p(Θ|D,Ψ) =p(D|Θ)p(Θ|Ψ)

p(D|Ψ), (14)

where p(D|Ψ) =∫

p(D|Θ)p(Θ|Ψ)dΘ is the marginal likelihood that often involves multi-dimensional integrals. Since these integrals are nonanalytical in most cases, the computationof the marginal likelihood is the principal challenge in Bayesian inference. These integrals

13

are circumvented if only a point estimate Θ is pursued, as in the expectation-maximizationalgorithm (Dempster et al., 1977). Markov chain Monte Carlo (MCMC) sampling methods(Neal, 1993) provide one class of approximations for the full posterior, based on samplesfrom a Markov chain whose stationary distribution is the posterior of interest. Most previousresearch on applications with a Dirichlet process prior has been implemented with Gibbssampling, the simplest MCMC method (Ishwaran and James, 2001; West et al., 1994). Themain concerns on MCMC methods include that they are typically slow and the diagnosisof convergence is often difficult.

As an efficient alternative, the variational Bayesian (VB) method (Beal, 2003) approxi-mates the true posterior p(Θ|D, Ψ) with a variational distribution q(Θ) with free variationalparameters. The problem of computing the posterior is reformulated as an optimizationproblem of minimizing the Kullback-Leibler (KL) divergence between q(Θ) and p(Θ|D,Ψ),which is equivalent to maximizing a lower bound of log p(D|Ψ), the log marginal likelihood.This optimization problem can be solved analytically using two assumptions on q(Θ): (i)q(Θ) is factorized; (ii) the factorized components of q(Θ) are of the same form as thecorresponding priors. Since it is desirable to maintain the dependencies among randomvariables shown in the graphical models (Figures 1 and 2) in the variational distributionq(Θ), one typically only breaks those dependencies that bring difficulty to computation.In the subsequent inference for the iQGME-based models, we retain some dependencies asunbroken. In recent publications (Blei and Jordan, 2006; Kurihara et al., 2007), discussionson the implementation of the variational Bayesian inference are given for Dirichlet processmixtures.

In this work we implement the variational Bayesian inference for all iQGME-basedclassifiers discussed above. Following Blei and Jordan (2006), we employ stick-breakingrepresentations with a finite number of components as variational distributions to approx-imate the infinite-dimensional random measures G. The truncation level of the intra-taskDP is set at N , and the truncation level of the inter-task DP is set at S.

We detail the variational Bayesian inference for the case of incomplete data. The infer-ence for the complete-data case is similar, except that all feature vectors are fully observedand thus the step of learning missing values is skipped. To avoid repetition, a thorough pro-cedure for the complete-data case is not included, with differences from the incomplete-datacase indicated.

5.1 Single-Task Learning

For single-task iQGME the unknowns are Θ = t, xm, z, V , α,µ,Λ, W , ζ, λ, with hyper-parameters Ψ = m0, u0, B0, ν0, τ10, τ20, γ0, a0, b0. We specify the factorized variationaldistributions as

q(t,xm, z, V , α, µ,Λ, W , ζ,λ)

=n∏

i=1

[qti(ti)qxmii ,zi

(xmii , zi)]

N−1∏

h=1

qVh(Vh)

N∏

h=1

[qµh,Λh(µh,Λh)qwh

(wh)P+1∏

p=1

qζp,λp(ζp, λp)qα(α)

where

14

• qti(ti) is a truncated normal distribution,

ti ∼ T N (µti, 1, yiti > 0), i = 1, . . . , n,

which means the density function of ti is assumed to be normal with mean µti and

unit variance for those ti satisfying yiti > 0.

• qxmii ,zi

(xmii , zi) = qx

mii

(xmii |zi)qzi(zi), where qzi(zi) is a multinomial distribution with

probabilities ρi, and there are N possible outcomes,

zi ∼MN (1, ρi1, . . . , ρiN ), i = 1, . . . , n.

Given the associated indicators zi, since features are assumed to be distributed asa multivariate Gaussian, the distributions of missing values xmi

i are still Gaussianaccording to conditional properties of multivariate Gaussian distributions:

(xmii |zi = h) ∼ N|mi|(m

mi|oi

h ,Σmi|oi

h ), i = 1, . . . , n, h = 1, . . . , N.

We retain the dependency between xmii and zi in the variational distribution since the

inference is still tractable; for complete data, the variation distribution for (xmii |zi =

h) is not necessary.

• qVh(Vh) is a beta distribution,

Vh ∼ Be(vh1, vh2), h = 1, . . . , N − 1.

Recall that we have a truncation level of N , which implies that the mixture proportionsπh(V ) are equal to zero for h > N . Therefore, qVh

(Vh) = δ1 for h = N , and qVh(Vh) =

δ0 for h > N . For h < N , Vh has a variational Beta posterior.

• qµh,Λh(µh,Λh) is a normal-Wishart distribution,

(µh,Λh) ∼ NP (mh, u−1h Λ−1

h )W(Bh, νh), h = 1, . . . , N.

• qwh(wh) is a normal distribution,

wh ∼ NP+1(µwh ,Σw

h ), h = 1, . . . , N.

• qζp,λp(ζp, λp) is a normal-gamma distribution,

(ζp, λp) ∼ N (φp, γ−1λ−1

p )Ga(ap, bp), p = 1, . . . , P + 1.

• qα(α) is a Gamma distribution,

α ∼ Ga(τ1, τ2).

15

Given the specifications on the variational distributions, a mean-field variational algorithm(Beal, 2003) is developed for the iQGME model. All update equations and derivations forq(xmi

i , zi) are included in the Appendix; similar derivations for other random variables arefound elsewhere (Xue et al., 2007; Williams et al., 2007). Each variational parameter isre-estimated iteratively conditioned on the current estimate of the others until the lowerbound of the log marginal likelihood converges. Although the algorithm yields a bound forany initialization of the variational parameters, different initializations may lead to differentbounds. To alleviate this local-maxima problem, we perform multiple independent runs withrandom initializations, and choose the run that produces the best bound on the marginallikelihood.

For simplicity, we omit the subscripts on the variational distributions and henceforthuse q to denote any variational distributions. In the following derivations and updateequations, we use generic notation 〈f〉q(·) to denote Eq(·)[f ], the expectation of a function fwith respect to variational distributions q(·). The subscript q(·) is dropped when it sharesthe same arguments with f .

For a new observed feature vector xo?? , the prediction on the associated class label y? is

given by integrating out the missing values.

P (y? = 1|xo?? ,D) =

p(y? = 1, xo?? |D)

p(xo?? |D)

=∫

p(y? = 1, x?|D)dxm??∫

p(x?|D)dxm??

=∫ ∑N

h=1 P (z? = h|D)p(x?|z? = h,D)P (y? = 1|x?, z? = h,D)dxm??∫ ∑N

k=1 P (z? = k|D)p(x?|z? = k,D)dxm??

We marginalize the hidden variables over their variational distributions to compute thepredictive probability of the class label

P (y? = 1|xo?? ,D) =

∑Nh=1 EV [πh]

∫∞0

∫EµhΛh

[NP (x?|µh,Λ−1h )]Ewh

[N (t?|wTh xb

?, 1)]dxm?

? dt?∑Nk=1 EV [πk]

∫EµkΛk

[NP (x?|µk,Λ−1k )]dxm?

?

where

EV [πh] = EV [Vh

∏

l<h

(1− Vl)] = 〈Vh〉∏

l<h

〈1− Vl〉 =[

vh1

vh1 + vh2

]1(h<N) ∏

l<h

[vl2

vl1 + vl2

]1(h>1)

The expectation Eµh,Λh[NP (x?|µh,Λ−1

h )] is a multivariate Student-t distribution (Attias,2000). However, for the incomplete-data situation, the integral over the missing values istractable only when the two terms in the integral are both normal. To retain the formof norm distributions, we use the posterior means of µh,Λh and wh to approximate thevariables in this case:

P (y? = 1|xo?? ,D) ≈

∑Nh=1 EV [πh]

∫∞0

∫ NP (x?|mh, (νhBh)−1)N (t?|(µwh )T xb

?, 1)dxm?? dt?∑N

k=1 EV [πk]∫ NP (x?|mk, (νkBk)−1)dxm?

?

=

∑Nh=1 EV [πh]N|o?|(x

o?? |mo?

h , (νhBh)−1,o?o?)∫∞0 N (t?|ϕ?h, g?h)dt?∑N

k=1 EV [πk]N|o?|(xo?? |mo?

k , (νkBk)−1,o?o?)

16

where

ϕ?h = [mTh , 1]µw

h + ΓT?h(∆o?o?

h )−1(xo?? −mo?

h )

g?h = 1 + (µwh )T∆hµw

h − ΓT?h(∆o?o?

h )−1Γ?h

Γ?h = ∆o?o?h (µw

h )o? + ∆o?m?h (µw

h )m?

µwh = (µw

h )1:P

∆h = (νhBh)−1

For complete data the integral of missing features is absent, so we take advantage of thefull variational posteriors for prediction.

5.2 Multi-Task Learning

For multi-task learning much of the inference is highly related to that of single-task learning;in the following we focus only on differences. In the two multi-task learning models, thelatent variables are Θ = t, xm, z, V , α, c, U , β, µ,Λ, W , ζ, λ, and hyper-parameters areΨ = m0, u0, B0, ν0, τ10, τ20, τ30, τ40, γ0, a0, b0. For the HDP structure, we specify thefactorized variational distributions as

q(t, xm, z, V , α, c, U , β, µ,Λ, W , ζ, λ)

=J∏

j=1

nj∏

i=1

[q(tji)q(xmji

ji )q(zji)]N−1∏

h=1

q(Vjh)N∏

h=1

q(cjh)S−1∏

s=1

q(Us)

S∏

s=1

[q(µs,Λs)q(ws)]P+1∏

p=1

q(ζp, λp)q(α)q(β)

where the variational distributions of (tji, Vjh, α,µs,Λs,ws, ζp, λp) are assumed to be thesame as in the single-task learning, while the variational distributions of hidden variablesnewly introduced for upper level Dirichlet process are specified as

• q(cjh) for each indicator cjh is a multinomial distribution with probabilities σjh,

cjh ∼MS(1, σjh1, . . . , σjhS), j = 1, . . . , J, h = 1, . . . , N.

• q(Us) for each weight Us is a Beta distribution,

Us ∼ Be(κs1, κs2), s = 1, . . . , S − 1.

Recall that we have a truncation level of S for the upper-lever DP, which implies thatthe mixture proportions ηs(U) are equal to zero for s > S. Therefore, q(Us) = δ1 fors = S, and q(Us) = δ0 for s > S. For s < S, Us has a variational Beta posterior.

• q(β) for the scaling parameter β is a Gamma distribution,

β ∼ Ga(τ3, τ4).

17

We also note that with a higher-level of hierarchy, the dependency between the missingvalues x

mji

ji and the associated indicator zji has to be broken so that the inference becomestractable. The variational distribution of zji is still assumed to be multinomial distribution,while x

mji

ji is assumed to be normally distributed but no longer dependent on zji. All updateequations are included in the Appendix.

For the nDP structure, the factorized variational distributions may be specified as

q(t, xm, z, V , α, c,U , β, µ,Λ, W , ζ,λ)

=J∏

j=1

nj∏

i=1

[q(tji)q(xmji

ji )q(zji|cj)]q(cj)S−1∏

s=1

q(Us)

S∏

s=1

N−1∏

h=1

q(Vsh)N∏

h=1

[q(µsh,Λsh)q(wsh)]P+1∏

p=1

q(ζp, λp)q(α)q(β)

where all the variational distributions are assumed to be in the same form of those inthe HDP structure. Unlike the HDP, we have S variational candidate models with eachcharacterized by (µsh,Λsh, wsh) : h = 1, . . . , N, and cj indicates which candidate modelis assigned for the task j. Since the data indicator zji further gives which component/clusterthe data point i in the task j is associated with, a zji is meaningful only when a cj is given.Therefore, we prefer to retain the dependency of zji on cj when inferring zji. All updateequations different from those of the HDP structure are included in the Appendix.

6. Experimental Results

In all the following experiments the prior hyper-parameters are set as follows: a0 = 0.01,b0 = 0.01, γ0 = 0.1, τ10 = 0.05, τ20 = 0.05, τ30 = 0.05, τ40 = 0.05, u0 = 0.1, ν0 = P +2, andm0 and B0 are set according to sample mean and sample precision, respectively. Theseparameters have not been optimized for any particular data set, and the results are relativelyinsensitive to “reasonable” settings. The truncation levels for the variational distributionsare set to be N = 20 and S = 20 for the HDP-based multi-task learning and S = J fornDP-based multi-task learning. We have found the results insensitive to the truncationlevel, for values larger than those considered here. Initial hyper-parameters for variationaldistributions are set randomly, and to alleviate the local-maxima issue of the variationalBayesian methods, we randomly initialize the variational distributions five times for eachtrial and select the solution that produces the best lower bound on the marginal likelihood.All the results cited below from the literature are taken directly from the source publication.

6.1 Synthetic Data

We first demonstrate the proposed iQGME single-task learning model on a synthetic dataset, for illustrative purposes. The data are generated according to a GMM model p(x) =∑3

k=1 πkN2(x|µk,Σk) with the following parameters:

π =[

1/3 1/3 1/3], µ1 =

[ −30

], µ2 =

[10

], µ3 =

[50

]

Σ1 =[

0.52 −0.36−0.36 0.73

], Σ2 =

[0.47 0.190.19 0.7

], Σ3 =

[0.52 −0.36−0.36 0.73

].

18

The class boundary for each Gaussian component is given by three lines x2 = wkx1 + bk fork = 1, 2, 3, where w1 = 0.75, b1 = 2.25, w2 = −0.58, b2 = 0.58, and w3 = 0.75, b3 = −3.75.The simulated data are shown in Figure 3(a), where big black dots and dashed ellipsesrepresent the true means and covariance matrices of the Gaussian components, respectively.

−6 −4 −2 0 2 4 6−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

Positive instancesNegative instances

(a)

0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Index of Components (h)

Pos

terio

r E

xpec

tatio

n of

Wei

ghts

E[π

h]

Weights on Components

(b)

−4 −2 0 2 4 6−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

x1

x 2

Dominant Components

(c)

Figure 3: Synthetic three-Gaussian single-task data with inferred components. (a) Data in featurespace with true labels and true Gaussian components indicated; (b) inferred posteriorexpectation of weights on components, with standard deviations depicted as error bars;(c) ground truth with posterior means of dominant components indicated (the linearclassifiers and Gaussian ellipses are inferred from the data).

The inferred mean mixture weights with standard deviations are depicted in Figure 3(b),and it is observed that three dominant mixture components (local “experts”) are inferred.The dominant components (those with mean weight larger than 0.005) are characterized byGaussian means, covariance matrices and local experts, as depicted in Figure 3(c). FromFigure 3(c), the nonlinear classification is manifested by using three dominant local linearclassifiers, with a GMM defining the effective regions stochastically.

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Number of Dominant Components

Pro

babi

lity

Prior and Posterior on Number of Dominant Components

PriorPosterior

(a)

0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1

1.5

2

2.5

3

α

p(α)

Prior and Posterior on α

PriorVariational Posterior

(b)

Figure 4: Synthetic three-Gausssian single-task data: (a) prior and posterior beliefs on the numberof dominant components; (b) prior and posterior beliefs on α.

19

An important point is that we are not selecting a “correct” number of mixture compo-nents as in most mixture-of-expert models, including the finite QGME model (Liao et al.,2007). Instead, there exists uncertainty on the number of components in our posteriorbelief. Since this uncertainty is not inferred directly, we obtain samples for the number ofdominant components by calculating πh based on Vh sampled from their probability densityfunctions (prior or variational posterior), and the probability mass functions given by his-togram are shown in Figure 4(a). As discussed, the scale parameter α is highly related tothe number of clusters, so we depict the prior and the variational posterior on α in Figure4(b).

x1

x 2

P ( y=1 ) in Feature Space (Using Full Posteriors)

−6 −4 −2 0 2 4 6

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

(a)

x1

x 2

P ( y=1 ) in Feature Space (Using Posterior Means)

−6 −4 −2 0 2 4 6

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(b)

−10−5

05

10

−10

−5

0

5

100

0.005

0.01

0.015

w1

Prior on Local Experts w

w2

p(w

1,w2)

(c)

−2

0

2

4

3.5

4

4.5

5

5.5

0

2

4

6

8

10

12

w1

Variational Posteriors on Local Experts w

w2

q(w

1,w2)

(d)

Figure 5: Synthetic three-Gauss single-task data: (a) prediction in feature space using the fullposteriors; (b) prediction in feature space using the posterior means; (c) a commonbroad prior on local experts; (d) variational posteriors on local experts.

The predictions in feature space are presented in Figure 5, where the prediction in sub-figure (a) is given by integrating over the full posteriors of local experts and parameters

20

(means and covariance matrices) of Gaussian components; while the prediction in sub-figure(b) is given by posterior means. We examine these two cases since the analytical integralsover the full posteriors may be unavailable sometimes in practice (for example, for caseswith incomplete data as discussed in Section 5). From Figures 5(a) and 5(b), we observethat these two predictions are fairly similar, except that (a) allows more uncertainty onregions with scarce data. The reason for this is that the posteriors are often peaked andthus posterior means are usually representative. As an example, we plot the broad commonprior imposed for local experts in Figure 5(c) and the peaked variational posteriors forthree dominant experts in Figure 5(d). According to Figure 5, we suggest the usage offull posteriors for prediction whenever integrals are analytical, that is, for experiments withcomplete data. It also empirically justifies the use of posterior means as an approximation.

6.2 Benchmark Data

Since the finite QGME (Liao et al., 2007) inferred by expectation and maximization (EM)has been compared with a recently developed algorithm (Williams et al., 2007), we demon-strate the performance of the proposed iQGME by comparing it with these two algorithmsand two single-imputation missing-data methods. The notation is as follows. QGME-EMrefers to the finite QGME (Liao et al., 2007); LR-Integration denotes the two-stage algo-rithm proposed by Williams et al. (2007), where the parameters of a GMM are estimatedfirst given the observed features, and then a marginalized logistic regression classifier islearned; LR-Imputation-CondMean and LR-Imputation-UncondMean refer to logistic re-gression classifiers with missing features imputed by conditional mean and unconditionalmean (Schafer and Graham, 2002), respectively.

For convenient comparison, we use benchmark data sets available from the UCI machinelearning repository (Newman et al., 1998): Wisconsin Diagnostic Breast Cancer (WDBC)and the Johns Hopkins University Ionosphere database (Ionosphere). These two data setsare summarized in Table 1. We randomly remove a fraction of features according to auniform distribution, and assume the rest are observed. Any instance with all featurevalues missing is deleted. After that, we randomly split each data set into training andtesting subsets, imposing that each subset encompasses at least one instance from each ofthe classes. Note that the random pattern of missing features and the random partition oftraining and testing subsets are independent of each other. By performing multiple trialswe consider the general (average) performance for various data settings. The performanceof algorithms is evaluated in terms of the area under a receiver operating characteristic(ROC) curve (AUC) (Hanley and McNeil, 1982).

Table 1: Details of Ionosphere and WDBC data setsData set Dimension Number of positive instances Number of negative instances

Ionosphere 34 126 225WDBC 30 212 357

The results on the Ionosphere and WDBC data sets are summarized in Figures 6 and7, respectively, where we consider 25% and 50% of the features missing. Given a portion ofmissing values, each curve is a function of the fraction of data used in training. For a given

21

size of training data, we perform ten independent trials for the proposed iQGME, with theerror bars representing the standard deviation.

0 0.2 0.4 0.6 0.8 10.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Fraction of Data Used in Training

Are

a un

der

RO

C c

urve

Ionosphere: 25% Features Missing

iQGME−VBQGME−EM (K=1)LR−IntegrationLR−Imputation−CondMeanLR−Imputation−UncondMean

(a)

0 0.2 0.4 0.6 0.8 10.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1


Are

a un

der

RO

C c

urve

Ionosphere: 50% Features Missing


(b)

Figure 6: Single-task learning results on Ionosphere data set for (a) 25%, and (b) 50% of the featuresmissing. The proposed iQGME algorithm is inferred using a variational Bayesian method.The results of the finite QGME solved with an expectation-maximization method arecited from Liao et al. (2007), and those of other algorithms are cited from Williams et al.(2007). Since the performance of the QGME-EM is affected by the choice of number ofexperts K, the overall best results among K = 1, 3, 5, 10, 20 are cited for comparison ineach case (no such selection of K is required for the VB solution).

0 0.2 0.4 0.6 0.8 10.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1


Are

a un

der

RO

C c

urve

WDBC: 25% Features Missing


(a)

0 0.2 0.4 0.6 0.8 1

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1


Are

a un

der

RO

C c

urve

WDBC: 50% Features Missing


Figure 7: Single-task learning results on WDBC data set for cases when (a) 25%, and (b) 50% ofthe features are missing. Refer to Figure 6 for additional information.

From both Figures 6 and 7, all three algorithms which analytically integrate missingfeatures are superior to the single-imputation schemes. This observation is unsurprising,

22

since uncertainty of the missing features is ignored in single-imputation methods. Wealso note that the one-step iQGME and the finite QGME outperform the two-step LR-integration. The proposed iQGME consistently performs better than the finite QGME withthe best choice of number of experts K for the latter in all the situations, which revealsthe advantage of retaining the uncertainty on the model structure (number of experts) andmodel parameters. This advantage is fairly pronounced for the WDBC data set, especiallywhen training examples are relatively scarce and thus the point-estimation EM methodsuffers from over-fitting issues. Comparing (a) and (b) in Figures 6 and 7, we observethat the margin in performance improvement is generally larger when a smaller fraction offeatures are missing, except for the situations with scarce training samples.

6.3 Multi-Task Learning with Landmine Detection Data

In an application of landmine detection, data collected from 19 landmine fields (Xue et al.,2007) are treated as 19 subtasks. In all subtasks, each target is characterized by a 9-dimensional feature vector x with corresponding binary label y (1 for landmines and -1for clutter). The number of landmines and clutter in each task is summarized in Figure8. The feature vectors are extracted from images measured with airborne radar systems.A detailed description of this landmine data set has been presented elsewhere (Xue et al.,2007).

0 5 10 15 201

10

100

1000

Index of Tasks

Num

ber

of O

bjec

ts in

Eac

h T

ask

Landmine: Number of Landmines and Clutter

LandminesClutter

Figure 8: Number of landmines and clutter in each task for the landmine-detection data set (Xueet al., 2007).

Although our main objective is simultaneously learning multiple tasks with incompletedata, we first demonstrate the proposed QGME-based multi-task learning models by com-paring them to a recently developed multi-task learning algorithm based on logistic re-gression (LR) (Xue et al., 2007), which is designed for the situation with all the featuresobserved, and cannot be applied directly to the case of missing data, unless imputation is

23

employed. Each task is divided into training and test subsets randomly. Since the numberof elements in the two classes is highly unbalanced, as shown in Figure 8, we impose thatthere is at least one instance from each class in each subset. Following Xue et al. (2007),the size of the training subset in each task varies from 20 to 300 in increments of 20, and100 independent trials are performed for each size of data set. An average AUC (Hanleyand McNeil, 1982) over all the 19 tasks is calculated as the performance representation forone trial of a given training size.

0 50 100 150 200 250 300

0.6

0.65

0.7

0.75

0.8

Number of Training Samples for Each Task

Ave

rage

AU

C o

n 19

Tas

ks

Landmine: Complete Data

iQGME−MTL−nDPiQGME−MTL−HDPiQGME−STLLR−MTLLR−STL

Figure 9: Average AUC on 19 tasks of landmine detection with complete data. Error bars reflectthe standard deviation across 100 random partitions of training and testing subsets.Results of logistic regression based algorithms are cited from Xue et al. (2007), whereLR-MTL and LR-STL respectively correspond to SMTL-2 and STL in Figure 3 of Xueet al. (2007).

The average AUC given by QGME-based models are compared with those by logisticregression (LR) models (Xue et al., 2007) in Figure 9. The first observation is that weobtain a significant performance improvement for single-task learning by using the iQGMEinstead of the linear logistic regression model (Xue et al., 2007). We also notice thatthe three multi-task learning algorithms outperform single-task learning methods whenthe number of training samples is small (below 180 each task), which demonstrates theadvantage of multi-task learning. Comparing the two iQGME-based MTL algorithms, weobserve that iQGME-MTL-nDP produces slightly better results than iQGME-MTL-HDP,especially when training data are relatively abundant. As discussed in Section 3, a probablereason is that the sharing in the nDP structure is more restrictive and thus the “improper”sharing from the view of the classification is less encouraged.

After yielding competitive results on the landmine-detection data set with completedata, the iQGME-based algorithms are evaluated on incomplete data, which are simulatedby randomly removing a portion of feature values for each task as in Section 6.2. We considerthree different portions of missing values: 25%, 50% and 75%, and concentrate on situations

24

0 0.1 0.2 0.3 0.4 0.50.55

0.6

0.65

0.7

0.75

Fraction of Data Used in Training for Each Task

Ave

rage

AU

C o

n 19

Tas

ks

Landmine: 25% Features Missing

iQGME−MTL−nDPiQGME−MTL−HDPiQGME−STLLR−STL−IntegrationLR−MTL−ImputationLR−STL−Imputation

(a)

0 0.1 0.2 0.3 0.4 0.5

0.55

0.6

0.65

0.7

0.75


Ave

rage

AU

C o

n 19

Tas

ks



(b)

0 0.1 0.2 0.3 0.4 0.50.5

0.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7


Ave

rage

AU

C o

n 19

Tas

ks



(c)

Figure 10: Average AUC on 19 tasks of landmine detection for the cases when (a) 25%, (b) 50%,and (c) 75% of the features are missing. Error bars reflect the standard deviation across10 random partitions of training and testing subsets.

for which a small fraction (varying from 0.02 to 0.5 for each task) of instances are used astraining data since the multi-task learning aims to problems with insufficient training data.As in the experiments above on benchmark data sets, we perform ten independent randomtrials for each setting of missing fraction and training portion, and for each trial, five VBinitializations are executed.

To the best of our knowledge, there exists no other previous work in the literature onmulti-task learning with missing data. As presented in Figure 10, we use the LR-MTL(Xue et al., 2007) with missing values completed by unconditional means as the baselinealgorithm. Results of the two-step LR with integration (Williams et al., 2007) and LR-STLwith single imputations are also included for comparison.

From Figure 10, iQGME-STL consistently performs best among single-task learningmethods and even better than LR-MTL-imputation when the size of the training set isrelatively large (more than 15%). The two iQGME-based multi-task learning algorithms

25

significantly outperform the baseline overall, and all the single-task learning methods whenthe size of the training set relatively small (less than around 25%). Furthermore, the im-provement of iQGME-based multi-task learning algorithms is more pronounced when thereare more features missing. These observations underscore the advantage of handling missingdata in a principled manner and at the same time learning multiple tasks simultaneously.It is felt that the advantages of jointly processing data from multiple tasks may be evenmore pronounced for the case of missing data, since in that case the differences in missingdata across tasks may make the sharing of data particularly valuable.

7. Conclusion and Future Work

In this paper we have introduced three new concepts, summarized as follows. First, wehave employed non-parametric Bayesian techniques to develop a mixture-of-experts algo-rithm for classifier design, which employs a set of localized (in feature space) linear classifiersas experts. The Dirichlet process is employed to allow the model to infer automatically theproper number of experts and their characteristics; in fact, since a Bayesian formulation isemployed, a full posterior distribution is manifested on the properties of the local experts,including their number. Secondly, the classifier is endowed with the ability to naturally ad-dress missing data, without the need for an imputation step. Finally, the whole frameworkhas been placed within the context of a multi-task learning, allowing one to jointly inferclassifiers for multiple data sets with missing data. The multi-task-learning component hasalso been implemented with the general tools associated with the Dirichlet process, withspecific implementations manifested via the hierarchical Dirichlet process and the nestedDirichlet process. For the examples considered here, we found the latter provided slightlybetter performance. Because of the hierarchical form of the model, in terms of a sequenceof distributions in the conjugate-exponential family, all inference has been manifested effi-ciently via variational Bayesian (VB) analysis. Results have been presented for single-taskand multi-task learning using data sets available from the literature, and encouraging algo-rithm performance has been demonstrated.

Concerning future research, we note that the use of multi-task learning provides animportant class of contextual information, and therefore is particularly useful when one haslimited labeled data and when the data are incomplete (missing features). Another formof context that has received significant recent attention is semi-supervised learning (Zhu,2005). There has been recent work on integrating multi-task learning with semi-supervisedlearning (Liu et al., 2007). An important new research direction includes extending semi-supervised multi-task learning to realistic problems for which the data are incomplete.

Appendix

The update equations of single-task learning with incomplete data are summarized as fol-lows:

1. q(ti|µti)

µti =

N∑

h=1

ρih〈wh〉T xbi,h where xb

i,h = [xoii ; mmi|oi

h ; 1].

26

The expectation of ti and t2i may be derived according to properties of truncatednormal distributions:

〈ti〉 = µti +

φ(−µti)

1(yi = 1)− Φ(−µti)

,

〈t2i 〉 = 1 + (µti)

2 +µt

iφ(−µti)

1(yi = 1)− Φ(−µti)

,

where φ(·) and Φ(·) denote the probability density function and the cumulative densityfunction of the standard normal distribution, respectively.

2. q(xmii , zi|mmi|oi

ih ,Σmi|oi

ih , ρih)First, we explicitly write the intercept wb

h , that is, wh = [(wxh)T , wb

h]T :

q(xmii , zi = h)

∝ exp〈ln[p(ti|zi = h,xi, W )p(zi = h|V )p(xi|zi = h,µ,Λ)]〉q(ti)q(wh)q(V )q(µh,Λh)∝ AihNP (xi|µih, Σih),

where

Σih = [〈wxh(wx

h)T 〉+ νhBh]−1

µih = Σih[〈ti〉〈wxh〉+ νhBhmh − 〈wx

hwbh〉]

Aih = exp〈lnVh〉+∑

l<h

〈ln(1− Vl)〉+ 〈ti〉〈wbh〉

+12[〈ln |Λh|〉+ µT

ihΣ−1ih µih + ln |Σih| − P

uh−mT

h νhBhmh − 〈(wbh)

2〉].

Since [xoi

i

xmii

]∼ NP

([µoi

ih

µmiih

],

[Σoioi

ih Σoimiih

Σmioiih Σmimi

ih

]),

the conditional distribution of missing features xmii given observable features xoi

i isalso a normal distribution, that is, xmi

i |xoii ∼ N|mi|(m

mi|oi

h ,Σmi|oi

h ) with

mmi|oi

h = µmiih + Σmioi

ih (Σoioiih )−1(xoi

i − µoiih),

Σmi|oi

h = Σmimiih − Σmioi

ih (Σoioiih )−1Σoimi

ih .

Therefore, q(xmii , zi = h) could be factorized as the product of a factor independent

of xmii and the variational posterior of xmi

i , that is,

q(xmii , zi = h) ∝ AihN|oi|(x

oii |µoi

ih, Σoioiih )N|mi|(x

mii |mmi|oi

h ,Σmi|oi

h )

ρih ∝ AihN|oi|(xoii |µoi

ih, Σoioiih )

For complete data, no factorization for the distribution for xmii is necessary:

ρih ∝ exp〈ti〉〈wh〉T xi − 12xT

i 〈whwTh 〉xi + 〈ln Vh〉

+∑

l<h

〈ln(1− Vl)〉+12〈ln |Λh|〉 − 1

2〈(xi − µh)TΛh(xi − µh)〉

27

3. q(Vh|vh1, vh2)

vh1 = 1 +n∑

i=1

ρih, vh2 = 〈α〉+n∑

i=1

∑

l>h

ρil;

〈ln Vh〉 = ψ(vh1)− ψ(vh1 + vh2), 〈ln(1− Vl)〉 = ψ(vl2)− ψ(vl1 + vl2).

4. q(µh,Λh|mh, uh, Bh, νh)

νh = ν0 + Nh, uh = u0 + Nh, mh =u0m0 + Nhxh

uh,

B−1h = B−1

0 +n∑

i=1

ρihΩi,h + NhSh +u0Nh

uh(xh −m0)(xh −m0)T ,

where

Nh =n∑

i=1

ρih, xh =n∑

i=1

ρihxi/Nh, Sh =n∑

i=1

ρih(xi − xh)(xi − xh)T /Nh.

xi,h =

[xoi

i

mmi|oi

h

], Ωi,h =

[0 00 Σmi|oi

h

].

〈ln |Λh|〉 =P∑

p=1

ψ((νh − p + 1)/2) + P ln 2 + ln |Bh|,

〈(xi − µh)TΛh(xi − µh)〉 = (xi,h −mh)T νhBh(xi,h −mh) + P/uh + tr(νhBhΩi,h).

5. q(wh|µwh ,Σw

h ), 〈wh〉 = µwh , 〈whwT

h 〉 = Σwh + µw

h (µwh )T .

Σwh =

(n∑

i=1

ρih(Ωi,h + xbi,hxb

T

i,h) + diag(〈λ〉))−1

,

µwh = Σw

h

(n∑

i=1

ρihxbi,h〈ti〉+ diag(〈λ〉)φ

).

6. q(ζp, λp|φp, γ, ap, bp), 〈λp〉 = ap/bp.

φp =N∑

h=1

〈whp〉/γ, γ = γ0 + N

ap = a0 +N

2, bp = b0 +

12

N∑

h=1

〈w2hp〉 −

12γφ2

p.

7. q(α|τ1, τ2), 〈α〉 = τ1/τ2.

τ1 = N − 1 + τ10, τ2 = τ20 −N−1∑

h=1

〈ln(1− Vh)〉.

28

The update equations of multi-task learning with incomplete data are summarized asfollows:

1. q(tji|µtji)

Define xbji = [xoji

ji ; mmji|oji

ji ; 1].

For the HDP, µtji =

S∑

s=1

Eσjis〈ws〉T xbji where Eσjis =

N∑

h=1

ρjihσjhs.

For the nDP, µtji =

S∑

s=1

σjs

N∑

h=1

ρsjih〈wsh〉T xb

ji.

2. q(xmji

ji |mmji|oji

ji ,Σmji|oji

ji )

mmji|oji

ji = µmji

ji + Σmjioji

ji (Σojioji

ji )−1(xoji

ji − µoji

ji ),

Σmji|oji

ji = Σmjimji

ji − Σmjioji

ji (Σojioji

ji )−1Σojimji

ji ,

where for the HDP

Σji =

(S∑

s=1

Eσjis(〈wxs (wx

s )T 〉+ νsBs)

)−1

,

µji = Σji

S∑

s=1

Eσjis(〈tji〉〈wxs 〉+ νsBsms − 〈wx

s wbs〉).

For the nDP

Σji =

(N∑

h=1

S∑

s=1

ρsjihσjs(〈wx

sh(wxsh)T 〉+ νshBsh)

)−1

µji = Σji

N∑

h=1

S∑

s=1

ρsjihσjs(〈tji〉〈wx

sh〉+ νshBshmsh − 〈wxshwb

sh〉).

xji =

[x

oji

ji

mmji|oji

ji

], Ωji =

[0 00 Σmji|oji

ji

], 〈xjix

Tji〉 = xjix

Tji + Ωji.

3. q(zji|ρji)

For the HDP,

ρjih = q(zji = h)

∝ expS∑

s=1

σjhs[〈tji〉〈ws〉T xbji − 1

2tr(〈wsw

Ts 〉〈xb

ji(xbji)

T 〉)]

+〈lnVjh〉+∑

l<h

〈ln(1− Vjl)〉

+12

S∑

s=1

σjhs[〈ln |Λs|〉 − (xji −ms)T νsBs(xji −ms)− P/us − tr(νsBsΩji)].

29

For the nDP

ρsjih = q(zji = h|cj = s)

∝ exp−12[〈t2ji〉 − 2〈tji〉〈wsh〉T xb

ji + tr(〈wshwTsh〉〈xb

ji(xbji)

T 〉)] + 〈ln Vsh〉

+∑

l<h

〈ln(1− Vsl)〉+12〈ln |Λsh|〉 − 1

2(xji −msh)T νshBsh(xji −msh)

−P/ush − tr(νshBshΩji).

4. q(V |v)

For the HDP,

vjh1 = 1 +nj∑

i=1

ρjih, vjh2 = 〈α〉+∑

l>h

nj∑

i=1

ρjil.

〈lnVjh〉 = ψ(vjh1)− ψ(vjh1 + vjh2), 〈ln(1− Vjl)〉 = ψ(vjl2)− ψ(vjl1 + vjl2).

For the nDP,

vsh1 = 1 +J∑

j=1

nj∑

i=1

σjsρsjih, vsh2 = 〈α〉+

J∑

j=1

nj∑

i=1

σjs

∑

l>h

ρsjil.

〈ln Vsh〉 = ψ(vsh1)− ψ(vsh1 + vsh2), 〈ln(1− Vsl)〉 = ψ(vsl2)− ψ(vsl1 + vsl2).

5. q(α|τ1, τ2), 〈α〉 = τ1/τ2.

For the HDP, τ1 = J(N − 1) + τ10, τ2 = τ20 −J∑

j=1

N−1∑

h=1

〈ln(1− Vjh)〉.

For the nDP, τ1 = S(N − 1) + τ10, τ2 = τ20 −S∑

s=1

N−1∑

h=1

〈ln(1− Vsh)〉.

6. q(c|σ)

For the HDP

σjhs ∝ expnj∑

i=1

ρjih[〈tji〉〈ws〉T xbji − 1

2tr(〈wsw

Ts 〉〈xb

ji(xbji)

T 〉)]

+〈ln Us〉+∑

l<s

〈ln(1− Us)〉

+12

nj∑

i=1

ρjih[〈ln |Λs|〉 − (xji −ms)T νsBs(xji −ms)− P/us − tr(νsBsΩji)].

30

For the nDP,

σjs ∝ exp−12

nj∑

i=1

N∑

h=1

ρsjih[〈t2ji〉 − 2〈tji〉〈wsh〉T xb

ji + tr(〈xbji(x

bji)

T 〉〈wshwTsh〉)]

+12[

nj∑

i=1

N∑

h=1

ρsjih〈ln |Λsh|〉 −

nj∑

i=1

N∑

h=1

ρsjih〈(xji − µsh)TΛsh(xji − µsh)〉]

+nj∑

i=1

N∑

h=1

ρsjih[〈lnVsh〉+

∑

l<h

〈ln(1− Vsl)〉] + 〈ln Us〉+∑

l<s

〈ln(1− Ul)〉.

7. q(Us|κs1, κs2)

For the HDP, κs1 = 1 +J∑

j=1

N∑

h=1

σjhs, κs2 = 〈β〉+J∑

j=1

N∑

h=1

∑

l>s

σjhl.

For the nDP, κs1 = 1 +J∑

j=1

σjs, κs2 = 〈β〉+J∑

j=1

∑

l>s

σjl.

〈lnUs〉 = ψ(κs1)− ψ(κs1 + κs2), 〈ln(1− Us)〉 = ψ(κs2)− ψ(κs1 + κs2).

8. q(β|τ3, τ4), 〈β〉 = τ3/τ4.

τ3 = S − 1 + τ30, τ4 = τ40 −S−1∑

s=1

〈ln(1− Us)〉.

9. q(µs,Λs|ms, us, Bs, νs)

For the HDP,

νs = ν0 + Ns, us = u0 + Ns, ms =u0m0 + Nsxs

us,

B−1s = B−1

0 +J∑

j=1

nj∑

i=1

EσjisΩji + NsSs +u0Ns

us(xs −m0)(xs −m0)T ,

where Eσjis =∑N

h=1 ρjihσjhs, and

Ns =J∑

j=1

nj∑

i=1

Eσjis, xs =J∑

j=1

nj∑

i=1

Eσjisxji/Ns,

Ss =J∑

j=1

nj∑

i=1

Eσjis(xji − xs)(xji − xs)T /Ns.

〈ln |Λs|〉 =P∑

p=1

ψ((νs − p + 1)/2) + P ln 2 + ln |Bs|,

〈(xji − µs)TΛs(xji − µs)〉 = (xji −ms)T νsBs(xji −ms) + P/us + tr(νsBsΩji).

31

10. q(ws|µws ,Σw

s )

For the HDP,

Σws =

J∑

j=1

nj∑

i=1

Eσjis(xbjixb

T

ji + Ωji) + diag(〈λ〉)−1

,

µws = Σw

s

J∑

j=1

nj∑

i=1

Eσjisxbji〈tji〉+ diag(〈λ〉)φ

.

〈ws〉 = µws , 〈wsw

Ts 〉 = Σw

s + µws (µw

s )T .

The update equations for µ,Λ and W for the nDP remain the same expect that thesubscript s changes to sh, and Eσjis is replaced with σjsρ

sjih.

11. q(λp|ap, bp), 〈λp〉 = ap/bp.

For the HDP,

φp =S∑

s=1

〈wsp〉/γ, γ = γ0 + S,

ap = a0 +S

2, bp = b0 +

12

S∑

s=1

〈W 2sp〉 −

12γφ2

p.

For the nDP,

φp =S∑

s=1

N∑

h=1

〈wshp〉/γ, γ = γ0 + SN,

ap = a0 +SN

2, bp = b0 +

12

S∑

s=1

N∑

h=1

〈w2shp〉 −

12γφ2

p.

References

J. H. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data.Journal of the American Statistical Association, 88:669–679, 1993.

H. Attias. A variational bayesian framework for graphical models. In Advances in NeuralInformation Processing Systems (NIPS), 2000.

M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD dissertation,University College London, Gatsby Computational Neuroscience Unit, 2003.

D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures. BayesianAnalysis, 1(1):121–144, 2006.

32

R. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997.

G. Chechik, G. Heitz, G. Elidan, P. Abbeel, and D. Koller. Max-margin classification ofdata with absent features. Journal of Machine Learning Research, 9:1–21, 2008.

L. M. Collins, J. L. Schafer, and C. M. Kam. A comparison of inclusive and restrictivestrategies in modern missing data procedures. Psychological Methods, 6(4):330–351, 2001.

A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via theem algorithm. Journal of Roual Statistical Society B, 39:1–38, 1977.

U. Dick, P. Haider, and T. Scheffer. Learning from incomplete data with infinite imputa-tions. In International Conference on Machine Learning (ICML), 2008.

T. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics,1:209–230, 1973.

Y. Freund and R. E. Schapire. A short introduction to boosting. Journal of JapaneseSociety for Artificial Intelligence, 14(5):771–780, 9 1999.

Z. Ghahramani and M. I. Jordan. Learning from incomplete data. Technical report, Mas-sachusetts Institute of Technology, 1994.

T. Graepel. Kernel matrix completion by semidefinite programming. In Proceedings of theInternational Conference on Artificial Neural Networks, pages 694–699, 2002.

J. Hanley and B. McNeil. The meaning and use of the area under a receiver operatingcharacteristic (roc) curve. Radiology, 143:29–36, 1982.

J. Ibrahim. Incomplete data in generalized linear models. Journal of the American StatisticalAssociation, 85:765–769, 1990.

H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journalof the American Statistical Association, 96:161–173, 2001.

R. A. Jacobs, M. I. Jordon, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of localexperts. Neural Computation, 3:79–87, 1991.

M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm.Neural Computation, 6:181–214, 1994.

K. Kurihara, M. Welling, and Y. W. Teh. Collapsed variational Dirichlet process mixturemodels. In Proceedings of the International Joint Conference on Artificial Intelligence(IJCAI), pages 2796–2801, 2007.

Percy Liang and Michael I. Jordan. An asymptotic analysis of generative, discriminative,and pseudolikelihood estimators. In Proceedings of the International Conference on Ma-chine Learning (ICML), pages 584–591, 2008.

X. Liao, H. Li, and L. Carin. Quadratically gated mixture of experts for incompletedata classification. In Proceedings of the International Conference on Machine Learn-ing (ICML), pages 553–560, 2007.

33

Q. Liu, X. Liao, and L. Carin. Semi-supervised multitask learning. In Neural InformationProcessing Systems, 2007.

R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technicalreport, Department of Computer Science, University of Toronto, 1993.

D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of machine learningdatabases. http://www.ics.uci.edu/∼mlearn/MLRepository.html, 1998.

A. Y. Ng and M. I. Jordan. On discriminative vs. generative classifiers: A comparisonof logistic regression and naive Bayes. In Advances in Neural Information ProcessingSystems (NIPS), 2002.

A. Rodrıguez, D. B. Dunson, and A. E. Gelfang. The nested Dirichlet process. Journal ofthe American Statistical Association, 2008. accepted.

D. B. Rubin. Inference and missing data. Biometrika, 63:581–592, 1976.

D. B. Rubin. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, Inc.,1987.

J. L. Schafer and J. W. Graham. Missing data: Our view of the state of the art. PsychologicalMethods, 7:147–177, 2002.

J. Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 1:639–650,1994.

P. K. Shivaswamy, C. Bhattacharyya, and A. J. Smola. Second order cone programming ap-proaches for handling missing and uncertain data. Journal of Machine Learning Research,7:1283–1314, 2006.

A. Smola, S. Vishwanathan, and T. Hofmann. Kernel methods for missing variables. InProceedings of the Tenth International Workshop on Artificial Intelligence and Statistics,2005.

Y. Sun, M. S. Kamel, and Y. Wang. Boosting for learning multiple classes with imbalancedclass distribution. In Proceedings of International Conference on Data Mining (ICDM),2006.

Y. W. Teh, M. J. Beal M. I. Jordan, and D. M. Blei. Hierarchical Dirichlet processes.Journal of the American Statistical Association, 101:1566–1581, 2006.

M. E. Tipping. The relevance vector machine. In T. K. Leen S. A. Solla and K. R. Muller,editors, Advances in Neural Information Processing Systems (NIPS), volume 12, pages652–658. MIT Press, 2000.

X. Wang, A. Li, Z. Jiang, and H. Feng. Missing value estimation for dna microarray geneexpression data by support vector regression imputation and orthogonal coding scheme.BMC Bioinformatics, 7:32, 2006.

34

S. R. Waterhouse and A. J. Robinson. Classification using hierarchical mixtures of experts.In Proceedings of the IEEE Workshop on Neural Networks for Signal Processing IV, pages177–186, 1994.

M. West, P. Muller, and M. D. Escobar. Hierarchical priors and mixture models, withapplication in regression and density estimation. In P. R. Freeman and A. F. Smith,editors, Aspects of Uncertainty, pages 363–386. John Wiley, 1994.

D. Williams and L. Carin. Analytical kernel matrix completion with incomplete multi-view data. In Proceedings of the International Conference on Machine Learning (ICML)Workshop on Learning with Multiple Views, pages 80–86, 2005.

D. Williams, X. Liao, Y. Xue, L. Carin, and B. Krishnapuram. On classification withincomplete data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(3):427–436, 2007.

Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classification withDirichlet process priors. Journal of Machine Learning Research, 8:35–63, 2007.

K. Yu, A. Schwaighofer, V. Tresp, W.-Y. Ma, and H. Zhang. Collaborative ensemblelearning: Combining collaborative and content-based information filtering via hierarchicalBayes. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pages616–623, 2003.

J. Zhang, Z. Ghahramani, and Y. Yang. Learning multiple related tasks using latent in-dependent component analysis. In Advances in Neural Information Processing Systems,2006.

X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, ComputerSciences, University of Wisconsin-Madison, 2005.

35

Multi-Task Classiﬂcation for Incomplete...

Documents

Transcript of Multi-Task Classiﬂcation for Incomplete...