Proceedings of the Twenty-Eighth AAAI Conference on ... · Seattle, WA 98109, USA ... main and...

Kernelized Bayesian Transfer Learning⇤

Mehmet Gonen†

[email protected] Bionetworks

Seattle, WA 98109, USA

Adam A. Margolin†

[email protected] Bionetworks

Seattle, WA 98109, USA

AbstractTransfer learning considers related but distinct tasksdefined on heterogenous domains and tries to transferknowledge between these tasks to improve generaliza-tion performance. It is particularly useful when we donot have sufficient amount of labeled training data insome tasks, which may be very costly, laborious, oreven infeasible to obtain. Instead, learning the tasksjointly enables us to effectively increase the amount oflabeled training data. In this paper, we formulate a ker-nelized Bayesian transfer learning framework that is aprincipled combination of kernel-based dimensionalityreduction models with task-specific projection matricesto find a shared subspace and a coupled classificationmodel for all of the tasks in this subspace. Our twomain contributions are: (i) two novel probabilistic mod-els for binary and multiclass classification, and (ii) veryefficient variational approximation procedures for thesemodels. We illustrate the generalization performance ofour algorithms on two different applications. In com-puter vision experiments, our method outperforms thestate-of-the-art algorithms on nine out of 12 benchmarksupervised domain adaptation experiments defined ontwo object recognition data sets. In cancer biology ex-periments, we use our algorithm to predict mutationstatus of important cancer genes from gene expressionprofiles using two distinct cancer populations, namely,patient-derived primary tumor data and in-vitro-derivedcancer cell line data. We show that we can increaseour generalization performance on primary tumors us-ing cell lines as an auxiliary data source.

1 IntroductionIn many real-life applications, obtaining sufficient amount oflabeled training data to have a reliable predictor may be verycostly, laborious, or even infeasible. Instead, we can makeuse of labeled training data available from related tasks toincrease our generalization performance. Transfer learning(also known as domain adaptation or cross-domain learning)

⇤This study was financially supported by the Integrative Can-cer Biology Program of the National Cancer Institute (grant no1U54CA149237).

†Present address: Department of Biomedical Engineering, Ore-gon Health & Science University, Portland, OR 97239, USA.Copyright c� 2014, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

aims to transfer knowledge between related tasks defined onheterogenous domains (Pan and Yang 2010). Heterogeneitymay be due to different feature representations or data dis-tributions over the same set of features. This setup is signifi-cantly different from multitask learning where we are giventasks with data points from the same feature representation(Caruana 1997; Argyriou, Evgeniou, and Pontil 2008).

Transfer learning algorithms are well-suited for naturallanguage processing, computer vision, and computationalbiology applications due to their inherent suitability forknowledge transfer. For example, text collections from dif-ferent languages, image collections from different types ofrecording devices, or biospecimen collections from differenttissue types are natural candidates for transfer learning.

1.1 Related WorkBlitzer, McDonald, and Pereira (2006) find correspondencesamong features from different tasks and learn a shared fea-ture space using these correspondences. Daume III (2007)replicates input features to produce shared and domain-specific features, which are jointly fed into a supervisedmethod to perform domain adaptation implicitly. Jiang etal. (2008) formulate a support vector machine (SVM; Vap-nik 1998) model that uses support vectors of the source do-main to improve the generalization performance on the tar-get domain. Duan et al. (2009; 2010) learn a cross-domainkernel function and an SVM model by jointly minimizingthe structural risk of the classifier and the mismatch betweendata distributions of the two domains. Bergamo and Torre-sani (2010) exploit strongly-labeled target domain data toimprove labeling of weakly-labeled source domain hence toimprove knowledge transfer between the two domains usinga transductive SVM model.

Dai et al. (2009) use a risk minimization framework thatcouples two Markov chains defined on labels and features ofthe source and target domains with different feature repre-sentations. Hoffman et al. (2013) learn a linear transforma-tion to map target domain data points into the source domainand a multiclass classifier in the source domain jointly usinga coupled optimization problem.

Gopalan, Li, and Chellappa (2011) propose a manifoldlearning approach that maps labeled data from source do-main and unlabeled data from target domain on the Grass-mann manifold to learn a classifier there. Gong et al. (2012)

Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence

1831

formulate a geodesic flow kernel to directly exploit intrin-sic structures of computer vision data sets when transferringknowledge between the domains.

Ben-David et al. (2007) learn a shared subspace by max-imizing the margin on the labeled source domain data andminimizing the distance between the data distributions ofthe two domains. Argyriou, Maurer, and Pontil (2008) di-vide image classification tasks into groups and learn a sharedsubspace for each group to transfer knowledge. Saenko etal. (2010) learn a nonlinear transformation to find a sharedsubspace by mapping two data points from different do-mains as close as possible if they are from the same classand as distant as possible otherwise. Kulis, Saenko, and Dar-rell (2011) generalize the same idea to the asymmetric set-ting (i.e., different feature representations). Pan et al. (2011)find low-dimensional latent representations for data pointsfrom different domains in a shared subspace by minimizingthe maximum mean discrepancy between domains and max-imizing the dependence between labels and latent features.

Bahadori, Liu, and Zhang (2011) give a transduc-tive large-margin optimization algorithm that projects datapoints from the source and target domains into a shared sub-space by minimizing reconstruction and prediction lossesjointly. Duan, Xu, and Tsang (2012) map data points fromdifferent domains into a shared subspace using separateprojection matrices and augment the projected data pointswith original features before feeding them into a super-vised learning algorithm such as an SVM. Han, Liao, andCarin (2012) formulate a probabilistic model that generatesthe original features of heterogeneous domains from theirlatent representations in a shared subspace and learns a jointprobit classifier in this subspace.

1.2 Our ContributionPrevious methods have been proposed for transfer learning,but none of them offers a fully Bayesian solution to do-main adaptation on heterogenous domains in a discrimina-tive setting. In this paper, we choose to find a shared sub-space between the tasks using task-specific kernel-based di-mensionality reduction models (Scholkopf and Smola 2002;Shawe-Taylor and Cristianini 2004) and to learn a coupledlinear classifier in this subspace by combining these twosteps with a fully Bayesian framework. Our formulationshares some similarities:

(i) with Kulis, Saenko, and Darrell (2011) and Pan etal. (2011) due to the shared subspace between do-mains,

(ii) with Hoffman et al. (2013) due to the coupled classi-fier,

(iii) with Bahadori, Liu, and Zhang (2011), Duan, Xu, andTsang (2012), and Han, Liao, and Carin (2012) due toboth parts.

We discuss the differences between our method and thesemethods in Section 2.5 after giving a detailed description ofour method, which can also be interpreted as the generaliza-tion of the relevance vector machine (RVM; Tipping 2001)to transfer learning setup.

N1

NT

R

R

D1 N1 R 1 1

DT NT R 1

1

11

R

X>1

.

.

.

X>T

A>1

K1

.

.

.

KT

A>T

H>1

.

.

.

H>T

b

w

f1

.

.

.

fT

y1

.

.

.

yT

Figure 1: Flowchart of kernelized transfer learning for bi-nary classification.

1.3 Preliminaries and NotationWe assume that there are T related binary classification tasksbut their data points come from heterogeneous domains,namely, X1,X2, . . . ,XT . For each task, we are given an in-dependent and identically distributed sample Xt = {xt,i 2Xt}i2It and a label vector yt = {yt,i 2 {�1,+1}}i2It ,where It gives the indices of data points in task t. There isa task-specific kernel function for each task to define sim-ilarities between the data points, i.e., kt : Xt ⇥ Xt ! R,which is used to calculate the corresponding kernel matrixKt = {kt(xt,i,xt,j)}i2It,j2It .

Figure 1 illustrates the method we propose to learn a con-joint model across the tasks; it is composed of two mainparts: (i) projecting data points from different tasks into ashared subspace using a separate kernel-based dimensional-ity reduction model for each task and (ii) performing cou-pled binary classification in this subspace using a commonset of classification parameters. We first briefly explain thesetwo parts and introduce the notation used.

We first perform feature extraction using the input kernelmatrices {Kt 2 RNt⇥Nt}Tt=1 and the task-specific projec-tion matrices {At 2 RNt⇥R}Tt=1, where Nt is the numberof data points in task t and R is the subspace dimensionality.After the projection, we obtain the hidden representations ofdata points in the shared subspace, i.e., {Ht = A>

t Kt}Tt=1.Using a kernel-based formulation has two main implica-tions: (i) We can apply our method to tasks with very high di-mensional representations. (ii) We can learn better subspacesusing nonlinear or domain-specific kernel functions.

The coupled classification part calculates the predictedoutputs {f t = H>

t w+1b}Tt=1 in the shared subspace usingthe same set of classification parameters {b 2 R,w 2 RR}.These outputs are mapped to labels by looking at their signs.

2 Kernelized Bayesian Transfer LearningWe formulate a probabilistic model, called kernelizedBayesian transfer learning (KBTL), for the method de-scribed earlier. We can derive a very efficient inference algo-rithm using variational approximation because our methodcombines the kernel-based dimensionality reduction andcoupled binary classification parts with a fully conjugateprobabilistic model.

Figure 2 gives the graphical model of KBTL with hyper-parameters, priors, latent variables, and model parameters.As described earlier, the main idea can be summarized as:

1832

⇤t

Kt

At

Ht

↵�

��

↵� ��

↵⌘

�⌘

�2h

⌫

b

w

⌘

�f t

yt

tasks

Figure 2: Graphical model of kernelized Bayesian transferlearning for binary classification.

(i) to find hidden representations for the data points of alltasks by mapping them into a shared subspace with the helpof kernel matrices and task-specific projection matrices and(ii) to perform coupled binary classification in this subspaceusing a common set of classification parameters.

There are some additions to the notation described ear-lier: The Nt ⇥R matrix of priors for the entries of the task-specific projection matrix At is denoted by ⇤t. The R ⇥ 1

vector of priors for the classification parameters w is de-noted by ⌘. The prior for the bias parameter b is denotedby �. For these three priors, there are three sets of hyper-parameters, namely, {↵�,��}, {↵⌘,�⌘}, and {↵� ,��}. Thestandard deviation for the hidden representations is given as�h. As short-hand notations, the hyper-parameters are de-noted by ⇣ = {↵⌘,�⌘,↵� ,�� ,↵�,��,�h, ⌫}, the priors by⌅ = {�,⌘, {⇤t}Tt=1}, and the latent variables and modelparameters by ⇥ = {b,w, {f t,At,Ht}Tt=1}. Dependenceon ⇣ is omitted for clarity throughout the manuscript.

The distributional assumptions of the kernel-based di-mensionality reduction part are defined as

�it,s ⇠ G(�i

t,s;↵�,��) 8(t, i, s)ait,s|�i

t,s ⇠ N (ait,s; 0, (�it,s)

�1) 8(t, i, s)

hst,i|at,s,kt,i ⇠ N (hs

t,i;a>t,skt,i,�

2h) 8(t, s, i),

where the superscript indexes the rows and the subscript in-dexes the columns. N (·;µ,⌃) represents the normal distri-bution with the mean vector µ and the covariance matrix⌃. G(·;↵,�) denotes the gamma distribution with the shapeparameter ↵ and the scale parameter �.

The coupled binary classification part has the followingdistributional assumptions:

� ⇠ G(�;↵� ,��)

b|� ⇠ N (b; 0, ��1)

⌘s ⇠ G(⌘s;↵⌘,�⌘) 8sws|⌘s ⇠ N (ws; 0, ⌘

�1s ) 8s

ft,i|b,w,ht,i ⇠ N (ft,i;w>ht,i + b, 1) 8(t, i)

yt,i|ft,i ⇠ �(ft,iyt,i > ⌫) 8(t, i),

where the predicted outputs {f t}Tt=1, similar to the discrim-inant outputs in SVMs, are introduced to make the infer-ence procedures efficient (Albert and Chib 1993). The non-negative margin parameter ⌫ is introduced to resolve thescaling ambiguity and to place a low-density region between

two classes, similar to the margin idea in SVMs, which isgenerally used for semi-supervised learning (Lawrence andJordan 2005). �(·) represents the Kronecker delta functionthat returns 1 if its argument is true and 0 otherwise.

2.1 Inference Using Variational BayesTo obtain an efficient inference mechanism, we formu-late a deterministic variational approximation instead of aGibbs sampling approach, which is computationally expen-sive (Gelfand and Smith 1990). The variational methods usea lower bound on the marginal likelihood using an ensembleof factored posteriors to find the joint parameter distribution(Beal 2003). We can write the factorable ensemble approxi-mation of the required posterior as

p(⇥,⌅|{Kt,yt}Tt=1) ⇡ q(⇥,⌅) =

TY

t=1

⇥q(⇤t)q(At)q(Ht)

⇤q(�)q(⌘)q(b,w)

TY

t=1

q(f t)

and define each factor in the ensemble just like its full con-ditional distribution:

q(⇤t) =

Y

i2It

RY

s=1

G(�it,s;↵(�

it,s),�(�

it,s))

q(At) =

RY

s=1

N (at,s;µ(at,s),⌃(at,s))

q(Ht) =

Y

i2It

N (ht,i;µ(ht,i),⌃(ht,i))

q(�) = G(�;↵(�),�(�))

q(⌘) =RY

s=1

G(⌘s;↵(⌘s),�(⌘s))

q(b,w) = N✓

bw

�;µ(b,w),⌃(b,w)

◆

q(f t) =

Y

i2It

T N (ft,i;µ(ft,i),⌃(ft,i), ⇢(ft,i)),

where ↵(·), �(·), µ(·), and ⌃(·) denote the shape parame-ter, the scale parameter, the mean vector, and the covariancematrix for their arguments, respectively. T N (·;µ,⌃, ⇢(·))denotes the truncated normal distribution with the mean vec-tor µ, the covariance matrix ⌃, and the truncation rule ⇢(·)such that T N (·;µ,⌃, ⇢(·)) / N (·;µ,⌃) if ⇢(·) is true andT N (·;µ,⌃, ⇢(·)) = 0 otherwise.

We can bound the marginal likelihood using Jensen’s in-equality:

log p({yt}Tt=1|{Kt}Tt=1) �Eq(⇥,⌅)

⇥log p({yt}Tt=1,⇥,⌅|{Kt}Tt=1)

⇤

� Eq(⇥,⌅)

⇥log q(⇥,⌅)

⇤

and optimize this bound by maximizing with respect to eachfactor separately until convergence. The approximate poste-rior distribution of a specific factor ⌧ can be found as

q(⌧ ) / exp

⇣Eq({⇥,⌅}\⌧)

⇥log p({yt}

Tt=1,⇥,⌅|{Kt}Tt=1)

⇤⌘.

1833

For our proposed model, thanks to the conjugacy, the result-ing approximate posterior distribution of each factor followsthe same distribution as the corresponding factor.

2.2 Inference Details for Binary ClassificationThe approximate posterior distributions of the precision pri-ors can be updated as

↵(�it,s) = ↵� + 1/2

�(�it,s) =

�1/�� +

⌦(ait,s)

2↵/2��1

↵(�) = ↵� + 1/2

�(�) =�1/�� +

⌦b2↵/2��1

↵(⌘s) = ↵⌘ + 1/2

�(⌘s) =�1/�⌘ +

⌦w2

s

↵/2��1

,

where hg(·)i denotes the posterior expectation as usual, i.e.,Eq(·)[g(·)].

The approximate posterior distributions of the task-specific projection matrices can be updated as

⌃(at,s) =�diag(h�t,si) +KtK

>t /�

2h

��1

µ(at,s) = ⌃(at,s)�Kt

⌦(h

st )

>↵/�2h

�

and the approximate posterior distribution of the hidden rep-resentation for each data point can be updated as

⌃(ht,i) =�I/�2

h +

⌦ww

>↵��1

µ(ht,i) = ⌃(ht,i)�⌦A>

t

↵kt,i/�

2h + hft,iihwi � hbwi

�.

Note that the bias parameter b and the vector of weight pa-rameters w are shared across the tasks.

The joint approximate posterior distribution of b and w

can be updated as

⌃(b,w) =

2

664h�i+

TPt=1

Nt

TPt=1

1>⌦H>t

↵

TPt=1

hHti1 diag(h⌘i) +TP

t=1

⌦HtH

>t

↵

3

775

�1

µ(b,w) = ⌃(b,w)

2

664

TPt=1

1>hf tiTP

t=1hHtihf ti

3

775,

where it can be seen that the inference mechanism transfersinformation between the tasks because they update q(b,w)

all together.The approximate posterior distribution of the predicted

outputs can be updated as

⌃(ft,i) = 1

µ(ft,i) =⌦w

>↵hht,ii+ hbi⇢(ft,i) , ft,iyt,i > ⌫,

where we can fortunately calculate the expectation of thetruncated normal distribution in closed-form.

⇤t

Kt

At

Ht

↵�

��

↵� ��

↵⌘

�⌘

�2h

⌫

bc

wc ⌘c

�cf t,c

yt,c

tasks

classes

Figure 3: Graphical model of kernelized Bayesian transferlearning for multiclass classification.

2.3 PredictionWe can replace p(At|{Ku,yu}Tu=1) with its approximateposterior distribution q(At) and obtain the predictive distri-bution of the hidden representation ht,? for a new data pointxt,? as

p(ht,?|kt,?, {Ku,yu}Tu=1) =

RY

s=1

N (hst,?;µ(at,s)

>kt,?,�

2h + k

>t,?⌃(at,s)kt,?).

The predictive distribution of the predicted output ft,? canalso be found by replacing p(b,w|{Kt,yt}Tt=1) with its ap-proximate posterior distribution q(b,w):

p(ft,?|ht,?, {Ku,yu}Tu=1) =

N✓ft,?;µ(b,w)

>

1

ht,?

�, 1 + [

1 ht,?]⌃(b,w)

1

ht,?

�◆

and the predictive distribution of the class label yt,? can befound using the predicted output distribution:

p(yt,? = +1|ft,?, {Ku,yu}Tu=1) = Z�1t,?�

✓µ(ft,?)� ⌫

⌃(ft,?)

◆,

where Zt,? is the normalization coefficient calculated for thetest data point, and �(·) is the standardized normal cumula-tive distribution function.

2.4 Multiclass ClassificationIn multiclass classification, we consider classification prob-lems with more than two classes (i.e., K > 2). The moststraightforward strategy is to train a distinct classifier foreach class that separates this particular class from the re-maining ones (i.e., one-versus-all classification). However,if we use our proposed method in such a setup, we wouldhave different hidden representations for each class. Instead,we choose to learn a shared subspace for all classes and toapply one-versus-all classification strategy in this subspace.

Figure 3 gives the graphical model of KBTL for multi-class classification. There is a shared hidden representationspace across the classes but each class has its own set ofclassification parameters {bc 2 R,wc 2 RR} with theircorresponding priors.

1834

The distributional assumptions of the coupled classifica-tion part are modified as

�c ⇠ G(�c;↵� ,��) 8cbc|�c ⇠ N (bc; 0, �

�1c ) 8c

⌘c,s ⇠ G(⌘c,s;↵⌘,�⌘) 8(c, s)wc,s|⌘c,s ⇠ N (wc,s; 0, ⌘

�1c,s ) 8(c, s)

ft,c,i|bc,wc,ht,i ⇠ N (ft,c,i;w>c ht,i + bc, 1) 8(t, c, i)

yt,c,i|ft,c,i ⇠ �(ft,c,iyt,c,i > ⌫) 8(t, c, i).By modifying the update equations of KBTL for

binary classification, we can derive the update equa-tions for the approximate posterior distributions of theclassification parameters and predicted outputs, namely,{q(�c), q(⌘c), q(bc,wc)}Kc=1 and {q(f t,c)}

T,Kt=1,c=1, as

↵(�c) = ↵� + 1/2

�(�c) =�1/�� +

⌦b2c↵/2

��1

↵(⌘c,s) = ↵⌘ + 1/2

�(⌘c,s) =�1/�⌘ +

⌦w2

c,s

↵/2

��1

⌃(bc,wc) =

2

664h�ci+

TPt=1

Nt

TPt=1

1>⌦H>t

↵

TPt=1

hHti1 diag(h⌘ci) +TP

t=1

⌦HtH

>t

↵

3

775

�1

µ(bc,wc) = ⌃(bc,wc)

2

664

TPt=1

1>⌦f t,c

↵

TPt=1

hHti⌦f t,c

↵

3

775

⌃(ft,c,i) = 1

µ(ft,c,i) =Dw>

c

Ehht,ii+ hbci

⇢(ft,c,i) , ft,c,iyt,c,i > ⌫.

The covariance update equation for the approximate pos-terior distribution of the hidden representation for each datapoint can be written as

⌃(ht,i) =

I/�2

h +

KX

c=1

⌦wcw

>c

↵!�1

and the mean update equation can be given as

µ(ht,i) =

⌃(ht,i)

⌦A>

t

↵kt,i/�

2h +

KX

c=1

�hft,c,iihwci � hbcwci

�!,

where we use the classification parameters of all classes to-gether. The update equations of the task-specific projectionmatrices and their priors remain intact.

2.5 Comparison to Related WorkKulis, Saenko, and Darrell (2011) and Pan et al. (2011) firstperform dimensionality reduction to find a shared subspaceand then learn a classifier in this subspace. The dimensional-ity reduction step has its own target function different from

the one that the classifier in the shared subspace uses. Hence,coupled training of these two steps as we do in our methodmay improve the overall system performance.

Hoffman et al. (2013) map data points from the sourcedomain into the target domain, which requires learning avery large projection matrix for high-dimensional featurerepresentations, which can be avoided by projecting bothdomains into low-dimensional spaces using, for example,principal component analysis (PCA) at the cost of informa-tion loss, whereas our method can work directly with high-dimensional feature representations using the kernel trick.

Bahadori, Liu, and Zhang (2011) and Duan, Xu, andTsang (2012) combine dimensionality reduction and clas-sification steps by formulating joint optimization problems,which are non-convex requiring time-consuming alternat-ing optimization strategies and are limited to two domains,whereas our method can use more than two domains and isable to produce probabilistic outputs.

Han, Liao, and Carin (2012) formulate a generativemodel, which is limited to low-dimensional problems and isable to produce linear mappings due to its generative nature,and find a maximum a posteriori estimate of model parame-ters using an expectation-maximization algorithm. However,our discriminative model is able to scale to high-dimensionalproblems and to find nonlinear mappings using the kerneltrick. Our method finds a full-Bayesian solution for its pa-rameters using a variational approximation algorithm.

3 ExperimentsWe first test our new algorithm KBTL on 12 benchmarkdomain adaptation experiments derived from two computervision data sets to illustrate its generalization performanceand compare its results to previously reported results onthese experiments. We then perform transfer learning ex-periments with heterogeneous populations for two differentcancer types to show the suitability of our algorithm in achallenging and nonstandard application scenario. Our Mat-lab implementations for binary and multiclass classificationare available at https://github.com/mehmetgonen/kbtl.

3.1 Computer Vision ExperimentsIn this first set of experiments, we use Office (Saenkoet al. 2010) and Caltech-256 (Griffin, Holub, and Per-ona 2007) data sets. Office data set contains images ofoffice objects from 31 different categories (e.g., calculator,keyboard, monitor, mug, etc.) under three distinct domains:amazon, webcam, and dslr. The images from differentdomains have varying characteristics such as illuminationand background. For example, amazon domain has “con-trolled” product images containing a single and centered ob-ject usually on a white background, whereas webcam anddslr domains have “uncontrolled” images with lightingvariations and background changes. Heterogeneity of im-ages across the domains can easily be seen from Figure 4 foramazon, webcam, and dslr domains. Caltech-256data set contains images from 256 categories under a diverseset of lighting conditions, poses, and backgrounds.

In our experiments, we use the ten common categories(i.e., backpack, bicycle, calculator, headphones, keyboard,

1835

EDFNSDFN

DPD]RQ

ELF\FOH FDOFXODWRU KHDGSKRQHV NH\ERDUG ODSWRS PRQLWRU PRXVH PXJ SURMHFWRU

ZHEFDP

GVOU

Figure 4: Sample images from amazon, webcam, and dslr domains in Office data set of Saenko et al. (2010).

Table 1: Multiclass classification accuracies for supervised domain adaptation experiments on the object recognition data sets.The results for six baseline algorithms are directly taken from Hoffman et al. (2013). The results of KBTL(L) and KBTL(G)are marked in bold face if they are better than all of the baseline algorithms.

Source Target SVMS SVMT ARCT HFA GFK MMDT KBTL(L) KBTL(G)

webcam amazon 35.7±0.4 45.6±0.7 43.4±0.5 45.9±0.7 44.1±0.4 47.7±0.9 52.2±0.8 53.4±0.8dslr amazon 34.0±0.3 45.7±0.9 42.5±0.5 45.8±0.9 45.7±0.6 46.9±1.0 50.9±0.8 51.9±0.9caltech amazon 35.9±0.4 45.3±0.9 44.1±0.6 45.5±0.9 44.7±0.8 49.4±0.8 52.4±1.1 52.9±1.0

amazon webcam 33.9±0.7 62.4±0.9 55.7±0.9 61.8±1.1 58.6±1.0 64.6±1.2 69.0±1.0 69.8±1.1dslr webcam 74.3±0.5 62.1±0.8 78.3±0.5 62.1±0.7 76.5±0.5 74.1±0.8 69.1±0.9 70.0±1.0caltech webcam 30.8±1.1 60.3±1.0 55.9±1.0 60.5±0.9 63.7±0.8 63.8±1.1 67.0±1.1 68.5±1.2

amazon dslr 35.0±0.8 55.9±0.8 50.2±0.7 52.7±0.9 50.7±0.8 56.7±1.3 57.8±1.1 57.6±1.1webcam dslr 66.6±0.7 55.1±0.8 71.3±0.8 51.7±1.0 70.5±0.7 67.0±1.1 60.8±1.0 61.8±1.3caltech dslr 35.6±0.7 55.8±0.9 50.6±0.8 51.9±1.1 57.7±1.1 56.5±0.9 57.4±1.3 58.8±1.1

amazon caltech 35.1±0.3 32.0±0.8 37.0±0.4 31.1±0.6 36.0±0.5 36.4±0.8 35.8±0.8 35.9±0.7webcam caltech 31.3±0.4 30.4±0.7 31.9±0.5 29.4±0.6 31.1±0.6 32.2±0.8 33.6±0.9 34.0±0.9dslr caltech 31.4±0.3 31.7±0.6 33.5±0.4 31.0±0.5 32.9±0.5 34.1±0.8 35.0±0.7 35.9±0.6

Mean Performance 40.0±0.6 48.5±0.8 49.5±0.6 47.4±0.8 51.0±0.7 52.5±1.0 53.4±0.9 54.2±1.0

laptop, monitor, mouse, mug, projector) shared by Officeand Caltech-256. We use 800-dimensional SURF-BoWfeatures provided by Gong et al. (2012) for all four domains.To have results comparable to the previous studies, we fol-low the experimental setup used by Saenko et al. (2010),Gong et al. (2012), and Hoffman et al. (2013), and per-form experiments for 20 random train/test splits provided byHoffman et al. (2013). In 12 domain adaptation experimentsdefined between all possible pairs of the four domains, thereare few labeled data points (three images) available for eachcategory in the target domain, whereas we have much morelabeled data points (20 images for amazon and eight im-ages for others) in the source domain. The remaining datapoints from the target domain are used in the test phase. Foreach experiment, we report the mean and standard deviationof multiclass classification accuracies over 20 replications.

We compare eight algorithms (i.e., six baseline algorithmsand our algorithm with two different kernels):

(i) SVMS: an SVM trained on the source domain,

(ii) SVMT: an SVM trained on the target domain,

(iii) ARCT: asymmetric regularized cross-domain transfor-mation algorithm proposed by Kulis, Saenko, and Dar-rell (2011),

(iv) HFA: heterogeneous feature adaptation algorithm pro-posed by Duan, Xu, and Tsang (2012),

(v) GFK: geodesic flow kernel algorithm proposed byGong et al. (2012),

(vi) MMDT: max-margin domain transforms algorithm pro-posed by Hoffman et al. (2013),

(vii) KBTL(L): our algorithm for multiclass classificationwith the linear kernel,

(viii) KBTL(G): our algorithm for multiclass classificationwith the Gaussian kernel.

The results for baseline algorithms are directly taken fromHoffman et al. (2013).

For our algorithm, the hyper-parameter values are se-lected as (↵⌘,�⌘) = (↵� ,��) = (↵�,��) = (1, 1),�h =

0.1, and ⌫ = 1. The number of components in the hiddenrepresentation space is selected as R = 20. We take 200 it-erations for variational inference scheme. For KBTL(L), wecalculate a linear kernel on each domain and normalize thekernel matrices to unit maximum value (i.e., dividing eachkernel matrix by its maximum value) in order to eliminatescaling issues. For KBTL(G), we calculate a Gaussian ker-nel on each domain, where we set the kernel width to themean of pairwise distances between the training data points.

1836

Table 2: Number of data points in source and target domainsfor transfer learning experiments.

CCLE (source) TCGA (target)

Cancer Gene Mutant WT Mutant WT

GBMEGFR 5 36 47 102TP53 27 14 45 104

LUADTP53 34 9 79 90KRAS 17 26 46 123

Table 1 reports the multiclass classification accuracies for12 supervised domain adaptation experiments. KBTL(L)outperforms all baseline algorithms on eight out of 12 ex-periments, whereas KBTL(G) outperforms them on nineexperiments. They also improve the mean performance by0.9 and 1.7 percentage units, respectively, compared to thebaseline algorithm with the best performance. These resultsvalidate the better generalization performance of our algo-rithm irrespective of the kernel function used. Our algorithmdoes not perform well on two experiments defined betweenwebcam and dslr. This can be explained by the similaritybetween these two domains as SVMS performs significantlybetter than SVMT on these experiments. That is why nearest-neighbor-based algorithms, i.e., ARCT and GFK, work bet-ter for such scenarios but not margin-based algorithms, i.e.,HFA, MMDT, KBTL(L), and KBTL(G).

3.2 Cancer Biology ExperimentsIn this second set of experiments, we test if we cantransfer information from in-vitro cancer cell line datato improve the accuracy of gene-expression-based pre-dictors of the mutation status of important cancer genesin primary patient tumor samples. We use primary tu-mor data from The Cancer Genome Atlas (TCGA)(TCGA Research Network 2008) and cancer cell linedata from the Cancer Cell Line Encyclopedia(CCLE) (Barretina et al. 2012). We define the learn-ing task as predicting mutation status (Mutant versusWild Type) of a particular gene using gene expres-sion profiles. Specifically, we use expression profiles of20,530 genes in TCGA from Illumina HiSeq experimentsand expression profiles of 18,897 genes in CCLE fromAffymetrix U133+2 microarrays. From TCGA, we use pri-mary tumors from two cancer cohorts: glioblastomamultiforme (GBM) and lung adenocarcinoma(LUAD). From CCLE, we use cell lines with the cor-responding tissue types: central nervous systemglioma and lung adenocarcinoma. We perform fourtransfer learning experiments in which we try to transfer in-formation from CCLE to TCGA (i.e., from cell lines to pri-mary tumors). For both cancer types, we identify two genesthat are most frequently mutated in TCGA among the oneslisted in The Cancer Gene Census (Futreal et al. 2004). Ta-ble 2 summarizes the details of these four experiments.

We compare three algorithms:(i) BRVM: Bayesian relevance vector machine algorithm

proposed by Bishop and Tipping (2000) trained on the

target domain (i.e., no transfer),(ii) MMDT: max-margin domain transforms algorithm pro-

posed by Hoffman et al. (2013) (i.e., the best baselinealgorithm in the computer vision experiments),

(iii) KBTL: our algorithm for binary classification trainedon both domains together.

For BRVM, we use our own Matlab implementation and setthe hyper-parameter values to the default values as in KBTL,i.e., (↵� ,��) = (↵�,��) = (1, 1). For MMDT, we use theMatlab implementation provided by Hoffman et al. (2013)and reduce the dimensionality of each domain to 20 usingPCA as suggested by Hoffman et al. (2013) to get reasonablerun times for high-dimensional cancer data sets. For BRVMand KBTL, we calculate a linear kernel for each domain andnormalize the kernel matrix to unit maximum value. For ouralgorithm, the number of components in the hidden repre-sentation space is selected as R = 2.

We perform experiments for 50 random train/test splits.For each replication, we randomly select 25 per cent ofTCGA with stratification as the test set and use 25, 50, or 75per cent as the training set, whereas we use all data pointsin CCLE for training. The training sets are normalized sep-arately to have zero mean and unit standard deviation, andthe test set in TCGA is then normalized using the mean andthe standard deviation of the original training set in TCGA.

Figure 5 compares the performance of BRVM, MMDT, andKBTL in terms of the area under the ROC curve (AUC) val-ues with varying training set size for the target domain onfour transfer learning experiments using box-and-whiskerplots. It also compares KBTL and the best baseline algo-rithm for each experiment using scatter plots. We clearly seethat KBTL is statistically significantly superior to BRVM andMMDT in all of the experiments according to the paired t-test with p < 0.05. Note that knowledge transfer happens inall of the experiments for our algorithm (i.e., KBTL is con-sistently better than BRVM), whereas MMDT obtains worseAUC values than BRVM in some of the experiments (e.g.,KRAS gene on LUAD cohort). These results show that KBTLis able to improve the predictive performance on primarytumors using cell lines as an auxiliary data source. Applica-tions of this approach may have important clinical implica-tions due to the difficulty and expense of obtaining data onprimary tumors compared to cell lines.

4 ConclusionsWe introduce a kernelized Bayesian transfer learning frame-work that can transfer information between tasks with het-erogenous feature representations by mapping their datapoints into a shared subspace with task-specific projectionmatrices and learning a coupled classification model in thissubspace. Our two main contributions are: (i) formulatingnovel probabilistic models that couple these two parts forbinary and multiclass classification in a principled way and(ii) developing very efficient variational approximation pro-cedures for these probabilistic models. We illustrate thepractical importance of the method for two scenarios: (i) su-pervised domain adaptation experiments on object recogni-tion data sets and (ii) transfer learning experiments with het-

1837

● ●

��

&DQFHU��*%0��*HQH��(*)5

&%$

1&&/( �� 17&*$ ��

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

��

��

��

��

��

��

�%�

�&�

SíYDOXH� ��

●● ●

��

&%$

1&&/( �� 17&*$ ��

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●

●●●

●

●

●

●●

●

●●

●

��

��

��

��

��

��

�$�

�&�

SíYDOXH��Hí�

● ●

● ●

●

��

&%$

1&&/( �� 17&*$ ��

●

●

●

●●

●

●

●

●

●

●

● ●●

●

●

●●●●

●

●

●

●

●

●

●

●

●●

●

●●

● ●

●

●●

●

●

●

●

●

●

●

●●

●

●●

��

��

��

��

��

��

�%�

�&�

SíYDOXH� ��

●

��

&DQFHU��*%0��*HQH��73��

&%$

1&&/( �� 17&*$ ��

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

● ●

●

●

●

●

●

●

●

●

●

●

��

��

��

��

��

�$�

�&�

SíYDOXH� ��

��

&%$

1&&/( �� 17&*$ ��

●●

●

●

● ●

●

●●

●

●●

●●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●●

●

��

��

��

��

��

�%�

�&�


��

&%$

1&&/( �� 17&*$ ��

●

●●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●●●●●

●

●

●

●●

●●

●

●●

●

●

●

��

��

��

��

��

�%�

�&�

SíYDOXH� ��

●

��

&DQFHU��/8$'��*HQH��73��

&%$

1&&/( �� 17&*$ ��

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●●

●●

●

●●

●

●

●

●

●

●

●●

● ●●

●

●

●●

●

●

●●●

●

●

●

●

��

��

��

��

��

�$�

�&�

SíYDOXH� ��

●

��

&%$

1&&/( �� 17&*$ ��

●

●

●

●

●

●●

●

●

●●

●

●

● ●

●●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

● ●●●

●

●

●

●

●

●

●

●

��

��

��

��

��

�%�

�&�

SíYDOXH� ��

●

●●

��

&%$

1&&/( �� 17&*$ ��

●

●

●●

●

●●

●

●

● ●

●

●

●

●●

●

●●

●

●

●

● ●

●●

●

●

●●

●●

●

●

●

●●

●

●●

●

●

●

● ●●

●

●

●●

��

��

��

��

��

�$�

�&�


● ●●

●

��

&DQFHU��/8$'��*HQH��.5$6

&%$

1&&/( �� 17&*$ ��

●●●●

●

●

●●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●●

●

●

●

●

●

��

��

��

��

��

��

��

�%�

�&�


●

●

��

&%$

1&&/( �� 17&*$ ��

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●●●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

��

��

��

��

��

��

��

�$�

�&�


●

��

&%$

1&&/( �� 17&*$ ��

●

●●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●●●

●

●

●●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

��

��

��

��

��

��

��

�$�

�&�


Figure 5: Comparison between (A) BRVM, (B) MMDT, and (C) KBTL in terms of AUC values on GBM and LUAD cohorts withvarying training set size for the target domain.

erogenous populations for two cancer types. We show thatour method outperforms the state-of-the-art domain adap-tation algorithms on computer vision experiments. We alsoshow that our method is able to transfer information betweenpatient-derived primary tumor data and in-vitro-derived can-cer cell line data.

Three interesting topics for future research are: (i) per-forming task grouping to cluster tasks with similar char-acteristics together, which leads to eliminating the negativetransfer between tasks with different characteristics, (ii) ex-tending coupled classification part of our algorithms usingthe low-density assumption of Lawrence and Jordan (2005)in order to make use of unlabeled data, which leads to semi-supervised variants of our algorithms, and (iii) combining

multiple kernels to integrate different feature representationsor similarities for each task to learn a better task-specific ker-nel function, known as multiple kernel learning (Gonen andAlpaydın 2011), using the formulation of Gonen (2012).

ReferencesAlbert, J. H., and Chib, S. 1993. Bayesian analysis of binaryand polychotomous response data. Journal of the AmericanStatistical Association 88:669–679.Argyriou, A.; Evgeniou, T.; and Pontil, M. 2008. Convex multi-task feature learning. Machine Learning 73(3):243–272.Argyriou, A.; Maurer, A.; and Pontil, M. 2008. An algorithmfor transfer learning in a heterogeneous environment. In Pro-

1838

ceedings of the European Conference on Machine Learning andKnowledge Discovery in Databases, Part I, 71–85.Bahadori, M. T.; Liu, Y.; and Zhang, D. 2011. Learning withminimum supervision: A general framework for transductivetransfer learning. In Proceedings of the 11th IEEE InternationalConference on Data Mining, 61–70.Barretina, J.; Caponigro, G.; Stransky, N.; Venkatesan, K.; Mar-golin, A. A.; et al. 2012. The Cancer Cell Line Encyclopediaenables predictive modelling of anticancer drug sensitivity. Na-ture 483:603–607.Beal, M. J. 2003. Variational Algorithms for ApproximateBayesian Inference. Ph.D. Dissertation, The Gatsby Compu-tational Neuroscience Unit, University College London.Ben-David, S.; Blitzer, J.; Crammer, K.; and Pereira, F. 2007.Analysis of representations for domain adaptation. In Advancesin Neural Information Processing Systems 19, 137–144.Bergamo, A., and Torresani, L. 2010. Exploiting weakly-labeled web images to improve object classification: A domainadaptation approach. In Advances in Neural Information Pro-cessing Systems 23, 181–189.Bishop, C. M., and Tipping, M. E. 2000. Variational relevancevector machines. In Proceedings of the 16th Conference onUncertainty in Artificial Intelligence, 46–53.Blitzer, J.; McDonald, R.; and Pereira, F. 2006. Domain adap-tation with structural correspondence learning. In Proceedingsof the Conference on Empirical Methods in Natural LanguageProcessing, 120–128.Caruana, R. 1997. Multitask learning. Machine Learning28(1):41–75.Dai, W.; Chen, Y.; Xue, G.-R.; Yang, Q.; and Yu, Y. 2009.Translated learning: Transfer learning across different featurespaces. In Advances in Neural Information Processing Systems21, 353–360.Daume III, H. 2007. Frustratingly easy domain adaptation. InProceedings of the 45th Annual Meeting of the Association ofComputational Linguistics, 256–263.Duan, L.; Tsang, I. W.; Xu, D.; and Maybank, S. J. 2009. Do-main transfer SVM for video concept detection. In Proceedingsof the 22nd IEEE Conference on Computer Vision and PatternRecognition, 1375–1381.Duan, L.; Xu, D.; Tsang, I. W.; and Luo, J. 2010. Visual eventrecognition in videos by learning from Web data. In Proceed-ings of the 23rd IEEE Conference on Computer Vision and Pat-tern Recognition, 1959–1966.Duan, L.; Xu, D.; and Tsang, I. W. 2012. Learning withaugmented features for heterogeneous domain adaptation. InProceedings of the 29th International Conference on MachineLearning, 711–718.Futreal, P. A.; Coin, L.; Marshall, M.; Down, T.; Hubbard, T.;et al. 2004. A census of human cancer genes. Nature ReviewsCancer 4:177–183.Gelfand, A. E., and Smith, A. F. M. 1990. Sampling-basedapproaches to calculating marginal densities. Journal of theAmerican Statistical Association 85:398–409.Gonen, M., and Alpaydın, E. 2011. Multiple kernel learn-ing algorithms. Journal of Machine Learning Research12(Jul):2211–2268.

Gonen, M. 2012. Bayesian efficient multiple kernel learning. InProceedings of the 29th International Conference on MachineLearning, 1–8.Gong, B.; Shi, Y.; Sha, F.; and Grauman, K. 2012. Geodesicflow kernel for unsupervised domain adaptation. In Proceed-ings of the 25th IEEE Conference on Computer Vision and Pat-tern Recognition, 2066–2073.Gopalan, R.; Li, R.; and Chellappa, R. 2011. Domain adapta-tion for object recognition: An unsupervised approach. In Pro-ceedings of the 13th IEEE International Conference on Com-puter Vision, 999–1006.Griffin, G.; Holub, A.; and Perona, P. 2007. Caltech-256 objectcategory dataset. Technical Report 7694, California Institute ofTechnology.Han, S.; Liao, X.; and Carin, L. 2012. Cross-domain multitasklearning with latent probit models. In Proceedings of the 29thInternational Conference on Machine Learning, 73–80.Hoffman, J.; Rodner, E.; Donahue, J.; Darrell, T.; and Saenko,K. 2013. Efficient learning of domain-invariant image repre-sentations. In Proceedings of the 1st International Conferenceon Learning Representations.Jiang, W.; Zavesky, E.; Chang, S.-F.; and Loui, A. 2008. Cross-domain learning methods for high-level visual concept classifi-cation. In Proceedings of the 15th IEEE International Confer-ence on Image Processing, 161–164.Kulis, B.; Saenko, K.; and Darrell, T. 2011. What you saw isnot what you get: Domain adaptation using asymmetric kerneltransforms. In Proceedings of the 24th IEEE Conference onComputer Vision and Pattern Recognition, 1785–1792.Lawrence, N. D., and Jordan, M. I. 2005. Semi-supervisedlearning via Gaussian processes. In Advances in Neural Infor-mation Processing Systems 17, 753–760.Pan, S. J., and Yang, Q. 2010. A survey on transfer learn-ing. IEEE Transactions on Knowledge and Data Engineering22(10):1345–1359.Pan, S. J.; Tsang, I. W.; Kwok, J. T.; and Yang, Q. 2011. Do-main adaptation via transfer component analysis. IEEE Trans-actions on Neural Networks 22(2):199–210.Saenko, K.; Kulis, B.; Fritz, M.; and Darrell, T. 2010. Adaptingvisual category models to new domains. In Proceedings of the11th European Conference on Computer Vision, IV, 213–226.Scholkopf, B., and Smola, A. J. 2002. Learning with Ker-nels: Support Vector Machines, Regularization, Optimization,and Beyond. Cambridge, MA, USA: MIT Press.Shawe-Taylor, J., and Cristianini, N. 2004. Kernel Methods forPattern Analysis. New York, NY, USA: Cambridge UniversityPress.TCGA Research Network. 2008. Comprehensive genomiccharacterization defines human glioblastoma genes and corepathways. Nature 455:1061–1068.Tipping, M. E. 2001. Sparse Bayesian learning and the rele-vance vector machine. Journal of Machine Learning Research1:211–244.Vapnik, V. N. 1998. Statistical Learning Theory. New York,NY, USA: John Wiley & Sons.

1839

Proceedings of the Twenty-Eighth AAAI Conference on ... · Seattle, WA 98109, USA ... main and...

Documents

Transcript of Proceedings of the Twenty-Eighth AAAI Conference on ... · Seattle, WA 98109, USA ... main and...