Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang,...

Transfer Learning From Multiple Source Domains via Consensus

Regularization

Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He

23/4/19 Ping Luo CIKM 08

OverviewIntroduction

• Preliminaries

• Consensus Regularization

• Experimental Evaluation

• Related Works

• Conclusions


Research Motivation (1)Source

Domain 1

Source Domain 2

Source Domain 3

Target Domain

Knowledge Transfer

Source Domain

Target Domain

Knowledge Transfer

• How to exploit the distribution differences among multiple source domains to boost the learning performance in a target domain

• How to deal with the situation that the source domains are geographically separated with some privacy concerns


Research Motivation (1)

• Motivating Examples– Web pages Classification

• Label Web pages from multiple different universities to find the course main page by text classification

• Different university with different terms to describe the course metadata

– Video concept detection• Generalize to models to detect semantic concepts from multiple source video data

Common Features

1.Multiple source

domains with different

data distributions

2.Separated source

domains


Challenges and Contributions

• New Challenges

- How to make good use of the distribution mismatch among multiple source-domains to promote the prediction performance on target-domain

- Extend the consensus regularization to implement in a distributed manner, which modestly preserves privacy

• Contributions

- Propose a consensus regularization based algorithm for transfer learning from multiple source-domains

- Perform in a distributed and modest privacy-preserving manner


Overview• Introduction

Preliminaries



• Related Works

• Conclusions


Consensus Measuring (1)

(1, 0, 0)

(1, 0, 0)

(1, 0, 0)

Average(1,0,0)

Entropy(1,0,0) 0eC E

• Example: three-class classification problem, three classifiers predict an

instance x

(1, 0, 0)

(0, 1, 0)

(0, 0, 1)

Average 1 1 1( , , )

3 3 3

Entropy 1 1 1( , , )3 3 3eC E

Minimal entropy, Maximal Consensus

Maximal entropy, Minimal Consensus


Consensus Measuring (2)

1 2 2 2(1) (2) (1) (1) (1)( , , ) ( ) ( (1 )) (2 1)m

sC p p p p p p p

• Example: two-classes classification problem, three classifiers predict an

instance x

• Due to computing complexity in the entropy, for 2-entry probability distribution vectors, we can simplify the consensus measure as:

(0.75, 0.25)

(0.45, 0.55)

(0.9, 0.1)

Average(0.7,0.3)

Entropy2(0.7 0.3)sC


Logistical Regression [Davie et al, 2000]

Logistic regression can be an approach to learn classification model for discrete outputs.•Given:

Training data set X, where X is any vector containing discrete or continuous random variables

Discrete outputs Y, where Y is discrete value

•Maximize the following formula to obtain Model w:

•Classification:1

1log .

1 exp( ) 2

NT

Ti i iy

w w

w x

1( 1| ; ) ( ) .

1 exp( )T

TP y y

y

x w w x

w x



• Preliminaries

Consensus Regularization


• Related Works

• Conclusions


Problem Formulation (1)

• Given: Let be m source-domains of labeled data,

and the l-th source-domain is represented by

The unlabeled target-domain is denoted by

Assume that are of different but closely related distributions

• Find: Train m classifiers

covers the knowledge from the i-th source domain achieve high degree of consensus on their

prediction results on the target domain

1{( , )} |ll l l n

s i i iyD x

1, , ms sD D

1{( )} |nt i iD x1, , ,ms s tD D D

1, , mh h

ih


Problem Formulation (2)• Formulation:

Adapt supervised learning framework with consensus regularization

Output m models , which maximize:

( , , | )l mtConsensus h h D

1, , mh h

( | )l lsP h D

1Maximum Consensus

Maximum A Posteriori

( | ) ( , , | )m

l l l ms t

l

P h D Consensus h h D

lhlsD

where is the probability of the hypotheses given the

observed data set .

is the consensus degree of the prediction results of these classifiers on the target domain


Why Consensus Regularization (1)

In this study we focus on binary classification problems with the labels 1 and -1, and the number of classifiers m = 3.

•

( | ) 1/ 2 or ( | ) 1/ 2P h u Y u P h u Y u The non-trivial classifier can be restated as:

•


Why Consensus Regularization (2)

•

Thus, minimizing the disagreement means to decrease the classification error.


Consensus Regularization by Logistic Regression (1)• The proposed consensus regularization framework

outputs m logistic models , which minimize:1, , mw w

For binary classification problem, the entropy based consensus measure Ce can be equivalent with Cs. Thus, the objective function can be rewritten as


• The partial differential of objective is,sg

where

A function of a local classifier and the data from the corresponding source domain. Thus, this function can be computed locally on each source domain.

A function of all the local classifiers and the data from the target domain. Thus, this function can be computed on the target domain with all the classifiers.

Consensus Regularization by Logistic Regression (2)


Distributed Implementation of Consensus Regularization (1)

In the distributed setting, the data notes contain source-domain data are used as slave nodes, denoted by , and the node contains target-domain is used as master node, called .

1, , msn snmn

mn

sn1 snm

1 1( , )snw

1 ( )sgw

( , )m msnw

( )m sgw



• Preliminaries


Experimental Evaluation

• Related Works

• Conclusions


Experimental Preparation (1)• Data Preparation

– Three source domains (A1, B1) (A2, B2) (A3, B3), one target domain (A4, B4)

– 96 ( ) problem instances can be constructed for the experimental evaluation

• Baseline Algorithms– Distributed approach: Distributed Ensemble (DE),

Distributed Consensus Regularization (CCR3)– Centralized approach: Centralized Training (CT),

Centralized Consensus Regularization (CCR) (eg. CCR1 means m = 1), CoCC [Dai et al., KDD’07], TSVM [Joachims, ICML’99], SGT [Joachims, ICML’03]

444 P

A1

sci.crypt

A2

sic.electronics

A3

sci.med

A4

sci.space

B1

talk.guns

B2

talk.mideast

B3

talk.misc

B4

talk.religion


Experimental Parameters and Metrics• Note that, when parameterθ= 0, DE is

equivalent to DCR, and CT is equivalent to CCR1.

• Parameter setting– The range ofθis [0,0.25]– The parameters of CoCC, TSVM, SGT are the same

as [Dai ea al., KDD’07]• Experimental metrics

Accuracy Convergence


Experimental Results (1)

max max3 1 and CCR CCR

• Comparison of CCR3, CCR1, DE and CT

• are the best performance whenθis sampled in [0, 0.25]


Experimental Results (2)• The average performance comparison of CCR3, CCR1,

DE and CT on 96 problem instances

• Comparison of TSVM, SGT, CoCC and CCR3


Experimental Results on Algorithm Convergence

• The algorithm almost converges after 20 iterations, which indicates that our algorithm owns a good property of convergence.


More experiments (1)

• Note that, the original source-domains have much large distribution mismatch, but after merging, the distribution mismatch is greatly alleviated.

Domain 1

Domain 2

Domain 3

Source Domain

Merge

SDomain 1

SDomain 2

SDomain 3

Random sampling without

replacement


More experiments (2)

• The experiments on image classification are also very promising



• Preliminaries



Related Works

• Conclusions


Related Work (1)

• Transfer Learning

Solve the fundamental problem of different distributions between the training and testing data.

– Assume there are some labeled data from the target domain data

Estimation of mismatch degree by Liao et al.[ICML’05] Boosting based learning by Dai et al.[ICML’07] Building generative classifiers by Smith et al.[KDD’07] Constructing information priors from source-domain and

then encoding it to the model built by Raina et al.[ICML’06]

– The data in target-domain are totally unlabeled Co-clustering based Classification by Dai et al.[KDD’07] Transductive Bridged-Refinement by Xing et al.

[PKDD’07]


Related Work (2)

• Self-Taught Learning Use a large amount of unlabeled data to improve

performance of given classification task– Apply sparse coding to construct higher-level features

using the unlabeled data by Raina et al.[ICML’07]• Semi-supervised Classification

– Entropy minimization by Grandvalet et al.[NIPS’05], which is a special case of our regularization framework when m = 1

• Multi-View Learning– Co-training by Blum et al.[COLT’98]– Boosting mixture models by Grandvalet et al.

[ICANN’01]– Co-regularization by Sindhwani et al.[ICML’05],

which focus on two views only and does not have the effect of entropy minimization


Overview

• Introduction

• Preliminaries



• Related Works

Conclusions


Conclusions

Propose a consensus regularization framework for transfer learning by learning from multiple source-domains

Maximize the likelihood of each model on its corresponding source domain Maximize the consensus degree of all the trained models

Extend the algorithm to a distributed implementation Only some statistical values are shared between the source-domains and the target-domain, so it can modestly alleviate the privacy concerns

Experiments on real-world text data sets show the effectiveness of our consensus regularization approach


Q. & A.

Acknowledgement

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang,...

Documents

Transcript of Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang,...