Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text...

Post on 29-Mar-2015

214 views 0 download


Transcript of Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text...

Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text


Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong Xiong, Zhongzhi Shi


• Problem Formulation

• Solution for Optimization Problem and Analysis of Algorithm Convergence

• Experimental Validation

• Related Works

• ConclusionsFuzhen Zhuang et al., SDM 2010 2


• Many traditional learning techniques work well only under the assumption: Training and test data follow the same distribution

Fuzhen Zhuang et al., SDM 2010

Training (labeled)


Test (unlabeled)

Enterprise News Classification: including the classes“Product Announcement”, “Business scandal”, “Acquisition”, … …Product announcement:

HP's just-released LaserJet Pro P1100

printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ...

Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300

desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p

laptop using. ...their performance

HP news Lenovo news

Different distribution

Fail !


Motivation (1)

• Example Analysis:

Fuzhen Zhuang et al., SDM 2010

Product announcement: HP's just-released LaserJet Pro P1100

printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ...

Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300

desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p

laptop using. ...their performance

HP news Lenovo news


word concept:

LaserJet, printer,

announcement, price,

ThinkPad, ThinkCentre,

announcement, price




document class:


Motivation (2)

• Example Analysis:

Fuzhen Zhuang et al., SDM 2010

HP LaserJet, printer, price, performance et al.

Lenovo Thinkpad, Thinkcentre, price, performance et al.

The words expressing the same word concept are domain-dependent




word conceptindicates

The association between word concepts and document classes is domain-independent

• Can we model this observation for classification? • We study to model it for cross-domain classification

• Domain-dependent word concepts• Domain-independent association between word concepts and document classes

Motivation (3)

• Example Analysis:

Fuzhen Zhuang et al., SDM 2010

Product announcement: HP's just-released LaserJet Pro P1100

printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ...

Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300

desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p

laptop using. ...their performance

HP news Lenovo news


word concept:

LaserJet, printer,

announcement, price…

ThinkPad, ThinkCentre

announcement price…




document class:


Share some common words: announcement,

price, performance …

Outline• Introduction

Problem Formulation

• Solution for Optimization Problem and Analysis of Algorithm Convergence

• Experimental Validation

• Related Works

• ConclusionsFuzhen Zhuang et al., SDM 2010 7

Preliminary Knowledge

• Basic formula of matrix tri-factorization:

where the input X is the word-document co-occurrence matrix

Fuzhen Zhuang et al., SDM 2010





Problem Formulation (1)

• Input: source domain Xs, target domain Xt

• Matrix tri-factorization based classification framework• Two-step Optimization Framework

(MTrick0)• Joint Optimization Framework


Fuzhen Zhuang et al., SDM 2010 9

Problem Formulation (2)

• Sketch map of two-step optimization

Fuzhen Zhuang et al., SDM 2010

Source domain Xs


Fs Gs

Ft Gt

Targetdomain Xt



First step

Second step

Problem Formulation (3)• The optimization problem in source domain (First


• The optimization problem in target domain (Second step)

Fuzhen Zhuang et al., SDM 2010

G0 is used as the supervision

information for this optimization

Our goal:

to obtain Fs , Gs and Ss


Our goal: to obtain Ft ,


Ss is the solution obtained from the

source domain

Problem Formulation (4)

• Sketch map of joint optimization

Fuzhen Zhuang et al., SDM 2010

Source domain Xs

Fs Gs

Ft Gt

Targetdomain Xt

S Knowledge Transfer


Problem Formulation (5)

• The joint optimization problem over source and target domain:

Fuzhen Zhuang et al., SDM 2010

G0 is the supervision informatio


the association S is shared as bridge to

transfer knowledge


Outline• Introduction

• Problem Formulation

Solution for Optimization Problem and Analysis of Algorithm Convergence

• Experimental Validation

• Related Works

• ConclusionsFuzhen Zhuang et al., SDM 2010 14

Solution for Optimization

• Alternately iterative algorithm is developed and the updated formulas are as follows,

Fuzhen Zhuang et al., SDM 2010

This is the solution for

joint optimization problem


Analysis of Algorithm Convergence

• According to the methodology of convergence analysis in the two works [Lee et al., NIPS’01] and [Ding et al., KDD’06], the following theorem holds.

Fuzhen Zhuang et al., SDM 2010

Theorem (Convergence): After each round of calculating the iterative formulas, the objective function in the joint optimization will converge monotonically.


Outline• Introduction

• Problem Formulation

• Solution for Optimization Problem and Analysis of Algorithm Convergence

Experimental Validation

• Related Works

• ConclusionsFuzhen Zhuang et al., SDM 2010 17

Experimental Preparation (1)• Construct Classification Tasks

rec and sci denote the positive and negative classes, respectively

For source domain:

For target domain:

144 ( ) Tasks can be constructed from this data set rec vs. sci

2 24 4P P

rec.autos rec.motorcycles



sci.crypt sic.electronics

sci.med sci.space

Fuzhen Zhuang et al., SDM 2010


rec.autos + sci.med

rec.motorcycles + sci.space

(4 x 4 cases)

(3 x 3 cases)


Experimental Preparation (2)• Data Sets 20 Newsgroup (three top categories are selected)

– Two data sets for binary classification: rec vs. sci and sci vs. talk

rec vs. sci : 144 tasks

sci vs. talk : 144 tasks

Reuters-21578 (the problems constructed in [Gao et al., KDD’08])

rec.autos rec.motorcycles



sci.crypt sic.electronics

sci.med sci.space

talk.guns talk.mideast

talk.misc talk.religion

Fuzhen Zhuang et al., SDM 2010




Experimental Preparation (3)

• Compared Algorithms– Supervised Learning:

Logistic Regression (LG) [David et al., 00] Support Vector Machine (SVM) [Joachims, ICML’99]

– Semi-supervised Learning: TSVM [Joachims, ICML’99]

– Cross-domain Learning: CoCC [Dai et al., KDD’07] LWE [Gao et al., KDD’08]

•Our Methods MTrick0 (Two-step optimization framework) MTrick (Joint optimization framework)

•Measure: classification accuracy

Fuzhen Zhuang et al., SDM 2010 20

Experimental Results (1)• Comparisons among MTrick, MTrick0, CoCC, TSVM,

SVM and LG on data set rec vs. sci

Fuzhen Zhuang et al., SDM 2010

MTrick can perform well

even the accuracy of LG is lower than 65%


Experimental Results (2)• Comparisons among MTrick, MTrick0, CoCC, TSVM,

SVM and LG on data set sci vs. talk

Fuzhen Zhuang et al., SDM 2010

Similar with rec vs. sci Mtrick also achieves the

best results in this data set


Experimental Results (3)

• The performance comparison of MTrick, LWE, CoCC, TSVM, SVM and LG on Reuters-21578

MTrick also performs very well on this data set

Fuzhen Zhuang et al., SDM 2010 23

Experimental Results Summary

• The systemic experiments show that MTrick outperforms all the compared algorithms

• Especially, MTrick can perform very well when the accuracy of LG is low (< 65%), which indicates that MTrick still works when the difficulty degree of transfer learning is great

• Also we can find that the joint optimization is better than the two-step optimization

Fuzhen Zhuang et al., SDM 2010 24

Overview• Introduction

• Problem Formulation

• Solution for Optimization Problem and Analysis of Algorithm Convergence

• Experimental Validation

Related Works

• ConclusionsFuzhen Zhuang et al., SDM 2010 25

Related Work (1)

• Cross-domain Learning Solve the distribution mismatch problems between the training

and testing data.– Instance weighting based approaches

Boosting based learning by Dai et al.[ICML’07] Instance weighting framework for NLP tasks by Jiang et al.


– Feature selection based approaches Two-phase feature selection framework by Jiang et al.[CIKM’07] Dimensionality reduction approach by Pan et al.[AAAI’08], which

focuses on finding out the latent feature space regarded as the bridge knowledge between the source and target domains

Co-Clustering based Classification method by Dai et al. [KDD’07]

Fuzhen Zhuang et al., SDM 2010 26

Related Work (2)

• Nonnegative Matrix Factorization (NMF) Weighted nonnegative matrix factorization (WNMF)

by Guillamet et al. [PRL’03] Incorporating word space knowledge for document

clustering by Li et al. [SigIR’08] Orthogonal constrained NMF by Ding et al.[KDD’06] Cross-domain collaborative filtering by Li et al.

[IJCAI’09] Transfer the label information by sharing the

information of word clusters, proposed by Li et al.[SigIR’09]. However, the word clusters are not exactly the same due to distribution difference cross domains

Fuzhen Zhuang et al., SDM 2010 27

Outline• Introduction

• Problem Formulation

• Solution for Optimization Problem and Analysis of Algorithm Convergence

• Experimental Validation

• Related Works

ConclusionsFuzhen Zhuang et al., SDM 2010 28


• Propose a nonnegative matrix factorization based classification framework (MTrick), which explicitly consider‒ the domain-dependent concepts‒ the domain-independent association between concepts and document classes

• Develop an alternately iterative algorithm to solve the optimization problem, and theoretically analyze the convergence

• Experiments on real-world text data sets show the effectiveness of the proposed approach

Fuzhen Zhuang et al., SDM 2010 29

Thank you!

Q. & A.


Fuzhen Zhuang et al., SDM 2010 30