Learning to Classify Documents with Only a Small Positive Training Set
description
Transcript of Learning to Classify Documents with Only a Small Positive Training Set
![Page 1: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/1.jpg)
Learning to Classify Documents with Only a Small Positive Training Set
Xiao-Li Li Institute for Infocomm Research, Singapore
Nanyang Technological University, Singapore
Joint work with Bing Liu (University of Illinois at Chicago) See-Kiong Ng (Institute for Infocomm Research)
![Page 2: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/2.jpg)
Outline
• 1. Introduction of problem
• 2. The Proposed Technique LPLP
• 3. Evaluation Experiments
• 4. Conclusions
![Page 3: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/3.jpg)
1. Introduction
• Traditional Supervised Learning– Given a set of labeled training documents of n
classes, the system uses this set to build a classifier.
– The classifier is then used to classify new documents into the n classes.
• Typically require a large number of labeled examples, which can be an expensive and tedious process.
![Page 4: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/4.jpg)
Positive-Unlabeled (PU) Learning
• One way to reduce the amount of labeled training data is to develop classification algorithms that can learn from a set P of labeled positive examples augmented with a set U of unlabeled examples.
• Then, build a classifier using P and U to classify the data in U as well as future test data. We call this the PU learning problem.
![Page 5: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/5.jpg)
PU Learning
• Positive documents: One has a set of documents of a class P, and
• Unlabeled (or mixed) set: also has a set U of unlabeled documents containing documents from P and also not from P (negative documents).
• Build a classifier: Build a classifier to classify the documents in U and future (test) data.
![Page 6: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/6.jpg)
An illustration of the typical PU Learning
P(ECML) U
(AAAI)
ClassifierClassifier aims to automatically find hidden positives in U
classifyMachine learning papers in AAAI
![Page 7: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/7.jpg)
Applications of the problem
• Given a ECML proceedings, find all machine learning papers from AAAI, IJCAI, KDD
• Given one's bookmarks, identify those documents that are of interest to him/her from Web sources.
• A company has a database with details of its customers, try to find potential customers from a database consisting of details of people.
![Page 8: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/8.jpg)
Related works• Theoretical study: Denis (1998), Muggleton (2001)
and Liu et al (2002) show that this problem is learnable.
• Scholkopf et al(1999) and others proposed one-class SVM
• S-EM: In [ICML,Liu, Lee, Yu, Li, 2002], Liu et al. proposed a method (called S-EM) to solve the problem based on a spy technique, naïve Bayesian classification (NB) and the EM algorithm.
![Page 9: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/9.jpg)
Related works
• PEBL: Yu et al. (KDD, 2002) proposed a SVM based technique to classify Web pages given positive and unlabeled pages.
• NBP: Denis’s group also built a NBP system.
• Roc-SVM: Li and Liu (IJCAI, 2003) gives a Rocchio and SVM based method.
![Page 10: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/10.jpg)
Can we use the current techniques in some real applications?
Can we use the current techniques in some real applications?
![Page 11: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/11.jpg)
A real-life business intelligence application
---searching for information on related products
Printer Pages From Amazon CNET
A company that sells computer printers may wantto do a product comparison among the various
printers currently available in the market
![Page 12: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/12.jpg)
Current techniques can
not work well!
Why???
![Page 13: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/13.jpg)
The Assumption (1) of current techniques
• There was a sufficiently large set of positive training examples
• However, in practice, obtaining a large number of positive examples can be rather difficult in many real applications.
![Page 14: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/14.jpg)
Current Assumption (1)
The small positive set may not even adequately
represent the whole positive class
![Page 15: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/15.jpg)
PU learning with a small positive training set
H2
H1
![Page 16: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/16.jpg)
The Assumption (2) of current techniques
• Positive set (P) and hidden positive examples in the Positive set (P) and hidden positive examples in the unlabeled set(U) are generated from the same unlabeled set(U) are generated from the same distribution.distribution.
• Reason: Different Web sites present similar products in different styles and have different focuses.
Different red color???
Printer Pages From Amazon CNET
![Page 17: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/17.jpg)
2. The proposed techniques: Ideas
Printer Pages From Amazon CNET
E.g. share the representative word features
such as “printer”, “inkjet”, “laser”, “ppm” etc,
Both should still be similar in some underlying feature dimensions (or subspaces) as they belong to the same class.
![Page 18: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/18.jpg)
The proposed techniques: Ideas (Cont.)
• If we can find such a set of representative word features (RW) from the positive set P and U, then we can use them to extract other hidden positive documents from U (Share).
• Method: LPLP Learning from Probabilistically Labeled Positive examples
![Page 19: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/19.jpg)
The proposed techniques: LPLP
• 1. Select the set of representative word features RW from the given positive set P.
• 2. We extract the likely positive documents from U and probabilistically label them based on the set RW.
• 3. We employ the EM algorithm to build an accurate classifier to identify the hidden positive examples from U.
![Page 20: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/20.jpg)
Step1: Selecting a set of representative word features from P
• The scoring function s() is based on TFIDF method
• It gives high scores to those words that occur frequently in the positive set P and not in the whole corpus since U contains many other unrelated documents.
![Page 21: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/21.jpg)
Select representative word features from P
Printer Pages From Amazon
CNET
printer
inkjet
laser
ppm
![Page 22: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/22.jpg)
Step2: identifying LP from U and probabilistically
labeling the documents in LP • rd: representative document consists of all reprehensive features
•Compare each document di in U with rd using the cosine similarity, which produces a set LP of probabilistically labeled documents with Pr(di|+) > 0.
• The hidden positive examples in LP will be assigned high probabilities while the negative examples in LP will be assigned very low probabilities.
![Page 23: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/23.jpg)
Identifying likely positives
LP
RUU
PrinterInkjetLaserPpm
Reprehensive document rd
high probability
low probability
![Page 24: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/24.jpg)
The Naïve Bayesian method
||
)|()(
||1
D
dcc
Di ij
j
||1
||1
||1
)|(),(||
)|(),(1)|(
Vs
Di ijis
Di ijit
jt dcdwNV
dcdwNcw
||1
||1 ,
||1 ,
)|()(
)|()()|(
Cr
dk rkdr
dk jkdj
iji
i
i
i
cwc
cwcdc
(1)
(2)
(3)Classifier
Classifier
parameters
![Page 25: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/25.jpg)
Step3: EM algorithm•Re-initialize the EM algorithm by treating the probabilistically labeled LP (with/without P) as positive documents.
•LP has the similar distributions with other hidden positive documents in U
• The remaining unlabeled set RU is also much purer than U as a negative set.
![Page 26: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/26.jpg)
Build a final classifier
RU
LP
Classifier
P
Option1Option1
Option2: combine P and LPOption2: combine P and LP
Negative set Negative set Positive set (two options) Positive set (two options)
![Page 27: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/27.jpg)
3. EMPIRICAL EVALUATION
Datasets
Number of Web pages and their classes
Web sites Amazon CNet J&R PCMag ZDnet
Notebook 434 480 51 144 143
Camera 402 219 80 137 151
Mobile 45 109 9 43 97
Printer 767 500 104 107 80
TV 719 449 199 0 0
![Page 28: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/28.jpg)
Experiment setting
• We experimented with different number of (randomly selected) positive documents in P, i.e. |P| = 5, 15, or 25 and allpos.
• We conducted a comprehensive set of experiments using all the possible P and U combinations. That is, we selected every entry in Table 1 as the positive set P and use each of the other 4 Web sites as the unlabeled set U.
![Page 29: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/29.jpg)
Performance of LPLP with different numbers of positive documents
different numbers of positive documents
![Page 30: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/30.jpg)
LP + P or LP only?
• If there were only a small number of positive documents (|P| = 5, 15 and 25) available, we found that combining LP and P to construct the positive set for the classifier is better than using LP only.
• If there is a large number of positive documents, then using LP only is better.
![Page 31: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/31.jpg)
The number of the representative features
• In general, 5-25 representative words would suffice.
• Including the less representative word features beyond the top 25 most representative ones would introduce unnecessary noise in identifying the likely positive documents in U
![Page 32: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/32.jpg)
Performance of LPLP, Roc-SVM and PEBL (using either P or LP) when using all positive
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
LPLP Roc-SVM PEBL
F v
alu
e
P LP
PEBL and Roc-SVM use the likely positive documents LP which requires each document (d) from U to contain at least 5 (out of 10) selected representative words.
![Page 33: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/33.jpg)
Comparative results when the number of the positive documents is small
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
LPLP Roc-SVM PEBL
The document number of P
F v
alu
e5 15 25
![Page 34: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/34.jpg)
Conclusions
• In many real-world classification applications, it is often the case that the number of positive examples available for learning can be fairly limited
• We proposed an effective technique LPLP that can learn effectively from positive and unlabeled examples with a small positive set for document classification.
![Page 35: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/35.jpg)
Conclusions (cont.)
• The likely positive documents LP can be used to help boost the performance of classification techniques for PU learning problems
• LPLP algorithm benefited the most because of its ability to handle probabilistic labels and is thus better equipped to take advantage of the probabilistic LP set than the SVM-based approaches.
![Page 36: Learning to Classify Documents with Only a Small Positive Training Set](https://reader033.fdocuments.us/reader033/viewer/2022052603/568146b4550346895db3d3c8/html5/thumbnails/36.jpg)
Thank you for your attention!