Feature selection for text categorization on imbalanced data
-
Upload
maisie-barnett -
Category
Documents
-
view
40 -
download
1
description
Transcript of Feature selection for text categorization on imbalanced data
![Page 1: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/1.jpg)
1Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
Feature selection for text categorization on imbalanced data
Advisor : Dr. Hsu
Presenter : Zih-Hui Lin
Author :Zhaohui Zheng , Xiaoyun Wu, Rohini Srihari
![Page 2: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/2.jpg)
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation Objective Introduction Feature selection framework Experimental setup Primary results and analysis Conclusions Future work
Outline
![Page 3: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/3.jpg)
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
Feature selection using one-sided metrics selects the features most indicative of membership only
Feature selection using two-sided metrics implicitly combines the features most indicative of membership (e.g. positive features) and non-membership (e.g. negative features) by ignoring the signs of features.
![Page 4: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/4.jpg)
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
We investigate the usefulness of explicit control of that combination within a proposed feature selection framework.
Our experiments show both great potential and actual merits of explicitly combining positive and negative features in a nearly optimal fashion according to the imbalanced data.
![Page 5: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/5.jpg)
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
One choice in the feature selection policy is whether to rule out all negative features.
Others believe that negative features are numerous, given the imbalanced data set, and quite valuable in practical experience.
![Page 6: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/6.jpg)
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction (cont.) negative features are useful because their
presence in a document highly indicates its non-relevance.
They help to confidently reject non-relevant documents.
When deprived of negative features, the performance of all feature selection metrics degrades, which indicates negative features are essential to high quality classification.
![Page 7: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/7.jpg)
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction (cont.) neither one-sided nor two-sided metrics
themselves allow control of the combination. The focus in this paper is to answer the
following three questions with empirical evidence:─ How sub-optimal are two sided metrics?─ To what extent can the performance be improved by
better combination of positive and negative features?─ How can the optimal combination be learned in
practice?
![Page 8: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/8.jpg)
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Feature selection metrics
Correlation Coefficient (CC), and Odds Ratio (OR) are one-sided metrics which select the features most indicative of membership for a category only,
IG and CHI are two-sided metrics, which consider the features most indicative of either membership (e.g. positive features) or non-membership (e.g. negative features).
![Page 9: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/9.jpg)
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Feature selection metrics (cont.)
positive
positive
negative
negative
t
陳水扁t
無「陳水扁」Ci: 政治類 Positive Negative
Ci: 非政治類 Negative Positive
![Page 10: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/10.jpg)
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Feature selection metrics (cont.)
Information gain (IG)─ Measures the number of bits of information obtained fo
r category prediction by knowing the presence or absence of a term in a document.
Chi-square (CHI)─ measures the lack of independence between a term t an
d a category ci and can be compared to the chi-square distribution with one degree of freedom to judge extremeness
![Page 11: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/11.jpg)
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Feature selection metrics (cont.)
Correlation coefficient (CC)─ Correlation coefficient of a word t with a category ci
Odds ratio (OR)─ the odds of the word occurring in the positive class nor
malized by that of the negative class.
![Page 12: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/12.jpg)
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Feature selection metrics (cont.)
IG and CHI are two-sided, whose values are non-negative.
CC and OR are one-sided metrics, whose positive and negative values correspond to the positive and negative features respectively
We can easily obtain that the sign for a one-sided metric, e.g. CC or OR, is sign (AD-BC).A positive C negativeB negative D positive
![Page 13: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/13.jpg)
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Feature selection metrics (cont.)
A one-sided metric could be converted to its two-sided counterpart by ignoring the sign, while a two-sided metric could be converted to its one-sided counterpart by recovering the sign.
We propose the two-sided counterpart of OR, namely OR square, and the one-sided counterpart of IG, namely signed IG as follows.
![Page 14: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/14.jpg)
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.The imbalanced data problem
The imbalanced data problem occurs when the training examples are unevenly distributed among different classes.
there is an overwhelming number of non-relevant training documents especially when there is a large collection of categories with each assigned to a small number of documents,
This problem presents a particular challenge to classification algorithms, which can achieve high accuracy by simply classifying every example as negative.
![Page 15: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/15.jpg)
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.The imbalanced data problem ( cont.)
The impact of imbalanced data problem on the standard feature selection can be illustrated as follows, which primarily answers the first question of “how sub-optimal are two side metric”:─ How to confidently reject the non-relevant documents
is important in that case.─ given a two-sided metric, the values of positive
features are not necessarily comparable with those of negative features.
![Page 16: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/16.jpg)
16
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.The imbalanced data problem ( cont.)
─ TP,FP, FN and TN denote true positives, false positives, false negatives and true negatives respectively. Positive features have more effect on TP and FN, negative features have more effect on TN and FP. feature selection using a two-sided metric combines the
positive and negative features so as to optimize the accuracy, which is defined to be
F1 has been widely used in information retrieval, which is:
Finally, some performance measures themselves,
![Page 17: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/17.jpg)
17
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Feature selection framework – General formulation For each category ci :
─ l is the size of feature set─ 0 < l1<= 1 is the key parameter of the framework to be set. ─ The function =گ ( . ; ci) should be defined so that the larger = .is, the more likely the term t belongs to the category ci (t; ci)گ
![Page 18: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/18.jpg)
18
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Feature selection framework – General formulation (cont.) Obviously, one-sided metrics like SIG, CC, an
d OR can serve as such functions.─ we can easily obtain:
SIG(t; ci) = - SIG(t; ci);
CC(t; ci) = - CC(t; ci);
OR(t; ci) = - OR(t; ci);
![Page 19: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/19.jpg)
19
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Feature selection framework – General formulation (cont.) the second step can be rewritten as:
the framework combines ─ the l1 terms with largest گ =( . ; ci)
─ the l2 = l - l1 terms with smallest = گ( . ; ci).
![Page 20: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/20.jpg)
20
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Feature selection framework – two special cases The standard feature selection methods
generally fall into one of the following two groups:─ select the positive features only using one-sided
metrics, e.g. SIG, CC, and OR. For convenience, we will use CC as the representative of this group.
─ implicitly combine the positive and negative features using two-sided metrics, e.g. IG, CHI, and ORS. CHI will be chosen to represent this group.
![Page 21: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/21.jpg)
21
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Feature selection framework – two special cases (cont.) The positive subset of Fi is
─
![Page 22: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/22.jpg)
22
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Feature selection framework – Optimization The feature selection framework facilitates the control on
explicit combination of the positive and negative features through the parameter l1=l (size ratio).
How to optimize the size ratio?─ Ideal scenario
use the training data to learn different models per category according to different size ratios ranging from 0 to 1 and select the ratio having best performance,
─ Practical scenario we tried in this paper is to empirically select the size ratio per
category having best performance on the training set.
![Page 23: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/23.jpg)
23
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Feature selection framework – Optimization (cont.) The efficient implementation of optimization
within the framework is as follows:─ select l positive features with greatest گ (t; ci) in a de
creasing order.─ select l negative features with smallest گ (t; ci) in an i
ncreasing order.─ empirically choose the size ratio l1 / l such that the featu
re set constructed by combining the first l1; 0 < l1 < l; positive features and the first l - l1 negative features has the optimal performance.
![Page 24: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/24.jpg)
24
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental setup – data collection
We conduct experiments on standard text categorization data using two popular classifiers: ─ Naiive Bayes (NB) and logistic regression (LR).
Data collection─ Reuters-21578 (ModApte split) is used as our data collection,─ contains 90 categories, with 7769 training documents and 3019 te
st documents.─ After filtering out all numbers, stop-words and words occurring l
ess than 3 times, we have 9484 indexing words in the vocabulary.─ Words in document titles are treated same as in document─ body.
![Page 25: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/25.jpg)
25
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental setup – Classifiers
A NB score between a document d and the category ci can be calculated as:─
fj is the feature appearing in the document
P(ci) and P(ci) represent prior probabilities of relevant and non-relevant respectively
and P(fj | ci) and P(fj | ci) are conditional probabilities estimated with Laplacian smoothing.
![Page 26: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/26.jpg)
26
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Logistic regression tries to model the conditional probability as:─
The optimization problem for LR is to minimize:─
di is the ith training example, w is the weight vector, yi Є {-1, 1} is the label associated with di, λ is an appropriately chosen regularization parameter
Experimental setup – Classifiers (cont.)
![Page 27: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/27.jpg)
27
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental setup – performance measure To measure the performance, we use both precision
(p) and recall (r) in their combined from F1 : . To remain compatible with other results, the F1 value at Break Even Point (BEP).
BEP is defined to be the point where precision equals recall. ─ It corresponds to the minimum of |FP – FN| in practice.
We report both micro and macro-averaged BEP F1.─ The micro-averaged F1 is largely dependent on the most common
categories while the rare categories influence macro-averaged F1.
![Page 28: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/28.jpg)
28
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental setup – performance measure
![Page 29: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/29.jpg)
29
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Primary results and analysis –
ideal scenario Micro F1(BEP) values for NB with the feature selection methods at different sizes
of features over the 58 categories: 3rd-60th. The micro-averaged F1(BEP) for NB without feature selection (using all 9484 features ) is .641
![Page 30: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/30.jpg)
30
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Primary results and analysis –
ideal scenario (cont.) As table1 but for macro-averaged F1(BEP). The macro-averaged F1(BEP) for NB
without feature selection is .483
![Page 31: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/31.jpg)
31
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Primary results and analysis – ideal scenario
(cont.) Micro-averaged F1(BEP) values for LR with the feature selection methods at
different sizes of features over the 58 categories. The micro-averaged F1(BEP) for LR without feature selection is .766
![Page 32: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/32.jpg)
32
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Primary results and analysis – ideal scenario
(cont.) As table 3, but for macro-averaged F1(BEP). The macro-averaged F1(BEP) for LR
without feature selection is .676
![Page 33: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/33.jpg)
33
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Primary results and analysis – ideal scenario (cont.) Figure 2: Size ratios implicitly decided by using two-sided metrics: IG,
CHI and ORS respectively (58 categories:3rd- 60th, feature size = 50)
confirms that feature selection using a two-sided metric is similar to its one-sided counterpart (size ratio = 1) when the feature size is small.
![Page 34: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/34.jpg)
34
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Primary results and analysis –
ideal scenario (cont.) Figure 3: BEP F1 for test over the first two and last two categories (out of
58 categories) at different l1=l values (NB, گ= CC, feature size = 50). The optimal size ratios for money-fx, grain, dmk and lumber are 0.95, 0.95, 0.1 and 0.2 respectively
![Page 35: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/35.jpg)
35
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Primary results and analysis –
ideal scenario (cont.) Figure 4: Optimal size ratios of iSIG, iCC and iOR (NB, 58 categories:3rd-
60th, feature size = 50)
![Page 36: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/36.jpg)
36
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Primary results and analysis – ideal scenario
(cont.) Figure 5: As gure 4, but for LR
![Page 37: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/37.jpg)
37
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Primary results and analysis –
practical scenario (cont.)
![Page 38: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/38.jpg)
38
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Primary results and analysis –
Additional results
![Page 39: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/39.jpg)
39
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Primary results and analysis – Additional results (cont.)
![Page 40: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/40.jpg)
40
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Primary results and analysis – Additional results (cont.)
![Page 41: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/41.jpg)
41
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusions
a novel feature selection framework was presented, in which the positive and negative features are separately selected and explicitly combined.
We explored three special cases of the framework:─ 1. consider the positive features only by using one-sided metrics;
─ 2. implicitly combine the positive and negative features by using two-sided metrics;
─ 3. combine the two kinds of features explicitly and choose the size ratio empirically such that optimal performance is obtained.
![Page 42: Feature selection for text categorization on imbalanced data](https://reader035.fdocuments.us/reader035/viewer/2022062422/568131c1550346895d982b2c/html5/thumbnails/42.jpg)
42
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusions (cont.)
Implicitly combining positive and negative features using two-sided metrics is not necessarily optimal, especially on imbalanced data.
A judicious combination shows great potential and practical merits.
A good feature selection method should take into consideration the data set, performance measure, and classification methods.
Feature selection can significantly improve the performance of both naiive Bayes and regularized logistic regression on imbalanced data.