Th Cl I b l P bl i The Class Imbalance Problem in Learning Classifier Systems: Learning Classifier Systems:
A Preliminary Studyy y
Albert Orriols PuigEster Bernadó Mansilla
Enginyeria i Arquitectura La SalleRamon Llull University
Page 1IWLCS Enginyeria i Arquitectura La Salle
June 25th, 2005
OUTLINE
1. Introduction 1. Introduction
2. Description of UCS3 D t t d i
2. UCS Description
3. Dataset Design3. Dataset design4. UCS on Unbalanced Datasets
3. Dataset Design
4. UCS on unbalanced d.
4. UCS on Unbalanced Datasets5. Dealing with imbalances
5. Dealing imbalances
6. Chk Problem
6. UCS in the Chk Problem7 Contrasting results with Pos
7. Contrasting res.
7. Contrasting results with Pos Problem
8. Conclusions
8. Conclusions
Page 2IWLCS Enginyeria i Arquitectura La Salle
INTRODUCTION
1. Introduction
2. UCS Description
3. Dataset Design
Real world
Class imbalances inthe samples taken 3. Dataset Design
4. UCS on unbalanced d.
domainst e sa p es ta e
5. Dealing imbalances
6. Chk ProblemDoes it affects the learning performance of some well
7. Contrasting res.
Does it affects the learning performance of some well-known systems?
8. Conclusions
If it is, how we can deal with imbalances
Does class imbalances affect the performance of UCS
Page 3IWLCS Enginyeria i Arquitectura La Salle
Supervised Learning Scheme1. Introduction
I t E l l
Supervised Learning Scheme
2. UCS Description
3. Dataset Design
Input Example class
3. Dataset Design
4. UCS on unbalanced d.Population
5. Dealing imbalances
6. Chk Problem
Classifiers that predictthe correct action.
7. Contrasting res.
matchset
correctset
GeneticAlgorithmDiscovery
component
8. Conclusions
Classifier’s Acc = #Correct / experienceParameters
Update Fitness = accv
Page 4IWLCS Enginyeria i Arquitectura La Salle
Chk Problem
- Two real attributes x,y E [0,1]Two classes
1. Introduction
- Two classes- Permits varying complexity along:
C t C l it ( )
2. UCS Description
3. Dataset Designa. Concept Complexity (c)b. Dataset size (s)
I b l l l (i)
3. Dataset Design
4. UCS on unbalanced d.
c. Imbalance level (i)5. Dealing imbalances
6. Chk Problem
7. Contrasting res.
8. Conclusions
s=4096, c=4, i=2
#inst. maj. class = s/c2 = 4096/16 = 256#inst. min. class = s/c2*2i = 4096/(16*4) = 64
Page 5IWLCS Enginyeria i Arquitectura La Salle
We ran UCS in chk with s=4096 c=4 and i=[0 7]
1. Introduction
We ran UCS in chk with s=4096, c=4 and i=[0..7]
2. UCS Description
3. Dataset Design3. Dataset Design
4. UCS on unbalanced d.
5. Dealing imbalances
6. Chk Problem
7. Contrasting res.
8. Conclusions
Training datasets for chk problem
Page 6IWLCS Enginyeria i Arquitectura La Salle
Obtaining the following results
1. Introduction
Obtaining the following results
2. UCS Description
3. Dataset Design3. Dataset Design
4. UCS on unbalanced d.
5. Dealing imbalances
6. Chk Problem
7. Contrasting res.
8. Conclusions
Boundaries evolved by UCS in the chk problem with imbalance levels from 0 to 7
Page 7IWLCS Enginyeria i Arquitectura La Salle
Analyzing the population evolved in higher
1. Introduction
y g p p gimbalance levels
Id diti Cl A F N2. UCS Description
3. Dataset Design
Id condition Class Acc F Num
1 [0.509, 0.750] [0.259, 0.492] 1 1.00 1.00 39
2 [0.000, 0.231] [0.252, 0.492] 1 1.00 1.00 383. Dataset Design
4. UCS on unbalanced d.
3 [0.000, 0,248] [0.755, 1.000] 1 1.00 1.00 35
4 [0.761, 1.000] [0.000, 0.249] 1 1.00 1.00 34
5 [0.255, 0.498] [0.520, 0.730] 1 1.00 1.00 3318 rules5. Dealing imbalances
6. Chk Problem
6 [0.751, 1.000] [0.514, 0.737] 1 1.00 1.00 31
7 [0.259, 0.498] [0.000, 0.244] 1 1.00 1.00 27
8 [0.501, 0.743] [0.751, 1.000] 1 1.00 1.00 18
18 rules predicting the under-sized
class As imbalance level increases, the
7. Contrasting res.
[ , ] [ , ]
9 [0.500, 0.743] [0.751, 1.000] 1 1.00 1.00 9
10 [0.751, 1.000] [0.531, 0.737] 1 1.00 1.00 8
accuracy of the over-general classifiers increases too. Then, they become stronger in the population.
8. Conclusions…
18 [0.509, 0.750] [0.246, 0.492] 1 0.64 0.01 1
19 [0.000, 1.000] [0.000, 1.000] 0 0.94 0.54 2047 rules
g p p
20 [0.000, 1.000] [0.000, 0.990] 0 0.94 0.54 13
21 [0.012, 1.000] [0.000, 0.990] 0 0.94 0.54 10
…
47 rules predicting the
over-sized class
Page 8IWLCS Enginyeria i Arquitectura La Salle
64 [0.012, 1.000] [0.038, 0.973] 0 0.94 0.54 1Rules for imbalance level i=4
Methods to deal with imbalances1. Introduction
Methods to deal with imbalances
• In literature there are several methods to 2. UCS Description
3. Dataset Design
deal with imbalances
• We have considered 3 methods:
3. Dataset Design
4. UCS on unbalanced d.
• We have considered 3 methods:– Random over-sampling [Jap02]
5. Dealing imbalances
6. Chk Problem
– Adaptive sampling
– Class-sensitive accuracy7. Contrasting res.
y8. Conclusions
Page 9IWLCS Enginyeria i Arquitectura La Salle
Adaptive Sampling
I i d i li d b ti1. Introduction
Adaptive Sampling
• Inspired in over-sampling and boosting
• It maintains a weight for each training instance. Th i ht i th b bilit f li thi
2. UCS Description
3. Dataset DesignThe weight is the probability of sampling this instance
E h ti i t i l t d f l it it
3. Dataset Design
4. UCS on unbalanced d.
• Each time an instance is selected for exploit, its weight is updated in the following way:
5. Dealing imbalances
6. Chk Problem
7. Contrasting res.wi (1 - α) if correct
8. Conclusionswi
wi (1 + α) otherwise
Page 10IWLCS Enginyeria i Arquitectura La Salle
Class sensitive accuracy
W t f h l1. Introduction
Class-sensitive accuracy
• We compute accuracy for each class2. UCS Description
3. Dataset Designi
icacc = Ci = number of examples of class i correctly classified
b f l f l i d b th l
• The compound accuracy
3. Dataset Design
4. UCS on unbalanced d.i
iaccexp expi = number of examples of class i covered by the rule
The compound accuracy5. Dealing imbalances
6. Chk Problem⎪⎨⎧ ∑
>=
C
iii
eacc
Cacc 0exp|1
1accii θ≥∀ exp: Ce = Number of different
classes that a rule
7. Contrasting res.⎪⎩⎨
∑=
>=
iiC
iiii
ewacc
C
acc 0e p|
0exp|1
1otherwise
classes that a rulecovers.
• Where 8. Conclusions
⎪⎧ exp f θ0 Cee = Number of
⎪⎩
⎪⎨
⎧
∑=
<<=−
acc
i
C
accii iacceCiw θ
θθ
exp
exp·exp0|1
acciif θ<< exp0..
acciif θ≥exp..
Cee = Number of experienced classes
Θacc = threshold below hi h l i
Page 11IWLCS Enginyeria i Arquitectura La Salle
⎪⎩ accee
acciC θ·
accif p which a class is inexperienced
1. Introduction
2. UCS Descript.
3. Dataset Design
4. UCS on unbal.
5. Dealing imb.
6. Chk Problem
7. Contrasting res.
8. Conclusions
Page 12IWLCS Enginyeria i Arquitectura La SalleOversampling
1. Introduction
2. UCS Descript.
3. Dataset Design
4. UCS on unbal.
5. Dealing imb.
6. Chk Problem
7. Contrasting res.
8. Conclusions
Page 13IWLCS Enginyeria i Arquitectura La SalleAdaptive sampling
1. Introduction
2. UCS Descript.
3. Dataset Design
4. UCS on unbal.
5. Dealing imb.
6. Chk Problem
7. Contrasting res.
8. Conclusions
Page 14IWLCS Enginyeria i Arquitectura La SalleClass-sensitive accuracy
Pos Problem
- Multiple classes and different imbalance levelsCondition binary string of length L
1. Introduction
- Condition = binary string of length L- Class = Position of the leftmost one-valued bit
2. UCS Description
3. Dataset Design3. Dataset Design
4. UCS on unbalanced d.
C diti A ti5. Dealing imbalances
6. Chk Problem
Condition Action00000 0
00001 1
7. Contrasting res.
0001# 2
001## 3
01### 48. Conclusions
01### 4
1#### 5
Optimal ruleset for the pos5 problemOptimal ruleset for the pos5 problem
Page 15IWLCS Enginyeria i Arquitectura La Salle
Running pos8 pos15 with raw UCS
1. Introduction
Running pos8 – pos15 with raw UCS
2. UCS Description
3. Dataset Design3. Dataset Design
4. UCS on unbalanced d.
5. Dealing imbalances
6. Chk Problem
7. Contrasting res.
8. ConclusionsPercentage of optimal population achieved
As the imbalance level increases, the system presents difficulties in discovering the most specific rules
Page 16IWLCS Enginyeria i Arquitectura La Salle
difficulties in discovering the most specific rules
Contrasting results with Pos problem
1. Introduction
2. UCS Description
3. Dataset Design
Condition Class00000000 0 3. Dataset Design
4. UCS on unbalanced d.
00000000 0
00000001 1
0000001# 2Wrong rule: #000000:05. Dealing imbalances
6. Chk Problem
000001## 3
00001### 4
0001#### 5
Wrong rule: #000000:0example = 00000000:0 (128)
Counter example = 10000000:8 (1)
7. Contrasting res.
0001#### 5
001##### 6
01###### 7
Counter-example = 10000000:8 (1)
We are sampling in a very low rate the counter-examples for the rules
8. Conclusions1####### 8the counter examples for the rules
over-generalized with the most specific optimal rules.
Page 17IWLCS Enginyeria i Arquitectura La Salle
Oversampling
Contrasting results with Pos problem
1. Introduction
2. UCS Description
3. Dataset Design3. Dataset Design
4. UCS on unbalanced d.
5. Dealing imbalances
6. Chk Problem
7. Contrasting res.
8. Conclusions
Adaptive Sampling
Page 18IWLCS Enginyeria i Arquitectura La Salle
Contrasting results with Pos problem
1. Introduction
2. UCS Description
3. Dataset Design3. Dataset Design
4. UCS on unbalanced d.
5. Dealing imbalances
6. Chk Problem
7. Contrasting res.
8. Conclusions
Class-sensitive accuracy
Page 19IWLCS Enginyeria i Arquitectura La Salle
Conclusions1. Introduction
Conclusions
• The class imbalance problem has appeared to be a real problem on UCS
2. UCS Description
3. Dataset Designbe a real problem on UCS.
• All tested strategies to deal with class i b l i th lt f UCS i
3. Dataset Design
4. UCS on unbalanced d.
imbalances improves the results of raw UCS in Chk problem
5. Dealing imbalances
6. Chk Problem
• The analysis in Pos revealed many inconveniences in oversampling method. This
7. Contrasting res.
p glead us to discard this method for real-world problem
8. Conclusions
Page 20IWLCS Enginyeria i Arquitectura La Salle
Further Work1. Introduction
Further Work
• Enhance the study with other LCS (preliminary experiments made with GAssist [Bac04] and
2. UCS Description
3. Dataset Designexperiments made with GAssist [Bac04] and Hider [Agu04])
E t d thi l i t th l ifi
3. Dataset Design
4. UCS on unbalanced d.
• Extend this analysis to other classifier schemes: C4.5 and SVM
5. Dealing imbalances
6. Chk Problem
• Extend the analysis with other artificial and real problems
7. Contrasting res.
p8. Conclusions
Page 21IWLCS Enginyeria i Arquitectura La Salle
Top Related