Mining When Classes are Imbalanced, Rare Events Matter More ...

52
2/24/2009 1 Mining When Classes are Mining When Classes are Imbalanced, Rare Events Imbalanced, Rare Events Matter More, and Errors Have Matter More, and Errors Have Matter More, and Errors Have Matter More, and Errors Have Costs Attached Costs Attached Nitesh V. Chawla University of Notre Dame http://www.nd.edu/~nchawla [email protected] Nitesh Chawla, SIAM 2009 Tutorial Overview Overview y Introduction S li M th d y Sampling Methods y Moving Decision Threshold y Classifiers’ Objective Functions y Evaluation Measures IEEE ICDM noted “Dealing with Non-static, Unbalanced and Cost-sensitive Data” among the 10 Challenging Problems in Data Mining Research Nitesh Chawla, SIAM 2009 Tutorial

Transcript of Mining When Classes are Imbalanced, Rare Events Matter More ...

Page 1: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

1

Mining When Classes are Mining When Classes are Imbalanced, Rare Events Imbalanced, Rare Events Matter More, and Errors Have Matter More, and Errors Have Matter More, and Errors Have Matter More, and Errors Have Costs AttachedCosts Attached

Nitesh V. ChawlaUniversity of Notre Dame

http://www.nd.edu/~nchawlap // /[email protected]

Nitesh Chawla, SIAM 2009 Tutorial

OverviewOverviewIntroduction S li M th dSampling MethodsMoving Decision ThresholdClassifiers’ Objective FunctionsEvaluation Measures

IEEE ICDM noted “Dealing with Non-static, Unbalanced and Cost-sensitive Data” among the 10 ChallengingProblems in Data Mining Research

Nitesh Chawla, SIAM 2009 Tutorial

Page 2: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

2

Small Class Matters, and Matters Small Class Matters, and Matters MoreMore

Data set is Imbalanced, if the classes are unequally q ydistributed

Class of interest (minority class) is often much smaller or rarer

But, the cost of error on the minority class can have a minority class can have a bigger bite

Nitesh Chawla, SIAM 2009 Tutorial

P

Actual Non-Default

Actual Default

Predict Non-Default

TN FN

Positives

Typical Prediction Typical Prediction ModelModel

Predict Default

FP TP

Nitesh Chawla, SIAM 2009 Tutorial

Page 3: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

3

The one in a 100, one in a 1000, one in The one in a 100, one in a 1000, one in 100,000, and one in a million event100,000, and one in a million event

Real-world has abundance of scenarios with h i b l i l di t ib tisuch imbalance in class distributions

◦ Fraud detection◦ Disease prediction◦ Intrusion detection◦ Text categorization◦ Bioinformatics◦ Direct marketingg◦ Terrorist attack◦ Physics simulations◦ Climate

Nitesh Chawla, SIAM 2009 Tutorial

Page 4: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

4

Nitesh Chawla, SIAM 2009 Tutorial

Paradox of False PositiveParadox of False Positive

Imagine a disease that has a prevalence of 1 in a mllion people. I invent a test that is 99% accurate. I am obviously excited. But, when applied to a million, it returns positive for 10,000 (remember, it is 99%accurate). Priors tell us otherwise. There is one in a million infected 99% There is one in a million infected --- 99% accurate test is inaccurate 9,999 times out of 10,0000.

Nitesh Chawla, SIAM 2009 Tutorial

Page 5: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

5

Yes, measuring performance Yes, measuring performance presents challengespresents challenges

A “fruit-bowl” of measures. No more comparing apples and oranges. Take your favorite. But, how do we really compare?do we really compare?◦ Accuracy (CAREFUL)◦ Balanced accuracy (better)◦ AUROC (different ways of computing, potentially)◦ F-measure (requires a threshold)◦ Precision @ Top 20 (where are the positive cases in the

ranking)◦ G-mean◦ Probability loss measures such as negative cross entropy Probability loss measures such as negative cross entropy

and brier score (how well calibrated are the models?)

Fruit for thought. We will return to this.

Nitesh Chawla, SIAM 2009 Tutorial

Countering Class Imbalance: Some Countering Class Imbalance: Some Popular Solutions in Data MiningPopular Solutions in Data Mining

Sampling◦ Oversampling◦ Oversampling◦ Undersampling◦ …Variations and combinations of the twoAdapting learning algorithms by modifying objective functions or changing decision thresholds◦ Decision trees◦ Neural Networks

SVMs◦ SVMsEnsemble based methods

(all of the above can also be combined together!)

Nitesh Chawla, SIAM 2009 Tutorial

Page 6: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

6

OverviewOverviewIntroduction S li M th dSampling MethodsMoving Decision ThresholdClassifiers’ Objective FunctionsEvaluation Measures

Nitesh Chawla, SIAM 2009 Tutorial

Sampling: Add or remove until Sampling: Add or remove until satisfactory performancesatisfactory performance

Undersampling dictates removal of majority class examplesmajority class examplesOversampling dictates removal of minority class examples

Nitesh Chawla, SIAM 2009 Tutorial

Page 7: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

7

UndersamplingUndersamplingRandomly remove majority class examples

Risk of losing potentially important majority class examples, that help establish the discriminating power

Nitesh Chawla, SIAM 2009 Tutorial

What about focusing on the What about focusing on the borderline and noisy examples?borderline and noisy examples?

Introducing Tomek Links and Condensed Nearest Neighbor Rule

Nitesh Chawla, SIAM 2009 Tutorial

Page 8: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

8

TomekTomek linkslinks

To remove both noise and borderline examplesTomek link◦ Let Ei, Ej be examples belonging to different classes.◦ Let d (Ei, Ej) is the distance between them.◦ A (Ei Ej) pair is called a Tomek link if there is ◦ A (Ei, Ej) pair is called a Tomek link if there is no example Ek, such that d(Ei, Ek) < d(Ei, Ej) or d(Ej , Ek) < d(Ei, Ej).

Nitesh Chawla, SIAM 2009 Tutorial

Undersample by Tomek LinksUndersample by Tomek LinksNitesh Chawla, SIAM 2009 Tutorial

Page 9: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

9

Condensed Nearest Neighbor Condensed Nearest Neighbor Rule (CNN ruleRule (CNN rule))

Find a consistent subset of examplesFind a consistent subset of examples.◦ A subset E’ E is consistent with E if using a 1-nearest neighbor, E’ correctly classifies the examples in E

The goal is to eliminate examples from the majority class that are much further away from the border

Nitesh Chawla, SIAM 2009 Tutorial

CNN EditingCNN EditingNitesh Chawla, SIAM 2009 Tutorial

Page 10: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

10

OversamplingOversamplingReplicate the minority class examples to increase their relevance

But no new information is being added. Hurts the generalization capacity.

Nitesh Chawla, SIAM 2009 Tutorial

Instead of replicating, let us Instead of replicating, let us invent some new instancesinvent some new instances

SMOTE: Synthetic Minority Over-sampling hTechnique

Nitesh Chawla, SIAM 2009 Tutorial

Page 11: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

11

SMOTESMOTEMinorityx

|x| to1 ifor =∈∀

jj

tij'xx

if),NN(k-

if

1else

rand(1) * )x-(x x continuous x

|x| to1 jfor xof NN-k Compute

kiinj

ij

i

i

=≡

=

n

C

c

jjn

jjc

xNjcxNj

xNjcxNj

alnoj'xx

continuousj

δδ

δ

δ

+

=

−=

∈−=

c

1

2

(VDM) ',,',

,,,

min if )(

if

endforendfor

endif'xx:k

maxargx'x

xjkj

nj

1 ∑=∈

= nδδ +=Δ c

Nitesh Chawla, SIAM 2009 Tutorial

n

SMOTE

i jk

l

m

For each minority exampleFor each minority example kk compute nearest minoritycompute nearest minority class class examplesexamples {{ii, j, l, n, m, j, l, n, m}}

Nitesh Chawla, SIAM 2009 Tutorial

Page 12: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

12

in

SMOTE

i jk

l

mdiff

Randomly choose an example out of 5 closest pointsNitesh Chawla, SIAM 2009 Tutorial

in

SMOTE

i jk

l

mdiff

k1

Synthetically generate event Synthetically generate event kk11, , such that such that kk11 lies between lies between kk and and ii

Nitesh Chawla, SIAM 2009 Tutorial

Page 13: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

13

in

SMOTE

ijk

l

m

diff

k1 k3

k2

After applying SMOTE 3 times (SMOTE parameter = 300%) data After applying SMOTE 3 times (SMOTE parameter = 300%) data set may look like as the picture aboveset may look like as the picture above

Nitesh Chawla, SIAM 2009 Tutorial

SMOTESMOTEGenerate “new” minority class instances conditioned on the distribution of known instances

Nitesh Chawla, SIAM 2009 Tutorial

Page 14: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

14

SMOTESMOTEGenerate “new” minority class instances conditioned on the provided instances

Nitesh Chawla, SIAM 2009 Tutorial

Beware of those lurking majority Beware of those lurking majority class examplesclass examples

Borderline-SMOTE

Nitesh Chawla, SIAM 2009 Tutorial

Page 15: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

15

SMOTE + SMOTE + TomekTomek Link Link UndersampleUndersample

Nitesh Chawla, SIAM 2009 Tutorial

Two fundamental issues:◦ What is the right sampling method for a given dataset?◦ How to choose the amount to sample?

Use a wrapper to empirically discover the relevant amounts of sampling

Sampling Methods: DiscussionsSampling Methods: DiscussionsNitesh Chawla, SIAM 2009 Tutorial

Page 16: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

16

A Scalable WrapperA Scalable WrapperNitesh Chawla, SIAM 2009 Tutorial

Testing for each pair of under-sampling and SMOTE percentages is too time consumingSo a heuristic is used where searching for under-sampling percentage is done first then followed by search for SMOTE p g ypercentage◦ Hypothesis: The under-sampling will first remove the “excess”

majority class examples, without much hampering the accuracy on majority classes. Later SMOTE will add synthetic minority class examples which will increase the generalization performance of the classifier over the minority classes

Algorithm divided into two partsWrapper Under-sample AlgorithmWrapper SMOTE AlgorithmWrapper SMOTE Algorithm

Our Algorithm can handle multiple minority and majority class problemsUses Five-fold cross-validation over training data as the evaluation function

Nitesh Chawla, SIAM 2009 Tutorial

Page 17: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

17

Under-sampling percentage search

I d i Al i h

Performance on five-fold cross-validation

Under-samplingpercentage

Training set Wrapper selected

Under-samplingpercentages

Corresponding

Greedy search process guided by performance over five-fold cross validation of training dataBaseline results are results with no under-sampling performedGenerally the under-sampling of the majority classes increases th v F v lu v min it cl ss s but in s m c s s c n

Induction Algorithm Corresponding

Best minority class F-values

the average F-value over minority classes, but in some cases can be reduced due to high reduction in precision versus small increase in recallCriteria to stop under-sampling for majority class◦ The Average F-value over minority classes reduces or◦ The Average F-value over majority classes reduces more

than 5% Nitesh Chawla, SIAM 2009 Tutorial

SMOTE percentage search

Induction Algorithm

Performance on five-fold cross-validation

Wrapper selected Under-sampling (constant)

and SMOTE percentages

Training set

Wrapper selected

Under-samplingpercentages

Corresponding

Best minority

Training set

Wrapper selected Under-

sampling & SMOTE

percentages

Induction Algorithm

The under-sampling percentages for majority classes are fixedto those found so far by Wrapper Under-sample AlgorithmCorresponding Best Results are BaselinedCriteria to stop SMOTE for minority class◦ The Average F-value over-minority classes reduces and

h l k h f h l h

yclass F-values

Testing setFinal Evaluation

Wrapper results

g y◦ The two look aheads for the minority class are exhausted

Rationale for SMOTE look ahead : More SMOTE will add more information and will give better accuracies over minority classes as apposed to under-sampling which reduces information thereby reducing the coverage of the build classifier

Nitesh Chawla, SIAM 2009 Tutorial

Page 18: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

18

Exploiting Exploiting Locality in Locality in SamplingSampling

Nitesh Chawla, SIAM 2009 Tutorial

36

Nitesh Chawla, SIAM 2009 Tutorial

Page 19: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

19

Class ratio can be important to determining best sampling levels to useOther properties may exert greater influenceinfluence◦ Overlap◦ DensityConsider the following examples

Nitesh Chawla, SIAM 2009 Tutorial

38All have same class ratio: (50:1000)

Nitesh Chawla, SIAM 2009 Tutorial

Page 20: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

20

AUROC(density,separation) C45 C45+S Undersample Smote

(High, High) 0.926 0.968 40 450(High, Medium) 0.909 0.942 60 250

(High, Low) 0.898 0.904 60 100(Medium, High) 0.915 0.961 50 300

(Medium, Medium) 0.878 0.927 40 250(Medium, Low) 0.814 0.831 70 250

(Low, High) 0.892 0.940 0 150(Low, Medium) 0.825 0.847 70 350

( )

39

(Low, Low) 0.705 0.736 20 500

Nitesh Chawla, SIAM 2009 Tutorial

+

(U,S) =

(50,350)

(U,S) =

(80,250)

(U,S) =

(??,???)

=

40

AUROC =

0.906

AUROC =

0.812

AUROC =

?????

Nitesh Chawla, SIAM 2009 Tutorial

Page 21: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

21

Step 1: Split the Data

41

• Could use most supervised and unsupervised methods• We form 2-level Hellinger distance tree (upcoming)• Allows localization of diverging class distributions

Nitesh Chawla, SIAM 2009 Tutorial

Localized Sampling FrameworkLocalized Sampling Framework

Step 2: Sample

42

• Sample (SMOTE and undersample) each localization• Optimize global performance iterating each sample

based on minority class size (use wrapper approach)Nitesh Chawla, SIAM 2009 Tutorial

Page 22: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

22

Step 3: Train/predict globally

+ C4.5 Final Classifier

43

Localized Sampling FrameworkLocalized Sampling Framework

[Cieslak, Chawla ICDM 2008]Nitesh Chawla, SIAM 2009 Tutorial

Example Example -- PimaPima

44

Nitesh Chawla, SIAM 2009 Tutorial

Page 23: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

23

Example Example -- PimaPima

45

Nitesh Chawla, SIAM 2009 Tutorial

Example Example -- PimaPima

40% US

40% US250% SMOTE

50% SMOTE

46

90% US50% US500% SMOTE

100% SMOTENitesh Chawla, SIAM 2009 Tutorial

Page 24: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

24

Experimental Results

Alternative localized sampling method using clustering

Standard Wrapper

C4.5 SVM C4.5+COGOS

SVM+COGOS

C4.5+UnSM

C4.5+LS

method using clustering

COGOS COGOS UnSM LSAvg.Rank 4.850 4.900 3.650 3.530 2.750 1.330

Our MethodAUROC rank averaged across 20 datasets

Nitesh Chawla, SIAM 2009 Tutorial

OverviewOverviewIntroduction S li M th dSampling MethodsMoving Decision ThresholdClassifiers’ Objective FunctionsEvaluation Measures

Nitesh Chawla, SIAM 2009 Tutorial

Page 25: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

25

Use localized sampling◦ Start Globally, Optimize Locally, and Predict Globally

Wrapper can be integrated to guide the samplingGenerally AUC is recommended as the objective function

RecommendationsRecommendationsNitesh Chawla, SIAM 2009 Tutorial

Changing Decision ThresholdsChanging Decision Thresholds

Decisions of (scoring) classifiers are ll 0typically set at 0.5

◦ P(x > 0.5) is class 1 and P( x <= 0.5) is class 0The decision threshold can be moved to compensate for the rate of imbalance◦ Equivalent to different optimization points on the ROC curvethe ROC curve

A wrapper can again be used to optimize threshold

Nitesh Chawla, SIAM 2009 Tutorial

Page 26: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

26

Quality of probability estimates also becomes important◦ Estimate the quality of estimates using appropriate measures such as negative cross entropy or brier score

Can also combine sampling methods with threshold moving

Decision ThresholdsDecision ThresholdsNitesh Chawla, SIAM 2009 Tutorial

OverviewOverviewIntroduction S li M th dSampling MethodsMoving Decision ThresholdClassifiers’ Objective FunctionsEvaluation Measures

Nitesh Chawla, SIAM 2009 Tutorial

Page 27: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

27

Beyond Sampling: Adapting Beyond Sampling: Adapting ClassifiersClassifiers

Consider◦ Decision Trees◦ SVMs

Nitesh Chawla, SIAM 2009 Tutorial

Decision TreesDecision TreesA popular choice when combined with sampling or moving threshold to counter sampling or moving threshold to counter the problem of class imbalanceThe leaf frequencies converted to probability estimates (Laplace or m-estimate smoothing applied, typically)◦ Suggested use is as a PET – Probability Suggested use is as a PET Probability Estimation Trees (unpruned, no-collapse, and Laplace)

Nitesh Chawla, SIAM 2009 Tutorial

Page 28: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

28

Converting decision tree leaf Converting decision tree leaf predictions into probability predictions into probability estimatesestimates

)()1(

CFPTPTPP

FPTPTPP

laplace

freq

+++

=

+=

)()(mFPTP

bmTPPmest +++

=

Nitesh Chawla, SIAM 2009 Tutorial

Dietterich, Kearns and Mansoor (DKM)Hellinger distanceArea under the ROC curve (AUC)Minimum squared error of probability Minimum squared error of probability estimates (MSEEsplit)

Some of the skew insensitive Some of the skew insensitive metrics proposedmetrics proposed

Nitesh Chawla, SIAM 2009 Tutorial

Page 29: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

29

Decision tree (Decision tree (imim)purity metrics)purity metricsPartition feature space to maximize purity at leaves. Recurse

Nitesh Chawla, SIAM 2009 Tutorial

Entropy (Information Gain) as Entropy (Information Gain) as an impurityan impurity

(Q )N = number of samples

Ni = number of samples in classiN S = number of samples in

split

L /RNi

S = number of samples in class is i L /R

(Q,W ) classes of interest

Nitesh Chawla, SIAM 2009 Tutorial

E = −Ni

L

N L log2i∈(W ,Q )∑ Ni

L

N L + −Ni

R

N R log2i∈(W ,Q )∑ Ni

R

N R

split

Page 30: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

30

Consider a skew insensitive Consider a skew insensitive criterioncriterion

◦Hellinger Distancedistance between probability measures independent of the dominating parameters

Nitesh Chawla, SIAM 2009 Tutorial

Properties of Hellinger DistanceProperties of Hellinger Distance

∫Ω −= λdQPQPdH2)(),(

Measures countable space ΦRanges from 0 to √2

2))()((),( φφφ

QPQPdH −= ∑ Φ∈

Nitesh Chawla, SIAM 2009 Tutorial

Ranges from 0 to √2Symmetric: dH(P,Q)=dH (Q,P)Lower bounds KL divergence

Page 31: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

31

Formulating for decision treeFormulating for decision tree

Consider a countable space countable space

Consider a two-class problem (W and Q) are the two classes

2

1∑=

⎟⎟

⎜⎜

⎛−=

p

j W

jW

Q

jQ

NN

NN

H

Nitesh Chawla, SIAM 2009 Tutorial

“Distance” in the normalized frequencies space

Inf. Gain vs. Inf. Gain vs. HellingerHellinger distancedistance(Q,W ) classes of interest

N

E = −Ni

L

N L log2∑ NiL

N L + −Ni

R

N R log2∑ NiR

N R

Ni = number of samples in class iN S = number of samples in

split

RL /Ni

S = number of samples in class isi L /R

Nitesh Chawla, SIAM 2009 Tutorial

Ni∈(W ,Q ) N Ni∈(W ,Q ) N

22

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

−+⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

−=W

RW

Q

RQ

W

LW

Q

LQ

NN

NN

NN

NN

H

Page 32: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

32

Nitesh Chawla, SIAM 2009 Tutorial

Nitesh Chawla, SIAM 2009 Tutorial

Page 33: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

33

Nitesh Chawla, SIAM 2009 Tutorial

Comparing Comparing Value SurfacesValue Surfaces

P(x|+) P(x|-) P(x|+) P(x|-)

Class ratio +:- = 1:1

Hellinger Distance

Information Gain( | ) ( | ) ( | )

Nitesh Chawla, SIAM 2009 Tutorial

Page 34: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

34

Comparing Value SurfacesComparing Value Surfaces

Class ratio +:- = 1:100

Information GainHellingerDistance

Nitesh Chawla, SIAM 2009 Tutorial

HellingerHellinger vs. DKM as decision vs. DKM as decision tree splitting criteriatree splitting criteria

)|()|()(2)|()|()(2)()(2 −+−−+−−+= RPRPRPLPLPLPPPd

22 ))|()|(())|()|(( −−++−−+= RPRPLPLPdH

)|()|()(2)|()|()(2)()(2 +++= RPRPRPLPLPLPPPdDKM

DKM has improved concavity compared to information gain, especially foeither very small (relative) class proportions [10].

)|()|(2)|()|(22 −+−−+−= RPRPLPLPdH

Nitesh Chawla, SIAM 2009 Tutorial

Page 35: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

35

Algorithm HDDTAlgorithm HDDT

Input: Training Set T, Cutoff size C

if |T| < C thenFunction Calc_HellingerInput: Training set T Feature fif |T| C then

return

end if

for each feature f of T doHf = Calc_Hellinger(T,f)

end for

b = max(H) (best feature)

Input: Training set T, Feature fFor each value v of f do

2

,,

⎟⎟

⎜⎜

⎛−=+

−=

−==

+=

+==

y

yvx

y

yvx

T

T

T

THellinger ff

Nitesh Chawla, SIAM 2009 Tutorial

( ) ( )

for each branch v of b doHDDT(Txb=v,C)

end for

end forreturn √Hellinger

Support Vector MachinesSupport Vector Machines

SVMs are also sensitive to high class imbalanceimbalancePenalty can be specified as a trade-off between the two classes◦ Limitations arise from the Karush Kuhn Tucker

conditions

Solutions:Integrating sampling strategiesKernel alignment algorithms

Nitesh Chawla, SIAM 2009 Tutorial

Page 36: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

36

Hellinger distance is strongly skew insensitiveMore robust for class imbalance as compared to Gini and Information gaincompared to Gini and Information gainRecommended decision tree splitting criterion

RecommendationsRecommendationsNitesh Chawla, SIAM 2009 Tutorial

Nitesh Chawla, SIAM 2009 Tutorial

Page 37: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

37

Nitesh Chawla, SIAM 2009 Tutorial

Nitesh Chawla, SIAM 2009 Tutorial

Page 38: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

38

Nitesh Chawla, SIAM 2009 Tutorial

Kernel Boundary AlignmentKernel Boundary Alignment

Adaptively modify K (kernel) based on training set distributiontraining set distributionAddresses◦ Improving class separation◦ Safeguarding overfitting◦ Improving imbalanced ratio

Nitesh Chawla, SIAM 2009 Tutorial

Page 39: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

39

OverviewOverviewIntroduction S li M th dSampling MethodsMoving Decision ThresholdClassifiers’ Objective FunctionsEvaluation Measures

Nitesh Chawla, SIAM 2009 Tutorial

Back to the Performance Fruit Back to the Performance Fruit Bowl Bowl

What evaluation measure to use?

Is there one validation strategy that we can embrace?

Nitesh Chawla, SIAM 2009 Tutorial

Page 40: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

40

Truth Value

True Positive (TP)

F l N ti

False Positive (FP)

T N ti (TN)

Positive Negative

Positive

edic

tion V

alue

N ti

Truth TableTruth Table

False Negative (FN)

True Negative (TN)

Pre Negative

Nitesh Chawla, SIAM 2009 Tutorial

ff--measuremeasure

80

Nitesh Chawla, SIAM 2009 Tutorial

Page 41: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

41

Top 20 PrecisionTop 20 RecallMean averaged precisionPrecision Recall Curves (sweeping across thresholds)

More on Precision and RecallMore on Precision and Recall

Source of this Figure: Rich Caruana

Nitesh Chawla, SIAM 2009 Tutorial

Balanced Accuracy =

G-mean =

2−+ + AccuracyAccuracy

AAG mean = −+ × AccuracyAccuracy

Balanced Accuracy and GBalanced Accuracy and G--meanmeanNitesh Chawla, SIAM 2009 Tutorial

Page 42: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

42

ROC CurvesROC Curves

ROC Curve

1

“H ”

0.4

0.6

0.8

Tru

e P

osit

ive R

ate

AB

“Heaven”

83

0

0.2

0 0.2 0.4 0.6 0.8 1

False Positive Rate

“Hell”

Nitesh Chawla, SIAM 2009 Tutorial

ROC CurvesROC CurvesROC Curve

1

0.2

0.4

0.6

0.8

Tru

e P

osit

ive R

ate

AB

Blue “wins” here

0

0.2

0 0.2 0.4 0.6 0.8 1

False Positive Rate

Red “wins” here

Nitesh Chawla, SIAM 2009 Tutorial

Page 43: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

43

10

111 2/)1(nn

nnIA +−=

examples 1 class ofnumber :1

examples 0 class ofnumber :

examples 1 class all of ranks of sum:1

0

n

n

I

AUCAUCNitesh Chawla, SIAM 2009 Tutorial

AUC: Area Under the ROC CurveAUC: Area Under the ROC CurveROC Curve

1

0.2

0.4

0.6

0.8

Tru

e P

osit

ive R

ate

AB

00 0.2 0.4 0.6 0.8 1

False Positive Rate

AUC(Blue) = 0.7626

AUC(Red) = 0.7569

Nitesh Chawla, SIAM 2009 Tutorial

Page 44: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

44

Cost Curves

87

Cost CurveCost CurveRatio of each class’s misclassification cost

times class prior rateExpected cost per example Nitesh Chawla, SIAM 2009

Tutorial

Relating Relating AUC AUC to Cost Curvesto Cost Curves

88

This ROC… Nitesh Chawla, SIAM 2009 Tutorial

Page 45: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

45

Relating Relating AUC AUC to Cost Curvesto Cost Curves

89

…converts to this Cost CurveNitesh Chawla, SIAM 2009 Tutorial

Cost and BenefitsCost and Benefits

Actual Negative

Actual Positive

Actual Negative

Actual Positive

Predict Negative

TN FN

Predict FP TPNegative Positive

Predict Negative

b00 b01

Predict b10 b11

0100)1( bPbPB kkN +−=

1110)1( bPbPB kkP +−=

Predict Positive

FP TP

Predict Positive

b10 b11 1110)( kkP

Costs

Nitesh Chawla, SIAM 2009 Tutorial

Page 46: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

46

Benefit of NonBenefit of Non--DefaultDefault)()1()1)(,( 01111000 xbPbPbPPxkb kkkk −+−>−

)1()()1(),( 011110

00k

kkk

PxbPbPbPxkb

−−+−

>

)1()1( 10110100 bPbPbPbPNPV kkkk −+−−−=∴)(.)(.)().1()()1( FNCPTPbPFPCPTNbP kkkk −+−−−≡

Nitesh Chawla, SIAM 2009 Tutorial

Quality of Posterior Probability Quality of Posterior Probability EstimateEstimate

}))|1(1log())|1(log({10|1|

∑∑==

=−+=−=yi

iyi

i xypxypn

NCE

Nitesh Chawla, SIAM 2009 Tutorial

Page 47: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

47

Nitesh Chawla, SIAM 2009 Tutorial

◦ Average Quadratic Loss on each test instance

Brier ScoreBrier Score

◦ Average Quadratic Loss on each test instance

◦ Indicative of best estimates at true probabilities

2

1)(1

i

n

ii py

nQL −= ∑

=

◦ accounts for probability assignments to all classes

Nitesh Chawla, SIAM 2009 Tutorial

Page 48: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

48

Nitesh Chawla, SIAM 2009 Tutorial

What are we really evaluating What are we really evaluating then?then?

Rank-order?Quality of probability estimates?Quality of probability estimates?Precision, Recall (and f-measure) at a threshold?Balanced accuracy or g-mean (again at a threshold)An operating point on ROC curve?Costs?

Different measures have different sensitivitiesCall to the community: Let us standardize.

Nitesh Chawla, SIAM 2009 Tutorial

Page 49: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

49

Step one, choosing the validation Step one, choosing the validation strategystrategy

Nitesh Chawla, SIAM 2009 Tutorial

Step two, comparing and Step two, comparing and contrasting evaluation measurescontrasting evaluation measures

Nitesh Chawla, SIAM 2009 Tutorial

Page 50: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

50

Step three, computing statistical Step three, computing statistical significancesignificance

Nitesh Chawla, SIAM 2009 Tutorial

Step four, some recommendations Step four, some recommendations and call to the communityand call to the community

Nitesh Chawla, SIAM 2009 Tutorial

Page 51: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

51

DiscussionDiscussionNeed for larger datasets◦ A benchmark repositoryp yNeed for many positives and march towards parts-per-millionNeed for standardization in evaluation Need for full parameter disclosure in papers

Nitesh Chawla, SIAM 2009 Tutorial

Datasets and SoftwareDatasets and Software

Available via◦ http://www.nd.edu/~dial◦ Email me: [email protected]

Nitesh Chawla, SIAM 2009 Tutorial

Page 52: Mining When Classes are Imbalanced, Rare Events Matter More ...

2/24/2009

52

AcknowledgementsAcknowledgementsKevin Bowyer, David Cieslak, George Forman Larry Hall Nathalie Japkowicz Forman, Larry Hall, Nathalie Japkowicz, Philip Kegelmeyer, Alek Kolcz

Nitesh Chawla, SIAM 2009 Tutorial

Let neither measurement without theoryNor theory without measurement dominateYour mind but rather contemplateA two-way interaction between the twoWhich will your thought processes stimulateTo attain syntheses beyond a rational

expectation!Contributed by A. Zellner.

Nitesh Chawla, SIAM 2009 Tutorial