Andrew Thesis

8/12/2019 Andrew Thesis

1/125

A COMBINATION SCHEME FOR INDUCTIVE

LEARNING FROM IMBALANCED DATA

SETS

by

Andrew Estabrooks

A Thesis Submitted to the

Faculty of Computer Science

in Partial Fulfillment of the Requirementsfor the degree of

ASTER !F C!P"TER SC#E$CE

a%or Sub%ect& Computer Science

APPR!'E(&

)))))))))))))))))))))))))))))))))))))))))

$athalie *apkowic+, Super-isor

)))))))))))))))))))))))))))))))))))))))))

.igang /ao

)))))))))))))))))))))))))))))))))))))))))

0ouise Spiteri

(A01!"S#E "$#'ERS#T2 3 (A0TEC1


2/125

1alifa4, $o-a Scotia5666

ii


3/125

(A0TEC1 0#7RAR2

"AUTHORITY TO DISTRIBUTE MANUSCRIPT THESIS"

T#T0E&

A Combination Scheme for 0earning From #mbalanced (ata Sets

The abo-e library may make a-ailable or authori+e another library to make

a-ailable indi-idual photo8microfilm copies of this thesis without restrictions9

Full $ame of Author& Andrew Estabrooks

Signature of Author& )))))))))))))))))))))))))))))))))

(ate& :85;85666

iii


4/125

TA70E !F C!$TE$TS

1. Introducton999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;; #nducti-e 0earning9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;

5 Class #mbalance999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999995< oti-ation 999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999=

= Chapter !-er-iew99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999=

> 0earners99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999?

>9; 7ayesian 0earning99999999999999999999999999999999999999999999999999999999999999999999999999999999?>95 $eural $etworks99999999999999999999999999999999999999999999999999999999999999999999999999999999999:

>9< $earest $eighbor9999999999999999999999999999999999999999999999999999999999999999999999999999999999@

>9= (ecision Trees99999999999999999999999999999999999999999999999999999999999999999999999999999999999999? (ecision Tree 0earning Algorithms and C>969999999999999999999999999999999999999999999999999999999999999999999

?9; (ecision Trees and the #(< algorithm 99999999999999999999999999999999999999999999999;6?95 #nformation /ain and the Entropy easure9999999999999999999999999999999999999999;;?9< !-erfitting and (ecision Trees99999999999999999999999999999999999999999999999999999999999;96 !ptions999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;>

: Performance easures9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;::9; Confusion atri499999999999999999999999999999999999999999999999999999999999999999999999999999999;:

:95 g3ean99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;@

:9< R!C cur-es 999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;@

@ A Re-iew of Current 0iterature999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;@9; isclassification Costs9999999999999999999999999999999999999999999999999999999999999999999999956

@95 Sampling Techniques9999999999999999999999999999999999999999999999999999999999999999999999999955

@959; 1eterogeneous "ncertainty Sampling99999999999999999999999999999999999999999999999999999999999955@9595 !ne sided #ntelligent Selection99999999999999999999999999999999999999999999999999999999999999999999995=

@959< $ai-e Sampling Techniques999999999999999999999999999999999999999999999999999999999999999999999999995>

@9< Classifiers Bhich Co-er !ne Class9999999999999999999999999999999999999999999999999999


5/125

;695 Architecture99999999999999999999999999999999999999999999999999999999999999999999999999999999999999?:

;6959; Classifier 0e-el99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999?;69595 E4pert 0e-el999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:6

;6959< Beighting Scheme999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:;

;6959= !utput 0e-el999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:5;; Testing the Combination scheme on the Artificial (omain9999999999999999999999999999999999999999999:5

;5 Te4t Classification999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999::

;59; Te4t Classification as an #nducti-e Process99999999999999999999999999999999999999:;< Reuters35;>:@999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999@6

;9; Precision and Recall9999999999999999999999999999999999999999999999999999999999999999999999999@@

;>95 F3 measure9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999@;>9< 7reake-en Point99999999999999999999999999999999999999999999999999999999999999999999999999999999@

;>9= A-eraging Techniques9999999999999999999999999999999999999999999999999999999999999999999999;

;? Statistics used in this study999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999

, Mot()ton

Currently the ma%ority of research in the machine learning community has based the

performance of learning algorithms on how well they function on data sets that are

reasonably balanced9 This has lead to the design of many algorithms that do not adapt well

to imbalanced data sets9 Bhen faced with an imbalanced data set, researchers ha-e

generally de-ised methods to deal with the data imbalance that are specific to the

application at hand9 Recently howe-er there has been a thrust towards generali+ing

techniques that deal with data imbalances9

The focus of this thesis is directed towards inducti-e learning on imbalanced data sets9 The

goal of the work presented is to introduce a combination scheme that uses two of the

pre-iously mentioned balancing techniques, downsi+ing and o-er3sampling, in an attempt

to impro-e learning on imbalanced data sets9 ore specifically, # will present a system that

combines classifiers in a hierarchical structure according to their sampling technique9 This

combination scheme will be designed using an artificial domain and tested on the real

world application of te4t classification9 #t will be shown that the combination scheme is an

effecti-e method of increasing a standard classifiers performance on imbalanced data sets9

- C)/t$r O($r($

The remainder of this thesis is broken down into four chapters9 Chapter 5 gi-es backgroundinformation and a re-iew of the current literature pertaining to data set imbalance9 Chapter

< is di-ided into se-eral sections9 The first section describes an artificial domain and a set

of e4periments, which lead to the moti-ation behind a general scheme to handle

imbalanced data sets9 The second section describes the architecture behind a system

=


16/125

designed to lend itself to domains that ha-e imbalanced data9 The third section tests the

de-eloped system on the artificial domain and presents the results9 Chapter = presents the

real world application of te4t classification and is di-ided into two parts9 The first part

gi-es needed background information and introduces the data set that the system will be

tested on9 The second part presents the results of testing the system on the te4t

classification task and discusses it effecti-eness9 The thesis concludes with Chapter >,

which contains a summary and suggested directions for further research9

>


17/125

C h a p t e r T w o

5 7AC/R!"$(

# will begin this chapter by gi-ing a brief o-er-iew of some of the more common learning

algorithms and e4plaining the underlying concepts behind the decision tree learning

algorithm C>96, which will be used for the purposes of this study9 There will then be a

discussion of -arious performance measures that are commonly used in machine learning9

Following that, # will gi-e an o-er-iew of the current literature pertaining to data

imbalance9

0 L$)rn$r'

There are a large number of learning algorithms, which can be di-ided into a broad range

of categories9 This section gi-es a brief o-er-iew of the more common learning algorithms9

0.1 B)$')n L$)rnn%

#nducti-e learning centers on finding the best hypothesis h, in a hypothesis space 1, gi-en aset of training data (9 Bhat is meant by the best hypothesis is that it is the most probable

hypothesis gi-en a data set ( and any initial knowledge about the prior probabilities of

-arious hypothesis in 19 achine learning problems can therefore be -iewed as attempting

to determine the probabilities of -arious hypothesis and choosing the hypothesis which has

the highest probability gi-en (9

ore formally, we define the posterior probability PGh(H, to be the probability of an

hypothesis h after seeing a data set (9 7ayes theorem GEq9 ;H pro-ides a means to calculate

posterior probabilities and is the basis of 7ayesian learning9

?


18/125

( ) ( ) ( )

( )DPhPhDP

DhPS

S = GEq9 ;H

A simple method of learning based on 7ayes theorem is called the nai-e 7ayes classifier9

$ai-e 7ayes classifiers operate on data sets where each e4ample 4 consists of attribute

-alues a;, a5 999 aiQ and the target function fG4H can take on any -alue from a pre3defined

finite set 'G-;, -5 999 -%H9 Classifying unseen e4amples in-ol-es calculating the most

probable target -alue vma4and is defined as&

"sing 7ayes theorem GEq9 ;H vmaxcan be rewritten as&

"nder the assumption that attribute -alues are conditionally independent gi-en the target

-alue9 The formula used by the nai-e 7ayes classifier is&

where - is the target output of the classifier and PGai-%H and PG-iH can be calculated based on

their frequency in the training data9

0.* N$ur)# N$tor!'

$eural $etworks are considered -ery robust learners that perform well on a wide range of

applications such as, optical character recognition 0e Cun et al9, ;@ and autonomous

na-igation Pomerleau, ;


19/125

a4ons9 The basic unit of an artificial neural network is the perceptron, which takes as input

a number of -alues and calculates the linear combination of these -alues9 The combined

-alue of the input is then transformed by a threshold unit such as the sigmoid function 59

Each input to a perceptron is associated with a weight that determines the contribution of

the input9 0earning for a neural network essentially in-ol-es determining -alues for the

weights9 A pictorial representation of a perceptron is gi-en in Figure 59;9;9

w1

w2

wn

x1

x2

xn

Threshold unit

w0

Figure& 59;959 A perceptron9

0., N$)r$'t N$%+or

$earest $eighbor learning algorithms are instance3based learning methods, which store

e4amples and classify newly encountered e4amples by looking at the stored instances

considered similar9 #n its simplest form all instances correspond to points in an n

dimensional space9 An unseen e4ample is classified by choosing the ma%ority class of the

closest e4amples9 An ad-antage nearest neighbor algorithms ha-e is that they can

appro4imate -ery comple4 target functions, by making simple local appro4imations based

on data, which is close to the e4ample to be classified9 An e4cellent e4ample of an

application, which uses a nearest neighbor algorithm, is that of te4t retrie-al in which

documents are represented as -ectors and a cosine similarity metric is used to measure the

distance of queries to documents9

5 The sigmoid function is defined as oGyH ; 8 G; I e3yH and is referred to as a squashing function because it maps a -ery

wide range of -alues onto the inter-al G6, ;H9

@


20/125

0.- D$c'on Tr$$'

(ecision trees classify e4amples according to the -alues of their attributes9 They are

constructed by recursi-ely partitioning training e4amples based each time on the remaining

attribute that has the highest information gain9 Attributes become nodes in the constructed

tree and their possible -alues determine the paths of the tree9 The process of partitioning the

data continues until the data is di-ided into subsets that contain a single class, or until some

stopping condition is met Gthis corresponds to a leaf in the treeH9 Typically, decision trees

are pruned after construction by merging children of nodes and gi-ing the parent node the

ma%ority class9 Section 595 describes in detail how decision trees, in particular C>96, operate

and are constructed9

2 D$c'on Tr$$ L$)rnn% A#%ort&' )nd C0.3

C>96 is a decision tree learning algorithm that is a later -ersion of the widely used C=9>

algorithm .uinlan, ; and C>969 The following

section consists of two parts9 The first part is a brief summary of itchells description of

the #(< algorithm and the e4tensions leading to typical decision tree learners9 A brief

operational o-er-iew of C>96 is then gi-en as it relates to this work9

7efore # begin the discussion of decision tree algorithms, it should be noted that a decision

tree is not the only learning algorithm that could ha-e been used in this study9 As described

in Chapter ;, there are many different learning algorithms9 For the purposes of this study a

decision tree algorithm was chosen for three reasons9 The first is the understandability of

the classifier created by the learner9 7y looking at the comple4ity of a decision tree in terms

of the number and si+e of e4tracted rules, we can describe the beha-ior of the learner9

Choosing a learner such as $ai-e 7ayes, which classifies e4amples based on probabilities,

would make an analysis of this type nearly impossible9 The second reason a decision tree

learner was chosen was because of its computational speed9 Although, not as cheap to

operate as $ai-e 7ayes, decision tree learners ha-e significantly shorter training times than

do neural networks9 Finally, a decision tree was chosen because it operates well on tasks


21/125

that classify e4amples into a discrete number of classes9 This lends itself well to the real

world application of te4t classification9 Te4t classification is the domain that the

combination scheme designed in Chapter < will be tested on9

2.1 D$c'on Tr$$' )nd t$ ID, )#%ort&

(ecision trees classify e4amples by sorting them based on attribute -alues9 Each node in a

decision tree represents an attribute in an e4ample to be classified, and each branch in a

decision tree represents a -alue that the node can take9 E4amples are classified starting at

the root node and sorting them based on their attribute -alues9 Figure 5959; is an e4ample of

a decision tree that could be used to classify whether it is a good day for a dri-e or not9

Road Conditions

Clear Snow Covered Icy

Forecast

Temperature Accumulation

RainClear

HeavyFreein! "i!ht#arm

Snow

$%S

&'

&' &'

&'

$%S $%S

Figure 5959


22/125

would sort to the nodes& Road Conditions, Forecast, and finally Temperature, which would

classify the instance as being positi-e GyesH, that is, it is a good day to dri-e9 Con-ersely an

instance containing the attribute Road Conditions assigned Snow Co-ered would be

classified as not a good day to dri-e no matter what the Forecast, Temperature, or

Accumulation are9

(ecision tress are constructed using a top down greedy search algorithm which recursi-ely

subdi-ides the training data based on the attribute that best classifies the training e4amples9

The basic algorithm #(< begins by di-iding the data according to the -alue of the attribute

that is most useful in classifying the data9 The attribute that best di-ides the training data

would be the root node of the tree9 The algorithm is then repeated on each partition of the

di-ided data, creating sub trees until the training data is di-ided into subsets of the same

class9 At each le-el in the partitioning process a statistical property known as information

gainis used to determine which attribute best di-ides the training e4amples9

2.* In4or&)ton G)n )nd t$ Entro/ M$)'ur$

#nformation gain is used to determine how well an attribute separates the training data

according to the target concept9 #t is based on a measure commonly used in information

theory known as entropy9 (efined o-er a collection of training data, S, with a 7ooleantarget concept, the entropy of Sis defined as&

wherep(+)is the proportion of positi-e e4amples in S andp()the proportion of negati-e

e4amples9 The function of the entropy measure is easily described with an e4ample9

Assume that there is a set of data Scontaining ten e4amples9 Se-en of the e4amples ha-e a

positi-e class and three of the e4amples ha-e a negati-e class :I, 5;96


23/125

$ote that if the number of positi-e and negati-e e4amples in the set were e-en Gp(+)p()

69>H, then the entropy function would equal ;9 #f all the e4amples in the set were of the

same class, then the entropy of the set would be 69 #f the set being measured contains an

unequal number of positi-e and negati-e e4amples then the entropy measure will be

between 6 and ;9

Entropy can be interpreted as the minimum number of bits needed to encode the

classification of an arbitrary member of S9 Consider two people passing messages back and

forth that are either positi-e or negati-e9 #f the recei-er of the message knows that the

message being sent is always going to be positi-e, then no message needs to be sent9

Therefore, there needs to be no encoding and no bits are sent9 #f on the other hand, half the

messages are negati-e, then one bit needs to be used to indicate that the message being sent

is either positi-e or negati-e9 For cases where there are more e4amples of one class than the

other, on a-erage, less than one bit needs to be sent by assigning shorter codes to more

likely collections of e4amples and longer codes to less likely collections of e4amples9 #n a

case wherep(+) 69 shorter codes could be assigned to collections of positi-e messages

being sent, with longer codes being assigned to collections of negati-e messages being

sent9

#nformation gain is the e4pected reduction in entropy when partitioning the e4amples of a

set S"according to an attribute#9 #t is defined as&

where Va$ue%(#)is the set of all possible -alues for an attribute#and Svis the subset of

e4amples in Swhich ha-e the -alue vfor attribute#9 !n a 7oolean data set ha-ing only

positi-e and negati-e e4amples, Va$ue%(#)would be defined o-er I,39 The first term in

the equation is the entropy of the original data set9 The second term describes the entropy

of the data set after it is partitioned using the attribute #9 #t is nothing more than a sum of

( ) ( )( )

( )v#Va$ue%v

vS!ntropy

S

SS!ntropy#S&ain

=,

;5


24/125

the entropies of each subset Svweighted by the number of e4amples that belong to the

subset9 The following is an e4ample of how &ain(S" #)would be calculated on a fictitious

data set9 /i-en a data set Swith ten e4amples G: positi-e and < negati-eH, each containing

an attribute Temperature, &ain(S"#) where #Temperature and Va$ue%GTemperatureH

KBarm, Free+ingL would be calculated as follows&

S :I,


25/125

There are two common approaches that decision tree induction algorithms can use to a-oid

o-erfitting training data9 They are&

Stop the training algorithm before it reaches a point in which it perfectly fits the

training data, and,

Prune the induced decision tree9

The most commonly used is the latter approach itchell, ;:9 (ecision tree learners

normally employ post3pruning techniques that e-aluate the performance of decision trees as

they are pruned using a -alidation set of e4amples that are not used during training9 The

goal of pruning is to impro-e a learners accuracy on the -alidation set of data9

#n its simplest form post3pruning operates by considering each node in the decision tree as

a candidate for pruning9 Any node can be remo-ed and assigned the most common class of

the training e4amples that are sorted to the node in question9 A node is pruned if remo-ing

it does not make the decision tree perform any worse on the -alidation set than before the

node was remo-ed9 7y using a -alidation set of e4amples it is hoped that the regularities in

the data used for training do not occur in the -alidation set9 #n this way pruning nodes

created on regularities occurring in the training data will not hurt the performance of the

decision tree o-er the -alidation set9

Pruning techniques do not always use additional data such as the following pruning

technique used by C=9>9

C=9> begins pruning by taking a decision tree to be and con-erting it into a set of rulesN one

for each path from the root node to a leaf9 Each rule is then generali+ed by remo-ing any of

its conditions that will impro-e the estimated accuracy of the rule9 The rules are then sorted

by this estimated accuracy and are considered in the sorted sequence when classifying

newly encountered e4amples9 The estimated accuracy of each rule is calculated on the

training data used to create the classifier Gi9e9, it is a measure of how well the rule classifies

the training e4amplesH9 The estimate is a pessimistic one and is calculated by taking the

;=


26/125

accuracy of the rule o-er the training e4amples it co-ers and then calculating the standard

de-iation assuming a binomial distribution9 For a gi-en confidence le-el, the lower3bound

estimate is taken as a measure of the rules performance9 A more detailed discussion of

C=9>s pruning technique can be found in .uinlan, ;96 offers adapti-e boosting Schapire and Freund, ;:9 The general idea behind

adapti-e boosting is to generate se-eral classifiers on the training data9 Bhen an unseen

e4ample is encountered to be classified, the predicted class of the e4ample is a weighted

count of -otes from indi-idually trained classifiers9 C>96 creates a number of classifiers by

first constructing a single classifier9 A second classifier is then constructed by re3training

on the e4amples used to create the first classifier, but paying more attention to the cases of

the training set in which the first classifier, classified incorrectly9 As a result the secondclassifier is generally different than the first9 The basic algorithm behind .uinlans

implementation of adapti-e boosting is described as follows9

Choose e4amples from the training set of $ e4amples each being assigned a

probability of ;8$ of being chosen to train a classifier9

Classify the chosen e4amples with the trained classifier9

Replace the e4amples by multiplying the probability of the misclassified e4amples

by a weight 79

Repeat the pre-ious three steps O times with the generated probabilities9

Combine the O classifiers gi-ing a weight logG7OH to each trained classifier9

;>


27/125

Adapti-e boosting can be in-oked by C>96 and the number of classifiers generated

specified9

Prunn% O/ton'C>96 constructs decision trees in two phases9 First it constructs a classifier that fits the

training data, and then it prunes the classifier to a-oid o-er3fitting the data9 Two options

can be used to affect the way in which the tree is pruned9

The first option specifies the degree in which the tree can initially fit the training data9 #t

specifies the minimum number of training e4amples that must follow at least two of the

branches at any node in the decision tree9 This is a method of a-oiding o-er3fitting data by

stopping the training algorithm before it o-er3fits the data9

A second pruning option that C>96 has affects the se-erity in which the algorithm will post3

prune constructed decision trees and rule sets9 Pruning is performed by remo-ing parts of

the constructed decision trees or rule sets that ha-e a high predicted error rate on new

e4amples9

Ru#$ S$t'

C>96 can also con-ert decision trees into rule sets9 For the purposes of this study rule sets

were generated using C>969 This is due to the fact that rule sets are easier to understand

than decision trees and can easily be described in terms of comple4ity9 That is, rules sets

can be looked at in terms of the a-erage si+e of the rules and the number of rules in the set9

The pre-ious description of C>96s operation is by no means complete9 #t is merely an

attempt to pro-ide the reader with enough information to understand the options that were

primarily used in this study9 C>96 has many other options that can be used to affect its

operation9 They include options to in-oke k3fold cross -alidation, enable differential

misclassification costs, and speed up training times by randomly sampling from large data

sets9

;?


28/125

5 P$r4or&)nc$ M$)'ur$'

E-aluating a classifierMs performance is a -ery important aspect of machine learning9

Bithout an e-aluation method it is impossible to compare learners, or e-en know whether

or not a hypothesis should be used9 For e4ample, learning to classify mushrooms as being

poisonous or not, one would want to be able to -ery precisely measure the accuracy of a

learned hypothesis in this domain9 The following section introduces the 'onfu%ion matrix

that identifies the type of errors a classifier makes, as well as two more sophisticated

e-aluation methods9 They are thegmean, which combines the performance of a classifier

o-er two classes, and OC 'urve%, which pro-ide a -isual representation of a classifiers

performance9

5.1 Con4u'on M)tr6

A classifiers performance is commonly broken down into what is known as a 'onfu%ion

matrix9 A confusion matri4 basically shows the type of classification errors a classifier

makes9 Figure 59


29/125

A classifierMs performance can also be separately calculated for its performance o-er the

positi-e e4amples Gdenoted as aIH and o-er the negati-e e4amples Gdenoted as a3H9 Each arecalculated as&

5.* %7M$)n

ubat, 1olte, and atwin ;@ use the geometric mean of the accuracies measured

separately on each class&

The basic idea behind this measure is to ma4imi+e the accuracy on both classes9 #n this

study the geometric mean will be used as a check to see how balanced the combination

scheme is9 For e4ample, if we consider an imbalanced data set that has 5=6 positi-e

e4amples and ?666 negati-e e4amples and stubbornly classify each e4ample as negati-e,

we could see, as in many imbalanced domains, a -ery high accuracy Gacc ?UH9 "sing

the geometric mean, howe-er, would quickly show that this line of thinking is flawed9 #t

would be calculated as sqrtG6 V ;H 69

5., ROC cur($'

OC 'urve% GRecei-ing !perator CharacteristicH pro-ide a -isual representation of the

trade off between true positi-es and false positi-es9 They are plots of the percentage ofcorrectly classified positi-e e4amples aI with respect to the percentage of incorrectly

classified negati-e e4amples a39

dc

ca

ba

aa

+=

+=

+

+

= aag

;@


30/125

ROC curves

0

20

(0)0

*0

100

0 20 (0 )0 *0 100

False Positive (%)

TruePositive(%)

Series1Series2

Figure 59& A fictitious e4ample of two R!C cur-es9

Point G6, 6H along a cur-e would represent a classifier that by default classifies all e4amples

as being negati-e, whereas a point G6, ;66H represents a classifier that correctly classifies all

e4amples9

any learning algorithms allow induced classifiers to mo-e along the cur-e by -arying

their learning parameters9 For e4ample, decision tree learning algorithms pro-ide options

allowing induced classifiers to mo-e along the cur-e by way of pruning parameters

Gpruning options for C>96 are discussed in Section 5959=H9 Swets ;@@ proposes that

classifiers performances can be compared by calculating the area under the cur-es

generated by the algorithms on identical data sets9 #n Figure 59


31/125

sampling techniques, discusses data set balancing techniques that sample training

e4amples, both in nai-e and intelligent fashions9 The third category, classifiers that co-er

one class, describes learning algorithms that create rules to co-er only one class9 The last

category, recognition based learning, discusses a learning method that ignores or makes

little use of one class all together9

8.1 M'c#)''4c)ton Co't'

Typically a classifiers performance is e-aluated using the proportion of e4amples that are

incorrectly classified9 Pa++ani, er+, urphy, Ali, 1ume, and 7runk ;= look at errors

made by a classifier in terms of their cost9 For e4ample, take an application such as the

detection of poisonous mushrooms9 The cost of misclassifying a poisonous mushroom as

being safe to eat may ha-e serious consequences and therefore should be assigned a high

costN con-ersely, misclassifying a mushroom that is safe to eat may ha-e no serious

consequences and should be assigned a low cost9 Pa++ani et al9 ;= use algorithms that

attempt to sol-e the problem of imbalanced data sets by way of introducing a cost matri49

The algorithm that is of interest here is called eu'e Co%t Orering GRC!H, which

attempts to order a decision list Gset of rulesH so as to minimi+e the cost of making incorrect

classifications9

RC! is a post3processing algorithm that can complement any rule learner such as C=9>9 #t

essentially orders a set of rules to minimi+e misclassification costs9 The algorithm works as

follows&

The algorithm takes as input a set of rules Grule listH, a cost matri4, and a set of e4amples

Ge4ample listH and returns an ordered set of rules Gdecision listH9 An e4ample of a cost

matri4 Gfor the mushroom e4ampleH is depicted in Figure 59=9;9

1ypothesis

Safe Poisonous Actual Class

6 ; Safe

;6 6 PoisonousFigure 59=9?& A cost matri4 for a poisonous mushroom

application9

56


32/125

$ote that the costs in the matri4 are the costs associated with the prediction in light of the

actual class9

The algorithm begins by initiali+ing a decision list to a default class which yields the leaste4pected cost if all e4amples were tagged as being that class9 #t then attempts to iterati-ely

replace the default class with a new rule 8 default class pair, by choosing a rule from the

rule list that co-ers as many e4amples as possible and a default class which minimi+es the

cost of the e4amples not co-ered by the rule chosen9 $ote that when an e4ample in the

e4ample list is co-ered by a chosen rule it is remo-ed9 The process continues until no new

rule 8 default class pair can be found to replace the default class in the decision list Gi9e9, the

default class minimi+es cost o-er the remaining e4amplesH9

An algorithm such as the one described abo-e can be used to tackle imbalanced data sets

by way of assigning high misclassification costs to the underrepresented class9 (ecision

lists can then be biased, or ordered to classify e4amples as the underrepresented class, as

they would ha-e the least e4pected cost if classified incorrectly9

#ncorporating costs into decision tree algorithms can be done by replacing the information

gain metric used with a new measure that bases partitions not on information gain, but on

the cost of misclassification9 This was studied by Pa++ani et al9 ;= by modifying #(< to

use a metric that chooses partitions that minimi+e misclassification cost9 The results of their

e4perimentation indicate that their greedy test selection method, attempting to minimi+e

cost, did not perform as well as using an information gain heuristic9 They attribute this to

the fact that their selection technique attempts to solely fit training data and not minimi+e

the comple4ity of the learned concept9

A more -iable alternati-e to incorporating misclassification costs into the creation of a

decision trees, is to modify pruning techniques9 Typically, decision trees are pruned by

merging lea-es of the tree to classify e4amples as the ma%ority class9 #n effect, this is

calculating the probability that an e4ample belongs to a gi-en class by looking at training

e4amples that ha-e filtered down to the lea-es being merged9 7y assigning the ma%ority

5;


33/125

class to the node of the merged lea-es, decision trees are assigning the class with the lowest

e4pected error9 /i-en a cost matri4, pruning can be modified to assign the class that has the

lowest e4pect cost instead of the lowest e4pected error9 Pa++ani et al9 ;= state that cost

pruning techniques ha-e an ad-antage o-er replacing the information gain heuristic with a

minimal cost heuristic, in that a change in the cost matri4 does not affect the learned

concept description9 This allows different cost matrices to be used for different e4amples9

8.* S)&/#n% T$cn9u$'

*,- .eterogeneou% /n'ertainty Samp$ing

0ewis and Catlett ;= describe a heterogeneou%0 approach to selecting training

e4amples from a large data set by using uncertainty sampling9 The algorithm they use

operates under an information filtering paradigmN uncertainty sampling is used to select

training e4amples to be presented to an e4pert9 #t can be simply described as a process

where a cheap classifier chooses a subset of training e4amples for which it is unsure of the

class from a large pool and presents them to an e4pert to be classified9 The classified

e4amples are then used to help the cheap classifier choose more e4amples for which it is

uncertain9 The e4amples that the classifier is unsure of are used to create a more e4pensi-e

classifier9

The uncertainty sampling algorithm used is an iterati-e process by which an ine4pensi-e

probabilistic classifier is initially trained on three randomly chosen positi-e e4amples from

the training data9 The classifier is based on an estimate of the probability that an instance

belongs to a class C&

< Their method is considered heterogeneous because a classifier of one type chooses e4amples to present to a classifier of

another type9

( )

( )

( )( )( )

++

+

=

=

=

)

i i

i

)

i i

i

CwP

CwPba

CwP

CwPba

1CP

;

;

S

Sloge4p;

S

Sloge4p

S

55


34/125

where C indicates class membership and wiis the ith attribute of d attributes in e4ample wN

a and b are calculated using logistic regression9 This model is described in detail in 0ewis

and 1ayes, ;=9 All we are concerned with here is that the classifier returns a number P

between 6 and ; indicating its confidence in whether or not an unseen e4ample belongs to a

class9 The threshold chosen to indicate a positi-e instance is 69>9 #f the classifier returns a P

higher than 69> for an unknown e4ample, it is considered to belong to the class C9 The

classifiers confidence in its prediction is proportional to the distance its prediction is away

from the threshold9 For e4ample, the classifier is less confident in a P of 69? belonging to C

than it is a P of 69 belong to C9

At each iteration of the sampling loop, the probabilistic classifier chooses four e4amples

from the training setN the two which are closest and below the threshold and the two which

are closest and abo-e the threshold9 The e4amples that are closest to the threshold are those

that it is least sure of the class9 The classifier is then retrained at each iteration of the

uncertainty sampling and reapplied to the training data to select four more instances that it

is unsure of9 $ote that after the four e4amples are chosen at each loop, their class is known

for retraining purposes Gthis is analogous to ha-ing an e4pert label e4amplesH9

The training set presented to the e4pert classifier can essentially be described as a pool ofe4amples that the probabilistic classifier is unsure of9 The pool of e4amples, chosen using a

threshold, will be biased towards ha-ing too many positi-e e4amples if the training data set

is imbalanced9 This is because the e4amples are chosen from a window that is centered

o-er the borderline where the positi-e and negati-e e4amples meet9 To correct for this, the

classifier chosen to train on the pool of e4amples, C=9>, was modified to include a loss ratio

parameter, which allows pruning to be based on e4pected loss instead of e4pected error

Gthis is analogous to cost pruning, Section 59=9;H9 The default rule for the classifier was also

modified to be chosen based on e4pected loss instead of e4pected error9

0ewis and Catlett ;= show by testing their sampling technique on a te4t classification

task that uncertainty sampling reduces the number of training e4amples required by an

e4pensi-e learner such as C=9> by a factor of ;69 They did this by comparing results of

5


35/125

induced decision trees on uncertainty samples from a large pool of training e4amples with

pools of e4amples that were randomly selected, but ten times larger9

*,, One %ie 2nte$$igent Se$e'tionubat and atwin ;: propose an intelligent one sided sampling technique that reduces

the number of negati-e e4amples in an imbalanced data set9 The underlying concept in their

algorithm is that positi-e e4amples are considered rare and must all be kept9 This is in

contrast to 0ewis and Catletts technique in that uncertainty sampling does not guarantee

that a large number of positi-e e4amples will be kept9 ubat and atwin ;: balance

data sets by remo-ing negati-e e4amples9 They categori+e negati-e e4amples as belonging

to one of four groups9 They are&

Those that suffer from class label noiseN

7orderline e4amples Gthey are e4amples which are close to the boundaries of

positi-e e4amplesHN

Redundant e4amples Gtheir part can be taken o-er by other e4amplesHN and

Safe e4amples that are considered suitable for learning9

#n their selection technique all negati-e e4amples, e4cept those which are safe, areconsidered to be harmful to learning and thus ha-e the potential of being remo-ed from the

training set9 Redundant e4amples do not directly harm correct classification, but increase

classification costs9 7orderline negati-e e4amples can cause learning algorithms to o-erfit

positi-e e4amples9

ubat and atwinMs ;: selection technique begins by first remo-ing redundant

e4amples from the training set9 To do this a subset C of the training e4amples, S, is created

by taking e-ery positi-e e4ample from S and randomly choosing one negati-e e4ample9

The remaining e4amples in S are then classified using the ;3$earest $eighbor G;3$$H rule

with C9 Any misclassified e4ample is added to C9 $ote that this technique does not make

the smallest C possible, it %ust shrinks S9 After redundant e4amples are remo-ed, e4amples

considered borderline or class noisy are remo-ed9

5=


36/125

7orderline, or class noisy e4amples are detected using the concept of Tomek 0inks

Tomek, ;:? that are defined by the distance between different class labeled e4amples9

Take for instance, two e4amples 4 and y with different classes9 The pair G4, yH is considered

to be a Tomek link if there e4ists no e4ample +, such that G4, +H G4, yH or Gy, +H Gy,

4H, where Ga, bH is defined as the distance between e4ample a and e4ample b9 E4amples

are considered borderline or class noisy if they participate in a Tomek link9

ubat and atwins selection technique was shown to be successful in impro-ing the

performance using the g3mean on two of three benchmark domains& -ehicles G-eh;H, glass

Gg:H, and -owels G-woH9 The domain in which no impro-ement was seen, g:, was

e4amined and it was found that in that particular domain the original data set did not

produce disproportionate -alues for gI and g39

*,0 Naive Samp$ing Te'hni3ue%

The pre-iously described selection algorithms balance data sets by significantly reducing

the number of training e4amples9 7oth are intelligent methods that filter out e4amples

using uncertainty sampling, or by remo-ing e4amples that are considered harmful to

learning9 0ing and 0i ;@ approach the problem of data imbalance using methods that

nai-ely downsi+e or o-er3sample data sets classifying e4amples with a confidence

measurement9 The domain of interest is data mining for direct marketing9 (ata sets in this

field are typically two class problems and are se-erely imbalanced, only containing a few

e4amples of people who ha-e bought the product and many e4amples of people who ha-e

not9 The three data sets studied by 0ing and 0i ;@ are a bank data set from a loan

product promotion G7ankH, a RRSP campaign from a life insurance company G0ife

#nsuranceH, and a bonus point program where customers accumulate points to redeem for

merchandise G7onusH9 As will be e4plained later, all three of the data sets are imbalanced9

(irect marketing is used by the consumer industry to target customers who are likely to

buy products9 Typically, if mass marketing is used to promote products Ge9g9, including

flyers in a newspaper with a large distributionH the response rate Gthe percent of people who

buy a product after being e4posed to the promotionH is -ery low and the cost of mass

5>


37/125

marketing -ery high9 For the three data sets studied by 0ing and 0i the response rates were

;95U of 6,66 responding in the 7ank data set, :U of @6,666 responding in the 0ife

#nsurance data set, and ;95U of ;6=,666 for the 7onus Program9

(ata mining can be -iewed as a two class domain9 /i-en a set of customers and their

characteristics, determine a set of rules that can accurately predict a customer as being a

buyer or a non3 buyer, ad-ertising only to buyers9 0ing and 0i ;@ howe-er, state that a

binary classification is not -ery useful for direct marketing9 For e4ample, a company may

ha-e a database of customers to which it wants to ad-ertise the sale of a new product to the


38/125

The e-aluation method used by 0ing and 0i ;@ is known as the lift inde49 This inde4

has been widely used in database marketing9 The moti-ation behind using the lift inde4 is

that it reflects the re3distribution of testing e4amples after a learner has ranked them9 For

e4ample, in this domain the learning algorithms rank e4amples in order of the most likely

to respond to the least likely to respond9 0ing and 0i ;@ di-ide the ranked list into ;6

deciles9 Bhen e-aluating the ranked list, regularities should be found in the distribution of

the responders Gi9e9, there should be a high percentage of the responders in the first few

decilesH9 Table 59=9; is a reproduction of the e4ample that 0ing and 0i ;@ present to

demonstrate this9

0ift Table

;6U ;6U ;6U ;6U ;6U ;6U ;6U ;6U ;6U ;6U

=;6 ;6 ;


39/125

"sing their lift inde4 as the sole measure of performance, 0ing and 0i ;@ report results

for o-er3sampling and downsi+ing on the three data sets of interest G7ank, 0ife #nsurance,

and 7onusH9

0ing and 0i ;@ report results that show the best lift inde4 is obtained when the ratio of

positi-e and negati-e e4amples in the training data is equal9 "sing 7oosted3$aW-e 7ayes

with a downsi+ed data set resulted in a lift inde4 of :69>U for 7ank, :>95U for 0ife

#nsurance, and @;99=U for 0ife #nsurance, and @69=U for the 7onus program when the data sets were

imbalanced at a ratio of ; positi-e e4ample to e-ery @ negati-e e4amples9 1owe-er, using

7oosted37ayes with o-er3sampling did not show any significant impro-ement o-er the

imbalanced data set9 0ing and 0i ;@ state that one method to o-ercome this limitation

may be to retain all the negati-e e4amples in the data set and re3sample the positi-e

e4amples=9

Bhen tested using their boosted -ersion of C=9>, o-er3sampling saw a performance gain as

the positi-e e4amples were re3sampled at higher rates9 Bith a positi-e sampling rate of

564, 7ank saw an increase of 59U Gfrom ?>9?U to ?@9>UH, 0ife #nsurance an increase of

59U Gfrom :=9


40/125

which techniques are appropriate in dealing with class imbalancesX To in-estigate these

questions *apkowic+ created a number of artificial domains which were made to -ary in

concept comple4ity, si+e of the training data and ratio of the under3represented class to the

o-er3represented class9

The target concept to be learned in her study was a one dimensional set of continuous

alternating equal si+ed inter-als in the range 6, ;, each associated with a class -alue of 6

or ;9 For e4ample, a linear domain generated using her model would be the inter-als 6,

69>H and G69>, ;9 #f the first inter-al was gi-en the class ;, the second inter-al would ha-e

class 69 E4amples for the domain would be generated by randomly sampling points from

each inter-al Ge9g9, a point 4 sampled in 6, 69> would be a G4, IH e4ample, and likewise a

point y sampled in G69>, ; would be an Gy, 3H e4ampleH9

*apkowic+ 5666 -aried the comple4ity of the domains by -arying the number of inter-als

in the target concept9 (ata set si+es and balances were easily -aried by uniformly sampling

different numbers of points from each inter-al9

The two balancing techniques that *apkowic+ 5666 used in her study that are of interest

here are o-er3sampling and downsi+ing9 The o-er3sampling technique used was one in

which the small class was randomly re3sampled and added to the training set until the

number of e4amples of each class was equal9 The downsi+ing technique used was one in

which random e4amples were remo-ed from the larger class until the si+e of the classes

was equal9 The domains and balancing techniques described abo-e were implemented

using -arious discrimination based neural networks G(0PH9

*apkowic+ found that both re3sampling and downsi+ing helped impro-e (0P, especially

as the target concept became -ery comple49 (ownsi+ing, howe-er, outperformed o-er3

sampling as the si+e of the training set increased9

5


41/125

8., C#)''4$r' :c Co($r On$ C#)''

*0- 5/T!

Riddle, Segal, and Et+ioni ;= propose an induction technique called 7R"TE9 The goal

of 7R"TE is not classification, but the detection of rules that predict a class9 The domain

of interest which leads to the creation of 7R"TE is the detection of manufactured airplane

parts that are likely to fail9 Any rule that detects anomalies, e-en if they are rare, is

considered important9 Rules which predict that a part will not fail, on the other hand are not

considered -aluable, no matter how large the co-erage may be9

7R"TE operates on the premise that standard decision trees test functions such as #(


42/125

#t can be seen that T5 will be chosen o-er T; using #(


43/125

CART and 59U for C=>9 !ne drawback is that the computational comple4ity of 7R"TES

depth bound search is much higher than that of typical decision tree algorithms9 They do

report, howe-er, that it only took CP" minutes of computation on a SPARC3;69

*0, 7O24

F!#0 .uinlan, ;6 is an algorithm designed to learn a set of first order rules to predict a

target predicate to be true9 #t differs from learners such as C>96 in that it learns relations

among attributes that are described with -ariables9 For e4ample, using a set of training

e4amples where each e4ample is a description of people and their relations&

$ame; *ack, /irlfriend; *ill,

$ame5 *ill, 7oyfriend5 *ack, Couple;5 True Q

C>96 may learn the rule&

#F G$ame; *ackH Y G7oyfriend5 *ackH T1E$ Couple;5 True9

This rule of course is correct, but will ha-e a -ery limited use9 F!#0 on the other hand can

learn the rule&

#F 7oyfriendG4, yH T1E$ CoupleG4, yH True

where 4 and y are -ariables which can be bound to any person described in the data set9 A

positi-e binding is one in which a predicate binds to a positi-e assertion in the training

data9 A negati-e binding is one in which there is no assertion found in the training data9 For

e4ample, the predicate 7oyfriendG4, yH has four possible bindings in the e4ample abo-e9

The only positi-e assertion found in the data is for the binding 7oyfriendG*ill, *ackH Gread

> The accuracy being referred to here is not how well a rule set performs o-er the testing data9 Bhat is being referred to

is the percentage of testing e4amples which are co-ered by a rule and correctly classified9 The e4ample Riddle et al9

;= gi-e is that if a rule matches ;6 e4amples in the testing data, and = of them are positi-e, then the predicti-e

accuracy of the rule is =6U9 The figures gi-en are a-erages o-er the entire rule set created by each algorithm9 Riddle et

al9 ;= use this measure of performance in their domain because their primary interest is in finding a few accurate

rules that can be interpreted by factory workers in order to impro-e the production process9 #n fact, they state that they

would be happy with a poor tree with one really good branch from which an accurate rule could be e4tracted9


44/125

the boyfriend of *ill is *ackH9 The other three possible bindings Ge9g9, 7oyfriendG*ack, *illHH

are negati-e bindings, because there are no positi-e assertions in the training data9

The following is a brief description of the F!#0 algorithm adapted from itchell, ;:9

F!#0 takes as input a target predicate Ge9g9, CoupleG4, yHH, a list of predicates that will be

used to describe the target predicate and a set of e4amples9 At a high le-el, the algorithm

operates by learning a set of rules that co-ers the positi-e e4amples in the training set9 The

rules are learned using an iterati-e process that remo-es positi-e training e4amples from

the training set when they are co-ered by a rule9 The process of learning rules continues

until there are enough rules to co-er all the positi-e training e4amples9 This way, F!#0 can

be -iewed as a specific to general search through a hypothesis space, which begins with an

empty set of rules that co-ers no positi-e e4amples and ends with a set of rules general

enough to co-er all the positi-e e4amples in the training data Gthe default rule in a learned

set is negati-eH9

Creating a rule to co-er positi-e e4amples is a process by which a general to specific search

is performed starting with an empty condition that co-ers all e4amples9 The rule is then

made specific enough to co-er only positi-e e4amples by adding literals to the rule Ga

literal is defined as a predicate or its negati-eH9 For e4ample, a rule predicting the predicate

FemaleG4H may be made more specific by adding the literals long)hairG4H and ZbeardG4H9

The function used to e-aluate which literal, 0, to add to a rule, R, at each step is&

where p6and n6are the number of positi-e GpH and negati-e GnH bindings of the rule R, p ;

and n;are the number of positi-e and negati-e binding of the rule which will be created by

adding 0 to R and t is the number of positi-e bindings of the rule R which are still co-ered

by R when 0 is added Gi9e9, t p6 3 p;H9

( )

+

+=

66

65

;;

;5 loglog,)

np

p

np

pt(4&ain7oi$


45/125

The function 7oi$6&aindetermines the utility of adding 0 to R9 #t prefers adding literals

with more positi-e bindings than negati-e bindings9 As can be seen in the equation, the

measure is based on the proportion of positi-e bindings before and after the literal in

question is added9

*00 S.2N8

ubat, 1olte, and atwin ;@ discuss the design of the S1R#$ algorithm that follows

the same principles as 7R"TE9 S1R#$ operates by finding rules that co-er positi-e

e4amples9 #n doing this, it learns from both positi-e and negati-e e4amples using the g3

mean to take into account rule accuracy o-er negati-e e4amples9 There are three principles

behind the design of S1R#$9 They are&

(o not subdi-ide the positi-e e4amples when learningN

Create a classifier that is low in comple4ityN and

Focus on regions in space where positi-e e4amples occur9

A S1R#$ classifier is made up of a network of tests9 Each test is of the form& 4 imin aiN

ma4 ai where i inde4es the attributes9 0et hirepresent the output of the ith test9 #f the test

suggests a positi-e test, the output is ;, else it is 3;9 E4amples are classified as being

positi-e if ihiwiQ where wiis a weight assigned to the test h i9

S1R#$ creates the tests and weights in the following way9 #t begins by taking the inter-al

for each attribute that co-ers all the positi-e e4amples9 The inter-al is then reduced in si+e

by remo-ing either the left or right point based on whiche-er produces the best g3mean9

This process is repeated iterati-ely and the inter-al found to ha-e the best g3mean is

considered the test for the attribute9 Any test that has a g3mean less than 69>6 is discarded9

The weight assigned to each test is wi log Ggi8;3giH where giis the g3mean associated with

the ith attribute test9


46/125

The results reported by ubat et al9 ;@ demonstrate that the S1R#$ algorithm

performs better than ;3$earest $eighbor with one sided selection?9 Pitting S1R#$ against

C=9> with one sided selection the results became less clear9 "sing one sided selection

resulted in a performance gain o-er the positi-e e4amples but a significant loss o-er the

negati-e e4amples9 This loss of performance o-er the negati-e e4amples results in the g3

mean being lowered by about ;6U9

Accuracies Achie-ed by C=9>, ;3$$ and Shrink

Classifier aI a3 g3mean

C=9> @;9; @?9? @;9:

;3$$ ?:95 @ ?69 :69Table 59=95& This table is adapted from ubat et al9, ;@9 #tgi-es the accuracies achie-ed by C=9> ;3$$ and S1R#$9

8.- R$co%nton B)'$d L$)rnn%

(iscrimination based learning techniques, such as C>969 create rules which describe both

the positi-e GconceptualH class, as well as the negati-e Gcounter conceptualH class9

Algorithms such as, 7R"TE, and F!#0 differ from algorithms such as C>96, in that they

create rules that only co-er positi-e e4amples9 1owe-er, they are still discrimination based

techniques because they create positi-e rules using negati-e e4amples in their search

through the hypothesis space9 For e4ample, F!#0 creates rules to co-er the positi-e class

by adding literals until they do not co-er any of the negati-e class e4amples9 !ther learning

methods, such as back propagation applied to a feed forward neural network and 3nearest

neighbor, do not e4plicitly create rules, but they are discrimination based techniques that

learn from both positi-e and negati-e e4amples9

*apkowic+, yers, and /luck ;> describe 1#PP!, a system that learns to recogni+e atarget concept in the absence of counter e4amples9 ore specifically, it is a neural network

Gcalled an autoencoderH that is trained to take positi-e e4amples as input, map them to a

small hidden layer, and then attempt to reconstruct the e4amples at the output layer9

? !ne sided selection is discussed in Section 59>95959 #t is essentially a method by which negati-e e4amples considered

harmful to learning are remo-ed from the data set9


47/125

7ecause the network has a narrow hidden layer it is forced to compress redundancies found

in the input e4amples9

An ad-antage of recognition based learners is that they can operate in en-ironments inwhich negati-e e4amples are -ery hard or e4pensi-e to obtain9 An e4ample *apkowic+ et

al9 ;> gi-e is the application of machine fault diagnosis where a system is designed to

detect the likely failure of hardware Ge9g9, helicopter gear bo4esH9 #n domains such as this,

statistics on functioning hardware are plentiful, while statistics of failed hardware may be

nearly impossible to acquire9 !btaining positi-e e4amples in-ol-es monitoring functioning

hardware, while obtaining negati-e e4amples in-ol-es monitoring hardware that fails9

Acquiring enough e4amples of failed hardware for training a discrimination based learner,

can be -ery costly if the de-ice has to be broken a number of different ways to reflect all

the conditions in which it may fail9

#n learning a target concept, recognition based classifiers such as that described by

*apkowic+ et al9 ;> do not try to partition a hypothesis space with boundaries that

separate positi-e and negati-e e4amples, but they attempt to make boundaries which

surround the target concept9 The following is an o-er-iew of how 1#PP!, a one hidden

layer autoencoder, is used for recognition based learning9

A one hidden layer autoencoder consists of three layers, the input layer, the hidden layer

and the output layer9 Training an autoencoder takes place in two stages9 #n the first stage

the system is trained on positi-e instances using back3propagation :to be able to compress

the training e4amples at the hidden layer and reconstruct them at the output layer9 The

second stage of training in-ol-es determining a threshold that can be used to determine the

reconstruction error between positi-e and negati-e e4amples9

The second stage of training is a semi3automated process that can be one of two cases9 The

first noiseless case is one in which a lower bound is calculated on the reconstruction error

of either the negati-e or positi-e instances9 The second noisy case is one that uses both

: $ote that back propagation is not the only training function that can be used9 E-ans and *apkowic+ 5666 report results

using an auto3encoder trained with the !ne Step Secant function9


48/125

positi-e and negati-e training e4amples to calculate the threshold ignoring the e4amples

considered to be noisy or e4ceptional9

After training and threshold determination, unseen e4amples can be gi-en to theautoencoder that can compress and then reconstruct them at the output layer, measuring the

accuracy at which the e4ample was reconstructed9 For a two class domain this is -ery

powerful9 Training an autoencoder to be able to sufficiently reconstruct the positi-e class,

means that unseen e4amples that can be reconstructed at the output layer contain features

that were in the e4amples used to train the system9 "nseen e4amples that can be

generali+ed with a low reconstruction error can therefore be deemed to be of the same

conceptual class as the e4amples used for training9 Any e4ample which cannot be

reconstructed with a low reconstruction error is deemed to be unrecogni+ed by the system

and can be classified as the counter conceptual class9

*apkowic+ et al9 ;> compared 1#PP! to two other standard classifiers that are designed

to operate with both positi-e and negati-e e4amples9 They are C=9> and applying back

propagation to a feed forward neural network GFF ClassificationH9 The data sets studied

were&

The C1=? 1elicopter /earbo4 data set olesar and $Ra(, ;=9 This domain

consists of discriminating between faulty and non3faulty helicopter gearbo4es

during operation9 The faulty gearbo4es are the positi-e class9

The Sonar Target Recognition data set9 This data was obtained from the "9C9 #r-ine

Repository of achine 0earning9 This domain consists of taking sonar signals as

input and determining which signals constitute rocks and which are mines Gmine

signals were considered the positi-e class in the studyH9

The Promoter data set9 This data consists of input segments of ($A strings9 The

problem consists of recogni+ing which strings represent promoters that are the

positi-e class9


49/125

Testing 1#PP! showed that it performed much better than C=9> and FF Classifier on the

1elicopters and Sonar Targets domains9 #t performed equally with FF classifier on the

promoters domain but much better than C=9> on the same data9

(ata Set Results

(ata Set 1#PP! C=9> FF Classifier

1elicopters 69 ;>9?5>;9 ;69;9:

Promoters 5669: ;9= 56;9=

SonarTargets

5659: 5;9@


50/125


51/125

C h a p t e r T h r e e

< ART#F#C#A0 (!A#$

Chapter < is di-ided into three sections9 Section 969 The

purpose of the e4periments is to in-estigate the nature of imbalanced data sets and pro-ide

a moti-ation behind the design of a system intended to impro-e a standard classifiers

performance on imbalanced data sets9 Section


52/125

where k is the number of dis%uncts, n is the number of con%unctions in each dis%unct, and 4 n

is defined o-er the alphabet 4;, 45,[, 4%9 Z4;, Z45, [,Z4%9 An e4ample of a k3($F

e4pression, being 5, gi-en as GE4p9 ;H9

4;Y4Z4 GE4p9 ;H

$ote that if 4k is a member of a dis%unct Z4kcannot be9 Also note, GE4p9 ;H would be

referred to as an e4pression of , the following four e4amples would ha-e

classes indicated by I839

4; 45 4< 4= 4> Class

;H ; 6 ; ; 6 I

5H 6 ; 6 ; ; I


53/125

Figure


54/125

The other similarity between te4t classification and k3($F e4pressions is the ability to

affect the comple4ity of the target e4pression in a k3($F e4pression9 7y -arying the

number of dis%uncts in an e4pression we can -ary the difficulty of the target concept to be

learned9@This ability to control concept comple4ity can map itself onto te4t classification

tasks where not all classification tasks are equal in difficulty9 This may not be ob-ious at

first9 Consider a te4t classification task where one needs to classify documents as being

about a particular consumer product9 The comple4ity of the rule set needed to distinguish

documents of this type, may be as simple as a single rule indicating the name of the product

and the name of the company that produces it9 This task would probably map itself to a

-ery simple k3($F e4pression with perhaps only one dis%unct9 $ow consider training

another classifier intended to be used to classify documents as being computer softwarerelated or not9 The number of rules needed to describe this category is probably much

greater9 For e4ample, the terms JcomputerJ and JsoftwareJ in a document may be good

indicators that a document is computer software related, but so might be the term

JwindowsJ, if it appears in a document not containing the term JcleanerJ9 #n fact, the terms

JoperatingJ and JsystemJ or JwordJ and JprocessorJ appearing together in a document are

also good indicators that it is software related9 The comple4ity of a rule set needed to be

constructed by a learner to recogni+e computer software related documents is, therefore,

greater and would probably map onto a k3($F e4pression with more dis%uncts than that of

the first consumer product e4ample9

The biggest difference between the two domains is that the artificial domain was created

without introducing any noise9 $o negati-e e4amples were created and labeled as being

positi-e9 0ikewise, there were no positi-e e4amples labeled as negati-e9 For te4t domains

in general there is often label noise in which documents are gi-en labels that do not

accurately indicate their content9

@ As the number of dis%uncts GkH in an e4pression increases, more partitions in the hypothesis space are need to be

reali+ed by a learner to separate the positi-e e4amples from the negati-e e4amples9

=


55/125

;.* E6)&/#$ Cr$)ton

For the described tests, training e4amples were always created independently of the testing

e4amples9 The training and testing e4amples were created in the following manner&

A Random k3($F e4pression is created on a gi-en alphabet si+e Gin this study the

alphabet si+e is >6H9

An arbitrary set of e4amples was generated as a random sequence of attributes

equal to the si+e of the alphabet the k3($F e4pression was created o-er9 All the

attributes were gi-en an equal probability of being either 6 or ;9

Each e4ample was then classified as being either a member of the e4pression or not

and tagged appropriately9 Figure 666 negati-e e4amples and ;566 positi-e e4amples was used9 This

represented a class imbalance of >&; in fa-or of the negati-e class9 As the tests, howe-er,

lead to the creation of a combination scheme, the data sets tested were further imbalanced

to a 5>&; G?666 negati-e & 5=6 positi-eH ratio in fa-or of the negati-e class9 This greater

imbalance more closely resembled the real world domain of te4t classification on which the

system was ultimately tested9 #n each case the e4act ratio of positi-e and negati-e e4amples

in both the training and testing set will be indicated9

==


56/125

;., D$'cr/ton o4 T$'t' )nd R$'u#t'

The description of each test will consist of se-eral sections9 The first section will state the

moti-ation behind performing the test and gi-e the particulars of its design9 The results of

the e4periment will then be gi-en followed by a discussion9

90- Te%t : - Varying the Target Con'ept% Comp$exity

'arying the number of dis%uncts in an e4pression -aries the comple4ity of the target

concept9 As the number of dis%uncts increases, the following two things occur in a data set

where the positi-e e4amples are e-enly distributed o-er the target e4pression and their

number is held constant&

The target concept becomes more comple4, and

The number of positi-e e4amples becomes sparser relati-e to the target concept9

A -isual representation of the preceding statements is gi-en in Figure


57/125

The moti-ation behind this e4periment comes from Schaffer ;96 learns target concepts of increasing comple4ity on balanced and imbalanced data sets9

S$tu/

#n order to in-estigate the performance of induced decision trees on balanced and

imbalanced data sets, eight sets of training and testing data of increasing target concept

comple4ities were created9 The target concepts in the data sets were made to -ary in

concept comple4ity by increasing the number of dis%uncts in the e4pression to be learned,

while keeping the number of con%unctions in each dis%unct constant9 The following

algorithm was used to produce the results gi-en below9

Repeat 4 times

o Create a training set TGc, ?666I, ?6663Ho Create a test set EGc, ;566I, ;5663H

o Train Con T

o Test Con Eand record its performance P;&;

o Randomly remo-e =@66 positi-e e4amples from T

o Train Con T

o Test Con Eand record its performance P;&>

o Randomly remo-e ?6 positi-e e4amples from T

o Train Con To Test Con Eand record its performance P;&5>

$ote that throughout Chapter < the testing sets used to measure the performance of the induced classifiers are balanced9

That is, there is an equal number of both positi-e and negati-e e4amples used for testing9 The test sets are artificially

balanced in order to increase the cost of misclassifying positi-e e4amples9 "sing a balanced testing set to measure a

classifiers performance gi-es each class equal weight9

=?


58/125

A-erage each Ps o-er each 49

For this test e4pressions of comple4ity ' =45, =469 The results for each e4pression were a-eraged o-er 4 ;6 runs9

R$'u#t'

The results of the e4periment are shown in Figures


59/125

Error Over All Examples

0

041

042

04/

04(

(x2

(x/

(x(

(x-

(x)

(x.

(x*

(x10

Degree of Com plexity

Erro

r 151

15-

152-

Figure


60/125

D'cu''on

As pre-iously stated, the purpose of this e4periment was to test the classifiers performance

on both balanced and imbalanced data sets while -arying the comple4ity of the target

e4pression9 #t can be seen in Figure


61/125

Table


62/125

#n terms of the o-erall si+e of the data set, downsi+ing significantly reduces the number of

o-erall e4amples made a-ailable for training9 7y lea-ing negati-e e4amples out of the data

set, information about the negati-e Gor counter conceptualH class is being remo-ed9

!-er3sampling has the opposite effect in terms of the si+e of the data set9 Adding e4amples

by re3sampling the positi-e Gor conceptualH class, howe-er, does not add any additional

information to the data set9 #t %ust balances the data set by increasing the number of positi-e

e4amples in the data set9

S$tu/

This test was designed to determine if randomly remo-ing e4amples of the o-er

represented negati-e class, or uniformly o-er3sampling e4amples of the under represented

class to balance the data set, would impro-e the performance of the induced classifier o-er

the test data9 To do this, data sets imbalanced at a ratio of ;I&5>3 were created, -arying the

comple4ity of target e4pression in terms of the number of dis%uncts9 The idea behind the

testing procedure was to start with an imbalanced data set and measure the performance of

an induced classifier as either negati-e e4amples are remo-ed, or positi-e e4amples are re3

sampled and added to the training data9 The procedure gi-en below was followed to

produce the presented results9

Repeat 4 times

o Create a training set TG', 5=6I, ?6663H

o Create a test set EG', ;566I, ;5663H

o Train Con T

o Test Con Eand record its performance Poriginalo Repeat for n ; to ;6

Create TdG5=6I, G?666 3 nV>:?H3H by randomly remo-ing >:?Vne4amples from T

Train Con Td Test Con Eand record its performance Pdownsi+e

o Repeat for n ; to ;6

Create ToGG5=6 I nV>:?HI, ?6663H by uniformly o-er3sampling the

positi-e e4amples from T9

Train Con Td

Test Con Eand record its performance Po-ersample

>;


63/125


64/125

For downsi+ing the numbers represent the rate at which negati-e e4amples were remo-ed

from the training data9 The point 6 represents no negati-e e4amples being remo-ed, while

;66 represents all the negati-e e4amples being remo-ed9 The point 6 represents the

training data being balanced G5=6I, 5=63H9 Essentially, what is being said is that the

negati-e e4amples were remo-ed at >:? increments9

For o-er3sampling, the labels on the 43a4is are simply the rate at which the positi-e

e4amples were re3sampled, ;66 being the point at which the training data set is balanced

G?666I, ?6663H9 The positi-e e4amples were therefore re3sampled at >:? increments9

#t can be seen from Figure :?H or =6:?H positi-e e4amples9 That is, the lowest error rate

achie-ed for o-er3sampling is around the ?6 or :6 mark in Figure


65/125

x$ Accuracy Over All Examples

0

041

042

04/

04(

0 20 (0 )0 *0 100

"ampli#g Rate

Error 6ow nsiin!

'verSamplin!

Figure & This graph demonstrates that the optimal le-el at

which a data set should be balanced does not always occur at thesame point9 To see this, compare this graph with Figure


66/125

x% Accuracy Over Negative Examples

0

04002

0400(

0400)

0400*

0401

0 20 (0 )0 *0 100

rror

6ownsiin!

'verSamplin!

Figure


67/125

The results in Figure


68/125

Figure :


69/125

There are competing factors when each balancing technique is used9 Achie-ing a

higher aI comes at the e4pense of a3 Gthis is a common point in the literature for

domains such as te4t classificationH9

900 Te%t :0 # u$e Count for 5a$an'e Data Set%

"ltimately, the goal of the e4periments described in this section is to pro-ide moti-ation

behind the design of a system that combines multiple classifiers that use different sampling

techniques9 The ad-antage of combining classifiers that use different sampling techniques

only comes if there is a -ariance in their predictions9 Combining classifiers that always

make the same predictions is of no -alue if one hopes that their combination will increase

predicti-e accuracy9 #deally, one would like to combine classifiers that agree on correctpredictions, but disagree on incorrect predictions9

ethods that combine classifiers such as Adapti-e37oosting attempt to -ary learners

predictions by -arying the training e4amples in which successi-e classifiers are presented

to learn on9 As we saw in Section 5959=, Adapti-e37oosting increases the sampling

probability of e4amples that are incorrectly classified by already constructed classifiers9 7y

placing this higher weight on incorrectly classified e4amples, the induction process at each

iteration is biased towards creating a classifier that performs well on pre-iously

misclassified e4amples9 This is done in an attempt to create a number of classifiers that can

be combined to increase predicti-e accuracy9 #n doing this, Adapti-e37oosting ideally

di-ersifies the large rule sets of the classifiers9

S$tu/

Rules can be described in terms of their comple4ity9 0arger rules sets are considered more

comple4 than smaller rule sets9 This e4periment was designed to get a feel for the

comple4ities of the rule sets produced by C>96, when induced on imbalanced data sets that

ha-e been balanced by either o-er3sampling or downsi+ing9 7y looking at the comple4ity

of the rule sets created, we can get a feel for the differences between the rule sets created

>@


70/125

using each sampling technique9 The following algorithm was used to produce the results

gi-en below9

Repeat 4 timeso Create a training set TGc, 5=6I, ?6663H

o Create ToG?666I,?6663H by uniformly re3sampling the positi-e e4amples

from Tand adding the negati-e e4amples from T9o Train Con Too Record rule counts RoI and Ro3 for positi-e and negati-e rule sets

o Create TdG5=6I, 5=63H by randomly remo-ing >?:6 negati-e e4amples from

T9o Train Con Td

o Record rule counts RdI and Rd3 for positi-e and negati-e rule sets

A-erage rule counts o-er 49

For this test e4pressions of si+es c =45, =46 were tested and a-eraged o-er 4


71/125

Positi-e Rule Counts

Don S96=4? =9: ; =96 ?96

=4: =9@ ;>9< =9< :95

=4@ =9 ;>9= @9<


72/125

7efore # begin the discussion of these results it should be noted that these numbers must

only be used to indicate general trends towards rule set comple4ity9 Bhen being a-eraged

for e4pressions of comple4ities =4? and greater the numbers -aried considerably9 The

discussion will be in four parts9 #t will begin by attempting to e4plain the factors in-ol-ed

in creating rule sets o-er imbalanced data sets and then lead into an attempt to e4plain the

characteristics of rules sets created by downsi+ed data sets, followed by o-er3sampled rule

sets9 # will then conclude with a general discussion about some of the characteristics of the

artificial domain and how they create the results that ha-e been presented9 Throughout this

section one should remember that the positi-e rule set contains the target concept, that is,

the underrepresented class9

.ow oe% a $a'= of po%itive training examp$e% hurt $earning>

ubat et al9 ;@ gi-e an intuiti-e e4planation of why a lack of positi-e e4amples hurts

learning9 0ooking at the decision surface of a two dimensional plane, they e4plain the

beha-ior of the ;3$earest $eighbor G;3$$H rule9 #t is a simple e4planation that is

generali+ed as& J[as the number of negati-e e4amples in a noisy domain grows Gthe

number of positi-es being constantH, so does the likelihood that the nearest neighbor of any

e4ample will be negati-e9J Therefore, as more negati-e e4amples are introduced to the data

set, the more likely a positi-e e4ample is to be classified as negati-e using the ;3$$ rule9

!f course, as the number of negati-e e4amples approaches infinity, the accuracy of a

learner that classifies all e4amples as negati-e approaches ;66U o-er negati-e data and 6U

o-er the positi-e data9 This is unacceptable if one e4pects to be able to recogni+e positi-e

e4amples9

They then e4tend the argument to decision trees, drawing a connection to the common

problem of o-erfitting9 Each leaf of a decision tree represents a decision as being positi-e

or negati-e9 #n a noisy training set that is unbalanced in terms of the number of negati-e

e4amples, it is stated that an induced decision tree will be large enough to create regions

arbitrarily small enough to partition the positi-e regions9 That is, the decision tree will ha-e

rules comple4 enough to co-er -ery small regions of the decision surface9 This is a result of

?;


73/125

a classifier being induced to partition positi-e regions of the decision surface small enough

to contain on$y positi-e e4amples9 #f there are many negati-e e4amples nearby, the

partitions will be made -ery small to e4clude them from the positi-e regions9 #n this way,

the tree o-erfits the data with a similar effect as the ;3$$ rule9

any approaches ha-e been de-eloped to a-oid o-er fitting data, the most successful being

post pruning9 ubat et al9 ;@, howe-er, state that this does not address the main

problem9 #f a region in an imbalanced data set by definition contains many more negati-e

e4amples than positi-e e4amples, post pruning is -ery likely to result in all of the pruned

branches being classified as negati-e9

C?@ an u$e Set%

C>96 attempts to partition data sets into regions that contain only positi-e e4amples and

regions that contain only negati-e e4amples9 #t does this by attempting to find features in

the data that are good to partition the training data around Gi9e9, ha-e a high information

gainH9 !ne can look at the partitions it creates by analy+ing the rules that are generated

which create the boundaries9 Each rule generated creates a partition in the data9 Rules can

appear to o-erlap, but when -iewed as partitions in an entire set of rules, the partitions

created in the data by the rule sets do not o-erlap9 'iewed as an entire set of rules, thepartitions in the data can be -iewed has ha-ing highly irregular shapes9 This is due to the

fact that C>96 assigns a confidence le-el to each rule9 #f a region of space is o-erlapped by

multiple rules, the confidence le-el for each rule class that co-ers the space is summed9 The

class with the highest summed confidence le-el is determined to be the correct class9 The

confidence le-el gi-en to each rule can be -iewed as being the number of e4amples the rule

co-ers correctly o-er the training data9 Therefore, rule sets that contain higher numbers of

rules are generally less confident in their estimated accuracy because each rule co-ers

fewer e4amples9 Figure


74/125

Rule 1 Rule 2

Figure 96 adds rules to createcomple4 decision surfaces9 #t is done by summing the confidencele-el of rules that co-er o-erlapping regions9 A region co-ered by

more than one rule is assigned the class with the highest summed

confidence le-el of all the rules that co-er it9 1ere we assumeRule ; has a higher confidence le-el than Rule 59

Down%i


75/125

!-er3sampling has different effects than downsi+ing9 !ne ob-ious difference is the

comple4ity of the rule sets indicating negati-e partitions9 Rule sets that classify negati-e

e4amples when o-er3sampling is used are much larger than those created using

downsi+ing9 This is because there is still the large number of negati-e e4amples in the data

set, resulting in a large number of rules created to classify them9

The rule sets created for the negati-e e4amples are gi-en much less confidence than those

created when downsi+ing is used9 This effect occurs due to the fact that the learning

algorithm attempts to partition the data using features contained in the negati-e e4amples9

7ecause there is no target concept contained in the negati-e e4amples;5Gi9e9, no features to

indicate an e4ample to be negati-eH, the learning algorithm is faced with the dubious task,

in this domain, of attempting to find features that do not e4ist e4cept by mere chance9

!-er sampling the positi-e class can be -iewed as adding weight to the e4amples that are

re3sampled9 "sing an information gain heuristic when searching through the hypothesis

space, features which partition more e4amples correctly are fa-ored o-er those that do not9

The effect of multiplying the number of e4amples a feature will classify correctly when

found gi-es the feature weight9 !-er sampling the positi-e e4amples in the training data

therefore has the effect of gi-ing weight to features contained in the target concept, but italso adds weight to random features which occur in the data that is being o-er3sampled9

The effect of o-er3sampling therefore has two competing factors9 The factors are&

!ne that adds weight to features containing the target concept9

!ne that adds weight to features notcontaining the target concept

The effect of features not rele-ant to the target concept being gi-en a disproportionate

weight can be seen for e4pressions of comple4ity =4@ and =4;69 This can be seen in lower

right hand corner of Table


76/125

sparse compared to the number of positi-e e4amples9 Bhen the positi-e data is o-er3

sampled, irrele-ant features are gi-en enough weight relati-e to the features containing the

target conceptN as a result the learning algorithm se-erely o-erfits the training data by

creating garbage rules that partition the data on features not containing the target concept,

but that appear in the positi-e e4amples9

;.- C)r)ct$r'tc' o4 t$ Do&)n )nd o t$ A44$ct t$ R$'u#t'

The characteristics of the artificial domain greatly affect the way in which rule sets are

created9 The ma%or determining factor in the creation of the rule sets is the fact that the

target concept is hidden in the underrepresented class and that the negati-e e4amples in the

domain ha-e no rele-ant features9 That is, the underrepresented class contains the target

concept and the o-er represented class contains e-erything else9 #n fact, if o-er3sampling is

used to balance the data sets, e4pressions of comple4ity =45 to =4? could still, on a-erage,

attain ;66U accuracy on the testing set, if only the positi-e rule sets were used to classify

e4amples with a default negati-e rule9 #n this respect, the artificial domain can be -iewed as

lending itself to bein

Andrew Thesis

Documents

Transcript of Andrew Thesis