Andrew Thesis

download Andrew Thesis

of 125

Transcript of Andrew Thesis

  • 8/12/2019 Andrew Thesis

    1/125

    A COMBINATION SCHEME FOR INDUCTIVE

    LEARNING FROM IMBALANCED DATA

    SETS

    by

    Andrew Estabrooks

    A Thesis Submitted to the

    Faculty of Computer Science

    in Partial Fulfillment of the Requirementsfor the degree of

    ASTER !F C!P"TER SC#E$CE

    a%or Sub%ect& Computer Science

    APPR!'E(&

    )))))))))))))))))))))))))))))))))))))))))

    $athalie *apkowic+, Super-isor

    )))))))))))))))))))))))))))))))))))))))))

    .igang /ao

    )))))))))))))))))))))))))))))))))))))))))

    0ouise Spiteri

    (A01!"S#E "$#'ERS#T2 3 (A0TEC1

  • 8/12/2019 Andrew Thesis

    2/125

    1alifa4, $o-a Scotia5666

    ii

  • 8/12/2019 Andrew Thesis

    3/125

    (A0TEC1 0#7RAR2

    "AUTHORITY TO DISTRIBUTE MANUSCRIPT THESIS"

    T#T0E&

    A Combination Scheme for 0earning From #mbalanced (ata Sets

    The abo-e library may make a-ailable or authori+e another library to make

    a-ailable indi-idual photo8microfilm copies of this thesis without restrictions9

    Full $ame of Author& Andrew Estabrooks

    Signature of Author& )))))))))))))))))))))))))))))))))

    (ate& :85;85666

    iii

  • 8/12/2019 Andrew Thesis

    4/125

    TA70E !F C!$TE$TS

    1. Introducton999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;; #nducti-e 0earning9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;

    5 Class #mbalance999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999995< oti-ation 999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999=

    = Chapter !-er-iew99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999=

    > 0earners99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999?

    >9; 7ayesian 0earning99999999999999999999999999999999999999999999999999999999999999999999999999999999?>95 $eural $etworks99999999999999999999999999999999999999999999999999999999999999999999999999999999999:

    >9< $earest $eighbor9999999999999999999999999999999999999999999999999999999999999999999999999999999999@

    >9= (ecision Trees99999999999999999999999999999999999999999999999999999999999999999999999999999999999999? (ecision Tree 0earning Algorithms and C>969999999999999999999999999999999999999999999999999999999999999999999

    ?9; (ecision Trees and the #(< algorithm 99999999999999999999999999999999999999999999999;6?95 #nformation /ain and the Entropy easure9999999999999999999999999999999999999999;;?9< !-erfitting and (ecision Trees99999999999999999999999999999999999999999999999999999999999;96 !ptions999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;>

    : Performance easures9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;::9; Confusion atri499999999999999999999999999999999999999999999999999999999999999999999999999999999;:

    :95 g3ean99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;@

    :9< R!C cur-es 999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;@

    @ A Re-iew of Current 0iterature999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;@9; isclassification Costs9999999999999999999999999999999999999999999999999999999999999999999999956

    @95 Sampling Techniques9999999999999999999999999999999999999999999999999999999999999999999999999955

    @959; 1eterogeneous "ncertainty Sampling99999999999999999999999999999999999999999999999999999999999955@9595 !ne sided #ntelligent Selection99999999999999999999999999999999999999999999999999999999999999999999995=

    @959< $ai-e Sampling Techniques999999999999999999999999999999999999999999999999999999999999999999999999995>

    @9< Classifiers Bhich Co-er !ne Class9999999999999999999999999999999999999999999999999999

  • 8/12/2019 Andrew Thesis

    5/125

    ;695 Architecture99999999999999999999999999999999999999999999999999999999999999999999999999999999999999?:

    ;6959; Classifier 0e-el99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999?;69595 E4pert 0e-el999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:6

    ;6959< Beighting Scheme999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:;

    ;6959= !utput 0e-el999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:5;; Testing the Combination scheme on the Artificial (omain9999999999999999999999999999999999999999999:5

    ;5 Te4t Classification999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999::

    ;59; Te4t Classification as an #nducti-e Process99999999999999999999999999999999999999:;< Reuters35;>:@999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999@6

    ;9; Precision and Recall9999999999999999999999999999999999999999999999999999999999999999999999999@@

    ;>95 F3 measure9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999@;>9< 7reake-en Point99999999999999999999999999999999999999999999999999999999999999999999999999999999@

    ;>9= A-eraging Techniques9999999999999999999999999999999999999999999999999999999999999999999999;

    ;? Statistics used in this study999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999

    , Mot()ton

    Currently the ma%ority of research in the machine learning community has based the

    performance of learning algorithms on how well they function on data sets that are

    reasonably balanced9 This has lead to the design of many algorithms that do not adapt well

    to imbalanced data sets9 Bhen faced with an imbalanced data set, researchers ha-e

    generally de-ised methods to deal with the data imbalance that are specific to the

    application at hand9 Recently howe-er there has been a thrust towards generali+ing

    techniques that deal with data imbalances9

    The focus of this thesis is directed towards inducti-e learning on imbalanced data sets9 The

    goal of the work presented is to introduce a combination scheme that uses two of the

    pre-iously mentioned balancing techniques, downsi+ing and o-er3sampling, in an attempt

    to impro-e learning on imbalanced data sets9 ore specifically, # will present a system that

    combines classifiers in a hierarchical structure according to their sampling technique9 This

    combination scheme will be designed using an artificial domain and tested on the real

    world application of te4t classification9 #t will be shown that the combination scheme is an

    effecti-e method of increasing a standard classifiers performance on imbalanced data sets9

    - C)/t$r O($r($

    The remainder of this thesis is broken down into four chapters9 Chapter 5 gi-es backgroundinformation and a re-iew of the current literature pertaining to data set imbalance9 Chapter

    < is di-ided into se-eral sections9 The first section describes an artificial domain and a set

    of e4periments, which lead to the moti-ation behind a general scheme to handle

    imbalanced data sets9 The second section describes the architecture behind a system

    =

  • 8/12/2019 Andrew Thesis

    16/125

    designed to lend itself to domains that ha-e imbalanced data9 The third section tests the

    de-eloped system on the artificial domain and presents the results9 Chapter = presents the

    real world application of te4t classification and is di-ided into two parts9 The first part

    gi-es needed background information and introduces the data set that the system will be

    tested on9 The second part presents the results of testing the system on the te4t

    classification task and discusses it effecti-eness9 The thesis concludes with Chapter >,

    which contains a summary and suggested directions for further research9

    >

  • 8/12/2019 Andrew Thesis

    17/125

    C h a p t e r T w o

    5 7AC/R!"$(

    # will begin this chapter by gi-ing a brief o-er-iew of some of the more common learning

    algorithms and e4plaining the underlying concepts behind the decision tree learning

    algorithm C>96, which will be used for the purposes of this study9 There will then be a

    discussion of -arious performance measures that are commonly used in machine learning9

    Following that, # will gi-e an o-er-iew of the current literature pertaining to data

    imbalance9

    0 L$)rn$r'

    There are a large number of learning algorithms, which can be di-ided into a broad range

    of categories9 This section gi-es a brief o-er-iew of the more common learning algorithms9

    0.1 B)$')n L$)rnn%

    #nducti-e learning centers on finding the best hypothesis h, in a hypothesis space 1, gi-en aset of training data (9 Bhat is meant by the best hypothesis is that it is the most probable

    hypothesis gi-en a data set ( and any initial knowledge about the prior probabilities of

    -arious hypothesis in 19 achine learning problems can therefore be -iewed as attempting

    to determine the probabilities of -arious hypothesis and choosing the hypothesis which has

    the highest probability gi-en (9

    ore formally, we define the posterior probability PGh(H, to be the probability of an

    hypothesis h after seeing a data set (9 7ayes theorem GEq9 ;H pro-ides a means to calculate

    posterior probabilities and is the basis of 7ayesian learning9

    ?

  • 8/12/2019 Andrew Thesis

    18/125

    ( ) ( ) ( )

    ( )DPhPhDP

    DhPS

    S = GEq9 ;H

    A simple method of learning based on 7ayes theorem is called the nai-e 7ayes classifier9

    $ai-e 7ayes classifiers operate on data sets where each e4ample 4 consists of attribute

    -alues a;, a5 999 aiQ and the target function fG4H can take on any -alue from a pre3defined

    finite set 'G-;, -5 999 -%H9 Classifying unseen e4amples in-ol-es calculating the most

    probable target -alue vma4and is defined as&

    "sing 7ayes theorem GEq9 ;H vmaxcan be rewritten as&

    "nder the assumption that attribute -alues are conditionally independent gi-en the target

    -alue9 The formula used by the nai-e 7ayes classifier is&

    where - is the target output of the classifier and PGai-%H and PG-iH can be calculated based on

    their frequency in the training data9

    0.* N$ur)# N$tor!'

    $eural $etworks are considered -ery robust learners that perform well on a wide range of

    applications such as, optical character recognition 0e Cun et al9, ;@ and autonomous

    na-igation Pomerleau, ;

  • 8/12/2019 Andrew Thesis

    19/125

    a4ons9 The basic unit of an artificial neural network is the perceptron, which takes as input

    a number of -alues and calculates the linear combination of these -alues9 The combined

    -alue of the input is then transformed by a threshold unit such as the sigmoid function 59

    Each input to a perceptron is associated with a weight that determines the contribution of

    the input9 0earning for a neural network essentially in-ol-es determining -alues for the

    weights9 A pictorial representation of a perceptron is gi-en in Figure 59;9;9

    w1

    w2

    wn

    x1

    x2

    xn

    Threshold unit

    w0

    Figure& 59;959 A perceptron9

    0., N$)r$'t N$%+or

    $earest $eighbor learning algorithms are instance3based learning methods, which store

    e4amples and classify newly encountered e4amples by looking at the stored instances

    considered similar9 #n its simplest form all instances correspond to points in an n

    dimensional space9 An unseen e4ample is classified by choosing the ma%ority class of the

    closest e4amples9 An ad-antage nearest neighbor algorithms ha-e is that they can

    appro4imate -ery comple4 target functions, by making simple local appro4imations based

    on data, which is close to the e4ample to be classified9 An e4cellent e4ample of an

    application, which uses a nearest neighbor algorithm, is that of te4t retrie-al in which

    documents are represented as -ectors and a cosine similarity metric is used to measure the

    distance of queries to documents9

    5 The sigmoid function is defined as oGyH ; 8 G; I e3yH and is referred to as a squashing function because it maps a -ery

    wide range of -alues onto the inter-al G6, ;H9

    @

  • 8/12/2019 Andrew Thesis

    20/125

    0.- D$c'on Tr$$'

    (ecision trees classify e4amples according to the -alues of their attributes9 They are

    constructed by recursi-ely partitioning training e4amples based each time on the remaining

    attribute that has the highest information gain9 Attributes become nodes in the constructed

    tree and their possible -alues determine the paths of the tree9 The process of partitioning the

    data continues until the data is di-ided into subsets that contain a single class, or until some

    stopping condition is met Gthis corresponds to a leaf in the treeH9 Typically, decision trees

    are pruned after construction by merging children of nodes and gi-ing the parent node the

    ma%ority class9 Section 595 describes in detail how decision trees, in particular C>96, operate

    and are constructed9

    2 D$c'on Tr$$ L$)rnn% A#%ort&' )nd C0.3

    C>96 is a decision tree learning algorithm that is a later -ersion of the widely used C=9>

    algorithm .uinlan, ; and C>969 The following

    section consists of two parts9 The first part is a brief summary of itchells description of

    the #(< algorithm and the e4tensions leading to typical decision tree learners9 A brief

    operational o-er-iew of C>96 is then gi-en as it relates to this work9

    7efore # begin the discussion of decision tree algorithms, it should be noted that a decision

    tree is not the only learning algorithm that could ha-e been used in this study9 As described

    in Chapter ;, there are many different learning algorithms9 For the purposes of this study a

    decision tree algorithm was chosen for three reasons9 The first is the understandability of

    the classifier created by the learner9 7y looking at the comple4ity of a decision tree in terms

    of the number and si+e of e4tracted rules, we can describe the beha-ior of the learner9

    Choosing a learner such as $ai-e 7ayes, which classifies e4amples based on probabilities,

    would make an analysis of this type nearly impossible9 The second reason a decision tree

    learner was chosen was because of its computational speed9 Although, not as cheap to

    operate as $ai-e 7ayes, decision tree learners ha-e significantly shorter training times than

    do neural networks9 Finally, a decision tree was chosen because it operates well on tasks

  • 8/12/2019 Andrew Thesis

    21/125

    that classify e4amples into a discrete number of classes9 This lends itself well to the real

    world application of te4t classification9 Te4t classification is the domain that the

    combination scheme designed in Chapter < will be tested on9

    2.1 D$c'on Tr$$' )nd t$ ID, )#%ort&

    (ecision trees classify e4amples by sorting them based on attribute -alues9 Each node in a

    decision tree represents an attribute in an e4ample to be classified, and each branch in a

    decision tree represents a -alue that the node can take9 E4amples are classified starting at

    the root node and sorting them based on their attribute -alues9 Figure 5959; is an e4ample of

    a decision tree that could be used to classify whether it is a good day for a dri-e or not9

    Road Conditions

    Clear Snow Covered Icy

    Forecast

    Temperature Accumulation

    RainClear

    HeavyFreein! "i!ht#arm

    Snow

    $%S

    &'

    &' &'

    &'

    $%S $%S

    Figure 5959

  • 8/12/2019 Andrew Thesis

    22/125

    would sort to the nodes& Road Conditions, Forecast, and finally Temperature, which would

    classify the instance as being positi-e GyesH, that is, it is a good day to dri-e9 Con-ersely an

    instance containing the attribute Road Conditions assigned Snow Co-ered would be

    classified as not a good day to dri-e no matter what the Forecast, Temperature, or

    Accumulation are9

    (ecision tress are constructed using a top down greedy search algorithm which recursi-ely

    subdi-ides the training data based on the attribute that best classifies the training e4amples9

    The basic algorithm #(< begins by di-iding the data according to the -alue of the attribute

    that is most useful in classifying the data9 The attribute that best di-ides the training data

    would be the root node of the tree9 The algorithm is then repeated on each partition of the

    di-ided data, creating sub trees until the training data is di-ided into subsets of the same

    class9 At each le-el in the partitioning process a statistical property known as information

    gainis used to determine which attribute best di-ides the training e4amples9

    2.* In4or&)ton G)n )nd t$ Entro/ M$)'ur$

    #nformation gain is used to determine how well an attribute separates the training data

    according to the target concept9 #t is based on a measure commonly used in information

    theory known as entropy9 (efined o-er a collection of training data, S, with a 7ooleantarget concept, the entropy of Sis defined as&

    wherep(+)is the proportion of positi-e e4amples in S andp()the proportion of negati-e

    e4amples9 The function of the entropy measure is easily described with an e4ample9

    Assume that there is a set of data Scontaining ten e4amples9 Se-en of the e4amples ha-e a

    positi-e class and three of the e4amples ha-e a negati-e class :I, 5;96

  • 8/12/2019 Andrew Thesis

    23/125

    $ote that if the number of positi-e and negati-e e4amples in the set were e-en Gp(+)p()

    69>H, then the entropy function would equal ;9 #f all the e4amples in the set were of the

    same class, then the entropy of the set would be 69 #f the set being measured contains an

    unequal number of positi-e and negati-e e4amples then the entropy measure will be

    between 6 and ;9

    Entropy can be interpreted as the minimum number of bits needed to encode the

    classification of an arbitrary member of S9 Consider two people passing messages back and

    forth that are either positi-e or negati-e9 #f the recei-er of the message knows that the

    message being sent is always going to be positi-e, then no message needs to be sent9

    Therefore, there needs to be no encoding and no bits are sent9 #f on the other hand, half the

    messages are negati-e, then one bit needs to be used to indicate that the message being sent

    is either positi-e or negati-e9 For cases where there are more e4amples of one class than the

    other, on a-erage, less than one bit needs to be sent by assigning shorter codes to more

    likely collections of e4amples and longer codes to less likely collections of e4amples9 #n a

    case wherep(+) 69 shorter codes could be assigned to collections of positi-e messages

    being sent, with longer codes being assigned to collections of negati-e messages being

    sent9

    #nformation gain is the e4pected reduction in entropy when partitioning the e4amples of a

    set S"according to an attribute#9 #t is defined as&

    where Va$ue%(#)is the set of all possible -alues for an attribute#and Svis the subset of

    e4amples in Swhich ha-e the -alue vfor attribute#9 !n a 7oolean data set ha-ing only

    positi-e and negati-e e4amples, Va$ue%(#)would be defined o-er I,39 The first term in

    the equation is the entropy of the original data set9 The second term describes the entropy

    of the data set after it is partitioned using the attribute #9 #t is nothing more than a sum of

    ( ) ( )( )

    ( )v#Va$ue%v

    vS!ntropy

    S

    SS!ntropy#S&ain

    =,

    ;5

  • 8/12/2019 Andrew Thesis

    24/125

    the entropies of each subset Svweighted by the number of e4amples that belong to the

    subset9 The following is an e4ample of how &ain(S" #)would be calculated on a fictitious

    data set9 /i-en a data set Swith ten e4amples G: positi-e and < negati-eH, each containing

    an attribute Temperature, &ain(S"#) where #Temperature and Va$ue%GTemperatureH

    KBarm, Free+ingL would be calculated as follows&

    S :I,

  • 8/12/2019 Andrew Thesis

    25/125

    There are two common approaches that decision tree induction algorithms can use to a-oid

    o-erfitting training data9 They are&

    Stop the training algorithm before it reaches a point in which it perfectly fits the

    training data, and,

    Prune the induced decision tree9

    The most commonly used is the latter approach itchell, ;:9 (ecision tree learners

    normally employ post3pruning techniques that e-aluate the performance of decision trees as

    they are pruned using a -alidation set of e4amples that are not used during training9 The

    goal of pruning is to impro-e a learners accuracy on the -alidation set of data9

    #n its simplest form post3pruning operates by considering each node in the decision tree as

    a candidate for pruning9 Any node can be remo-ed and assigned the most common class of

    the training e4amples that are sorted to the node in question9 A node is pruned if remo-ing

    it does not make the decision tree perform any worse on the -alidation set than before the

    node was remo-ed9 7y using a -alidation set of e4amples it is hoped that the regularities in

    the data used for training do not occur in the -alidation set9 #n this way pruning nodes

    created on regularities occurring in the training data will not hurt the performance of the

    decision tree o-er the -alidation set9

    Pruning techniques do not always use additional data such as the following pruning

    technique used by C=9>9

    C=9> begins pruning by taking a decision tree to be and con-erting it into a set of rulesN one

    for each path from the root node to a leaf9 Each rule is then generali+ed by remo-ing any of

    its conditions that will impro-e the estimated accuracy of the rule9 The rules are then sorted

    by this estimated accuracy and are considered in the sorted sequence when classifying

    newly encountered e4amples9 The estimated accuracy of each rule is calculated on the

    training data used to create the classifier Gi9e9, it is a measure of how well the rule classifies

    the training e4amplesH9 The estimate is a pessimistic one and is calculated by taking the

    ;=

  • 8/12/2019 Andrew Thesis

    26/125

    accuracy of the rule o-er the training e4amples it co-ers and then calculating the standard

    de-iation assuming a binomial distribution9 For a gi-en confidence le-el, the lower3bound

    estimate is taken as a measure of the rules performance9 A more detailed discussion of

    C=9>s pruning technique can be found in .uinlan, ;96 offers adapti-e boosting Schapire and Freund, ;:9 The general idea behind

    adapti-e boosting is to generate se-eral classifiers on the training data9 Bhen an unseen

    e4ample is encountered to be classified, the predicted class of the e4ample is a weighted

    count of -otes from indi-idually trained classifiers9 C>96 creates a number of classifiers by

    first constructing a single classifier9 A second classifier is then constructed by re3training

    on the e4amples used to create the first classifier, but paying more attention to the cases of

    the training set in which the first classifier, classified incorrectly9 As a result the secondclassifier is generally different than the first9 The basic algorithm behind .uinlans

    implementation of adapti-e boosting is described as follows9

    Choose e4amples from the training set of $ e4amples each being assigned a

    probability of ;8$ of being chosen to train a classifier9

    Classify the chosen e4amples with the trained classifier9

    Replace the e4amples by multiplying the probability of the misclassified e4amples

    by a weight 79

    Repeat the pre-ious three steps O times with the generated probabilities9

    Combine the O classifiers gi-ing a weight logG7OH to each trained classifier9

    ;>

  • 8/12/2019 Andrew Thesis

    27/125

    Adapti-e boosting can be in-oked by C>96 and the number of classifiers generated

    specified9

    Prunn% O/ton'C>96 constructs decision trees in two phases9 First it constructs a classifier that fits the

    training data, and then it prunes the classifier to a-oid o-er3fitting the data9 Two options

    can be used to affect the way in which the tree is pruned9

    The first option specifies the degree in which the tree can initially fit the training data9 #t

    specifies the minimum number of training e4amples that must follow at least two of the

    branches at any node in the decision tree9 This is a method of a-oiding o-er3fitting data by

    stopping the training algorithm before it o-er3fits the data9

    A second pruning option that C>96 has affects the se-erity in which the algorithm will post3

    prune constructed decision trees and rule sets9 Pruning is performed by remo-ing parts of

    the constructed decision trees or rule sets that ha-e a high predicted error rate on new

    e4amples9

    Ru#$ S$t'

    C>96 can also con-ert decision trees into rule sets9 For the purposes of this study rule sets

    were generated using C>969 This is due to the fact that rule sets are easier to understand

    than decision trees and can easily be described in terms of comple4ity9 That is, rules sets

    can be looked at in terms of the a-erage si+e of the rules and the number of rules in the set9

    The pre-ious description of C>96s operation is by no means complete9 #t is merely an

    attempt to pro-ide the reader with enough information to understand the options that were

    primarily used in this study9 C>96 has many other options that can be used to affect its

    operation9 They include options to in-oke k3fold cross -alidation, enable differential

    misclassification costs, and speed up training times by randomly sampling from large data

    sets9

    ;?

  • 8/12/2019 Andrew Thesis

    28/125

    5 P$r4or&)nc$ M$)'ur$'

    E-aluating a classifierMs performance is a -ery important aspect of machine learning9

    Bithout an e-aluation method it is impossible to compare learners, or e-en know whether

    or not a hypothesis should be used9 For e4ample, learning to classify mushrooms as being

    poisonous or not, one would want to be able to -ery precisely measure the accuracy of a

    learned hypothesis in this domain9 The following section introduces the 'onfu%ion matrix

    that identifies the type of errors a classifier makes, as well as two more sophisticated

    e-aluation methods9 They are thegmean, which combines the performance of a classifier

    o-er two classes, and OC 'urve%, which pro-ide a -isual representation of a classifiers

    performance9

    5.1 Con4u'on M)tr6

    A classifiers performance is commonly broken down into what is known as a 'onfu%ion

    matrix9 A confusion matri4 basically shows the type of classification errors a classifier

    makes9 Figure 59

  • 8/12/2019 Andrew Thesis

    29/125

    A classifierMs performance can also be separately calculated for its performance o-er the

    positi-e e4amples Gdenoted as aIH and o-er the negati-e e4amples Gdenoted as a3H9 Each arecalculated as&

    5.* %7M$)n

    ubat, 1olte, and atwin ;@ use the geometric mean of the accuracies measured

    separately on each class&

    The basic idea behind this measure is to ma4imi+e the accuracy on both classes9 #n this

    study the geometric mean will be used as a check to see how balanced the combination

    scheme is9 For e4ample, if we consider an imbalanced data set that has 5=6 positi-e

    e4amples and ?666 negati-e e4amples and stubbornly classify each e4ample as negati-e,

    we could see, as in many imbalanced domains, a -ery high accuracy Gacc ?UH9 "sing

    the geometric mean, howe-er, would quickly show that this line of thinking is flawed9 #t

    would be calculated as sqrtG6 V ;H 69

    5., ROC cur($'

    OC 'urve% GRecei-ing !perator CharacteristicH pro-ide a -isual representation of the

    trade off between true positi-es and false positi-es9 They are plots of the percentage ofcorrectly classified positi-e e4amples aI with respect to the percentage of incorrectly

    classified negati-e e4amples a39

    dc

    ca

    ba

    aa

    +=

    +=

    +

    +

    = aag

    ;@

  • 8/12/2019 Andrew Thesis

    30/125

    ROC curves

    0

    20

    (0)0

    *0

    100

    0 20 (0 )0 *0 100

    False Positive (%)

    TruePositive(%)

    Series1Series2

    Figure 59& A fictitious e4ample of two R!C cur-es9

    Point G6, 6H along a cur-e would represent a classifier that by default classifies all e4amples

    as being negati-e, whereas a point G6, ;66H represents a classifier that correctly classifies all

    e4amples9

    any learning algorithms allow induced classifiers to mo-e along the cur-e by -arying

    their learning parameters9 For e4ample, decision tree learning algorithms pro-ide options

    allowing induced classifiers to mo-e along the cur-e by way of pruning parameters

    Gpruning options for C>96 are discussed in Section 5959=H9 Swets ;@@ proposes that

    classifiers performances can be compared by calculating the area under the cur-es

    generated by the algorithms on identical data sets9 #n Figure 59

  • 8/12/2019 Andrew Thesis

    31/125

    sampling techniques, discusses data set balancing techniques that sample training

    e4amples, both in nai-e and intelligent fashions9 The third category, classifiers that co-er

    one class, describes learning algorithms that create rules to co-er only one class9 The last

    category, recognition based learning, discusses a learning method that ignores or makes

    little use of one class all together9

    8.1 M'c#)''4c)ton Co't'

    Typically a classifiers performance is e-aluated using the proportion of e4amples that are

    incorrectly classified9 Pa++ani, er+, urphy, Ali, 1ume, and 7runk ;= look at errors

    made by a classifier in terms of their cost9 For e4ample, take an application such as the

    detection of poisonous mushrooms9 The cost of misclassifying a poisonous mushroom as

    being safe to eat may ha-e serious consequences and therefore should be assigned a high

    costN con-ersely, misclassifying a mushroom that is safe to eat may ha-e no serious

    consequences and should be assigned a low cost9 Pa++ani et al9 ;= use algorithms that

    attempt to sol-e the problem of imbalanced data sets by way of introducing a cost matri49

    The algorithm that is of interest here is called eu'e Co%t Orering GRC!H, which

    attempts to order a decision list Gset of rulesH so as to minimi+e the cost of making incorrect

    classifications9

    RC! is a post3processing algorithm that can complement any rule learner such as C=9>9 #t

    essentially orders a set of rules to minimi+e misclassification costs9 The algorithm works as

    follows&

    The algorithm takes as input a set of rules Grule listH, a cost matri4, and a set of e4amples

    Ge4ample listH and returns an ordered set of rules Gdecision listH9 An e4ample of a cost

    matri4 Gfor the mushroom e4ampleH is depicted in Figure 59=9;9

    1ypothesis

    Safe Poisonous Actual Class

    6 ; Safe

    ;6 6 PoisonousFigure 59=9?& A cost matri4 for a poisonous mushroom

    application9

    56

  • 8/12/2019 Andrew Thesis

    32/125

    $ote that the costs in the matri4 are the costs associated with the prediction in light of the

    actual class9

    The algorithm begins by initiali+ing a decision list to a default class which yields the leaste4pected cost if all e4amples were tagged as being that class9 #t then attempts to iterati-ely

    replace the default class with a new rule 8 default class pair, by choosing a rule from the

    rule list that co-ers as many e4amples as possible and a default class which minimi+es the

    cost of the e4amples not co-ered by the rule chosen9 $ote that when an e4ample in the

    e4ample list is co-ered by a chosen rule it is remo-ed9 The process continues until no new

    rule 8 default class pair can be found to replace the default class in the decision list Gi9e9, the

    default class minimi+es cost o-er the remaining e4amplesH9

    An algorithm such as the one described abo-e can be used to tackle imbalanced data sets

    by way of assigning high misclassification costs to the underrepresented class9 (ecision

    lists can then be biased, or ordered to classify e4amples as the underrepresented class, as

    they would ha-e the least e4pected cost if classified incorrectly9

    #ncorporating costs into decision tree algorithms can be done by replacing the information

    gain metric used with a new measure that bases partitions not on information gain, but on

    the cost of misclassification9 This was studied by Pa++ani et al9 ;= by modifying #(< to

    use a metric that chooses partitions that minimi+e misclassification cost9 The results of their

    e4perimentation indicate that their greedy test selection method, attempting to minimi+e

    cost, did not perform as well as using an information gain heuristic9 They attribute this to

    the fact that their selection technique attempts to solely fit training data and not minimi+e

    the comple4ity of the learned concept9

    A more -iable alternati-e to incorporating misclassification costs into the creation of a

    decision trees, is to modify pruning techniques9 Typically, decision trees are pruned by

    merging lea-es of the tree to classify e4amples as the ma%ority class9 #n effect, this is

    calculating the probability that an e4ample belongs to a gi-en class by looking at training

    e4amples that ha-e filtered down to the lea-es being merged9 7y assigning the ma%ority

    5;

  • 8/12/2019 Andrew Thesis

    33/125

    class to the node of the merged lea-es, decision trees are assigning the class with the lowest

    e4pected error9 /i-en a cost matri4, pruning can be modified to assign the class that has the

    lowest e4pect cost instead of the lowest e4pected error9 Pa++ani et al9 ;= state that cost

    pruning techniques ha-e an ad-antage o-er replacing the information gain heuristic with a

    minimal cost heuristic, in that a change in the cost matri4 does not affect the learned

    concept description9 This allows different cost matrices to be used for different e4amples9

    8.* S)&/#n% T$cn9u$'

    *,- .eterogeneou% /n'ertainty Samp$ing

    0ewis and Catlett ;= describe a heterogeneou%0 approach to selecting training

    e4amples from a large data set by using uncertainty sampling9 The algorithm they use

    operates under an information filtering paradigmN uncertainty sampling is used to select

    training e4amples to be presented to an e4pert9 #t can be simply described as a process

    where a cheap classifier chooses a subset of training e4amples for which it is unsure of the

    class from a large pool and presents them to an e4pert to be classified9 The classified

    e4amples are then used to help the cheap classifier choose more e4amples for which it is

    uncertain9 The e4amples that the classifier is unsure of are used to create a more e4pensi-e

    classifier9

    The uncertainty sampling algorithm used is an iterati-e process by which an ine4pensi-e

    probabilistic classifier is initially trained on three randomly chosen positi-e e4amples from

    the training data9 The classifier is based on an estimate of the probability that an instance

    belongs to a class C&

    < Their method is considered heterogeneous because a classifier of one type chooses e4amples to present to a classifier of

    another type9

    ( )

    ( )

    ( )( )( )

    ++

    +

    =

    =

    =

    )

    i i

    i

    )

    i i

    i

    CwP

    CwPba

    CwP

    CwPba

    1CP

    ;

    ;

    S

    Sloge4p;

    S

    Sloge4p

    S

    55

  • 8/12/2019 Andrew Thesis

    34/125

    where C indicates class membership and wiis the ith attribute of d attributes in e4ample wN

    a and b are calculated using logistic regression9 This model is described in detail in 0ewis

    and 1ayes, ;=9 All we are concerned with here is that the classifier returns a number P

    between 6 and ; indicating its confidence in whether or not an unseen e4ample belongs to a

    class9 The threshold chosen to indicate a positi-e instance is 69>9 #f the classifier returns a P

    higher than 69> for an unknown e4ample, it is considered to belong to the class C9 The

    classifiers confidence in its prediction is proportional to the distance its prediction is away

    from the threshold9 For e4ample, the classifier is less confident in a P of 69? belonging to C

    than it is a P of 69 belong to C9

    At each iteration of the sampling loop, the probabilistic classifier chooses four e4amples

    from the training setN the two which are closest and below the threshold and the two which

    are closest and abo-e the threshold9 The e4amples that are closest to the threshold are those

    that it is least sure of the class9 The classifier is then retrained at each iteration of the

    uncertainty sampling and reapplied to the training data to select four more instances that it

    is unsure of9 $ote that after the four e4amples are chosen at each loop, their class is known

    for retraining purposes Gthis is analogous to ha-ing an e4pert label e4amplesH9

    The training set presented to the e4pert classifier can essentially be described as a pool ofe4amples that the probabilistic classifier is unsure of9 The pool of e4amples, chosen using a

    threshold, will be biased towards ha-ing too many positi-e e4amples if the training data set

    is imbalanced9 This is because the e4amples are chosen from a window that is centered

    o-er the borderline where the positi-e and negati-e e4amples meet9 To correct for this, the

    classifier chosen to train on the pool of e4amples, C=9>, was modified to include a loss ratio

    parameter, which allows pruning to be based on e4pected loss instead of e4pected error

    Gthis is analogous to cost pruning, Section 59=9;H9 The default rule for the classifier was also

    modified to be chosen based on e4pected loss instead of e4pected error9

    0ewis and Catlett ;= show by testing their sampling technique on a te4t classification

    task that uncertainty sampling reduces the number of training e4amples required by an

    e4pensi-e learner such as C=9> by a factor of ;69 They did this by comparing results of

    5

  • 8/12/2019 Andrew Thesis

    35/125

    induced decision trees on uncertainty samples from a large pool of training e4amples with

    pools of e4amples that were randomly selected, but ten times larger9

    *,, One %ie 2nte$$igent Se$e'tionubat and atwin ;: propose an intelligent one sided sampling technique that reduces

    the number of negati-e e4amples in an imbalanced data set9 The underlying concept in their

    algorithm is that positi-e e4amples are considered rare and must all be kept9 This is in

    contrast to 0ewis and Catletts technique in that uncertainty sampling does not guarantee

    that a large number of positi-e e4amples will be kept9 ubat and atwin ;: balance

    data sets by remo-ing negati-e e4amples9 They categori+e negati-e e4amples as belonging

    to one of four groups9 They are&

    Those that suffer from class label noiseN

    7orderline e4amples Gthey are e4amples which are close to the boundaries of

    positi-e e4amplesHN

    Redundant e4amples Gtheir part can be taken o-er by other e4amplesHN and

    Safe e4amples that are considered suitable for learning9

    #n their selection technique all negati-e e4amples, e4cept those which are safe, areconsidered to be harmful to learning and thus ha-e the potential of being remo-ed from the

    training set9 Redundant e4amples do not directly harm correct classification, but increase

    classification costs9 7orderline negati-e e4amples can cause learning algorithms to o-erfit

    positi-e e4amples9

    ubat and atwinMs ;: selection technique begins by first remo-ing redundant

    e4amples from the training set9 To do this a subset C of the training e4amples, S, is created

    by taking e-ery positi-e e4ample from S and randomly choosing one negati-e e4ample9

    The remaining e4amples in S are then classified using the ;3$earest $eighbor G;3$$H rule

    with C9 Any misclassified e4ample is added to C9 $ote that this technique does not make

    the smallest C possible, it %ust shrinks S9 After redundant e4amples are remo-ed, e4amples

    considered borderline or class noisy are remo-ed9

    5=

  • 8/12/2019 Andrew Thesis

    36/125

    7orderline, or class noisy e4amples are detected using the concept of Tomek 0inks

    Tomek, ;:? that are defined by the distance between different class labeled e4amples9

    Take for instance, two e4amples 4 and y with different classes9 The pair G4, yH is considered

    to be a Tomek link if there e4ists no e4ample +, such that G4, +H G4, yH or Gy, +H Gy,

    4H, where Ga, bH is defined as the distance between e4ample a and e4ample b9 E4amples

    are considered borderline or class noisy if they participate in a Tomek link9

    ubat and atwins selection technique was shown to be successful in impro-ing the

    performance using the g3mean on two of three benchmark domains& -ehicles G-eh;H, glass

    Gg:H, and -owels G-woH9 The domain in which no impro-ement was seen, g:, was

    e4amined and it was found that in that particular domain the original data set did not

    produce disproportionate -alues for gI and g39

    *,0 Naive Samp$ing Te'hni3ue%

    The pre-iously described selection algorithms balance data sets by significantly reducing

    the number of training e4amples9 7oth are intelligent methods that filter out e4amples

    using uncertainty sampling, or by remo-ing e4amples that are considered harmful to

    learning9 0ing and 0i ;@ approach the problem of data imbalance using methods that

    nai-ely downsi+e or o-er3sample data sets classifying e4amples with a confidence

    measurement9 The domain of interest is data mining for direct marketing9 (ata sets in this

    field are typically two class problems and are se-erely imbalanced, only containing a few

    e4amples of people who ha-e bought the product and many e4amples of people who ha-e

    not9 The three data sets studied by 0ing and 0i ;@ are a bank data set from a loan

    product promotion G7ankH, a RRSP campaign from a life insurance company G0ife

    #nsuranceH, and a bonus point program where customers accumulate points to redeem for

    merchandise G7onusH9 As will be e4plained later, all three of the data sets are imbalanced9

    (irect marketing is used by the consumer industry to target customers who are likely to

    buy products9 Typically, if mass marketing is used to promote products Ge9g9, including

    flyers in a newspaper with a large distributionH the response rate Gthe percent of people who

    buy a product after being e4posed to the promotionH is -ery low and the cost of mass

    5>

  • 8/12/2019 Andrew Thesis

    37/125

    marketing -ery high9 For the three data sets studied by 0ing and 0i the response rates were

    ;95U of 6,66 responding in the 7ank data set, :U of @6,666 responding in the 0ife

    #nsurance data set, and ;95U of ;6=,666 for the 7onus Program9

    (ata mining can be -iewed as a two class domain9 /i-en a set of customers and their

    characteristics, determine a set of rules that can accurately predict a customer as being a

    buyer or a non3 buyer, ad-ertising only to buyers9 0ing and 0i ;@ howe-er, state that a

    binary classification is not -ery useful for direct marketing9 For e4ample, a company may

    ha-e a database of customers to which it wants to ad-ertise the sale of a new product to the

  • 8/12/2019 Andrew Thesis

    38/125

    The e-aluation method used by 0ing and 0i ;@ is known as the lift inde49 This inde4

    has been widely used in database marketing9 The moti-ation behind using the lift inde4 is

    that it reflects the re3distribution of testing e4amples after a learner has ranked them9 For

    e4ample, in this domain the learning algorithms rank e4amples in order of the most likely

    to respond to the least likely to respond9 0ing and 0i ;@ di-ide the ranked list into ;6

    deciles9 Bhen e-aluating the ranked list, regularities should be found in the distribution of

    the responders Gi9e9, there should be a high percentage of the responders in the first few

    decilesH9 Table 59=9; is a reproduction of the e4ample that 0ing and 0i ;@ present to

    demonstrate this9

    0ift Table

    ;6U ;6U ;6U ;6U ;6U ;6U ;6U ;6U ;6U ;6U

    =;6 ;6 ;

  • 8/12/2019 Andrew Thesis

    39/125

    "sing their lift inde4 as the sole measure of performance, 0ing and 0i ;@ report results

    for o-er3sampling and downsi+ing on the three data sets of interest G7ank, 0ife #nsurance,

    and 7onusH9

    0ing and 0i ;@ report results that show the best lift inde4 is obtained when the ratio of

    positi-e and negati-e e4amples in the training data is equal9 "sing 7oosted3$aW-e 7ayes

    with a downsi+ed data set resulted in a lift inde4 of :69>U for 7ank, :>95U for 0ife

    #nsurance, and @;99=U for 0ife #nsurance, and @69=U for the 7onus program when the data sets were

    imbalanced at a ratio of ; positi-e e4ample to e-ery @ negati-e e4amples9 1owe-er, using

    7oosted37ayes with o-er3sampling did not show any significant impro-ement o-er the

    imbalanced data set9 0ing and 0i ;@ state that one method to o-ercome this limitation

    may be to retain all the negati-e e4amples in the data set and re3sample the positi-e

    e4amples=9

    Bhen tested using their boosted -ersion of C=9>, o-er3sampling saw a performance gain as

    the positi-e e4amples were re3sampled at higher rates9 Bith a positi-e sampling rate of

    564, 7ank saw an increase of 59U Gfrom ?>9?U to ?@9>UH, 0ife #nsurance an increase of

    59U Gfrom :=9

  • 8/12/2019 Andrew Thesis

    40/125

    which techniques are appropriate in dealing with class imbalancesX To in-estigate these

    questions *apkowic+ created a number of artificial domains which were made to -ary in

    concept comple4ity, si+e of the training data and ratio of the under3represented class to the

    o-er3represented class9

    The target concept to be learned in her study was a one dimensional set of continuous

    alternating equal si+ed inter-als in the range 6, ;, each associated with a class -alue of 6

    or ;9 For e4ample, a linear domain generated using her model would be the inter-als 6,

    69>H and G69>, ;9 #f the first inter-al was gi-en the class ;, the second inter-al would ha-e

    class 69 E4amples for the domain would be generated by randomly sampling points from

    each inter-al Ge9g9, a point 4 sampled in 6, 69> would be a G4, IH e4ample, and likewise a

    point y sampled in G69>, ; would be an Gy, 3H e4ampleH9

    *apkowic+ 5666 -aried the comple4ity of the domains by -arying the number of inter-als

    in the target concept9 (ata set si+es and balances were easily -aried by uniformly sampling

    different numbers of points from each inter-al9

    The two balancing techniques that *apkowic+ 5666 used in her study that are of interest

    here are o-er3sampling and downsi+ing9 The o-er3sampling technique used was one in

    which the small class was randomly re3sampled and added to the training set until the

    number of e4amples of each class was equal9 The downsi+ing technique used was one in

    which random e4amples were remo-ed from the larger class until the si+e of the classes

    was equal9 The domains and balancing techniques described abo-e were implemented

    using -arious discrimination based neural networks G(0PH9

    *apkowic+ found that both re3sampling and downsi+ing helped impro-e (0P, especially

    as the target concept became -ery comple49 (ownsi+ing, howe-er, outperformed o-er3

    sampling as the si+e of the training set increased9

    5

  • 8/12/2019 Andrew Thesis

    41/125

    8., C#)''4$r' :c Co($r On$ C#)''

    *0- 5/T!

    Riddle, Segal, and Et+ioni ;= propose an induction technique called 7R"TE9 The goal

    of 7R"TE is not classification, but the detection of rules that predict a class9 The domain

    of interest which leads to the creation of 7R"TE is the detection of manufactured airplane

    parts that are likely to fail9 Any rule that detects anomalies, e-en if they are rare, is

    considered important9 Rules which predict that a part will not fail, on the other hand are not

    considered -aluable, no matter how large the co-erage may be9

    7R"TE operates on the premise that standard decision trees test functions such as #(

  • 8/12/2019 Andrew Thesis

    42/125

    #t can be seen that T5 will be chosen o-er T; using #(

  • 8/12/2019 Andrew Thesis

    43/125

    CART and 59U for C=>9 !ne drawback is that the computational comple4ity of 7R"TES

    depth bound search is much higher than that of typical decision tree algorithms9 They do

    report, howe-er, that it only took CP" minutes of computation on a SPARC3;69

    *0, 7O24

    F!#0 .uinlan, ;6 is an algorithm designed to learn a set of first order rules to predict a

    target predicate to be true9 #t differs from learners such as C>96 in that it learns relations

    among attributes that are described with -ariables9 For e4ample, using a set of training

    e4amples where each e4ample is a description of people and their relations&

    $ame; *ack, /irlfriend; *ill,

    $ame5 *ill, 7oyfriend5 *ack, Couple;5 True Q

    C>96 may learn the rule&

    #F G$ame; *ackH Y G7oyfriend5 *ackH T1E$ Couple;5 True9

    This rule of course is correct, but will ha-e a -ery limited use9 F!#0 on the other hand can

    learn the rule&

    #F 7oyfriendG4, yH T1E$ CoupleG4, yH True

    where 4 and y are -ariables which can be bound to any person described in the data set9 A

    positi-e binding is one in which a predicate binds to a positi-e assertion in the training

    data9 A negati-e binding is one in which there is no assertion found in the training data9 For

    e4ample, the predicate 7oyfriendG4, yH has four possible bindings in the e4ample abo-e9

    The only positi-e assertion found in the data is for the binding 7oyfriendG*ill, *ackH Gread

    > The accuracy being referred to here is not how well a rule set performs o-er the testing data9 Bhat is being referred to

    is the percentage of testing e4amples which are co-ered by a rule and correctly classified9 The e4ample Riddle et al9

    ;= gi-e is that if a rule matches ;6 e4amples in the testing data, and = of them are positi-e, then the predicti-e

    accuracy of the rule is =6U9 The figures gi-en are a-erages o-er the entire rule set created by each algorithm9 Riddle et

    al9 ;= use this measure of performance in their domain because their primary interest is in finding a few accurate

    rules that can be interpreted by factory workers in order to impro-e the production process9 #n fact, they state that they

    would be happy with a poor tree with one really good branch from which an accurate rule could be e4tracted9

  • 8/12/2019 Andrew Thesis

    44/125

    the boyfriend of *ill is *ackH9 The other three possible bindings Ge9g9, 7oyfriendG*ack, *illHH

    are negati-e bindings, because there are no positi-e assertions in the training data9

    The following is a brief description of the F!#0 algorithm adapted from itchell, ;:9

    F!#0 takes as input a target predicate Ge9g9, CoupleG4, yHH, a list of predicates that will be

    used to describe the target predicate and a set of e4amples9 At a high le-el, the algorithm

    operates by learning a set of rules that co-ers the positi-e e4amples in the training set9 The

    rules are learned using an iterati-e process that remo-es positi-e training e4amples from

    the training set when they are co-ered by a rule9 The process of learning rules continues

    until there are enough rules to co-er all the positi-e training e4amples9 This way, F!#0 can

    be -iewed as a specific to general search through a hypothesis space, which begins with an

    empty set of rules that co-ers no positi-e e4amples and ends with a set of rules general

    enough to co-er all the positi-e e4amples in the training data Gthe default rule in a learned

    set is negati-eH9

    Creating a rule to co-er positi-e e4amples is a process by which a general to specific search

    is performed starting with an empty condition that co-ers all e4amples9 The rule is then

    made specific enough to co-er only positi-e e4amples by adding literals to the rule Ga

    literal is defined as a predicate or its negati-eH9 For e4ample, a rule predicting the predicate

    FemaleG4H may be made more specific by adding the literals long)hairG4H and ZbeardG4H9

    The function used to e-aluate which literal, 0, to add to a rule, R, at each step is&

    where p6and n6are the number of positi-e GpH and negati-e GnH bindings of the rule R, p ;

    and n;are the number of positi-e and negati-e binding of the rule which will be created by

    adding 0 to R and t is the number of positi-e bindings of the rule R which are still co-ered

    by R when 0 is added Gi9e9, t p6 3 p;H9

    ( )

    +

    +=

    66

    65

    ;;

    ;5 loglog,)

    np

    p

    np

    pt(4&ain7oi$

  • 8/12/2019 Andrew Thesis

    45/125

    The function 7oi$6&aindetermines the utility of adding 0 to R9 #t prefers adding literals

    with more positi-e bindings than negati-e bindings9 As can be seen in the equation, the

    measure is based on the proportion of positi-e bindings before and after the literal in

    question is added9

    *00 S.2N8

    ubat, 1olte, and atwin ;@ discuss the design of the S1R#$ algorithm that follows

    the same principles as 7R"TE9 S1R#$ operates by finding rules that co-er positi-e

    e4amples9 #n doing this, it learns from both positi-e and negati-e e4amples using the g3

    mean to take into account rule accuracy o-er negati-e e4amples9 There are three principles

    behind the design of S1R#$9 They are&

    (o not subdi-ide the positi-e e4amples when learningN

    Create a classifier that is low in comple4ityN and

    Focus on regions in space where positi-e e4amples occur9

    A S1R#$ classifier is made up of a network of tests9 Each test is of the form& 4 imin aiN

    ma4 ai where i inde4es the attributes9 0et hirepresent the output of the ith test9 #f the test

    suggests a positi-e test, the output is ;, else it is 3;9 E4amples are classified as being

    positi-e if ihiwiQ where wiis a weight assigned to the test h i9

    S1R#$ creates the tests and weights in the following way9 #t begins by taking the inter-al

    for each attribute that co-ers all the positi-e e4amples9 The inter-al is then reduced in si+e

    by remo-ing either the left or right point based on whiche-er produces the best g3mean9

    This process is repeated iterati-ely and the inter-al found to ha-e the best g3mean is

    considered the test for the attribute9 Any test that has a g3mean less than 69>6 is discarded9

    The weight assigned to each test is wi log Ggi8;3giH where giis the g3mean associated with

    the ith attribute test9

  • 8/12/2019 Andrew Thesis

    46/125

    The results reported by ubat et al9 ;@ demonstrate that the S1R#$ algorithm

    performs better than ;3$earest $eighbor with one sided selection?9 Pitting S1R#$ against

    C=9> with one sided selection the results became less clear9 "sing one sided selection

    resulted in a performance gain o-er the positi-e e4amples but a significant loss o-er the

    negati-e e4amples9 This loss of performance o-er the negati-e e4amples results in the g3

    mean being lowered by about ;6U9

    Accuracies Achie-ed by C=9>, ;3$$ and Shrink

    Classifier aI a3 g3mean

    C=9> @;9; @?9? @;9:

    ;3$$ ?:95 @ ?69 :69Table 59=95& This table is adapted from ubat et al9, ;@9 #tgi-es the accuracies achie-ed by C=9> ;3$$ and S1R#$9

    8.- R$co%nton B)'$d L$)rnn%

    (iscrimination based learning techniques, such as C>969 create rules which describe both

    the positi-e GconceptualH class, as well as the negati-e Gcounter conceptualH class9

    Algorithms such as, 7R"TE, and F!#0 differ from algorithms such as C>96, in that they

    create rules that only co-er positi-e e4amples9 1owe-er, they are still discrimination based

    techniques because they create positi-e rules using negati-e e4amples in their search

    through the hypothesis space9 For e4ample, F!#0 creates rules to co-er the positi-e class

    by adding literals until they do not co-er any of the negati-e class e4amples9 !ther learning

    methods, such as back propagation applied to a feed forward neural network and 3nearest

    neighbor, do not e4plicitly create rules, but they are discrimination based techniques that

    learn from both positi-e and negati-e e4amples9

    *apkowic+, yers, and /luck ;> describe 1#PP!, a system that learns to recogni+e atarget concept in the absence of counter e4amples9 ore specifically, it is a neural network

    Gcalled an autoencoderH that is trained to take positi-e e4amples as input, map them to a

    small hidden layer, and then attempt to reconstruct the e4amples at the output layer9

    ? !ne sided selection is discussed in Section 59>95959 #t is essentially a method by which negati-e e4amples considered

    harmful to learning are remo-ed from the data set9

  • 8/12/2019 Andrew Thesis

    47/125

    7ecause the network has a narrow hidden layer it is forced to compress redundancies found

    in the input e4amples9

    An ad-antage of recognition based learners is that they can operate in en-ironments inwhich negati-e e4amples are -ery hard or e4pensi-e to obtain9 An e4ample *apkowic+ et

    al9 ;> gi-e is the application of machine fault diagnosis where a system is designed to

    detect the likely failure of hardware Ge9g9, helicopter gear bo4esH9 #n domains such as this,

    statistics on functioning hardware are plentiful, while statistics of failed hardware may be

    nearly impossible to acquire9 !btaining positi-e e4amples in-ol-es monitoring functioning

    hardware, while obtaining negati-e e4amples in-ol-es monitoring hardware that fails9

    Acquiring enough e4amples of failed hardware for training a discrimination based learner,

    can be -ery costly if the de-ice has to be broken a number of different ways to reflect all

    the conditions in which it may fail9

    #n learning a target concept, recognition based classifiers such as that described by

    *apkowic+ et al9 ;> do not try to partition a hypothesis space with boundaries that

    separate positi-e and negati-e e4amples, but they attempt to make boundaries which

    surround the target concept9 The following is an o-er-iew of how 1#PP!, a one hidden

    layer autoencoder, is used for recognition based learning9

    A one hidden layer autoencoder consists of three layers, the input layer, the hidden layer

    and the output layer9 Training an autoencoder takes place in two stages9 #n the first stage

    the system is trained on positi-e instances using back3propagation :to be able to compress

    the training e4amples at the hidden layer and reconstruct them at the output layer9 The

    second stage of training in-ol-es determining a threshold that can be used to determine the

    reconstruction error between positi-e and negati-e e4amples9

    The second stage of training is a semi3automated process that can be one of two cases9 The

    first noiseless case is one in which a lower bound is calculated on the reconstruction error

    of either the negati-e or positi-e instances9 The second noisy case is one that uses both

    : $ote that back propagation is not the only training function that can be used9 E-ans and *apkowic+ 5666 report results

    using an auto3encoder trained with the !ne Step Secant function9

  • 8/12/2019 Andrew Thesis

    48/125

    positi-e and negati-e training e4amples to calculate the threshold ignoring the e4amples

    considered to be noisy or e4ceptional9

    After training and threshold determination, unseen e4amples can be gi-en to theautoencoder that can compress and then reconstruct them at the output layer, measuring the

    accuracy at which the e4ample was reconstructed9 For a two class domain this is -ery

    powerful9 Training an autoencoder to be able to sufficiently reconstruct the positi-e class,

    means that unseen e4amples that can be reconstructed at the output layer contain features

    that were in the e4amples used to train the system9 "nseen e4amples that can be

    generali+ed with a low reconstruction error can therefore be deemed to be of the same

    conceptual class as the e4amples used for training9 Any e4ample which cannot be

    reconstructed with a low reconstruction error is deemed to be unrecogni+ed by the system

    and can be classified as the counter conceptual class9

    *apkowic+ et al9 ;> compared 1#PP! to two other standard classifiers that are designed

    to operate with both positi-e and negati-e e4amples9 They are C=9> and applying back

    propagation to a feed forward neural network GFF ClassificationH9 The data sets studied

    were&

    The C1=? 1elicopter /earbo4 data set olesar and $Ra(, ;=9 This domain

    consists of discriminating between faulty and non3faulty helicopter gearbo4es

    during operation9 The faulty gearbo4es are the positi-e class9

    The Sonar Target Recognition data set9 This data was obtained from the "9C9 #r-ine

    Repository of achine 0earning9 This domain consists of taking sonar signals as

    input and determining which signals constitute rocks and which are mines Gmine

    signals were considered the positi-e class in the studyH9

    The Promoter data set9 This data consists of input segments of ($A strings9 The

    problem consists of recogni+ing which strings represent promoters that are the

    positi-e class9

  • 8/12/2019 Andrew Thesis

    49/125

    Testing 1#PP! showed that it performed much better than C=9> and FF Classifier on the

    1elicopters and Sonar Targets domains9 #t performed equally with FF classifier on the

    promoters domain but much better than C=9> on the same data9

    (ata Set Results

    (ata Set 1#PP! C=9> FF Classifier

    1elicopters 69 ;>9?5>;9 ;69;9:

    Promoters 5669: ;9= 56;9=

    SonarTargets

    5659: 5;9@

  • 8/12/2019 Andrew Thesis

    50/125

  • 8/12/2019 Andrew Thesis

    51/125

    C h a p t e r T h r e e

    < ART#F#C#A0 (!A#$

    Chapter < is di-ided into three sections9 Section 969 The

    purpose of the e4periments is to in-estigate the nature of imbalanced data sets and pro-ide

    a moti-ation behind the design of a system intended to impro-e a standard classifiers

    performance on imbalanced data sets9 Section

  • 8/12/2019 Andrew Thesis

    52/125

    where k is the number of dis%uncts, n is the number of con%unctions in each dis%unct, and 4 n

    is defined o-er the alphabet 4;, 45,[, 4%9 Z4;, Z45, [,Z4%9 An e4ample of a k3($F

    e4pression, being 5, gi-en as GE4p9 ;H9

    4;Y4Z4 GE4p9 ;H

    $ote that if 4k is a member of a dis%unct Z4kcannot be9 Also note, GE4p9 ;H would be

    referred to as an e4pression of , the following four e4amples would ha-e

    classes indicated by I839

    4; 45 4< 4= 4> Class

    ;H ; 6 ; ; 6 I

    5H 6 ; 6 ; ; I

  • 8/12/2019 Andrew Thesis

    53/125

    Figure

  • 8/12/2019 Andrew Thesis

    54/125

    The other similarity between te4t classification and k3($F e4pressions is the ability to

    affect the comple4ity of the target e4pression in a k3($F e4pression9 7y -arying the

    number of dis%uncts in an e4pression we can -ary the difficulty of the target concept to be

    learned9@This ability to control concept comple4ity can map itself onto te4t classification

    tasks where not all classification tasks are equal in difficulty9 This may not be ob-ious at

    first9 Consider a te4t classification task where one needs to classify documents as being

    about a particular consumer product9 The comple4ity of the rule set needed to distinguish

    documents of this type, may be as simple as a single rule indicating the name of the product

    and the name of the company that produces it9 This task would probably map itself to a

    -ery simple k3($F e4pression with perhaps only one dis%unct9 $ow consider training

    another classifier intended to be used to classify documents as being computer softwarerelated or not9 The number of rules needed to describe this category is probably much

    greater9 For e4ample, the terms JcomputerJ and JsoftwareJ in a document may be good

    indicators that a document is computer software related, but so might be the term

    JwindowsJ, if it appears in a document not containing the term JcleanerJ9 #n fact, the terms

    JoperatingJ and JsystemJ or JwordJ and JprocessorJ appearing together in a document are

    also good indicators that it is software related9 The comple4ity of a rule set needed to be

    constructed by a learner to recogni+e computer software related documents is, therefore,

    greater and would probably map onto a k3($F e4pression with more dis%uncts than that of

    the first consumer product e4ample9

    The biggest difference between the two domains is that the artificial domain was created

    without introducing any noise9 $o negati-e e4amples were created and labeled as being

    positi-e9 0ikewise, there were no positi-e e4amples labeled as negati-e9 For te4t domains

    in general there is often label noise in which documents are gi-en labels that do not

    accurately indicate their content9

    @ As the number of dis%uncts GkH in an e4pression increases, more partitions in the hypothesis space are need to be

    reali+ed by a learner to separate the positi-e e4amples from the negati-e e4amples9

    =

  • 8/12/2019 Andrew Thesis

    55/125

    ;.* E6)&/#$ Cr$)ton

    For the described tests, training e4amples were always created independently of the testing

    e4amples9 The training and testing e4amples were created in the following manner&

    A Random k3($F e4pression is created on a gi-en alphabet si+e Gin this study the

    alphabet si+e is >6H9

    An arbitrary set of e4amples was generated as a random sequence of attributes

    equal to the si+e of the alphabet the k3($F e4pression was created o-er9 All the

    attributes were gi-en an equal probability of being either 6 or ;9

    Each e4ample was then classified as being either a member of the e4pression or not

    and tagged appropriately9 Figure 666 negati-e e4amples and ;566 positi-e e4amples was used9 This

    represented a class imbalance of >&; in fa-or of the negati-e class9 As the tests, howe-er,

    lead to the creation of a combination scheme, the data sets tested were further imbalanced

    to a 5>&; G?666 negati-e & 5=6 positi-eH ratio in fa-or of the negati-e class9 This greater

    imbalance more closely resembled the real world domain of te4t classification on which the

    system was ultimately tested9 #n each case the e4act ratio of positi-e and negati-e e4amples

    in both the training and testing set will be indicated9

    ==

  • 8/12/2019 Andrew Thesis

    56/125

    ;., D$'cr/ton o4 T$'t' )nd R$'u#t'

    The description of each test will consist of se-eral sections9 The first section will state the

    moti-ation behind performing the test and gi-e the particulars of its design9 The results of

    the e4periment will then be gi-en followed by a discussion9

    90- Te%t : - Varying the Target Con'ept% Comp$exity

    'arying the number of dis%uncts in an e4pression -aries the comple4ity of the target

    concept9 As the number of dis%uncts increases, the following two things occur in a data set

    where the positi-e e4amples are e-enly distributed o-er the target e4pression and their

    number is held constant&

    The target concept becomes more comple4, and

    The number of positi-e e4amples becomes sparser relati-e to the target concept9

    A -isual representation of the preceding statements is gi-en in Figure

  • 8/12/2019 Andrew Thesis

    57/125

    The moti-ation behind this e4periment comes from Schaffer ;96 learns target concepts of increasing comple4ity on balanced and imbalanced data sets9

    S$tu/

    #n order to in-estigate the performance of induced decision trees on balanced and

    imbalanced data sets, eight sets of training and testing data of increasing target concept

    comple4ities were created9 The target concepts in the data sets were made to -ary in

    concept comple4ity by increasing the number of dis%uncts in the e4pression to be learned,

    while keeping the number of con%unctions in each dis%unct constant9 The following

    algorithm was used to produce the results gi-en below9

    Repeat 4 times

    o Create a training set TGc, ?666I, ?6663Ho Create a test set EGc, ;566I, ;5663H

    o Train Con T

    o Test Con Eand record its performance P;&;

    o Randomly remo-e =@66 positi-e e4amples from T

    o Train Con T

    o Test Con Eand record its performance P;&>

    o Randomly remo-e ?6 positi-e e4amples from T

    o Train Con To Test Con Eand record its performance P;&5>

    $ote that throughout Chapter < the testing sets used to measure the performance of the induced classifiers are balanced9

    That is, there is an equal number of both positi-e and negati-e e4amples used for testing9 The test sets are artificially

    balanced in order to increase the cost of misclassifying positi-e e4amples9 "sing a balanced testing set to measure a

    classifiers performance gi-es each class equal weight9

    =?

  • 8/12/2019 Andrew Thesis

    58/125

    A-erage each Ps o-er each 49

    For this test e4pressions of comple4ity ' =45, =469 The results for each e4pression were a-eraged o-er 4 ;6 runs9

    R$'u#t'

    The results of the e4periment are shown in Figures

  • 8/12/2019 Andrew Thesis

    59/125

    Error Over All Examples

    0

    041

    042

    04/

    04(

    (x2

    (x/

    (x(

    (x-

    (x)

    (x.

    (x*

    (x10

    Degree of Com plexity

    Erro

    r 151

    15-

    152-

    Figure

  • 8/12/2019 Andrew Thesis

    60/125

    D'cu''on

    As pre-iously stated, the purpose of this e4periment was to test the classifiers performance

    on both balanced and imbalanced data sets while -arying the comple4ity of the target

    e4pression9 #t can be seen in Figure

  • 8/12/2019 Andrew Thesis

    61/125

    Table

  • 8/12/2019 Andrew Thesis

    62/125

    #n terms of the o-erall si+e of the data set, downsi+ing significantly reduces the number of

    o-erall e4amples made a-ailable for training9 7y lea-ing negati-e e4amples out of the data

    set, information about the negati-e Gor counter conceptualH class is being remo-ed9

    !-er3sampling has the opposite effect in terms of the si+e of the data set9 Adding e4amples

    by re3sampling the positi-e Gor conceptualH class, howe-er, does not add any additional

    information to the data set9 #t %ust balances the data set by increasing the number of positi-e

    e4amples in the data set9

    S$tu/

    This test was designed to determine if randomly remo-ing e4amples of the o-er

    represented negati-e class, or uniformly o-er3sampling e4amples of the under represented

    class to balance the data set, would impro-e the performance of the induced classifier o-er

    the test data9 To do this, data sets imbalanced at a ratio of ;I&5>3 were created, -arying the

    comple4ity of target e4pression in terms of the number of dis%uncts9 The idea behind the

    testing procedure was to start with an imbalanced data set and measure the performance of

    an induced classifier as either negati-e e4amples are remo-ed, or positi-e e4amples are re3

    sampled and added to the training data9 The procedure gi-en below was followed to

    produce the presented results9

    Repeat 4 times

    o Create a training set TG', 5=6I, ?6663H

    o Create a test set EG', ;566I, ;5663H

    o Train Con T

    o Test Con Eand record its performance Poriginalo Repeat for n ; to ;6

    Create TdG5=6I, G?666 3 nV>:?H3H by randomly remo-ing >:?Vne4amples from T

    Train Con Td Test Con Eand record its performance Pdownsi+e

    o Repeat for n ; to ;6

    Create ToGG5=6 I nV>:?HI, ?6663H by uniformly o-er3sampling the

    positi-e e4amples from T9

    Train Con Td

    Test Con Eand record its performance Po-ersample

    >;

  • 8/12/2019 Andrew Thesis

    63/125

  • 8/12/2019 Andrew Thesis

    64/125

    For downsi+ing the numbers represent the rate at which negati-e e4amples were remo-ed

    from the training data9 The point 6 represents no negati-e e4amples being remo-ed, while

    ;66 represents all the negati-e e4amples being remo-ed9 The point 6 represents the

    training data being balanced G5=6I, 5=63H9 Essentially, what is being said is that the

    negati-e e4amples were remo-ed at >:? increments9

    For o-er3sampling, the labels on the 43a4is are simply the rate at which the positi-e

    e4amples were re3sampled, ;66 being the point at which the training data set is balanced

    G?666I, ?6663H9 The positi-e e4amples were therefore re3sampled at >:? increments9

    #t can be seen from Figure :?H or =6:?H positi-e e4amples9 That is, the lowest error rate

    achie-ed for o-er3sampling is around the ?6 or :6 mark in Figure

  • 8/12/2019 Andrew Thesis

    65/125

    x$ Accuracy Over All Examples

    0

    041

    042

    04/

    04(

    0 20 (0 )0 *0 100

    "ampli#g Rate

    Error 6ow nsiin!

    'verSamplin!

    Figure & This graph demonstrates that the optimal le-el at

    which a data set should be balanced does not always occur at thesame point9 To see this, compare this graph with Figure

  • 8/12/2019 Andrew Thesis

    66/125

    x% Accuracy Over Negative Examples

    0

    04002

    0400(

    0400)

    0400*

    0401

    0 20 (0 )0 *0 100

    rror

    6ownsiin!

    'verSamplin!

    Figure

  • 8/12/2019 Andrew Thesis

    67/125

    The results in Figure

  • 8/12/2019 Andrew Thesis

    68/125

    Figure :

  • 8/12/2019 Andrew Thesis

    69/125

    There are competing factors when each balancing technique is used9 Achie-ing a

    higher aI comes at the e4pense of a3 Gthis is a common point in the literature for

    domains such as te4t classificationH9

    900 Te%t :0 # u$e Count for 5a$an'e Data Set%

    "ltimately, the goal of the e4periments described in this section is to pro-ide moti-ation

    behind the design of a system that combines multiple classifiers that use different sampling

    techniques9 The ad-antage of combining classifiers that use different sampling techniques

    only comes if there is a -ariance in their predictions9 Combining classifiers that always

    make the same predictions is of no -alue if one hopes that their combination will increase

    predicti-e accuracy9 #deally, one would like to combine classifiers that agree on correctpredictions, but disagree on incorrect predictions9

    ethods that combine classifiers such as Adapti-e37oosting attempt to -ary learners

    predictions by -arying the training e4amples in which successi-e classifiers are presented

    to learn on9 As we saw in Section 5959=, Adapti-e37oosting increases the sampling

    probability of e4amples that are incorrectly classified by already constructed classifiers9 7y

    placing this higher weight on incorrectly classified e4amples, the induction process at each

    iteration is biased towards creating a classifier that performs well on pre-iously

    misclassified e4amples9 This is done in an attempt to create a number of classifiers that can

    be combined to increase predicti-e accuracy9 #n doing this, Adapti-e37oosting ideally

    di-ersifies the large rule sets of the classifiers9

    S$tu/

    Rules can be described in terms of their comple4ity9 0arger rules sets are considered more

    comple4 than smaller rule sets9 This e4periment was designed to get a feel for the

    comple4ities of the rule sets produced by C>96, when induced on imbalanced data sets that

    ha-e been balanced by either o-er3sampling or downsi+ing9 7y looking at the comple4ity

    of the rule sets created, we can get a feel for the differences between the rule sets created

    >@

  • 8/12/2019 Andrew Thesis

    70/125

    using each sampling technique9 The following algorithm was used to produce the results

    gi-en below9

    Repeat 4 timeso Create a training set TGc, 5=6I, ?6663H

    o Create ToG?666I,?6663H by uniformly re3sampling the positi-e e4amples

    from Tand adding the negati-e e4amples from T9o Train Con Too Record rule counts RoI and Ro3 for positi-e and negati-e rule sets

    o Create TdG5=6I, 5=63H by randomly remo-ing >?:6 negati-e e4amples from

    T9o Train Con Td

    o Record rule counts RdI and Rd3 for positi-e and negati-e rule sets

    A-erage rule counts o-er 49

    For this test e4pressions of si+es c =45, =46 were tested and a-eraged o-er 4

  • 8/12/2019 Andrew Thesis

    71/125

    Positi-e Rule Counts

    Don S96=4? =9: ; =96 ?96

    =4: =9@ ;>9< =9< :95

    =4@ =9 ;>9= @9<

  • 8/12/2019 Andrew Thesis

    72/125

    7efore # begin the discussion of these results it should be noted that these numbers must

    only be used to indicate general trends towards rule set comple4ity9 Bhen being a-eraged

    for e4pressions of comple4ities =4? and greater the numbers -aried considerably9 The

    discussion will be in four parts9 #t will begin by attempting to e4plain the factors in-ol-ed

    in creating rule sets o-er imbalanced data sets and then lead into an attempt to e4plain the

    characteristics of rules sets created by downsi+ed data sets, followed by o-er3sampled rule

    sets9 # will then conclude with a general discussion about some of the characteristics of the

    artificial domain and how they create the results that ha-e been presented9 Throughout this

    section one should remember that the positi-e rule set contains the target concept, that is,

    the underrepresented class9

    .ow oe% a $a'= of po%itive training examp$e% hurt $earning>

    ubat et al9 ;@ gi-e an intuiti-e e4planation of why a lack of positi-e e4amples hurts

    learning9 0ooking at the decision surface of a two dimensional plane, they e4plain the

    beha-ior of the ;3$earest $eighbor G;3$$H rule9 #t is a simple e4planation that is

    generali+ed as& J[as the number of negati-e e4amples in a noisy domain grows Gthe

    number of positi-es being constantH, so does the likelihood that the nearest neighbor of any

    e4ample will be negati-e9J Therefore, as more negati-e e4amples are introduced to the data

    set, the more likely a positi-e e4ample is to be classified as negati-e using the ;3$$ rule9

    !f course, as the number of negati-e e4amples approaches infinity, the accuracy of a

    learner that classifies all e4amples as negati-e approaches ;66U o-er negati-e data and 6U

    o-er the positi-e data9 This is unacceptable if one e4pects to be able to recogni+e positi-e

    e4amples9

    They then e4tend the argument to decision trees, drawing a connection to the common

    problem of o-erfitting9 Each leaf of a decision tree represents a decision as being positi-e

    or negati-e9 #n a noisy training set that is unbalanced in terms of the number of negati-e

    e4amples, it is stated that an induced decision tree will be large enough to create regions

    arbitrarily small enough to partition the positi-e regions9 That is, the decision tree will ha-e

    rules comple4 enough to co-er -ery small regions of the decision surface9 This is a result of

    ?;

  • 8/12/2019 Andrew Thesis

    73/125

    a classifier being induced to partition positi-e regions of the decision surface small enough

    to contain on$y positi-e e4amples9 #f there are many negati-e e4amples nearby, the

    partitions will be made -ery small to e4clude them from the positi-e regions9 #n this way,

    the tree o-erfits the data with a similar effect as the ;3$$ rule9

    any approaches ha-e been de-eloped to a-oid o-er fitting data, the most successful being

    post pruning9 ubat et al9 ;@, howe-er, state that this does not address the main

    problem9 #f a region in an imbalanced data set by definition contains many more negati-e

    e4amples than positi-e e4amples, post pruning is -ery likely to result in all of the pruned

    branches being classified as negati-e9

    C?@ an u$e Set%

    C>96 attempts to partition data sets into regions that contain only positi-e e4amples and

    regions that contain only negati-e e4amples9 #t does this by attempting to find features in

    the data that are good to partition the training data around Gi9e9, ha-e a high information

    gainH9 !ne can look at the partitions it creates by analy+ing the rules that are generated

    which create the boundaries9 Each rule generated creates a partition in the data9 Rules can

    appear to o-erlap, but when -iewed as partitions in an entire set of rules, the partitions

    created in the data by the rule sets do not o-erlap9 'iewed as an entire set of rules, thepartitions in the data can be -iewed has ha-ing highly irregular shapes9 This is due to the

    fact that C>96 assigns a confidence le-el to each rule9 #f a region of space is o-erlapped by

    multiple rules, the confidence le-el for each rule class that co-ers the space is summed9 The

    class with the highest summed confidence le-el is determined to be the correct class9 The

    confidence le-el gi-en to each rule can be -iewed as being the number of e4amples the rule

    co-ers correctly o-er the training data9 Therefore, rule sets that contain higher numbers of

    rules are generally less confident in their estimated accuracy because each rule co-ers

    fewer e4amples9 Figure

  • 8/12/2019 Andrew Thesis

    74/125

    Rule 1 Rule 2

    Figure 96 adds rules to createcomple4 decision surfaces9 #t is done by summing the confidencele-el of rules that co-er o-erlapping regions9 A region co-ered by

    more than one rule is assigned the class with the highest summed

    confidence le-el of all the rules that co-er it9 1ere we assumeRule ; has a higher confidence le-el than Rule 59

    Down%i

  • 8/12/2019 Andrew Thesis

    75/125

    !-er3sampling has different effects than downsi+ing9 !ne ob-ious difference is the

    comple4ity of the rule sets indicating negati-e partitions9 Rule sets that classify negati-e

    e4amples when o-er3sampling is used are much larger than those created using

    downsi+ing9 This is because there is still the large number of negati-e e4amples in the data

    set, resulting in a large number of rules created to classify them9

    The rule sets created for the negati-e e4amples are gi-en much less confidence than those

    created when downsi+ing is used9 This effect occurs due to the fact that the learning

    algorithm attempts to partition the data using features contained in the negati-e e4amples9

    7ecause there is no target concept contained in the negati-e e4amples;5Gi9e9, no features to

    indicate an e4ample to be negati-eH, the learning algorithm is faced with the dubious task,

    in this domain, of attempting to find features that do not e4ist e4cept by mere chance9

    !-er sampling the positi-e class can be -iewed as adding weight to the e4amples that are

    re3sampled9 "sing an information gain heuristic when searching through the hypothesis

    space, features which partition more e4amples correctly are fa-ored o-er those that do not9

    The effect of multiplying the number of e4amples a feature will classify correctly when

    found gi-es the feature weight9 !-er sampling the positi-e e4amples in the training data

    therefore has the effect of gi-ing weight to features contained in the target concept, but italso adds weight to random features which occur in the data that is being o-er3sampled9

    The effect of o-er3sampling therefore has two competing factors9 The factors are&

    !ne that adds weight to features containing the target concept9

    !ne that adds weight to features notcontaining the target concept

    The effect of features not rele-ant to the target concept being gi-en a disproportionate

    weight can be seen for e4pressions of comple4ity =4@ and =4;69 This can be seen in lower

    right hand corner of Table

  • 8/12/2019 Andrew Thesis

    76/125

    sparse compared to the number of positi-e e4amples9 Bhen the positi-e data is o-er3

    sampled, irrele-ant features are gi-en enough weight relati-e to the features containing the

    target conceptN as a result the learning algorithm se-erely o-erfits the training data by

    creating garbage rules that partition the data on features not containing the target concept,

    but that appear in the positi-e e4amples9

    ;.- C)r)ct$r'tc' o4 t$ Do&)n )nd o t$ A44$ct t$ R$'u#t'

    The characteristics of the artificial domain greatly affect the way in which rule sets are

    created9 The ma%or determining factor in the creation of the rule sets is the fact that the

    target concept is hidden in the underrepresented class and that the negati-e e4amples in the

    domain ha-e no rele-ant features9 That is, the underrepresented class contains the target

    concept and the o-er represented class contains e-erything else9 #n fact, if o-er3sampling is

    used to balance the data sets, e4pressions of comple4ity =45 to =4? could still, on a-erage,

    attain ;66U accuracy on the testing set, if only the positi-e rule sets were used to classify

    e4amples with a default negati-e rule9 #n this respect, the artificial domain can be -iewed as

    lending itself to bein