Faten Hussein

20
Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems Faten Hussein Presented by The University of British Columbia Department of Electrical & Computer Engineering

description

The University of British Columbia Department of Electrical & Computer Engineering. Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems. Presented by. Faten Hussein. Outline. Introduction & Problem Definition Motivation & Objectives - PowerPoint PPT Presentation

Transcript of Faten Hussein

  • Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems

    Faten HusseinPresented byThe University of British ColumbiaDepartment of Electrical & Computer Engineering

  • OutlineIntroduction & Problem DefinitionMotivation & ObjectivesSystem OverviewResultsConclusions

  • IntroductionAddress readersBank Cheques readersReading data entered in forms (tax forms)Detecting forged signatures

    ScanningPre-ProcessingFeature ExtractionClassificationPost-ProcessingText documentClassified textOff-line Character Recognition System

  • IntroductionMany variants of character (symbol) shape, size.Different writers have different writing styles. Same person could have different writing style.Thus, unlimited number of variations for a single character exists. For typical handwritten recognition task:

  • IntroductionVariations in handwritten digits extracted from zip codesTo overcome this diversity, a large number of features must be addedAn example of features that we used are: moment invariants, number of loops, number of end points, centroid, area, circularity and so on.L=2, E=0L=1, E=1L=0, E=3

  • ProblemAdd more features Increase problem sizeDilemma Increase run time/memory for classification To accommodate variations in symbols Hope to increase classification accuracyCharacter Recognition System Add-hoc process, depends on experience and trail and error Might add redundant/irrelevant features which decrease the accuracy

  • Feature SelectionAdvantagesSolution: Feature SelectionDefinition: Select a relevant subset of features from a larger set of features while maintaining or enhancing accuracy Remove irrelevant and redundant features Total of 40 features -> reduced to 16 7 Hu moments -> only first three Area removed -> redundant (Circularity) Maintain/enhance the classification accuracy70% recognition rate using 40 features -> 75% after FS & using only 16 features Faster classification and less memory requirements

  • Feature Selection/WeightingThe process of assigning weights (binary or real valued) to features needs a search algorithm to search for the set of weights that results in best classification accuracy (optimization problem) Genetic algorithm is a good search method for optimization problems

    Feature Selection (FS)Feature Weighting (FW)Special CaseGeneral CaseBinary weights (0 for irrelevant/redundant & 1 for relevant)Real-valued weights (variable weights depending on the feature relevance)Number of feature subset combinationsNumber of feature subset combinations

  • Genetic Feature Selection/WeightingHas been proven to be a powerful search method for FS problemDoes not require derivative information or any extra knowledge; only the objective function (classifiers error rate) to evaluate the quality of the feature subsetSearch a population of solutions in parallel, so they can provide a number of potential solutions not only oneGA is resistant to becoming trapped in local minimaWhy use GA for FS/FW

  • Objectives & Motivations Study the effect of varying weight values on the number of selected features (FS often eliminates more features than FW, how much ??) Compare the performance of genetic feature selection/weighting in the presence of irrelevant & redundant features (not studied before)Compare the performance of genetic feature selection/weighting for regular cases (test the hypothesis that says that FW should have better or at least same results as FS ??)Evaluate the performance of the better method (GFS or GFW) in terms of optimality and time complexity (study the feasibility of genetic search for optimality & time)

    Build a genetic feature selection/weighting system to be applied to character recognition problem and investigate the following issues:

  • MethodologyThe recognition problem is to classify isolated handwritten digits Used k-nearest-neighbor as a classifier (k=1)Used genetic algorithm as search methodApplied genetic feature selection and weighting in the wrapper approach (i.e. fitness function is the classifiers error rate)Used two phases during the program run: training/testing phase and validation phase

  • System OverviewPre-Processing ModuleAll Extracted features NFeature selection/weighting Module (GA)Evaluation Module (KNN classifier)Feature subsetAssessment of feature subsetEvaluationBest feature subset (M
  • Results (Comparison 1)As the number of weight values increase, the probability of a feature having weight value=0 (POZ) decreases, so the number of eliminated features decreasesGFS eliminates more features (thus selects less features) than GFW because of its smaller number of weight values (0/1) and without compromising classification accuracy

    Effect of varying weight values on the number of selected features

    Chart1

    27

    18

    9

    5

    3

    1

    0

    Number of weight values

    Number of zero (eliminated) features

    Sheet1

    579183134

    98.598.197.997.898.397.4training

    97.597.597.597.597.597.51-nn train

    82.884.283.483.382.481.2testing

    84.584.584.584.584.584.51-nn test

    23611214181161nlevels

    2718953100nzf

    0.090.0980.1660.330.50.66poz

    579182734nzf

    23611214181161nlevels

    0.50.33333333330.16666666670.09090909090.04761904760.02439024390.0123456790.0062111801poz

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    Number of weight values

    Number of zero (eliminated) features

    Sheet2

    0

    0

    0

    0

    0

    0

    Probabiltiy of zero (POZ)

    Number of zero features

    Sheet3

    0

    0

    0

    0

    0

    0

    0

    POZ

    Number of levels

    Probability of zero (POZ)

  • Results (Comparison 2)The performance of 1-NN classifier rapidly degrades by increasing the number of irrelevant featuresAs the number of irrelevant features increases, FS outperform all FW settings in both classification accuracy and elimination of featuresPerformance of genetic feature selection/weighting in the presence of irrelevant features

    Chart1

    59.3266.766766.2466.04

    41.8466.5664.8463.1661.6

    34.4865.0864.3261.7660.44

    30.8860.7258.1257.2852.56

    28.3656.5650.648.5647.88

    25.0448.724544.9244.68

    22.8442.240.9639.6438.76

    1-NN

    FS

    3FW

    5FW

    33FW

    Number of irrelevant features

    Classification rate %

    Sheet1

    4121624344454irr-clss

    59.3241.8434.4830.8828.3625.0422.841-nn

    66.7666.5665.0860.7256.5648.7242.2fs

    6764.8464.3258.1250.64540.963fw

    66.2463.1661.7657.2848.5644.9239.645fw

    6859.8858.7255.647.1640.640.617fw

    66.0461.660.4452.5647.8844.6838.7633fw

    4121522303642fsirr-f

    41113162124263fw

    37891014155fw

    000000033fw

    41424344454rd-clss

    66.7665.4464.7264.126463.841-nn

    67.1666.6866.1266.26665.36fs

    67.1266.686664.7664.4863.923fw

    66.9666.4864.8464.4864.4863.885fw

    6764.964.5263.7663.9617fw

    66.866.3664.0463.5664.0463.5633fw

    3920243336

    2712132025

    14891216

    000000

    Sheet1

    00000

    00000

    00000

    00000

    00000

    00000

    00000

    1-NN

    FS

    3FW

    5FW

    33FW

    Number of irrelevant features

    Classification rate %

    Sheet2

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    FS

    3FW

    5FW

    33FW

    Number of irrelevant features

    Number of eliminated features

    Sheet3

    00000

    00000

    00000

    00000

    00000

    00000

    1-NN

    FS

    3FW

    5FW

    33FW

    Number of redundant features

    Classification rate

    0000

    0000

    0000

    0000

    0000

    0000

    FS

    3FW

    5FW

    33FW

    Number of redundant features

    Number of eliminated features

    Chart2

    4430

    121170

    151380

    221690

    3021100

    3624140

    4226150

    FS

    3FW

    5FW

    33FW

    Number of irrelevant features

    Number of eliminated features

    Sheet1

    4121624344454irr-clss

    59.3241.8434.4830.8828.3625.0422.841-nn

    66.7666.5665.0860.7256.5648.7242.2fs

    6764.8464.3258.1250.64540.963fw

    66.2463.1661.7657.2848.5644.9239.645fw

    6859.8858.7255.647.1640.640.617fw

    66.0461.660.4452.5647.8844.6838.7633fw

    4121522303642fsirr-f

    41113162124263fw

    37891014155fw

    000000033fw

    41424344454rd-clss

    66.7665.4464.7264.126463.841-nn

    67.1666.6866.1266.26665.36fs

    67.1266.686664.7664.4863.923fw

    66.9666.4864.8464.4864.4863.885fw

    6764.964.5263.7663.9617fw

    66.866.3664.0463.5664.0463.5633fw

    3920243336

    2712132025

    14891216

    000000

    Sheet1

    00000

    00000

    00000

    00000

    00000

    00000

    00000

    1-NN

    FS

    3FW

    5FW

    33FW

    Number of irrelevant features

    Classification rate %

    Sheet2

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    FS

    3FW

    5FW

    33FW

    Number of irrelevant features

    Number of eliminated features

    Sheet3

    00000

    00000

    00000

    00000

    00000

    00000

    1-NN

    FS

    3FW

    5FW

    33FW

    Number of redundant features

    Classification rate

    0000

    0000

    0000

    0000

    0000

    0000

    FS

    3FW

    5FW

    33FW

    Number of redundant features

    Number of eliminated features

  • Results (Comparison 3)Performance of genetic feature selection/weighting in the presence of redundant featuresThe classification accuracy of 1-NN does not suffer so much by adding redundant features, but they increase the problem sizeAs the number of redundant features increases, FS has slightly better classification accuracy than all FW settings, but significantly outperform FW in elimination of features

    Chart1

    66.7667.1667.1266.9666.8

    65.4466.6866.6866.4866.36

    64.7266.126664.8464.04

    64.1266.264.7664.4863.56

    646664.4864.4864.04

    63.8465.3663.9263.8863.56

    1-NN

    FS

    3FW

    5FW

    33FW

    Number of redundant features

    Classification rate %

    Sheet1

    4121624344454irr-clss

    59.3241.8434.4830.8828.3625.0422.841-nn

    66.7666.5665.0860.7256.5648.7242.2fs

    6764.8464.3258.1250.64540.963fw

    66.2463.1661.7657.2848.5644.9239.645fw

    6859.8858.7255.647.1640.640.617fw

    66.0461.660.4452.5647.8844.6838.7633fw

    4121522303642fsirr-f

    41113162124263fw

    37891014155fw

    000000033fw

    41424344454rd-clss

    66.7665.4464.7264.126463.841-nn

    67.1666.6866.1266.26665.36fs

    67.1266.686664.7664.4863.923fw

    66.9666.4864.8464.4864.4863.885fw

    6764.964.5263.7663.9617fw

    66.866.3664.0463.5664.0463.5633fw

    3920243336

    2712132025

    14891216

    000000

    Sheet1

    00000

    00000

    00000

    00000

    00000

    00000

    00000

    1-NN

    FS

    3FW

    5FW

    33FW

    Number of irrelevant features

    Classification rate %

    Sheet2

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    FS

    3FW

    5FW

    33FW

    Number of irrelevant features

    Number of eliminated features

    Sheet3

    00000

    00000

    00000

    00000

    00000

    00000

    1-NN

    FS

    3FW

    5FW

    33FW

    Number of redundant features

    Classification rate %

    0000

    0000

    0000

    0000

    0000

    0000

    FS

    3FW

    5FW

    33FW

    Number of redundant features

    Number of eliminated features

    Chart2

    3210

    9740

    201280

    241390

    3320120

    3625160

    FS

    3FW

    5FW

    33FW

    Number of redundant features

    Number of eliminated features

    Sheet1

    4121624344454irr-clss

    59.3241.8434.4830.8828.3625.0422.841-nn

    66.7666.5665.0860.7256.5648.7242.2fs

    6764.8464.3258.1250.64540.963fw

    66.2463.1661.7657.2848.5644.9239.645fw

    6859.8858.7255.647.1640.640.617fw

    66.0461.660.4452.5647.8844.6838.7633fw

    4121522303642fsirr-f

    41113162124263fw

    37891014155fw

    000000033fw

    41424344454rd-clss

    66.7665.4464.7264.126463.841-nn

    67.1666.6866.1266.26665.36fs

    67.1266.686664.7664.4863.923fw

    66.9666.4864.8464.4864.4863.885fw

    6764.964.5263.7663.9617fw

    66.866.3664.0463.5664.0463.5633fw

    3920243336

    2712132025

    14891216

    000000

    Sheet1

    00000

    00000

    00000

    00000

    00000

    00000

    00000

    1-NN

    FS

    3FW

    5FW

    33FW

    Number of irrelevant features

    Classification rate %

    Sheet2

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    FS

    3FW

    5FW

    33FW

    Number of irrelevant features

    Number of eliminated features

    Sheet3

    00000

    00000

    00000

    00000

    00000

    00000

    1-NN

    FS

    3FW

    5FW

    33FW

    Number of redundant features

    Classification rate %

    0000

    0000

    0000

    0000

    0000

    0000

    FS

    3FW

    5FW

    33FW

    Number of redundant features

    Number of eliminated features

  • Results (Comparison 4)Performance of genetic feature selection/weighting for regular cases (not necessarily having irrelevant/redundant)FW has better training accuracies than FS, but FS is better in generalization (have better accuracies for unseen validation samples)FW over-fits the training samples

    Chart3

    67.42

    67.52

    67.96

    68.2

    68.8

    68.82

    Training Classification Rate

    Sheet1

    67.7667.7667.3666.2866.2466.24

    67.4267.5267.9668.268.868.82

    1-NNFS3FW5FW17FW33FW

    Sheet1

    0

    0

    0

    0

    0

    0

    Validation classification rate

    Sheet2

    0

    0

    0

    0

    0

    0

    Training classification rate

    Sheet3

    Chart2

    67.76

    67.76

    67.36

    66.28

    66.24

    66.24

    Validation Classification Rate

    Sheet1

    67.7667.7667.3666.2866.2466.24

    67.4267.5267.9668.268.868.82

    1-NNFS3FW5FW17FW33FW

    Sheet1

    0

    0

    0

    0

    0

    0

    Validation classification rate

    Sheet2

    0

    0

    0

    0

    0

    0

    Training classification rate

    Sheet3

  • Results (Evaluation 1)Convergence of GFS to an Optimal or Near-Optimal Set of Features GFS was able to return optimal or near-optimal values (reached by the exhaustive search)The worst average value obtained by GFS less than 1% away from optimal value

    Number of featuresBest Exhaustive (class. rate %)Best GA (class. rate %)Average GA (5 runs)87474741075.275.275.21277.277.277.0414797978.561679.27978.281879.479.478.92

  • Results (Evaluation 2)Convergence of GFS to an Optimal or Near-Optimal Set of Features within an Acceptable Number of Generations

    The time needed for GFS is bounded by (lower) linear-fit and (upper) exponential-fit curvesThe use of GFS for highly dimensional problems need parallel processing

    Chart1

    555

    555

    101010

    101010

    151515

    202020

    2026.50594236621.3333333333

    2559.085914477430.2222222222

    30115.164044400138.398794143

    35256.18172174647.3420697586

    40531.308117674655.3464010682

    451122.203181132963.5141577994

    502367.360303157772.2148231692

    554952.999604043180.5044207281

    6010585.198454301588.8848594333

    Actual

    Expected exponential

    Expected linear

    Number of features

    Number of generations

    Sheet1

    810121416182030405060708090100

    551010152028.1372483812105.9243595549466.06511750541889.66221632237763.937469177731749.3470677102133309.988315109541697.5850969462239865.54782656

    510101520

    810121416182022242628304050

    51545101520

    51010152028.1372483812105.9243595549466.06511750541889.66221632237763.937469177731749.3470677102133309.988315109541697.5850969462239865.54782656

    810121416182030405060708090100

    510101520LINEAR

    810121416182030405060708090100

    551010152028.1372483812105.9243595549466.06511750541889.66221632237763.937469177731749.3470677102133309.988315109541697.5850969462239865.54782656EXPO

    551010152021.333333333338.555555555655.061930783272.211111836888.59419889105.0996750728122.085649554138.6826992735155.3791707470

    5510101520263746668711751922691006843981

    5510101520212528323538557289106

    81012141618202530354045505560

    5510101520275911525653111222367495310585

    5510101520213038475564728189

    Sheet1

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    Actual

    Expected exponential

    Expected linear

    Number of features

    Number of generations

    Sheet2

    Sheet3

    MBD00397064.xls

    Chart1

    555

    555

    101010

    101010

    151515

    202020

    202621

    223725

    244628

    266632

    288735

    3011738

    4051955

    50226972

    Actual

    Expected exponential

    Expected linear

    Number of features

    Number of generations

    Sheet1

    810121416182022242628304050

    55101015202124273033363942

    810121416182022242628304050

    51545101520

    810121416182022242628304050

    551010152026.50594236637.154421057246.728438081666.355018430587.4537365934117.8039754971519.20824066512269.8480889364expon

    551010152021.333333333325.222222222228.43703703732.417283950635.363621399238.4573936955.924737082872.8557320081linear

    551010152026374666871175192269

    55101015202125283235385572

    Sheet1

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    000

    Actual

    Expected exponential

    Expected linear

    Number of features

    Number of generations

    Sheet2

    Sheet3

  • ConclusionsGFS is superior to GFW in feature reduction and without compromising classification accuracyIn the presence of irrelevant features, GFS is better than GFW in both feature reduction and classification accuracyIn the presence of redundant features, GFS is also preferred over GFW due its increased ability to feature reductionFor regular databases, it is advisable to use 2 or 3 weight values at most to avoid over-fittingGFS is a reliable method to find optimal or near-optimal solution, but need parallel processing for large problem sizes

  • Questions ?

    Character recognition is the process of converting scanned images of machine printed or handwritten text to computer processed format such as ASCII.FE: involves extracting from the raw data the information which is most relevant to classification.C: map these features into classesPostP: to enhance classification, using dictionary or by user.

    In our work, we were interested in classifying handwritten digits We used several features such as..However, due to variations in shape and size and style, several features must be added.Irrelevant features do not have an effect on the target concept at all redundant features add nothing new to the target concept.A redundant feature is a feature, which its value can be extracted from other features values, for example if its value is the average or square or even multiple of other feature values So a solution to this dilemma is to add a FS module into your character recognition systemSearching the space of 2 or l is impossible even for a moderate size of n, so we need a search algorithm to search for the set of weights that results in best classification rateSome search algorithms need the derivative information to get the maximum or minimum And what is the relation between the number of eliminated features and the number of weight values(not necessarily having irrelevant/redundant)

    Training/testing: to guide the GA searchValidation: to assess the quality of the generated solution on unseen dataPre-processing: noise removal, image resizing, thinning.

    As mentioned before, irrelevant features do not have an effect on the target concept at all Irrelevant features lower the classification accuracy and increase the problem dimensionalityredundant features add nothing new to the target conceptThey increase the problem size while adding nothing to the classification problemThe trend line shows that the greater the number of weight values the higher the training accuracy achieved and the lower the validation accuracyIncreasing the number of weights over 2 or 3 increases the chances of over-fittingOut of six times, GFS was able to return optimal values reached by exhaustive search and in the sixth it returned a near-optimal solutionThe run time for GA depends on: no of generations, population size, number of features and number of training samplesNeed to investigate the relationship between the no. of features and the number of generations needed to reach optimal values (while keeping the other two unchanged)As the no of features increase, the no of generations needed to reach the optimal values increase as well.We used extrapolation because it is computationally impossible to run exhaustive search for large no of featuresThe number of generations required for 60 features is something between 10585 & 89 (5337), which will not be computationally feasible.