FinalReportFoxMelle

18
18-752 Project Letter Recognition Andrew Fox, Fridtjof Melle May 5, 2014 1 Introduction For our 18-752 Estimation, Detection, Identification course project, we engaged in a numerical study testing various statistical analysis and machine learning algorithms in an attempt to classify typed letters based on predetermined features. Our goal with this study was in the broadest sense to classify and discriminate images by letter, by comparing the attainment of various modeling algorithms while assessing their strengths and weaknesses. We use these experiences to achieve the highest possible performance, with a hybrid realization of these algorithms, and further assess their influence and individual performance. Motivations for proceeding with the analysis of this data set include the fact that letter and word recognition remain an active field of study, and is also a task for which computers significantly lag behind humans in performance. 2 Data Set The data set used for this study was created by David J. Slate [1] with the objective of identifying a processed array of distorted letter images as one of the 26 capital letters in the English alphabet. For the purpose of generating the data set an algorithm was created whose output would be an English capital letter uniformly drawn from the letters of alphabet, randomly selected among 20 different fonts. The fonts included five stroke styles (Simplex, Duplex, Triplex, Complex and Gothic) and six letter styles (Block, Script, Italic, English, Italian and German). To further complicate the generated images, each letter underwent a random distortion process including vertical and horizontal warping, linear magnification and changes to the aspect ratio. Examples of the resulting images from the data set are shown in Figure 1. The algorithm produced 20,000 unique letter images. Each image was converted into 16 primitive nu- merical attributes, each values as a normalized 4-bit number with an integer value ranged from 0 through 15. The attributes used to construct the 16 features are detailed in Table 1. An example of the final data points extracted from the data set is displayed in Figure 2. For the purpose of this study we divided the data set of 20,000 letter images into two sets. To develop the different models we allocated the first 16,000 letters as training data. The training data was divided into 13,600 training images and 2,400 validation images for algorithms which required validation processes. The remaining 4,000 letter images for used for testing. We further developed a z-score of the features and replaced the letter labels with respective respective integers such that, A =1,B =2, .., Z = 26. (1) This label replacement was done for programming convenience. It should be noted that numerical closeness between two letters does not represent letters that are close are similar to each other. This should be kept in mind when using models that use regression analysis for classification. The concept of being close numerically does not represent closeness between classes. Algorithms may favor numerically close letters, although there is no real comparable basis for this consideration. Efforts were made in this project to use the algorithms in a way which avoided this issue. 1

Transcript of FinalReportFoxMelle

Page 1: FinalReportFoxMelle

18-752 Project

Letter Recognition

Andrew Fox, Fridtjof Melle

May 5, 2014

1 Introduction

For our 18-752 Estimation, Detection, Identification course project, we engaged in a numerical study testingvarious statistical analysis and machine learning algorithms in an attempt to classify typed letters based onpredetermined features. Our goal with this study was in the broadest sense to classify and discriminate imagesby letter, by comparing the attainment of various modeling algorithms while assessing their strengths andweaknesses. We use these experiences to achieve the highest possible performance, with a hybrid realizationof these algorithms, and further assess their influence and individual performance. Motivations for proceedingwith the analysis of this data set include the fact that letter and word recognition remain an active field ofstudy, and is also a task for which computers significantly lag behind humans in performance.

2 Data Set

The data set used for this study was created by David J. Slate [1] with the objective of identifying a processedarray of distorted letter images as one of the 26 capital letters in the English alphabet.

For the purpose of generating the data set an algorithm was created whose output would be an Englishcapital letter uniformly drawn from the letters of alphabet, randomly selected among 20 different fonts.The fonts included five stroke styles (Simplex, Duplex, Triplex, Complex and Gothic) and six letter styles(Block, Script, Italic, English, Italian and German). To further complicate the generated images, each letterunderwent a random distortion process including vertical and horizontal warping, linear magnification andchanges to the aspect ratio. Examples of the resulting images from the data set are shown in Figure 1.

The algorithm produced 20,000 unique letter images. Each image was converted into 16 primitive nu-merical attributes, each values as a normalized 4-bit number with an integer value ranged from 0 through15. The attributes used to construct the 16 features are detailed in Table 1. An example of the final datapoints extracted from the data set is displayed in Figure 2.

For the purpose of this study we divided the data set of 20,000 letter images into two sets. To developthe different models we allocated the first 16,000 letters as training data. The training data was dividedinto 13,600 training images and 2,400 validation images for algorithms which required validation processes.The remaining 4,000 letter images for used for testing. We further developed a z-score of the features andreplaced the letter labels with respective respective integers such that,

A = 1, B = 2, .., Z = 26. (1)

This label replacement was done for programming convenience. It should be noted that numerical closenessbetween two letters does not represent letters that are close are similar to each other. This should be kept inmind when using models that use regression analysis for classification. The concept of being close numericallydoes not represent closeness between classes. Algorithms may favor numerically close letters, although thereis no real comparable basis for this consideration. Efforts were made in this project to use the algorithms ina way which avoided this issue.

1

Page 2: FinalReportFoxMelle

Figure 1: Example letter images generated for the data set

Table 1: Attribute information for each feature representationFeature 1 Horizontal position of minimum-size letter-enclosing boxFeature 2 Vertical position of minimum-size letter-enclosing boxFeature 3 Width of minimum-size letter-enclosing boxFeature 4 Height of minimum-size letter-enclosing boxFeature 5 Total number of pixels making up letterFeature 6 Mean horizontal position of pixels relative to center of boxFeature 7 Mean vertical position of pixels relative to center of boxFeature 8 Mean-squared horizontal position of pixels relative to center of boxFeature 9 Mean-squared vertical position of pixels relative to center of boxFeature 10 Mean correlation of horizontal and vertical meansFeature 11 Mean vertical position correlation with horizontal varianceFeature 12 Mean horizontal position correlation with vertical varianceFeature 13 Mean number of horizontal letter edge pixels measuring left to rightFeature 14 Sum of respective vertical positions of edge pixels measuredFeature 15 Mean number of vertical letter edge pixels measuring bottom to topFeature 16 Sum of respective horizontal positions of edge pixels measured

2

Page 3: FinalReportFoxMelle

X,4,9,5,6,5,7,6,3,5,6,6,9,2,8,8,8

H,3,3,4,1,2,8,7,5,6,7,6,8,5,8,3,7

L,2,3,2,4,1,0,1,5,6,0,0,6,0,8,0,8

H,3,5,5,4,3,7,8,3,6,10,6,8,3,8,3,8

E,2,3,3,2,2,7,7,5,7,7,6,8,2,8,5,10

Y,5,10,6,7,6,9,6,6,4,7,8,7,6,9,8,3

H,8,12,8,6,4,9,8,4,5,8,4,5,6,9,5,9

Q,5,10,5,5,4,9,6,5,6,10,6,7,4,8,9,9

M,6,7,9,5,7,4,7,3,5,10,10,11,8,6,3,7

E,4,8,5,6,4,7,7,4,8,11,8,9,2,9,5,7

N,6,11,8,8,9,5,8,3,4,8,8,9,7,9,5,4

Y,8,10,8,7,4,3,10,3,7,11,12,6,1,11,3,5

W,4,8,5,6,3,6,8,4,1,7,8,8,8,9,0,8

O,6,7,8,6,6,6,6,5,6,8,5,8,3,6,5,6

N,4,4,4,6,2,7,7,14,2,5,6,8,6,8,0,8

H,4,8,5,6,5,7,10,8,5,8,5,6,3,6,7,11

O,4,7,5,5,3,8,7,8,5,10,6,8,3,8,3,8

N,4,8,5,6,4,7,7,9,4,6,4,6,3,7,3,8

H,4,9,5,6,2,7,6,15,1,7,7,8,3,8,0,8

Figure 2: Example Data extracted from data set

3 Methodology

The following subsections summarize the different algorithms that we used for letter classification and howthey were applied to our problem. Generally the models had 16 input dimensions, the features, and had anoutput representing either the chosen letter or a decision on the comparison between letters. Understandingof these algorithms was assisted by [2].

3.1 k-Nearest Neighbors

The k-Nearest Neighbors algorithm is a discriminative classifier that uses a deterministic association todistinguish the data points. The output label for a test letter is a class membership determined by amajority vote of its k closest neighbors in the feature space by Euclidean distance. The performance isoptimized by determining what value k of nearest neighbors which performs best on the validation data. Ak too large includes too much distant and irrelevant data, while a k too small risks misclassifications due tonoise.

The graph of validation performance with respect to different k values is shown in Figure 3. The bestresult was achieved with letting only one nearest neighbor vote, k = 1. This result is line with the fact thatthe features are normalized to 4-bit values which limits the potential possibilities in the feature space andleads to a lot of the data set thereby overlapping. Classifying a given test image based on only the majorityof letters having the same feature value may therefore be sufficient.

To visualize the classification process we extracted two letter examples, T and U from the training dataset, with two of their features (6 and 7) in Figure 4. On this data we added two test letters of each classand classified them by a higher number of neighbors to visualize the voting process. As our k determinationprocess concluded with k = 1 for optimal results, our output will in reality only take in account the lettershaving obtained the same value for the majority vote. It should be noted that this classification appears toperform relatively poorly. That is due to the fact that we only used 2 features for simple visualization. Thefull 16 dimensional feature space with all 26 letters would be impossible to illustrate.

The overall performance of the k-Nearest Neighbor Algorithm on the test data was 95.65%. This overadequate achievement can be due to the sparsely ranged values of the attributes which enables us through

3

Page 4: FinalReportFoxMelle

1 2 3 4 5 6 7 8 9 100.938

0.94

0.942

0.944

0.946

0.948

0.95

0.952

0.954k Determination − k−NN

k

Val

idat

ion

Dat

a P

erfo

rman

ce

Figure 3: Determination of optimal k values of nearest voting neighbors.

our setup of the algorithm to remove a lot of the noise in the data set.

3.2 Naive Bayes

The Naive Bayes Classifier is a simple generative classification algorithm. Using a conditional probabilitymodel, it develops a prior probability density function based on labeled training observations followinga suspected distribution and classifies using the Bayes Theorem. A key simplifying assumption is thateach of the features are conditionally independent, meaning the covariance matrix is a diagonal matrix.Mathematically, this formulation is accomplished by calculating the posterior probability given the providedattributes based on the computed prior for each letter and their respective likelihood,

p(C|A1, .., An) =p(C)p(A1, .., An|C)

p(A1, .., An). (2)

The classification is performed by a MAP estimator calculating the maximum posterior probability amongall possible classes to determine the predicted letter, C, for a given test data,

CPredicted = argmaxC

p(C = c)n∏

i=1

p(Ai = ai|C = c) (3)

For our purposes we assumed that the attributes associated with each class follow a Gaussian distribution,and developed a Naive Gaussian Classifier. The training data was first segmented by class, then the meanand variance in the prior distribution for each of the 16 attributes was computed based on the Gaussianassumption and the respective observations. To visualize classification model we developed an exampleclassifier based only on two example letters, T and U, including only two attributes, 6 and 7 similar to thek-NN visualization example. The resulting two prior distributions was then plotted over the remaining testdata for the same letters as shown in Figure 5.

The Naive Gaussian is characterized by its independence assumption which specifically implies no corre-lation between the attributes. Therefore it only contains a n × n diagonal matrix containing the attributevariances for each class, where n = 16 in our study. Figure 5 displays how this prevents the priors in fullyadapting to the letters’ distributions, which is further reflected in the overall performance.

4

Page 5: FinalReportFoxMelle

3 4 5 6 7 8 9 102

4

6

8

10

12

14

16

feature 6

feat

ure

7

Letter 20 vs. 21 − feature 6 vs. 7 − kNN

Letter 20 − TLetter 21 − U

Figure 4: Visualized k-Nearest Neighbors for letters T and U considering features 6 and 7 with a higher k.

The overall achievement of Naive Gaussian generative model was 62.45%. The results reflects its naiveassumptions, and we did not expect the model to perform very well. Being computationally fast and simple,the Naive Bayes serves as a good indicator for more complex algorithms that follow the similar principles.The results were good enough to indicate a relative success with this type of approach and predict a decentperformance of the Full Gaussian Mixture Model.

3.3 Gaussian Mixture Model

The Full Gaussian Mixture Model represents a more complex generative classification algorithm. Basedon the Bayesian Gaussian Mixture Model, the method is designed to develop a model, fitting mean vectorand covariance matrices following a multivariate normal distribution with dimensions equal to the numberof components. The goal with the process is to develop a corresponding distribution that represents thebehavior of the data and can predict the class of new observations.

The resulting prior probability thereby takes all the components and their covariances into account.LettingK be the number of letters, and denoting µi=1..K and Σi=1..K the respective final mean and covariancematrices for letter i, we can express the prior probability as,

p(θ) =K∑i=1

N (µi,Σi). (4)

This result is reached through a Bayesian estimation process, by iterating through the training data, updatingthe model parameters using the Expectation Maximization algorithm. An initial or current a priori distri-bution p(θ) is multiplied with the known conditional distribution p(x|θ) of the data, providing a posteriordistribution p(θ|x) which is subsequently Gaussian and takes the form,

p(θ|x) =K∑i=1

N (µ̃i, Σ̃i). (5)

5

Page 6: FinalReportFoxMelle

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

feature 6

feat

ure

7

Letter 20 vs. 21 − feature 6 vs. 7 − Naive Gaussian

Letter 20 − TLetter 21 − U

Figure 5: Example Naive Gaussian Output for letters T and U considering only features 6 and 7.

This distribution is then optimized through the EM algorithm with a provided tolerance, by iterativelyupdating the parameters µi=1..K and Σi=1..K .

In our study we used the fully labeled training data set to train 26 separate Gaussian distributions, onefor each letter. The individual letter mean vectors and covariance matrices was subsequently fitted underoptimal conditions. To demonstrate the performance of the resulting class prior distributions against theNaive Bayes we created a equivalent example classifier based only on the same two letters, T and U, withonly two attributes, 6 and 7. The achieved two prior distributions are subsequently plotted over the relevanttest data in Figure 6.

The final model classifies test data by the same principles as the Naive Bayes Classifier through a MAPestimator. Given the provided attributes for each test data, the posterior probability is computed for everyclass prior with the corresponding conditional likelihood. The predicted letter is given by the class thatprovides the highest posterior probability.

The overall performance of the Full Gaussian Mixture Model with 26 classes was 96.43%. Contrary tothe Naive Gaussian Classifier, the Full Gaussian Mixture Model takes correlation between the attributes intoaccount providing a full n× n dimensional covariance matrix for each class, where n = 16 for our study. Asobserved in the example in Figure 6 this enables the achieved prior distributions to adapt for the test datasignificantly better, as there evidently exists correlation between the features.

3.4 Logisitic Regression

Logistic regression is a discriminative classification algorithm which attempts to learn the boundary betweentwo classes of data. For binary classifiers, there are only two possible outputs (here represented by 0 and 1),and logistic regression attempts to fit a sigmoidal function to this binary output data as a function of theinput features. The sigmoidal function, g(t), is given by

g(t) =1

1 + e−t. (6)

6

Page 7: FinalReportFoxMelle

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

feature 6

feat

ure

7

Letter 20 vs. 21 − feature 6 vs. 7 − Full Gaussian

Letter 20 − TLetter 21 − U

Figure 6: Example Full Gaussian Mixture Output for letters T and U considering only features 6 and 7.

An example of the sigmoidal function is shown in Figure 7. Since the range of the sigmoidal function is[0, 1],we can use it to represent the a posteriori probability p(θ|x) for the two classes as,

p(θ = 0|x) = g(wTx) =1

1 + e−wT x, (7)

p(θ = 1|x) = 1− g(wTx) =e−wT x

1 + e−wT x. (8)

The mathematical objective of using Logistic Regression as a binary classifier is to determine the vector w.This is determined using the log-likelihood function. Given n labeled data points, {(X1, θ1), . . . , (Xn, θn)},the log-likelihood function, l(x) is given by,

l(x) =

n∑i=1

θi ln(1− g(wTxi)) + (1− θi) ln(g(wTxi)) (9)

=n∑

i=1

θiwTxi − ln(1 + ew

T xi) (10)

We can differentiate this log-likelihood with respect to w, equate to zero to find the maximum, and solve forthe maximizing value of w using gradient descent.

However logistic regression is a binary classifier and the Letter Recognition data set has 26 classes. Toaccount for this we create 26 different binary classifiers, each comparing an individual letter (denoted class1) to all the remaining letters (denoted class 0). The determined letter corresponds to the maximum value ofthe output of each of the logistic regression functions. An example of the output for two test letter examples,I and O is shown in Table 2. The letter I is correctly classified with the logistic regression outputting a valueclose to 1 for I and close to 0 for every other letter. However for the given test letter O, the features resultin an output close to 0 for every logistic regression output and the letter is misclassified as a D.

The overall classification performance of the logistic regression algorithm on the test data was 71.53%.One of the weaknesses of the algorithm is that the 26 models for each letter are derived independently andthe output values are therefore not directly comparable. Some letters may be further away from the restof the letters than others and therefore the boundaries for different models and the area of change in theresulting sigmoidal function from 0 to 1 will vary in terms of its width. Therefore output values from some

7

Page 8: FinalReportFoxMelle

−10 −5 0 5 100

0.2

0.4

0.6

0.8

1

t

g(t)

Sigmoidal Function

Figure 7: Graph of the Sigmoidal Function, g(t) = 11+e−t

of the logistic regression functions may be artificially high or low and could cause problems when beingcompared across the 26 models. Additionally the logistic regression model is not good at finding multipleboundaries. The sigmoidal function increases from 0 to 1 only once. Therefore if the 1-class data happensto fall in the middle of the x spectrum with 0-class data on either side, the sigmoidal function will havedifficulties in capturing the characteristics of the boundary and classifying the data.

3.5 Decision Tree

Decision trees are an extremely intuitive way to make classification decisions as they they are essentially aseries of if-then-else questions one poses on the feature set until a classification determination is made. Anexample of a three-class decision tree with two features is shown in Figure 8. Classification decisions aremade by beginning at the root node of the tree. One proceeds along the leaves of the tree at each node bymaking a binary decision on a single feature using a threshold. For example at the root node, if x1 > a thenthe tree proceeds to the right to make the classification determination, and if x1 ≤ a then the tree proceedsto the left.

These binary decisions effectively split the feature space into n-dimensional hypercubes, where n is thedimension of the feature space (16 in this study). This means that decision trees are effective at makingclassification decisions when the data can be split along linear boundaries of only one variable, and also ifthe output fits into multiple isolated clusters in the feature space. Intuitively, this then seems like it wouldbe very effective for the type of data given in this project. The features are 4-bit integers, non-continuous,which allows for very natural boundary lines between different features.

In order to construct the decision tree one must determine what features to split the data on at eachnode. This is done according to the feature which maximizes the information gain given the remaining dataset. We first introduce the concept of entropy, H(X), of a random variable which is effectively a measure ofthe amount of randomness of the random variable,

H(X) = −∑i

P (xi) log2 P (xi), (11)

where xi is the set of possible outputs of the distribution. The conditional entropy is given by,

H(θ|X) = −∑i

∑j

P (xi, θj) log2 P (θj |xi). (12)

We define the information gain I(θ,X) as,

I(θ,X) = H(θ)−H(θ|X). (13)

8

Page 9: FinalReportFoxMelle

Table 2: Example Logistic Regression Output for two test letters, I and OI O

A 0.002 0.000B 0.000 0.003C 0.000 0.000D 0.000 0.078E 0.012 0.000F 0.004 0.000G 0.001 0.027H 0.000 0.011I 0.970 0.000J 0.005 0.000K 0.004 0.000L 0.115 0.001M 0.000 0.000N 0.001 0.023O 0.002 0.067P 0.000 0.000Q 0.002 0.010R 0.002 0.002S 0.003 0.012T 0.003 0.000U 0.000 0.002V 0.000 0.000W 0.000 0.000X 0.141 0.002Y 0.003 0.000Z 0.002 0.000

9

Page 10: FinalReportFoxMelle

Figure 8: Example three-class decision tree using two features.

21 21

21 20

21 21 20

21

20 21

x2 < 9.5

x2 < 6.5 x1 < 3.5

x1 < 5.5 x2 < 10.5

x1 < 7.5 x1 < 4.5

x2 < 7.5

x1 < 6.5

x2 >= 9.5

x2 >= 6.5 x1 >= 3.5

x1 >= 5.5 x2 >= 10.5

x1 >= 7.5 x1 >= 4.5

x2 >= 7.5

x1 >= 6.5

Figure 9: Example decision tree for letters T and U using features 6 (labeled x1) and 7 (labeled x2).

The decision tree is made in a greedy manner, constructing a node at each stage that maximizes theinformation gain. Intuitively, the information gain equation is the amount of randomness of the entire setof outputs minus the amount of randomness of the output conditioned on a specific feature. If knowing avalue of the feature deterministically determines the output then the conditional entropy is zero and theinformation gain is maximized. This would then be a good feature to construct the node with since theoutput would then be known and the construction of the tree along that branch is completed.

Decision trees can be prone to overfitting through. As long as for any combination of features there isonly a single output, a tree can be constructed to correctly classify every training data. This situation needsto be avoided. Therefore validation data is used to prune the decision tree to the point where the informationgane on the test data is above a threshold.

The overall performance of the decision tree on the test data was 85.35%. The entire tree would bedifficult to show due to its size, however as an example the decision tree for the letters T and U for features6 and 7 (same as in previous sections) is shown in Figure 9.

10

Page 11: FinalReportFoxMelle

3.6 Multiclass Support Vector Machine

A Support Vector Machine (SVM) is a discriminative binary regression classifier used to find the boundarybetween two classes of data. This is useful since it is often only the boundary that is of any consideration,not the distribution of the data in entire classes. An example SVM is shown in Figure 10. Mathematically,the goal of the SVM is to maximize the boundary between the data in class +1 and the data in class −1.This is done by determining a vector w such that the classifier, ϕ(x), works as

ϕ(x) =

{+1 , if sgn(wTx) ≥ 0−1 , if sgn(wTx) < 0

. (14)

The margin between the data is equal to 2||w|| . Therefore we can solve for w by attempting to minimize wTw.

However we need to consider the case where the two classes are non-separable as shown in Figure 10. In thiscase no linear boundary exists that can perfectly separate the two classes. So we add to our cost function apenalty C times the distance ϵ that a data point is from its respective boundary plane, equal to zero if it ison the correct side. Therefore the value w is determined as

argminw

1

2wTw +

n∑i=1

Cϵi. (15)

This is a quadratic programming problem that can be solved easier in the dual formulation of the Lagrangemultipliers on the constraints classifying each point to it’s respective side of the boundary and that ϵi ≥ 0 ∀i.This dual formulation is the following:

argminα

∑i

αi−1

2

∑i,j

αiαjθiθjxixj (16)

∑i

αiθi = 0 (17)

αi ≥ 0, ∀i (18)

w =∑i

αixiθi (19)

However this is only for a linear boundary in the SVM. We can extend the SVM to non-linear boundariesusing the kernel trick. Note that the minimization is a function of the dot product between xi and xj .By replacing this dot product with another inner product, K(xi, xj), we can map the data to a higherdimensional feature space and create a non-linear boundary line. For this analysis we use a radial basisfunction

K(xi, xj) = exp

(− (xi − xj)

T (xi − xj)

2σ2

). (20)

The SVM is a two-class classifier so we need to extend it to be able to handle the 26 classes requiredfor this project. To do this we created

(262

)SVMs. These SVMs do one-on-one comparisons between each

letter. The classification results for each letter are summed and the letter that is classified the most in the(262

)SVMs is chosen as the final letter for classification.

The test performance of the multiclass SVM was 95.75%. The results for the same two letters chosen inthe logistic regression are shown for the multiclass SVM in Table 3. Each row represents the number of timesthat letter was chosen in the

(262

)SVMs. Note that in this case both the letters were correctly classified.

The results in this table may be slightly deceiving as there are a number of letters with very close values tothe one ultimately chosen. However this does not imply that the test letter resembles those letters closely(although it still may). There are

(262

)SVMs and one letter must be chosen in each of those classifiers, even

when comparing letters that the test letter is not remotely close to. That may seem to increase some of thetotals beyond what one would intuitively expect, however it is merely a result of the fact that the sum ofthe column must equal

(262

)and that each letter must have a result between 0 and 25.

11

Page 12: FinalReportFoxMelle

Table 3: Example Multiclass Support Vector Machine Output for two test letters, I and OI O

A 3 3B 1 11C 6 20D 11 23E 3 6F 15 9G 10 21H 19 20I 25 0J 12 5K 21 13L 7 4M 23 17N 17 19O 3 25P 21 18Q 18 24R 9 14S 21 18T 10 8U 12 12V 3 2W 16 9X 10 8Y 23 12Z 6 4

-1 plane

+1 plane

jkεε

Figure 10: Example Support Vector Machine with boundaries [3].

12

Page 13: FinalReportFoxMelle

3.7 Neural Network

The Neural Network is the final algorithm used to classify the letter data. Neural networks are generallyable to obtain extremely high accuracy, however at the expense of being extremely unintuitive, difficult totrack, and requiring a long training time. Neural networks consist of multiple layers of nodes, where theoutput of each node in a layer serves as an input to every node in the succeeding layer. Every node in theoutput layer executes a weighted linear summation of all its inputs, while each node in the inner (hidden)layers executes a sigmoidal function on its inputs. These sigmoidal functions in the hidden layer nodes areknown as the activation functions.

Unlike the logistic regression which only determines one vector w to perform the classification, neuralnetworks require a determination of the w vector for every node. This is done by error backpropagation.The high-level mathematical objective is to minimize the mean-square error between the predicted and trueoutput values. This mean square error function is then differentiated with respect to each w to determinethe values to minimize the MSE. This is done using gradient descent. We first randomize all w values anddetermine the resulting error at each output j,

δj = θj − g(wT vj), (21)

where vj is the vector of inputs into output node j. We can then propagate the effect of the error backwardsfrom node j to node i in the previous layer as

∆j,i = wiδj(1− gj)gj . (22)

The (1− g)g terms is a result of differentiating the sigmoidal function. This allows us to update all w valuesfrom node i to node k as

wk,i = wk,i + λn∑

j=1

∆j,igj,i(1− gj,i)xj,k. (23)

For our purposes we created 26 different neural networks, one for each letter comparing the letter to therest of the training data set, where the desired output from each network was 1 if the letter corresponded tothat particular network, or 0 otherwise. Each network had 16 input nodes (the features) and one output node.We trained the neural network with varying amounts of nodes and hidden layers and tested the networksusing validation data to determine the appropriate number of nodes and layers. The validation performancefor one hidden layer is shown in Figure 11 and for two hidden layers in Figure 12. For one hidden layer,the validation performance flatlined at approximately 0.974, however for the final network we chose to usetwo hidden layers with 35 and 50 nodes respectively due to the fact that the validation performance almostmatches and the training time was hours shorter. All 26 of the neural networks were of identical size. Notethat this could have been another parameter to vary however it would require testing many more neuralnetworks. As is, the neural network still performed the best of any of the algorithms, having a test dataperformance value of 97.18%.

3.8 Individual Algorithm Results

Table 4 summarizes the performance of each model individually as well as showing the resulting training timesfor each algorithm. The Neural Network achieves the best performance, however takes orders of magnitudemore training time than the other algorithms. Due to the size of the test data set, and the fact that formost of the algorithms the test phase is executing a series of set calculations, the test time is negligible andcomparable for each of the algorithms.

3.9 Hybrid Models

We created hybrid models of the various individual algorithms to attempt to increase the test performanceresults. Each algorithm has weaknesses and misclassifications that were discussed in their respective sections,

13

Page 14: FinalReportFoxMelle

0 50 100 150 2000.88

0.9

0.92

0.94

0.96

0.98

Hidden Layer Nodes

Val

idat

ion

Dat

a P

erfo

rman

ce

Figure 11: Validation performance for the number of nodes in a neural network with one hidden layer.

10 20 30 40 50

10203040500.935

0.94

0.945

0.95

0.955

0.96

0.965

0.97

0.975

0.98

Hidden Layer 1 NodesHidden Layer 2 Nodes

Val

idat

ion

Dat

a P

erfo

rman

ce

Figure 12: Validation performance for the number of nodes in a neural network with two hidden layers.

Table 4: Test Data Performance and Training Time of each algorithm individually.Algorithm Performance Training Time (s)K-NN 95.65% 0Naive Bayes 62.45% 0.06Full Gaussian 96.43% 87.4Logistic Regression 71.53% 2.0Decision Tree 85.35% 1.13Multiclass SVM 95.90% 71.4Neural Network 97.18% 2212

14

Page 15: FinalReportFoxMelle

so the hope is that by combining algorithms that the strengths of some of the algorithms will be able toovercome the weaknesses of others and therefore increase the overall performance.

The hybrid models are created using a voting scheme between the algorithms. Each individual algorithmsubmits a 26-column vector corresponding to the amount that the algorithm is voting for each letter. Thereare two weighting schemes. The first weighting scheme weights each algorithm equally, so that the sum ofthe vote vector for each algorithm equals 1. The second weighting scheme weights each algorithm accordingto their performance so that the sum of the vote vector of each algorithm sums to the respective performancevalues from Table 4. This allows the better performing algorithms to have a larger vote in the final modelwhich is logical since they perform better and should be trusted more.

We tested the equal weight and performance weighted voting scheme for a hybrid model combining allalgorithms. We also created three more hybrid models all using performance vote. These three other hybridmodels and the algorithms they contain are shown in Table 5.

Table 5: Hybrid Models tested and the individual algorithms used in their respective voting schemes.Hybrid Model Algorithms

Weaker ModelsNaive BayesLogistic RegressionDecision Tree

Stronger Models

k-NNFull GaussianSVMNeural Network

Fastest ModelsNaive Bayesk-NNDecision Tree

An example of the votes from each algorithm and the resulting total vote for the equal vote scheme andfor a test letter G is shown in Table 6. In this vote, six of the seven algorithms correctly predict the letterand the resulting final vote’s maximum value also predicts the correct letter. One will notice that eachalgorithm’s vote vector has a slightly different structure. The Naive Bayes, GMM, Logistic Regression, andNeural Network algorithms are able to vote for multiple letters while the k-NN, Decision Tree, and SVM votefor only a single letter. This is simply due to the nature of each of the algorithms. The algorithms that votefor multiple letters have a posteriori predictions or comparable scores for each letter while the algorithmsthat vote for only a single letter do so because the algorithm is a discriminative classifier with only a singleoutput. The Multiclass SVM outputs a score for every letter, but those scores are not representative of theprobability that the test letter belongs to each class, so the SVM therefore submits its full vote to only thesingle letter with the maximum value from the algorithm.

One should also note that the Neural Network’s vote vector contains negative values and that the vote forthe correct letter G is also greater than one. This is due to the fact that the neural network is constructed tominimize the mean-square error, and that we are using a regression based algorithm as a classifier. Thereforeif one were trying to train a neural network on an output of 0, a result of -0.01 would be equivalent to aresulting output of 0.01 from a mean-square error perspective. This is a possible source of corruption in thevoting scheme.

The final performance results for the hybrid algorithms are shown in Table 7. There is only a verysmall improvement over the neural network for the equal, performance, and strong model voting schemes.Similarly there is only a small improvement over the decision tree for the weak models. The results for thefast models are in fact worse than the k-NN results on their own.

This lack of improvement is due to a number of factors. When algorithms are combined that only votefor a single letter, there isn’t a lot of room to improve when combining only three or four algorithms. Alsofor some of the stronger performing models, there isn’t a lot of room to improve on the performance anyway.However the lack of improvement is also due to the nature of the data. Table 8 shows the amount of overlap

15

Page 16: FinalReportFoxMelle

Table 6: Resulting votes for example test letter G.LR DT NN SVM NB GMM KNN SUM

A 0.000 0 0.001 0 0.000 0.000 0 0.001B 0.000 0 -0.024 0 0.000 0.000 0 -0.024C 0.484 0 -0.222 0 0.011 0.009 0 0.282D 0.000 0 -0.005 0 0.000 0.000 0 -0.005E 0.028 0 -0.054 0 0.000 0.000 0 -0.026F 0.000 0 -0.022 0 0.000 0.000 0 -0.022G 0.287 1 1.273 1 0.981 0.991 1 6.532H 0.033 0 -0.055 0 0.000 0.000 0 -0.022I 0.000 0 0.005 0 0.000 0.000 0 0.005J 0.000 0 0.003 0 0.000 0.000 0 0.003K 0.011 0 0.018 0 0.000 0.000 0 0.029L 0.048 0 -0.006 0 0.000 0.000 0 0.042M 0.000 0 0.000 0 0.000 0.000 0 0.000N 0.012 0 0.004 0 0.000 0.000 0 0.016O 0.069 0 -0.001 0 0.007 0.000 0 0.075P 0.000 0 -0.001 0 0.000 0.000 0 -0.001Q 0.010 0 0.007 0 0.000 0.000 0 0.017R 0.000 0 -0.017 0 0.000 0.000 0 -0.017S 0.009 0 -0.005 0 0.000 0.000 0 0.004T 0.000 0 0.010 0 0.000 0.000 0 0.010U 0.007 0 0.032 0 0.000 0.000 0 0.039V 0.000 0 0.008 0 0.000 0.000 0 0.008W 0.000 0 0.003 0 0.000 0.000 0 0.003X 0.001 0 0.005 0 0.000 0.000 0 0.006Y 0.000 0 0.000 0 0.000 0.000 0 0.000Z 0.000 0 0.043 0 0.000 0.000 0 0.043

Table 7: Test performance results for the hybrid models.Algorithm Performance Time to Run (s)Equal Weight 97.22% 2374Performance Weight 97.50% 2374Weak Models 85.38% 3.2Strong Models 97.48% 2371Fast Models 94.93% 1.3

16

Page 17: FinalReportFoxMelle

Table 8: Amount of shared overlap of misclassifications among the algorithms in the Weak and Strong hybridmodels.

Weak Models Strong ModelsAlgorithm # Misclassifications Overlap Algorithm # Misclassifications Overlap >3

Naive Bayes 1502306

k-NN 174

77Logistic Regression 1139 Full Gaussian 149Decision Tree 586 SVM 164

Neural Network 119

in misclassifications among the individual algorithms in both the weak and strong hybrid models. Comparedto the best performing model in each situation, there is a large amount of overlap in the misclassificationsbetween the models. This leaves very few test cases where fixed classifications could be expected. Generallyhybrid models such as these tend to work best with a larger amount of weaker models, since there are morepossible votes and more room for improvement.

3.10 Final Confusion Matrix

We present the confusion matrix for the best performing performance-weighted vote in Figure 13. Highlightedin green are the correct predictions for each letter and highlighted in red are some of the interesting andmost frequent letter misclassifications. We perform worst on the letter H, correctly predicting it only 93%of the time. Some misclassifications include mistaking the letter I for J and vice-versa. It should be notedthat the confusion matrix is not symmetric, meaning that even though one letter is often misclassified as asecond it does not mean that the second letter gets misclassified as the first, although that is not an unusualscenario. This is because some of the letters can be better discriminated than others, and some are closerto the general distribution of the remaining letters than others. For example the letter K is misclassified asan R 3% of the time, while the letter R is misclassified as the letter K only 1% of the time.

In general this confusion matrix matches up well with our general intuition of how the classifications wouldperform. Letters that are most frequently misclassified as others tend to have a significant resemblance tothe human eye.

4 Conclusions

We used seven different machine learning algorithms to classify letter images based on 16 predeterminedfeatures. Our best performance was accomplished with a neural network at 97.18%. We were able to slightlyincrease this result to 97.50%. using a voting scheme among many models. However the improvement withthe hybrid models was less than originally hoped for due to the already good performance from the originalmodels, as well as the nature of the data, in that the letters that were misclassified were often commonlymisclassified by most of the individual algorithms.

References

[1] P. W. Frey and D. J. Slate, “Letter recognition using holland-style adaptive classifiers,” Mach. Learn.,vol. 6, pp. 161–182, Mar. 1991.

[2] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus,NJ, USA: Springer-Verlag New York, Inc., 2006.

[3] Z.-B. Joseph, “10701 machine learning, neural network lecture slides,” 2012.

17

Page 18: FinalReportFoxMelle

Figure 13: Confusion matrix for test data. Rows represent the actual test letter, with each clumn in therow representing the fractional amount that column’s letter class was predicted as the class for the row testletter.

18