CHAPTER-5 DEEP LEARNING BASED APPROACHESshodhganga.inflibnet.ac.in/bitstream/10603/114183/5/... ·...

[Chapter- 5: Deep Learning Based Approaches]

Investigations in Devanagari Handwriting Recognition 105

CHAPTER-5

DEEP LEARNING BASED APPROACHES

Chapter Outline

This chapter describes the implementation details of deep learning Neural network based

approaches in Devanagari character recognition. The advantage of deep learning

techniques is that it is inspired from human vision. Four deep learning based NN models

are build namely, Feedforward NN, Convolutional NN, Deep Belief Network and Stacked

Denoising Autoencoder. A performance comparison of shallow and deep architecture of

neural network is done.



5. Deep learning Tools for Devanagari Character Recognition

The Three layer net works presented in the previous chapter belongs to the category of

shallow networks. The Input that was given to these networks was set of specially crafted

features. In this chapter we will be introducing the deep networks which is having

multiple hidden layers which will allow it to compute much more complex features of the

input. Because each hidden layer computes a non-linear transformation of the previous

layer, a deep network can have significantly greater representational power (i.e., can learn

significantly more complex functions) than a shallow one. By using a deep network, in the

case of images, one can also start to learn part-whole decompositions. For example, the

first layer might learn to group together pixels in an image in order to detect edges (as

seen in the earlier exercises). The second layer might then group together edges to detect

longer contours, or perhaps detect simple "parts of objects." An even deeper layer might

then group together these contours or detect even more complex features. However there

is a difficulty with the method described above, it relies only on labeled data for training.

However, labeled data is often scarce, and thus for many problems it is difficult to get

enough examples to fit the parameters of a complex model. For example, given the high

degree of expressive power of deep networks, training on insufficient data would also

result in over-fitting. The normal gradient descent is not suitable for the training of deep

networks because there is a problem of 'local optima' and 'diffusion of gradients' [106]

The method that has seen some success for the training of deep Networks is the greedy

layer-wise training method. The main idea is to train the layers of the network one at a

time, so that we first train a network with 1 hidden layer, and only after that is done, train

a network with 2 hidden layers, and so on. At each step, we take the old network with k −

1 hidden layers, and add an additional k-th hidden layer (that takes as input the previous

hidden layer k − 1 that we had just trained). Training can either be supervised (say, with

classification error as the objective function on each step), but more frequently it is

unsupervised (as in case of autoencoder). The weights from training the layers

individually are then used to initialize the weights in the final/overall deep network, and

only then is the entire architecture "fine-tuned" (i.e., trained together to optimize the

labeled training set error). The success of greedy layer-wise training has been attributed to

a number of factors. We will be discussing various types of architecture used in this work

belonging to deep networks. The four deep models are experimented with the patterns



namely deep architecture of neural network(NN), Convolutional Neural Network(CNN),

the Deep Belief Network(DBN)and the Stacked Denoising Autoencoder(SAE).

Multi layer perceptrons are quite efficient in learning complex high dimension problems.

It consists of many layers of fully connected hidden layers between input and output layer.

However for the patterns having higher dimensions, the learning of MLP becomes costly

and time consuming specially for the case of image processing applications with large

number of inputs. So to alleviate this difficulty of this over-parameterized NN for high

dimensional image processing applications a model of Neural Network is implemented

inspired from the architecture of Convolutional Neural Network [107]. The Convolutional

Neural Network is a system which incorporates the feature extraction and the

classification in itself. The arrangement of Convolution and sub-sampling layer builds the

feature detector part which is trainable. The Convolutional neural network is applied to

character recognition application by some of the researchers [108] on direct image pixel.

The method exploits the shared weight, minimum learning parameter criterion, deep

learning concept of Convolutional Neural Network. The main advantage of Convolutional

neural network is that it is invariant to basic transformations such as rotation and

translation due to the sub-sampling and pooling operations performed at various layers of

the network.

Convolutional Neural Networks (CNN) are inspired from our biological system and it is

basically a variant of Multilayer Perceptron(MLP). There exists a complex arrangement of

cells within the visual cortex [109]. Many such neurally inspired models exists like the

NeoCognitron [110], HMAX [111] and LeNet-5 [107]. The standard MLP ignores the

topology present in the input, whereas the CNN exploits this topological information.

According to the model of Convolutional Neural Network, input image is assumed to be

composed of into small sub - regions called "receptive fields" and the filter are sensitive to

these sub-regions of the input space in such a way as to cover the entire visual field. These

filters are local in input space and are thus better suited to exploit the strong spatially local

correlation present in natural images. There are two types of cells in CNN namely simple

cell(S) and the complex cell(C). Simple cells respond maximally to specific edge-like

stimulus patterns within their receptive field. Complex cells have larger receptive fields

and are locally invariant to the exact position of the stimulus. The major advantage of the

CNN is that it requires only few parameters of learning because it is made up of the

translated version of same basis function. Hence CNN has capability of producing



considerable recognition performance with only few samples. Convolutional Neural

Network have been widely used for ample of recognition problems such as in MNIST

handwritten digit recognition [112], object recognition [113] and in natural language

processing [114].

5.1 The Role of Data Preprocessing

Data preprocessing plays a very important in many deep learning algorithms. In practice,

many methods work best after the data has been normalized and whitened. However, the

exact parameters for data preprocessing are usually not immediately apparent unless one

has much experience working with the algorithms. Standard first step to data

preprocessing is data normalization. While there are a few possible approaches, this step is

usually clear depending on the data. The common methods for feature normalization are:

• Simple Rescaling • Per-example mean subtraction (a.k.a. remove DC) • Feature Standardization (zero-mean and unit variance for each feature across the dataset)

5.1.1 Simple Rescaling

In simple rescaling, our goal is to rescale the data along each data dimension (possibly

independently) so that the final data vectors lie in the range [0,1] or [ − 1,1] (depending on

the dataset). This is useful for later processing as many default parameters (treat the data

as if it has been scaled to a reasonable range).

5.1.2 Per-Example Mean Subtraction

In character recognition application, we are not interested in the illumination conditions of

the image, but more so in the content; removing the average pixel value per data point

makes sense here. For such images, this normalization has the property of removing the

average brightness (intensity) of the data point. In other words, if our data is stationary

(i.e., the statistics for each data dimension follow the same distribution), then it is further

pre-processed by subtracting the mean-value for each example (computed per-example).

5.1.3 Feature Standardization

Feature standardization refers to (independently) setting each dimension of the data to

have zero-mean and unit-variance. This is the most common method for normalization and


Investigations in Devanagari Handwriting Recognition

is generally used widely. In practice, one achieves this by first computing the mean o

each dimension (across the dataset) and subtracts this from each dimension. Next, each

dimension is divided by its standard deviation.

5.2 The Proposed Approach

The overall workflow of the proposed approach is shown in the Fig

pipeline of the proposed approach mainly consists of following steps:

1. Acquire/ Collect the data images of handwritten characters and divide them into

training and test dataset

2. Apply the preprocessing technique on both training and test set.

3. Normalize the data so that it lie

4. Divide the training data into mini

5. Use unlabelled data for feature dimension reduction.

6. Train the deep network model using labeled data.

7. Use trained model for classification.

Handwritten images are divided in the training set and testing set. All digit images have

been size-normalized and centered in a fixed size image of 28 x 28 pixels. In the original

dataset each pixel of the image is represented by a value between 0 and 255, where

black, 255 is white and anything in between is a different shade of grey. An image is

represented as 1-dimensional array of 784 (28 x 28) float values between 0 and 1 (0 stands

for black, 1 for white). When using the dataset, we divide training and t

batches.

Figure 5.1 Overall workflow of proposed approach

Feature Extraction using

unsupervised training data

Preprocessed Training data

Training set

5: Deep Learning Based Approaches]


is generally used widely. In practice, one achieves this by first computing the mean o


dimension is divided by its standard deviation.

pproach

The overall workflow of the proposed approach is shown in the Fig

posed approach mainly consists of following steps:

Acquire/ Collect the data images of handwritten characters and divide them into

training and test dataset

Apply the preprocessing technique on both training and test set.

Normalize the data so that it lies in the range 0-1.

Divide the training data into mini-batches of suitable size decided using heuristic.

Use unlabelled data for feature dimension reduction.

Train the deep network model using labeled data.

Use trained model for classification.

ten images are divided in the training set and testing set. All digit images have

normalized and centered in a fixed size image of 28 x 28 pixels. In the original

dataset each pixel of the image is represented by a value between 0 and 255, where


dimensional array of 784 (28 x 28) float values between 0 and 1 (0 stands

for black, 1 for white). When using the dataset, we divide training and t

Overall workflow of proposed approach

Classification of test data using Trained model

Deep Neural Network

Feature Extraction using

unsupervised training data

Training using supervised Data and

minibatches

Preprocessing

Preprocessed Training data Preprocessed Test data

Input Dataset Images

Training set Test set

109

is generally used widely. In practice, one achieves this by first computing the mean of


The overall workflow of the proposed approach is shown in the Fig 5.1. The

Acquire/ Collect the data images of handwritten characters and divide them into

batches of suitable size decided using heuristic.

ten images are divided in the training set and testing set. All digit images have

normalized and centered in a fixed size image of 28 x 28 pixels. In the original

dataset each pixel of the image is represented by a value between 0 and 255, where 0 is


dimensional array of 784 (28 x 28) float values between 0 and 1 (0 stands

for black, 1 for white). When using the dataset, we divide training and test set into mini-

Training using supervised Data and

minibatches

Preprocessed Test data



5.2.1 The Deep NN- Classifier Model

Multi Layer Perceptron is used as classifier. The architecture of Multi Layer Perceptron

(MLP) used in this work consists of input layer, output layer and hidden layer. Single

hidden layer Perceptron gives universal approximation in many pattern recognition

applications. The output vector for a single layer Perceptron is given by

�� = ��K� +£�K� -��,� +£�,��3� (5.1)

where ��,�, ��K� are the bias vectors at the hidden and output layers, £�,�,£�K�are the

weight matrices at the respective nodes and s, G are the activation functions. For a

classification problem if (x(i), y(i)) is the training vector, where�� ∈ ℜ�, a D-dimensional

training vector and *�� ∈ â1,…… . , H}. For the prediction function�� given in Eqn. 5.1,

the zero-one loss function is given by

ℓR,, = ∑ 5Þ�g��{��|ó|�QR (5.2)

where I is the indicator function given by

5g = �1��0!�ℎ�� » (5.3

�� = ��%#��?O� = D|�, ¿� (5.4)

where ¿is the set of all parameters of the given model. The objective of the training is to

minimize the loss function. But the zero-one loss function is not differentiable therefore

negative Log-likelihood of loss function minimization is used as the objective of the

training.

8HH�¿, ó� = −∑ O� = *��, ¿�|ó|�QR (5.5)

Weights are updated using gradient of the error surface defined by loss function. Gradient

is estimated from the training data. In this study Stochastic Gradient Descent based

learning (Table 5.1) approach is applied to MLP. In ordinary gradient descent algorithm,

repeatedly small steps are made in downward direction on an error surface, which is the

mean square error. Mean square error is a function of weights. Stochastic gradient descent

(SGD) works according to the same principles as ordinary gradient descent, but proceeds

more quickly by estimating the gradient from just a few examples at a time instead of the

entire training set. In its purest form, estimation of the gradient is made from just a single

example at a time. In both gradient descent (GD) and stochastic gradient descent (SGD),

we update a set of weights in an iterative manner to minimize an error function. In normal

GD (Table 5.2) all the samples of the training set have to be processed before updating the



weight for a particular iteration, while in SGD, only single training sample from whole

training set is used to do the update for a weight in a particular iteration. Thus for big data

if the number of training samples are very large, then using gradient descent may take too

long because in every iteration when we are updating the values of the parameters, we are

running through the complete training set. On the other hand, using SGD will be faster

because you use only one training sample and it starts improving itself right away from the

first sample. SGD often converges much faster compared to GD but the error function is

not as well minimized as in the case of GD. Often in most cases, the close approximation

that we get in SGD for the parameter values are enough because they reach the optimal

values and keep oscillating there. Stochastic gradient descent has a convergence rate

which is independent of the size of our dataset and is thus adapted when we have a huge

or even infinite dataset. But there are two downsides to it:

• Its slow convergence rate, it is not beneficial to get a faster convergence rate on the

training set as this will not be translated into a better convergence rate on the test set

[115].

• Its sensitivity to the two parameters, the learning rate and the decrease constant.

Table 5.1 Stochastic Gradient Descent

for (xi,yi) in training_set:

% Assume an infinite generator

% It may repeat examples (if there is only a finite training loss)

Loss= f(parameters, xi, yi)

Find derivative of loss with respect to parameters % compute

gradient

Modify parameters by-learning rate * derivative of loss with

respect to parameters

if <stopping condition is met>:

return parameters

For deep learning a variant is recommend, which is a further modification in

stochastic gradient descent using “minibatches”. Minibatch SGD(MSGD)

explained in Table 5.3, works identically to SGD, except that we use more than



one training example to make each estimate of the gradient. This technique

reduces variance in the estimate of the gradient, and often makes better use of the

hierarchical memory organization in modern computers.

Table 5.2 Gradient Descent

while True:

Loss = f(parameters)


gradient




return parameters

Table 5.3 Mini Batch- SGD

for (xbatch,ybatch) in training batches:

% Assume an infinite generator

% It may repeat examples

Loss = f(parameters, xbatch, ybatch)


gradient




return parameters

There is a tradeoff in the choice of the minibatch size B. With large B, time is wasted in

reducing the variance of the gradient estimator, that time would be better spent on

additional gradient steps. An optimal B is model, dataset, and hardware dependent, and

can be anywhere from 1 to several hundreds. In Machine Learning when we train our



model from data, we are trying to prepare it to do well on new examples, not the ones it

has already seen. The MSGD training loop does not have the capability for this

generalization and have a tendency to overfit. The way to combat overfitting is through

regularization and early-stopping using validation. There are several techniques for

regularization the most commonly used method is L1/L2 regularization which is

explained in the next section.

5.2.1.1 Weight Decay

Weight decay is a subset of regularization methods. The penalty term in weight decay, by

definition, penalizes large weights. Other regularization methods may involve not only the

weights but various derivatives of the output function The weight decay penalty term

causes the weights to converge to smaller absolute values than they otherwise would.

Large weights can hurt generalization in two different ways. Excessively large weights

leading to hidden units can cause the output function to be too rough, possibly with near

discontinuities. Excessively large weights leading to output units can cause wild outputs

far beyond the range of the data if the output activation function is not bounded to the

same range as the data. To put it another way, large weights can cause excessive variance

of the output.

5.2.1.2 L1 and L2 Regularization

L1 and L2 regularization involve adding an extra term to the loss function, which

penalizes certain parameter configurations. Formally, if our Negative log likelihood loss

function is 8HH�¿, |�|� then the regularized loss will be given by ²�¿,�� = 8HH�¿, |�|� + ö=�¿� (5.6)

This is written as follow for the present study as ²�¿,�� = 8HH�¿, |�|� + ö‖¿‖ff(5.7)

where ‖¿‖f is the Hf norm of ¿

‖¿‖f = �∑ �¿̀ �f|Ç|`QR �1m(5.8)

� is a hyper-parameter which controls the relative importance of the regularization

parameter. Commonly used values for p are 1 and 2, hence the L1/L2 nomenclature. If

p=2, then the regularizer is also called “weight decay”. In principle, adding a

regularization term to the loss will encourage smooth network mappings in a neural

network (by penalizing large values of the parameters, which decreases the amount of



nonlinearity that the network models). More intuitively, the two terms (NLL and

R(�))correspond to modeling the data well (NLL) and having

solutions. Thus, minimizing the sum of both will, in theory, correspond to finding the right

trade-off between the fit to the training data and the “generality” of the solution

found.

5.2.1.3 Methodology Used

For other Deep Neural Network model of this work,

extraction and used pixel intensities after normalization as features but for the first deep

MLP based architecture we ha

well as the features proposed in our earlier work.

characters follow all the steps of standard pattern recognition technique in supervised

learning mode. For the training

are used. To the best of our knowledge only isolated characters and numeral dataset of

Devanagari script is available as

design cycle used in this part of

Figure 5.2 Design Cycle for the

The images obtained from benchmarking dataset

testing. The images are subjected to various preprocessing steps described in Table

The output after preprocessing module is not suitable for classifier training because of

high dimensionality. Feature extract

dimensionality reduction. In this study we have used a gradient based directional features.

Preprocessing and

Feature Extraction

Classifier

Decision

5: Deep Learning Based Approaches]



correspond to modeling the data well (NLL) and having “simple” or “smooth”

Thus, minimizing the sum of both will, in theory, correspond to finding the right

off between the fit to the training data and the “generality” of the solution

Methodology Used for Deep Neural Network

For other Deep Neural Network model of this work, we have not applied feature


MLP based architecture we have obtained the results for both the direct pixel values as

well as the features proposed in our earlier work. The design cycle for the recognition of


training and testing, datasets developed by various research groups

. To the best of our knowledge only isolated characters and numeral dataset of

is available as test bed. So these are experimented in this study. The

part of study(for Deep NN model) is shown in F

Design Cycle for the Feature Extraction Based Deep Network Architecture

images obtained from benchmarking dataset exists in two groups namely: training and

testing. The images are subjected to various preprocessing steps described in Table


high dimensionality. Feature extraction/selection is an important step for the


Stored Data (Training + Test Data)

Preprocessing and Feature

Extraction

Training

114


“simple” or “smooth”

Thus, minimizing the sum of both will, in theory, correspond to finding the right

off between the fit to the training data and the “generality” of the solution that is

we have not applied feature


ve obtained the results for both the direct pixel values as

The design cycle for the recognition of


by various research groups

. To the best of our knowledge only isolated characters and numeral dataset of

. So these are experimented in this study. The

Fig. 5.2

ased Deep Network Architecture

exists in two groups namely: training and

testing. The images are subjected to various preprocessing steps described in Table 2.2.


ion/selection is an important step for the


Preprocessing and Feature

Extraction

Training Algorithm

Model



The features are obtained from the 9 different parts of the image sample. The global and

local histogram based zone boundary concept is used before feature model extraction.

The results are obtained for the size normalized image patterns with the features as direct

pixel intensities. The network configuration varies with respect to number of hidden

layers. The time for recognition include the training time along with classification. All the

sample images are partitioned into 3x3 sub-images but the criterion for zoning is different.

Edge based directional features are extracted from each of the sub-image in 8 discrete

direction defining 72-D feature vector in each case. Three layer architecture of MLP is

used with 200 nodes in hidden layer. Rest of the NN- settings are kept same as in previous

experiment on direct pixel values. For implementing the Feed-forward Neural Network

using mini batch approach the code developed by Palm [116] is used.

Table 5.4 Error Rates for Various Datasets using Direct Pixel Values

Dataset Training

Samples

Test

Samples

Neural Net

Configuration

Time for

Recognition

(training+test)

(secs)

Percent

age

Error

MNIST Handwritten

digit 50000 10000 784 - 100 - 10 4846sec 2.25

MNIST Handwritten

digit 50000 10000

784 - 200 -100 -

10 9471 sec 2.15

MNIST Handwritten

digit 50000 10000 784 -200 -10 7923 sec 2.13

ISI Digit 18000 3500 784 - 100 - 10 1524 sec 3.31

ISI Digit 18000 3500 784 -200 -100 -

10 3239sec 3.17

ISI Digit 18000 3500 784 -200 -10 2767sec 2.74

CPAR-2012 character 49000 29400 784 - 100 - 49 3903sec 21.54

CPAR-2012 character 49000 29400 784 -200- 49 8334sec 18.6

CPAR-2012 character 49000 29400 784 -200 - 100 -49 8794 17.21

CPAR-2012 Numeral 26250 8750 784 -200 -11 5422sec 2.53

CPAR-2012 Numeral 26250 8750 72 -100 -11 2153sec 2.8

CPAR-2012 Numeral 26250 8750 784 -200 -100-

11 5560 sec 2.77



Table 5.5 Error Rates for Various Datasets using proposed Feature Extraction method

Dataset

Feature

Extraction

Method

Number

of

Features

NN

Configuration

Time for

Training

+ testing

Error

Rate

CPAR-2012 character

Global Zone based Edge

72 72 -200-49 1941 14.89

CPAR-2012 character

local Zone based Edge

72 72- 200 -49 1864 16.01

CPAR-2012 character

Equal Zone 72 72 -200- 49 2082 21.35

ISI Digit Equal Zone Edge 72 72 -200- 10 1224 sec 2.03

ISI Digit Global Zone

based Edge

72 72 -200 -10 1178sec 1.83

ISI Digit local Zone based

Edge 72 72- 200 -10 909sec 2.14

CPAR-2012 Numeral 26250

(training)+8750 (test sample)

Global Zone based Edge

72 72 -200 -10 756 2.38



local Zone based Edge

72 72 -200- 10 785 1.93



Equal Zone Edge 72 72 -200 -10 956 sec 2.07

In this study experiments are performed with single hidden layer(having 200 or 100

hidden units) NN and two hidden layer NN(having 200 and 100 hidden units

respectively). Input layer is made up of the number of units equal to the size of feature

vector generated. Network is trained for 500 epoch with a mini batch size of 100. The

learning rate used in the experiments are kept constant equal to 1. The value of

Regularization weight decay L2 is maintained a value of 1e-04. The activation function

used in hidden layer is hyperbolic tangent and in the output layer is logistic. The



momentum set for experiments equal to 0.7. The error rates obtained for the various

dataset using direct pixel values as features are tabulated in Table 5.4. . The recognition

results are obtained using direct pixel intensities as feature on MSGD. Secondly, the

results are obtained for edge based directional features on MSGD. Table 5.5 illustrate the

performance of various feature extraction algorithms described in Table 2.2 on SGD based

learning of MLP.

The Mini batch stochastic gradient is used to accelerate the speed of recognition for large

dataset. From the obtained results as compared in Table 5.6, it is clear that if Pre-

processing and feature extraction is used along with the mini batch algorithm, the error

rate reduces by a 1-3% over direct pixel intensity features. The proposed method gives

better/same recognition accuracy with most of the standard benchmarks available for

Devanagari characters. The recognition time cannot be compared from the previously

reported results as the time was not considered as criterion in previous research. The

proposed method is faster over normal Gradient Descent based learning and it gives good

accuracy on even direct pixel intensities. In the experiments performed by Kumar et al. [7]

the average recognition rate is 95.18% with rejection, on a single classifier and 97.87%

with rejection criterion on an ensemble of classifier is applied to the patterns. For the

same dataset the proposed learning strategy gives the recognition rate of 98.07 % without

any rejection. The performance improvement of the proposed methods are given in terms

of accuracy as tabulated in Table 5.6 (column 4 and 5). Column 3 of Table 5.6 provides

the information about the previously reported results by other researchers. For the CPAR-

2012 Numeral dataset the accuracy the proposed features with SGD learning of MLP, is

improved by 0.2%. Our results are better than the previous results in a way that the results

by other researchers are obtained with multi-classifier along with the rejection of samples.

Kumar et al [7] applied the method of classifier ensemble on the character dataset and

obtained the recognition rate of 84.03% for a rejection of 5.3%. For the same dataset the

proposed learning method yield the recognition rate of 82.79% with direct pixel values

and 85.11% with feature extraction method without any rejection of patterns. For CPAR-

2012 character dataset the accuracy improvement of 1.08 is observed with the proposed

feature set along with the SGD learning on MLP. Proposed method is better compared

with the method developed by Bhattacharya et al. [25] in terms for speed of recognition.

For ISI Devanagari Numerals performance comparison is made with three previously

reported results. The result reported by Das et al [33] is better but it is tested for small



subset of ISI data whereas our method is tested on complete ISI Dataset. Result reported

by Bhattacharya et al [25] is tested on complete dataset but the accuracy is obtained with

0.24 % rejection of patterns. Features used in the work belongs to complex algorithms of

feature extraction. Also the classification algorithm used by them is a multistage MLP

based classifier and there is a rejection of patterns used by the last stage in multistage

MLP. So for ISI Data considering the complexity of algorithm our results are better.

Table 5.6 Performance comparison of the obtained results over previously reported

results

Dataset Approach Feature Classifier Previously

reported

Recognition

Percentage

Recognition

Percentage

of Proposed

method

(Direct Pixel

as Features)

Recognition

Percentage

of Proposed

method

(Gradient

Feature )

CPAR-2012 Digit [7]

Classifier Combination(MV)

Direct Pixel+ Profile+ Gradient+ Wavelet Transform

CCN,KNN,PRN,KNN,FFT

97.87% (35000)

97.47 (35000)

98.07 (35000)

CPAR-2012 Character [32]

Classifier Combination(MV)

Direct Pixel+ Profile+ Gradient+ Wavelet Transform

CCN,KNN,PRN,KNN,FFT

84.03% (79400)

82.79% (79400)

85.11

(79400)

ISI Devanagari Digit [33]

Feature Combination

PCA/MPCA + QTLR

SVM 98.7% (3000) ISI

data

97.26

(22546)

Full ISI

Data

98.17

(22546)

Full ISI

Data ISI Devanagari Digit [25]

Multi stage classifier

Wavelet Multi stage 99.04 with 0.24% rejection

ISI Devanagari Digit [35]

Ensemble using AdaBoost

Zernike Moments

MLPs 96.80(single) (22546)

Abbreviations used in Table 5.6 are as follows: PCA: principal component analysis, NN: K-nearest neighbor, FNN: feed-forward neural network, SVM: support vector machine, MLP: multilayer perceptron, CNN: cascade neural network, PRN: pattern recognition network, and FFT: function fitting neural network.

5.2.2 The Performance of Convolutional Neural Network

Convolutional Neural Network is based on connecting Local region of the previous layer

to the next layer. CNNs applies spatially local correlation by enforcing a local



connectivity pattern between neurons of adjacent layers. The units of one layer are

connected to the sub-units of the previous layer. Number of units of previous layer forms

the width of the receptive field. The second architecture is inspired from Lanet-5 [108].

This idea of local connectivity enforcement is represented as shown in Fig.5.3

Figure 5.3 Illustration of local connectivity pattern(Receptive Field)

As CNN share weights, the number of free parameters does not grow proportionally with

the input dimensions as in standard multi-layer-networks.

The functions of the two layers of CNN are described as follows:

Convolutional Layer: performs a 2D filtering between input images x and a bank of filters

w, producing another set of images h. A connection table CT indicates the input-output

correspondences, filter responses from inputs connected to the same output image are

linearly combined. Each row in CT is a connection and has the following semantic:

(inputImage, filterId, outputImage). This layer performs following mapping h�=∑ x��,�∈�� .!." *w� (5.9)

where * indicates the 2D valid convolution. Each filter D of a particular layer has the

same size and defines, together with the size of the input, the size of the output images �\. Then, a non-linear activation function (e.g. tanh, logistic, etc.) is applied to h just as for

standard multi-layer networks.

Pooling Layer: reduces the dimensionality of the input by a constant factor. The scope of

this layer is not only to reduce the computational burden, but also to perform feature

selection. The input images are tiled in non overlapping sub-regions from which only one

output value is extracted. Common choices are maxima or average, usually shortened as

Max-Pooling and Avg-Pooling. Max-Pooling is generally favorable as it introduces small

invariance to translation and distortion, leads to faster convergence and better

generalization [117].



Fully Connected Layer: this is the standard layer of a multi-layer network. Performs a

linear combination of the input vector with a weight matrix. Either the network alternates

convolutional and max-pooling layers such that at some stage a 1D feature vector is

obtained, or the resulting images are rearranged to have 1D shape. The output layer is

always a fully connected layer with as many neurons as classes in the classification task.

The outputs are normalized with a softmax activation function and therefore approximate

posterior class probabilities. Convolution layer has many parameters such as size and the

number of feature maps, kernel size, skipping factors and the connection table. In this

work we used the kernel size as 5x5, the number of feature map for the first convolution

layer equal to 6 and the second convolution layer equal to 12. The skipping factor

specifies the horizontal and vertical numb0er of pixel by which kernel skips between the

subsequent convolutions. For each layer if there are M feature maps of size (Mx,My), the

kernel of size(Kx,Ky), the skipping factors are (Sx,Sy) then the relation of the size of output

maps in terms of the above mentioned parameters: Ð$� = Ð@0,g − Ö@gà@g + 1 Ð$* = Ð@0,� − Ö@�à@� + 1 (5.10)

where index n indicates the layer. Each map in layer Ln is connected to at most Mn−1 maps

in layerLn−1. Neurons of a given map share their weights but have different receptive

fields.

For

Figure 5.4 The Convolutional Neural Network architecture

For the experimenting the architecture of CNN experiments are performed on mini-

batches of input patterns. Convolution neural network we used the architecture shown in

Fig. 5.4. The input layer has feature map equal to the size of normalized image of the

character. In this study weights are sampled randomly from a uniform distribution in the

range [-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a hidden unit. For



CNNs this is dependent on the number of input feature maps and the size of the receptive

fields. Similar to the case of CNN, max-pooling is applied as a non-linear down sampling.

In Max-pooling operation the input image is partitioned into a set of non-overlapping

rectangles and, for each such sub-region, outputs are calculated as the maximum value.

Max-pooling is useful in vision for two reasons: (1) it reduces the computational

complexity for upper layers and (2) it provides a form of translation invariance. The

second convolution layer consists of 6 feature maps obtained by convolving the kernel of

size 5x5 with the input feature map. The third layer is sub-sampling layer which is used to

down-sample the image. Max-pooling operation is used across 2x2 area of the input image

for down sampling at this layer. The third layer is the convolution layer which convolves

the image from the previous layer to the kernel of size 5x5. The number of feature maps

obtained at this stage equal to 12. Again in the fourth layer the sub-sampling using

'maxpooling' operation is performed in the area of size 2x2. the fifth layer is fully

connected layer which drives a feed-forward classifier for giving the output. The layer 1-4

forms the trainable feature extraction part and the next succeeding layer forms the

classifier. Handwritten images are divided in the training set and testing set. All digit

images have been size-normalized and centered in a fixed size image of 28 x 28 pixels. In

the original dataset each pixel of the image is represented by a value between 0 and 255,

where 0 is black, 255 is white and anything in between is a different shade of grey. An

image is represented as 1-dimensional array of 784 (28 x 28) float values between 0 and 1

(0 stands for black, 1 for white). When using the dataset, we divide training and test set

into mini-batches. The results of the convolutional Neural Network is shown in Table-5.7.

The column 2 represent the recognition rate for number of feature map of 6 and 12

respectively for the layer 2 and four of the architecture.

Table 5.7 Experimental results of Convolutional Neural Network with 784 features

Dataset Feature

map-6-12

Recognition

Time

Feature

map-12-24

Recognition

Time

CPAR-2012

Char 81.11 20892 84.21 68182

CPAR-2012

Number 96.95 14761 97.66 53251

MNIST

Digit 98.96 32193 99.21% 104469

ISI Numeral 97.8 10951 98.11 42078



The experiments are also done by varying the number of feature maps stated above to be

equal to 12 and 24. The column 4 of Table 5.8 reports the result corresponding to that. The

recognition times reported in the result are the complete time of training and testing.

Table 5.8 Experimental results of Convolutional Neural Network with 1024 features

Dataset Feature map-

6-12

Recognition

Time

Feature map-

12-24

Recognition

Time

CPAR-2012

Char 86.48% 68292sec 90.58%/9.04 243102

CPAR-2012

Number 97.5% 48945 sec 97.47% 126306

MNIST 99.11% 115558 99.18% 290491

ISI Numeral 97.97% 34476 98.17%/1.6 94799

The experimental result for the available dataset are better in comparison with the

previously reported results. The Experiments done on the CPAR digit dataset reported the

highest recognition rate of 97.87 % by Rajiv Kumar et al. While on the character dataset

the highest recognition rate reported was 84.03% with a rejection of patterns. In our

experiments the recognition rate comes out to be 90.58% without any rejection of patterns

with proposed architecture of Convolutional Neural Network. The recognition rate for the

digit in our results comes out to be 97.66% which is slightly less than result reported by

Kumar et al. But the results are without any rejection while in the experiments done by

Kumar et al performed the experiments with the rejection of the degraded patters. Further

our method does not involves the complex feature extraction or classifier ensemble

approach. Better results can be obtained if combination of classifier is used in place of a

single classifier. Our experiments also reports the result n ISI-CVPR dataset better than

the proposed approach by Jindal et al [35]. They reported the recognition percentage of

96.8 % and our results are reporting 98.11% recognition with CNN.

The Convolution Neural network based classifier is implemented for Devanagari

character and Numeral dataset. The Convolution Neural Network is a deep architecture

inspired from the Human vision system. It consists of trainable feature detector and

classifier embedded in itself. It directly takes the input of the character image in the pixel

form. The proposed deep learning architecture is quite effective on the Devanagari

Character /Numeral datasets. However the results can further be improved if classifier

fusion is implemented which will be the future direction of our proposed architecture.



5.2.3 Deep Belief Networks

Deep belief networks are probabilistic generative models that are composed of multiple

layers of stochastic, latent variables where the latent variables typically have binary values

and are often called hidden units or feature detectors. DBNs, are constructed as hierarchies

of recurrently connected simpler probabilistic graphical models, called Restricted

Boltzmann Machines (RBMs). Each RBM receives signal from its previous layer and

provide the input to its next layer. The Every RBM has two layers of neurons, a hidden

and a visible layer, which are fully and symmetrically connected between layers, but not

connected within layers. Restricted Boltzmann Machines (RBMs) are trained in a greedy

layer wise fashion using unsupervised learning. This encode in its weight matrix a

probability distribution that predicts the activity of the visible layer from the activity of the

hidden layer. Each layer predict the activity of the layer below by stacking such models,

and letting, higher RBMs learn increasingly abstract representations of sensory inputs,

which matches well with representations learned by neurons in higher brain regions e.g.,

of the visual cortical hierarchy [118] [119]. The success of Deep Learning rests on the

unsupervised layer-by-layer pre-training with the Contrastive Divergence (CD) algorithm

[120] .

5.2.3.1 Restricted Boltzmann Machines

A Restricted Boltzmann machine is a two layer architecture made up of visible binary and

hidden binary units interconnected to each other. A RBM is an generative undirected

graphical model.

Figure 5.5 The connection diagram of Restricted Boltzmann Machine

The lower layer x, is defined as the visible layer, and the top layer has the hidden layer. The visible and hidden layer units x and h are stochastic binary variables. The weights between the visible layer and the hidden layer are undirected and are denoted W. In addition each neuron has a bias. The model defines the probability distribution

1

1

encoder

decoder



O��, ℎ� = �î#�$,%�ì (5.11)

with the energy function E(x,h) and the partition function Z are defined by ²��, ℎ� = −�′� − � ′ℎ − ℎ′£� (5.12) e��, ℎ� = ∑ �0Ê�g,&�g,& (5.13)

where b and c are the biases of visible layer and the hidden layer respectively. Sum over x

and h represent all possible states of the model.

Conditional probability of one layer given other will be given by

O��ℎ|�� = �gf�'′$(ð′%(%′)$�∑ �gf�'′$(ð′%(%′)$�%

(5.14)

For the recognition of handwritten digits instead of applying the supervised learning, the

layers of deep model are trained in an unsupervised manner. For this purpose the lowest

level is trained for the images of the characters whereas the higher levels tries to learn

their representations. So that combined model acts as supervised model. The DBN is made

up of two hidden layers each having 400 units and 300 units respectively. The learning is

done without momentum. The size of mini-batch is kept equal to 100. The training is done

for 200 epochs and the learning rates(supervised and unsupervised) is kept 1.Table 5.9

shows the performance of DBN in terms of error rate for the direct pixel intensities as the

features. The feature size given in column 2 of the table according to two sizes of the

images.

Table 5.9 Performance of DBN for Devanagari Characters/Numerals

Dataset Number of

features

Number of

hidden layers Error Rate

ISI Numeral 784 2 5.03 1024 2 3.94

CPAR-2012

Numeral

784 2 4.78 1024 2 5.05

CPAR-2012

Character

784 2 19.03 1024 2 18.47

MNIST-Digit 784 2 1.68 1024 2 1.97

5.2.4 Stacked Denoising Autoencoder

The stacked Denoising auto encoder(SDA) is obtained by stacking Denoising

Autoencoder for which hidden layers are designed to detect more robust features.

Encoding and Decoding operations are performed as the main building block of SDA.

Each layer is trained separately to minimize reconstruction error. There are two stages in



training process of this network, a layer-wise pre-training and fine-tuning after that. Here

training is done using a corrupted version of the patterns. The encoder first encodes the

input and then it try to decodes the input from this by undoing this corruption done in

input. This corruption is done stochastically. The stochastic corruption process [121]

consists in randomly setting some of the inputs (as many as half of them) to zero. Hence

the denoising auto-encoder is trying to predict the corrupted (i.e. missing) values from the

uncorrupted (i.e., non-missing) values, for randomly selected subsets of missing patterns.

If x is the input ie � ∈ 40,17�. Encoder takes the input x and encodes it into the hidden

representation y ie * ∈ 40,17�,, where d and d1 are representing respective dimensions. * = ��£� + �� (5.15)

Where s is a non-linearity such as the sigmoid. The latent representation y, is mapped

back(with a decoder) into a reconstruction z of same shape as x through a similar

transformation. ¥ = ��£ ′* + �′� (5.16)

W and W' are the weights. The parameters of this model which are optimized by training

are W, W', b, b' using mean reconstruction error. We had taken W' as WT in our

experiments. Note here that y captures the main factors of variation in data. The single

hidden layer of SDA is trained to minimize the mean squared error of that layer. Fine

tuning of the parameters is done using supervised learning. The SAE is made up of single

hidden layer having 400 units in the hidden layer. The learning is done without

momentum. The size of mini-batch is kept equal to 100. The training is done for 100

epochs and the learning rate is kept 1. The MLP consists of two hidden layers with 400

and 300 units respectively. The input in this case also is the pixel intensities for two image

sizes.

Table 5.10 Performance of Stacked Denoising autoencoder

Dataset Number of

features

Number of

hidden layers Error Rate

ISI Numeral 784 1 3.57 1024 1 3.77

CPAR-2012

Numeral

784 1 4.38 1024 1 4.35

CPAR-2012

Character

784 1 16.12 1024 1 16.78

MNIST-Digit 784 1 1.68 1024 1 1.71



The performance comparison of all deep architecture for various datasets is shown in

Figure 5.6. Four classifiers namely Deep Neural Network, Deep Belief Network,

Convoltional Neural Network and stacked De-noising Auto encoder are shown for

different datasets.

Figure 5.6 Performance Comparison of Networks

5.3 Conclusion

The experimental evaluation of deep learning tools for Devanagari handwritten

characters/numerals. The main advantage of the proposed approach used in this part of

work is that, it avoids the costly feature extraction step from the pattern recognition

conventional approaches and performs equally well or better (in some cases) in

comparison to previously reported studies. The previously reported results are obtained by

computationally expensive feature extraction or the classifier combination strategy. While

the present study is free from complex preprocessing and feature extraction algorithm. In

the case of Devanagari characters there is an improvement of recognition result by 6.5 %

of the previously reported result. When the performance various deep learning methods

are compared the classifier based on CNN performs most efficiently on all the datasets.

But the training of CNN is quite costly and time consuming. So other configuration are

used when the time is the major consideration.

Work has presented in this chapter led to the following publications

1. Pratibha Singh, Ajay Verma, and Narendra S. Chaudhari, "On the Performance

Improvement of Devanagari Handwritten Character Recognition," Applied

0

3

6

9

12

15

18

21

24

27

30

CPAR-Digit ISI- Digit CPAR-Char MNIST-Digit

Te

st E

rro

r R

ate

Dataset

Performance comparison of Deep Architectures

NN

DBN

CNN

SDA



Computational Intelligence and Soft Computing(Hindawi), vol. 2015, Article ID

193868, 12 pages, 2015. doi:10.1155/2015/193868.

2. Pratibha Singh, Ajay Verma, and Narendra S. Chaudhari, "Performance

Comparison of Neural Network architecture for Handwritten Devanagari Character

Recognition". Under review.

CHAPTER-5 DEEP LEARNING BASED APPROACHESshodhganga.inflibnet.ac.in/bitstream/10603/114183/5/... ·...

Documents

Transcript of CHAPTER-5 DEEP LEARNING BASED APPROACHESshodhganga.inflibnet.ac.in/bitstream/10603/114183/5/... ·...