CHAPTER-5 DEEP LEARNING BASED APPROACHESshodhganga.inflibnet.ac.in/bitstream/10603/114183/5/... ·...
Transcript of CHAPTER-5 DEEP LEARNING BASED APPROACHESshodhganga.inflibnet.ac.in/bitstream/10603/114183/5/... ·...
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 105
CHAPTER-5
DEEP LEARNING BASED APPROACHES
Chapter Outline
This chapter describes the implementation details of deep learning Neural network based
approaches in Devanagari character recognition. The advantage of deep learning
techniques is that it is inspired from human vision. Four deep learning based NN models
are build namely, Feedforward NN, Convolutional NN, Deep Belief Network and Stacked
Denoising Autoencoder. A performance comparison of shallow and deep architecture of
neural network is done.
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 106
5. Deep learning Tools for Devanagari Character Recognition
The Three layer net works presented in the previous chapter belongs to the category of
shallow networks. The Input that was given to these networks was set of specially crafted
features. In this chapter we will be introducing the deep networks which is having
multiple hidden layers which will allow it to compute much more complex features of the
input. Because each hidden layer computes a non-linear transformation of the previous
layer, a deep network can have significantly greater representational power (i.e., can learn
significantly more complex functions) than a shallow one. By using a deep network, in the
case of images, one can also start to learn part-whole decompositions. For example, the
first layer might learn to group together pixels in an image in order to detect edges (as
seen in the earlier exercises). The second layer might then group together edges to detect
longer contours, or perhaps detect simple "parts of objects." An even deeper layer might
then group together these contours or detect even more complex features. However there
is a difficulty with the method described above, it relies only on labeled data for training.
However, labeled data is often scarce, and thus for many problems it is difficult to get
enough examples to fit the parameters of a complex model. For example, given the high
degree of expressive power of deep networks, training on insufficient data would also
result in over-fitting. The normal gradient descent is not suitable for the training of deep
networks because there is a problem of 'local optima' and 'diffusion of gradients' [106]
The method that has seen some success for the training of deep Networks is the greedy
layer-wise training method. The main idea is to train the layers of the network one at a
time, so that we first train a network with 1 hidden layer, and only after that is done, train
a network with 2 hidden layers, and so on. At each step, we take the old network with k −
1 hidden layers, and add an additional k-th hidden layer (that takes as input the previous
hidden layer k − 1 that we had just trained). Training can either be supervised (say, with
classification error as the objective function on each step), but more frequently it is
unsupervised (as in case of autoencoder). The weights from training the layers
individually are then used to initialize the weights in the final/overall deep network, and
only then is the entire architecture "fine-tuned" (i.e., trained together to optimize the
labeled training set error). The success of greedy layer-wise training has been attributed to
a number of factors. We will be discussing various types of architecture used in this work
belonging to deep networks. The four deep models are experimented with the patterns
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 107
namely deep architecture of neural network(NN), Convolutional Neural Network(CNN),
the Deep Belief Network(DBN)and the Stacked Denoising Autoencoder(SAE).
Multi layer perceptrons are quite efficient in learning complex high dimension problems.
It consists of many layers of fully connected hidden layers between input and output layer.
However for the patterns having higher dimensions, the learning of MLP becomes costly
and time consuming specially for the case of image processing applications with large
number of inputs. So to alleviate this difficulty of this over-parameterized NN for high
dimensional image processing applications a model of Neural Network is implemented
inspired from the architecture of Convolutional Neural Network [107]. The Convolutional
Neural Network is a system which incorporates the feature extraction and the
classification in itself. The arrangement of Convolution and sub-sampling layer builds the
feature detector part which is trainable. The Convolutional neural network is applied to
character recognition application by some of the researchers [108] on direct image pixel.
The method exploits the shared weight, minimum learning parameter criterion, deep
learning concept of Convolutional Neural Network. The main advantage of Convolutional
neural network is that it is invariant to basic transformations such as rotation and
translation due to the sub-sampling and pooling operations performed at various layers of
the network.
Convolutional Neural Networks (CNN) are inspired from our biological system and it is
basically a variant of Multilayer Perceptron(MLP). There exists a complex arrangement of
cells within the visual cortex [109]. Many such neurally inspired models exists like the
NeoCognitron [110], HMAX [111] and LeNet-5 [107]. The standard MLP ignores the
topology present in the input, whereas the CNN exploits this topological information.
According to the model of Convolutional Neural Network, input image is assumed to be
composed of into small sub - regions called "receptive fields" and the filter are sensitive to
these sub-regions of the input space in such a way as to cover the entire visual field. These
filters are local in input space and are thus better suited to exploit the strong spatially local
correlation present in natural images. There are two types of cells in CNN namely simple
cell(S) and the complex cell(C). Simple cells respond maximally to specific edge-like
stimulus patterns within their receptive field. Complex cells have larger receptive fields
and are locally invariant to the exact position of the stimulus. The major advantage of the
CNN is that it requires only few parameters of learning because it is made up of the
translated version of same basis function. Hence CNN has capability of producing
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 108
considerable recognition performance with only few samples. Convolutional Neural
Network have been widely used for ample of recognition problems such as in MNIST
handwritten digit recognition [112], object recognition [113] and in natural language
processing [114].
5.1 The Role of Data Preprocessing
Data preprocessing plays a very important in many deep learning algorithms. In practice,
many methods work best after the data has been normalized and whitened. However, the
exact parameters for data preprocessing are usually not immediately apparent unless one
has much experience working with the algorithms. Standard first step to data
preprocessing is data normalization. While there are a few possible approaches, this step is
usually clear depending on the data. The common methods for feature normalization are:
• Simple Rescaling • Per-example mean subtraction (a.k.a. remove DC) • Feature Standardization (zero-mean and unit variance for each feature across the dataset)
5.1.1 Simple Rescaling
In simple rescaling, our goal is to rescale the data along each data dimension (possibly
independently) so that the final data vectors lie in the range [0,1] or [ − 1,1] (depending on
the dataset). This is useful for later processing as many default parameters (treat the data
as if it has been scaled to a reasonable range).
5.1.2 Per-Example Mean Subtraction
In character recognition application, we are not interested in the illumination conditions of
the image, but more so in the content; removing the average pixel value per data point
makes sense here. For such images, this normalization has the property of removing the
average brightness (intensity) of the data point. In other words, if our data is stationary
(i.e., the statistics for each data dimension follow the same distribution), then it is further
pre-processed by subtracting the mean-value for each example (computed per-example).
5.1.3 Feature Standardization
Feature standardization refers to (independently) setting each dimension of the data to
have zero-mean and unit-variance. This is the most common method for normalization and
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition
is generally used widely. In practice, one achieves this by first computing the mean o
each dimension (across the dataset) and subtracts this from each dimension. Next, each
dimension is divided by its standard deviation.
5.2 The Proposed Approach
The overall workflow of the proposed approach is shown in the Fig
pipeline of the proposed approach mainly consists of following steps:
1. Acquire/ Collect the data images of handwritten characters and divide them into
training and test dataset
2. Apply the preprocessing technique on both training and test set.
3. Normalize the data so that it lie
4. Divide the training data into mini
5. Use unlabelled data for feature dimension reduction.
6. Train the deep network model using labeled data.
7. Use trained model for classification.
Handwritten images are divided in the training set and testing set. All digit images have
been size-normalized and centered in a fixed size image of 28 x 28 pixels. In the original
dataset each pixel of the image is represented by a value between 0 and 255, where
black, 255 is white and anything in between is a different shade of grey. An image is
represented as 1-dimensional array of 784 (28 x 28) float values between 0 and 1 (0 stands
for black, 1 for white). When using the dataset, we divide training and t
batches.
Figure 5.1 Overall workflow of proposed approach
Feature Extraction using
unsupervised training data
Preprocessed Training data
Training set
5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition
is generally used widely. In practice, one achieves this by first computing the mean o
each dimension (across the dataset) and subtracts this from each dimension. Next, each
dimension is divided by its standard deviation.
pproach
The overall workflow of the proposed approach is shown in the Fig
posed approach mainly consists of following steps:
Acquire/ Collect the data images of handwritten characters and divide them into
training and test dataset
Apply the preprocessing technique on both training and test set.
Normalize the data so that it lies in the range 0-1.
Divide the training data into mini-batches of suitable size decided using heuristic.
Use unlabelled data for feature dimension reduction.
Train the deep network model using labeled data.
Use trained model for classification.
ten images are divided in the training set and testing set. All digit images have
normalized and centered in a fixed size image of 28 x 28 pixels. In the original
dataset each pixel of the image is represented by a value between 0 and 255, where
black, 255 is white and anything in between is a different shade of grey. An image is
dimensional array of 784 (28 x 28) float values between 0 and 1 (0 stands
for black, 1 for white). When using the dataset, we divide training and t
Overall workflow of proposed approach
Classification of test data using Trained model
Deep Neural Network
Feature Extraction using
unsupervised training data
Training using supervised Data and
minibatches
Preprocessing
Preprocessed Training data Preprocessed Test data
Input Dataset Images
Training set Test set
109
is generally used widely. In practice, one achieves this by first computing the mean of
each dimension (across the dataset) and subtracts this from each dimension. Next, each
The overall workflow of the proposed approach is shown in the Fig 5.1. The
Acquire/ Collect the data images of handwritten characters and divide them into
batches of suitable size decided using heuristic.
ten images are divided in the training set and testing set. All digit images have
normalized and centered in a fixed size image of 28 x 28 pixels. In the original
dataset each pixel of the image is represented by a value between 0 and 255, where 0 is
black, 255 is white and anything in between is a different shade of grey. An image is
dimensional array of 784 (28 x 28) float values between 0 and 1 (0 stands
for black, 1 for white). When using the dataset, we divide training and test set into mini-
Training using supervised Data and
minibatches
Preprocessed Test data
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 110
5.2.1 The Deep NN- Classifier Model
Multi Layer Perceptron is used as classifier. The architecture of Multi Layer Perceptron
(MLP) used in this work consists of input layer, output layer and hidden layer. Single
hidden layer Perceptron gives universal approximation in many pattern recognition
applications. The output vector for a single layer Perceptron is given by
���� = ����K� +£�K� -����,� +£�,���3� (5.1)
where ��,�, ��K� are the bias vectors at the hidden and output layers, £�,�,£�K�are the
weight matrices at the respective nodes and s, G are the activation functions. For a
classification problem if (x(i), y(i)) is the training vector, where���� ∈ ℜ�, a D-dimensional
training vector and *��� ∈ â1,…… . , H}. For the prediction function���� given in Eqn. 5.1,
the zero-one loss function is given by
ℓR,, = ∑ 5Þ�g���{�����|ó|�QR (5.2)
where I is the indicator function given by
5g = �1���������0!�ℎ�� ��� » (5.3
���� = ��%#��?O� = D|�, ¿� (5.4)
where ¿is the set of all parameters of the given model. The objective of the training is to
minimize the loss function. But the zero-one loss function is not differentiable therefore
negative Log-likelihood of loss function minimization is used as the objective of the
training.
8HH�¿, ó� = −∑ O� = *��������, ¿�|ó|�QR (5.5)
Weights are updated using gradient of the error surface defined by loss function. Gradient
is estimated from the training data. In this study Stochastic Gradient Descent based
learning (Table 5.1) approach is applied to MLP. In ordinary gradient descent algorithm,
repeatedly small steps are made in downward direction on an error surface, which is the
mean square error. Mean square error is a function of weights. Stochastic gradient descent
(SGD) works according to the same principles as ordinary gradient descent, but proceeds
more quickly by estimating the gradient from just a few examples at a time instead of the
entire training set. In its purest form, estimation of the gradient is made from just a single
example at a time. In both gradient descent (GD) and stochastic gradient descent (SGD),
we update a set of weights in an iterative manner to minimize an error function. In normal
GD (Table 5.2) all the samples of the training set have to be processed before updating the
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 111
weight for a particular iteration, while in SGD, only single training sample from whole
training set is used to do the update for a weight in a particular iteration. Thus for big data
if the number of training samples are very large, then using gradient descent may take too
long because in every iteration when we are updating the values of the parameters, we are
running through the complete training set. On the other hand, using SGD will be faster
because you use only one training sample and it starts improving itself right away from the
first sample. SGD often converges much faster compared to GD but the error function is
not as well minimized as in the case of GD. Often in most cases, the close approximation
that we get in SGD for the parameter values are enough because they reach the optimal
values and keep oscillating there. Stochastic gradient descent has a convergence rate
which is independent of the size of our dataset and is thus adapted when we have a huge
or even infinite dataset. But there are two downsides to it:
• Its slow convergence rate, it is not beneficial to get a faster convergence rate on the
training set as this will not be translated into a better convergence rate on the test set
[115].
• Its sensitivity to the two parameters, the learning rate and the decrease constant.
Table 5.1 Stochastic Gradient Descent
for (xi,yi) in training_set:
% Assume an infinite generator
% It may repeat examples (if there is only a finite training loss)
Loss= f(parameters, xi, yi)
Find derivative of loss with respect to parameters % compute
gradient
Modify parameters by-learning rate * derivative of loss with
respect to parameters
if <stopping condition is met>:
return parameters
For deep learning a variant is recommend, which is a further modification in
stochastic gradient descent using “minibatches”. Minibatch SGD(MSGD)
explained in Table 5.3, works identically to SGD, except that we use more than
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 112
one training example to make each estimate of the gradient. This technique
reduces variance in the estimate of the gradient, and often makes better use of the
hierarchical memory organization in modern computers.
Table 5.2 Gradient Descent
while True:
Loss = f(parameters)
Find derivative of loss with respect to parameters % compute
gradient
Modify parameters by-learning rate * derivative of loss with
respect to parameters
if <stopping condition is met>:
return parameters
Table 5.3 Mini Batch- SGD
for (xbatch,ybatch) in training batches:
% Assume an infinite generator
% It may repeat examples
Loss = f(parameters, xbatch, ybatch)
Find derivative of loss with respect to parameters % compute
gradient
Modify parameters by-learning rate * derivative of loss with
respect to parameters
if <stopping condition is met>:
return parameters
There is a tradeoff in the choice of the minibatch size B. With large B, time is wasted in
reducing the variance of the gradient estimator, that time would be better spent on
additional gradient steps. An optimal B is model, dataset, and hardware dependent, and
can be anywhere from 1 to several hundreds. In Machine Learning when we train our
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 113
model from data, we are trying to prepare it to do well on new examples, not the ones it
has already seen. The MSGD training loop does not have the capability for this
generalization and have a tendency to overfit. The way to combat overfitting is through
regularization and early-stopping using validation. There are several techniques for
regularization the most commonly used method is L1/L2 regularization which is
explained in the next section.
5.2.1.1 Weight Decay
Weight decay is a subset of regularization methods. The penalty term in weight decay, by
definition, penalizes large weights. Other regularization methods may involve not only the
weights but various derivatives of the output function The weight decay penalty term
causes the weights to converge to smaller absolute values than they otherwise would.
Large weights can hurt generalization in two different ways. Excessively large weights
leading to hidden units can cause the output function to be too rough, possibly with near
discontinuities. Excessively large weights leading to output units can cause wild outputs
far beyond the range of the data if the output activation function is not bounded to the
same range as the data. To put it another way, large weights can cause excessive variance
of the output.
5.2.1.2 L1 and L2 Regularization
L1 and L2 regularization involve adding an extra term to the loss function, which
penalizes certain parameter configurations. Formally, if our Negative log likelihood loss
function is 8HH�¿, |�|� then the regularized loss will be given by ²�¿,�� = 8HH�¿, |�|� + ö=�¿� (5.6)
This is written as follow for the present study as ²�¿,�� = 8HH�¿, |�|� + ö‖¿‖ff(5.7)
where ‖¿‖f is the Hf norm of ¿
‖¿‖f = �∑ �¿̀ �f|Ç|`QR �1m(5.8)
� is a hyper-parameter which controls the relative importance of the regularization
parameter. Commonly used values for p are 1 and 2, hence the L1/L2 nomenclature. If
p=2, then the regularizer is also called “weight decay”. In principle, adding a
regularization term to the loss will encourage smooth network mappings in a neural
network (by penalizing large values of the parameters, which decreases the amount of
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition
nonlinearity that the network models). More intuitively, the two terms (NLL and
R(�))correspond to modeling the data well (NLL) and having
solutions. Thus, minimizing the sum of both will, in theory, correspond to finding the right
trade-off between the fit to the training data and the “generality” of the solution
found.
5.2.1.3 Methodology Used
For other Deep Neural Network model of this work,
extraction and used pixel intensities after normalization as features but for the first deep
MLP based architecture we ha
well as the features proposed in our earlier work.
characters follow all the steps of standard pattern recognition technique in supervised
learning mode. For the training
are used. To the best of our knowledge only isolated characters and numeral dataset of
Devanagari script is available as
design cycle used in this part of
Figure 5.2 Design Cycle for the
The images obtained from benchmarking dataset
testing. The images are subjected to various preprocessing steps described in Table
The output after preprocessing module is not suitable for classifier training because of
high dimensionality. Feature extract
dimensionality reduction. In this study we have used a gradient based directional features.
Preprocessing and
Feature Extraction
Classifier
Decision
5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition
nonlinearity that the network models). More intuitively, the two terms (NLL and
correspond to modeling the data well (NLL) and having “simple” or “smooth”
Thus, minimizing the sum of both will, in theory, correspond to finding the right
off between the fit to the training data and the “generality” of the solution
Methodology Used for Deep Neural Network
For other Deep Neural Network model of this work, we have not applied feature
extraction and used pixel intensities after normalization as features but for the first deep
MLP based architecture we have obtained the results for both the direct pixel values as
well as the features proposed in our earlier work. The design cycle for the recognition of
characters follow all the steps of standard pattern recognition technique in supervised
training and testing, datasets developed by various research groups
. To the best of our knowledge only isolated characters and numeral dataset of
is available as test bed. So these are experimented in this study. The
part of study(for Deep NN model) is shown in F
Design Cycle for the Feature Extraction Based Deep Network Architecture
images obtained from benchmarking dataset exists in two groups namely: training and
testing. The images are subjected to various preprocessing steps described in Table
The output after preprocessing module is not suitable for classifier training because of
high dimensionality. Feature extraction/selection is an important step for the
dimensionality reduction. In this study we have used a gradient based directional features.
Stored Data (Training + Test Data)
Preprocessing and Feature
Extraction
Training
114
nonlinearity that the network models). More intuitively, the two terms (NLL and
“simple” or “smooth”
Thus, minimizing the sum of both will, in theory, correspond to finding the right
off between the fit to the training data and the “generality” of the solution that is
we have not applied feature
extraction and used pixel intensities after normalization as features but for the first deep
ve obtained the results for both the direct pixel values as
The design cycle for the recognition of
characters follow all the steps of standard pattern recognition technique in supervised
by various research groups
. To the best of our knowledge only isolated characters and numeral dataset of
. So these are experimented in this study. The
Fig. 5.2
ased Deep Network Architecture
exists in two groups namely: training and
testing. The images are subjected to various preprocessing steps described in Table 2.2.
The output after preprocessing module is not suitable for classifier training because of
ion/selection is an important step for the
dimensionality reduction. In this study we have used a gradient based directional features.
Preprocessing and Feature
Extraction
Training Algorithm
Model
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 115
The features are obtained from the 9 different parts of the image sample. The global and
local histogram based zone boundary concept is used before feature model extraction.
The results are obtained for the size normalized image patterns with the features as direct
pixel intensities. The network configuration varies with respect to number of hidden
layers. The time for recognition include the training time along with classification. All the
sample images are partitioned into 3x3 sub-images but the criterion for zoning is different.
Edge based directional features are extracted from each of the sub-image in 8 discrete
direction defining 72-D feature vector in each case. Three layer architecture of MLP is
used with 200 nodes in hidden layer. Rest of the NN- settings are kept same as in previous
experiment on direct pixel values. For implementing the Feed-forward Neural Network
using mini batch approach the code developed by Palm [116] is used.
Table 5.4 Error Rates for Various Datasets using Direct Pixel Values
Dataset Training
Samples
Test
Samples
Neural Net
Configuration
Time for
Recognition
(training+test)
(secs)
Percent
age
Error
MNIST Handwritten
digit 50000 10000 784 - 100 - 10 4846sec 2.25
MNIST Handwritten
digit 50000 10000
784 - 200 -100 -
10 9471 sec 2.15
MNIST Handwritten
digit 50000 10000 784 -200 -10 7923 sec 2.13
ISI Digit 18000 3500 784 - 100 - 10 1524 sec 3.31
ISI Digit 18000 3500 784 -200 -100 -
10 3239sec 3.17
ISI Digit 18000 3500 784 -200 -10 2767sec 2.74
CPAR-2012 character 49000 29400 784 - 100 - 49 3903sec 21.54
CPAR-2012 character 49000 29400 784 -200- 49 8334sec 18.6
CPAR-2012 character 49000 29400 784 -200 - 100 -49 8794 17.21
CPAR-2012 Numeral 26250 8750 784 -200 -11 5422sec 2.53
CPAR-2012 Numeral 26250 8750 72 -100 -11 2153sec 2.8
CPAR-2012 Numeral 26250 8750 784 -200 -100-
11 5560 sec 2.77
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 116
Table 5.5 Error Rates for Various Datasets using proposed Feature Extraction method
Dataset
Feature
Extraction
Method
Number
of
Features
NN
Configuration
Time for
Training
+ testing
Error
Rate
CPAR-2012 character
Global Zone based Edge
72 72 -200-49 1941 14.89
CPAR-2012 character
local Zone based Edge
72 72- 200 -49 1864 16.01
CPAR-2012 character
Equal Zone 72 72 -200- 49 2082 21.35
ISI Digit Equal Zone Edge 72 72 -200- 10 1224 sec 2.03
ISI Digit Global Zone
based Edge
72 72 -200 -10 1178sec 1.83
ISI Digit local Zone based
Edge 72 72- 200 -10 909sec 2.14
CPAR-2012 Numeral 26250
(training)+8750 (test sample)
Global Zone based Edge
72 72 -200 -10 756 2.38
CPAR-2012 Numeral 26250
(training)+8750 (test sample)
local Zone based Edge
72 72 -200- 10 785 1.93
CPAR-2012 Numeral 26250
(training)+8750 (test sample)
Equal Zone Edge 72 72 -200 -10 956 sec 2.07
In this study experiments are performed with single hidden layer(having 200 or 100
hidden units) NN and two hidden layer NN(having 200 and 100 hidden units
respectively). Input layer is made up of the number of units equal to the size of feature
vector generated. Network is trained for 500 epoch with a mini batch size of 100. The
learning rate used in the experiments are kept constant equal to 1. The value of
Regularization weight decay L2 is maintained a value of 1e-04. The activation function
used in hidden layer is hyperbolic tangent and in the output layer is logistic. The
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 117
momentum set for experiments equal to 0.7. The error rates obtained for the various
dataset using direct pixel values as features are tabulated in Table 5.4. . The recognition
results are obtained using direct pixel intensities as feature on MSGD. Secondly, the
results are obtained for edge based directional features on MSGD. Table 5.5 illustrate the
performance of various feature extraction algorithms described in Table 2.2 on SGD based
learning of MLP.
The Mini batch stochastic gradient is used to accelerate the speed of recognition for large
dataset. From the obtained results as compared in Table 5.6, it is clear that if Pre-
processing and feature extraction is used along with the mini batch algorithm, the error
rate reduces by a 1-3% over direct pixel intensity features. The proposed method gives
better/same recognition accuracy with most of the standard benchmarks available for
Devanagari characters. The recognition time cannot be compared from the previously
reported results as the time was not considered as criterion in previous research. The
proposed method is faster over normal Gradient Descent based learning and it gives good
accuracy on even direct pixel intensities. In the experiments performed by Kumar et al. [7]
the average recognition rate is 95.18% with rejection, on a single classifier and 97.87%
with rejection criterion on an ensemble of classifier is applied to the patterns. For the
same dataset the proposed learning strategy gives the recognition rate of 98.07 % without
any rejection. The performance improvement of the proposed methods are given in terms
of accuracy as tabulated in Table 5.6 (column 4 and 5). Column 3 of Table 5.6 provides
the information about the previously reported results by other researchers. For the CPAR-
2012 Numeral dataset the accuracy the proposed features with SGD learning of MLP, is
improved by 0.2%. Our results are better than the previous results in a way that the results
by other researchers are obtained with multi-classifier along with the rejection of samples.
Kumar et al [7] applied the method of classifier ensemble on the character dataset and
obtained the recognition rate of 84.03% for a rejection of 5.3%. For the same dataset the
proposed learning method yield the recognition rate of 82.79% with direct pixel values
and 85.11% with feature extraction method without any rejection of patterns. For CPAR-
2012 character dataset the accuracy improvement of 1.08 is observed with the proposed
feature set along with the SGD learning on MLP. Proposed method is better compared
with the method developed by Bhattacharya et al. [25] in terms for speed of recognition.
For ISI Devanagari Numerals performance comparison is made with three previously
reported results. The result reported by Das et al [33] is better but it is tested for small
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 118
subset of ISI data whereas our method is tested on complete ISI Dataset. Result reported
by Bhattacharya et al [25] is tested on complete dataset but the accuracy is obtained with
0.24 % rejection of patterns. Features used in the work belongs to complex algorithms of
feature extraction. Also the classification algorithm used by them is a multistage MLP
based classifier and there is a rejection of patterns used by the last stage in multistage
MLP. So for ISI Data considering the complexity of algorithm our results are better.
Table 5.6 Performance comparison of the obtained results over previously reported
results
Dataset Approach Feature Classifier Previously
reported
Recognition
Percentage
Recognition
Percentage
of Proposed
method
(Direct Pixel
as Features)
Recognition
Percentage
of Proposed
method
(Gradient
Feature )
CPAR-2012 Digit [7]
Classifier Combination(MV)
Direct Pixel+ Profile+ Gradient+ Wavelet Transform
CCN,KNN,PRN,KNN,FFT
97.87% (35000)
97.47 (35000)
98.07 (35000)
CPAR-2012 Character [32]
Classifier Combination(MV)
Direct Pixel+ Profile+ Gradient+ Wavelet Transform
CCN,KNN,PRN,KNN,FFT
84.03% (79400)
82.79% (79400)
85.11
(79400)
ISI Devanagari Digit [33]
Feature Combination
PCA/MPCA + QTLR
SVM 98.7% (3000) ISI
data
97.26
(22546)
Full ISI
Data
98.17
(22546)
Full ISI
Data ISI Devanagari Digit [25]
Multi stage classifier
Wavelet Multi stage 99.04 with 0.24% rejection
ISI Devanagari Digit [35]
Ensemble using AdaBoost
Zernike Moments
MLPs 96.80(single) (22546)
Abbreviations used in Table 5.6 are as follows: PCA: principal component analysis, NN: K-nearest neighbor, FNN: feed-forward neural network, SVM: support vector machine, MLP: multilayer perceptron, CNN: cascade neural network, PRN: pattern recognition network, and FFT: function fitting neural network.
5.2.2 The Performance of Convolutional Neural Network
Convolutional Neural Network is based on connecting Local region of the previous layer
to the next layer. CNNs applies spatially local correlation by enforcing a local
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 119
connectivity pattern between neurons of adjacent layers. The units of one layer are
connected to the sub-units of the previous layer. Number of units of previous layer forms
the width of the receptive field. The second architecture is inspired from Lanet-5 [108].
This idea of local connectivity enforcement is represented as shown in Fig.5.3
Figure 5.3 Illustration of local connectivity pattern(Receptive Field)
As CNN share weights, the number of free parameters does not grow proportionally with
the input dimensions as in standard multi-layer-networks.
The functions of the two layers of CNN are described as follows:
Convolutional Layer: performs a 2D filtering between input images x and a bank of filters
w, producing another set of images h. A connection table CT indicates the input-output
correspondences, filter responses from inputs connected to the same output image are
linearly combined. Each row in CT is a connection and has the following semantic:
(inputImage, filterId, outputImage). This layer performs following mapping h�=∑ x��,�∈�� .!." *w� (5.9)
where * indicates the 2D valid convolution. Each filter D of a particular layer has the
same size and defines, together with the size of the input, the size of the output images �\. Then, a non-linear activation function (e.g. tanh, logistic, etc.) is applied to h just as for
standard multi-layer networks.
Pooling Layer: reduces the dimensionality of the input by a constant factor. The scope of
this layer is not only to reduce the computational burden, but also to perform feature
selection. The input images are tiled in non overlapping sub-regions from which only one
output value is extracted. Common choices are maxima or average, usually shortened as
Max-Pooling and Avg-Pooling. Max-Pooling is generally favorable as it introduces small
invariance to translation and distortion, leads to faster convergence and better
generalization [117].
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 120
Fully Connected Layer: this is the standard layer of a multi-layer network. Performs a
linear combination of the input vector with a weight matrix. Either the network alternates
convolutional and max-pooling layers such that at some stage a 1D feature vector is
obtained, or the resulting images are rearranged to have 1D shape. The output layer is
always a fully connected layer with as many neurons as classes in the classification task.
The outputs are normalized with a softmax activation function and therefore approximate
posterior class probabilities. Convolution layer has many parameters such as size and the
number of feature maps, kernel size, skipping factors and the connection table. In this
work we used the kernel size as 5x5, the number of feature map for the first convolution
layer equal to 6 and the second convolution layer equal to 12. The skipping factor
specifies the horizontal and vertical numb0er of pixel by which kernel skips between the
subsequent convolutions. For each layer if there are M feature maps of size (Mx,My), the
kernel of size(Kx,Ky), the skipping factors are (Sx,Sy) then the relation of the size of output
maps in terms of the above mentioned parameters: Ð$� = Ð@0,g − Ö@gà@g + 1 Ð$* = Ð@0,� − Ö@�à@� + 1 (5.10)
where index n indicates the layer. Each map in layer Ln is connected to at most Mn−1 maps
in layerLn−1. Neurons of a given map share their weights but have different receptive
fields.
For
Figure 5.4 The Convolutional Neural Network architecture
For the experimenting the architecture of CNN experiments are performed on mini-
batches of input patterns. Convolution neural network we used the architecture shown in
Fig. 5.4. The input layer has feature map equal to the size of normalized image of the
character. In this study weights are sampled randomly from a uniform distribution in the
range [-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a hidden unit. For
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 121
CNNs this is dependent on the number of input feature maps and the size of the receptive
fields. Similar to the case of CNN, max-pooling is applied as a non-linear down sampling.
In Max-pooling operation the input image is partitioned into a set of non-overlapping
rectangles and, for each such sub-region, outputs are calculated as the maximum value.
Max-pooling is useful in vision for two reasons: (1) it reduces the computational
complexity for upper layers and (2) it provides a form of translation invariance. The
second convolution layer consists of 6 feature maps obtained by convolving the kernel of
size 5x5 with the input feature map. The third layer is sub-sampling layer which is used to
down-sample the image. Max-pooling operation is used across 2x2 area of the input image
for down sampling at this layer. The third layer is the convolution layer which convolves
the image from the previous layer to the kernel of size 5x5. The number of feature maps
obtained at this stage equal to 12. Again in the fourth layer the sub-sampling using
'maxpooling' operation is performed in the area of size 2x2. the fifth layer is fully
connected layer which drives a feed-forward classifier for giving the output. The layer 1-4
forms the trainable feature extraction part and the next succeeding layer forms the
classifier. Handwritten images are divided in the training set and testing set. All digit
images have been size-normalized and centered in a fixed size image of 28 x 28 pixels. In
the original dataset each pixel of the image is represented by a value between 0 and 255,
where 0 is black, 255 is white and anything in between is a different shade of grey. An
image is represented as 1-dimensional array of 784 (28 x 28) float values between 0 and 1
(0 stands for black, 1 for white). When using the dataset, we divide training and test set
into mini-batches. The results of the convolutional Neural Network is shown in Table-5.7.
The column 2 represent the recognition rate for number of feature map of 6 and 12
respectively for the layer 2 and four of the architecture.
Table 5.7 Experimental results of Convolutional Neural Network with 784 features
Dataset Feature
map-6-12
Recognition
Time
Feature
map-12-24
Recognition
Time
CPAR-2012
Char 81.11 20892 84.21 68182
CPAR-2012
Number 96.95 14761 97.66 53251
MNIST
Digit 98.96 32193 99.21% 104469
ISI Numeral 97.8 10951 98.11 42078
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 122
The experiments are also done by varying the number of feature maps stated above to be
equal to 12 and 24. The column 4 of Table 5.8 reports the result corresponding to that. The
recognition times reported in the result are the complete time of training and testing.
Table 5.8 Experimental results of Convolutional Neural Network with 1024 features
Dataset Feature map-
6-12
Recognition
Time
Feature map-
12-24
Recognition
Time
CPAR-2012
Char 86.48% 68292sec 90.58%/9.04 243102
CPAR-2012
Number 97.5% 48945 sec 97.47% 126306
MNIST 99.11% 115558 99.18% 290491
ISI Numeral 97.97% 34476 98.17%/1.6 94799
The experimental result for the available dataset are better in comparison with the
previously reported results. The Experiments done on the CPAR digit dataset reported the
highest recognition rate of 97.87 % by Rajiv Kumar et al. While on the character dataset
the highest recognition rate reported was 84.03% with a rejection of patterns. In our
experiments the recognition rate comes out to be 90.58% without any rejection of patterns
with proposed architecture of Convolutional Neural Network. The recognition rate for the
digit in our results comes out to be 97.66% which is slightly less than result reported by
Kumar et al. But the results are without any rejection while in the experiments done by
Kumar et al performed the experiments with the rejection of the degraded patters. Further
our method does not involves the complex feature extraction or classifier ensemble
approach. Better results can be obtained if combination of classifier is used in place of a
single classifier. Our experiments also reports the result n ISI-CVPR dataset better than
the proposed approach by Jindal et al [35]. They reported the recognition percentage of
96.8 % and our results are reporting 98.11% recognition with CNN.
The Convolution Neural network based classifier is implemented for Devanagari
character and Numeral dataset. The Convolution Neural Network is a deep architecture
inspired from the Human vision system. It consists of trainable feature detector and
classifier embedded in itself. It directly takes the input of the character image in the pixel
form. The proposed deep learning architecture is quite effective on the Devanagari
Character /Numeral datasets. However the results can further be improved if classifier
fusion is implemented which will be the future direction of our proposed architecture.
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 123
5.2.3 Deep Belief Networks
Deep belief networks are probabilistic generative models that are composed of multiple
layers of stochastic, latent variables where the latent variables typically have binary values
and are often called hidden units or feature detectors. DBNs, are constructed as hierarchies
of recurrently connected simpler probabilistic graphical models, called Restricted
Boltzmann Machines (RBMs). Each RBM receives signal from its previous layer and
provide the input to its next layer. The Every RBM has two layers of neurons, a hidden
and a visible layer, which are fully and symmetrically connected between layers, but not
connected within layers. Restricted Boltzmann Machines (RBMs) are trained in a greedy
layer wise fashion using unsupervised learning. This encode in its weight matrix a
probability distribution that predicts the activity of the visible layer from the activity of the
hidden layer. Each layer predict the activity of the layer below by stacking such models,
and letting, higher RBMs learn increasingly abstract representations of sensory inputs,
which matches well with representations learned by neurons in higher brain regions e.g.,
of the visual cortical hierarchy [118] [119]. The success of Deep Learning rests on the
unsupervised layer-by-layer pre-training with the Contrastive Divergence (CD) algorithm
[120] .
5.2.3.1 Restricted Boltzmann Machines
A Restricted Boltzmann machine is a two layer architecture made up of visible binary and
hidden binary units interconnected to each other. A RBM is an generative undirected
graphical model.
Figure 5.5 The connection diagram of Restricted Boltzmann Machine
The lower layer x, is defined as the visible layer, and the top layer has the hidden layer. The visible and hidden layer units x and h are stochastic binary variables. The weights between the visible layer and the hidden layer are undirected and are denoted W. In addition each neuron has a bias. The model defines the probability distribution
1
1
encoder
decoder
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 124
O��, ℎ� = �î#�$,%�ì (5.11)
with the energy function E(x,h) and the partition function Z are defined by ²��, ℎ� = −�′� − � ′ℎ − ℎ′£� (5.12) e��, ℎ� = ∑ �0Ê�g,&�g,& (5.13)
where b and c are the biases of visible layer and the hidden layer respectively. Sum over x
and h represent all possible states of the model.
Conditional probability of one layer given other will be given by
O��ℎ|�� = �gf�'′$(ð′%(%′)$�∑ �gf�'′$(ð′%(%′)$�%
(5.14)
For the recognition of handwritten digits instead of applying the supervised learning, the
layers of deep model are trained in an unsupervised manner. For this purpose the lowest
level is trained for the images of the characters whereas the higher levels tries to learn
their representations. So that combined model acts as supervised model. The DBN is made
up of two hidden layers each having 400 units and 300 units respectively. The learning is
done without momentum. The size of mini-batch is kept equal to 100. The training is done
for 200 epochs and the learning rates(supervised and unsupervised) is kept 1.Table 5.9
shows the performance of DBN in terms of error rate for the direct pixel intensities as the
features. The feature size given in column 2 of the table according to two sizes of the
images.
Table 5.9 Performance of DBN for Devanagari Characters/Numerals
Dataset Number of
features
Number of
hidden layers Error Rate
ISI Numeral 784 2 5.03 1024 2 3.94
CPAR-2012
Numeral
784 2 4.78 1024 2 5.05
CPAR-2012
Character
784 2 19.03 1024 2 18.47
MNIST-Digit 784 2 1.68 1024 2 1.97
5.2.4 Stacked Denoising Autoencoder
The stacked Denoising auto encoder(SDA) is obtained by stacking Denoising
Autoencoder for which hidden layers are designed to detect more robust features.
Encoding and Decoding operations are performed as the main building block of SDA.
Each layer is trained separately to minimize reconstruction error. There are two stages in
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 125
training process of this network, a layer-wise pre-training and fine-tuning after that. Here
training is done using a corrupted version of the patterns. The encoder first encodes the
input and then it try to decodes the input from this by undoing this corruption done in
input. This corruption is done stochastically. The stochastic corruption process [121]
consists in randomly setting some of the inputs (as many as half of them) to zero. Hence
the denoising auto-encoder is trying to predict the corrupted (i.e. missing) values from the
uncorrupted (i.e., non-missing) values, for randomly selected subsets of missing patterns.
If x is the input ie � ∈ 40,17�. Encoder takes the input x and encodes it into the hidden
representation y ie * ∈ 40,17�,, where d and d1 are representing respective dimensions. * = ��£� + �� (5.15)
Where s is a non-linearity such as the sigmoid. The latent representation y, is mapped
back(with a decoder) into a reconstruction z of same shape as x through a similar
transformation. ¥ = ��£ ′* + �′� (5.16)
W and W' are the weights. The parameters of this model which are optimized by training
are W, W', b, b' using mean reconstruction error. We had taken W' as WT in our
experiments. Note here that y captures the main factors of variation in data. The single
hidden layer of SDA is trained to minimize the mean squared error of that layer. Fine
tuning of the parameters is done using supervised learning. The SAE is made up of single
hidden layer having 400 units in the hidden layer. The learning is done without
momentum. The size of mini-batch is kept equal to 100. The training is done for 100
epochs and the learning rate is kept 1. The MLP consists of two hidden layers with 400
and 300 units respectively. The input in this case also is the pixel intensities for two image
sizes.
Table 5.10 Performance of Stacked Denoising autoencoder
Dataset Number of
features
Number of
hidden layers Error Rate
ISI Numeral 784 1 3.57 1024 1 3.77
CPAR-2012
Numeral
784 1 4.38 1024 1 4.35
CPAR-2012
Character
784 1 16.12 1024 1 16.78
MNIST-Digit 784 1 1.68 1024 1 1.71
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 126
The performance comparison of all deep architecture for various datasets is shown in
Figure 5.6. Four classifiers namely Deep Neural Network, Deep Belief Network,
Convoltional Neural Network and stacked De-noising Auto encoder are shown for
different datasets.
Figure 5.6 Performance Comparison of Networks
5.3 Conclusion
The experimental evaluation of deep learning tools for Devanagari handwritten
characters/numerals. The main advantage of the proposed approach used in this part of
work is that, it avoids the costly feature extraction step from the pattern recognition
conventional approaches and performs equally well or better (in some cases) in
comparison to previously reported studies. The previously reported results are obtained by
computationally expensive feature extraction or the classifier combination strategy. While
the present study is free from complex preprocessing and feature extraction algorithm. In
the case of Devanagari characters there is an improvement of recognition result by 6.5 %
of the previously reported result. When the performance various deep learning methods
are compared the classifier based on CNN performs most efficiently on all the datasets.
But the training of CNN is quite costly and time consuming. So other configuration are
used when the time is the major consideration.
Work has presented in this chapter led to the following publications
1. Pratibha Singh, Ajay Verma, and Narendra S. Chaudhari, "On the Performance
Improvement of Devanagari Handwritten Character Recognition," Applied
0
3
6
9
12
15
18
21
24
27
30
CPAR-Digit ISI- Digit CPAR-Char MNIST-Digit
Te
st E
rro
r R
ate
Dataset
Performance comparison of Deep Architectures
NN
DBN
CNN
SDA
[Chapter- 5: Deep Learning Based Approaches]
Investigations in Devanagari Handwriting Recognition 127
Computational Intelligence and Soft Computing(Hindawi), vol. 2015, Article ID
193868, 12 pages, 2015. doi:10.1155/2015/193868.
2. Pratibha Singh, Ajay Verma, and Narendra S. Chaudhari, "Performance
Comparison of Neural Network architecture for Handwritten Devanagari Character
Recognition". Under review.