Generalized stacking of layerwise-trained Deep...

Generalized Stacking of Layerwise-trained DeepConvolutional Neural Networks for Document

Image ClassificationSaikat Roy

Dept. of Comp. Sci. & Engg.Jadavpur University

Kolkata, IndiaEmail: saikat [email protected]

Arindam DasImaging Tech LabHCL Technologies

Chennai, IndiaEmail: [email protected]

Ujjwal BhattacharyaComp. Vision and Patt. Recog. Unit

Indian Statistical InstituteKolkata, India

Email: [email protected]

Abstract—This article presents our recent study of alightweight Deep Convolutional Neural Network (DCNN) archi-tecture for document image classification. Here, we concentratedon training of a committee of generalized, compact and powerfulbase DCNNs. A support vector machine (SVM) is used to combinethe outputs of individual DCNNs. The main novelty of thepresent study is introduction of supervised layerwise trainingof DCNN architecture in document classification tasks for betterinitialization of weights of individual DCNNs. Each DCNN of thecommittee is trained for a specific part or the whole document.Also, here we used the principle of generalized stacking forcombining the normalized outputs of all the members of theDCNN committee. The proposed document classification strategyhas been tested on the well-known Tobacco3482 document imagedataset. Results of our experimentations show that the proposedstrategy involving a considerably smaller network architecturecan produce comparable document classification accuracies incompetition with the state-of-the-art architectures making itmore suitable for use in comparatively low configuration mobiledevices.

Index Terms—Deep learning, CNN, document classification,supervised layerwise training.

I. INTRODUCTION

Organizations receive or process regularly many documentscopies of which are often stored as scanned images for their fu-ture references. Similarly, an individual needs to handle severaldocuments periodically which are required to be maintainedfor certain period. Such documents are often preferred to storeas images. An automatic system capable of understanding thetypes or logical structures of similar document images shouldhelp in better management of these documents such as efficientarchiving, retrieval, information mining etc.

Although understanding the type of a document from itsimage is not difficult for us, automatization of the same isa non-trivial problem. On the other hand, classification ofdocument images into a number of pre-defined categories is auseful application towards their automatic understanding. Infact, document image classification is considered to be aneffective initial step of various Document Image Processing(DIP) tasks such as document retrieval, information extraction,text recognition etc. It increases the indexing efficiency in

the construction of a Digital Library. Automatic archiving ofdocuments essentially needs classification of documents. Ini-tial classification of available documents into different knownclasses not only simplifies various document processing tasksbut also increases the overall performance of DIP systems.

Documents of various classes can be characterized by theirtext contents and/or their structural similarities. In the presentwork, we considered their structural aspect only. Designing ofa high-performance document classifier is challenging due toa number of reasons including the large variety of documentsin any individual document class [1]. Often the structuresof two documents belonging to two different classes areso similar that their correct classification is very difficult.Experimentation of the proposed classification approach hasbeen done on Tobacco litigation dataset [2] and a few samplesfrom the same have been shown in Fig. 1.

Several studies of automatic document classification havebeen done in recent past. A related survey work can befound in [1]. In [3], an approach for automatic generationof a decision tree for logical labeling of business letterswas proposed. In another study [4], a supervised classifiertrained using given examples from each underlying classexploited visual similarity of document layout structure for itsclassification. Another group of authors proposed documentclassification based on layout similarity in [5]. The authors of[6] used a recursive representation of document structure topreserve relationship among its different parts. Gordo et al. [7]recently proposed certain multi-scale runlength histograms forrepresentation of document images and a generative classifiermodel for efficient classification of them.

Since deciding a discriminating set of computable featuresfor document classification tasks is not at all an easy job,machine learning based approaches have been studied for un-derstanding document logical structures from the early days ofDIP research. Dengel and Dubiel [8] presented an approach forlearning document logical structures, particularly for businessletters. In this approach, the available training samples wereclustered into structural concepts producing certain concepthierarchy which was used for later classification of new

2016 23rd International Conference on Pattern Recognition (ICPR)Cancún Center, Cancún, México, December 4-8, 2016

978-1-5090-4847-2/16/$31.00 ©2016 IEEE 1273

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

Fig. 1: Three resized samples illustrating document structure from each of the ten classes of the Tobacco dataset: (a)Advertisement, (b) News, (c) Letter, (d) Email, (e) Form, (f) Report, (g) Memo, (h) Scientific, (i) Resume, (j) Note

documents. Heroux [9] et al. studied three classifiers, e.g.,k-Nearest Neighbours (kNN), multilayer perceptron (MLP)and a structural classifier for form document classificationtasks. Pyramidal decomposition based low-level feature wasused by the kNN and the MLP classifiers while high levelfeature representing structural information was used by thestructural classifier. Cesarini et al. [10] used a Modified X-Ytree to describe a document page and encoded its hierarchicalstructure into a fixed length feature vector which was fed toan MLP classifier for assigning label to the input documentimage. Hidden Markov models (HMM) [11] are considered tobe robust and suitable for handling uncertainties and noise ininput samples. Hu et al. [5] considered interval encoding tocompute spatial layout of input document image and used theresulting fixed length feature vector for its classification basedon an HMM.

A Deep Convolutional Neural Network (DCNN) or simplya Convolutional Neural Network (CNN) [12] has a deeparchitecture which may be considered as an improved versionof the MLP. It has two main parts one of which is sparselyconnected and the other part is fully connected. The sparselyconnected part extracts the features from the input data andthe fully connected part takes care of classification using theextracted feature. Thus, CNN has the advantage of avoidingeffort on feature selection and hence it has been tried in avariety of image classification tasks such as document imageclassification. Kang et al. [13] trained a CNN architecturewith rectified linear units using dropout strategy for a 10-classdocument image classification task. Later Afzal et al. [14] usedtransfer learning strategy to improve the recognition accuracyon the same standard dataset. These authors used a deeperCNN pre-trained on a different very large dataset. Harley etal. [15] also used transfer learning and trained an ensemble ofCNNs for the same document classification task.

In the present study, we used a committee of six DCNNmodels for classification of document image into a certainnumber of classes. The goal of the present study is to increasethe generalization performance of trained model without muchincrease in the architecture of the DCNN model. Such a modelshould be useful for comparatively low configuration devicesor in an environment supporting parallel computation. The new

aspects of the proposed strategy include (i) layerwise trainingof DCNN models, (ii) generalized stacking and (iii) use of aDCNN model trained by both original training samples andtheir rotated images.

II. PROPOSED METHOD

DCNNs have increasingly become very popular due to itsnotable success in image classification tasks. However, sincethe deep architecture of such a network involves significantlymany free parameters, a very large number of samples isrequired for its training in the absence of which the networkgets suffered due to the overfitting issue. Since the creation ofsuch a large training set is not always feasible, DCNN cannotbe used in many applications areas. As a feasible solutionto this problem, a recent approach which is slowly gettingpopularized is to initiate from a DCNN model pre-trainedusing a large image dataset (such as the Imagenet [19]) ofpossibly completely different subject and retrain the same for awhile using the available small training dataset of the particularproblem in hand for fine tuning the network. However, thisapproach may not provide high classification accuracies (as inthe studies [14], [15]) possibly due to large difference amongthe nature of training image samples of pre-trained networkand the inage samples of the target problem. On the other hand,training a DCNN model from the scratch using a small numberof training samples cannot generalize enough on unknown testdata as it was obtained in the study [13]. Thus, it is worthto study different avenues of generalization improvement ofDCNN model when trained with a limited volume of relevantimage samples. Further details of our study on the problemare described below.

A. Architecture of the Proposed Classification Model

A block diagram of the proposed system consisting sixDCNN models and an SVM combining the outputs of theDCNNs is shown in Fig. 2. The architecture of the DCNNused in the present system can be represented by 150×150-20C7-4P4-50C5-4P4-500FC-10SM where:

• 150×150 represents the input layer which takes 150×150image as input to the proposed system,

1274

Fig. 2: Architecture of the proposed system for document classification consisting of an ensemble of DCNNs and an SVM

• mCn represents the first or second convolution layerwhich applies n number of filters each of size m × mto the 2-dimensional inputs to these layers,

• qPr represents the first or second max-pooling layerwhich pool q × q regions at strides of r pixels alongboth the dimensions of the inputs to them,

• hFC represents a fully connected layer consisting of hhidden units,

• mSM represents a Softmax output layer consisting of moutput units for the m-class classification problem.

This network is trained in both regular and layerwise fash-ions. To achieve better generalization of the trained network,dropout, where neurons are randomly switched off at eachiteration, was considered for the fully connected layer ofeach DCNN. Here, a droupout probability of 0.5 was used.We used the standard mini-batch gradient descent as thetraining algorithm. Also, we used the RMSProp algorithm foradaptation of the learning rates during training.

B. Region based Predictions

A strategy of generating predictions on individual imageregions was introduced by Harley et al. [15]. Similarly, imageregions are extracted for left body, right body, header andfooter regions of the image and resized to the input size of theCNNs. Each region was used to train an individual CNN withthe aim of learning region-specific classifiers and combiningtheir predictions.

1) Training for Rotated Images: As mentioned earlier, thedataset contains quite a few images scanned at 90 and -90degrees and they pose a challenge to the recognition processin the following ways –

1) there are only a handful of them compared to the uprightimages in the entire dataset.

2) Owing to the uneven classwise distribution of images aswell as the random selection of the training set, theremay not be enough such samples in the training set.

3) Filters trained on images tilted by 90◦ will not performefficiently on images tilted by −90◦ and vice versa.

The need is, therefore, to increase the amount of tiltedimages in the dataset while at the same time not disturbingupright images enough to result in a significantly lowperforming model.

This is achieved rather simply by appending the training setwith a copy of each image reflected on its Y-axis (flippedhorizontally). The justification of the same is as follows:

• As mentioned earlier, images resized to 150 × 150 aresignificantly degraded preserving merely the structureof the images. The structure of a rotated image at 90degree when flipped horizontally, significantly replicatesthe degraded structure of the same (or same class of)image rotated at -90 degrees.

• Again due to the degradation, the images scanned withoutany significant rotation, mainly retain their structure whenflipped, hence not degrading the performance of the

1275

(a) (b)

Fig. 3: (a) An original image sample from Tobacco3482dataset resized to 150×150; (b) the flipped version of theoriginal samples shown in (a).

model significantly.Two original samples from Tobacco3482 dataset resized to

150×150 and their flipped versions are shown in Fig. 3. Addi-tion of a base DCNN model trained on a combined training setconsisting of original training samples and their similar flippedversions helps the ensemble to correctly classify documentseven when their images had been captured in a rotated fashion.

C. Supervised Layerwise Training of Deep ConvolutionalNeural Network

DCNN is currently a state-of-the-art classifier. However, it istill now seriously plagued due to its inclusion of a significantlyhigh number of free parameters requiring a large number oftraining samples to avoid overfitting. Here, we propose touse a comparatively recent strategy [16], [17] for layerwisetraining of a DCNN in a supervised fashion in order to achievebetter generalization capacity of the trained network. Thebasic idea of such training is as follows. Instead of usingthe entire architecture of the DCNN from the beginning ofthe training session, successive convolution and pooling layersare added to the network at different training stages. Afteraddition of a convolution or a pooling layer, the model istrained for a relatively small number of epochs. Further layersare inserted at regular intervals of the training until the targetdepth of the architecture is reached. Training is completed byfine tuning the entire architecture using a small value of thelearning rate for a certain number of epochs. Here, it maybe noted that the lower layers are not removed when a newlayer is added at the top. The inherent resistance of CNNsagainst diffusion of gradient makes this an effective method ofcontinuing training the lower layers as still significant amountof the gradient is backpropagated. When used alongside theRMSProp algorithm, it enables fast convergence of the model,higher generalization capabilities and can be thought to mosteffectively utilize the limited amount of available labeled data.

D. Generalized Stacking of Base Models

Stacking or stacked generalization, introduced by Wolpert[18], is an elegant way of combining an ensemble of ma-chine learning models which helps the ensemble to producereduced generalization error than the same of its individual

base models. This strategy has not been widely used by themachine learning community in the past. In particular, thisidea, could not be found in the literature to be used along withan ensemble of CNN or DCNN base models. In the presentstudy, we implemented stacked generalization for a trainablesystem consisting of several base DCNNs models. Steps ofour implementation of the generalized stacking is as follows:

• Divide the available set of training samples into twodisjoint subsets, called training set and validation set.

• Train each of the base DCNN model using the samplesof the training subset.

• Test the trained DCNN models on the validation samples.• Concatenate the sets of outputs of all the trained base

DCNN models corresponding to a validation sample andarrange the same in a row.

• Assign the true class label to each such row (number ofrows equal to the number of validation samples) formingthe training set for the SVM.

• Train the SVM using the training samples formed as inthe above.

• Obtain the prediction results by the SVM on novel testsamples which was not used before in the above steps.

The SVM trained using the above described stacking proce-dure helps to avoid the winner-takes-all approach and combinethe base learners in a nonlinear way which eventually providesbetter generalization performance on novel test samples thantraining the SVM directly on the training samples as thebase DCNN models. Our implementation of this generalizedstacking is further detailed below.

E. Combination of the Ensemble

As discussed earlier, we train distinct CNN models fordifferent sections of the image (header, footer, left body andright body) as well as the whole image (holistic) and itshorizontal flip. This produces 6 CNN models each classifyingdocument images in 10 different classes. The four CNNmodels trained by specific regions of the documents exploitonly region specific features and the model trained with flippedimages tries to increase the classification performance onrotated document images. Finally, it is required to combineindividual results of 6 CNN models of the ensemble. Meanrule, a basic strategy for combination of an ensemble, althoughsimplistic and efficient, was found to be inefficient for thepresent problem. So, a linear Support Vector Machine (SVM)was trained as a meta-classifier to combine the predictions ofthe individual models. The input vector to this SVM consistedof concatenated class probability predictions of the models,[p11, p

12, ..., p

110, p

21, p

22, ..., p

210, ..., p

61, p

62, ..., p

610

], where pij is

the prediction for class j by the ith model. The SVM istrained with the predictions of the models corresponding tothe validation samples and the trained SVM was used toobtain final results on test samples. The above experimentwas conducted 10 times corresponding to 10 different randomsubdivisions of the database into training, validation and testsets and the median of the corresponding accuracies is reportedhere.

1276

III. EXPERIMENTATION

A. Brief description of the dataset

The Tobacco3482 dataset was used for experiments on thedifferent neural network models. It is a collection of 3482non-uniformly distributed hand-labelled images of documents,forming a 10 class dataset. There are a minimum of 121sample images per class. The Tobacco3482 dataset is a subsetof the much larger IIT-CDIP test collection. The imagesthemselves are a mixture of documents scanned at 0 degrees(upright), 90 degrees (rotated right) and -90 degrees (rotatedleft) – which is an issue that was chosen to be addressedseparately in this work.

To maintain consistency with previous works for compar-ison, certain details were retained from them during experi-mentation. Images were resized to 150 × 150 using bilinearinterpolation. Resizing to such a significantly small scale fromthe original high resolution leaves little but the basic structureof the document intact for the Convolutional Nets to work on– which is indeed the intention.

Training sets of size 80 per class and validation sets of size20 per class were selected from the dataset at random. Theremaining 2482 images were used as the test set. The finalclassification accuracy on any particular CNN was the medianaccuracy of 10 such experiments.

It may be mentioned that the architecture of the DCNNused in a previous study [15] was 150×150-20C7-4P4-50C5-4P4-1000FC-1000FC-10SM, which although a bit deeper andlarger, was found to hinder the generalization capability oftrained network, considering the small volume of trainingsamples. It is observed that a simpler and more compactmodel, such as the one used in this work, could provide higherclassification accuracy on the same dataset due to the proposedtraining strategy which includes (i) layerwise initial training,(ii) generalized stacking and (iii) use of an additional DCNNtrained using both original and rotated image samples.

B. Performance

For obtaining the classification performance of the proposedsystem we have made 10 random divisions of the availableimage samples into training, validation and test samples.In each division, the samples in these three subsets haveequal proportions. Simulation of the system using one suchrandom division of samples is termed as a trial. Thus, weconducted 10 trials of our simulation. Here, we report thesimulation results in terms of the median of 10 trials, if nototherwise stated. The proposed layerwise trained DCNN (L-DCNN) models have been seen to perform better at the holisticlevel than regular DCNNs. Here, similar hyperparameters havebeen used for both types of DCNN models as in Harleyet al. [15] or Kang et al. [13], sans the few architecturalchanges described in Section II-A. The proposed L-DCNNprovided 67.1% classification accuracy whereas the accuraciesobtained by [13] and [15] using similar system architectureswere 65.35% and 64.3% respectively. However, [14] and [15]obtained better accuracies using significantly larger system

Fig. 4: Confusion matrix of the proposed system correspondingto trial no. 10 of Table I

architectures although we had to restrain ourselves to usingcomparatively lighter models due to non-availability of highersystem configurations at our end.

TABLE I: Classification accuracies corresponding to mean rulebased combination & SVM based combination

Trial No. Mean rule based SVM basedcombination combination

1 0.708 0.7162 0.716 0.6953 0.711 0.7254 0.693 0.7105 0.725 0.7256 0.718 0.7267 0.702 0.7138 0.704 0.7289 0.682 0.69810 0.725 0.730

Min 0.693 0.695Max 0.725 0.730

Median 0.709 0.721

Table I shows the accuracy of our run of 10 trials ofthe experimentation using the proposed system consistingof six L-DCNN models on the Tobacco3482 dataset whenresults were combined using standard mean rule vs. usingan SVM classifier. It can be seen that SVM meta-classifierbased combination performs better than the mean rule basedcombination, in terms of median accuracy. The confusionmatrix on the test set corresponding to trial number 10 ofTable I is shown in Fig. 4.

The benefit gained from the training of an additional DCNNusing a larger training set consisting of original image samplesand their rotated images has been empirically verified andthe results are shown in Figure 5. The combination of theensemble consisting of six DCNN models provides a medianaccuracy of 72.1% compared to the median accuracy of 71.2%

1277

Fig. 5: Classification accuracies of two ensembles consistingof five and six DCNN models corresponding to ten differentdivisions of the sample set into training, validation and testsets

TABLE II: Comparison of classification accuracies for the twodeep learning models

Training Set Regular DCNN Layerwise DCNNHolistic 0.632 0.671

Holistic+Reflected 0.624 0.673Header 0.527 0.578Footer 0.461 0.525

Left Body 0.486 0.539Right Body 0.483 0.527

Combination of Models 0.661 0.721

provided by the combination of the ensemble of five DCNNmodels excluding the DCNN trained using rotated samples.

Finally, Table II further displays tests on holistic as wellas region based classification accuracies on the Tobacco3482dataset. An SVM meta-classifier is used as an ensemble forthese tests. L-DCNNs used as the base models give signifi-cantly high accuracies of 72.1% compared to 66.1% for theregular DCNNs.

It should be noted that compared to the model for size150 × 150 images used by Harley et al. [15] and Kang et al.[13], the complete ensemble used in this work has about 48%of the number of trainable weights. Also, each model in theensemble has been trained by merely 80 images, hence there isan enormous potential for improvement of the technique whenscaled up in terms of dataset size as well as model depth.

IV. CONCLUSION

In this article, we have introduced an efficient trainingscheme of an ensemble of deep convolutional neural networkmodels for document structure learning. The proposed trainingscheme includes layerwise training of individual base DCNNsand generalized stacking of the combination strategy involvinga support vector machine. The goal of the proposed strategyis to improve the generalization capacity of the system evenwhen it is trained using a very small number of trainingsamples. In addition to base DCNN models trained on thewhole or various sub-regions of the documents, the ensembleconsists of an additional base DCNN model trained on the

combined set of original documents and their rotated versionskeeping in mind that in real-life situations often a documentmay get rotated at the time of its placement on the scan bed.The proposed ensemble obtained the median (out of 10 trials)accuracy of 72.1% on the standard Tobacco3482 documentdataset, which is significant considering that the individualbase DCNN models is trained using only 80 training imagesamples per class. Another important aspect of the proposedscheme is that the volume of connection weights of individualbase DCNN models is approximately 48% of the volume ofconnection weights of similar base DCNN models used in aprevious study on the same dataset.

REFERENCES

[1] N. Chen and D. Blostein, A survey of document image classification:problem statement, classifier architecture and performance evaluation, Int.J. of Doc. Anal. and Recog., vol. 10(1), pp. 1-16, 2007.

[2] D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard,Building a test collection for complex document information processing,Proc. of 29th Annual Int. ACM SIGIR Conference, pp. 665-666, 2006.

[3] A. Dengel, Initial learning of document structure, Proc. of Int. Conf. onDocument Analysis and Recognition, pp. 86-90, 1993.

[4] C. Shin, D. Doermann, and A. Rosenfeld, Classification of documentpages using structure-based features, Int. J. Doc. Anal. and Recognition,vol. 3(4), pp. 232-247, 2001.

[5] J. Hu, R. Kashi, and G. Wilfong, Comparison and classification ofdocuments based on layout similarity, Information Retrieval, vol. 2(2-3),pp. 227-243, 2000.

[6] M. Diligenti, P. Frasconi, M. Gori, Hidden tree Markov models fordocument image classification, IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 25(4), pp. 519-523, 2003.

[7] A. Gordo, F. Perronnin, E. Valveny, Large-Scale Document Image Re-trieval and Classification with Runlength Histograms and Binary Embed-dings”, Pattern Recognition, vol. 46(7), pp. 1898-1905, 2013.

[8] A. Dengel, and F. Dubiel, Clustering and classification of documentstructure a machine learning approach, Proc. of the 3rd Int.Conf. onDoc. Anal. and Recog., pp. 587-591, 1995.

[9] P. Heroux, S. Diana, A. Ribert, E. Trupin, Classification method studyfor automatic form class identification, Proc. of the 14th Int. Conf. onPatt. Recog., pp. 926-929, 1998,

[10] F. Cesarini, M. Lastri, S. Marinai, and G. Soda, Encoding of modifiedXY trees for document classification, Proc. of the 6th Int. Conf. on Doc.Anal. and Recog., pp. 1131-1136, 2001.

[11] L. R. Rabiner, A tutorial on hidden Markov models and selectedapplications in speech recognition, Readings in Speech Recognition,Waibel, Alex and Lee, Kai-Fu (eds.), Morgan Kaufmann Publishers Inc.,pp. 267-296, 1990.

[12] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learningapplied to document recognition, Proceedings of the IEEE, vol. 86(11),pp. 2278-2324, 1998.

[13] L. Kang, J. Kumar, P. Ye, Y. Li, D. Doermann, Convolutional NeuralNetworks for Document Image Classification, 22nd International Con-ference on Pattern Recognition (ICPR) pp. 3168-3172, 2014.

[14] M. Z. Afzal, S. Capobianco, M. I. Malik., S. Marinai, T. M. Breuel,A. Dengel, and M. Liwicki, DeepDocClassifier: Document Classificationwith Deep Convolutional Neural Network, Proc. of the 13th Int. Conf. onDocument Analysis and Recognition (ICDAR), pp. 1111-1115, 2015.

[15] A. W. Harley, A. Ufkes, K. G. Derpanis, Evaluation of deep convolu-tional nets for document image classification and retrieval, Proc. of Int.Conf. on Doc. Anal. and Recog., pp. 991-995, 2015.

[16] D. Plata, R. Ramos and A. Gonzalez, Supervised Greedy Layer-WiseTraining for Deep Convolutional Networks with Small Datasets, 7th Intl.Conf. on Computational Collective Intelligence, pp. 275-284, 2015.

[17] S. Roy, Supervised Layerwise Training of Deep Convolutional NeuralNetworks for Bangla Compound Character Recognition, Master’s Thesis,Jadavpur University, India, 2015.

[18] D.H. Wolpert, Stacked generalization, Neural Networks, vol. 5, pp. 241-259, 1992.

[19] J. Deng et al., Imagenet: A large-scale hierarchical image database, Proc.of IEEE Conf. on CVPR, pp. 248-255, 2009.

1278

Generalized stacking of layerwise-trained Deep...

Documents

Transcript of Generalized stacking of layerwise-trained Deep...