DEEP IMAGE RETRIEVAL AND CLASSIFICATION ON SPARKNETpsu/img/Deep Image Retrieval and... ·...

DEEP IMAGE RETRIEVAL AND CLASSIFICATION ON SPARKNET

Peng Su, Hongyang Li

The Chinese University of Hong KongDepartment of Electronic Engineering{psu, yangli}@ee.cuhk.edu.hk

Michael R. Lyu

The Chinese University of Hong KongDepartment of Computer Science and Engineering

[email protected]

ABSTRACT

Image retrieval and classification are hot topics in com-puter vision and have attracted great attention nowadays withthe emergence of large-scale data. We propose a new schemeto use both deep learning models and large-scale computingplatform and jointly learn powerful feature representations inimage classification and retrieval. We achieve a superior per-formance on the ImageNet dataset, where the framework iseasy to be embedded for daily user experience. First we con-duct the classification task using deep convolutional neuralnetworks with several novel techniques, including batch nor-malization and multi-crop testing to obtain a better perfor-mance. Then we transfer the network’s knowledge to imageretrieval task by comparing the feature codebook of the queryimage with those feature database extracted from the deepmodel. Such a search pipeline is implemented in a MapRe-duce framework on the Spark platform, which is suitable forlarge-scale and real-time data processing. At last, the systemoutputs to users some textual information of the predicted ob-ject searching from Internet as well as similar images fromthe retrieval stage, making our work a real application.

Index Terms— Deep Convolutional Neural Networks,Image Retrieval and Classification, Large-scale Data.

1. INTRODUCTION

Good feature representations are of vital importance to com-puter vision research [1], especially in image classificationand retrieval, which the diversity of different object categoriesvaries in shape, orientation, appearance, light, etc. Traditionalmethods resort to hand-crafted features, such as Histogramof Gradients (HOG) [2], SIFT [3], LBP [4] and GIST [5].Combined with the discriminative classifier of kernel or linearSVM, these models have achieved great success in the earlystage of related topics. Recently, deep learning models basedon convolutional neural network (CNN) and its variants havedeveloped rapidly and become a dominant method in imageclassification since 2012, when Alex and Hinton [6] devel-oped a regularization method called ‘dropout’ to randomlyshut down the neuron output and achieved stated-of-the-artaccuracy in the yearly ImageNet challenge. It proves to be

a much better way of feature representation in many ways.However, choosing a good initialization point remains to bea crucial and challenging problem especially when the CNNmodel goes deeper from 4-5 layers to more than 20 layers. Al-though there is little theory behind the rapid growth of deeplearning, several attempts including parameterized ReLU [7]and LVUnit [8] have been proposed to investigate how the ini-tial points would affect the final training loss. Also some use-ful tricks, such as data augmentation and multi-scale training,are incorporated into the general deep learning framework toovercome the problem of overfitting as the model’s capacitygoes larger and larger.

The distributed computing platforms seem to be the nextgeneration of data processing as the fast growth and availabil-ity of large-scale data. For example, Hadoop1 is developedbased on BigTable [9], Google File System [10] and MapRe-duce [11]. It has been proved great success in the area oflarge-scale computing. However, it falls short in I/O commu-nication and iterative algorithms. As its derivative, Spark [12]is able to perform in-memory computations on large clustersin a fault-tolerant manner and could overcome the drawbacksof Hadoop. Applying and deploying the deep learning algo-rithms on the distributed platforms are necessary to handlelarge-scale data as well as obtain a cutting-edge performance.

To address the problems, we propose a joint frameworkto classify and retrieve images at the same time with fast im-plementation on Spark. Figure 1 describes the pipeline withsome brief explanations of each step in the caption. The maincontributions of this work are three-folds:

• Modify the existant deep neural networks by batch nor-malization and multi-crop scheme to improve the clas-sification performance.

• Embed object retrieval and classification tasks jointlyon the fly using the parallelized data platform, Spark.

• Design a complete user-content based application withquick item searching bundled with retrieved images aswell as news feed.

1 http://hadoop.apache.org/

1x1conv

3x3conv

1x1conv 1x1conv

5x5conv 1x1conv

3x3pool

FeatureConcatenation

InputFeatureMaps

InceptionModule

7x7 conv

1x1conv

3x3conv

1x1conv 1x1conv

5x5conv 1x1conv

3x3pool

MaxPool

DepthConcat

…

1x1 conv

DepthConcat

MaxPool

AvgPool

Softmax

ClassScores

MoreInceptionBlocks

1

1

BacthNorm

QueryImage

What’sthis?

DeepConvolutionalNetworkObjectRetrievalonSpark

Query’sCodebook

ComparewithDatabase

1.5MillionImages

DeepModel

It’sChihuahua!

OnlineSearch

RetrievalonSpark

Similar-looking

Application

Fig. 1. The pipeline of our joint classification and retrieval system, which is user-content based. First we train the classificationdeep neural network on ImageNet with novel batch normalization and multi-crop scheme (middle); then given the input queryimage by user, we feed the input into the trained network and predict the class label of the image. The retrieval system is builtup on the Spark platform where a codebook of the query image is generated from the network (left); at last, we send back touser the retrieved similar images found in the database as well as some useful information searching online from the keywordsof the predicted class (right).

2. RELATED WORK

Deep learning has changed the whole computer vision com-munity significantly with the success of AlexNet [6] in theImageNet 2012 classification task. Traditional algorithms inclassification including bag of words [13] and spatial pyramidmatching [14] would resort to extract hand-crafted featuresand combine them within the discriminative classifier, such asSVM and its variants. However, deep learning overturns allof this by simply constructing a neural network of multiplelayers and by carefully tuning the parameters in the network.The key problems of back-propagation and effective trainingare addressed in many literature [15, 16], and we see the per-formance can be further improved by adding more layers intothe network, such as VGG [7] model which contains 19 lay-ers and consecutive 3 × 3 kernels with larger receptive field,and GoogLeNet inception [17] model which has no fully con-nected layers and keeps the minimum amount of parametersby adding more 1×1 filters. The more layers of a network canbe viewed as a way of increasing the non-linearity expressivepower of the feature space.

Image retrieval has many applications in various scenar-ios, such as online advertisement, tourism promotion and rec-ommendation systems. While many efforts have been madein order to improve its speed and accuracy, it remains a chal-lenging problem. Many work have been done to address theproblem such as deep semantic feature transferring [18]. Inparticular, there’s trending that using the automatic learnedfeatures from deep models plus a kernel-based locality sensi-tivity hashing (LSH) can achieve better performance in objectretrieval [19]. In this work ,we build the retrieval system ona large-scale dataset that contains 1.5 million images; more-

over, we directly compute the Euclidean distance between thequery image and all those accessible in the database to findsimilar-looking images with the help of MapReduce frame-work on Spark [12], a professional state-of-the-art techniquefor addressing the parallelized computing jobs. Note that wedo not apply LSH as classical methods would do because wefind such a simple matching scheme suffices to achieve satis-fying retrieval results.

3. ALGORITHM

We propose our algorithm by first introducing the convolu-tional neural networks (CNN) to have a strong expressivepower of feature representation for image classification (Sec.3.1); then we explicitly state how the network is transferred tothe task of object retrieval on the Spark platform (Sec. 3.2); atlast, we show such a jointly-learned framework can be madeinto real-world applications by providing users keywords ofthe predicted object searching from Internet (Sec. 3.3).

3.1. Deep Convolutional Model

In this paper, we employ a popular variant model called in-ception model [8], which descends from GoogLeNet [17]and inherits the good merits of the ancestor (inception-based,timing-efficient, superior performance, etc.), to build thenetwork, extract features and represent each image in theclassification task. In the meanwhile, we merge several use-ful techniques such as batch normalization and multi-croptesting to further boost the performance.

1x1conv

3x3conv

1x1conv 1x1conv

5x5conv 1x1conv

3x3pool

FeatureConcatenation

InputFeatureMaps

InceptionModule

Fig. 2. The basic inception module with dimensional reduc-tions. Note that the 1 × 1 convolution reduces the channeldimension while keeping the spatial resolution unchanged.

Input

7x7 conv

MaxPool

1x1conv

3x3conv

1x1conv 1x1conv

5x5conv 1x1conv

3x3pool

MaxPool

DepthConcat

7x7 conv

1x1conv

3x3conv

1x1conv 1x1conv

5x5conv 1x1conv

3x3pool

MaxPool

DepthConcat

…

1x1 conv

DepthConcat

MaxPool

AvgPool

Softmax

ClassScores

MoreInceptionBlocks

Fig. 3. Prototype of our network architecture with all the bellsand whistles, which contains 22 parametric layers and severalnon-parametric (normalization, pooling, etc.) layers.

3.1.1. Network Architecture

The network is designed with computational efficiency andpracticality in mind, so that inference can be run on individ-ual devices including even those with limited computationalresources, especially with low-memory footprint. Each basicblock is called an ‘inception’ module (Figure 2) to keep nec-essary spatial information while discarding redundancy in theparameters by pooling and normalization operations. The in-ception network has 22 parametric convolutional layers (Fig-ure 3) plus several operational, non-parametric counterparts.Note that we do not use any fully-connected layers for bet-ter saving the parameter space. The use of average poolingbefore the classifier is based on [15], where a concentrate fo-cus of the feature map by averaging could still represent theimage well. It is found that such a change from fully con-nected layers to average pooling improves the top-1 accuracyby about 0.6%.

3.1.2. Network Training and Batch Normalization

Stochastic gradient descent (SGD) has proved to be an effec-tive way of training deep networks, and SGD variants such asmomentum and Adagrad have been used to achieve state-of-the-art performance. We adopt SDG for training the wholenetwork, which optimizes the model parameters Θ so as tominimize the loss

Θ = arg minΘ

1

N

N∑i=1

l(xi,Θ) (1)

where N is the number of training samples and xi is the inputof the i-th sample. We adopt a log-likelihood form of the loss:

l(xi,Θ) = − log(pi, li). (2)

As for the parameter update, a gradient descent method is in-troduced according to the following rule:

Θt+1 = Θt − α

m

m∑i=1

∂l(xi,Θ)

∂Θ(3)

where α is the learning rate and m being the mini-batch size.In this way, the loss can flow back through back-propagationby chain rule and update the parameters per mini-batch ac-cordingly.

However, the traditional method has its own suffering forslow convergence due to a misleading distribution as the layergoes deep, meaning that the data will have different forms ofdistributions if we do not regularize the data. Thus we adoptthe batch normalization (BN) technique [8] to force data ineach layer to have zero-mean and unit variance. Specifically,for a typical BN model, each time the network is fed with

input xi, we transform the input as follows:

xi ←xi − µB√σ2B + ε

(4)

yi ← γxi + β (5)

where µB, σ2B denote the mean and variance of the mini-batch,

respectively; γ, β indicate a pair of parameters which scaleand shift the normalized value; and ε is a small number toprevent infinity of the quotient. These parameters are learnedalong with the original model parameters, and restore the rep-resentation power of the network. The transformed value yiwould represent the identity transform as well as constrain theinput to the linear regime of the nonlinearity. During trainingthe BN module back-propagates the gradient of loss l withrespect to its parameters:

∂l

∂xi=

∂l

∂xi· 1√

σ2B + ε

+∂l

∂σ2B· 2(xi − µB)

m(6)

∂l

∂γ=

m∑i=1

∂l

∂yi· xi (7)

∂l

∂β=

m∑i=1

∂l

∂yi(8)

3.1.3. Muti-view Testing

During testing we take a multi-view voting strategy that takethe top and bottom, left and right crops as well as their hor-izontal flips out of the input image and forward the networkover 10 times, then we average the scores of the classificationresults and the performance is better improved. Such a simpleand yet effective trick has the intuition of detecting the imageat multiple locations and scales, making it more robust andgeneralized on the test data.

3.2. Image Retrieval on SparkNet

After we train the deep model and perform batch normal-ization and multi-crop scheme, the classification error hasbeen decreased to a significant margin, which in turn provesthe good expressive power of the network to represent thefeatures. We explore the model’s capacity and transfer theknowledge to object retrieval where the Caffe system is builtup on the large-scale data parallelizm platform Spark. Algo-rithm 1 shows the pipeline in a python style.

Specifically, for a given input image (query) from the user,we forward the network on the fly and extract its features. Wedefine a function called extract feat to extract all the vali-dation image features in one package. Once we feed the im-age package path to this function, Caffe will be triggered au-tomatically to extract all the features for that particular pack-age. Moreover, we initialize a Spark RDD to contain paths ofall the image packages. After the program and clusters are setup, all the image packages will be finished in a parallelized

way. The features2 of the query image is called codebook,or representation and we compute the distance between thequery and those available in the database. The retrieved im-ages are those with top k nearest Euclidean distance with thequery image. We set k = 100 through all experiments.

Algorithm 1: Object Retrieval on SparkNet

1 def extract feat(x):

# extract validation feature on the fly

2 def main():

# create Spark context3 sc = Spark Context()

# select the image package4 cls range = range(1,1000)

5 x = imread(’val.jpeg’)

# extract the feature in parallel6 feat = sc.parallelize(cls range,100)

.map(lambda x:extract feat(x))

.collect()

# find the predicted class7 cls = sc.parallelize(filelist,len(filelist))

.map(lambda feat:softmax(feat))

.reduce(add)

# compute the distance8 dist = sc.parrallelize(data[0:],100)

.map(lambda x:compute(x, val vec))

.sortByKey(Ture).collect()

# extract the keyword online9 keywd =sc.parallelize(filelist,len(filelist))

.map(lambda x:keyword[x])

.reduce(add)

10 sc.stop ()

3.3. Real-time User Application

One of our goal in this work is to make a transfer product forreal-world applications. Whenever the user inputs an image,he can find what exactly category it falls into by the classifica-tion model. Note that our result is fine-grained based, mean-ing we can know the specific breed of a dog, for example.Also, our retrieval system can send back similar looking im-ages by quickly matching with the database samples in a fastand efficient way.

To this end, we add an additional feature to the frame-work, that is, first associate the predicted class with its key-words description provided from ImageNet, search online

2 It is the vectorized form of the feature maps in the last convolutionallayer before average pooling in the network.

URL: spark://192.168.72.139:7077 REST URL: spark://192.168.72.139:6066 (cluster mode) Alive Workers: 8 Cores in use: 64 Total, 64 Used Memory in use: 242.8GB Total, 16.0GB Used Applications: 1 Running, 32 Completed Drivers: 0 Running, 0 Completed Status: ALIVE

Fig. 4. Detailed deployment of the Spark configuration.

based on the category’s keywords, and send back some infor-mation about the query image to the user, a piece of news, orWikipedia page, for instance. Line #9 in Algorithm 1 gives arough description of such a searching process.

4. EXPERIMENT

We conduct several experiments to verify our algorithm inthis section. The dataset for image classification and retrievalcomes from a large-scale database called ImageNet3 whichcontains 22k classes and 8 millions images. We use the stan-dard subset of 1000 classes and 1.5 million images, which isalso used and evaluated in each year’s ImageNet classifica-tion challenge. The validation set consists of 50,000 images.All the source images are collected from Flickr, Google, Bingand other search engines, and hand labeled with the presenceor absence of 1000 object categories.

The evaluation metric for classification is top-5 error rate,where the network gives five top most likely classes the imagebelongs to and if the label is among these five predictions, theimage is marked as correct. The ratio of misclassified imagesover the whole validation set is the error rate metric used inthis paper. As for the retrieval task, we define a similar errorrate where the retrieved image is marked as false if it does notbelong to the label class of the query image.

4.1. Network Setup and Spark Deployment

We use the Caffe [20] package to implement our networkand incorporate cudnn v34 to further facilitate the computa-tion time (especially the spatialConvolutionLayer)of training and feature extraction. During training, the batchsize was set to 256, momentum to 0.9. The model is regu-larized by weight decay of 5 × 10−4. Also we find addingsome dropout regularization at the end of the network afterthe convolution could also enhance the classification results(dropout ratio set to 0.5). The learning rate is initially set to10−2, and then decreased by a factor of 10 when the valida-tion set accuracy stops improving. In total, the learning rate isdecreased 3 times, and the training is terminated after 370K

3 http://image-net.org/4 https://developer.nvidia.com/cudnn

Fig. 5. Test error of different training and test schemes onCIFAR-10 classification dataset.

Table 1. Timing of feature extraction on ImageNet classifica-tion validation set in different allocations. Each entry’s unit issecond per image.

CPU 1GPU 2GPU 2GPU cudnn 4GPU cudnn4.2 2.5 1.1 0.8 0.5

iterations (74 epochs). For the classification task, the traininglasts around 3.6 days on a 4 Titan-X GPU server machine.During inference, we use a batch size of 200 and extract allimage features of the 1000 classes on the validation set within30 minutes (0.08 second per image).

Then we transfer our offline-trained model to the Spark5

platform and perform the object retrieval task on the fly. Weset up two server machines with each of them having 8 coresand deploy the system in a standalone mode. One serveris set to be the master node and another serves as the slavenode. For each server machine, we create 8 workers and eachworker will get 30GB memory. When a job is submitted, themaster node will manage the resource and distribute the jobto all workers. Figure 4 summarizes the configurations.

4.2. Quantitative Comparison

Table 1 shows the breakdown of computation efficiency usingfeature extraction as illustration in different hardware condi-tions. We can see the computational speed will boost a lotfrom CPU to GPU mode and can be further accelerated if weincorporate multiple GPUs and cudnn.

To verify our modification schemes, i.e., batch normaliza-tion and multi-crop voting, are effective, we conduct the ab-lation study in Figure 5 on a smaller dataset called CIFAR-10[6], where there are only 10 classes with each containing 5000images. Without batch normalization, the test error slowlydecreases as the learning rate changes and the network’s datadistribution is not stable and well regularized. Also multi-crop and dropout strategies could boost the performance fur-ther.

5 http://spark.apache.org/

Table 2. Object retrieval accuracy under different feature rep-resentations. We compare k = 100 retrieved images from thewhole ImageNet database.

Feature representation Retrieval accuarcylow1 (HOG+SIFT) 84.45%low2 (LBP+GIST) 83.37%low1 + low2 87.25%deep net w/o BN 93.77%deep net 95.21%

Table 3. Comparison of top-5 classification accuracy in dif-ferent settings as well as to other state-of-the-arts.

Method or setting Classification accuracylow1 + kNN 32.45%low1 + ker SVM 52.37%low1 + li SVM 63.25%low2 + li SVM 64.77%low1 + low2 + li SVM 68.29%AlexNet [6] 84.70%Overfeat [16] 86.40%Ours 89.45%

Table 2 reports the effect of different feature representa-tions as codebook in the retrieval task. We use HoG, SIFT,LBP, GIST and their combinations as comparison features.Table 3 verifies also the superior ability of feature representa-tion in deep models compared with other hand-crafted coun-terparts. Also our modified classification model beats otherpopular methods by a margin about 5%. Note that we use k-nearest neighbor (kNN), kernel and linear SVM as alternateclassifiers in some settings.

5. CONCLUSION

In this paper, we propose an efficient joint classification andretrieval model using Caffe in the Spark framework. Wemodify the deep neural networks by batch normalization andmulti-view voting to further improve the classification perfor-mance, which could serve as a better start point for the featurecodebook generation in object retrieval. With the help of deeplearning algorithms and large-scale computing platforms, weget a significant improvement in terms of speed and accuracy.Also we design an user interface to embed these two tasksand make it ready for daily use. Experimental results showwe achieve a better performance in terms of top-5 test error.We intend to embed the Caffe training process into the Sparkplatform as our future work.

6. REFERENCES

[1] Jianbo Shi and Carlo Tomasi. Good features to track. In ComputerVision and Pattern Recognition, 1994. Proceedings CVPR’94., 1994IEEE Computer Society Conference on, pages 593–600. IEEE, 1994.

[2] David G. Lowe. Object recognition from local scale-invariant features.Computer Vsion, 1999.

[3] Zhiyi Zhang, Lianwen Jin, Kai Ding, and Xue Gao. Character-SIFT:a novel feature for offline handwritten chinese character recognition.Analysis and Recognition, 2009.

[4] Matti Pietikainen, Abdenour Hadid, Guoying Zhao, and Timo Ahonen.Computer Vision Using Local Binary Patterns, chapter Local BinaryPatterns for Still Images. 2011.

[5] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: Aholistic representation of the spatial envelope. Int. J. Comput. Vision,42(3), 2001.

[6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenetclassification with deep convolutional neural networks. NIPS, 2012.

[7] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. ICLR, 2015.

[8] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. ICML, 2015.

[9] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deb-orah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, andRobert E. Gruber. BigTable: A distributed storage system for structureddata. ACM Transactions on Computer Systems, 2008.

[10] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Googlefile system. SOSP, 2003.

[11] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data pro-cessing on large clusters. ACM Communications, 51(10), 2008.

[12] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, ScottShenker, and Ion Stoica. Spark: Cluster computing withworking sets.Cloud Computing, 2010.

[13] Eric Nowak, Frederic Jurie, and Bill Triggs. Sampling strategies forbag-of-features image classification. ECCV, 2006.

[14] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linearspatial pyramid matching using sparse coding for image classification.CVPR, 2009.

[15] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. NIPS,2014.

[16] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, RobFergus, and Yann LeCun. Overfeat: Integrated recognition, localizationand detection using convolutional networks. In ICLR, 2014.

[17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, , S. Reed, D. Anguelov,D.Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convo-lutions. CVPR, 2015.

[18] Liang Wang Tieniu Tan Fang Zhao, Yongzhen Huang. Deep semanticranking based hashing for multi-label image retrieval. CVPR, 2015.

[19] Ke Jiang, Qichao Que, and Brian Kulis. Revisiting kemelized locality-sensitive hashing for improved large-scale image retrieval. CVPR,2015.

[20] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev,Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell.Caffe: Convolutional architecture for fast feature embedding. ACMMultimedia, 2014.

DEEP IMAGE RETRIEVAL AND CLASSIFICATION ON SPARKNETpsu/img/Deep Image Retrieval and... ·...

Documents

Transcript of DEEP IMAGE RETRIEVAL AND CLASSIFICATION ON SPARKNETpsu/img/Deep Image Retrieval and... ·...