State-of-the-art Object Detection Algorithms - Computer Science...

State-of-the-art Object Detection Algorithms

Jong-Chyi SuUniversity of California, San Diego

9500 Gilman Dr., La Jolla, [email protected]

Abstract

Object detection is one of the fundamental task in com-puter vision. In this report, I present three state-of-the-artalgorithms, Integral Channel Features (ICF) [1] Discrimi-natively Trained Part Based Models (DPM) [4], and RichFeature Hierarchies for Convolutional Neural Networks(RCNN) [5]. DPM, which uses a deformable model andlatent SVN for training, is the most widely used alorithmfor object detection. ICF, which is fast and good on detect-ing pedestrians, combines different simple transformationsand uses AdaBoost for training. RCNN, which uses selec-tive search and convolutional neural network, is currentlythe best algorithm for PASCAL VOC dataset,

1. IntroductionObject detection is still an important and unresolved

problem in computer vision. From the first PASCAL VOCobject detection task in 2007 until now, the accuracy ofstate-of-the-art algorithms has increased from 20% to 50%.However, there is still space for improvement in the future.

The first successful algorithm is discriminatively trainedpart based models (DPM) [4] , which is an algorithm be-tween generative and discriminative model. First, given animage, build a pyramid for any scale. Then, use differentroot filter and part filters to get responses. By combiningthese responses in star-like model and computing cost func-tion, train classifiers by latent SVM. It is still a widely-usedalgorithm that one can add other creative methods on it, likecombining segmentation and DPM [6].

On the other hand, in Integral Channel Features (ICF)[1], the author combines multiple registered image chan-nels, which are computed by simple linear and nonlineartransformations. The author later argues that having a fea-ture, like HOG, computed at one scale, the correspondingfeature at higher and lower scales can be approximated byre-weightening [2]. Integrate those features and do train-ing by AdaBoost in cascade way can have good accuracyon detecting pedestrians. Although it is a simple algorithm

works well on detecting rigid body like pedestrians, it maybe hard to be used in more general case.

In addition, the concept behind detection algorithmshave also changed recently. The sliding window method, orsimilarly the pyramid method, will cost lots of time in detec-tion. Recently, proposing high-recall important regions [7]is widely used. Besides, deep learning or convolutional neu-ral network (CNN) become successful for finding featuresin image classification task [8]. The paper Rich Feature Hi-erarchies for Convolutional Neural Networks (RCNN) [5],combining CNN and selective search, has made a hugeprogress on PASCAL VOC object detection task. In nextsection, I will briefly introduce these three algorithms. Ialso tried their detection code and show results. Last, I willdiscuss those algorithms and future work.

2. Object Detection Algorithms2.1. Integral Channel Features (ICF)

In integral channel features [1], the author tried to com-bine multiple registered image channels computed by lin-ear and nonlinear transformations. Those eight transforma-tions, including color, gradient, edge, gradient histogram,difference of Gaussian, thresholding, and the absolute valueof Gaussian, are all translationally invariant, meaning thatthey only need to be computed once. Training those fea-tures by cascading AdaBoost classifier, the whole methodis fast and effective on pedestrain detection. It is also em-pirically proved that most of the parameters in those trans-formation are not crucial. It can achieve 79% accuracy onINRIA dataset using PASCAL criteria.

2.2. Discriminatively Trained Part Based Models(DPM)

The hardest part for object detection is that there are lotsof variances. Those are arose from illumination, viewpoint,non-rigid deformation, occlusion, and intraclass variability.The deformable parts model is trying to capture those vari-ances. It assumes an object is constructed by its parts. Thus,the detector will first found a match by coarser (at half reso-

1

lution) root filter, and then using its part models to fine-tunethe result. It uses HOG features on pyramid levels beforefiltering, and linear SVM for training to find the differentpart locations of an object.

2.3. Rich Feature Hierarchies for ConvolutionalNeural Networks (RCNN)

Recently, many detection algorithms using selectivesearch [7] as region proposals, to avoid exhaustive slidingwindow method. In addition, Convolutional Neural Net-works (CNN) [8] is successful in ImageNet classificationchallenge. Since the progress of DPM model is now restric-tive, the RCNN paper [5] is trying to do detection in anotherway. CNN can learn a diverse set of features from expe-rience. However, it is computational complex, has manyparameters, and hard to tell its meaning of the whole mech-anism. In RCNN paper, they use region proposal to reducetwo order in the number of image windows. All propos-als are then warped to 224 by 224 pixel size as inputs toCNN. For CNN, the experiments also show that the num-ber of parameters can be reduced to 6%, by using pool 5instead of the last layer with only few percentage drop onaccuracy, though the reason is still a mystery. The key pointfor CNN is the fine-tuning of the parameters. According tothe author, the fine-tuning process is crucial to the final ac-curacy. By combining region proposals and CNN features,the RCNN method can achieve 54% accuracy on PASCALVOC 2007 challenge, with training on ImageNet dataset.

3. Experiment Result

The leaderboard of PASCAL VOC challenge1 is the bestway to compare the performance of object detectors. On2012 object detection challenge, DPM model can yield33.6% accuracy while RCNN (trained on ImageNet) canachieve 53.3% accuracy. For integral channel features, itis only trained and tested on INRIA and Caltech pedes-trian dataset and achieve 79% accuracy. Some examplesare shown in the subsections. However, I haven’t succeededto run RCNN code4, which is built based on Caffe5, on mylaptop6.

3.1. Integral Channel Features2

An example of detection result on Caltech dataset is inFigure 1. The precision and recall curve is shown in Fig-ure 2. Some false negative and false positive patches areshown in Figure 3. In false negatives, we can see that itis hard for ICF to detect pose variations and occlusions.In false negatives, there are some vertical objects like traf-fic lights, though their shape are not similar. It seems thatthe shape information, or say, the boundary, can be utilizedmore.

(a) Input image (b) Detection Result.Figure 1. Detection result of integral channel features.

Figure 2. The ROC curve for integral channel feature algorithm onCaltech pedestrian dataset.

(a) False negative (b) False positiveFigure 3. False negative and false positive patches.

3.2. DPM3

Some examples, including root filter, part filter, and de-tection results, are shown in Figure 4. The root filter is twicecoarser than the part filter using HOG feature.

1http://host.robots.ox.ac.uk:8080/leaderboard2Piotr’s Image & Video Matlab Toolbox http://vision.ucsd.

edu/˜pdollar/toolbox/doc/index.html3Available at http://github.com/rbgirshick/voc-dpm4Available at http://github.com/rbgirshick/rcnn5Available at http://caffe.berkeleyvision.org/6Although the Caffe, a convolutional neural network framework built

by Yangqing Jia, can run on CPU-only mode, it seems the R-CNN codehave to run with GPU. However, my laptop’s GPU, nvidia GT540m, hasproblem about nvidia driver on linux due to some reason. I tried for fewdays but it is still fruitless. I’ll try to run on other computer.

2

http://host.robots.ox.ac.uk:8080/leaderboard

http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html

http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html

http://github.com/rbgirshick/voc-dpm

http://github.com/rbgirshick/rcnn

http://caffe.berkeleyvision.org/

4. ConclusionFrom DPM model to RCNN algorithm, the accuracy of

object detection has a great enhancement. The whole direc-tion of research is also changed. When people tried to addnew features on generative-like DPM model, the model isgetting more and more complex. On the other hand, usingsliding window is exhaustive. By exploiting region propos-als such as selective search [7] and objectness [9], the can-didate windows can be reduced in two magnitude. In addi-sion, using segmentation hint, like [6], or the DPM modelwith sketch token (as described in RCNN as baseline), arealso sensible. Another key point is the choose of features.For ICF algorithm, they tried to combine simple transfor-mations as features. Recently, using deep learned featureslike RCNN gives better performance compared with tradi-tional SIFT or HOG feature.

References[1] P. Dollr, Z. Tu, P. Perona and S. Belongie , Integral

Channel Features BMVC, 2009.

[2] P. Dollr, S. Belongie and P. Perona, The Fastest Pedes-trian Detector in the West, BMVC, 2010.

[3] P. P. Dollr, R. Appel, S. Belongie and P. Perona , FastFeature Pyramids for Object Detection, IEEE Trans-actions on Pattern Analysis and Machine Intelligence,2014.

[4] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ra-manan, Object Detection with Discriminatively TrainedPart Based Models, IEEE Transactions on PatternAnalysis and Machine Intelligence, Vol. 32, No. 9, Sep.2010.

[5] Ross Girshick, Jeff Donahue, Trevor Darrell, JitendraMalik, Rich feature hierarchies for accurate object de-tection and semantic segmentation, CVPR, 2014.

[6] Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, RaquelUrtasun, Bottom-up Segmentation for Top-down De-tection, CVPR, 2013.

[7] Koen E. A. van de Sande, Jasper R. R. Uijlings, TheoGevers, Arnold W. M. Smeulders, Segmentation As Se-lective Search for Object Recognition, ICCV, 2011.

[8] Imagenet classification with deep convolutional neuralnetworks, Alex Krizhevsky , Ilya Sutskever , GeoffreyE. Hinton, NIPS, 2012.

[9] Alexe, B., Deselares, T. and Ferrari, V., Measuring theobjectness of image windows, IEEE Transactions onPattern Analysis and Machine Intelligence, 2012.

3

Figure 4. Top row: Input images. Second row: Root filters. Middle row: Part filters. Forth row: Result of root and part filters. Bottomrow: Result bounding boxes..

4

State-of-the-art Object Detection Algorithms - Computer Science...

Documents

Transcript of State-of-the-art Object Detection Algorithms - Computer Science...