ResearchArticle Small Object Detection with Multiscale...

Research ArticleSmall Object Detection with Multiscale Features

Guo X Hu12 Zhong Yang 1 Lei Hu3 Li Huang4 and Jia M Han1

1College of Automation Engineering Nanjing University of Aeronautics and Astronautics Nanjing 211106 China2School of Software Jiangxi Normal University Nanchang 330022 China3School of Computer Information Engineering Jiangxi Normal University Nanchang 330022 China4Elementary Education College Jiangxi Normal University Nanchang 330022 China

Correspondence should be addressed to Zhong Yang yznuaa163com

Received 15 July 2018 Accepted 13 September 2018 Published 30 September 2018

Guest Editor Wei Quan

Copyright copy 2018 Guo X Hu et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

The existing object detection algorithmbased on the deep convolution neural network needs to carry outmultilevel convolution andpooling operations to the entire image in order to extract a deep semantic features of the imageThe detectionmodels can get betterresults for big object However those models fail to detect small objects that have low resolution and are greatly influenced by noisebecause the features after repeated convolution operations of existing models do not fully represent the essential characteristics ofthe small objects In this paper we can achieve good detection accuracy by extracting the features at different convolution levels ofthe object andusing themultiscale features to detect small objects For our detectionmodelwe extract the features of the image fromtheir third fourth and 5th convolutions respectively and then these three scales features are concatenated into a one-dimensionalvectorThe vector is used to classify objects by classifiers and locate position information of objects by regression of bounding boxThrough testing the detection accuracy of our model for small objects is 11 higher than the state-of-the-art models In additionwe also used the model to detect aircraft in remote sensing images and achieved good results

1 Introduction

Object detection which not only requires accurate classifi-cation of objects in images but also needs accurate locationof objects is an automatic image detection process basedon statistical and geometric features The accuracy of objectclassification and object location is important indicatorsto measure the effectiveness of model detection Objectdetection is widely used in intelligent monitoring militaryobject detection UAV navigation unmanned vehicle andintelligent transportation However because of the diversityof the detected objects the current model fails to detectobjects The changeable light and the complex backgroundincrease the difficulty of the object detection especially for theobjects that are in the complex environment

The traditional method of image classification and loca-tion by multiscale pyramid method needs to extract thestatistical features of the image in multiscale and then classifythe image by a classifier [1ndash3] Because different types ofimages are characterized by different features it is difficultto use one or more features to represent objects which do

not achieve a robust classification modelThosemodels failedto detect the objects especially that there are more detectedobjects in an image

Since deep learning has been a great success in the fieldof object detection it has become themainstream method forobject detectionThesemethods (eg RCNN [4] Fast-RCNN[5] Faster-RCNN [6] SPP-Net [7] and R-FCN [8]) haveachieved good results in multiobject detection in imagesBut most of these object detection algorithms are based onPASCAL VOC dataset [9] for training and testing PASCALVOC dataset which provides a standard evaluation systemfor detection algorithms and learning performance is themost widely used standard dataset in the field of object clas-sification and detection The dataset consists of 20 cataloguesclosely related to human life including human and animal(bird cat cattle dog horse and sheep) vehicle (aircraftbicycle ship bus car motorcycle and train) and indoor item(bottle chair table potted plants sofa and television) Fromthe above object category we can find that the actual sizeof most objects in the dataset is large object Even if thereare some small objects such as bottles these small objects

HindawiInternational Journal of Digital Multimedia BroadcastingVolume 2018 Article ID 4546896 10 pageshttpsdoiorg10115520184546896

2 International Journal of Digital Multimedia Broadcasting

Figure 1 Small object dataset

display very large objects in the image because of the focallength Therefore the detection model based on the datasetcomposed of large objects will not be effectively detected forthe small objects in reality [10]

Based on this problem we mainly study automatic detec-tion of small object For small object we define it as twotypes one is a small object in the real world such as mouseand telephone And the other is small objects those are largeobjects in the real world but they are shown in the image assmall objects because of the camera angle and focal lengthsuch as objects detection in aerial images or in remote sensingimages The small object dataset is shown in Figure 1

Usually since small objects have low resolution and arenear large objects small objects are often disturbed by thelarge objects and it leads to failure in being detected in theautomatic detection process As the mouse in Figure 1 is oftenplaced next to the monitor the common saliency detectionmodel [11 12] usually focuses on more significant monitorand ignores the mouse In addition we not only find the

detected objects in the image but also need to accuratelymark object location for object detection Because the bigdetected objects have many pixels in the image they canaccurately locate their location But it is just the oppositefor the small objects that have low resolution and few pixelsEven more because the small objects have fewer pixels andthe finite pixels contain few object features it is difficult todetect the small objects by the conventional detection modelIn addition there are few studies references and also nostandard dataset on automatic detection of small objects

In order to solve these problems we propose a multiscaledeep convolution detection network to detect small objectsThe network is based on the Faster-RCNN detection modelWe firstly combine the features of the 3th 4th and 5thconvolution layers for the small objects to amultiscale featurevectorThen we use the vector to detect the small objects andlocate the bounding box of objects In order to train smallobjects the paper also uses the method [13] to build a datasetfocusing on small objects Finally by comparing the proposed

International Journal of Digital Multimedia Broadcasting 3

detectionmodelwith the state-of-the-art detectionmodel wefind that the accuracy of our method is much better than thatof Faster-RCNN

The paper is organized as follows In Section 2 weintroduce related works Thereafter in the Section 3 wedemonstrate the detection model Experiments are presentedin Section 4 We conclude with a discussion in Section 5

2 Related Works

Object detection is always a hot topic in the field of machinevision The conventional detection method based on slidingwindow needs to decompose images in multiscale imagesUsually one image is decomposed into lots of subwindowsof several million different locations and different scalesThe model then uses a classifier to determine whether thedetected object is contained in each window The method isvery inefficient because it needs exhaustive search In addi-tion different classifiers also affect the detection accuracy ofobjects In order to obtain robust classifier the classifiers aredesigned according to the different kinds of detected objectsFor example the Harr feature combined with Adaboostingclassifier [14] is availability for face detection For pedestriandetection we use the HOG feature (Histogram of Gradients)combined with support vector machine [15] and the HOGfeature combined with DPM (Deformable Part Model) [1617] is often used in the field of the general object detectionHowever if there are many different kinds of detectedobjects in an image those classifiers will fail to detect theobjects

Since 2014 Hinton used deep learning to achieve the bestclassification accuracy in the yearrsquos ImageNet competitionand then the deep learning has become a hot direction todetect the objects Themodel of object detection based on thedeep learning is divided into two categories the first that iswidely used is based on the region proposals [18ndash20] suchas RCNN [4] SPP-Net [7] Fast-RCNN [5] Faster-RCNN[6] and R-FCN [8] The other method does not use regionproposals but directly detects the objects such as YOLO [21]and SSD [22]

For the first method the model firstly performs RoIsselection during the detection ie multiple RoIs are gener-ated by selective search [23] edge box [24] or RPN [25]Then the model extracts features for each RoIs by CNNclassifies objects by classifiers and finally obtains the locationof detected objects RCNN [4] uses selective search [23] toproduce about 2000 RoIs for each picture and then extractsand classifies the convolution features of the 2000 RoIsrespectively Because these RoIs have a large number ofoverlapped parts the large number of repeated calculationsresults in the inefficient detection SSP-net [7] and Fast-RCNN [5] propose a sharedRoIs feature for this problemThemethods extract only a CNN feature from the whole originalimage and then the feature of each RoI is extracted fromthe CNN feature by RoI pooling operation independently Sothe amount of computing of extracting feature of each RoI isshared This method reduces the CNN operation that needs2000 times in RCNN to one CNN operation which greatlyimproves the computation speed

However whether it is SSP-net or Fast-RCNN althoughthey reduce the number of CNN operations its time con-sumption is far greater than the time of the CNN featureextraction onGPU because the selection of the bounding boxof each object requires about 2 secondsimage onCPUThere-fore the bottleneck of the object detection lies in region pro-posal operation Faster-RCNN inputs the features extractedby CNN to the RPN (Region Proposal Network) networkand obtains region proposal by the RPN network so it canshare the image features extracted by CNN and thereby itreduces the time of selective search operation After RPNFaster-RCNN classifies the obtained region proposal throughtwo fully connected layers and the regression operation ofthe bounding box Experiments prove that not only is thisspeed faster but also the quality of proposal is better R-FCN thinks that the full connection classification for eachRoI by Faster-RCNN is also a very time-consuming processso R-FCN also integrates the classification process into theforward computing process of the network Since this processis shared for different RoI it is much faster than a separateclassifier

The other type is without using region proposal for theobject detection YOLO divides the entire original image intothe SlowastS cell If the center of an object falls within a cell thecorresponding cell is responsible for detecting the object andsetting the confidence score for each cell The score reflectsthe probability of the existence of the object in the boundingbox and the accuracy of IoU YOLO does not use regionproposal but directly convolution operations on the wholeimage so it is faster than Faster-RCNN in speed but theaccuracy is less than Faster-RCNN SSD also uses a singleconvolution neural network to convolution the image andpredicts a series of boundary box with different sizes andratio of length and width at each object In the test phasethe network predicts the possibility of each class of objects ineach bounding box and adjusts the boundary box to adapt tothe shape of the object G-CNN [25] regards object detectionas a problem of changing the detection box from a fixedgrid to a real box The model firstly divides the entire imagewith different scale to obtain the initial bounding box andextracts the features from the whole image by the convolutionoperation Then the feature image encircled by an initialbounding box is adjusted to a fixed size feature image bythe method Fast-RCNN mentioned Finally we can obtaina more accurate bounding box by regression operation Thebounding box will be the final output after several iterations

In short for the currentmainstream there are two types ofobject detection methods the first will have better accuracybut the speed is slower The accuracy of the second one isslightly worse but faster No matter which way to carry outthe object detection the feature extraction uses multilayerconvolution method which can obtain the rich abstractobject feature for the target object But this method leads toa decrease in detection accuracy for small target objectsbecause the features extracted by the method are few and cannot fully represent the characteristics of the object

In addition the PASCAL VOC dataset is the main datasetfor object detection and it is composed of 20 categories ofobject eg cattle buses and pedestrians But all of these


objects in the image are large objects Even in the PASCALVOC there are also some small objects eg cup but thesesmall objects display very large objects in the image becauseof the focal length So the PASCAL VOC is not suitable forthe detection of small objects

Microsoft COCO dataset [26] is a standard dataset builtby Microsoft team for object detection image segmentationand other fields The dataset includes various types of smallobjectswith the complexity of the background so it is suitablefor small objects detection The SUN dataset [27] consists of908 scene categories and 4479 object categories and a totalof 131067 images that also contain a large number of smallobjects

In order to get the rich small object dataset the paper [13]adopted two standards to build the dataset The first is thatthe actual size of the objects is not more than 30 centimetersAnother criterion is that the area occupied of the objects isnot more than 058 in the image The author also gives themAP of RCNN based on the dataset and it has only 235detection rate

3 Model Introduction

31 Faster-RCNN The RCNN model proposed by Girshickin 2014 is divided into four processes during the object detec-tion First 2000 proposal regions in the image are obtainedby region proposal algorithm Second it extracts the CNNfeatures of the two thousand proposal regions separately andoutputs the fixed dimension features Third the objects areclassified according to the features Finally in order to getthe precise object bounding box RCNN accurately locateand merge the foreground objects by regression operationThe algorithm has achieved the best accuracy of the yearBut it requires an additional expense on storage spaceand time because RCNN needs to extract the features of2000 proposal regions in each image Later Fast-RCNN isproposed by Girshick based on RCNN the model whichmaps all proposal regions into one image and has only onefeature extraction So Fast-RCNN greatly improves the speedof detection and training However Fast-RCNN still needsto extract the proposal regions which is the same as RCNNThe proposal regions extracted lead to inefficiency Faster-RCNN integrates the generation of proposal region extract-ing feature of proposal region detection of bounding box andclassification of object into a CNN framework by the RPNnetwork (regionproposal network) So it greatly improves thedetection efficiency The RPN network structure diagram isshown in Figure 2 The core idea of Faster-RCNN is to usethe RPN network to generate the proposal regions directlyand to use the anchor mechanism and the regression methodto output an objectness score and regressed bounds for eachproposal region ie the classification score and the boundaryof the 3 different scales and 3 length-width ratio for eachproposal region are outputted Experiments show that theVGG-16 model takes only 02 seconds to detect each imageIn addition it has been proved that the detection precisionwill be reduced if the negative sample is very high in thedataset The RPN network generates 300 proposal regions foreach image by multiscale anchors which are less than 2000

2k scores 4k coordinatescls layer reg layer

256-dintermediate layer

sliding window

conv feature map

k anchor boxes

Figure 2 RPN network structure

RPN

1 2 3 4 5

Feature Extraction

Figure 3 Faster-RCNN network structure

proposal regions of Fast-RCNN or RCNN So the accuracy isalso higher than them

Faster-RCNN only provides a RPN layer improvementcompared to the Fast-RCNN network and does not improvethe feature mapping layer compared to the Fast-RCNN net-work Faster-RCNN network structure is shown in Figure 3Faster-RCNN performs multiple downsampling operationsin the process of feature extraction Each sampling causes theimage to be reduced by halfTheoutput image in the fifth layeris the 116 of the original object for Faster-RCNN ie only 1byte feature is outputted on the last layer if the detected objectis smaller than 16 pixels in the original image The objectsfailed to be detected because little feature information can notsufficiently represent the characteristics of the object

AlthoughFaster-RCNNhas achieved very good detectionresults on the PASCAL VOC the PASCAL VOC is mainlycomposed of large objects The detection precision will fall ifthe dataset is mainly composed of small objects

32 Multiscale Faster-RCNN In reality the detected objectsare low in resolution and small in size The current model(eg Faster-RCNN) which has good detection accuracy forlarge objects can not effectively detect small objects in theimage [28] The main reason is that those models based ondeep neural network make the image calculated with con-volution and downsampled in order to obtain more abstractand high-level features Each downsampling causes the imageto be reduced by half If the objects are similar to thesize of the objects in the PASCAL VOC the objectrsquos detailfeatures can be obtained through these convolutions and


RPN

Concat 1lowast1conv

fc fcsoftmax

Bbox

1 2 3 4 5

The Layer of feature extracted

The layer of featureconcaenated

Figure 4 Our model structure

downsampling However if the detected objects are the verysmall scale the final features may only be left 1-2 pixels aftermultiple downsampling So few features can not fully describethe characteristics of the objects and the existing detectionmethod can not effectively detect the small target object

The deeper the convolution operation the more abstractthe object features which can represent the high-level featuresof objects are The shallow convolution layer can only extractthe low-level features of objects But for small objects thelow-level features can ensure rich object characteristics Inorder to get high-level and abstract object features and ensurethat there are enough pixels to describe small objects wecombine the features of different scales to ensure the localdetails of the object At the same time we also pay attentionto the global characteristics of the object based on the Faster-RCNNThismodel will have more robust characteristics Themodel structure is shown in Figure 4

The model is divided into four parts the first part is thefeature extraction layer which consists of 5 convolution layers(red part) 5 ReLU layers (yellow parts) 2 pooling layers(green parts) and 3 RoI pooling layers (purple part) Wenormalize the output of the 3th 4th and 5th convolutionrespectively Then the normalized output is sent to the RPNlayer and the feature combination layer for the generation ofproposal region and the extracted multiscale feature respec-tively The second part is the feature combination layer thatcombines the different scales features of third fourth andfifth layer into one-dimension feature vector by connectionoperation The third part is the RPN layer which mainly real-izes the generation of proposal regionsThe last layer which isused to realize classification and bounding box regression ofobjects that are in proposal regions is composed of softmaxand BBox

33 L2 Normalization In order to obtain the combinatorialfeature vectors we need to normalization the feature vectors

of different scales Usually the deeper convolution layeroutputs the smaller scale features On the contrary the lowerconvolution layer outputs the larger scale featuresThe featurescales of different layers are very differentTheweight of large-scale features will be much larger than that of small scalefeatures during the network weight which is tuned if thefeatures of these different scales are combined which leadsto the lower detection accuracy

To prevent such large-scale features from covering smallscale features the feature tensor that is outputted from differ-ent RoI pooling should be normalized before those tensorsare concatenated In this paper we use L2 normalizationThe normalization operation which is used to process everyfeature vector that is pooled is located after RoI pooling Afternormalization the scale of the feature vectors of the 3th4thand 5th layer will be normalized into a unified scale

119883 = 1198831198832 (1)

1198832 = ( 119889sum119894=1

10038161003816100381610038161199091198941003816100381610038161003816)12

(2)

where X is the original vector from the 3th 4th and 5th layerX is normalized feature vector and D is the channel numberof each RoI pooling

The vector X will be uniformly scaled by scale facto ie

119884 = 120574 (3)

where 119884 = [1199101 1199102 119910119889]119879In the process of error back propagation we need to fur-

ther adjust the scale factor 120574 and input vector X The specificdefinition is as follows

120597119897120597119883 = 120574

120597119897120597119910 (4)


Table 1 The process of model training

Training processInput VGG CNN M 1024 and imageOutput detection modelStep 1 Initialize the ImageNet pre training model VGG CNN M 1024 and train the RPN network

(1) Initialize network parameters using pre training model parameters(2) Initialization of caffe(3) Prepare for roidb and imdb(4) Set output path to save the caffe module of intermediate generated(5) Training RPN and save the weight of the network

Step 2 Using the trained RPN network in step 1 we generate the ROIs information and the probability distribution of the foregroundobjects in the proposal regionsStep 3 First training Fast RCNN network

(1) The proposal regions got from step 2 are sent to the ROIs(2)The probability distribution of foreground objects is sent to the network as the weight of the objects in the proposal regions(3) By comparing the size of Caffe blob we get the weight of objects outside the proposal regions(4) The loss-cls and loss-box loss functions are calculated classify and locate objects obtain the detection models

Step 4 Replace the detection model obtained in step 3 with the ImageNet network model in step 1 repeat steps 1 to 3 and the finalmodel is the training model

120597119897120597119883 = 120597119897120597119883 (1198681198832 minus11988311988311987911988332) (5)

120597119897120597120574 = sum119910120597119897120597119910119883 (6)

34 Concat Layer After the features of the third fourthand fifth layer are L2 normalized and RoI pooled outputvectors need to be concatenatedThe concatenation operationconsists of four tuples (ie number channel height andweight) where number and channel represent the concate-nation dimension and height and weight represent the sizeof concatenation vectors All output of each layer will beconcatenated into a single dimension vector by concatenationoperations In the initial stage of model training we set auniform initial scale factor of 10 for eachRoI pooling layer [11]in order to ensure that the output values of the downstreamlayers are reasonable

Then in order to ensure that the input vector of the fullconnection has the same scale as the input vector of theFaster-RCNN an additional 1lowast1 convolution layer is added tothe network to compress the channel size of the concatenatedtensor to the original one ie the same number as thechannel size of the last convolution feature map (conv5)

35 Algorithmic Description Faster RNN provides two train-ing methods with end-to-end training and alternate trainingand also provides three pretraining networks of differentsizes with VGG-16 VGG CNN M 1024 and ZF respectivelyThe large network VGG-16 has 13 convolutional layers and3 fully connected layers ZF net that has 5 convolutionallayers and 3 fully connected layers is small network and theVGG CNN M 1024 is medium-sized network Experimentshows that the detection accuracy of VGG-16 is better than

the other two models but it needs more than 11G GPU Inorder to improve the training speed of the model we use theVGG CNN M 1024model as a pretrainingmodel and use thealternation training as a training method The main processof training is shown in Table 1

4 Experimental Analysis

41 Dataset Acquisition At present the dataset commonlyused in target detection is PASCAL VOC which is made upof larger objects or the objects whose size is very small butthe area of the objects in the image is very large because ofthe focal length Therefore PASCAL VOC is not suitable forsmall object detection There is no dataset for small targetobjects In order to test the detection effect of the model onsmall objects the paper will establish a small object datasetfor object detection based on Microsoft COCO datasets andSUN datasets

In the process of building small object dataset we referto the two criteria mentioned in [18] The first criterion isthat the actual size of the detected object is not more than 30centimeters The second criterion is that all the small objectsin the image occupy 008 to 058 of the area in the imageie the pixels of the object are between 16lowast16 and 42lowast42pixels The small objects in the PASCAL VOC occupy 138and 4640 of the area in the image so it is not suitable forsmall object detection The statistics table is shown in Table 2[18]

Based on the above standards we select 8 types of objectsto make up a dataset including mouse telephone outletfaucet clock toilet paper bottle and plate After filteringCOCO and SUN dataset we finally select 2003 images thatinclude a total of 3339 objects The 358 mouse are distributedin 282 images and the other objects eg toilet paper faucetsocket panel and clock are shown in Table 3


Table 2 PASCAL VOC object area account table Unit

category cat sofa train dog table motorbike horsearea ratio 46 40 3387 3233 3096 2373 2369 2315category bus plane bicycle person bird cow chairarea ratio 2304 2283 1438 814 803 668 609category TV boat sheep plant car bottlearea ratio 596 382 334 292 279 138

Table 3 The small object dataset

category Mouse Telephone Outlet Faucet Clock Toilet paper bottle plateNumber of images 282 265 305 423 387 245 209 353Number of objects 358 332 477 515 422 289 371 575

Table 4 The comparison of accuracy between our model and Faster-RCNN (4000020000)

model mAP Mouse Telephone Outlet Faucet Clock Toilet paper bottle plateFaster RCNN 0479 0360 0409 0519 0392 0643 0350 0485 0676Our model 0589 0402 0482 0600 0506 0687 0641 0585 0806



The small object dataset established in this paper is basedon COCO and SUN Because the data in COCO and SUN aremainly based on the scenes of everyday life the complexity ofimage background in our dataset is much larger than that inPASCAL VOC In addition there are more objects in singleimage compared with the PASCAL VOC and most of theseobjects are not in the image center These make the objectdetection based on the small object datasetmore difficult thanthat based on the PASCAL VOC

During the experiment we randomly select 300 imagesas a test set and 600 as a validation set from the small datasetand all the remaining images are trained as a training set

42 Experimental Comparison The paper compares ourmodel with the state-of-the-art detection model Faster-RCNN for small object detection In the process of modeltraining our model and Faster-RCNN model use the alter-nate training method Firstly we train the RPN network anduse the RPN network as a pretraining network to train thedetection network Then we repeat the above steps to getthe final detection model In the training process we have40000 iterations for the RPN network and 20000 iterationsfor the detection networkThe final accuracy of the detectionis shown in Table 4

With the increase in the number of iterations of the train-ing network different models will show different detectionresults In the experiment we also try to increase the numberof iterations that is the RPN network iterates 60000 times

and the detection network iterates 30000 times The resultsobtained are shown in Table 5

We can find that the detection accuracy is stable whenthe number of iterations of RPN network is 40000 and thenumber of iterations of detection network is 20000 fromthe above experiments The accuracy of our model is betterthan that of Faster-RCNN for all types of objects The partrenderings of the objects detection are shown in Figure 5

In order to further detect the robustness of the model wealso detect the remote sensing images in real environmentThe remote sensing image dataset comes from the Googlemap and the insulators of the field transmission line are pho-tographed by the UAV (unmanned aerial vehicle) Becausethe images in real environment have the characteristicsof changeable light complex background and incompleteobjects we try to take all the special cases into considerationduring the building the dataset Experiments show thatour proposed detection model has better detection resultsin small objects detection in real environment The partrenderings of the objects detection are shown in Figure 6

5 Conclusions

Small objects are very difficult to detect because of theirlower resolution and larger influence of the surroundingenvironment The existing detection models based on deepneural network are not able to detect the small objectsbecause the features of objects that are extracted by manyconvolution and pooling operations are lost Our model not


Figure 5 The detection renderings

Figure 6 Effect of remote sensing image detection


only ensures the integrity of the feature of the large objectbut also preserves the full detail feature of the small objectsby extracting the multiscale feature of the image So it canimprove the accuracy of the detection of the small objects

The GANs (Generative Adversarial Nets) have beenwidely applied to the game area and achieved good results[29] For future work we believe that investigating moresophisticated techniques for improving the accuracy of smallobject detection including the Generative Adversarial Netswill be beneficial Existing object detection usually detectssmall objects through learning representations of all theobjects at multiple scales However the performance isusually limited to pay off the computational cost and therepresentation of the image In the future we address thesmall object detection problem that internally lifts represen-tations of small objects to ldquosuper-resolvedrdquo ones achievingsimilar characteristics as large objects and thus being morediscriminative for detection And finally we use the adver-sarial network to train the detection model

Data Availability

The data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China under Grants nos 61662033 and61473144 Aeronautical Science Foundation of China (KeyLaboratory) under Grant no 20162852031 and the SpecialScientific Instrument Development of Ministry of Scienceand Technology of China under Grant no 2016YFF0103702

References

[1] P Viola and M Jones ldquoRapid object detection using a boostedcascade of simple featuresrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern Recognitionpp I511ndashI518 Kauai Hawaii USA December 2001

[2] R Lienhart and J Maydt ldquoAn extended set of Haar-like featuresfor rapid object detectionrdquo in Proceedings of the InternationalConference on Image Processing (ICIP rsquo02) pp I900ndashI903Rochester NY USA September 2002

[3] P Viola J C Platt and C Zhang ldquoMultiple Instance boostingfor object detectionrdquo in Proceedings of the Annual Conferenceon Neural Information Processing Systems (NIPS rsquo05) vol 18pp 1417ndash1424 Vancouver BritishColumbia CanadaDecember2005

[4] R Girshick J Donahue T Darrell and J Malik ldquoRich fea-ture hierarchies for accurate object detection and semanticsegmentationrdquo in Proceedings of the 27th IEEE Conference onComputer Vision and Pattern Recognition (CVPR rsquo14) pp 580ndash587 Columbus Ohio USA June 2014

[5] R Girshick ldquoFast R-CNNrdquo in Proceedings of the 15th IEEEInternational Conference on Computer Vision (ICCV rsquo15) pp1440ndash1448 Santiago Chile December 2015

[6] S Ren K He R Girshick and J Sun ldquoFaster R-CNN towardsreal-time object detectionwith region proposal networksrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol39 no 6 pp 1137ndash1149 2017

[7] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol37 no 9 pp 1904ndash1916 2015

[8] J F Dai Y Li K M He et al ldquoR-FCN Object Detection viaRegion-based Fullyrdquo in Proceedings of the 30th Conference onNeural Information Processing Systems (NIPS 2016) BarcelonaSpain 2016

[9] M Everingham L van Gool C K I Williams J Winn and AZisserman ldquoThe pascal visual object classes (VOC) challengerdquoInternational Journal of Computer Vision vol 88 no 2 pp 303ndash338 2010

[10] Y Ren C Zhu and S Xiao ldquoSmall object detection in opticalremote sensing images via modified faster R-CNNrdquo AppliedSciences vol 8 no 5 article 813 2018

[11] L Itti C Koch and E Niebur ldquoAmodel of saliency-based visualattention for rapid scene analysisrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 20 no 11 pp 1254ndash12591998

[12] M-M Cheng N J Mitra X Huang P H S Torr and S-M Hu ldquoGlobal contrast based salient region detectionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol37 no 3 pp 569ndash582 2015

[13] C Chen M Y Liu O Tuzel et al ldquoR-CNN for small objectdetectionrdquo in Asian Conference on Computer Vision vol 10115of Lecture Notes in Computer Science pp 214ndash230 2016

[14] J S Lim and W H Kim ldquoDetection of multiple humans usingmotion information and adaboost algorithmbased onHarr-likefeaturesrdquo International Journal of Hybrid Information Technol-ogy vol 5 no 2 pp 243ndash248 2012

[15] R PYadav V Senthamilarasu K Kutty and S P UgaleldquoImplementation of Robust HOG-SVM based Pedestrian Clas-sificationrdquo International Journal of Computer Applications vol114 no 19 pp 10ndash16 2015

[16] L Hou W Wan K-H Lee J-N Hwang G Okopal and JPitton ldquoRobust Human Tracking Based on DPM ConstrainedMultiple-Kernel from a Moving Camerardquo Journal of SignalProcessing Systems vol 86 no 1 pp 27ndash39 2017

[17] A Ali and M A Bayoumi ldquoTowards real-time DPM objectdetector for driver assistancerdquo in Proceedings of the 23rd IEEEInternational Conference on Image Processing ICIP 2016 pp3842ndash3846 Phoenix Ariz USA September 2016

[18] S Bell C L Zitnick K Bala and R Girshick ldquoInside-outside net Detecting objects in context with skip pooling andrecurrent neural networksrdquo in Proceedings of the 2016 IEEEConference on Computer Vision and Pattern Recognition CVPR2016 pp 2874ndash2883 Las Vegas Nev USA July 2016

[19] T Kong A Yao Y Chen and F Sun ldquoHyperNet towards accu-rate region proposal generation and joint object detectionrdquo inProceedings of the 2016 IEEEConference onComputerVision andPattern Recognition (CVPR) pp 845ndash853 Las Vegas Nev USAJune 2016

[20] F Yang W Choi and Y Lin ldquoExploit all the layers fast andaccurate CNNobject detector with scale dependent pooling andcascaded rejection classifiersrdquo in Proceedings of the 2016 IEEEConference on Computer Vision and Pattern Recognition CVPR2016 pp 2129ndash2137 Las Vegas Nev USA July 2016


[21] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo in Proceedingsof the 2016 IEEE Conference on Computer Vision and PatternRecognition CVPR 2016 pp 779ndash788 Las Vegas NevUSA July2016

[22] W Liu D Anguelov D Erhan et al ldquoSSD single shot multiboxdetectorrdquo in EuropeanConference on Computer Vision vol 9905of Lecture Notes in Computer Science pp 21ndash37 Springer ChamSwitzerland 2016

[23] J R R Uijlings K E A Van De Sande T Gevers and A WM Smeulders ldquoSelective search for object recognitionrdquo Inter-national Journal of Computer Vision vol 104 no 2 pp 154ndash1712013

[24] C L Zitnick and P Dollar ldquoEdge boxes locating objectproposals from edgesrdquo in European Conference on ComputerVision pp 391ndash405 Springer Cham Switzerland 2014

[25] M Najibi M Rastegari and L S Davis ldquoG-CNN An iterativegrid based object detectorrdquo in Proceedings of the 2016 IEEEConference on Computer Vision and Pattern Recognition CVPR2016 pp 2369ndash2377 Las Vegas Nev USA July 2016

[26] T-Y Lin M Maire S Belongie et al ldquoMicrosoft COCO Com-mon objects in contextrdquo in European Conference on ComputerVision vol 8693 of Lecture Notes in Computer Science pp 740ndash755 Springer Cham Switzerland 2014

[27] httpgroupscsailmiteduvisionSUN[28] THN Le Y ZhengC ZhuK Luu andM Savvides ldquoMultiple

scale faster-RCNN approach to driverrsquos cell-phone usage andhands on steering wheel detectionrdquo in Proceedings of the 29thIEEE Conference on Computer Vision and Pattern RecognitionWorkshops CVPRW 2016 pp 46ndash53 Las Vegas Nev USA July2016

[29] X Wu K Xu and P Hall ldquoA survey of image synthesis andediting with generative adversarial networksrdquo Tsinghua Scienceand Technology vol 22 no 6 pp 660ndash674 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018


Active and Passive Electronic Components

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of



Journal ofEngineeringVolume 2018

SensorsJournal of



RotatingMachinery


Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation


Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom


Figure 1 Small object dataset

display very large objects in the image because of the focallength Therefore the detection model based on the datasetcomposed of large objects will not be effectively detected forthe small objects in reality [10]

Based on this problem we mainly study automatic detec-tion of small object For small object we define it as twotypes one is a small object in the real world such as mouseand telephone And the other is small objects those are largeobjects in the real world but they are shown in the image assmall objects because of the camera angle and focal lengthsuch as objects detection in aerial images or in remote sensingimages The small object dataset is shown in Figure 1

Usually since small objects have low resolution and arenear large objects small objects are often disturbed by thelarge objects and it leads to failure in being detected in theautomatic detection process As the mouse in Figure 1 is oftenplaced next to the monitor the common saliency detectionmodel [11 12] usually focuses on more significant monitorand ignores the mouse In addition we not only find the

detected objects in the image but also need to accuratelymark object location for object detection Because the bigdetected objects have many pixels in the image they canaccurately locate their location But it is just the oppositefor the small objects that have low resolution and few pixelsEven more because the small objects have fewer pixels andthe finite pixels contain few object features it is difficult todetect the small objects by the conventional detection modelIn addition there are few studies references and also nostandard dataset on automatic detection of small objects

In order to solve these problems we propose a multiscaledeep convolution detection network to detect small objectsThe network is based on the Faster-RCNN detection modelWe firstly combine the features of the 3th 4th and 5thconvolution layers for the small objects to amultiscale featurevectorThen we use the vector to detect the small objects andlocate the bounding box of objects In order to train smallobjects the paper also uses the method [13] to build a datasetfocusing on small objects Finally by comparing the proposed




2 Related Works
















sliding window

conv feature map

k anchor boxes


RPN

1 2 3 4 5

Feature Extraction







RPN

Concat 1lowast1conv

fc fcsoftmax

Bbox

1 2 3 4 5










119883 = 1198831198832 (1)

1198832 = ( 119889sum119894=1

10038161003816100381610038161199091198941003816100381610038161003816)12

(2)



119884 = 120574 (3)



120597119897120597119883 = 120574

120597119897120597119910 (4)








120597119897120597119883 = 120597119897120597119883 (1198681198832 minus11988311988311987911988332) (5)

120597119897120597120574 = sum119910120597119897120597119910119883 (6)

























5 Conclusions








Data Availability




Acknowledgments


References

































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia





2 Related Works
















sliding window

conv feature map

k anchor boxes


RPN

1 2 3 4 5

Feature Extraction







RPN

Concat 1lowast1conv

fc fcsoftmax

Bbox

1 2 3 4 5










119883 = 1198831198832 (1)

1198832 = ( 119889sum119894=1

10038161003816100381610038161199091198941003816100381610038161003816)12

(2)



119884 = 120574 (3)



120597119897120597119883 = 120574

120597119897120597119910 (4)








120597119897120597119883 = 120597119897120597119883 (1198681198832 minus11988311988311987911988332) (5)

120597119897120597120574 = sum119910120597119897120597119910119883 (6)

























5 Conclusions








Data Availability




Acknowledgments


References

































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia










sliding window

conv feature map

k anchor boxes


RPN

1 2 3 4 5

Feature Extraction







RPN

Concat 1lowast1conv

fc fcsoftmax

Bbox

1 2 3 4 5










119883 = 1198831198832 (1)

1198832 = ( 119889sum119894=1

10038161003816100381610038161199091198941003816100381610038161003816)12

(2)



119884 = 120574 (3)



120597119897120597119883 = 120574

120597119897120597119910 (4)








120597119897120597119883 = 120597119897120597119883 (1198681198832 minus11988311988311987911988332) (5)

120597119897120597120574 = sum119910120597119897120597119910119883 (6)

























5 Conclusions








Data Availability




Acknowledgments


References

































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia



RPN

Concat 1lowast1conv

fc fcsoftmax

Bbox

1 2 3 4 5










119883 = 1198831198832 (1)

1198832 = ( 119889sum119894=1

10038161003816100381610038161199091198941003816100381610038161003816)12

(2)



119884 = 120574 (3)



120597119897120597119883 = 120574

120597119897120597119910 (4)








120597119897120597119883 = 120597119897120597119883 (1198681198832 minus11988311988311987911988332) (5)

120597119897120597120574 = sum119910120597119897120597119910119883 (6)

























5 Conclusions








Data Availability




Acknowledgments


References

































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia









120597119897120597119883 = 120597119897120597119883 (1198681198832 minus11988311988311987911988332) (5)

120597119897120597120574 = sum119910120597119897120597119910119883 (6)

























5 Conclusions








Data Availability




Acknowledgments


References

































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia


















5 Conclusions








Data Availability




Acknowledgments


References

































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia








Data Availability




Acknowledgments


References

































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia














RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia


ResearchArticle Small Object Detection with Multiscale...

Documents

Transcript of ResearchArticle Small Object Detection with Multiscale...