Post on 26-Dec-2019
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
CSCE 496/896 Lecture 9:Object Detection
Stephen Scott
sscott@cse.unl.edu
1 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Introduction
We know that CNNs are useful in image classificationNow consider object detection
Given an input image, identify what objects (plural) arein it and where they areOutput bounding box of each object
2 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Outline
Performance measuresRCNNSPP-netFast RCNNYOLO
3 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Performance MeasuresMean Average Precision
Mean average precision (mAP) to measure how wellobjects are identified
Recall from Lecture 3
Precision isfraction of thoselabeled positivethat are positiveRecall is fractionof the truepositives that arelabeled positivePrecision-recallcurve plotsprecision vs recall
4 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Performance MeasuresMean Average Precision (2)
Given a ranking (by confidence values) of n items,average precision at n (AP@n) is average of precisionvalues at each position in the ranking:
AP =
n∑k=1
P(k)∆r(k) ,
where P(k) is precision at position k and ∆r(k) ischange in recall: r(k)− r(k − 1) (= 0 if instance k isnegative, = 1/Np if k is one of Np positives)
E.g., if ranking = 〈+,+,−,+,−〉, AP@5= (1)(1/3) + (1)(2/3) + (2/3)(2/3) + (3/4)(1) + (3/5)(1)Larger as more positives ranked above negatives
mAP is mean of average precision across all classes
5 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Performance MeasuresIntersection Over Union
Intersection over union (IoU) to measure quality ofbounding boxes
Divide the size of the two boxes’ intersection by the sizeof their union
6 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Basic Idea of Object Detection
Split input image into regions and classify each region witha CNN and other machinery
Region boundary isbounding boxObject detected inregion is object in BB
Issues:
Limited to bounding boxes of fixed sizes and locationsAn object could span regions
7 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Region CNN (Girshick et al. 2014)
R-CNN proposes collection of 2000 regions in imageWarps each region to match input dimensions(227× 227× 3) of CNN to get 4096-dimensionalembedded representationClassifies each embedded vector with class-specificbinary SVMsApply class-specific regressors to fine-tune boundingboxes
8 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Region CNN (Girshick et al. 2014)Example from Girshick (2015)
9 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Region CNN (Girshick et al. 2014)Selective Search
Popular method to propose RoIs: selective search1 Segment the image2 Compute bounding
boxes of segments3 Iteratively merge
adjacent segmentsbased on similarity
Linearcombination ofsimilarities of:color, texture, size,shape
4 Goto 2
10 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Region CNN (Girshick et al. 2014)Issues
Training and detection are slowDetection: 13s/image on GPU, 53s/image on CPUDue to large number of regions proposed, each runthrough CNN and classified
11 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Spatial Pyramid Pooling (He et al. 2015)
Part of R-CNN’s slowdown at test time is running eachRoI through ConvNet separatelyTo speed up test time, instead put entire image throughsingle ConvNet
Choose RoIs from ConvNetoutput and run throughspatial pyramid pooling(SPP) layer
Max/avg pooling withfixed number of binsProduces fixed-lengthvector regardless of inputsize
Fixed-length vectors feed tofully connected layers, thenSVMs12 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Spatial Pyramid Pooling (He et al. 2015)Example from Girshick (2015)
13 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Spatial Pyramid Pooling (He et al. 2015)Drawbacks
While training is faster then R-CNN, is still slow anddisk-intensiveCannot efficiently update ConvNet parameters, so keptfrozen
Each RoI’s receptive field covers most of entire image,so forward pass expensive across all images ofmini-batch
14 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Fast R-CNN (Girshick 2015)Hierarchical Sampling
Similar architecture to SPP-net
Mini-batches constructedvia hierarchical sampling:Sample a similar number ofRoIs over a smaller numberof images
15 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Fast R-CNN (Girshick 2015)Example from Girshick (2015)
16 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
Fast R-CNN (Girshick 2015)Example from Girshick (2015)
17 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
You Only Look Once (Redmon et al. 2016)
A single, unified networkCan process 45 frames per second on a GPU (155 fpsfor Fast YOLO)Lower mAP than some R-CNN variants, but much fasterHighest mAP of real-time detectors (≥ 30 fps)
18 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
You Only Look Once (Redmon et al. 2016)Idea
Divides image into S× SgridEach grid cell predicts Bbounding boxes, each as(x, y,w, h) (coordinates,width, height), and aconfidence (five totalpredictions)
x, y,w, h ∈ [0, 1] (relative to image dimensions and gridcell location)Each cell also predicts C class probabilitiesOutput is S× S× (5B + C) tensor
19 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
You Only Look Once (Redmon et al. 2016)Architecture
Leaky ReLU for all layers except output, which is linear
20 / 21
CSCE496/896
Lecture 9:Object
Detection
Stephen Scott
Introduction
PerformanceMeasures
R-CNN
SPP-net
Fast R-CNN
YOLO
You Only Look Once (Redmon et al. 2016)Training
Pretrained 20 convolutional layers on ImageNet 1000Added 4 convolutional layers and 2 connected layersTrained to optimize weighted square loss functionλcoord = 5 times more weight on (x, y,w, h) predictions
21 / 21