R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus...

27
R-CNN minus R Karel Lenc, Andrea Vedaldi Visual Geometry Group, Department of Engineering Science

Transcript of R-CNN minus R - University of Oxfordvgg/publications/2015/Lenc15a/presentation.pdf · R-CNN minus...

R-CNN minus R

Karel Lenc, Andrea Vedaldi

Visual Geometry Group, Department of Engineering Science

Object detection 2

Goal: tightly enclose objects of a certain type in a bounding box

“bikes” “planes”

“birds”“horses”

Top performer: Region proposals + CNN 3

WHERE

WHAT

proposal

generation

CNN chair

CNN

CNN

background

potted

plant

[Girshick et al. 2013, He et al. 2014, 2015]

WHAT

Convolutional neural networks

E.g. region classification with AlexNet, VGG VD

[Krizhevsky et al. 2012, Simonyan Zisserman 2014]

WHERE

Segmentation algorithm

E.g. region proposal from selective search

[Uijlings et al. 2013]

Top performer: Region proposals + CNN 4

Can CNN understand where as well as what? 5

proposal

generationWhere? WHAT

WHERE

Convolutional

neural network

?

Approaches to object detection

Scanning windows

▶ Sliding windows

▶ HOG detector [Dalal Triggs 2005]

▶ DPM [Felzenszwalb et al. 2008]

▶ Cascaded windows

▶ AdaBoost [Viola Jones 2004]

▶ MKL [Vedaldi et al. 2009]

▶ B and Bound [Lampert et al. 2009]

▶ Jumping windows

▶ [Sivic et al. 2008]

▶ Selective windows

▶ [Endres and Hoeim 2010, Uijlings

et. al 2011, Alexe et al. 2012, Gu et

al. 2012]

Hough voting

▶ Implicit shape models [Amit Geman

1997, Leibe et al. 2003]

▶ Max margin [Maji Berg 2009], Random

Forests [Gall Lempitsky 2009]

Classifiers & features

▶ linear SVMs, kernel SVM, Fisher

Vectors, … [Cinbis et al. 2013, …]

▶ convolutional neural networks

[Sermanet et al. 2014, Girshick et al.

2014, …]

▶ HOG, SIFT, C-SIFT, …

[van de Sande et al. 2010, …]

▶ Segmentation cues, … [Shotton et al.

2008, Cinbis et al. 2013, …]

6

Evolution of object detection 7

PASCAL VOC 2007 data

DPM [Felzenszwalb et

al.]

MKL [Vedaldi et al.]

DPMv5 [Girshick et al.]

Regionlet [Wang et al.]

RCNN-Alex [Girshick et al.]

RCNN-VGG [Girshick et al.]

0

10

20

30

40

50

60

70

2008 2009 2010 2011 2012 2013 2014 2015

mA

P [

%]

Year

Pros: simple and effective

Cons: slow as the CNN is re-evaluated for each tested region

R-CNN 8

[Girshick et al. 2013]

CNN chair

CNN

CNN

background

potted

plant

c5c1 c2 c3 c4 f6 f7 f8(SVM)

label

SPP R-CNN

Convolutional features = local features

Region descriptor = pooled local features

▶ Spatial pyramid + max pooling [He et al. 2014]

▶ Bag of words, Fisher vector, VLAD, …. [Cimpoi et. al. 2015]

Order of magnitudes speedup

9

[He et al. 2014]

chair

bowl

potted

plant

c5c1 c2 c3 c4

f6 f7 f8(SVM)

f6 f7 f8(SVM)

f6 f7 f8(SVM)

local features pooling encoder

Computational cost

SPP-CNN results in a significant test-time speedup

However, region proposal extraction is the new bottleneck

R-CNN minus R: can we get rid of region proposal extraction?

10

Detection time

0 2000 4000 6000 8000 10000 12000

SPP-CNN

R-CNN

Avg. Time per Image [ms]

Sel. Search CNN evaluation

Streamlining R-CNN and SPP-CNN

Dropping proposal generation

Streamlining R-CNN and SPP-CNN

Dropping proposal generation

A complex learning pipeline

1. Pre-train a large CNN (on ImageNet)

2. Extract region proposals (on PASCAL VOC)

3. Use pre-processed regions to:

1. Fine-tune the CNN

2. Learn an SVM to rank regions

3. Learn a bounding-box regressor to refine localization

13

(SPP) R-CNN training comprises many steps

label (fine tuning)c5c1 c2 c3 c4 f6 f7 f8

SVM label (ranking)

linear

regress. b. box

A complex learning pipeline

With SPP R-CNN of [He et al. 2014]

fine-tuning is limited to the fully connected layers

14

(SPP) R-CNN training comprises many steps

labelc5c1 c2 c3 c4 f6 f7 f8

SVM label

linear

regress. b. box

frozen

Streamlining R-CNN

Up to a simple transformation, softmax is just as good as hinge loss for box ranking.

15

Removing the SVM

phase score(s) learning loss mAP

fine tuning 𝑆𝑐 = exp( 𝑤𝑐 , 𝜙 𝒙 + 𝑏𝑐) − log

𝑆𝑐0𝑆0 + 𝑆1 + 𝑆2 + …+ 𝑆𝐶

38.1

region ranking

𝑄1 = 𝑤1, 𝜙 𝒙 + 𝑏1⋮

𝑄𝐶 = 𝑤𝐶 , 𝜙 𝒙 + 𝑏𝐶

max 0, 1 − 𝑦 𝑄1⋮

max{0, 1 − 𝑦 𝑄𝐶}59.8

region raking 𝑄𝑐 = log𝑆𝑐𝑆0

from fine-tuning 58.4

Streamlining R-CNN and SPP-CNN

SPP and bounding box regressions can be easily implemented in a CNN (with a DAG

topology) and trained jointly in one step

16

See also [Fast R-CNN and Faster R-CNN]

labelc5c1 c2 c3 c4 f6 f7 f8

lin.

regress.b. box

frozen

labelc5c1 c2 c3 c4 f6 f7 f8

fregb. box

SPP

Streamlining R-CNN and SPP-CNN

Dropping proposal generation

A constant-time region proposal generator

Proposals are now very fast but very inaccurate

We let the CNN compensate with the bounding box regressor

18

Algorithm

Preprocessing

Collect all the training bounding boxes (x1,y1,x2,y2)

Use K-means to extract K clusters in (x1,y1,x2,y2) space

Proposal generation

Regardless of the image, return the same K cluster centers

Proposal statistics on PASCAL VOC 19

ground truthselective search

2K

sliding windows

7K

clustering

3K

Information pathways 20

[See also Lenc Vedaldi CPVR 2015]

labelc5c1 c2 c3 c4 f6 f7 f8

linear

regress.bounding box

shared local

features

where path

what path

equivariant

representation

invariant

representation

CNN-based bounding box regression

Dashed line: proposals Solid line: corrected by the CNN

21

Performance

Observations

▶ Selective search is much better than fixed generators

▶ However, bounding box regression almost eliminates the difference

▶ Clustering allows to use significantly less boxes than sliding windows

22

0.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

Sel. Search (2Kboxes)

Slid. Win. (7KBoxes)

Clusters (2KBoxes)

Clusters (7KBoxes)

mA

P(V

OC

07)

Baseline BBR

Timings 23

Finding (1) Streamlining accelerates SPP

0 50 100 150 200 250 300 350 400 450

SPP

Streamlined SPP

Avg. Time per Image [ms]

Im. Prep. GPU↔CPU CONV Layers Spat. Pooling FC Layers Bbox Regr.

Timings 24

Finding (2) Dropping selective search is a huge benefit

0 500 1000 1500 2000 2500 3000

SPP

Streamlined SPP

Minus R

Avg. Time per Image [ms]

Sel. Search Im. Prep. GPU↔CPU CONV Layers

Spat. Pooling FC Layers Bbox Regr.

Timings 25

Finding (2) Dropping selective search is a huge benefit

0 2000 4000 6000 8000 10000 12000

RCNN

SPP

Streamlined SPP

Minus R

Avg. Time per Image [ms]

Sel. Search Im. Prep. GPU↔CPU CONV Layers

Spat. Pooling FC Layers Bbox Regr.

Timings 26

Test-time speedups

1.0

4.5

5.0

67.5

0 10 20 30 40 50 60 70 80

RCNN

SPP

StreamlinedSPP

Minus R

Times faster than R-CNN

Conclusions

Current CNNs can localize objects well

▶ External segmentation cues bring only a minor benefit at a great expense

Benefits of CNN-only solutions

▶ Much faster, particularly at test time

▶ Much simpler and streamlined implementations

Future steps

▶ Eliminate the remaining accuracy gap

▶ Essentially achieved in

[Faster R-CNN, Ren et al. 2015]

▶ Beyond bounding boxes

▶ Beyond detection

27