Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate...

64
Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Transcript of Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate...

Page 1: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik

Rich Feature Hierarchies for Accurate Object Detection and

Semantic Segmentation

Page 2: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik

Rich Feature Hierarchies for Accurate Object Detection and

Semantic Segmentation

Page 3: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

PASCAL VOC detection history

DPM

VOC’07 VOC’08 VOC’09 VOC’10 VOC’11 VOC’12

DPMDPM,

HOG+BOW

DPM,MKL

DPM++DPM++,

MKL,Selective

Search

Selective Search,DPM++,

MKL

41%

PASCAL VOC challenge dataset

41%37%

28%23%

17%

[Source: http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc20{07,08,09,10,11,12}/results/index.html]

Page 4: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

A rapid rise in performance

DPM

VOC’07 VOC’08 VOC’09 VOC’10 VOC’11 VOC’12

DPMDPM,

HOG+BOW

DPM,MKL

DPM++DPM++,

MKL,Selective

Search

Selective Search,DPM++,

MKL

41%

PASCAL VOC challenge dataset

41%37%

28%23%

17%

[Source: http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc20{07,08,09,10,11,12}/results/index.html]

Page 5: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Complexity and the plateau

DPM

VOC’07 VOC’08 VOC’09 VOC’10 VOC’11 VOC’12

DPMDPM,

HOG+BOW

DPM,MKL

DPM++DPM++,

MKL,Selective

Search

Selective Search,DPM++,

MKL

41%

PASCAL VOC challenge dataset

41%37%

28%23%

17%

plateau & increasing complexity

[Source: http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc20{07,08,09,10,11,12}/results/index.html]

Page 6: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

SIFT, HOG, LBP, ...

VOC’10 VOC’12VOC’07 VOC’09VOC’08 VOC’11

SegDPM (2013)Regionlets (2013)Regionlets

(2013)

[Regionlets. Wang et al. ICCV’13] [SegDPM. Fidler et al. CVPR’13]

PASCAL VOC challenge dataset

DPMDPM,

HOG+BOW

DPM,MKL

DPM++DPM++,

MKL,Selective

Search

Selective Search,DPM++,

MKL

Page 7: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

R-CNN: Regions with CNN features

VOC’10 VOC’12VOC’07 VOC’09VOC’08 VOC’11

R-CNN58.5% R-CNN

53.7%R-CNN53.3%

PASCAL VOC challenge dataset

Page 8: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Feature learning with CNNs

Fukushima 1980 Neocognitron

Page 9: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Feature learning with CNNs

Fukushima 1980 Neocognitron

Rumelhart, Hinton, Williams 1986“T” versus “C” problem

Page 10: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Feature learning with CNNs

Fukushima 1980 Neocognitron

LeCun et al. 1989-1998Handwritten digit reading / OCR

Rumelhart, Hinton, Williams 1986“T” versus “C” problem

Page 11: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Feature learning with CNNs

Fukushima 1980 Neocognitron

Rumelhart, Hinton, Williams 1986“T” versus “C” problem

...Krizhevksy, Sutskever, Hinton 2012ImageNet classification breakthrough“SuperVision” CNN

LeCun et al. 1989-1998Handwritten digit reading / OCR

Page 12: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

CNNs for object detection

LeCun, Huang, Bottou 2004NORB dataset

Cireşan et al. 2013Mitosis detection

Sermanet et al. 2013Pedestrian detection

Vaillant, Monrocq, LeCun 1994Multi-scale face detection

Szegedy, Toshev, Erhan 2013PASCAL detection (VOC’07 mAP 30.5%)

Page 13: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Can we break through the PASCAL plateauwith feature learning?

Page 14: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

R-CNN: Regions with CNN features

Inputimage

Extract regionproposals (~2k / image)

Compute CNNfeatures

Classify regions(linear SVM)

Page 15: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

R-CNN at test time: Step 1

Inputimage

Extract regionproposals (~2k / image)

Proposal-method agnostic, many choices - Selective Search [van de Sande, Uijlings et al.] (Used in this work) - Objectness [Alexe et al.] - Category independent object proposals [Endres & Hoiem] - CPMC [Carreira & Sminchisescu] Active area, at this CVPR - BING [Ming et al.] – fast - MCG [Arbelaez et al.] – high-quality segmentation

Page 16: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

R-CNN at test time: Step 2

Inputimage

Extract regionproposals (~2k / image)

Compute CNNfeatures

Page 17: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

R-CNN at test time: Step 2

Inputimage

Extract regionproposals (~2k / image)

Compute CNNfeatures

Dilate proposal

Page 18: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

R-CNN at test time: Step 2

Inputimage

Extract regionproposals (~2k / image)

Compute CNNfeatures

a. Crop

Page 19: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

R-CNN at test time: Step 2

Inputimage

Extract regionproposals (~2k / image)

Compute CNNfeatures

a. Crop b. Scale (anisotropic)

227 x 227

Page 20: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

1. Crop b. Scale (anisotropic)

R-CNN at test time: Step 2

Inputimage

Extract regionproposals (~2k / image)

Compute CNNfeatures

c. Forward propagateOutput: “fc7” features

Page 21: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

R-CNN at test time: Step 3

Inputimage

Extract regionproposals (~2k / image)

Compute CNNfeatures

Warped proposal 4096-dimensionalfc7 feature vector

linear classifiers(SVM or softmax)

person? 1.6

horse? -0.3

...

...

Classifyregions

Page 22: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Linear regression

on CNN features

Step 4: Object proposal refinement

Originalproposal

Predictedobject bounding box

Bounding-box regression

Page 23: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Bounding-box regression

original

predicted

Δh × h + h

Δw × w + w

(Δx × w + x, Δy × h + h)

w

h(x, y)

Page 24: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

metric: mean average precision (higher is better)

VOC 2007 VOC 2010

DPM v5 (Girshick et al. 2011) 33.7% 29.6%

UVA sel. search (Uijlings et al. 2013) 35.1%

Regionlets (Wang et al. 2013) 41.7% 39.7%

SegDPM (Fidler et al. 2013) 40.4%

R-CNN 54.2% 50.2%

R-CNN + bbox regression 58.5% 53.7%

R-CNN results on PASCAL

Reference systems

Page 25: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

metric: mean average precision (higher is better)

VOC 2007 VOC 2010

DPM v5 (Girshick et al. 2011) 33.7% 29.6%

UVA sel. search (Uijlings et al. 2013) 35.1%

Regionlets (Wang et al. 2013) 41.7% 39.7%

SegDPM (Fidler et al. 2013) 40.4%

R-CNN 54.2% 50.2%

R-CNN + bbox regression 58.5% 53.7%

R-CNN results on PASCAL

Page 26: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Top bicycle FPs (AP = 72.8%)

Page 27: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

False positive #15

Page 28: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

False positive #15

Unannotated bicycle

(zoom)

Page 29: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

False positive #15

1949 French comedy by Jacques Tati

Page 30: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

False positive type distribution

BG = background

Oth = other / dissimilar classes

Sim = similar classes

Loc = localization

Analysis software: D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing Error in Object Detectors. ECCV, 2012.

Page 31: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Training R-CNN

Bounding-box labeled detection data is scarce

Key insight:Use supervised pre-training on a data-

rich auxiliary task and transfer to detection

Page 32: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

ImageNet LSVR Challenge

– Image classification (not detection)

– 1000 classes (vs. 20)

– 1.2 million training labels (vs. 25k) bus anywhere?

[Deng et al. CVPR’09]

Page 33: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

ILSVRC 2012 winner“SuperVision” Convolutional Neural Network (CNN)

ImageNet Classification with Deep Convolutional Neural Networks.Krizhevsky, Sutskever, Hinton. NIPS 2012.

input 5 convolutional layers fully connected

Page 34: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Impressive ImageNet results!

But... does it generalize to other datasets and tasks?[See also: DeCAF. Donahue et al., ICML 2014.]

Spirited debate at ECCV 2012

1000-way image classification

metric: classification error rate (lower is better)

Top-5 error

Fisher Vectors (ISI) 26.2%

5 SuperVision CNNs 16.4%

7 SuperVision CNNs 15.3%

now: ~12%

Page 35: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

R-CNN training: Step 1

Supervised pre-trainingTrain a SuperVision CNN* for the 1000-way

ILSVRC image classification task

*Network from Krizhevsky, Sutskever & Hinton. NIPS 2012Also called “AlexNet”

Page 36: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

R-CNN training: Step 2

Fine-tune the CNN for detectionTransfer the representation learned for ILSVRC

classification to PASCAL (or ImageNet detection)

Page 37: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

R-CNN training: Step 2

Fine-tune the CNN for detectionTransfer the representation learned for ILSVRC

classification to PASCAL (or ImageNet detection)

Try Caffe! http://caffe.berkeleyvision.org - Clean & fast CNN library in C++ with Python and MATLAB interfaces - Used by R-CNN for training, fine-tuning, and feature computation

Page 38: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

R-CNN training: Step 3

Train detection SVMs(With the softmax classifier from fine-tuning

mAP decreases from 54% to 51%)

Page 39: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Compare with fine-tuned R-CNN

VOC 2007 VOC 2010

Regionlets (Wang et al. 2013) 41.7% 39.7%

SegDPM (Fidler et al. 2013) 40.4%

R-CNN pool5 44.2%

R-CNN fc6 46.2%

R-CNN fc7 44.7%

R-CNN FT pool5 47.3%

R-CNN FT fc6 53.1%

R-CNN FT fc7 54.2% 50.2%

metric: mean average precision (higher is better)

fine-

tune

d

Page 40: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

What did the network learn?

Page 41: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

What did the network learn?

pool5 feature mapx

yx

y

← feature channels →

“stimulus”

Page 42: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

What did the network learn?

feature position receptive fieldpool5 feature mapx

yx

y

← feature channels →

Visualize images that activate pool5 a feature

227 pixels

2276

6 cells

“stimulus”

Page 43: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Page 44: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Page 45: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Page 46: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Page 47: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Page 48: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Take away- Dramatically better PASCAL mAP

- R-CNN outperforms other CNN-based methods on ImageNet detection

- Detection speed is manageable (~11s / image)

- Scales very well with number of categories (30ms for 20 → 200 classes!)

- R-CNN is simple and completely open source

Page 49: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Semantic segmentation

full fgCPMC segments(Carreira & Sminchisescu)

Page 50: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Semantic segmentation

full fgCPMC segments(Carreira & Sminchisescu)

VOC 2011 test

Bonn second-order pooling (Carreira et al.)

47.6%

R-CNN fc6 full+fg (no fine-tuning) 47.9%

Improved to 50.5% in our upcoming ECCV’14 work: Simultaneous Localization and Detection. Hariharan et al.

Page 51: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Get the code and models!bit.ly/rcnn-cvpr14

Page 52: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Supplementary slides follow

Page 53: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Pre-trained CNN + SVMs (no FT)

VOC 2007 VOC 2010

Regionlets (Wang et al. 2013) 41.7% 39.7%

SegDPM (Fidler et al. 2013) 40.4%

R-CNN pool5 44.2%

R-CNN fc6 46.2%

R-CNN fc7 44.7%

R-CNN FT pool5 47.5%

R-CNN FT fc6 53.2%

R-CNN FT fc7 54.1% 50.2%

metric: mean average precision (higher is better)

Page 54: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

False positive analysis

Analysis software: D. Hoiem, Y. Chodpathumwan, and Q. Dai. “Diagnosing Error in Object Detectors.” ECCV, 2012.

No fine-tuning

Page 55: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

False positive analysis

After fine-tuning

Page 56: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

False positive analysis

After bounding-box regression

Page 57: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

What did the network learn?

unit position receptive fieldpool5 “data cube”x

yx

y

z

Visualize pool5 units

Page 58: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Page 59: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Page 60: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Page 61: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Page 62: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Page 63: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Comparison with DPM

R-CNN DPM v5

animals animals

Page 64: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Localization errors dominate

animals vehicles furniture

backgrounddissimilar

classessimilar classes localization

Analysis software: D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing Error in Object Detectors. ECCV, 2012.