DeepID-Net: Deformable Deep Convolutional Neural Networks for ...
Transcript of DeepID-Net: Deformable Deep Convolutional Neural Networks for ...
![Page 1: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/1.jpg)
DeepID-Net: Deformable Deep Convolutional Neural Networks Convolutional Neural Networks for Object DetectionXiaogang WangDepartment of Electronic Engineering, Chinese University of Hong Kong
1
![Page 2: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/2.jpg)
ImageNet Image Classification Challenge 20122012
Rank Name Error rate
Description
1 U. Toronto 0.15315 Deep learning2 U. Tokyo 0.26172 Hand-crafted 2 U. Tokyo 0.26172 Hand crafted
features and learning models.
3 U. Oxford 0.269794 Xerox/INRIA 0 27058
2
Bottleneck.4 Xerox/INRIA 0.27058
Krizhevsky, Sutskever, Hinton, NIPS’12
![Page 3: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/3.jpg)
Top5 Image Classification Error on ImageNetTop5 Image Classification Error on ImageNet
Al NAlexNet15.375%
Cl if iClarifai11.197%
GoogLeNetGoogLeNet6.656%
ILSVRC 2012 ILSVRC 2013 ILSVRC 2014
3
![Page 4: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/4.jpg)
ImageNet Object Detection Task (2013)ImageNet Object Detection Task (2013) 200 object classes 40,000 test images
4
![Page 5: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/5.jpg)
Mean Average Precision (mAP)Mean Average Precision (mAP)
DeepID NetGoogLeNet
43 9%
DeepID-Net50.3%
UvA-EuvisionRCNN31.4%
43.9%
22.581%
ILSVRC 2013 ILSVRC 2014
5
W. Ouyang and X. Wang, et al. “DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection,” CVPR 2015
![Page 6: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/6.jpg)
PASCAL VOC (SIFT HOG DPM )PASCAL VOC (SIFT, HOG, DPM…)
6
![Page 7: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/7.jpg)
PSCAL VOC (CNN features)PSCAL VOC (CNN features)
7
![Page 8: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/8.jpg)
PSCAL VOL (CNN features)PSCAL VOL (CNN features)Our current result 73.9%DeepID-Net 64.1%
8
![Page 9: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/9.jpg)
Pedestrian DetectionImprove state‐of‐the‐art average miss detection rate on the largest Caltech dataset from 63% to 17%
ICCV’13 CVPR’14
CVPR’12 CVPR’13 ICCV’13 CVPR’15
![Page 10: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/10.jpg)
Pedestrian Detection on Caltech (average miss detection rates)(average miss detection rates)
HOG+SVM68% HOG+DPM
63%
Joint DL39%
DL aided by DL aided by semantic tasks
17%
W. Ouyang and X. Wang, “Joint Deep Learning for Pedestrian Detection,” ICCV 2013.
Y. Tian, P. Luo, X. Wang, and X. Tang, “Pedestrian Detection aided by Deep Learning Semantic T k ” CVPR 2015
10
Tasks,” CVPR 2015.
![Page 11: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/11.jpg)
OutlineOutline Joint deep learning: pedestrian detection
DeepID-Net: general object detection on ImageNetp g j g
Conclusions Conclusions
11
![Page 12: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/12.jpg)
Is deep model a black box?p
12
![Page 13: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/13.jpg)
Joint Learning vs Separate Learning
Training or manual design
Manual design
Training or manual design
Data Preprocessing Preprocessing Feature
manual design designmanual design
Data collection
Preprocessing step 1
Preprocessing step 2
… Feature extraction Classification
? ?
D t F t F t F t
? ? ?
Data collection
Feature transform
Feature transform
… Feature transform Classification
End-to-end learning
Deep learning is a framework/language but not a black-box model
13Its power comes from joint optimization and
increasing the capacity of the learner
![Page 14: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/14.jpg)
ConvNet−U−MS
– Sermnet, K. Kavukcuoglu, S. Chintala, and LeCun, “Pedestrian Detection with Unsupervised Multi-Stage Feature Learning,” CVPR 2013.
14
![Page 15: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/15.jpg)
Results on Caltech Test Results on ETHZ
15
![Page 16: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/16.jpg)
• N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR 2005 (6000 it ti )CVPR, 2005. (6000 citations)
• P. Felzenszwalb, D. McAlester, and D. Ramanan. A Discriminatively Trained, Multiscale, Deformable Part Model. CVPR, 2008. (2000 citations)
16
( )
• W. Ouyang and X. Wang. A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling. CVPR, 2012.
![Page 17: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/17.jpg)
Our Joint Deep Learning Model
17W. Ouyang and X. Wang, “Joint Deep Learning for Pedestrian Detection,” Proc. ICCV, 2013.
![Page 18: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/18.jpg)
Modeling Part Detectorsg
Design the filters in the second convolutional layer with variable sizes
Part models learned from HOG
Part models Learned filtered at the second convolutional layer
![Page 19: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/19.jpg)
Deformation Layer
19
![Page 20: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/20.jpg)
Visibility Reasoning with Deep y gBelief Net
20Correlates with part detection score
![Page 21: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/21.jpg)
Pedestrian Detection aided by Deep Learning Semantic Tasks Learning Semantic Tasks
VehicleVehicleHorizontalHorizontal
FemaleFemaleBagBagFemaleFemale MaleMaleggrightrightFemaleFemale
rightright BackpackBackpackBackBack
VehicleVehicle
TreeTreeVerticalVertical
VehicleVehicleVerticalVertical
Y. Tian, P. Luo, X. Wang, and X. Tang, “Pedestrian Detection aided by Deep Learning Semantic Tasks,” CVPR 2015
![Page 22: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/22.jpg)
Bb: Stanford Bkg.Ba: CamVid Bc: LM+SUNP: Caltech
hard negatives hard negativeshard negatives
Pedestrian Background
(a) Data Generationbldg.sky tree road traffic light sky tree road vertical horizontal sky bldg. tree road vehiclebldg.
(a) Data Generation
160
6416
408 4 220
500
200
conv1conv2 conv3 conv4
fc5
fc6
pedestrian classifier:
pedestrian attributes:…
y
aph L‐1 Wm
Wap
patches160
77
40
48 6496
220 10 5
55 3
33
1003
3 … shared bkg. attributes:
unshared bkg. attributes:D
asWL
W
Was
32483
(b) TA‐CNN
unshared bkg. attributes:…
SPV:x y
h L
zauWz
Wau
22
![Page 23: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/23.jpg)
Pedestrian Detection on Caltech (average miss detection rates)(average miss detection rates)
HPG+SVM68% DPM
63%
Joint DL39%
DL aided by DL aided by semantic tasks
17%
W. Ouyang and X. Wang, “Joint Deep Learning for Pedestrian Detection,” ICCV 2013.
Y. Tian, P. Luo, X. Wang, and X. Tang, “Pedestrian Detection aided by Deep Learning Semantic T k ” CVPR 2015
23
Tasks,” CVPR 2015.
![Page 24: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/24.jpg)
OutlineOutline Joint deep learning: pedestrian detection
DeepID-Net: general object detection on ImageNetp g j g
Conclusions Conclusions
24
![Page 25: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/25.jpg)
Challenges of Object DetectionChallenges of Object Detection
Huge number of classes Huge number of classes Appearance variation in different classes
![Page 26: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/26.jpg)
Challenges personChallenges -- person Intra-class variation
Part existence
![Page 27: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/27.jpg)
Challenges personChallenges -- person Intra-class variation
Part existence
Color
![Page 28: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/28.jpg)
Challenges personChallenges -- person Intra-class variation
Part existence
Color
O l Occlusion
![Page 29: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/29.jpg)
Challenges personChallenges -- person Intra-class variation
Part existence
Color
O l Occlusion
Deformation
![Page 30: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/30.jpg)
RCNN (mean average precision: 31 4%)
Object Detection on ImageNet
Selective search
CNN+SVM
Bounding box regression
person
RCNN (mean average precision: 31.4%)
Image Proposed
ghorse
Detection Refined
DeepID-Net (mean average precision: 50.3%)
Image Proposed bounding boxes
Detection results
Refined bounding boxes
Selective search DeepID-Net
Pretrain def
Box rejection
person
horse
Image Proposed b d b
Pretrain, def-pooling layer,
hinge-loss Context
horse
Remaining bounding boxes
Model averaging
Bounding box regression
person
modeling
personperson
bounding boxes
30 30
averagingregressionhorsehorsehorse
![Page 31: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/31.jpg)
Consideration for deep learning based general object detectiongeneral object detection Time Test Training
Accuracy Learning discriminative and invariant features Capture complex deformation and occlusion of parts Rich contextual information
31
![Page 32: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/32.jpg)
Our pipelinemAP 31 to 50.3
Our pipeline
Selective search
DeepID-Net
hinge-loss,
Box rejection
Image Bounding boxes
hinge loss, Pretrain,
def-pooling layer
Context modeling
Remaining bounding boxes y modelingbounding boxes
Model averaging
Bounding box regression
32
![Page 33: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/33.jpg)
Object detection old frameworkObject detection – old framework Sliding windowg Feature extraction Classification Classification
For each window sizeFor each window
1. Feature extraction2. Classification
End;
2015/10/333
;End;
![Page 34: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/34.jpg)
Object detection the frameworkObject detection – the framework
Sliding window
Sliding window
Feature extraction
ClassificationFeature exaction
For each window sizeFor each window
Feature vector:= [x1 x2 x3 x4 …]x
1. Feature extraction2. Classification
End;
34
;End;
![Page 35: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/35.jpg)
Object detection the frameworkObject detection – the framework Sliding windowg Feature extraction Classification
Sliding window Classification
Feature exaction
For each window sizeFor each window
1. Feature extraction2. Classification
End;
Feature vector:= [x1 x2 x3 x4 …]x
Classification
35
;End;
Object or not?
![Page 36: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/36.jpg)
Problem of sliding windowsProblem of sliding windows Single-scale detection: 10k to 100k windows per image Multi-scale detection: 100k to 1m windows per image Multiple aspect ratio:10m to 100m windows per imagep p p g Selective search: 2k windows per image of multiple scales
and aspect ratiosp
Selective search
36
![Page 37: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/37.jpg)
Selective search
Selective search
Selective search
Initial segments from over segmentation [Felzenszwalb2004] Image Bounding boxes
Initial segments from over-segmentation [Felzenszwalb2004]
37
![Page 38: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/38.jpg)
Selective search
Selective search
Selective search
Initial segments from over segmentation [Felzenszwalb2004] Image Bounding boxes
Initial segments from over-segmentation [Felzenszwalb2004] Based on hierarchical grouping
G dj i i l l i il i Group adjacent regions on region-level similarity Consider all scales of the hierarchy
38
![Page 39: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/39.jpg)
Our investigationOur investigation Speed-up the pipeline Effectively learn the deep model Make use of domain knowledge from computer visiong p Deformation pooling Context modellingg
39
![Page 40: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/40.jpg)
Our approachmAP 31 to 50.57 on val2
Our approach
Selective search
DeepID-Net
hinge-loss,
Box rejection
Image Proposed bounding boxes
hinge loss, Pretrain,
def-pooling layer
Contextmodeling
Remaining bounding boxesbounding boxes y modelingbounding boxes
Model averaging
Bounding box regression
40
![Page 41: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/41.jpg)
Bounding box rejectionBox
rejectionBounding box rejection Motivation Selective search: ~ 2400 bounding boxes per image Feature extraction using AlexNet
ILSVRC l 20 000 2 4 d ILSVRC val: ~20,000 images, ~2.4 days ILSVRC test: ~40,000 images, ~4.7days
Bounding box rejection by RCNN: Bounding box rejection by RCNN: For each box, RCNN has 200 scores S1…200 for 200 classes If max(S ) < 1 1 reject 6% remaining bounding boxes If max(S1…200) < -1.1, reject. 6% remaining bounding boxes
Remaining window 100% 20% 6%
41Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR, 2014
Recall (val1) 92.2% 89.0% 84.4%
Feature extraction time (seconds per image) 10.24 2.88 1.18
![Page 42: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/42.jpg)
Bounding box rejectionBox
rejectionBounding box rejection Speed up the pipeline Save the feature extraction time by about 10 times.
Improve mean AP by 1%
3
89.80.5
29.1
Testing SVM score
All
20
28.4
3 3
12
3.3
f i ( l1)
SVM learning
feature extraction (val2)With bbox rejection
Without bbox rejection
10
28.410
3.3
0 20 40 60 80 100
finetuning
feature extraction (val1)
Remaining window 100% 20% 6%
0 20 40 60 80 100hours
42Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR, 2014
Recall (val1) 92.2% 89.0% 84.4%
Feature extraction time (seconds per image) 10.24 2.88 1.18
![Page 43: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/43.jpg)
Our pipelinemAP 31 to 50.57
Our pipeline
Selective search DeepID-Net
Pretrain
Box rejection
Image Proposed bounding boxes
Pretrain,hinge-loss, def-pooling
layerContext
modelingRemaining
bounding boxesbounding boxes layer, modelingbounding boxes
Model averaging
Bounding box regression
43
![Page 44: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/44.jpg)
Deep learning is feature learningDeep learning is feature learning
Image classification Object detection Tracking
SegmentationFeatures learned on ImageNet
![Page 45: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/45.jpg)
Learning features and classifiers separatelyLearning features and classifiers separately How to effectively learn features? With challenging tasks Predict high-dimensional vectors
Dataset ATraining stage A Dataset B
Training stage B
feature transformDeep
feature transform
Classifier 1 Classifier 2...
Deep learning
transform
Classifier B
Prediction on task 1 ...
Prediction on task 2
Prediction on task B (Our target task)
![Page 46: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/46.jpg)
Directly training 200 binary classifiers with CNNs are not good
Pre-train on Fine-tune on Feature
representationPre train on classifying 1,000
categories
Fine tune on classifying 201
categories SVM binary
classifier for each t
Detect 200 object classes on ImageNetcategory
46 Girshick, Ross, et al. CVPR, 2014
![Page 47: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/47.jpg)
Why need pre training with many classes?Why need pre-training with many classes? Each sample carries much more information One big negative class with many types of objects
confuses CNN on feature learning Make the training task challenging, not easy to overfit
47
![Page 48: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/48.jpg)
Feature learningFeature learning Pretrain for image-classification with 1000 classesg f Finetune for object-detection with 200+1 classes Transfer the representation learned from ILSVRC Transfer the representation learned from ILSVRC
Classification to PASCAL (or ImageNet) detection
Use the fine-tuned features for learning SVM Use the fine tuned features for learning SVM
Classification Pre-Train Classification
Detection Fine-tune
SVMApple
Detection
Girshick, Ross, et al. CVPR, 2014
![Page 49: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/49.jpg)
Feature learningFeature learning Pretrain for image-classification with 1000 classesg f Finetune for object-detection with 200+1 classes Use the fine-tuned features for learning SVM Use the fine-tuned features for learning SVM Existing approaches mainly investigate on network
t tstructure Number of layers/channels, filter size, dropout
Classification Pre-Train Classification
Detection Fine-tune
SVMApple
Detection
Girshick, Ross, et al. CVPR, 2014
![Page 50: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/50.jpg)
Deep model designDeep model design Network structure
Net structure AlexNet AlexNet
Annotation level Image Image
Bbox rejection n y
50
Bbox rejection n y
mAP (%) 29.9 30.9
![Page 51: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/51.jpg)
Deep model designDeep model design Network structure
Net structure AlexNet AlexNet Clarifai
Annotation level Image Image Image
Bbox rejection n y y
51
Bbox rejection n y y
mAP (%) 29.9 30.9 31.8
![Page 52: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/52.jpg)
Deep model designDeep model design Network structure
Net structure AlexNet AlexNet Clarifai Overfeat
Annotation level Image Image Image Image
Bbox rejection n y y y
52
Bbox rejection n y y y
mAP (%) 29.9 30.9 31.8 36.6
![Page 53: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/53.jpg)
Deep model designDeep model design Network structure
Net structure AlexNet AlexNet Clarifai Overfeat GoogleNet
Annotation level Image Image Image Image Image
Bbox rejection n y y y y
53
Bbox rejection n y y y y
mAP (%) 29.9 30.9 31.8 36.6 37.8
![Page 54: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/54.jpg)
Feature learning pretrainFeature learning – pretrain Classification Pretrain for image-classification with 1000 classes Finetune for object detection with 200 classes Gap: classification vs. detection, 1000 vs. 200
54Image classification Object detection
![Page 55: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/55.jpg)
Feature learning pretrainFeature learning – pretrain Classification Pretrain for image-classification with 1000 classes Finetune for object detection with 200 classes Gap: classification vs. detection, 1000 vs. 200
55Image classification Object detection
![Page 56: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/56.jpg)
Feature learning pretrainFeature learning – pretrain Classification
56Pretrained on object-level annoation Pretrained on image-level annotation
![Page 57: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/57.jpg)
Feature learning pretrainFeature learning – pretrain Classification (Cls) Pretrain for image-classification with 1000 classes
Gap: classification vs. detection, 1000 vs. 200
Detection (Loc)
Pretrain for object-detection with 1000 classes
Pretraining scheme Cls Cls Loc
Net structure AlexNet Clarifai Clarifai
mAP (%) on val2 29.9 31.8 36.0
57
![Page 58: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/58.jpg)
Result and discussionResult and discussion RCNN (Cls+Det), Our investigation
Better pretraining on 1000 classes Object-level annotation is more suitable for pretraining
AlexNet Image annotation Object annotation
200 classes (Det) 20.7 32200 classes (Det) 20.7 32
1000 classes (Cls-Loc) 31.8 36
58
![Page 59: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/59.jpg)
Our approachmAP 31 to 50.57 on val2
Our approach
Selective search
DeepID-Net
hinge-loss,
Box rejection
Image Proposed bounding boxes
hinge loss, Pretrain,
def-pooling layer
Context modeling
Remaining bounding boxesbounding boxes y modelingbounding boxes
Model averaging
Bounding box regression
59
![Page 60: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/60.jpg)
Feature learning SVM netFeature learning – SVM-net Existing approach Learn features using soft-max loss (Softmax-Net) Train SVM with the learned features
60
![Page 61: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/61.jpg)
Feature learning SVM netFeature learning – SVM-net Existing approach Learn features using soft-max loss (Softmax-Net) Train SVM with the learned features
Replace Soft-max loss by Hinge loss when fine-tuning (SVM-Net) Merge the two steps of RCNN into one Require no feature extraction from training data (~60 hours)
61
![Page 62: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/62.jpg)
Our pipelinemAP 31 to 50.3
Our pipeline
Selective search
DeepID-Net
Pretrain
Box rejection
Image Proposed bounding boxes
, Pretrain,hinge-lossdef-pooling
layerContext
modelingRemaining
bounding boxesbounding boxes aye modelingbounding boxes
Model averaging
Bounding box regression
62
![Page 63: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/63.jpg)
Deep model training def pooling layerDeep model training – def-pooling layer RCNN (ImageNet Cls+Det) Pretrain on image-level annotation with 1000 classes Finetune on object-level annotation with 200 classes Gap: classification vs. detection, 1000 vs. 200
Our approach (ImageNet Loc+Det) Pretrain on object-level annotation with 1000 classes Finetune on object-level annotation with 200 classes with def-
l lpooling layers
Net structure Without Def Layer With Def layeret st uctu e W t out e aye W t e aye
mAP (%) on val2 36.0 38.5
63
![Page 64: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/64.jpg)
DeformationDeformation Learning deformation [a] is effective in computer vision
society. Missing in deep model. We propose a new deformation constrained pooling layer.
64
[a] P. Felzenszwalb, R. B. Grishick, D.McAllister, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Trans. PAMI, 32:1627–1645, 2010.
![Page 65: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/65.jpg)
Modeling Part DetectorsModeling Part Detectors
Different parts have different sizesp Design the filters with variable sizes
Part models learned from HOG
Part models Learned filtered at the second convolutional layer65
![Page 66: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/66.jpg)
Deformation Layer [b]Deformation Layer [b]
66 [b] Wanli Ouyang, Xiaogang Wang, "Joint Deep Learning for Pedestrian Detection ", ICCV 2013.
![Page 67: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/67.jpg)
Deformation layer for repeated patternsDeformation layer for repeated patterns
Pedestrian detection General object detectionPedestrian detection General object detection
Assume no repeated pattern Repeated patterns
67
![Page 68: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/68.jpg)
Deformation layer for repeated patternsDeformation layer for repeated patterns
Pedestrian detection General object detectionPedestrian detection General object detection
Assume no repeated pattern Repeated patterns
Only consider one object class Patterns shared across different object classes
68
![Page 69: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/69.jpg)
Deformation constrained pooling layerDeformation constrained pooling layerCan capture multiple patterns simultaneously
69
![Page 70: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/70.jpg)
Our deep model with deformation layerOur deep model with deformation layer
Patterns shared across different classes
Training scheme Cls+Det Loc+Det Loc+Det
70
Net structure AlexNet Clarifai Clarifai+Def layer
Mean AP on val2 0.299 0.360 0.385
![Page 71: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/71.jpg)
Our approachmAP 31 to 50.57 on val2
Our approach
Selective search
DeepID-Net
hinge-loss,
Box rejection
Image Proposed bounding boxes
hinge loss, Pretrain,
def-pooling layer
Context modeling
Remaining bounding boxesbounding boxes y modelingbounding boxes
Model averaging
Bounding box regression
71
![Page 72: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/72.jpg)
Context modelingContext modeling Use the 1000 class
Image classification score.
~1% mAPimprovement.
72
![Page 73: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/73.jpg)
Context modelingContext modeling Use the 1000-class Image classification score. ~1% mAP improvement. Volleyball: improve ap by 8.4% on val2.
VolleyballVolleyball
G lf b ll
Bathin ca
Golf ball
Bathing cap
73
![Page 74: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/74.jpg)
Our approachmAP 31 to 50.57 on val2
Our approach
Selective search
DeepID-Net
hinge-loss,
Box rejection
Image Proposed bounding boxes
hinge loss, Pretrain,
def-pooling layer
Context modeling
Remaining bounding boxesbounding boxes y modelingbounding boxes
Model averaging
Bounding box regression
74
![Page 75: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/75.jpg)
Model averagingModel averaging Models of different structures are complementary on
different classes.
Net structure
AlexNet AlexNet Clarifai20
AP diff scorpion
Annotation level
Image Object Object
Bbox n n n 0
10
Bboxrejection
n n n
mAP (%) 29.9 34.3 35.6 -20
-10
class
75hamster
![Page 76: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/76.jpg)
Our approachmAP 31 to 50.57 on val2
Our approach
Selective search
DeepID-Net
hinge-loss,
Box rejection
Image Proposed bounding boxes
hinge loss, Pretrain,
def-pooling layer
Context modeling
Remaining bounding boxesbounding boxes y modelingbounding boxes
Model averaging
Bounding box regression
76
![Page 77: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/77.jpg)
Comparison with state of the artComparison with state-of-the-artDetection
Pipeline Flair RCNN Berkeley Vision UvA-Euvision DeepInsight GoogLeNet Ours p y p g gmAP on val2 (avg) n/a n/a n/a n/a 42 44.5 50.7mAP on val2 (sgl) n/a 31.0 33.4 n/a 40.1 38.8 48.2mAP on test (avg) 22 6 n/a n/a n/a 40 5 43 9 50 3
Our approach
mAP on test (avg) 22.6 n/a n/a n/a 40.5 43.9 50.3mAP on test (sgl) n/a 31.4 34.5 35.4 40.2 38.0 47.9
ppSelective search
DeepID-Net
hi l
Box rejection
Image Proposed b di b
hinge-loss, Pretrain,
def-pooling layer
Context Remaining Image bounding boxes layer modelingbounding boxes
77 Model averaging
Bounding box regression
![Page 78: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/78.jpg)
Component analysisComponent analysisDetection
Pipeline RCNN Box
rejection O-net G-net+bbox
pretrain+Edgebox
+Def layer
Scalejittering +ctx
+bboxregr.
Model avg. p j p y j g g g
mAP on val2 29.9 30.9 36.6 37.8 40.4 42.7 44.9 47.3 47.8 48.2 50.7mAP on test 47.9 50.3
0 50.4
2.5
Cbbox regr.
Model avg.
2 32.2
2.40.5
Edgebox Def layer
Scale jittering Context
5 71.2
2.62.3
O-net G-net
bbox pretrainEdgebox
15.7
0 1 2 3 4 5 6
Box rejectionO-net
78
![Page 79: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/79.jpg)
SummarySummary Speed-up the pipeline: Bounding rejection. Save feature extraction by about 10 times,
slightly improve mAP (~1%).H l S f ( 60 h) Hinge loss. Save feature computation time (~60 h).
Improve the accuracy Pre-training with object-level annotation, more classes. 4.2%
mAP D f li l 2 5% AP Def-pooling layer. 2.5% mAP Context. 0.5-1% mAP Model averaging Different model designs and training schemes Model averaging. Different model designs and training schemes
lead to high diversity
79
![Page 80: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/80.jpg)
ConclusionsConclusions Jointly optimize vision components (joint deep learning) Propose new layers based on domain knowledge (def-
pooling layer) Carefully design the strategies of learning feature
representations Feature learned aided by semantic tasks Pre-training with challenging tasks and rich predictions The chosen training tasks help to achieved desired feature
invariance and discriminative powerAd d f k Adapted to specific tasks in test
80
![Page 81: DeepID-Net: Deformable Deep Convolutional Neural Networks for ...](https://reader031.fdocuments.us/reader031/viewer/2022022415/58496c871a28aba93a8ed1bd/html5/thumbnails/81.jpg)
Wanli Ouyang Xiaogang Wang Xiaoou Tang Chen Change Loy Ping Luo
Hongsheng Li Xingyu Zeng Shi Qiu Yongkong Tian Zhenyao Zhu
Zhe Wang ShuoYang Chen Qian Yuanjun Xiong Ruohui Wang