for Visual (2016S) 6: CNNs for Detection, Tracking, and...
Transcript of for Visual (2016S) 6: CNNs for Detection, Tracking, and...
Lecture 6: CNNs for Detection, Tracking, and Segmentation
Bohyung HanComputer Vision [email protected]
CSED703R: Deep Learning for Visual Recognition (2016S)
2
Object Detection
Region‐based CNN (RCNN)
• Object detection Independent evaluation of each proposal Bounding box regression improves detection accuracy. Mean average precision (mAP): 53.7% with bounding box regression in
VOC 2010 test set
3
[Girshick14] R. Girshick, J. Donahue, S. Guadarrama, T. Darrell, J. Malik: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
Input image Extract regionproposal
Compute CNN features Classification
Any proposal method (e.g., selective search, edgebox)
Any architecture Softmax, SVM
Selective Search
• Motivation Sliding window approach is not feasible for object detection with
convolutional neural networks. We need a more faster method to identify object candidates.
• Finding object proposals Greedy hierarchical superpixel
segmentation Diversification of superpixel
construction and merge• Using a variety of color spaces• Using different similarity measures
• Varying staring regions
4
[Uijlings13] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders: Selective Search for Object Recognition. IJCV 2013
Bounding Box Regression
• Learning a transformation of bounding box Region proposal: , , , Ground‐truth: , , , Transformation: , , ,
5
CNN pool5 feature
∗ argmin
expexp
Detection Results
• VOC 2010 test set
• Feature analysis on VOC 2007 test set
6
Fast RCNN
• Fast version of RCNN 9x faster in training and 213x faster in testing than RCNN A single feature computation and ROI pooling using object proposals Bounding box regression into network Single stage training using multi‐task loss
7[Girshick15] R. Girshick: Fast R‐CNN, ICCV 2015
Faster RCNN
• Fast RCNN + RPN Proposal computation into network Marginal cost of proposals:
10ms
8
[Ren15] S. Ren, K. He, R. Girshick, J. Sun: Faster R‐CNN: Towards Real‐Time Object Detection with Region Proposal Networks. NIPS 2015
Object Detection Performance
9
Pascal VOC 2007 Object Detection mAP (%)
RCNN family achieves the state‐of‐the‐art performance in object detection!
Faster RCNN with ResNet
10
Faster RCNN with ResNet
1112
Visual Tracking with Convolutional Neural Networks
Main Idea
• Training shared features and domain‐specific classifiers jointly.
13
Domain‐specific classifiersDomain 1
Domain 2
Domain 3
Domain 4
Transfer to a new domain
Shared feature representation
Visual Tracking
• MDNet (Multi‐Domain Network) Multi‐domain learning Separating shared and domain‐specific layers
14
[Nam15] Hyeonseob Nam, Bohyung Han: Learning Multi‐Domain Convolutional Neural Networks for Visual Tracking, CVPR 2016
The Winner of Visual Object Tracking Challenge 2015
Multi‐Domain Learning
15
• Iteration #nK+1• Iteration #nK+2• Iteration #nK
Online Tracking using MDNet Features
16
Transfer shared features
New Sequence
Online Tracking using MDNet Features
17
Fine‐Tuning
Transfer shared features
New Sequence
Online Tracking: Overview
18
Update the CNN if needed
Find the optimal state
Collect training samples
Draw target candidates
∗ argmax xFrame 2
Repeat for the next frame
⋅ : positive score
• Short‐Term Update Performed at abrupt appearance
changes ( ∗ 0.5 Using short‐term training samples For Adaptiveness
Online Network Update
19
Frame #
• Long‐Term Update Performed at regular intervals Using long‐term training
samples For Robustness
Long-term update
Short-term update
1 0.82 0.91 0.86 0.93 0.94 0.85 0.73 0.78 0.66 0.38 0.53 0.47 0.62 0.83 0.88∗
• Provide a “hard” minibatch in each training iteration.
Hard Negative Mining
20
Pool of Positive Samples
Pool of Negative Samples
A MINIBATCH
Training CNN
Randomly draw samples
Select ≪samples with
highest scores
Randomly draw samples
Hard Negative Mining
21
1st minibatch 5th minibatch 30th minibatch
Positive sample Negative sample
Training iteration
Bounding Box Regression
• Improve the localization quality. ‐ DPM [Felzenszwalb et al. PAMI’10], R‐CNN [Girshick et al. CVPR’14]
22
Train a bounding box regression model.
Adjust the tracking result by bounding box regression.
Frame 1 Frame
Tracking result
Ground-Truth
Positive samples
Results on OTB100[Wu15]
• Protocol MDNet is trained with 58 sequences from {VOT’13,’14,’15} excluding
{OTB100}. Distance precision and overlap success rate by One‐Pass‐Evaluation (OPE)
23[Wu15] Y. Wu, J. Lim, M.‐H. Yang: Object Tracking Benchmark. TPAMI 2015
Results on VOT2015
24Ground‐truth Our 15 repetitions
25
Semantic Segmentation by Fully Convolutional Network
Semantic Segmentation
• Segmenting images based on its semantic notion
26
Semantic Segmentation using CNN
• Image classification
• Semantic segmentation Given an input image, obtain pixel‐wise segmentation mask using a deep
Convolutional Neural Network (CNN)
27
Query image
Query image
Fully Convolutional Network (FCN)
• Interpreting fully connected layers as convolution layers Each fully connected layer is identical to a convolution layer with a
large spatial filter that covers entire input field.
28
11
4096
409611
51277
Fully connected layers
fc7
fc6
pool5
4096
4096
11
11
7 7512
Convolution layers
pool5
fc6
fc7
51222
22
For the larger Input field
pool5
409616 16
fc6
40961616
fc7
FCN for Semantic Segmentation
• Network architecture[Long15]• End‐to‐end CNN architecture for semantic segmentation• Interpret fully connected layers to convolutional layers
29
[Long15] J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Network for Semantic Segmentation. CVPR 2015
Deconvolution
16x16x21
500x500x3
Deconvolution Filter
• Bilinear interpolation filter Same filter for every class No filter learning!
• How does this deconvolution work? Deconvolution layer is fixed. Fining‐tuning convolutional layers of
the network with segmentation ground‐truth.
30
64x64 bilinear interpolation
seg ∘Fixed Pretrained on ImageNet
Fine‐tuned for segmentation
Skip Architecture
• Ensemble of three different scales Combining complementary features
31
More semantic
More detailed
Limitations of FCN‐based Semantic Segmentation
• Coarse output score map A single bilinear filter should handle the variations in all kinds of object
classes. Difficult to capture detailed structure of objects in image
• Fixed size receptive field Unable to handle multiple scales Difficult to delineate too small or large objects compared to the size of rec
eptive field
• Noisy predictions due to skip architecture Trade off between details and noises Minor quantitative performance improvement
32
Results and Limitations
33
Input image GT FCN‐32s FCN‐16s FCN‐8s
Results and Limitations
34
Input image GT FCN‐32s FCN‐16s FCN‐8s
35