AttentionNet for Accurate Localization and Detection of ... › content › gtc-kr ›...
Transcript of AttentionNet for Accurate Localization and Detection of ... › content › gtc-kr ›...
-
AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)
Donggeun Yoo, Sunggyun Park, Joon-Young Lee, Anthony Paek, In So Kweon.
-
State-of-the-art frameworks for object detection.
-
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
-
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
Object proposal.
-
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
CN
N
Object proposal.
-
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
CN
N
SVM
Object proposal.
-
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
CN
N
SVM
NM
S
BB R
eg.
Object proposal.
-
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
CN
N
SVM
NM
S
BB R
eg.
Object proposal.
-
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
(−) The maximally scored region is prone to focus on discriminative part (e.g. face)
rather than entire object (e.g. human body).
CN
N
SVM
NM
S
BB R
eg.
Object proposal.
-
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
(−) The maximally scored region is prone to focus on discriminative part (e.g. face)
rather than entire object (e.g. human body).
CN
N
SVM
NM
S
BB R
eg.
Object proposal.
-
State-of-the-art frameworks for object detection.
2. Detection by CNN-regression. [Szegedy et al., NIPS’13]
-
State-of-the-art frameworks for object detection.
2. Detection by CNN-regression. [Szegedy et al., NIPS’13]
-
State-of-the-art frameworks for object detection.
2. Detection by CNN-regression. [Szegedy et al., NIPS’13]
CN
N
X1
y1
X2
y2
-
State-of-the-art frameworks for object detection.
2. Detection by CNN-regression. [Szegedy et al., NIPS’13]
(X1,Y1)
(X2,Y2)
CN
N
X1
y1
X2
y2
-
State-of-the-art frameworks for object detection.
2. Detection by CNN-regression. [Szegedy et al., NIPS’13]
(−) Direct mapping from an image to an exact bounding box is relatively difficult for a CNN.
(X1,Y1)
(X2,Y2)
CN
N
X1
y1
X2
y2
-
Idea: Ensemble of weak prediction.
-
Idea: Ensemble of weak prediction.
-
Idea: Ensemble of weak prediction.
-
Idea: Ensemble of weak prediction.
-
Idea: Ensemble of weak prediction.
-
Idea: Ensemble of weak prediction.
-
Stop signal
Idea: Ensemble of weak prediction.
-
Stop signal
Idea: Ensemble of weak prediction.
-
Stop signal
Stop signal
Idea: Ensemble of weak prediction.
-
Stop signal
Stop signal
Idea: Ensemble of weak prediction.
-
Model: Rather than CNN regression model,
use CNN classification model.
-
Model: Rather than CNN regression model,
use CNN classification model.
Bottom-right direction prediction. Top-left direction prediction.
Convolution.
Normalization.
Pooling.
Convolution.
Normalization.
Pooling.
Convolution.
Convolution.
Convolution.
Fully connected.
Fully connected.
-
Model: Rather than CNN regression model,
use CNN classification model.
Bottom-right direction prediction. Top-left direction prediction.
Convolution.
Normalization.
Pooling.
Convolution.
Normalization.
Pooling.
Convolution.
Convolution.
Convolution.
Fully connected.
Fully connected.
-
Model: Rather than CNN regression model,
use CNN classification model.
[ 3 directions, stop signal, no object ] ∈ ℜ5
Bottom-right direction prediction. Top-left direction prediction.
Convolution.
Normalization.
Pooling.
Convolution.
Normalization.
Pooling.
Convolution.
Convolution.
Convolution.
Fully connected.
Fully connected.
[ 3 directions, stop signal, no object ] ∈ ℜ5
-
Model: Rather than CNN regression model,
use CNN classification model.
[ 3 directions, stop signal, no object ] ∈ ℜ5
Convolution.
Normalization.
Pooling.
Convolution.
Normalization.
Pooling.
Convolution.
Convolution.
Convolution.
Fully connected.
Fully connected.
[ 3 directions, stop signal, no object ] ∈ ℜ5
→ ↘ ↓ • F ← ↖ ↑ • F
-
Iterative test: Ensemble of weak directions.
-
Iterative test: Ensemble of weak directions.
-
Iterative test: Ensemble of weak directions.
-
Iterative test: Ensemble of weak directions.
-
Iterative test: Ensemble of weak directions.
-
Iterative test: Ensemble of weak directions.
-
Iterative test: Ensemble of weak directions.
-
Iterative test: Ensemble of weak directions.
-
Training AttentionNet.
-
Training AttentionNet.
1. Generating training samples.
-
Training AttentionNet.
2. Minimizing the loss function by back-propagation and stochastic gradient descent.
𝐿 =1
2𝐿𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑦𝑇𝐿, 𝑡𝑇𝐿 +
1
2𝐿𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑦𝐵𝑅 , 𝑡𝐵𝑅 .
-
Result. (Good examples.)
-
Result. (Good examples.)
-
Result. (Bad examples.)
-
How to detect multiple instance?
-
Extension to multiple-instance: 1. Fast multi-scale sliding window search
using fully-convolutional network.
-
*Fast extraction of multi-scale dense activations.
-
*Fast extraction of multi-scale dense activations.
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
FC 7
FC 6
227×227×3
-
*Fast extraction of multi-scale dense activations.
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
FC 7
FC 6
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
FC 7
FC 6
227×227×3
322×322×3
-
*Fast extraction of multi-scale dense activations.
Idea: Fully connection can be equally implemented
by convolutional layer.
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
FC 7
FC 6
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
FC 7
FC 6
227×227×3
322×322×3
-
*Fast extraction of multi-scale dense activations.
Idea: Fully connection can be equally implemented
by convolutional layer.
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
FC 7
FC 6
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
Conv. 7
Conv. 6
227×227×3
322×322×3
-
*Fast extraction of multi-scale dense activations.
…
-
*Fast extraction of multi-scale dense activations.
…
-
*Fast extraction of multi-scale dense activations.
…
…
-
*Fast extraction of multi-scale dense activations.
Multi-scale
dense
activations.
…
…
…
4,096
-
*Fast extraction of multi-scale dense activations.
Multi-scale
dense
activations.
…
…
4,096
Each activation vector
comes from each patch.
-
Extension to multiple-instance: 1. Fast multi-scale sliding window search
using fully-convolutional network.
-
Extension to multiple-instance:
2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.
-
Extension to multiple-instance:
2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.
Satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Start iterative test.
-
Extension to multiple-instance:
2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.
Un-satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Reject.
Satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Start iterative test.
-
Extension to multiple-instance:
2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.
Un-satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Reject.
Un-satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Reject.
Satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Start iterative test.
-
Extension to multiple-instance: Overall architecture for sliding window search.
-
Extension to multiple-instance: Merging multiple bounding boxes.
-
Extension to multiple-instance: Merging multiple bounding boxes.
-
Extension to multiple-instance: Merging multiple bounding boxes.
-
Extension to multiple-instance: Merging multiple bounding boxes.
-
Extension to multiple-instance: Merging multiple bounding boxes.
-
Evaluation on PASCAL VOC Series.
PASCAL VOC 2007 “Person”.
PASCAL VOC 2012 “Person”.
58.7 RCNN.
RCNN-based.
-
Evaluation on PASCAL VOC Series.
PASCAL VOC 2007 “Person”.
PASCAL VOC 2012 “Person”.
58.7 RCNN.
RCNN-based.
AttentionNet.
AttentionNet.
-
Evaluation on PASCAL VOC Series.
PASCAL VOC 2007 “Person”.
PASCAL VOC 2012 “Person”.
58.7 RCNN.
RCNN-based.
AttentionNet+RCNN.
AttentionNet+RCNN.
-
Evaluation on PASCAL VOC Series.
PASCAL VOC 2007 “Person”.
PASCAL VOC 2012 “Person”.
Precision-recall curve on PASCAL VOC 2007 “Person”.
58.7