Post on 21-Apr-2020
DeepTrack: Learning Discriminative Feature Representations by Convolutional Neural Networksfor Visual Tracking
Hanxi Li12
hanxi.li@nicta.com.au
Yi Li1
http://users.cecs.anu.edu.au/~yili/
Fatih Porikli1
http://www.porikli.com/
1 NICTA and ANU,Canberra, Australia
2 Jiangxi Normal University,Jiangxi, China
Defining hand-crafted feature representations needs expert knowledge, re-quires time-consuming manual adjustments, and besides, it is arguablyone of the limiting factors of object tracking.
In this paper, we propose a novel solution to automatically relearn themost useful feature representations during the tracking process in order toaccurately adapt appearance changes, pose and scale variations while pre-venting from drift and tracking failures. We employ a candidate pool ofmultiple Convolutional Neural Networks (CNNs) as a data-driven modelof different instances of the target object. Individually, each CNN main-tains a specific set of kernels that favourably discriminate object patchesfrom their surrounding background using all available low-level cues (Fig.1). These kernels are updated in an online manner at each frame after be-ing trained with just one instance at the initialization of the correspondingCNN.
Cue-1
Cue-2
Cue-3
Cue-4
Training patches
Normalized patches
Learned filters in Layer-1
Training faces
Pre-learn
Class-Specific Cue Layer-1
9, 20 × 20 Input Patch
32 × 32 Layer-2
9, 10 × 10 Layer-3
18, 2 × 2
Feature Vector 18 × 1
Output Label
Label 2 × 1
Convolution 13 × 13
Subsampling 2 × 2
Convolution 9 × 9
Subsampling 2 × 2
Fully Connected
Input-Image
32 × 32 Input-Image
32 × 32 Feature Maps
20 × 20 Feature Maps
20 × 20 Feature Maps
20 × 20 Input-Image
32 × 32 Input-Image
32 × 32 Input-Image
32 × 32
Image 32× 32
32× 32
e Maps 20 ×20
e Maps 20 ×20 32× 32 32× 32
32× 32
Class-Specific
Cue
Current frame
Input-Image
32 × 32 Input-Image
32 × 32 Feature Maps
20 × 20 Feature Maps
20 × 20 Feature Maps
20 × 20 Input-Image
32 × 32 Input-Image
32 × 32 Input-Image
32 × 32
Image 32× 32
32× 32
e Maps 20 ×20
e Maps 20 ×20 32× 32 32× 32
32× 32
Cue-1
Input-Image
32 × 32 Input-Image
32 × 32 Feature Maps
20 × 20 Feature Maps
20 × 20 Feature Maps
20 × 20 Input-Image
32 × 32 Input-Image
32 × 32 Input-Image
32 × 32
Image 32× 32
32× 32
e Maps 20 ×20
e Maps 20 ×20 32× 32 32× 32
32× 32
Cue-4
Figure 1: Overall architecture with (red box) and without (rest) the class-specific version.
Instead of learning one complicated and powerful CNN model for allthe appearance observations in the past, we chose a relatively small num-ber of filters in the CNN within a framework equipped with a temporaladaptation mechanism (Fig. 2). Given a frame, the most promising CNNsin the pool are selected to evaluate the hypothesises for the target object.The hypothesis with the highest score is assigned as the current detectionwindow and the selected models are retrained using a warm-start back-propagation which optimizes a structural loss function.
#34 #162 #208
CNN-1 CNN-2 CNN-3 CNN-20 CNN-4 … CNN-Pool
Prototypes …
… …
Test Est. Train
① Nearest CNNs via prototype matching
② Perform Nearest CNNs
④ Train from CNN-3
⑥ Update CNN-3
⑥ Add a new CNN
⑤ New CNN?
No
Yes
Time axis
Test Est. Train Test Est. Train
③ Select & rank
Figure 2: Illustration of the temporal adaptation mechanism.
Our experiments on a large selection of videos from the recent bench-marks demonstrate that our method outperforms the existing state-of-the-art algorithms and rarely loses the track of the target object. We evaluate
our method on 16 benchmark video sequences that cover most challeng-ing tracking scenarios such as scale changes, illumination changes, occlu-sions, cluttered backgrounds and fast motion (Fig. 3).
0 10 20 30 40 500
0.2
0.4
0.6
0.8
1
Location error thresholdP
recis
ion
Precision plots
CNN
SCM
Struck
CXT
ASLA
TLD
IVT
MIL
Frag
CPF
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Overlap threshold
Su
cce
ss r
ate
Success plots
CNN
SCM
TLD
Struck
ASLA
CXT
IVT
Frag
MIL
CPF
Figure 3: The Precision Plot (left) and the Success Plot (right) of thecomparing visual tracking methods.
In certain applications, the target object is from a known class of ob-jects such as human faces. Our method can use this prior information toleverage the performance of tracking by training a class-specific detec-tor. In the tracking stage, given the particular instance information, oneneeds to combine the class-level detector and the instance-level tracker ina certain way, which usually leads to higher model complexity (Fig. 4).
0 10 20 30 40 500
0.2
0.4
0.6
0.8
1
Location error threshold
Pre
cis
ion
Precision plots
Class−CNN
CNN
Normal−SGD
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Overlap threshold
Su
cce
ss r
ate
Success plots
Class−CNN
CNN
Normal−SGD
Figure 4: The Precision Plot (left) and the Success Plot (right) of thecomparing visual tracking methods.
To conclude, we introduced DeepTrack, a CNN based online objecttracker. We employed a CNN architecture and a structural loss functionthat handles multiple input cues and class-specific tracking. We also pro-posed an iterative procedure, which speeds up the training process signif-icantly. Together with the CNN pool, our experiments demonstrate thatDeepTrack performs very well on 16 sequences.