Supplementary for ROAM: Recurrently Optimizing Tracking Model · 2020. 6. 11. · Supplementary for...
Transcript of Supplementary for ROAM: Recurrently Optimizing Tracking Model · 2020. 6. 11. · Supplementary for...
Supplementary for ROAM: Recurrently Optimizing Tracking Model
Tianyu Yang1,2 Pengfei Xu3 Runbo Hu3 Hua Chai3 Antoni B. Chan2
1 Tencent AI Lab 2 City University of Hong Kong 3 Didi Chuxing
1. Gradient FlowNote that our recurrent model optimization process in-
volves a gradient update step (see Eq. 9), which may needto compute the second derivative when minimizing the metaloss. Specifically, we show the computation graph of theoffline training process of our framework in Fig. 1, wherethe gradients flow backwards along the solid lines and ig-nores the dashed lines. For the initial frame (Fig. 1 left), weaim to learn a conducive tracking network θ(0) and learn-ing rates λ(0) that can fast adapt to different videos withone gradient step, which is correlated closely with the track-ing model θ(1). Therefore, we back-propagate through boththe gradient update step and the gradient of the update loss(i.e. computing the second derivative). For the subsequentframes, our goal is to learn the online neural optimizer O,which is independent of the tracking model θ(t) that is beingoptimized. Hence, we choose to ignore the gradient pathsalong the update loss gradient Oθ(t−1)`(t−1), and the neuraloptimizer inputs I(t−1), which are composed of historicalcontext vectors computed from the tracking model. The ef-fect of this simplification is to focus the training more onimproving the neural optimizer to reduce the loss, ratherthan on secondary items, such as the LSTM input or lossgradient.
Figure 1. Computation graph of the recurrent neural opti-mizer.Dashed lines are ignored during backpropagation of the gra-dient.
2. Speed analysisWe further investigate the speed gain obtained by our
recurrent neural optimizer and bounding box regression.
To make a fair speed comparison, we use the same fea-ture backbone (VGG16) and tracking model (two stackedconvolutional layers) to build the tracker. We compare ourROAM and ROAM++ with trackers using traditional SGDfor model updating under different number of gradient stepsin Table 1. ROAM++ runs faster than ROAM because itdoes not need to compute multi-scale features. With similarrun time, our ROAM achieves much better results comparedwith SGD-2, showing the effectiveness of our recurrent neu-ral optimizer. Note that ROAM and SGD-2, which bothuses two gradient steps, obtain similar speed demonstratingthat our recurrent neural optimizer has negligible computa-tional cost. Furthermore, using vanilla SGD requires morenumber of iterations to converge the model, thus leading toslower running speed.
Speed (fps) AUC
ROAM++ 20.6 0.680ROAM 13.4 0.681
SGD-2 13.7 0.593SGD-4 12.5 0.597SGD-8 10.6 0.621
SGD-16 8.0 0.645SGD-32 5.5 0.644
Table 1. Speed comparison results on OTB2015 datasets. SGD-nmeans n gradient steps. SGD uses a learning rate of 7e-7. Noteboth ROAM and ROAM++ uses two gradient steps.
3. Update Loss and Meta Loss ComparisonWe show the update loss `(t) and meta loss L(t) compar-
ison over time between ROAM and SGD after two gradientsteps for all videos of OTB-2015 in Figure 2 and Figure 3respectively. Our ROAM demonstrates better optimizationability on minimizing tracking loss.
4. Impact of Training Data.We also train the variants of our ROAM++ and ROAM
with ImageNet VID training data, which are denoted aasVID-ROAM++ and VID-ROAM, respectively. Table 2
shows the comparison results on VOT datasets. Withonly VID dataset for training, our ROAM++ still outper-forms our preliminary tracker ROAM. Furthermore, bothour VID-ROAM++ and VID-ROAM outperform the base-line MetaTracker [1] by a large margin.
VOT-2016 VOT-2017EAO A R EAO A R
ROAM++ 0.4414 0.5987 0.1743 0.3803 0.5433 0.1945VID-ROAM++ 0.4147 0.5886 0.1749 0.3330 0.5513 0.2366
ROAM 0.3842 0.5558 0.1833 0.3313 0.5048 0.2263VID-ROAM 0.3568 0.5468 0.2091 0.2971 0.5098 0.2491
MetaTracker 0.317 0.519 - - - -
Table 2. Comparison results on VOT-2016 and VOT-2017 datasetswith different training datasets
5. More AblationsWe investigate the impact of removing scale smoothen
and multi-scale search for ROAM tracker. We also presenta variant of ROAM++ which does not resize tracking modelto demonstrate the effectiveness of our resizable model. Ta-ble 3 shows the comparison results on OTB dataset. We cansee that removing multi-scale search or filter resizing de-crease the performance a lot while removing scale smoothhas relative small affect.
Trackers AUC
ROAM 0.681ROAM++ 0.680
ROAM w/o scalesmooth 0.667ROAM w/o multiscale 0.590ROAM++ w/o filter resizing 0.553
Table 3. Performance comparison of different variants on OTBdataset
6. Bounding Box ParametrizationWe follow the way of bounding box parameterization in
[2] when computing smooth L1 loss in (7). Let (x, y, w, h)be the center coordinate, width and height of the groundtruth box, and (x∗, y∗, w∗, h∗) be the predicted box. Theparametrization of the bounding boxes are:
B = (tx, ty, tw, th) = (x−xa
aw, y−ya
ah, log w
ah, log h
ah),
B∗ = (t∗x, t∗y, t
∗w, t
∗h) = (x
∗−xa
aw, y∗−ya
ah, log w∗
aw, log h∗
ah)
where the anchor size (aw, ah) is defined in (6) and thisanchor is used on each spatial location on the feature map.
(xa, ya) is the center of each anchor. R(F ;θreg) (in Eq.7)corresponds to B∗, which is the parametrized predictedbbox, and B is the parameterized ground truth bbox. Aftergetting R(F ;θreg), we compute the predicted object box(x∗, y∗, w∗, h∗).
References[1] Eunbyung Park and Alexander C. Berg. Meta-Tracker: Fast
and Robust Online Adaptation for Visual Object Trackers. InECCV, 2018. 2
[2] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster R-CNN: Towards Real-Time Object Detection with Re-gion Proposal Networks. In NeurIPS, 2015. 2
0 5000
0.050.1
up
da
te lo
ss Basketball
0 50 1000
0.050.1
Biker
0 200 4000
0.5Bird1
0 500
0.050.1
Bird2
0 2000
0.20.4
BlurBody
0 5000
0.050.1
BlurCar1
0 5000
0.020.04
BlurCar2
0 2000
0.05BlurCar3
0 2000
0.050.1
up
da
te lo
ss BlurCar4
0 200 4000
0.050.1
BlurFace
0 5000
0.20.4
BlurOwl
0 5000
0.050.1
Board
0 2000
0.10.2
Bolt
0 100 2000
0.10.2
Bolt2
0 500 10000
0.050.1
Box
0 5000
0.10.2
Boy
0 500 10000
0.20.4
up
da
te lo
ss Car1
0 5000
0.10.2
Car2
0 20000
0.20.4
Car24
0 5000
0.050.1
Car4
0 2000
0.10.2
CarDark
0 100 2000
0.20.4
CarScale
0 200 4000
0.20.4
ClifBar
0 100 2000
0.10.2
Coke
0 50 1000
0.10.2
up
da
te lo
ss Couple
0 2000
0.050.1
Coupon
0 50 1000
0.20.4
Crossing
0 2000
0.20.4
Crowds
0 100 2000
0.10.2
Dancer
0 50 1000
0.20.4
Dancer2
0 200 4000
0.050.1
David
0 5000
0.10.2
David2
0 100 2000
0.050.1
up
da
te lo
ss David3
0 500
0.10.2
Deer
0 100 2000
0.10.2
Diving
0 50 1000
0.10.2
Dog
0 10000
0.10.2
Dog1
0 20000
0.20.4
Doll
0 50 1000
0.050.1
DragonBaby
0 500 10000
0.050.1
Dudek
0 5000
0.20.4
up
da
te lo
ss FaceOcc1
0 5000
0.050.1
FaceOcc2
0 200 4000
0.10.2
Fish
0 5000
0.10.2
FleetFace
0 2000
0.10.2
Football
0 500
0.050.1
Football1
0 2000
0.20.4
Freeman1
0 200 4000
0.51
Freeman3
0 100 2000
0.5
up
da
te lo
ss Freeman4
0 200 4000
0.20.4
Girl
0 10000
0.20.4
Girl2
0 5000
0.10.2
Gym
0 500 10000
0.20.4
Human2
0 10000
0.20.4
Human3
0 5000
0.20.4
Human4
0 5000
0.51
Human5
0 5000
0.5
up
da
te lo
ss Human6
0 100 2000
0.020.04
Human7
0 50 1000
0.050.1
Human8
0 100 200 3000
0.050.1
Human9
0 1000
0.050.1
Ironman
0 2000
0.10.2
Jogging-1
0 2000
0.10.2
Jogging-2
0 50 1000
0.20.4
Jump
0 2000
0.20.4
up
da
te lo
ss Jumping
0 500
0.10.2
KiteSurf
0 10000
0.050.1
Lemming
0 10000
0.10.2
Liquor
0 50 1000
0.10.2
Man
0 500
0.20.4
Matrix
0 10000
0.10.2
Mhyang
0 1000
0.050.1
MotorRolling
0 100 2000
0.10.2
up
da
te lo
ss MountainBike
0 5000
0.20.4
Panda
0 10000
0.20.4
RedTeam
0 10000
0.050.1
Rubik
0 2000
0.050.1
Shaking
0 2000
0.20.4
Singer1
0 2000
0.20.4
Singer2
0 1000
0.20.4
Skater
0 200 4000
0.20.4
up
da
te lo
ss Skater2
0 2000
0.10.2
Skating1
0 200 4000
0.20.4
Skating2-1
0 200 4000
0.10.2
Skating2-2
0 500
0.20.4
Skiing
0 2000
0.10.2
Soccer
0 1000
0.20.4
Subway
0 2000
0.20.4
Surfer
0 5000
0.20.4
up
da
te lo
ss Suv
0 10000
0.10.2
Sylvester
0 2000
0.050.1
Tiger1
0 2000
0.050.1
Tiger2
0 100 200frame number
00.10.2
Toy
0 50 100frame number
00.10.2
Trans
0 500frame number
00.05
0.1Trellis
0 200 400frame number
00.05
0.1Twinnings
0 100 200frame number
00.20.4
up
da
te lo
ss Vase
0 200 400frame number
00.05
0.1Walking
0 200 400frame number
00.05
0.1Walking2
0 500frame number
00.20.4
Woman
ROAMSGD
Figure 2. Update loss comparison between ROAM and SGD.
0 5000
0.2
0.4
me
ta lo
ss Basketball
0 50 1000
0.5
1Biker
0 200 4000
0.5
1Bird1
0 500
0.2
0.4Bird2
0 2000
0.5BlurBody
0 5000
0.2
0.4BlurCar1
0 5000
0.1
0.2BlurCar2
0 2000
0.1
0.2BlurCar3
0 2000
0.1
0.2
me
ta lo
ss BlurCar4
0 200 4000
0.2
0.4BlurFace
0 5000
0.5
1BlurOwl
0 5000
0.2
0.4Board
0 2000
0.2
0.4Bolt
0 100 2000
0.2
0.4Bolt2
0 500 10000
0.2
0.4Box
0 5000
0.2
0.4Boy
0 500 10000
0.2
0.4
me
ta lo
ss Car1
0 5000
0.2
0.4Car2
0 20000
0.5Car24
0 5000
0.1
0.2Car4
0 2000
0.2
0.4CarDark
0 100 2000
0.2
0.4CarScale
0 200 4000
0.5ClifBar
0 100 2000
0.5Coke
0 50 1000
0.2
0.4
me
ta lo
ss Couple
0 2000
0.2
0.4Coupon
0 50 1000
0.2
0.4Crossing
0 2000
0.5Crowds
0 100 2000
0.2
0.4Dancer
0 50 1000
0.2
0.4Dancer2
0 200 4000
0.1
0.2David
0 5000
0.2
0.4David2
0 100 2000
0.2
0.4
me
ta lo
ss David3
0 500
0.2
0.4Deer
0 100 2000
0.5Diving
0 50 1000
0.2
0.4Dog
0 10000
0.2
0.4Dog1
0 20000
0.5Doll
0 50 1000
0.2
0.4DragonBaby
0 500 10000
0.2
0.4Dudek
0 5000
0.2
0.4
me
ta lo
ss FaceOcc1
0 5000
0.1
0.2FaceOcc2
0 200 4000
0.2
0.4Fish
0 5000
0.2
0.4FleetFace
0 2000
0.5Football
0 500
0.2
0.4Football1
0 2000
0.5
1Freeman1
0 200 4000
0.5
1Freeman3
0 100 2000
0.5
1
me
ta lo
ss Freeman4
0 200 4000
0.2
0.4Girl
0 10000
0.5Girl2
0 5000
0.2
0.4Gym
0 500 10000
0.5Human2
0 10000
0.2
0.4Human3
0 5000
0.2
0.4Human4
0 5000
0.5
1Human5
0 5000
0.5
1
me
ta lo
ss Human6
0 100 2000
0.2
0.4Human7
0 50 1000
0.2
0.4Human8
0 2000
0.2
0.4Human9
0 1000
0.2
0.4Ironman
0 2000
0.2
0.4Jogging-1
0 2000
0.2
0.4Jogging-2
0 50 1000
0.5Jump
0 2000
0.2
0.4
me
ta lo
ss Jumping
0 500
0.2
0.4KiteSurf
0 10000
0.2
0.4Lemming
0 10000
0.2
0.4Liquor
0 50 1000
0.2
0.4Man
0 500
0.5Matrix
0 10000
0.2
0.4Mhyang
0 1000
0.2
0.4MotorRolling
0 100 2000
0.2
0.4
me
ta lo
ss MountainBike
0 5000
0.5Panda
0 10000
0.5RedTeam
0 10000
0.2
0.4Rubik
0 2000
0.2
0.4Shaking
0 2000
0.2
0.4Singer1
0 2000
0.2
0.4Singer2
0 1000
0.2
0.4Skater
0 200 4000
0.2
0.4
me
ta lo
ss Skater2
0 2000
0.2
0.4Skating1
0 200 4000
0.2
0.4Skating2-1
0 200 4000
0.2
0.4Skating2-2
0 500
0.5Skiing
0 2000
0.5Soccer
0 1000
0.2
0.4Subway
0 2000
0.5Surfer
0 5000
0.5
me
ta lo
ss Suv
0 10000
0.2
0.4Sylvester
0 2000
0.2
0.4Tiger1
0 2000
0.2
0.4Tiger2
0 100 200
frame number
0
0.2
0.4Toy
0 50 100
frame number
0
0.2
0.4Trans
0 500
frame number
0
0.2
0.4Trellis
0 200 400
frame number
0
0.2
0.4Twinnings
0 100 200
frame number
0
0.2
0.4
me
ta lo
ss Vase
0 200 400
frame number
0
0.2
0.4Walking
0 200 400
frame number
0
0.2
0.4Walking2
0 500
frame number
0
0.5
1Woman
ROAM
SGD
Figure 3. Meta loss comparison between ROAM and SGD.