Beyond appearance model Learning appearance variations for...

9
Beyond appearance model: Learning appearance variations for object tracking Guorong Li a,b,c , Bingpeng Ma a,b,c,n , Jun Huang a , Qingming Huang a,b,c , Weigang Zhang d a School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100190, China b Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences(CAS), Beijing 100080, China c Key Laboratory of Big Data Mining and Knowledge Management,CAS, China d School of Computer Science and Technology, Harbin Institute of Technology, China article info Article history: Received 26 January 2016 Received in revised form 25 May 2016 Accepted 24 June 2016 Communicated by Xiang Bai Available online 20 July 2016 Keywords: Object tracking Appearance model Appearance prediction abstract In this paper, we present a novel appearance variation prediction model which can be embedded into the existing generative appearance model based tracking framework. Different from the existing works, which online learn appearance model with obtained tracking results, we propose to predict appearance reconstruction error. We notice that although the learned appearance model can precisely describe the target in the previous frames, the tracking result is still not accurate if in the following frame, the patch that is most similar to appearance model is assumed to be the target. We rst investigate the above phenomenon by conducting experiments on two public sequences and discover that in most cases the best target is not the one with minimal reconstruction error. Then we design three kinds of features which can encode motion, appearance, appearance reconstruction error information of target's sur- rounding image patches, and capture potential factors that may cause variations of target's appearance as well as its reconstruction error. Finally, with these features, we learn an effective random forest for predicting reconstruction error of the target during tracking. Experiments on various datasets demon- strate that the proposed method can be combined with many existing trackers and improve their per- formances signicantly. & 2016 Elsevier B.V. All rights reserved. 1. Introduction Object tracking has been a hot research topic for decades and plays a critical role in many applications such as visual surveillance, activity analysis, intelligent human computer interaction, etc. It aims to locate the target in every frame. Appearance and motion are the two key cues that are commonly used to infer target's states. During the past years, many works [20,7,17,12,26,10,14,9,28,8,23] have been proposed to build effective appearance models or accurate motion models for visual target tracking and achieve good perfor- mance on various test videos. Meanwhile, some works [29,28] tried to integrate multiple features to improve the tracking performance. However, it is still a challenging problem to build effective appear- ance model for the target because of its motion, varying scene, il- lumination changes, occlusion, etc. Tracking methods focusing on target's appearance can be di- vided into two categories: generative appearance model [18,17,22] and discriminative appearance model [6,11,3,9,4]. In generative methods, the target's appearance space is learned and then object tracking is formulated as nding the image patch that is the most similar to target's model. In [18], during tracking, principle space of the tracked target is learned with the samples generated according to tracking results. Then the unlabeled image patch which has the minimum reconstruction error is assumed to be the target. In [17], sparse representation is introduced into tracking. Static and dy- namic templates are learned and candidates are sparsely re- presented by them through solving L1-regularized least squares problems. The target candidate with the lowest representation error is cast as the nal tracking result. Because of the impressive ex- perimental results of sparse representation, many extensions such as [13,25,26,5,30] are proposed to further improve the tracking performance. In discriminative methods, considering target as po- sitive class and background as negative class, target tracking is transformed into classication problem. Many classical classica- tion methods such as AdaBoost [2], Multi-instance Learning [3], structured SVM [9] are adopted to distinguish the tracked target from backgrounds. During tracking, to be adaptive to variations, classiers are online updated and the target candidate with the Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing http://dx.doi.org/10.1016/j.neucom.2016.06.058 0925-2312/& 2016 Elsevier B.V. All rights reserved. n Corresponding author at: School of Computer and Control Engineering, Uni- versity of Chinese Academy of Sciences, Beijing 100190, China. Neurocomputing 214 (2016) 796804

Transcript of Beyond appearance model Learning appearance variations for...

Page 1: Beyond appearance model Learning appearance variations for …vipl.ict.ac.cn/uploadfile/upload/2018040815311914.pdf · 2018-04-08 · review the classical generative tracking method:

Neurocomputing 214 (2016) 796–804

Contents lists available at ScienceDirect

Neurocomputing

http://d0925-23

n Corrversity

journal homepage: www.elsevier.com/locate/neucom

Beyond appearance model: Learning appearance variations for objecttracking

Guorong Li a,b,c, Bingpeng Ma a,b,c,n, Jun Huang a, Qingming Huang a,b,c, Weigang Zhang d

a School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 100190, Chinab Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences(CAS), Beijing 100080, Chinac Key Laboratory of Big Data Mining and Knowledge Management,CAS, Chinad School of Computer Science and Technology, Harbin Institute of Technology, China

a r t i c l e i n f o

Article history:Received 26 January 2016Received in revised form25 May 2016Accepted 24 June 2016Communicated by Xiang BaiAvailable online 20 July 2016

Keywords:Object trackingAppearance modelAppearance prediction

x.doi.org/10.1016/j.neucom.2016.06.05812/& 2016 Elsevier B.V. All rights reserved.

esponding author at: School of Computer anof Chinese Academy of Sciences, Beijing 10019

a b s t r a c t

In this paper, we present a novel appearance variation prediction model which can be embedded into theexisting generative appearance model based tracking framework. Different from the existing works,which online learn appearance model with obtained tracking results, we propose to predict appearancereconstruction error. We notice that although the learned appearance model can precisely describe thetarget in the previous frames, the tracking result is still not accurate if in the following frame, the patchthat is most similar to appearance model is assumed to be the target. We first investigate the abovephenomenon by conducting experiments on two public sequences and discover that in most cases thebest target is not the one with minimal reconstruction error. Then we design three kinds of featureswhich can encode motion, appearance, appearance reconstruction error information of target's sur-rounding image patches, and capture potential factors that may cause variations of target's appearance aswell as its reconstruction error. Finally, with these features, we learn an effective random forest forpredicting reconstruction error of the target during tracking. Experiments on various datasets demon-strate that the proposed method can be combined with many existing trackers and improve their per-formances significantly.

& 2016 Elsevier B.V. All rights reserved.

1. Introduction

Object tracking has been a hot research topic for decades andplays a critical role in many applications such as visual surveillance,activity analysis, intelligent human computer interaction, etc. It aimsto locate the target in every frame. Appearance and motion are thetwo key cues that are commonly used to infer target's states. Duringthe past years, many works [20,7,17,12,26,10,14,9,28,8,23] havebeen proposed to build effective appearance models or accuratemotion models for visual target tracking and achieve good perfor-mance on various test videos. Meanwhile, some works [29,28] triedto integrate multiple features to improve the tracking performance.However, it is still a challenging problem to build effective appear-ance model for the target because of its motion, varying scene, il-lumination changes, occlusion, etc.

Tracking methods focusing on target's appearance can be di-vided into two categories: generative appearance model [18,17,22]

d Control Engineering, Uni-0, China.

and discriminative appearance model [6,11,3,9,4]. In generativemethods, the target's appearance space is learned and then objecttracking is formulated as finding the image patch that is the mostsimilar to target's model. In [18], during tracking, principle space ofthe tracked target is learned with the samples generated accordingto tracking results. Then the unlabeled image patch which has theminimum reconstruction error is assumed to be the target. In [17],sparse representation is introduced into tracking. Static and dy-namic templates are learned and candidates are sparsely re-presented by them through solving L1-regularized least squaresproblems. The target candidate with the lowest representation erroris cast as the final tracking result. Because of the impressive ex-perimental results of sparse representation, many extensions suchas [13,25,26,5,30] are proposed to further improve the trackingperformance. In discriminative methods, considering target as po-sitive class and background as negative class, target tracking istransformed into classification problem. Many classical classifica-tion methods such as AdaBoost [2], Multi-instance Learning [3],structured SVM [9] are adopted to distinguish the tracked targetfrom backgrounds. During tracking, to be adaptive to variations,classifiers are online updated and the target candidate with the

Page 2: Beyond appearance model Learning appearance variations for …vipl.ict.ac.cn/uploadfile/upload/2018040815311914.pdf · 2018-04-08 · review the classical generative tracking method:

G. Li et al. / Neurocomputing 214 (2016) 796–804 797

highest confidence score is usually taken as the target. However,updating with inaccurate results can cause “drift”. To deal with thisproblem, many semi-supervised methods [3,16,27] are proposed tomine information from the unlabeled samples collected in the nextframe. In [16], assuming that target's and background's data dis-tributions are different in every frame, an effective transfer methodis developed to learn the classifier for target tracker. However, howthe reconstruction error changes and affects tracking performancehave not been discussed. Intuitively, the chance that true target isthe optimal one according to learned generative model or dis-criminative model is small. The main reason is that target andbackground usually keep changing. Although the learned generativemodel represents the obtained samples well, it does not mean thetarget in the following frame could also be well represented withthem. We design an interesting test: In Incremental Visual Tracking(IVT), when searching the target in next frame, the true position ofthe target is added into state spaces for candidates, but IVT stillcannot find the true position. The minimum reconstruction errorand that of the ground truth target are shown in Fig. 1. Apparently,the ground truth target is not the one that has the minimum re-construction error. Meanwhile, recent tracking methods [21,15] tryto exploit information outside the testing video, which are learnedoffline before tracking. Inspired by these observations, we in-vestigate the influence of the reconstruction error of target's ap-pearance on tracker's performance, and try to predict the re-construction error for generative methods through offline trainingand online-updating. Then the candidate whose reconstruction er-ror is the closest one to our predicted error is considered as target.Because of the competitive performance of IVT, we combine ourappearance change predictor with it to show an example of howmuch our reconstruction error predictor can help generativetrackers to improve their performance.

To the best of our knowledge, the idea of predicting re-construction error to aid visual object tracking, which is the maincontribution of this paper, has not been studied before. Meanwhile,experiments are designed to prove its reasonability. Furthermore,we propose an effective online random forest based reconstructionpredicting method, which can be integrated into IVT framework andgreatly improve its performance. Compare to the existing literature,the main contributions of this paper are summarized as follows.

� We firstly investigate the correlation between reconstructionerror and the best target candidate.

� Three kinds of features are elaborately designed for predictingthe variations of reconstruction error effectively.

� Random forest is used for the regression of reconstruction error, andis further embedded into exiting trackers to improve its performance.

The content of this paper is organized as follows. In Section 2, we simplyreview the classical generative tracking method: IVT, and provide someinteresting test results. Then in Section 3 how to predict target'sappearance changes with online random forest for IVT is introducedin detail. Experimental results as well as analysis on various datasets arereported in Section 4. Finally, a short conclusion is given in Section 5.

0 50 100 150 200 250 300 3500

2

4

6

8

10

12

frame number

erro

r/pix

els

reconstruction error of ground truththe minimum reconstruction error

Fig. 1. An example displaying reconstruction error of ground truth and that of thetracking result obtained by IVT [18].

2. The IVT tracker

Let xt and yt denote observation and state of the target in frameft. IVT tracker [18] tries to learn the principle space representationof the target (also referred to as target template) through mini-mizing objective function Eq. (1). Then as shown by Eq. (2) thetarget candidates denoted by ˜ +xt 1 with the minimum reconstruc-tion error is regarded as the target. Considering target's variationsand computation speed, an incremental algorithm is proposed forfast updating the low dimensional principle space. Experimentsshow that online updating has certain effect on illumination andpose variation.

= ‖ − ‖ ( )V X V V Xmin 1t V T tT

t t F2

t

where = { … }X x x x, , ,t t1 2 , Vt is orthogonal basis matrix re-presenting the low dimensional principle space.

= ‖( ) ˜ ‖ ( )+ ˜ ++x V V xmin 2t x tT

t t F1 12

t 1

In IVT, the reconstruction error is assumed to follow Gaussiandistribution with small variances. In [20], it is pointed out that thisassumption is not correct when outliers exist and they improve itby considering outlier explicitly. However, when target and back-ground change, we found that the pre-learned representationspace based on { … }x x, , t1 is not the optimal for describing target

+xt 1 which is demonstrated through experiments designed in thefollowing subsection.

2.1. Some interesting investigations

It is known that when performing IVT, target candidates aregenerated according to pre-defined motion model and then thecandidate with the smallest reconstruction error is selected as thetarget. So two factors impact the tracking result: motion modeland decision rule. If the target candidates do not cover the trueposition of the target, apparently tracking result will not be theground truth. As we want to analyze how appearance's changeaffects tracking performance, we add the true position into thestate space for sampling target candidates and in every frame, weuse the true target to update the generative model of the target. Inother words, target candidates including the true target and targetmodel are not polluted by inaccurate tracking results. Thereforethe tracking result in each frame is only influenced by IVT's

Fig. 2. Experimental results showing that whether ground truth of the target is thecandidate with minimal reconstruction error. (For interpretation of the referencesto color in this figure caption, the reader is referred to the web version of thispaper.)

Page 3: Beyond appearance model Learning appearance variations for …vipl.ict.ac.cn/uploadfile/upload/2018040815311914.pdf · 2018-04-08 · review the classical generative tracking method:

Fig. 3. The proposed features of two frames in “Deer”. (a) and (f): original frame; (b) and (g): histogram of appearance template. In order to display, it was reshaped intomatrix of size 4�4. (c) and (h): reconstruction error. (d) and (i): mean optical flow. (e) and (j): relative motion.

3 4 5 6 7 8 91.5

2

2.5

3

Nblock

aver

age

pred

ictio

n er

ror

T=40, maxDepth=60T=80, maxDepth=20

20 40 60 80 1001

1.5

2

2.5

maximum depth of the random tree

aver

age

regr

essi

on e

rror T=20

T=30T=40T=50T=60T=70T=80T=90T=100

0 500 1000 1500 2000 2500 30000

2

4

6

8

sample number

valu

e of

re

cons

truct

ion

erro

r

prediction results of reonstruction errorgroundtruth of reconstruction error

0 1 2 3 4 5 6 70

0.1

0.2

0.3

0.4

bin of prediction error

perc

enta

ge o

f sam

ples

prediction results on training data

0 1 2 3 4 5 6 70

0.2

0.4

bin of prediction error

pece

ntag

e of

sam

ples

prediction results on "Deer"

Fig. 4. Average prediction error of random forest with different values for parameters: number of blocks ×N Nb b, the number of random trees (T) and the maximum depth ofevery random tree (maxDepth).

G. Li et al. / Neurocomputing 214 (2016) 796–804798

decision rule: the candidate with minimal reconstruction error isthe true target. Results on two public sequences: “Jumping” and“Deer”, are shown in Fig. 2. The green stars denote frames in whichthe candidate with the smallest reconstruction error is not theground truth. We can see that for “Jumping” the proportion of theframes where the true target is not the best according to IVT isnearly 50%. For “Deer”, in 73% of the frames, the reconstructionerror of the true target is not the lowest.

The unsatisfactory results demonstrate that target andbackground keep changing and this seriously degrades track-ing performance. Therefore considering the candidate thatcould fit the generative model most accurately as the targetmay not be the best choice. Intuitively, predicting the re-construction error of the target and then finding the targetcandidate with the predicted reconstruction error could im-prove tracking performance.

3. Online random forest for appearance change prediction

To predict the reconstruction error, we first define several kindsof features based on optical flow, appearance template and re-construction error. Then we collect training examples from videosand train a random forest predictor which is further online up-dated during tracking.

3.1. Features used for appearance variation prediction

Intuitively, target appearance is mainly affected by its sur-roundings and motion. Therefore similar to [15], we extract fea-tures from image patches surrounding the target. When we ob-tained tracking result yt in the current frame Ft, features are ex-tracted from the image patch centered at yt with size ·[ ]w h2 ,t t .Then the patch is divided into ×N Nb b blocks. Noticing that target'sappearance varies mainly because of illumination, motion and

Page 4: Beyond appearance model Learning appearance variations for …vipl.ict.ac.cn/uploadfile/upload/2018040815311914.pdf · 2018-04-08 · review the classical generative tracking method:

G. Li et al. / Neurocomputing 214 (2016) 796–804 799

occlusion, so we choose features that could reflect thoseinformation.

(1) Optical flow: Generally, the angle of the optical flow couldreflect the moving direction and provide the view from whichthe camera recorded the target. The surroundings moving inthe same direction with fast speed or moving towards thetarget may occlude with the target. To capture this kind ofinformation, we calculate the average optical flow of eachblock over Nf frames. Let ( )MOF bt i j, denote the average opticalflow of block at ith row jth column. Meanwhile, to describe thepotential occlusion, we introduce feature “Relative Mo-tion”(RM) defined by Eq. (3), which reflects the relative motionbetween surrounding blocks and central block.

π

( ) =

∠( ( ) Δ ( ))>

( ( )Δ ( ))>

−( ( )Δ ( ))

≤ ( )

⎪⎪⎪⎪

⎪⎪⎪⎪

RM b

ROF b p b

ROF b p b

ROF b p b

1if ,

/6

0if cos

0

1if cos

0 3

t i j

t i j t i j

t i j t i j

t i j t i j

,

, ,

, ,

, ,

where =c N /2b , ( )p bt i j, represents the position of bi j, in frame ft,( ) = ( ) − ( )ROF b OF b OF bt i j t i j t c c, , , denotes the relative motion ,

Δ ( ) = ( ) − ( )p b p b p bt i j t c c t i j, , , is the relative position.(2) Appearance feature: Intuitively, the appearance feature used to

describe each block is dependent on the tracker. For example,IVT and L1-tracking utilize gray values of image pixels to re-present the object regions. In order to combine with IVT, in

every frame, the appearance template ( ( )AT bt i j, ) of every blockis composed by gray values of its pixels. Histogram of ap-pearance template for each block ( )HAF bt i j, is computed. Thenwe average them and get the appearance descriptor of eachblock (denoted by ( )MAF bt i j, ) in ft. Since we want to predictreconstruction error of the target in ( + )t 1 th frame, re-construction errors of each block in previous Nf frames arecollected and averaged in time to obtain ( )RE bt i j, .

(3) Difference of appearance feature: Usually object's appearancechanges smoothly and we calculate the difference of appear-ance template of each block, reconstruction error in twoconsecutive frames. They are respectively representedby { ( − ( ) … ( ) − ( )}− − + −AT b AT b AT b AT b, ,t i j t i j t N i j t N i j, 1 , 1 , ,f f and{ ( ) − ( ) … ( ) − ( )}− − + −RE b RE b RE b RE b, ,t i j t i j t N i j t N i j, 1 , 1 , ,f f . We takemean difference of appearance templates ( ) − ( )−RE b RE bt i j t i j, 1 ,

as a new feature descriptor.

Fig. 3 shows the feature of two example frames of “Deer”. Fig. 3(d)(i) are MOF of the targets and we can see that MOF reflects themotion of the targets. If target moves faster, the appearanceusually changes greatly. While from RM, we could know thatwhich block will get closer to the target and infer potential oc-clusions. In Fig. 3(f), the deer is jogging down and spray rises upfrom the surface of the water, occluding the chin of the deer. Fig. 3(e)(j) displays RM, and from (e) and (f) we can see that, the blocksat the low left center are getting closer to the central block andmay bring occlusion on the tracked target.

Algorithm 1. The procedures of online updating random forest.

Page 5: Beyond appearance model Learning appearance variations for …vipl.ict.ac.cn/uploadfile/upload/2018040815311914.pdf · 2018-04-08 · review the classical generative tracking method:

frame0 50 100 150 200 250 300 350 400

cent

er e

rror

-50

0

50

100

150

200

250

300

350

400tracking result on "Tiger2" with differen training video

Fig. 5. Tracking results on “Tiger2” with training video “Liquor”, “Walking2” and theabove two videos respectively.

G. Li et al. / Neurocomputing 214 (2016) 796–804800

3.2. Online random forest

Provided training samples {⟨ ⟩} =f r,i i iN

1l represented by the pro-

posed features, we need to train a predictor ( → ) F: , which canpredict reconstruction error (denoted by r) of the target. Here

denotes feature space and represents real-valued set. Mean-while, during tracking, the pre-learned regression should be up-dated to adapt to every video. Because of the good performanceand fast online updating algorithm, we select online random forestfor regression.

Classical random forest is composed by a set of random binarytrees and each of them is a classifier or regression trained with

samples randomly selected from original training. Every node ofthe tree is either a leaf or an internal node, and every node hasonly one parent node except for root node. The root and internalnodes are associated with a threshold, according to which thesamples arrived at the corresponding node could be divided intotwo disjointed children nodes: left and right, according to therandom generated threshold of a pre-randomly selected feature.Each leaf LT i, stores the probability distribution of reconstructionerror of targets ( | )p r LT i, . Usually, random forest is trained offlinewhere the training set needs to be provided in advance. Thisgreatly limits its applications. Modeling the sequential data arrivalwith a Poisson distribution, online training method for randomforest was proposed in [19]. Due to its fast computation speed, lowmemory requirements and comparable performance, we adoptonline random forest [19] for regression of reconstruction error.

For each node nodeT i, , we can estimate the distribution of re-construction error ( | )p r nodeT i, based on the training samples that ithas seen, and store it. Then when this node needs to be split, wecould evaluate information gain of the randomly generatedthreshold and select the best one. Meanwhile, the distributioncould be inherited by its children nodes without the needs ofstoring training samples. Algorithm 1 depicts the online randomforest updating algorithm. The detail procedures of our objecttracking method with reconstruction error predicted by onlinerandom forest are provided in Algorithm 2.

Algorithm 2. Object tracking with reconstruction error predictedby online random forest.

Page 6: Beyond appearance model Learning appearance variations for …vipl.ict.ac.cn/uploadfile/upload/2018040815311914.pdf · 2018-04-08 · review the classical generative tracking method:

Fig. 6. Examples of tracking results of OREP (red rectangle), IVT (green rectangle) and L1APG (magenta rectangle) on several test videos. (For interpretation of the referencesto color in this figure caption, the reader is referred to the web version of this paper.)

G. Li et al. / Neurocomputing 214 (2016) 796–804 801

4. Experiments

The proposed tracker based on online random forest reconstruc-tion error prediction is implemented in MATLAB 2009b on a PC with

3.10 GHz CPU and 4 GB RAM. It is embedded into IVT framework toform our tracker (referred to as Online Reconstruction Error Prediction“OREP”). To determine values of parameters, in this section, we firstconduct experiments to test the performance of offline learned ran-dom forest with different parameters.

Page 7: Beyond appearance model Learning appearance variations for …vipl.ict.ac.cn/uploadfile/upload/2018040815311914.pdf · 2018-04-08 · review the classical generative tracking method:

Fig. 7. Quantitative comparisons of tracking results on 51 test videos obtained by OREP and IVT.

Fig. 8. Some failure examples of tracking results of OREP (red rectangle). (For interpretation of the references to color in this figure caption, the reader is referred to the webversion of this paper.)

G. Li et al. / Neurocomputing 214 (2016) 796–804802

Then in order to evaluate the tracking performance, we eval-uate our public sequences described in [24] against two classicalgenerative trackers: IVT[18] and L1APG[1].

4.1. Appearance reconstruction error predicting

We extract training samples from “Jumping”, “Tiger2” and “Walk-ing2” sequences. Then 5-crossfold is used to evaluate each parameter.If the number of blocks is small, the information of many pixels isintegrated into one descriptor, leading to the loss of some detail in-formation. However, if we use more blocks, each block usually haslittle semantic and the generated feature descriptor is low-level. Asshown in Fig. 4(a), regression result of Random Forest reaches anoptimal value when Nb¼5, Nf¼5.

We adopt Nb¼5 and Fig. 4(b) displays the results with differentsettings of one parameter. Apparently, the average prediction errordecreases first and then increases as maxDepth becomes larger. Whenfixed maxDepth, influence of T has not regularity. However, we noticethat when the value of maxDepth is around 50, the prediction resultsare relative good. Fig. 4(c) and (d) shows the predicted the results

= =T maxDepth40, 60 and the histogram of reconstruction errorestimation error respectively. As a trade-off between computingcomplexity and tracking performance, we set = =T maxDepth40, 60for tracking.

Meanwhile, to test the generality of the offline learned pre-dictor on other videos, we test the learned random forest onsamples collected from “Deer”. Results are shown in Fig. 4(e). Wecan see that offline learned knowledge could be transferred to

Page 8: Beyond appearance model Learning appearance variations for …vipl.ict.ac.cn/uploadfile/upload/2018040815311914.pdf · 2018-04-08 · review the classical generative tracking method:

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Overlap threshold

Suc

cess

rate

Success plots of OPE − occlusion (29)

SCM [0.487]OREP [0.436]Struck [0.413]LSK [0.409]VTD [0.403]TLD [0.402]VTS [0.398]CPF [0.384]DFT [0.381]LOT [0.378]ASLA [0.376]CXT [0.372]OAB [0.368]CSK [0.365]Frag [0.360]L1APG [0.353]

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Overlap threshold

Suc

cess

rate

Success plots of OPE − motion blur (12)

Struck [0.433]TLD [0.404]CXT [0.369]TM−V [0.362]OREP [0.355]DFT [0.333]KMS [0.326]OAB [0.324]LOT [0.312]L1APG [0.310]VTD [0.309]RS−V [0.305]CSK [0.305]VTS [0.304]LSK [0.302]SCM [0.298]

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Overlap threshold

Suc

cess

rate

Success plots of OPE − fast motion (17)

Struck [0.462]TLD [0.417]CXT [0.388]OREP [0.363]OAB [0.358]TM−V [0.347]MTT [0.333]LOT [0.331]LSK [0.328]MIL [0.326]KMS [0.321]DFT [0.320]CSK [0.316]L1APG [0.311]CPF [0.307]VTD [0.302]

Fig. 9. Comparison results with trackers in [24] on videos with attributes “motion blur” and “fast motion”.

G. Li et al. / Neurocomputing 214 (2016) 796–804 803

other videos. However, comparing Fig. 4(d) with Fig. 4(e), it isapparent that if samples used for training and testing are from thesame video, the random forest yields better predicting results.However, it is impossible to extract training examples from all ofthe videos. Besides, usually only initialization in the first frame isprovided, so the number of labeled samples we could collect in thetest video is very small. As a result, we propose to combine offlinetraining and online updating together. That is training randomforest with labeled samples collected in several videos. Whentracking, the offline learned random forest is further finetunedwith samples obtained according to tracking results.

To test the effectiveness of training video, we perform tracking on“Tiger2” with different training videos and the results are shown inFig. 5. In video “Liquor”, a bottle simply moves and there are twosimilar bottles around the tracked bottle. The background is simpleas there are no occlusions and motion blurs. Compared to “Liquor”,video “Walking2” is recorded in a real scenario, where a woman iswalking in a mall aisle and the background is more complex. In video“Tiger2”, a tiger toy is moving, where the background is complex. Sothe tracking performance on “Tiger2” with the Random Forest rainedwith “Walking2” is better than that obtained with “Liquor” as trainingvideo. We can also see that if trained with both videos, the trackingresult is further improved. Generally, if the training videos containchallenges such as light variations, complex background, which aresimilar to challenges existing in the test videos, the tracking resultswill be improved greatly. So more training videos of different chal-lenges, will lead to better tracking result. In our experiments, werandomly select 80 percent of samples collected from videos “Tiger2”,“Jumping”,“Walking2” and use them as the training set.

4.2. Visual target tracking

For test videos, we all set =T maxDepth100,α= = =Ntrain N20, 0.1 , 5b . Fig. 6 shows the qualitative results on

several sequences including “Bolt”, “Boy” [31], “Jumping”, “Liquor”,“Matrix”, “Singer1” and “Tiger2”. In the first three sequences, fastmotion and motion blur bring great challenges to trackers. Ac-cording to optical information, our proposed online random forestcould effectively predict the reconstruction error of the target andOREP tracks the targets successfully. While in sequences “Liquor”and “Tiger1”, in which partial occlusion occurs sometimes, com-bining motion and appearance feature of surrounding image,OREP succeeds in tracking the targets thanks to the reconstructionerror prediction procedure.

To compare the results quantitatively, overlap score which is de-fined as the overlap between ground truth and tracking result iscalculated. The frames in which the overlap scores of tracking resultsare larger than the given threshold are regarded as success frames.

The ratio of success frame number is defined as Success Rate (re-ferred to as “SR”). In [24], SR under different threshold of overlapscore is plotted for fairly comparison. We run our tracker on testsequences with initialization from the ground truth in the first frameand the average SR is computed for comparison, which is referred toas One Pass Evaluation (OPE). In Fig. 7, average SR of OPE of all thetest videos and that of videos with different attributes described in[24] are plotted. From Fig. 7(a), we can see that on 51 test videos,average SR of OREP under any of the threshold value is higher thanthat of IVT and L1APG. Fig. 7(b) (l) provides results on video subsetswith different 11 domain attributes. Apparently, OREP performsmuch better than IVT and L1APG when illumination variation, fastmotion, partial occlusion or out-of-view exists in the test video. Thisis mainly because the proposed features for reconstruction errorprediction reflect motion, illumination and surrounding backgroundappearance information. When dealing with videos with low re-solution, the rectangle of the target contains too many backgroundpixels, which lead to inaccurate target appearance model and wrongupdating. In this case, OREP could not make effective prediction andyields unsatisfactory results as shown in Fig. 8(a). When the trackedtarget rotates in image plane, the view of the target keeps changing.However, we do not exploit target templates from different views, soOREP only makes a little improvement as shown in Fig. 8(a)–(c).Moreover, from Fig. 8(d)–(f), we can see that when target is com-pletely occluded with other objects in a long time, it is hard for OREPto recover the target when it reappears. This is because when oc-clusion exists, both the template and reconstruction error predictorare wrongly updated with tracking results, which accelerates theaccumulation of error.

We also compare our results with that of the other state-of-the-art trackers that are provided in [24]. As shown in Fig. 9, whendealing with videos with “occlusion”, “motion blur” and “fast motion”,OREP ranks 2, 5 and 4 respectively. However, OREP only ranks6 evaluating with average SR of OPE on 51 test videos. This is becauseOREP is based on a simple global appearance model. In the future,we will combine reconstruction error predictor with trackers basedon local appearance model to further improve tracking performance.

5. Conclusion

In this paper, we propose to predict reconstruction error of target'sappearance template during tracking. With well-designed surround-ing features and training samples collected offline, an effective randomforest is learned for predicting reconstruction error of target's ap-pearance template. Through online updating Random Forest Predictorwith tracking results, it becomes specific to every video. With the helpof the proposed reconstruction error predictor, OREP improved

Page 9: Beyond appearance model Learning appearance variations for …vipl.ict.ac.cn/uploadfile/upload/2018040815311914.pdf · 2018-04-08 · review the classical generative tracking method:

G. Li et al. / Neurocomputing 214 (2016) 796–804804

tracking performance on test videos significantly comparing with IVT.The method is general to be applied to other trackers based on gen-erative appearance model by adopting suitable appearance features.

Acknowledgments

This work was supported in part by National Basic Research Pro-gram of China (973 Program): 2012CB316400 and 2015CB351802, byNational Natural Science Foundation of China: 61303153, 61332016and 61572465.

References

[1] S. Bao, H. Ling, H. Ji, Real time robust l1 tracker using accelerated proximalgradient approach, in: CVPR, 2012.

[2] S. Avidan, Ensemble tracking, in: CVPR, 2005.[3] B. Babenko, M.H. Yang, S. Belongie, Visual tracking with online multiple in-

stance learning, in: CVPR, 2009.[4] Q. Bai, Z. Wu, S. Sclaroff, M. Betke, C. Monnier, Randomized ensemble tracking, in:

ICCV, 2013.[5] S. Zhang, H. Yao, H. Zhou, X. Sun, S. Liu, Robust visual tracking based on online

Learning sparse representation, Neurocomputing 100 (2013) 31–40.[6] R. Collins, Y. Liu, M. Leordeanu, On-line featue selection of discriminative tracking

features, IEEE Trans. Pattern Anal. Mach. Intell. 27 (10) (2005) 1631–1643.[7] M. Danelljan, F.S. Khan, M. Felsberg, J.V. de Weijer. Adaptive color attributes for

real-time visual tracking, in: CVPR, 2014.[8] G. Wu, W. Lu, G. Gao, C. Zhao, J. Liu, Regional deep learning model for visual

tracking, Neurocomputing 175 (2016) 310–323.[9] S. Hare, A. Safari, P.H. Torr, Struck: Structured output tracking with kernels, in:

ICCV, 2011.[10] S. He, Q. Yang, R. Lau, J. Wang, M.-H. Yang, Visual tracking via locality sensitive

histograms, in: CVPR, 2013.[11] Z. Kalal, J. Matas, K. Mikolajcayk, P-n learning: Bootstrapping binary classifiers

by structural constraints, in: CVPR, 2010.[12] J. Kwon, K.M. Lee, Visual tracking decomposition, in: CVPR, 2010.[13] X. Lan, P.C. YUENn, A.J. Ma, Multi-cue visual tracking using robust feature-level

fusion based on joint sparse representation, in: CVPR, 2014.[14] B. Zhong, Y. Chen, Y. Chen, Z. Cui, R. Ji, X. Yuan, D. Chen, W. Chen, Robust

tracking via patch-based appearance model and local background estimation,Neurocomputing 123 (2014) 344–353.

[15] L. Leal-Taixé, M. Fenzi, A. Kuznetsova, B. Rosenhahn, S. Savarese, Learning animage-based motion context for multiple people tracking, in: CVPR, 2014.

[16] G. Li, Q. Huang, L. Qin, S. Jiang, Ssocbt: a robust semisupervised online cov-boost tracker that uses samples differently, IEEE Trans. on Circuits and Sys-tems for Video Technology 23 (4) (2013) 695–709.

[17] X. Mei, H. Ling, Robust visual tracking and vehicle classification via sparse re-presentation, IEEE Trans. Pattern Anal. Mach. Intell. 33 (11) (2011) 2259–2272.

[18] D. Ross, J. Lim, R.-.S. Lin, M.-.H. Yang, Incremental learning for robust visualtracking, Int. J. Comput. Vis. (2007).

[19] A. Saffari, C. Leistner, J. Santner, M. Godec, H. Bischof, On-line random forest,in: ICCV Workshops, 2009.

[20] D. Wang, H. Lu, Visual tracking via probability continuous outlier model, in: CVPR2014.

[21] N. Wang, D.-Y. Yeung, Learning a deep compact image representation for vi-sual tracking, in: NIPS, 2013, pp. 809–817.

[22] Q. Wang, F. Chen, W. Xu, M. Yang, Object tracking via partial least squaresanalysis, IEEE Trans. Image Process. 21 (10) (2012) 4454–4465.

[23] X. Bai, C. Rao, X. Wang, Shape vocabulary: a robust and efficient shape re-presentation for shape matching, IEEE Trans. Image Process. 23 (9) (2014)3935–3949.

[24] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: a benchmark, in: CVPR, 2013.[25] K. Zhang, L. Zhang, M.-H. Yang. Real-time compressive tracking, in: ECCV,

2012, pp. 864–877.[26] T. Zhang, B. Ghanem, S. liu, N. Ahuja, Robust visual tracking via structured

multi-task sparse learning, Int. J. Comput. Vis. 101 (2) (2013) 367–383.[27] Z. Teng, T. Wang, F. Liu, D. Kang, C. Lang, S. Feng, From sample selection to

model update: a robust online visual tracking algorithm against drifting,Neurocomputing 173 (2016) 1221–1234.

[28] J. Yoon, M. Yang, K. Yoon, Interacting multiview tracker, IEEE Trans. PatternAnal. Mach. Intell. 38 (5) (2016) 903–917.

[29] Y. Zhou, X. Bai, L.J. Latecki, Similarity fusion for visual tracking, Int. J. Comput. Vis.(2016).

[30] G. Han, X. Wang, J. Liu, N. Sun, C. Wang, Robust object tracking based on localregion sparse appearance model, Neurocomputing 183 (2016) 145–167.

[31] Y. Wu, B. Shen, H. Ling, Visual tracking via online nonnegative matrix factor-ization, IEEE Trans. Circuits Syst. Video Technol. 24 (3) (2014) 374–383.

Guorong Li received her B.S. degree in technology ofcomputer application from Renmin University of China,in 2006 and Ph.D. degree in technology of computerapplication from the Graduate University of the Chi-nese Academy of Sciences in 2012.

Now, she is a associate professor within the GraduateUniversity of Chinese Academy of Sciences. Her re-search interests include object tracking, video analysis,pattern recognition and cross-media analysis.

Bingpeng Ma received the B.S. degree in mechanics in1998 and the M.S. degree in mathematics in 2003 fromHuazhong University of Science and Technology, re-spectively. He received Ph.D. degree in computer sci-ence at the Institute of Computing Technology, ChineseAcademy of Sciences, P.R. China in 2009. He was a post-doctoral researcher in University of Caen, France, from2011 to 2012. He joined the School of Computer andControl Engineering, University of Chinese Academy ofSciences, Beijing, in March 2013 and now is an assistantprofessor. His research interests cover image analysis,pattern recognition, and computer vision.

Jun Huang received the M.S. degree in computer sci-ence from the Anhui University of Technology, Maan-shan, China, in 2011. Now, he is a Ph.D. student inSchool of Computer and Control Engineering, Uni-versity of the Chinese Academy of Sciences (UCAS).Before joining UCAS, he was a lecturer in Anhui Uni-versity of Technology. His research interests includemachine learning and data mining.

Qingming Huang received the B.S. degree in computerscience and Ph.D. degree in computer engineering fromHarbin Institute of Technology, Harbin, China, in 1988and 1994, respectively.

He is currently a Professor with the University of theChinese Academy of Sciences (CAS), Beijing, China, andan Adjunct Research Professor with the Institute ofComputing Technology, CAS. He has been supported byChina National Funds for Distinguished Young Scien-tists in 2010. He has authored or coauthored more than200 academic papers in prestigious international jour-nals and conferences. His research areas include mul-

timedia video analysis, video adaptation, image pro-

cessing, computer vision, and pattern recognition.Dr. Huang is a reviewer for the IEEE Transactions On Multi-Media and the IEEE

Transactions On Circuits and Systems For Video Technology. He has served as a TPCmember for well- known conferences, including ACMMultimedia, CVPR, ICCV, and ICME.

Weigang Zhang received the M.S. degree and Ph.D.degree in computer science from the Harbin Institute ofTechnology, Harbin, China, in 2005 and 2016 respec-tively. Now he is an assistant professor with School ofComputer, Harbin Institute of Technology at Weihai,Weihai, China. His research interests include imageprocessing, video analysis and pattern recognition.