IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS … · texture pattern ﬂow encodes the binary pattern...

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 1, JANUARY 2011 29

Kernel Similarity Modeling of Texture Pattern Flowfor Motion Detection in Complex Background

Baochang Zhang, Yongsheng Gao, Senior Member, IEEE, Sanqiang Zhao, and Bineng Zhong

Abstract—This paper proposes a novel kernel similarity model-ing of texture pattern flow (KSM-TPF) for background modelingand motion detection in complex and dynamic environments. Thetexture pattern flow encodes the binary pattern changes in bothspatial and temporal neighborhoods. The integral histogram oftexture pattern flow is employed to extract the discriminativefeatures from the input videos. Different from existing uniformthreshold based motion detection approaches which are onlyeffective for simple background, the kernel similarity modelingis proposed to produce an adaptive threshold for complex back-ground. The adaptive threshold is computed from the mean andvariance of an extended Gaussian mixture model. The proposedKSM-TPF approach incorporates machine learning method withfeature extraction method in a homogenous way. Experimentalresults on the publicly available video sequences demonstrate thatthe proposed approach provides an effective and efficient way forbackground modeling and motion detection.

Index Terms—Background modeling, background subtraction,kernel similarity modeling (KSM), motion detection, texturepattern flow (TPF).

I. Introduction

MOVING OBJECT detection from video stream or imagesequence is one of the main tasks in many computer

vision applications such as people tracking, traffic monitoring,scene analysis, and video surveillance. To achieve high sen-sitivity in the detection of moving objects while maintaininglow false alarm rates, the background image has to be either astationary scene without moving objects or regularly updatedin order to adapt to dynamic environments such as swayingtree, rippling water, flickering monitors and shadows of mov-

Manuscript received December 1, 2009; revised May 18, 2010 and August 6,2010; accepted August 19, 2010. Date of publication January 13, 2011; dateof current version February 24, 2011. This work was supported in part bythe Australian Research Council, under Discovery Grant DP0877929, in partby the Natural Science Foundation of China, under Contracts 60903065 and61039003, in part by the Ph.D. Programs Foundation of Ministry of Educationof China, under Grant 20091102120001, and in part by the FundamentalResearch Funds for the Central Universities. This paper was recommendedby Associate Editor L. Correia.

B. Zhang is with the Science and Technology on Aircraft Control Labo-ratory, School of Automation Science and Electrical Engineering, BeihangUniversity, Beijing 100191, China (e-mail: [email protected]).

Y. Gao and S. Zhao are with the Griffith School of Engineer-ing, Griffith University, Brisbane, QLD 4111, Australia (e-mail: [email protected]; [email protected]), and also with the Queens-land Research Laboratory, National ICT Australia, Brisbane, QLD 4072,Australia (e-mail: [email protected]; [email protected]).

B. Zhong is with the Department of Computer Science and Engi-neering, Harbin Institute of Technology, Harbin 150001, China (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2011.2105591

ing objects. The most widely adopted approach for movingobject detection is based on background subtraction [3], [17],[18]. An underlying assumption of background subtraction isthat the images of the scene without the foreground objectsexhibit a certain behavior that can be well described bya statistical model (i.e., background modeling). After thestatistical model is created, moving objects can be detectedby spotting the image parts that do not fit the model (i.e.,foreground detection).

In the literature, many motion detection methods with adap-tive thresholds have been proposed for both un-compressed [2]and compressed image sequences [16]. Tian and Hampapur[15] proposed a real-time algorithm to detect salient motionin complex backgrounds by combining temporal differenceimaging and a temporal filtered motion field. A large numberof background subtraction methods have also been proposedto model the dynamic backgrounds. These methods can beclassified into two categories: parametric methods and non-parametric methods. A popular parametric technique is tomodel each pixel in a video frame with a Gaussian distributionand update the Gaussian parameters recursively with an adap-tive filter [18], which has become a de facto baseline methodin background modeling. However, the Gaussian model doesnot work well in dynamic natural environments, where thescene background is not completely stationary. A Gaussianmixture model (GMM) method was proposed in [14] to modeldynamic background and applied to complex problems suchas traffic monitoring. The parameters of GMM were updatedusing an on-line K-means approximation. In stead of usinga fixed number of Gaussians to model each pixel, Zivkovicand van der Heijden [20] proposed a recursive method toautomatically decide the number of Gaussians. Different fromthe above parametric methods, Elgammal et al. [4] proposeda non-parametric approach for background modeling, whichutilizes a non-parametric kernel density estimation techniqueto build a statistical representation of the scene background.Mittal and Paragios [7] proposed an improved kernel densityestimation, where the kernel has a variable bandwidth. Hanet al. [5] proposed a sequential approach using an iterativealgorithm to predict the modes of kernel density estimation toovercome the space complexity problem. Sheikh and Shah [13]proposed a Bayesian framework and kernel density estimationmethod for efficient motion detection without considering therelationship between two neighborhood pixels. Heikkilä andPietikäinen [6] proposed to model the dynamic backgroundusing the local binary pattern (LBP) histograms. The LBP

1051-8215/$26.00 c© 2011 IEEE

30 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 1, JANUARY 2011

Fig. 1. Flowchart of the proposed KSM-TPF approach.

operator, encoding the relationship between the central pointand its neighbors, was originally designed for the texturedescription [8]. It was later extended to other applicationsincluding texture classification and face analysis [9], [11], [19].In background modeling, the LBP method achieved a muchbetter performance than GMM in terms of false negative andpositive rates [6]. However, the LBP method does not considerthe temporal variations in its pattern, which contains inherentmotion information derived from moving objects in the video.

In this paper, we propose a novel texture pattern flow (TPF)to encode the local pattern information in both spatial andtemporal domains. In the spatial domain, the TPF is definedas a directional texture descriptor to reflect the fact that theforeground object always moves toward a certain direction.In the temporal domain, the TPF captures motion informationover time in the image sequence. To speed up the proposedapproach, the computationally efficient integral histogram[12] of TPF is employed to statistically model the imagesequence. The complexity of integral histogram is linear tothe length of the input data and computationally superiorto the conventional histogram method that is exhaustive.Inspired from the property of histogram intersection as akernel function, we propose to incorporate the histogramsimilarity measure into our background modeling framework.In stead of using a uniform threshold that treats differentimage regions and video frames equally and thus not effectivefor complex background, we propose to use the kernelsimilarity modeling (KSM) to provide an adaptive thresholdcomputed from the mean and variance of an extendedGaussian mixture model (E-GMM). Background classificationis based on a decision rule that a bigger similarity value hasmore contribution to the final decision. The flowchart of theproposed kernel similarity modeling of texture pattern flow(KSM-TPF) approach is displayed in Fig. 1, in which bothspatial and temporal information is exploited to construct theTPF features. We compared the proposed approaches with theLBP and GMM background modeling methods on multiplevideo sequences from three publicly available databases. Itis a very encouraging finding that the proposed backgroundsubtraction approach performed consistently superior to thebenchmark methods in all the comparative experiments.

The remainder of this paper is organized as follows.Section II describes the concept of texture pattern flow forfeature extraction. Section III presents the proposed kernelsimilarity modeling method for motion detection. Section IVgives comparative experimental results. Conclusions aredrawn in Section V.

II. Texture Pattern Flow

The goal of background modeling is to construct andmaintain a statistical representation of the scene that the

camera sees. In this section, we utilize the texture informationfor dynamic background modeling and motion detection. Wepropose a new operator, TPF, for the measure of texture flow.TPF encodes texture and motion information in a local regionfrom both spatial and temporal aspects. The integral histogramof TPF computed within a local region around the pixel isemployed to extract the statistical discriminative features fromthe image sequence.

A. TPF Operator

Given a gray-scale image sequence Ix,y,t where x, y, and trepresent space and time indexes, respectively, the gradientsin x and y directions can be calculated by

∂Ix,y,t

∂x= Ix,y,t − Ix−1,y,t (1)

∂Ix,y,t

∂y= Ix,y,t − Ix,y−1,t . (2)

To capture the spatial variations in x and y directions, twothresholding functions, fx( ∂Ix,y,t

∂x) and fy( ∂Ix,y,t

∂y), are employed

to encode the gradient information into binary representations

fx(∂Ix,y,t

∂x) =

{1, if ∂Ix,y,t

∂x≥ 0

0, if ∂Ix,y,t

∂x< 0

(3)

fy(∂Ix,y,t

∂y) =

{1, if ∂Ix,y,t

∂y≥ 0

0, if ∂Ix,y,t

∂y< 0.

(4)

The temporal derivative is defined as

∂Ix,y,t,i

∂t= Ix,y,t,i − µx,y,t−1,i (5)

where µx,y,t−1,i is the mean of the ith model in the GMM attime t − 1. GMM is exploited to model the distribution of therecent history of each pixel value over time as

P(Ix,y,t) =K∑i=1

ωx,y,t,iϕµx,y,t,i,σ2x,y,t,i

(Ix,y,t) (6)

where K is the number of Gaussian models, ωx,y,t,i is anestimate of the weight of the ith model at time t, σ2

x,y,t,i isthe variance of the ith Gaussian probability density functionϕ(•) at time t, and ϕ(•) is defined as

ϕµx,y,t ,σ2x,y,t

(Ix,y,t) =1√

2πσx,y,t

exp(− (Ix,y,t − µx,y,t)2

2σ2x,y,t

). (7)

In this way, the distribution of recently observed values of eachpixel in the image sequence is characterized by a mixture of KGaussian distributions. A new pixel value will be representedby one of the major components of the mixture model and usedto update the model. Each new pixel in the image sequenceis checked against the existing K Gaussian distributions untila match is found. If a pixel matches two mixtures, eitherGaussian distribution can be used. A pixel value lying within2.5 standard deviations (2.5σx,y,t) of a distribution is definedas a match. The thresholding function ft(

∂Ix,y,t

∂t) is used to

encode the temporal derivative information (Ix,y,t − µx,y,t−1,i)

ZHANG et al.: KERNEL SIMILARITY MODELING OF TEXTURE PATTERN FLOW FOR MOTION DETECTION IN COMPLEX BACKGROUND 31

Fig. 2. Neighborhood pixels used for encoding the texture pattern flow in(a) spatial domain and (b) temporal domain.

into binary representations based on whether a match can befound

ft(∂Ix,y,t

∂t) =

{1, if ∂Ix,y,t

∂t≥ ±2.5σx,y,t−1,i

0, if ∂Ix,y,t

∂t< ±2.5σx,y,t−1,i.

(8)

If a pixel is checked with a match being found, the thresh-olding function gets 0; otherwise, the thresholding functiongets 1.

After the temporal information is encoded, the current pixelis used to update the parameters of GMM through an online K-means approximation [14]. If a match has been found for thepixel, the parameters for the unmatched Gaussian distributionsremain the same, while the mean and variance of the matchedGaussian distribution at time t are updated as follows:

µx,y,t = (1 − ρ)µx,y,t−1 + ρIx,y,t (9)

σ2x,y,t = (1 − ρ)σ2

x,y,t−1 + ρ(Ix,y,t − µx,y,t)2 (10)

where ρ is the learning rate. The bigger the learning rate is,the faster the propagation speed. The weights of GMM at timet are updated as follows:

ωx,y,t,i = (1 − α)ωx,y,t−1,i + α(1 − ft(∂Ix,y,t

∂t)) (11)

where α is the second learning rate in direct proportion toρ. All the weights are then re-normalized. If none of the KGaussian distributions match the current pixel value, the leastprobable distribution is replaced with a new distribution whosemean is set to the pixel value, with an initial high variance andlow prior weight.

By integrating both spatial and temporal information, theTPF is defined as

TPF (Ix,y,t) = {fx

(∂Ix,y,t

∂x

), fx

(∂Ix+1,y,t

∂x

), fy

(∂Ix,y,t

∂y

),

fy

(∂Ix,y+1,t

∂y

), ft

(∂Ix,y,t

∂t

)}. (12)

TPF reveals the relationship between derivative directionsin both spatial and temporal domains. Fig. 2 illustrates theneighborhood pixels used for the TPF encoding in both thespatial domain and the temporal domain. To make TPF morerobust against the negligible changes in pixel values, we add aperturbation factor (β) to the first order derivative in the spatialdomain, e.g., ∂Ix,y,t

∂x= Ix,y,t − Ix−1,y,t + β. In our experiments,

the perturbation factor is set to β = 3, the same as in [6].

Fig. 3. Example of the local region around a pixel I(x, y) for integralhistogram extraction.

B. Integral Histogram of TPF

After the TPF is calculated from each pixel of the inputvideo, the discriminative statistics of TPF features are ex-tracted as the video representation. To efficiently calculatethe statistics, we extract the integral histogram [12] of TPFfrom a neighborhood region around each pixel I(x, y). Theintegral histogram is more computationally efficient than theconventional histogram method in that its complexity is linearto the length of the data. Fig. 3 illustrates an example of theneighborhood region around a pixel for integral histogramextraction used in this paper. The typical values of width(w) and height (h) of a neighborhood region are w = 8and h = 8. Using a neighborhood region instead of a pixelprovides certain robustness against noise, as the neighborhoodregion has an effect similar to a smoothing window. However,when the local region is too large, more local details willbe lost. Therefore, the selection of these two parameters is acompromise between noise reduction and detail preservation.We calculate TPF features based on the gray-level image inorder to increase the robustness against complex and dynamicbackgrounds. For simplicity, we model each pixel of the videoframe in the same way, which allows for a high-speed parallelimplementation if needed.

The similarity between two integral histograms h1 and h2is calculated through the following similarity function:

HI(h1, h2) =B∑

b=1

min (h1b, h2b) (13)

where HI(h1, h2) is the histogram intersection operation andB is the number of histogram bins. This measure has anintuitive motivation in that it calculates the common partsof two histograms. Its advantage is that the computation isvery fast as it only requires very simple arithmetic operations.Histogram intersection has been proved to be a Mercer’s kernel[1], and thus there are highly effective tricks for computingscalar products in histogram spaces using powerful kernelfunctions.

III. Kernel Similarity Modeling

After the TPF integral histogram features are extracted torepresent each pixel of the input video frame, a backgroundmodel is established and dynamically adapted to complexsituations. With new frames being processed, the backgroundmodel is updated over time. The similarity measurement be-tween the input video and the background model is conductedby a new kernel similarity modeling (KSM) approach pro-posed in this section. The KSM approach integrates the TPF


Fig. 4. Flowchart of the kernel similarity modeling for one pixel.

integral histogram features with a kernel similarity estimationmethod for dynamic similarity measurement. In the follow-ing subsections, we first introduce how to use TPF integralhistogram features to model the background. The updatingof background model is also presented in this process. Wethen describe how to use an adaptive threshold to distinguisha foreground pixel from background. The flowchart of theproposed KSM approach is shown in Fig. 4. In this flowchart,a GMM is used to model the background into the model in-tegral histograms, while an extended Gaussian mixture model(E-GMM) is used to model the histogram intersection kernel.The similarity value between the input integral histogram andthe model integral histograms is computed by the histogramintersection kernel, and implemented by the online K-meansapproximation [14]. Whether a pixel is background is decidedby whether the computed similarity value is above an adaptivethreshold. The procedures illustrated in Fig. 4 are repeated foreach pixel of the input video frame. The pixel-wise approachusually offers more accurate results for shape extraction.

A. Background Modeling and Updating

In the KSM-TPF framework, each pixel of the video frameis modeled by a GMM of the TPF integral histograms, in theform of a set of model integral histograms with associatedweights. The model integral histograms can be establishedfrom a training process. Once the TPF integral histogram isextracted from a pixel in an input video frame (called inputintegral histogram), it is compared with the model integralhistograms through a similarity measurement (histogram inter-section kernel). If the similarity value is above a threshold, thepixel will be considered as part of a background. Otherwise, itwill be considered as part of foreground objects. The thresholdthat is computed adaptively from kernel similarity modelingwill be described in the next subsection.

After the input pixel is classified either as part of back-ground or foreground, the background model will be updatedaccordingly. If the pixel is labeled as foreground (i.e., thesimilarity value is less than the threshold for all the model

integral histograms), a new model integral histogram is cre-ated with a low initial weight to replace the model integralhistogram with the lowest weight. If the pixel is labeled asbackground (i.e., the similarity value is above the thresholdfor at least one of the model integral histograms), the modelintegral histogram with the highest similarity value is selectedas the best-matched model histogram. The best-matched modelhistogram is updated with the new data through

H̄i,t = (1 − ρ)H̄i,t−1 + ρHt (14)

where H̄i,t and Ht are the mean integral histogram of the ithGaussian distribution and the input integral histogram at timet, respectively [6]. The other model histograms remain thesame. The weight ωi,t of each Gaussian distribution at time tis updated through

ωi,t = (1 − α)ωi,t−1 + α(Mi,t) (15)

where Mi,t is set to 1 for the best-matched distribution, andset to 0 for the other distributions. All the weights are thenre-normalized. Note (15) gives a bigger weight to the best-matched distribution, and gives smaller weights to the otherdistributions. In practice, we sort the model integral histogramsin descending order according to their weights. The first Nmodel integral histograms are chosen to model the background

ω0,t + ω1,t + · · · + ωN−1,t > TN (16)

where TN is a user defined threshold. TN is used to confinethe number of Gaussian distributions N.

B. Adaptive Threshold Based Kernel Similarity Measurement

We use the variable κ to represent the kernel similarity value(which is measured by the histogram intersection operation HIin (13) between the input integral histogram H and a modelintegral histogram H̄ . The E-GMM model consists of threeelements, µκ

i,t , σ2κi,t , and H̄ , where µκ

i,t and σ2κi,t represent the

mean and the variance of κ corresponding to the ith Gaussiandistribution at time t, respectively. H̄ is used to reserve bothspatial and temporal information contained in TPF integralhistograms. Both the mean µκ

i,t and the variance σ2κi,t in

E-GMM are updated over time in a propagation way throughthe online K-means approximation [14]. The mean kernelsimilarity value is updated through

µκi,t = (1 − ρ)µκ

i,t−1 + ρκ. (17)

The variance of the similarity value in E-GMM is updatedthrough

σ2κi,t = (1 − ρ)σ2κ

i,t−1 + ρ(κ − µκi,t)

2. (18)

The propagation speed is controlled by the learning rate ρ.Traditionally, a uniform threshold is used in the update

procedure (e.g., κ ≥ Threshold). If the kernel similarityvalue κ is higher than the threshold for at least one modelintegral histogram, the input pixel is classified as background;otherwise, the input pixel is classified as foreground. How-ever, the uniform threshold is not effective for complex anddynamic background, because different image regions havedifferent ranges of background variation. In this paper, an


adaptive threshold is used for kernel similarity measurement tosegment the background from foreground objects. Specifically,the threshold is computed from the mean and the standarddeviation of the histogram intersection kernel

κ ≥ µκi,t − ησκ

i,t (19)

where σκi,t is the standard deviation of the ith Gaussian

distribution, and η is a parameter to balance the mean valueand the standard deviation. The adaptive threshold µκ

i,t − ησκi,t

is used to reflect the statistics of dynamic similarity valuethrough kernel similarity modeling.

IV. Experiments

Multiple video sequences from three publicly availabledatabases are used to assess the proposed approach withboth visual evaluation and numerical evaluation. The firstvideo dataset, a synthesized dataset, is taken under a dynamicbackground with a computer graphics synthesized object. It isdownloaded from the web site at http://mmc36.informatik.uni-augsburg.de/VSSN06 OSAC/. The other two video data setsare the Wallflower video dataset [17] and the DynTex videodataset [10]. The reason of using a computer graphics syn-thesized video object for testing is that it can provide veryaccurate pixel-level ground truth for all the video frames(that manual labeling cannot achieve) for quantitative accu-racy comparison with benchmark algorithms. For the realisticvideos, the ground truth is provided by manually labeling oneor several frames from the video sequence. The numericalevaluation is conducted in terms of false negatives (the numberof foreground pixels that were misclassified as background)and false positives (the number of background pixels that weremisclassified as foreground.

In all the following experiments, the test phases beginfrom the 50th frame of each video sequence. The followingtwo parameters are used the same as in [6]: the thresholdTN = 0.7 and the propagation rate ρ = 0.01. The remainingparameters in our approach can be heuristically determined byexperiments. For example, the parameter η used to balance themean value and the standard deviation has been demonstratedrelatively stable in our experiment in the next subsection.For simplicity, three Gaussian distributions and three modelintegral histograms are used to describe all the Gaussianmixture models in both feature extraction and classificationstages.

We compare the proposed approach with three benchmarkmethods, the GMM method [14], the Carnegie Mellon Uni-versity (CMU) method [2], and the LBP method [6]. Wealso implement a simplified version of the proposed approach,named TPF. The TPF method differs from the full version ofthe proposed approach only in that it uses a uniform threshold(i.e., κ ≥ Threshold) to decide whether a pixel is backgroundor foreground. It should be noted that all our experiments inthis paper are conducted on gray-level values, although otherimplementation techniques (e.g., by concatenating RGB colorinformation as used in [17]) may obtain a better performance.

Fig. 5. Comparative performances on Video 3 from a synthesized dataset.(a) Original frame. (b) GMM result. (c) LBP result. (d) TPF result. (e) KSM-TPF result.

A. Experiment 1

The first experiment is conducted on Video 3 (totally 898frames) and Video 4 (totally 818 frames) from the synthesizeddataset. These two video sequences focus on the outdoorscenes and contain sudden background changes and swayingtrees. Fig. 5(a) shows an example frame in the Video 3. Allframes from each sequence are labeled as the ground truth.

In this experiment, we first test the sensitivity of LBP andTPF with different uniform thresholds on Video 3. The exper-imental results in Fig. 6 show that both methods suffer fromthe threshold value changes, but TPF performs consistentlymuch better than LBP, particularly in terms of false negatives.We then test the sensitivity of KSM-TPF with different η.The performances illustrated in Fig. 7 demonstrate that theproposed KSM-TPF approach performs relatively stable whenη is varied from 0.4 to 0.7. This can be explained by thefact that the standard deviation is much smaller than themean value, and thus η does not significantly affect the finalperformance. Fig. 5(b)–(e) displays some visual segmentationresults of all the compared approaches, i.e., KSM-TPF, TPF,LBP, and GMM. Tables I and II list their correspondingnumerical results on Videos 3 and 4, respectively. The per-formance of the proposed KSM-TPF approach is clearly thebest in terms of both false positives and false negatives, aswell as the sum of two. The sum of false positives andfalse negatives is calculated by adding all the classificationerrors in the input frames compared with the ground truthframes. These experimental results show that the TPF basedmethods are better than LBP and GMM, indicating that TPF iseffective in extracting the discriminate features from the inputvideo sequences. Moreover, the proposed KSM-TPF approachachieves the best performance in terms of all three numerical


Fig. 6. Comparative performances of TPF and LBP with different thresholdvalues on Video 3. (a) Performances of false positives. (b) Performances offalse negatives.

TABLE I

Comparative Results of False Positives and False Negatives

(×105) on Video 3

False Positive False Negative SumKSM-TPF 18.4 1.2 19.6TPF 19.3 1.64 20.94LBP 21.76 2.26 24.02GMM 66.84 2.06 68.91

TABLE II

Comparative Results of False Positives and False Negatives

(×105) on Video 4

False Positive False Negative SumKSM-TPF 26.4 1.42 27.82TPF 29.7 1.95 31.65LBP 40.6 2.98 43.58GMM 68.80 3.01 71.81

evaluation factors (i.e., false positives, false negatives, and thesum). This proves that the adaptive threshold of the KSM-TPFapproach can increase the performance of TPF to some extent.It should be noted that, for the proposed approach, most of thefalse positives occur on the contour areas of the moving objects(see Fig. 5), which is because the TPF features are extractedfrom the neighborhood of pixels.

In Tables I and II, the threshold value (κ ≥ Threshold)for both TPF and LBP is 0.7. For KSM-TPF, the parameterη = 0.5. For the integral histogram and the mean gray-level

Fig. 7. Performances of KSM-TPF with different η on Video 3. (a) Perfor-mances of false positives. (b) Performances of false negatives.

Fig. 8. Comparative performances on the video from the Wallflower videodataset. (a) Performances of false positives (×105). (b) Performances of falsenegatives (×104). (c) Performances of the sum (×105).


Fig. 9. Comparative performances on the Wallflower video dataset. (a) GMM results. (b) CMU results. (c) LBP results. (d) TPF results. (e) KSM-TPF results.

Fig. 10. Comparative performances on the DynTex video dataset. (a) GMM results. (b) CMU results. (c) LBP results. (d) TPF results. (e) KSM-TPF results.


Fig. 10. (Continued).


image, the size of the local subregions is 10 × 10 and 8 × 8,respectively. The number of histogram bins for LBP is equallyquantized into 32. Therefore, the feature lengths for the LBPand TPF integral histograms are the same.

B. Experiment 2

The second experiment is conducted on the Wallflowervideo dataset [17]. Five frames from the sequence are manu-ally labeled to generate the ground truth data. The numericalcomparison results among KSM-TPF, TPF, LBP, and GMMon the video sequence that has a dynamic background withheavily swaying trees (the first and second rows of Fig. 9) areshown in Fig. 8, in terms of false positives, false negatives, andthe sum. Some visual comparison results are also displayed inFig. 9. The proposed KSM-TPF approach still achieves thebest performance in terms of the sum of false positives andfalse negatives.

From the first two rows of Fig. 9, we can see that theproposed approach robustly handles the heavily swaying treesand correctly detects the walking person, because it combinesboth textural and motional information of background vari-ations. Although we use the same set of parameters in thisexperiment as in the previous subsection, we would like tostress that it is straightforward to fine tune the parametersthrough experiments for a better performance.

C. Experiment 3

The third experiment is conducted on the DynTex videodataset [10], which includes ten videos for public download-ing. The videos consist of ten different dynamic textures, suchas swaying trees and moving escalators. Dynamic texturesrefer to spatially repetitive, time-varying visual patterns witha certain temporal stationarity. In dynamic texture, the notionof self-similarity central to conventional image textures isextended to the spatio-temporal domain.

A complete set of experimental results on all the ten videosare displayed in Fig. 10. Note that only one of the ten videos(the first row in Fig. 10) contains foreground objects, i.e., aduck moving into the scene. For the other nine videos, allthe video frames are considered as dynamic backgrounds. Theproposed KSM-TPF approach clearly outperforms the otherfour benchmark methods. For the video with vehicles runningon the highway (the ninth row in Fig. 10), the traffic flowis considered by the proposed method as part of the dynamicbackground because the traffic flow consistently appears in thewhole video clip from the beginning to the end and thus thereis no clear highway road frames at the beginning of the videofor our system training that can model an empty highway asstatic background.

V. Conclusion and Future Work

We presented a new approach to background subtractionthrough KSM-TPF. Our approach provides advantages overother methods in that it identifies more discriminative informa-tion in the spatio-temporal domain by using an adaptive thresh-old. The proposed KSM-TPF approach was tested on multiple

video sequences from three publicly available databases andachieved a better performance than LBP and GMM. The exper-imental results showed that KSM-TPF is much more robust tosignificant background variations. The processing speed of theproposed approach is 5 frames/s on a Windows-platform PCwith a 2.8 GHz Intel Core Duo CPU and 2 GB RAM, whichreaches quasi real-time performance. The program is imple-mented using C/C++. This shows that the proposed method isless computationally efficient than the GMM method [14] orthe LBP method [6]. In the future, we will focus on improvingthe computational efficiency of the proposed approach.

References

[1] A. Barla, F. Odone, and A. Verri, “Histogram intersection kernel forimage classification,” in Proc. IEEE Int. Conf. Image Process., vol. 3.Sep. 2003, pp. 513–516.

[2] R. T. Collins, A. J. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y.Tsin, D. Tolliver, N. Enomoto, O. Hasegawa, P. Burt, and L. Wixson, “Asystem for video surveillance and monitoring,” Carnegie Mellon Univ.,Pittsburgh, PA, Tech. Rep. CMU-RI-TR-00-12, 2000.

[3] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting movingobjects, ghosts, and shadows in video streams,” IEEE Trans. PatternAnal. Mach. Intell., vol. 25, no. 10, pp. 1337–1342, Oct. 2003.

[4] A. Elgammal, R. Duraiswami, D. Harwood, and L. S. Davis, “Back-ground and foreground modeling using nonparametric kernel densityestimation for visual surveillance,” Proc. IEEE, vol. 90, no. 7, pp. 1151–1163, Jul. 2002.

[5] B. Han, D. Comaniciu, Y. Zhu, and L. Davis, “Incremental densityapproximation and kernel-based Bayesian filtering for object tracking,”in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit.,vol. 1. Jun. 2004, pp. 638–644.

[6] M. Heikkilä and M. Pietikäinen, “A texture-based method for modelingthe background and detecting moving objects,” IEEE Trans. PatternAnal. Mach. Intell., vol. 28, no. 4, pp. 657–662, Apr. 2006.

[7] A. Mittal and N. Paragios, “Motion-based background subtraction usingadaptive kernel density estimation,” in Proc. IEEE Comput. Soc. Conf.Comput. Vision Pattern Recognit., vol. 2. Jun.–Jul. 2004, pp. 302–309.

[8] T. Ojala, M. Pietikäinen, and D. Harwood, “A comparative study oftexture measures with classification based on feature distributions,”Pattern Recognit., vol. 29, no. 1, pp. 51–59, 1996.

[9] T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution gray-scaleand rotation invariant texture classification with local binary patterns,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987,Jul. 2002.

[10] R. Péteri, S. Fazekas, and M. J. Huiskes, “DynTex: A comprehensivedatabase of dynamic textures,” Pattern Recognit. Lett., vol. 31, no. 12,pp. 1627–1632, Sep. 2010.

[11] M. Pietikäinen, T. Ojala, and Z. Xu, “Rotation-invariant texture classi-fication using feature distributions,” Pattern Recognit., vol. 33, no. 1,pp. 43–52, 2000.

[12] F. Porikli, “Integral histogram: A fast way to extract histograms inCartesian spaces,” in Proc. IEEE Comput. Soc. Conf. Comput. VisionPattern Recognit., vol. 1. Jun. 2005, pp. 829–836.

[13] Y. Sheikh and M. Shah, “Bayesian modeling of dynamic scenes forobject detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27,no. 11, pp. 1778–1792, Nov. 2005.

[14] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture modelsfor real-time tracking,” in Proc. IEEE Comput. Soc. Conf. Comput.Vision Pattern Recognit., vol. 2. Jun. 1999, pp. 246–252.

[15] Y.-L. Tian and A. Hampapur, “Robust salient motion detection withcomplex background for real-time video surveillance,” in Proc. IEEEComput. Soc. Workshop Motion Video Comput., vol. 2. 2005, pp.30–35.

[16] B. U. Toreyin, A. E. Cetin, A. Aksay, and M. B. Akhan, “Movingobject detection in wavelet compressed video,” Signal Process. ImageCommun., vol. 20, no. 3, pp. 255–264, 2005.

[17] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower: Princi-ples and practice of background maintenance,” in Proc. IEEE Int. Conf.Comput. Vision, vol. 1. Sep. 1999, pp. 255–261.

[18] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder:Real-time tracking of the human body,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 19, no. 7, pp. 780–785, Jul. 1997.


[19] G. Zhao and M. Pietikäinen, “Dynamic texture recognition using localbinary patterns with an application to facial expressions,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 915–928, Jun. 2007.

[20] Z. Zivkovic and F. van der Heijden, “Efficient adaptive density estima-tion per image pixel for the task of background subtraction,” PatternRecognit. Lett., vol. 27, no. 7, pp. 773–780, 2006.

Baochang Zhang received the B.S., M.S., and Ph.D.degrees in computer science from the Harbin Insti-tute of Technology, Harbin, China, in 1999, 2001,and 2006, respectively.

From 2006 to 2008, he was a Research Fellowwith the Chinese University of Hong Kong, Shatin,Hong Kong, and with Griffith University, Brisbane,Australia. Currently, he is a Lecturer with the Sci-ence and Technology on Aircraft Control Labora-tory, School of Automation Science and ElectricalEngineering, Beihang University, Beijing, China.

His current research interests include pattern recognition, machine learning,face recognition, and wavelets.

Yongsheng Gao (SM’01) received the B.S. andM.S. degrees in electronic engineering from Zhe-jiang University, Hangzhou, China, in 1985 and1988, respectively, and the Ph.D. degree in computerengineering from Nanyang Technological University,Singapore.

Currently, he is an Associate Professor with theGriffith School of Engineering, Griffith University,Brisbane, Australia. He is also with the Queens-land Research Laboratory, National ICT Australia,Brisbane, leading the Biosecurity Group. His current

research interests include face recognition, biometrics, biosecurity, imageretrieval, computer vision, and pattern recognition.

Sanqiang Zhao received the B.E. degree in elec-tronics from Northwestern Polytechnical University,Xi’an, China, in 2000, the M.E. degree in computerscience from the Beijing University of Technology,Beijing, China, in 2004, and the Ph.D. degree fromGriffith University, Brisbane, Australia, in 2009.

In July 2009, he joined the Queensland ResearchLaboratory, National ICT Australia, Brisbane, as aResearcher. Currently, he is also an Adjunct Re-search Fellow with the Institute for Integrated andIntelligent Systems, Griffith School of Engineering,

Griffith University. His current research interests include pattern recognition,computer vision, face recognition, and image processing.

Bineng Zhong received the B.S. and M.S. degreesin computer science from the Harbin Institute ofTechnology, Harbin, China, in 2004 and 2006, re-spectively, where he is currently pursuing the Ph.D.degree from the Department of Computer Scienceand Engineering.

His current research interests include patternrecognition, machine learning, computer vision, andintelligence video surveillance.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS … · texture pattern ﬂow encodes the binary pattern...

Documents

Transcript of IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS … · texture pattern ﬂow encodes the binary pattern...