Segmentation and Tracking of Multiple Humans in...

1

Segmentation and Tracking of Multiple Humans in Crowded Environments

Tao Zhao, Member, IEEE, Ram Nevatia, Fellow, IEEE, and Bo Wu, Student Member, IEEE,

Abstract— Segmentation and tracking of multiple humans incrowded situations is made difficult by inter-object occlusion.We propose a model based approach to interpret the imageobservations by multiple, partially occluded human hypothesesin a Bayesian framework. We define a joint image likelihoodfor multiple humans based on the appearance of the humans,the visibility of body obtained by occlusion reasoning, and fore-ground/background separation. The optimal solution is obtainedby using an efficient sampling method, data-driven Markov chainMonte Carlo (DDMCMC), which uses image observations forproposal probabilities. Knowledge of various aspects includinghuman shape, camera model, and image cues are integratedin one theoretically sound framework. We present experimen-tal results and quantitative evaluation, demonstrating that theresulting approach is effective for very challenging data.

Index Terms— Multiple Human Segmentation, Multiple Hu-man Tracking, Markov chain Monte Carlo

I. INTRODUCTION AND MOTIVATION

Segmentation and tracking of humans in video sequencesis important for a number of applications, such as visualsurveillance and human computer interaction. This has been atopic of considerable research in the recent past and robustmethods for tracking isolated or small number of humanshaving only transient occlusion exist. However, tracking in amore crowded situation where several people are present andexhibit persistent occlusion, remains challenging. The goal ofthis work is to develop a method to detect and track humansin the presence of persistent and temporarily heavy occlusion.We do not require that humans be isolated, i.e. un-occluded,when they first enter the scene. However, in order to “see” aperson, we require that at least the head-shoulder region mustbe visible. We assume a stationary camera so that motion canbe detected by comparison with a background model. We donot require the foreground detection to be perfect, e.g. theforeground blobs may be fragmented, but we assume that thereare no significant false alarms due to shadows, reflections, orother reasons. We also assume that the camera model is knownand that people walk on a known ground plane.

Fig.1(a) shows a sample frame of a crowded environmentand Fig.1(b) shows the motion blobs detected by comparisonwith the learned background. It is apparent that segmentinghumans from such blobs is not straight-forward. One blobmay include multiple objects; while one object may split intomultiple blobs. Blob tracking over extended periods, e.g. [20],may resolve some of these ambiguities but such approaches arelikely to fail when occlusion is persistent. Some approacheshave been developed to handle occlusion, e.g. [9], but require

1T. Zhao is with Intuitive Surgical Inc, 950 Kifer Road, Sunnyvale, CA94086 Email: [email protected]. R. Nevatia and B. Wu are with Insti-tute for Robotics and Intelligent Systems, University of Southern California,Los Angeles, CA 90089 Email: {nevatia|bowu}@usc.edu.

the objects to be initialized before occlusion happens. Thisis usually infeasible for crowded scene. We believe that useof a shape model is necessary to achieve individual humansegmentation and tracking in crowded scenes.

(a) Sample frame (b) Motion blobs

(c) Our result

Fig. 1. An sample frame, the corresponding motion blobs and our segmen-tation and tracking result for crowded situation.

In earlier related work [54], Zhao and Nevatia model humanbody as a 3D ellipsoid and human hypotheses are proposedbased on head top detection from foreground boundary peaks.This method works reasonably well in presence of partialocclusions if the number of people in the field of view issmall. As the complexity of the scene grows, head tops cannot be obtained by simple foreground boundary analysis andmore complex shape models are needed to fit more accuratelywith the observed shapes. Also, joint reasoning about thecollection of objects is needed, rather than the simpler one-by-one verification method in [54]. The consequence of this jointconsideration is that the optimal solution has to be computedin the joint parameter space of all the objects. To track theobjects in multiple frames, temporal coherence is anotherdesired property besides accuracy of the spatial segmentation.We adapt a data-driven Markov chain Monte Carlo approach toexplore this complex solution space. To improve the computa-tional efficiency, we use direct image features from bottom-upimage analysis as importance proposal probabilities to guidethe moves of the Markov chain. The main features of this workinclude

1) a 3-dimensional part based human body model, whichenables segmentation and tracking of humans in 3D andinference of inter-object occlusion naturally;

2) a Bayesian framework which integrates segmentationand tracking based on a joint likelihood for the appear-ance of multiple objects;

Digital Object Indentifier 10.1109/TPAMI.2007.70770 0162-8828/$25.00 © 2007 IEEE

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCEThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

2

3) design of an efficient Markov chain dynamics, directedby proposal probabilities based on image cues; and

4) the incorporation of a color based background model ina mean shift tracking step.

Our method is able to successfully detect and track humansin scenes of complexity shown in Fig.1 with high detectionand low false alarm rates; the tracking results for the frame inFig.1(a) is shown in Fig.1(c) (the result includes integrationof multiple frames during tracking). In the result section,we give graphical and quantitative results on a number ofsequences. Parts of our system have been partially describedin [53] and [55]; this paper provides a unified presentationof the methodology, additional results and discussions. Thisapproach has been built on by other researchers, e.g. [41]. Thesame framework has also been successfully applied to vehiclesegmentation and tracking in challenging cases [43].

The rest of the paper is organized as follows: SectionII gives a brief review of the related works; Section IIIpresents an overview of our method; Section IV describes theprobabilistic modeling of the problem; Section V describes ourMCMC based solution; Section VI shows experimental resultsand evaluation; conclusions and discussions are given in thelast section.

II. RELATED WORK

We summarize related work in this section; some of theseare referred to in more detail in the following sections. Due tothe size of the literature in this field, it is not possible for usto provide a comprehensive survey but we attempt to includethe major trends.

The observations for human hypotheses may come frommultiple cues. Many previous approaches [20], [9], [54],[37], [44], [15], [18], [40], [24], [3], [45] use motion blobsdetected by comparing pixel colors in a frame to learnedmodels of the stationary background. When the scene is nothighly crowded, most part of the humans in the scene aredetected in the foreground motion blob; multiple humans maybe merged into a single blob but they can be separated byrather simple processing. For example, Haritaoglu et al. [15]uses vertical projection of the blob to help segment a bigblob into multiple humans. Siebel and Maybank [40], Zhaoand Nevatia [54] detect head candidates by analyzing theforeground boundaries. Since different humans have smalloverlapping foreground regions, they could be segmented in agreedy way. However, the utility of these methods in crowdedenvironments such as in Fig.1 is likely to be limited.

Some methods, e.g. [50], [31], [7], [13] detect appearance-or shape-based patterns of humans directly. [50] and [31]learn human detectors from local shape features; [7] and [13]builds contour templates for pedestrians. These learning basedmethods need a large number of training samples and may besensitive to imaging view-point variations as they learn 2-Dpatterns. Besides motion and shape, face and skin-color arealso useful cues for human detection, but environments wherethese cues could be utilized are limited, usually indoor sceneswhere illumination is controlled and the objects are imagedwith high resolution, e.g. [42], [12].

Without a specific model of objects, tracking methods arelimited to blob tracking e.g. [3]. The main advantage ofmodel-based tracking is that it can solve the blob merge andsplit problems by enforcing a global shape constraint. Theshape models could be either parametric, e.g. an ellipsoidas in [54], or non-parametric, e.g. the edge template as in[13]; either in 2D, e.g. [46] or in 3D, e.g. [54]. Parametricmodels are usually generative and of high dimensionality,while non-parametric models are usually learned from realsamples. 2D models make the matching of hypotheses andimage observations straightforward, while 3D models are morenatural for occlusion reasoning. The choice of the modelcomplexity depends on both the application and the videoresolution. For human tracking from a mid-distant camera, wedo not need to capture the detailed body articulation, a roughbody model, such as the generic cylinder in [19], the ellipsoidin [54], and the multiple rectangles in [46] suffice. When thebody pose of humans is desired and the video resolution ishigh enough, more complex models could be used, such asthe articulated models in [54] and [34].

Tracking of multiple objects requires matching of hypothe-ses with the observations both spatially and temporally. Whenobjects are highly inter-occluded, their image observationsare far from being independent, hence a joint likelihood formultiple objects is necessary [46], [27], [19], [35], [30], [51].Smith et al. [41] use a pair-wise Markov Random Field (MRF)to model the interaction between humans and define the jointlikelihood. Rittscher et al. [36] include a hidden variable,which indicates a global mapping from the observed featuresto human hypotheses, in the state vector.

As the solution space is of high dimension, searching forthe best interpretation by brute force is not feasible. Particlefilters based methods, e.g. [19], [46], [30], [51], [27], becomeunsuitable when the dimensionality of the search space is highas the number of samples needed usually grows exponentiallywith the dimension. [41], [21] use some variations of MCMCalgorithm to sample the solution space while [45], [36] uses anEM style method. For efficiency the candidate solutions couldbe generated from some image cues, not pure randomly, e.g.[36] propose hypotheses from local silhouette features.

Information from multiple cameras with overlapping viewscan reduce the ambiguity of a single camera. Such methodsusually assume that at least from one view point, the objectcan be detected successfully (e.g. [11]) or many camerasare available for 3-dimensional reconstruction (e.g. [28]). Thedifficulty of segmenting multiple humans which overlap inimages from a stereo camera is alleviated by analyzing inthe 3-dimensional space where they are separable [52]. In amulti-camera context, an object can be tracked even when itis fully occluded from some of the views; however, manyreal environments do not permit use of multiple cameraswith overlapping views. In this paper, we consider situationswhere video from only one camera is available. However ourapproach can utilize multiple cameras with little modification.

MCMC-based methods are receiving increasing popularityfor computer vision problems due to its flexibility in optimiz-ing an arbitrary energy function as opposed to energy functionsof specific type as in graph cut [2] or belief propagation [49].


3

It has been used for various applications including segmentingmultiple cells [38], image parsing [48], multi-object tracking[21], estimating articulated structures [23], etc. Data-driveMCMC was proposed by [48] to utilize bottom-up image cuesto speed up the sampling process.

We want to point out the difference between our approachand another independently developed work [21] which alsoused MCMC for multi-object tracking. [21] assumes that theobjects do not overlap by applying a penalty term for overlapwhile our approach explicitly uses a likelihood of appearanceunder occlusion. Our approach focuses on the domain oftracking human which is the most important subject for visualsurveillance. We consider the 3-dimensional perspective effectin typical camera setting while the ant tracking problemdescribed in [21] is almost a 2-dimensional problem. Weutilize acquired appearance where each object is of differentappearance where ants in [21] are assumed to have the sameappearance. We developed a full set of effective bottom-upcues for human segmentation and hypotheses generation.

III. OVERVIEW

Our approach to segmenting and tracking of multiple hu-mans emphasizes the use of shape models. An overviewdiagram is given in Fig.2. Based on a background model, theforeground blobs are extracted as the basic observation. Byusing the camera model and the assumption that objects moveon a known ground plane, multiple 3D human hypothesesare projected onto the image plane and matched with theforeground blobs. Since the hypotheses are in 3D, occlusionreasoning is straightforward. In one frame, we segment theforeground blobs into multiple humans and associate thesegmented humans with the existing trajectories. Then thetracks are used to propose human hypotheses in the next frame.The segmentation and tracking are integrated in a unifiedframework and inter-operate along time.

Fig. 2. Overview diagram of our approach.

We formulate the problem of segmentation and tracking asone of Bayesian inference to find the best interpretation giventhe image observations, the prior models, and the estimatesfrom previous frame analysis (i.e. the maximum a posteriori,MAP, estimation). The state to be estimated at each frameincludes the number of objects, their correspondences tothe objects in the previous frame (if any), their parameters(e.g. positions), and the uncertainty of the parameters. Wedefine a color-based joint likelihood model which considers

all the objects and the background together, and encodesboth the constraints that the object should be different fromthe background and that the object should be similar toits correspondence. Using this likelihood model gracefullyintegrates segmentation and tracking, and avoids a separate,sometimes ad hoc, initialization step. Given multiple humanhypotheses, before calculating the joint image likelihood inter-object occlusion reasoning is done. The occluded parts of ahuman should not have corresponding image observations.

The solution space contains subspaces of varying dimen-sions, each corresponding to a different number of objects. Thestate vector consists of both discrete and continuous variables.This disqualifies many optimization techniques. Therefore weuse a highly general reversible jump/diffusion MCMC-basedmethod to compute the MAP estimate. We design dynamicsfor multi-object tracking problem. We also use various directimage features to make the Markov chain more efficient. Directimage features alone do not guarantee optimality because theyare usually computed locally or using partial cues. Using themas proposal probabilities of the Markov chain results in anintegrated top-down/bottom-up approach which has both thecomputational efficiency of image features and the optimalityof a Bayesian formulation. A mean shift technique [5] is usedas efficient diffusion for the Markov chain. The data-drivendynamics and the in-depth exploration of the solution spacemake the approach less sensitive to dimensionality comparedto particle filters. Our experiments show that the describedapproach works robustly in very challenging situations withaffordable computation; some results are shown in Section VI.

IV. PROBABILISTIC MODELING

Let θ represent the state of the objects in the scene at timet; it consists of the number of objects in the scene, their 3Dpositions and other parameters describing their size, shape andpose. Our goal is to estimate the state at time t, θ(t), given theimage observations, I(1), . . . , I(t), abbreviated as I(1,...,t). Weformulate the tracking problem as computing the maximum aposteriori (MAP) estimation, θ(t)�.

θ(t)� = arg maxθ(t)∈Θ

P(θ(t)|I(1,...,t)

)= arg max

θ(t)∈Θ

{P(I(t)|θ(t)

)P(θ(t)|I(1,...,t−1)

)} (1)

where Θ is the solution space. Denote by m the state vectorof one individual object. A state containing n objects can bewritten as θ = {(k1,m1) , . . . , (kn,mn)} ∈ Θn, where ki isthe unique identity of the i-th object whose parameters are mi,and Θn is the solution space of exactly n objects. The entiresolution space is Θ = ∪Nmax

n=0 Θn, where Nmax is a upperbound of the number of objects. In practice, we compute anapproximation of P

(θ(t)|I(1,...,t−1)

)(details are given later in

section IV-D).

A. 3D Human Shape Model

The parameters of an individual human, m, are definedbased on a 3D human shape model. Human body is highlyarticulated, however, in our case, the human motion is mostlylimited to standing or walking, and we do not attempt to


4

capture the detailed shape and articulation parameters of thehuman body. Thus we use a number of low dimensionalmodels to capture the gross shape of human bodies.

Fig. 3. A number of 3D human models to capture the gross shape of humanbodies.

Ellipsoids fit human body parts well and has the propertythat its projection is an ellipse with a convenient form [16].Therefore we model human shape by a composition of mul-tiple ellipsoids corresponding to the head, the torso and thelegs, with fixed spatial relationship. A few such models atcharacteristic poses are sufficient to capture the gross shapevariations of most humans in the scene for mid-resolutionimages. We use the multi-ellipsoid model to control the modelcomplexity while maintaining a reasonable level of fidelity.We have used three such models (1 for legs close to eachother and 2 for legs well-split) in our previous work on multi-human segmentation [53]. However, in this work we use onlya single model with three ellipsoids which we found sufficientfor tracking.

The model is controlled by two parameters called sizeand thickness. The size parameter is the 3D height of themodel; it also controls the overall scaling of the object inthe three directions. The thickness parameter captures extrascaling in the horizontal directions. Besides size and thickness,the parameters also include image position of the head 1, 3Dorientation of the body, and 2D inclination of the body. Theorientations of the models are quantized into a few levels forcomputation efficiency. The origin of the rotation is chosen sothat 0◦ corresponds to human facing the camera. We use 0◦

and 90◦ to represent frontal/back and side view in this work.The 3D models assumes that humans are perfectly upright, butthere are chances that they incline their body slightly. We useone parameter to capture the inclination in 2D (as opposed totwo parameters in 3D). Therefore, the parameters of the i-thhuman are mi = {oi, xi, yi, hi, fi, ii} which are orientation,position, size, thickness, and inclination respectively. We alsowrite (xi, yi) as ui.

With a given camera model and a known ground plane, the3D shape models automatically incorporates the perspectiveeffect of camera projection (change in object image size andshape due to the change in object position and/or cameraviewpoint). Compared to 2D shape models (e.g. [13]) or pre-learnt 2D appearance models (e.g. [50]), the 3D models aremore easily applicable for a novel viewpoint.

1The image head location is a equivalent parameterization of the worldlocation on the ground plane (xw, yw) given the human height. The two arerelated by [x, y, 1]T ∼ [p1,p2,p3h+p4][xw, yw, 1]T , where pi is the i-thcolumn of the camera projection matrix and h is the height of the human.For clarity of presentation, we chose the ground plane to be z = 0.

B. Object Appearance Model

Besides the shape model, we also use a color histogramof the object, p = {p1, . . . , pm} (m is the number of binsof the color histogram) defined within the object shape, as arepresentation of its appearance which helps establish corre-spondence in tracking. We use color histogram because it isinsensitive to the non-rigidity of human motion. Furthermore,there exists efficient algorithm, e.g. the mean shift technique[5], to optimize a histogram-based object function. Whencalculating the color histogram, a kernel function KE() withEpanechnikov profile [5] is applied to weight pixel locationsso that the center has a higher weight than the boundary. Sucha representation has been used in [6]. Our implementation usesa single RGB histogram with 512 bins (8 for each dimension),of all the samples within the three elliptic regions of our objectmodel.

C. Background Appearance Model

The background appearance model is a modified versionof a Gaussian distribution. Denote by (rj , gj , bj) and Σj =diag

{σ2

rj, σ2

gj, σ2

bj

}the mean and the covariance of the

color at pixel j. The probability of pixel j being from thebackground is

Pb (Ij) = Pb (rj , gj , bj)

∝ max{

exp[−(

rj−rj

σrj

)2

−(

gj−gj

σgj

)2

−(

bj−bj

σbj

)2]

, ε

}(2)

where ε is a small constant. It is a composition of a Gaus-sian distribution and a uniform distribution. The uniformdistribution captures the outliers which are not modeled bythe Gaussian distribution to make the model more robust.The Gaussian parameters (mean and covariance) are updatedcontinuously by the video stream only with the non-movingregions. More sophisticated background model (e.g. mixtureof Gaussian [44] or non-parametric [10]), could be used toaccount for more variations but this is not the focus of thiswork; we assume that comparison with background modelyields adequate foreground blobs.

D. The Prior Distribution

The prior distribution P(θ(t)|I(1,...,t−1)

)is decomposed in

two parts given by:

P(θ(t)|I(1,...,t−1)

)∝ P

(θ(t))

P(θ(t)|I(1,...,t−1)

)(3)

P (θ(t)) is independent of time, and is defined byn∏

i=1

P (|Si|)P (mi), where Si is the projected image of

the i-th object and |Si| is its area. The prior of theimage area P (|Si|) is modeled as being proportional toexp (−λ1|Si|) [1 − exp (−λ2|Si|)]2. The first term here pe-nalizes large total object size to avoid situations wheretwo hypotheses overlap a large portion of an image blob,

2We have used prior on the number of objects in [53] to constrain oversegmentation. However we found that the prior on the area is more effectivedue to the large variation of the image sizes of the objects (due to cameraperspective effect) and therefore their different contribution to the likelihood.


5

while the second term penalizes objects with small imagesizes as they are more likely to be due to image noise.Although the prior on 2D image size could be convertedto the 3D space, defining this prior in 2D is more natu-ral, because these properties model the reliability of imageevidence independent of the camera models. The priors onthe human body parameters are considered independent. Thuswe have P (mi) = P (oi)P (xi, yi)P (hi)P (fi)P (ii). We setP (ofrontal) = P (oprofile) = 1/2. P (xi, yi) is a uniformdistribution in the image region where a human head is plau-sible. P (hi) is a Gaussian distribution N (μh, σ2

h) truncated inthe range of [hmin, hmax] and P (fi) is Gaussian distributionN (μf , σ2

f ) truncated in the range of [fmin, fmax]. P (ii) is aGaussian distribution N (μi, σ

2i ). In our experiments, we use

μh = 1.7m, σh = 0.2m, hmin = 1.5m, hmax = 1.9m;μf = 1, σf = 0.2, fmin = 0.8, fmax = 1.2; μi = 0, σi = 3◦.These parameters correspond to common adult body sizes.

We approximate the second term of the right side of Equ.3,P (θ(t)|I(1,...,t−1)), by P (θ(t)|θ(t−1)), assuming θt−1 encodesthe necessary information from the past observations. For con-venience of expression, we rearrange θ(t) and θ(t−1) as θ(t) ={(

k(t)i , m(t)

i

)}N

i=1and θ(t−1) =

{(k

(t−1)i , m(t−1)

i

)}N

i=1,

where N is the overall number of object present in the twoframes, so that one of{

k(t)i = k

(t−1)i , m(t)

i = φ, m(t−1)i = φ

}is true for each i.

k(t)i = k

(t−1)i means object k

(t)i is a tracked object; m(t)

i = φ

means object k(t−1)i is a dead object (i.e. trajectory is termi-

nated); and m(t−1)i = φ means object k

(t)i is a new object.

With the rearranged state vector, we have P(θ(t)|θ(t−1)

)=

P(θ(t)|θ(t−1)

)=

N∏i=1

P(m(t)

i |m(t−1)i

). The temporal prior

of each object follows the definition

P(m(t)

i |m(t−1)i

)∝

⎧⎪⎪⎪⎨⎪⎪⎪⎩

Passoc

(m(t)

i |m(t−1)i

), k

(t)i = k

(t−1)i

Pnew

(m(t)

i

), m(t−1)

i = φ

Pdead

(m(t−1)

i

), m(t)

i = φ

(4)We assume that the position and the inclination of an objectfollow constant velocity models with Gaussian noise, andthat the height and thickness follow a Gaussian distribution(for simplicity of presentation, we omit the velocity termsin the state). We use Kalman filters for temporal estimation;Passoc is therefore a Gaussian distribution. Pnew

(m(t)

i

)=

Pnew

(u(t)

i

)and Pdead

(m(t−1)

i

)= Pdead

(u(t−1)

i

)are the

likelihoods of the initialization of a new track at position u(t)i

and the termination of an existing track at position u(t−1)i

respectively. They are set empirically according to the distanceof the object to the entrances/exits (the boundaries of theimage and other areas that people move in/out). Pnew(u) ∼N (μ(u),Σe), where μ(u) is the location of the closest en-trance point to u and Σe is its associated covariance matrixwhich is set manually or through a learning phase. Pdead()follows a similar definition.

E. Joint Image Likelihood for Multiple Objects and Back-ground

The image likelihood P (I|θ) reflects the probability thatwe observe image I (or some features extracted from I) givenstate θ. Here we develop a likelihood model based on the colorinformation of background and objects. Given a state vector θ,we partition the image into different regions corresponding todifferent objects and the background. Denote by Si the visiblepart of the i-th object defined by mi. The visible part of anobject is determined by the depth order of all the objects,which can be inferred from their 3D positions and the cameramodel. The entire object region S = ∪n

i=1Si =∑n

i=1 Si, sinceSi are disjoint regions. We use S to denote the supplementaryregion of S, i.e. the non-object region. The relationship of theregions is illustrated in Fig.4.

Fig. 4. First pane: the relationship of visible object regions and the non-object region. Rest panes: the color likelihood model. In Si, the likelihoodfavors both the difference of an object hypothesis with the background andits similarity with its corresponding object in a previous frame. In S, thelikelihood penalizes the difference with the background model. Note that theelliptic models are used for illustration.

In case of multiple objects which can possibly overlap inthe image, the likelihood of the image given the state cannotbe simply decomposed into the likelihood of each individualobjects. Instead, a joint likelihood of the whole image given allobjects and the background model needs to be considered. Thejoint likelihood P (I|θ) consists of two terms corresponding tothe object region and the non-object region

P (I|θ) = P(IS |θ)P (I S |θ

)(5)

After obtaining Si by occlusion reasoning, the object regionlikelihood can be calculated by

P(IS |θ) =

n∏i=1

P(I Si |mi

)

∝ exp

⎧⎪⎨⎪⎩λS

n∑i=1

∣∣∣Si

∣∣∣⎡⎢⎣−λbB (pi,di)︸︷︷︸

(1)

+λfB (pi, pi)︸︷︷︸(2)

⎤⎥⎦⎫⎪⎬⎪⎭

(6)where di is the color histogram of the background imagewithin the visibility mask of object i, pi is the color histogramof the object, both weighted by the kernel function KE().B(p,d) =

∑mj=1

√pjdj is the Bhattachayya coefficient,

which reflects the similarity of two histograms.This likelihood favors both the difference of an object

hypothesis with the background and its similarity with itscorresponding object in a previous frame (Fig.4). This enablessimultaneous segmentation and tracking in the same objectfunction. We call the two terms background exclusion and


6

object attraction respectively. The background exclusion con-cept was also proposed by [33]. λb and λf weight the relativecontribution of the two terms (we constrain λb +λf = 1). Theobject attraction term is the same as the likelihood functionused in [6]. For an object without a correspondence, i.e. a newobject, only the background exclusion part is used.

The non-object likelihood is calculated by

P(I S |θ

)=∏j∈S

(Pb (Ij))λS ∝ exp

⎛⎝−λS

∑j∈S

ej

⎞⎠ , (7)

where ej = log(Pb(Ij)) is the probability of belonging to thebackground model, as defined in Equation 2. λS in Equation 6and λS in Equation 7 weight the balance of the foreground andthe background considering the different probabilistic modelsbeing used. The posterior probability is obtained by combiningthe prior, Equation 3, and the likelihood, Equation 5.

V. COMPUTING MAP BY EFFICIENT MCMC

Computing the MAP is an optimization problem. Due tothe joint consideration of an unknown number of objects, thesolution space contains subspace of varying dimensions. It alsoincludes both discrete variable and continuous variables. Thesehas made the optimization challenging. We use a Markovchain Monte Carlo method with jump/diffusion dynamics tosample the posterior probability. Jumps cause the Markovchain to move between subspaces with different dimensionsand traverse the discrete variables; diffusions make the Markovchain sample continuous variables. In the process of sampling,the best solution is recorded and the uncertainty associatedwith the solution is also obtained.

Fig.5 gives a block diagram of the computation process.The MCMC based algorithm is an iterative process, startingfrom an initial state. In each iteration, a candidate is proposedfrom the state in the previous iteration assisted by imagefeatures. The candidate is accepted probabilistically accordingto the Metropolis-Hasting rule [17]. The state correspondingto the maximum posterior value is recorded and becomes thesolution.

Fig. 5. The block diagram of the MCMC tracking algorithm.

Suppose we want to design a Markov chain with stationarydistribution P(θ) = P

(θ(t)|I(t), θ(t−1)

). At the g-th iteration,

we sample a candidate state θ′ according to θg−1 from aproposal distribution q(θg|θg−1). The candidate state θ′ is ac-

cepted with the probability p = min{

1,P(θ′)q(θg−1|θ′)

P(θg−1)q(θ′|θg−1)

}3.

If the candidate state θ′ is accepted, θg = θ′, otherwise,θg = θg−1. It can be proven that the Markov chain con-structed in this way has its stationary distribution equal toP(), independent of the choice of the proposal probabilityq() and the initial state θ0 [47]. However, the choice of theproposal probability q() can affect the efficiency of the MCMCsignificantly. Random proposal probabilities will lead to veryslow mixing rate. Using more informed proposal probabilities,e.g. as in data-driven MCMC [48], will make the Markov chaintraverse the solution space more efficiently. Therefore theproposal distribution is written as q(θg|θg−1, I). If the proposalprobability is informative enough so that each sample can bethought of as a hypothesis, then the MCMC approach becomesa stochastic version of the hypothesize and test approach.In general, the original version of MCMC has dimensionmatching problem for solution space with varying dimension-ality. A variation of MCMC, called trans-dimensional MCMC[14] is proposed to solve this problem. However, with someappropriate assumption and simplification, trans-dimensionalMCMC can be reduced to the standard MCMC. We addressthis issue later in this section.

A. Markov Chain Dynamics

We design the following reversible dynamics for the Markovchain to traverse the solution space. The dynamics corre-sponding to the proposal distribution with a mixture densityq(θ′|θg−1, I) =

∑a∈A paqa(θ′|θg−1, I), where A is the set of

all dynamics = {add, remove, establish, break, exchange,diff}. The mixing probabilities pa are the chances of select-ing different dynamics and

∑a∈A pa = 1.

We assume that we have the sample in the g − 1-thiteration θ

(t)g−1 = {(k1,m1), . . . , (kn,mn)} and now propose

a candidate θ′ for the g-th iteration (t is omitted where thereis no ambiguity).Object hypothesis addition Sample the parameters of anew human hypothesis (kn+1,mn+1) and add it to θg−1.qadd(θg−1 ∪ {(kn+1,mn+1)}|θg−1, I) is defined in a data-driven way whose details will be given later.Object hypothesis removal Randomly select an existinghuman hypothesis r ∈ [1, n] with a uniform distribution andremove it. qremove (θg−1 \ {(kr,mr)}|θg−1) = 1/n. If kr hasa correspondence in θ(t−1), then that object becomes dead.Establish correspondence Randomly select a new object r inθ(t)g−1 and a dead object r′ in θ(t−1), and establish their tem-

poral correspondence. qestablish (θ′|θg−1) ∝ ‖ur − ur′‖−2 forall the qualified pairs.Break correspondence Randomly select an object r where

3Base on our experiments, we find that approximating the ratio in the secondterm with just the posterior probability ratio, P(θ′)

P(θg−1), gives almost the same

results as the complete computation, hence we use this approximation in ourimplementation.


7

kr ∈ θ(t−1) with a uniform distribution and change kr toa new object (and same object in θ(t−1) becomes dead).qbreak (θ′|θg−1) = 1/n′, where n′ is the number of objectsin θ

(t)g−1 that have correspondences in the previous frame.

Exchange identity Exchange the IDs of two close-by objects.Randomly select two objects r1, r2 ∈ [1, n] and exchange theirIDs. qexchange (r1, r2) ∝ ‖ur1 − ur2‖−2. Identities exchangecan also be replaced by the composition of breaking andestablishing correspondence. It is used to ease the traversalsince breaking and establishing correspondences may lead toa big decrease in the probability and are less likely to beaccepted.Parameter update Update the continuous parameters of anobject. Randomly select an existing human hypothesis r ∈[1, n] with a uniform distribution, and update its continuousparameters qdiff (θ′|θg−1) = (1/n)qd (m′

r|mr).Among the above, addition and removal are a pair of reverse

moves, as are the establishing and breaking correspondences;exchanging identity, and parameter updating are the their ownreverse moves.

B. Informed Proposal Probability

In theory, the proposal probability q() does not affect thestationary distribution. However, different q() lead to differentperformance. The number of samples needed to get a goodsolution strongly depends on the proposal probabilities. Inthis application, the proposal probability of adding a newobject, and the update of the object parameters, are the twomost important ones. We use the following informed proposalprobabilities to make the Markov chain more intelligent andthus have a higher acceptance rate.

Object addition We add human hypotheses from three cues,foreground boundaries, intensity edges, and foreground residue(foreground with the existing objects carved out). In [54] amethod to detect the heads which are on the boundary of theforeground is described. The basic idea is to find the localvertical peaks of the boundary. The peaks are further verifiedby checking if there are enough foreground pixels below itaccording to a human height range and the camera model. Thisdetector has a high detection rate and is also effective when thehuman is small and image edges are not reliable; however, itcannot detect the heads in the interior of the foreground blobs.Fig.6(a) shows an example of head detection from foregroundboundaries.

The second head detection method is based on an “Ω”shape head-shoulder model (this term was first introduced in[53]). This detector matches the Ω-shape edge template withthe image intensity edges to find the head candidates. First,Canny edge detector is applied to the foreground region of theinput image. A distance transformation [1] is computed on theedge map. Fig.6(b) shows the exponential edge map whereE(x, y) = exp(−λD(x, y)) (D(x, y) is the distance to theclosest edge point and λ is a factor to control the response fielddepending on the object scale in the image; we use λ = 0.25).Besides, the coordinates of the closest pixel point are alsorecorded as �C(x, y). The unit image gradient vector �O(x, y)is only computed at edge pixels. The “Ω” shape model, see

(a) (b)

(c) (d)

Fig. 6. Head detection. a) Head detection from foreground blob boundaries;b) Distance transformation on Canny edge detection result; c) The Ω-shapehead-shoulder model (black-head shoulder shape, white-normals); and d) Headdetection from intensity edges.

Fig.6(c), is derived by projecting a generic 3D human modelto the image and taking the contour of the whole head andthe upper quarter torso as the shoulder. The normals of thecontour points are also computed. The size of the humanmodel is determined by the camera calibration assuming anaverage human height.

Denote {�u1, ..., �uk} and {�v1, ..., �vk} as the positions and theunit normals of the model points respectively when head top isat (x, y). The model is matched with the image as S(x, y) =(1/k)Σk

i=1e−λD(�ui)(�vi · �O( �C(�ui))). A head candidate map is

constructed by evaluating S(x, y) on every pixel in the dilatedforeground region. After smoothing it, we find all the peaksabove a threshold such that a very high detection rate but mayalso result in a high false alarm rate. An example is shown inFig.6(d). The false alarms tend to happen in the area of richtexture where there are abundant edges of various orientations.

Finally, after some human objects obtained from the firsttwo methods are hypothesized and removed from the fore-ground, the foreground residue map R = F ∗ S is computed.Morphological “open” operation with a vertically elongatedstructural element is applied to remove thin bridges andsmall/thin residues. From each connected component c, humancandidates can be generated assuming 1) the centroid of thec is aligned with the center of human body; 2) the top centerpoint of c is aligned with the human head; or 3) the bottomcenter point of c is aligned with the human feet.

The proposal probability for addition combines these threehead detection methods qa(k,m) =

∑3i=1 λaiqai(k,m),

where λai, i = 1, 2, 3 are mixing probabilities of thethree methods and we use λai = 1/3. qai() samplesm first and then k. qai(k,m) = qai(m)qai(k|m), andqai(m) = qo(o)qai(u)qh(h)qf (f)qi(i). qai(u) answers thequestion “where to add a new human hypothesis”. In prac-tice, qo(o), qh(h), qf (f), and qi(i) use their respective priordistributions, and qai(u) is a mixture of Guassian based on thebottom-up detection results. For example, denote by HC1 ={(xi, yi)}N1

i=1 the head candidates obtained by the first method,

then qa1(u) = qa1(x, y) ∼N1∑i=1

N ((xi, yi) , diag{σ2x, σ2

y}).

The definition of qa2(u) and qa3(u) are similar. After


8

u′ is sampled, q(k|m) ∝ q(k|u′) is to sample k from{k

(t−1)d1

, . . . , k(t−1)dnd

, new}

according to P(u|u(t−1)

di

), see

Equation 4, i = 1, . . . , nd and Pnew(u), where nd is thenumber of dead objects.

The addition and removal actions change the dimension ofthe sate vector. When calculating the acceptance probability,we need to compute the ratio of probabilities from spaces withdifferent dimensions. Smith et al. [41] use an explicit strategyof trans-dimensional MCMC [14] to deal with the dimension-matching problem. We do not need explicit strategy to matchthe dimension. Since the trans-dimensional actions only add orremove one object at one iteration, leaving the other objectsunchanged, the Jacobian in [14] is unit, as in [41]. So ourformulation is just a special case of the more general theory.

Parameter update We use two ways to update themodel parameters: qdiff (m′

r|mr) = λd1qd1 (m′r|mr) +

λd2qd2 (m′r|mr), λdi = 1/2. qd1() uses stochastic gradient

decent to update the object parameters. qd1 (m′r|mr) ∝

N (mr − k dEdm ,w

), where E = − log P

(θ(t)|I(t), θ(t−1)

)is

the energy function, k is a scalar to control the step size, andw is random noise to avoid local maximum.

A mean shift vector computed in the visible region providesan approximation of the gradient of the object likelihood w.r.t.the position. qd2 (m′

r|mr) ∝ N (mmsr ,w), where mms

r isthe new location computed from the mean shift procedure(details are given in a separate Appendix). We assume that thechange of the posterior probability by other components anddue to occlusion can be absorbed in the noise term. The meanshift has an adaptive step size and has a better convergencebehavior than numerically computed gradients. The rest ofthe parameters follow their numerically computed gradients.Compared to the original color-based mean shift tracking, thebackground exclusion term in Equation 6, can utilize a knownbackground model, which is available for a stationary camera.As we observe in our experiments, tracking using the abovelikelihood is more robust to the change of appearance of theobject, e.g. when going into the shadow, compared to usingthe object attraction term alone.

Theoretically, the Markov chain designed should be irre-ducible and reversible, however the use of the above datadriven proposal probabilities makes the approach not conformto the theory exactly. First, irreducibility requires the Markovchain be able to reach any possible point in the solutionspace. However, in practice, the proposal probability of somepoint are very small, close to zero. For example the proposalprobability of adding a hypothesis at a position, where thereis no head candidate detected nearby, is extremely low. Withfinite numbers of iterations, a state including such a hypothesiswill never be sampled. Although this breaks the completenessof the Markov chain, we argue that skipping the parts ofthe solution space, where no sign of objects observed, bringsno harm to the quality of the final solution and makes thesearching process more efficient. Second, the use of the meanshift, which is a non-parametric method, makes the chainirreversible. Mean-shift can be seen as an approximation ofthe gradient, while stochastic gradient decent is essentially aGibbs sampler [39], which is a special case of Metropolis-

Hasting sampler with acceptance ratio always equal to one[25]. However, mean shift is much faster than the randomwalk to estimate the parameters of the object. We choose touse these techniques with the lost of some theoretical beauty,because experimentally they makes our method much moreefficient and the results are good.

C. Incremental Computation

As the MCMC process may need hundreds or more samplesto approximate the distribution, we need an efficient method tocompute the likelihood for each proposed state. In one iterationof the algorithm, at most two objects may change. It affectsthe likelihood locally, therefore the computation of the newlikelihood can be carried out more efficiently by incrementallycomputing it only within their neighborhood (the area asso-ciated with the changed objects and those overlapping withthem).

Take the addition action as an example. When a new humanhypothesis is added to the state vector, for the likelihoodof the non-object region P (I S |θ), we only need to removethose background pixels taken by the new hypothesis. For thelikelihood of the object region P (IS |θ), as the new hypothesismay overlap with some existing hypotheses, we need to re-compute the visibility of the object regions connected to thenew hypothesis and then update the likelihood of these neigh-boring objects. The incremental computations of the likelihoodfor the other actions are similar. Although a joint state andjoint likelihood is used, the computation of each iteration isgreatly reduced through the incremental computation. This isin contrast to the particle filter where the evaluation of eachparticle (joint state) needs the computation of the full jointlikelihood.

The appearance models of the tracked objects are updatedafter processing each frame to adapt to the change in objectappearance. We update the object color histogram using anIIR filter p(t)=λpp(t) + (1− λp)p(t−1). We choose to updatethe appearance conservatively: we use a small λp = 0.01 andstop updating if the object is occluded by more than 25% orits position covariance is too big.

VI. EXPERIMENTAL RESULTS

We have experimented the system with many types of dataand will only show some representative ones. We will firstshow results on an outdoor scene video and then on a standardevaluation dataset of indoor scene videos. Video results aresubmitted as supplementary materials.

Among all the parameters of our approach, many are “nat-ural”, meaning that they correspond to measurable physicalquantities (e.g. 3d human height), therefore setting their valuesis straightforward. We use the same set of parameters for allthe sequences. This means that our approach is not sensitiveto the choice of parameter values. We list here the valuesof the parameters which are not mentioned in the previoussections. For the size prior (in Sec. IV-D), λ1 = 0.04 andλ2 = 0.002. For likelihood, λf = 0.5, λb = 0.5 in Equation6, λS = 25 in Eqn. 6 and λS = 0.005 in Eqn. 7. For themixing probabilities of different types of dynamics, we use


9

Padd = 0.1, Premove = 0.1, Pestablish = 0.1, Pbreak = 0.1,Pexchange = 0.1 and Pdiff = 0.5. We also apply a hardconstraint of 25 pixels on the minimum image height of ahuman.

We also want to comment here on the choice of parame-ters related to the peakedness of a distribution in samplingalgorithms. The image likelihood is usually a combinationof a number of components (sites, e.g. pixels). Inevitablesimplifications (e.g. independence assumption) in probabilisticmodeling may result in excessive peakedness of the distribu-tion, which affects the performance of the sampling algorithmssuch as MCMC and particle filter by having the samples inboth MCMC and particle filter focused in one location (i.e.highest peak) of the state space therefore makes them to degen-erate into greedy algorithms. Eliminating the dependencies ofdifferent components can be extremely difficult and infeasible.From an engineering point of view, one should set the valuesof the parameters (e.g. λS and λS while keeping their ratioconstant) so that likelihood ratio of different hypotheses arereasonable, so that the Markov chains can efficiently traverseand particle filters can maintain multiple hypotheses. In asimilar fashion, simulated annealing has been used in thesampling process to reduce the effect of the peakedness andforce convergence [48], [8], however the varying temperaturemakes the samples not from a single posterior distribution.

A. Evaluation on an Outdoor Scene

We show results on an outdoor video sequence, that we callthe “Campus Plaza” sequence, which contains 900 frames.This sequence is captured from a camera above a buildinggate with a 40◦ camera tilt angle. The frame size is 360×240pixel, and the sampling rate is 30 FPS. In this sequence, 33humans pass by the scene with 23 going out of field of viewand 10 going inside a building. The inter-human occlusions inthis sequence are large. There are overall 20 occlusion events,9 out of them are heavy occlusion (over 50% of the objectis occluded). For MCMC sampling, we use 500 iterations perframe. We show in Fig.7 some sample frames from the resulton this sequence. The identities of the objects are shown bytheir ID numbers displayed on the head.

We evaluate the results by the trajectory-based errors. Tra-jectories whose lengths are less than 10 frames are discarded.Among the 33 human objects, trajectories of 3 objects arebroken once (ID 28→ID 35, ID 31→ID 32, ID 30→ID 41,all between frame 387 and frame 447, as marked with arrowsin Fig.7); rest of the trajectories are correct. Usually thetrajectories are initialized once the humans are fully in thescene, some start when the objects are only partially inside.Only the initializations of three objects (objects 31, 50, 52)are noticeably delayed (by 50, 55, 60 frames respectively afterthey are fully in the scene). Partial occlusion or/and lack ofcontrast with the background are the causes of the delays. Tojustify our approach for integrated segmentation and tracking,we compare the tracking result with the result using frame-by-frame segmentation as in [53] where we use frame-basedevaluation metrics. The detection rate and the false alarm rateis 98.13% and 0.27% respectively. The detection rate and the

false alarm rate of the same sequence by using segmentationalone are 92.82% and 0.18%. With tracking, not only thetemporal correspondences are obtained, but also the detectionrate is increased by a large margin while the false alarm rateis kept low.

B. Evaluation on Indoor Scene Sequences

Fig. 8. Tracking evaluation criteria.

Next, we describe the results of our method on an indoorvideo set, CAVIAR video corpus4 [56]. We test our systemon the 26 “shopping center corridor view” sequences, overall36,292 frames, captured by a camera looking down towardsa corridor. The frame size is 384 × 288 pixel, and thesampling rate is 25 FPS. Some 2D-3D point correspondencesare given from which the camera can be calibrated. However,we compute the camera parameters by an interactive method[26].

The inter-object occlusion in this set is also intensive.There are overall 96 occlusion events in this set, 68 out of96 are heavy occlusions, and 19 out of the 96 are almostfully occlusions (more than 90% of the object is occluded).Many interactions between humans, such as talking, and handshaking, make this set very difficult for tracking. For MCMCsampling, we use 500 iterations per frame again. For sucha big data set, it’s infeasible to enumerate the errors likefor the “Campus Plaza” sequence. Instead we defined fivestatistical criteria: 1) number of mostly tracked trajectories; 2)number of mostly lost trajectories; 3) number of fragments oftrajectory; 4) number of false trajectories (a results trajectorycorresponding to no object); and 5) the frequency of identityswitches (identity exchanging between a pair of result trajec-tories). Fig.8 illustrates their definition. These five categoriesare by no means a complete classification, however they covermost of the typical errors observed on this set. There are otherperformance measures that have been proposed in the recentevaluations, such as the Multiple Object Tracking Precisionand Accuracy in the CLEAR 2006 evaluation [57]. We do notuse these measures, because they are less intuitive, as they tryto integrate multiple factors into one scalar valued measure.

Table I gives the performance of our method. We de-veloped an evaluation software to count the number ofmostly tracked trajectories, mostly lost trajectories, falsealarms and fragments automatically. Denote a ground-truthtrajectory by {G(i), . . .G(i+n)}, where G(t) is the objectstate at the t-th frame; denote a hypothesized trajectory by{H(j), . . .H(j+m)}. The overlap ratio of the ground-truth

4In the provided ground-truth, there are 232 trajectories overall. However 5of these are mostly out of sight, e.g. only one arm or the head top is visible;we set these as “don’t care”.


10

frame 42 frame 59

frame 250 frame 311

frame 387 frame 447

frame 560 frame 661

Fig. 7. Selected frames of the tracking results from “Campus Plaza”. The numbers on the heads show identities. (Please note that the two people who aresitting on two sides are in the background model, therefore not detected.)


11

object and the hypothesized object at the t-frame is definedby

Overlap(G(t),H(t)) =Reg(G(t)) ∩ Reg(H(t))Reg(G(t)) ∪ Reg(H(t))

(8)

where Reg() is the image region of the object. IfOverlap(G(t),H(t)) > 0.5, we say {G(t),H(t)} is a potentialmatch. The overlap ratio of the ground-truth trajectory and thehypothesized trajectory is defined by

Overlap(G(i:i+n),H(j:j+m))

=min(i+n,j+m)t=max(i,j) δ(Overlap(G(t),H(t))>0.5)

max(i+n,j+m)−min(i,j)+1

(9)

where δ() is an indicator function. Given that one sequence hasNG ground-truth trajectories {Gk}NG

k=1, and NH hypothesizedtrajectories {Hk}NH

k=1, we compute the overlap ratios forall ground-truth hypothesis pairs {Gk,Hl}; the pairs whoseoverlap ratios are larger than 0.8 are considered to be potentialmatches. Then the Hungarian matching algorithm [22] is usedto find the best matches which are considered to be mostlytracked. To count the mostly lost trajectories, we define arecall ratio by replacing the denominator of Equ.9 with n+1.If for Gk, there is no Hl such that the recall ratio betweenthem is larger than 0.2, we consider Gk to be mostly lost.To count the false alarm and fragments, we define a precisionratio by replacing the denominator of Equ.9 with m + 1. Iffor Hl there is no Gk such that the precision ratio betweenthem is larger than 0.2, we consider Hl a false alarm; if thereis such a Gk that the precision between them is larger than0.8, but the overlap ratio is smaller than 0.8, we consider Hl

to be a fragment of Gk. We first count the mostly trackedtrajectories, and remove the matched parts of the ground-truth tracks. Second, we count the trajectory fragments with agreedy, iterative algorithm. At each round, the fragment withthe highest overlap ratio is found, and then the matched part ofthe ground-truth track is removed; this procedure is repeateduntil there are no more valid fragments. Lastly, we count themostly lost trajectories and the false alarms. This algorithmcan not classify all ground-truth and hypothesized tracks; theunlabeled ones are mainly due to an identity switch. We countthe frequency of identity switches visually.

Some sample frames and results are shown in Fig.9. Most ofthe missed detections are due to the humans wearing clothingwith color very similar to that of the background so thatsome part of the object is misclassified as background, see theframe 1413 of Fig.9(b) for an example. The fragmentation oftrajectory and the ID switch are mainly due to full occlusions,see the frame 496 of Fig.9(a) and the frame 316 of Fig.9(b)for examples. Our method can deal with partial occlusionwell. For full occlusion, classifying an object as going intoan “occluded” state and associating it when it reappears couldpotentially improve the performance. The false alarms aremainly due to the shadows, reflections and sudden brightnesschanges which are misclassified as foreground, see the frame563 of Fig.9(a). More sophisticated background model andshadow model (e.g. [32]) could be used to improve the result.In general, our method performs reasonably well on theCAVIAR set, though not as well as on the “Campus Plaza”

TABLE I

RESULTS OF PERFORMANCE EVALUATIONS ON CAVIAR SET (277

TRAJECTORIES). MT: MOSTLY TRACKED, ML: MOSTLY LOST, FGMT:

FRAGMENT, FA: FALSE ALARM, IDS: IDENTITY SWITCH.

MT ML Fgmt FA IDSNumber 141 12 89 27 22Percentage 62.1% 5.3% − − −

sequence, mainly due to the above mentioned difficulties. Therunning speed of the system is about 2 FPS with a 2.8GHzPentium IV CPU. The implementation is in C++ code withoutany special optimization.

VII. CONCLUSION AND FUTURE WORK

We have presented a principled approach to simultaneouslydetect and track humans in a crowded scene acquired from asingle stationary camera. We take a model-based approach andformulate the problem as a Bayesian MAP estimation problemto compute the best interpretation of the image observationscollectively by the 3D human shape model, acquired humanappearance model, background appearance model, cameramodel, the assumption that humans move on a a known groundplane, and the object priors. The image is modeled as a compo-sition of an unknown number of possibly overlapping objectsand a background. The inference is performed by an MCMC-based approach to explore the joint solution space. Data-drivenproposal probabilities are used to direct the Markov chaindynamics. Experiments and evaluations on challenging real-life data show promising results.

The success of our approach mainly lies in the integrationof the top-down Bayesian formulation following the imageformation process and the bottom-up features that are directlyextracted from images. The integration has the benefit ofboth the computational efficiency of image features and theoptimality of a Bayesian formulation.

This work could be improved/extended in several ways. 1)extension to track multiple classes of objects (e.g. humans andcars), by adding model switching in the MCMC dynamics. 2)Tracking, operating in a 2-frame interval, has a very local viewtherefore ambiguities inevitably exist, especially in the caseof tracking fully occluded objects. The analysis in the levelof trajectories may resolve the local ambiguities (e.g. [29]).The analysis may take into account the prior knowledge onthe valid object trajectories including their starting and endingpoints.

APPENDIX ISINGLE OBJECT TRACKING WITH BACKGROUND

KNOWLEDGE USING MEANSHIFT

Denote by p, p(u), and b(u) the color histograms ofthe object learnt online, the color histogram of the objectat location u and the color histogram of the background atthe corresponding region respectively. Let {xi}i=1,...,n be thepixel locations in the region with the object center at u.A kernel with profile k() is used to assign smaller weightsto the pixels farther away from the center. An m-bin colorhistogram p(u) = {pj(u)}j=1,...,m, is constructed as pj(u) =


12

(a) Sequence “ThreePastShop2cor”

(b) Sequence “TwoEnterShop2cor”

Fig. 9. Selected frames of the tracking results from CAVIAR set.

∑ni=1 k

(‖xi‖2

)δ [bf (xi) − j], where function bf () maps the

pixel location to the corresponding histogram bin, and δ is thedelta function. Similar for p and b. We would like to optimize

L(u) = −λb B (p(u),b(u))︸︷︷︸L1(u)

+λf B (p(u), p)︸︷︷︸L2(u)

(10)

where B() is the Bhattachayya coefficient. By applying Taylorexpansion at p(u0) and b(u0)(u0 is a predicted position ofthe object), we have

L1(u) = B (p(u),b(u)) = B(u)

≈ B(u0) + B′p(u0) (p(u) − p(u0)) + B′

d(u0) (b(u) − b(u0))

= c1 +

m

u=1

bu(u0)

pu(u0)pu(u) +

m

u=1

pu(u0)

bu(u0)bu(u)

= c1 +

n

i=1

ku − xi

h

2

wbi (11)

where

wbi =

m

u=1

bu(u0)

pu(u0)δ [bf (xi) − u] +

pu(u0)

bu(u0)δ [bb (xi) − u]

Similarly, also in [6],

L2(u) = B (p(u), p) ≈ 1

2

m

u=1

pu(u0)pu +1

2pu(u)

pu

pu(u0)

= c2 +

nh

i=1

wfi k

u − xi

h

2

(12)

where wfi =

m∑u=1

√pu

pu(u0)δ [bf (xi) − u], therefore

L(u) = c1+c2+n∑

i=1

(λfwf

i − λbwbi

)︸︷︷︸

wi

k

(∥∥∥∥u − xi

h

∥∥∥∥2)

(13)

The last term of L(u) is the density estimate computed withkernel profile k() at u. The meanshift algorithm with negative


13

weight [4] applies. By using the Epanechikov profile [6], L(u)will be increased with the new location moved to

u′ ←∑n

i=1 xiwi∑ni=1 |wi| (14)

ACKNOWLEDGMENT

This research was funded, in part, by the U.S. GovernmentVACE program.

REFERENCES

[1] G. Borgefors. Distance transformations in digital images. ComputerVision, Graphics, and Image Processing, 34(3):344-371, 1986.

[2] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy min-imization via graph cuts. IEEE Trans. Pattern Analysis and MachineIntelligence, 23(11):1222-1239, 2001.

[3] I. Cohen and G. Medioni. Detecting and tracking moving objects forvideo surveillance. In Proc. IEEE Conf. Computer Vision and PatternRecognition, II:2319-2326, 1999.

[4] R.T. Collins. Mean-shift Blob Tracking through Scale Space. In Proc.Conf. Computer Vision and Pattern Recognition, II:234-240, 2003.

[5] D. Comaniciu and P. Meer. Mean shift: A robust approach toward featurespace analysis. IEEE Trans. Pattern Analysis and Machine Intelligence,24(5):603-619, 2002.

[6] D. Comaniciu and P. Meer. Kernel-based object tracking. IEEE Trans.Pattern Analysis and Machine Intelligence, 25(5):564-577, 2003.

[7] L. Davis, V. Philomin, and R. Duraiswami. Tracking humans from amoving platform. In Proc. Int’l Conf. Pattern Recognition, IV:171-178,2000.

[8] J. Deutscher, A. Blake, and I. Reid. Articulated body motion captureby annealed particle filtering. In Proc. IEEE Conf. Computer Vision andPattern Recognition, II:126-133, 2000.

[9] A. Elgammal and L. Davis. Probabilistic framework for segmentingpeople under occlusion. In Proc. Int’l Conf. Computer Vision, II:145-152, 2001.

[10] A. Elgammal, R. Duraiswami, D. Harwood, and L. Davis. Backgroundand foreground modeling using non-parametric kernel density estimationfor visual surveillance. Proc. IEEE, 90(7):1151-1163, 2002.

[11] F. Fleuret, R. Lengagne, and P. Fua. Fixed point probability field forcomplex occlusion handling. In Proc. Int’l Conf. Computer Vision, I:694-700, 2005.

[12] D. G-Perez, J.-M. Odobez, S. Ba, K. Smith, and G. Lathoud. Trackingpeople in meetings with particles. In Proc. Int’l Workshop on ImageAnalysis for Multimedia Interactive Service, 2005.

[13] D. Gavrila and V. Philomin. Real-time object detection for “smart”vehicles. In Proc. Int’l Conf. Computer Vision, I:87-93, 1999.

[14] P. Green. Trans-dimensional Markov chain Monte Carlo. Oxford Uni-versity Press, 2003.

[15] S. Haritaoglu, D. Harwood, and L. Davis. W4: Real-time surveillanceof people and their activities. IEEE Trans. Pattern Analysis and MachineIntelligence, 22(8):809-830, 2000.

[16] R. Hartley and A. Zisserman. Multiple View Geometry in ComputerVision. Cambridge University Press, 2000.

[17] W. Hasting. Monte carlo sampling methods using markov chains andtheir applications. Biometrika, 57(1):97-109, 1970.

[18] S. Hongeng and R. Nevatia. Multi-agent event recognition. In Proc. Int’lConf. Computer Vision, II:84-91, 2001.

[19] M. Isard and J. MacCormick. Bramble: A bayesian multiple-blob tracker.In Proc. Int’l Conf. Computer Vision, II:34-41, 2001.

[20] J. Kang, I. Cohen, and G. Medioni. Continuous tracking within andacross camera streams. In Proc. IEEE Conf. Computer Vision and PatternRecognition, I:267-272, 2003.

[21] Z. Khan, T. Balch, and F. Dellaert. Mcmc-based particle filtering fortracking a variable number of interacting targets. IEEE Trans. PatternAnalysis and Machine Intelligence, 27(11):1805-1819, 2005.

[22] H. W. Kuhn. The hungarian method for the assignment problem. NavalResearch Logistics Quarterly, II:83-87, 1955.

[23] M.-W. Lee and I. Cohen. A model-based approach for estimating human3d poses in static images. IEEE Trans. Pattern Analysis and MachineIntelligence, 28(6):905-916, 2006.

[24] A. Lipton, H. Fujiyoshi, and R. Patil. Moving target classification andtracking from real-time video. In Proc. DARPA Image UnderstandingWorkshop, pp.129-136, 1998.

[25] J. Liu. Metroplized gibbs sampler. In Monte Carlo strategies in scientificcomputing. Computing, Springer-Verlag NY INC, 2001.

[26] F. Lv, T. Zhao, and R. Nevatia. Self-calibration of a camera from video ofa walking human. IEEE Trans. Pattern Analysis and Machine Intelligence,28(9):1513-1518, 2006.

[27] J. MacCormick and A. Blake. A probabilistic exclusion principle fortracking multiple objects. In Proc. Int’l Conf. Computer Vision, pagesI:572-578, 1999.

[28] A. Mittal and L. Davis. M2tracker: A multi-view approach to segmentingand tracking people in a cluttered scene using region-based stereo. InProc. European Conf. Computer Vision, II:18-33, 2002.

[29] P. Nillius, J. Sullivan, and S. Carlsson. Multi-target tracking - linkingidentities using bayesian network inference. In Proc. IEEE Conf. Com-puter Vision and Pattern Recognition, II:2187-2194, 2006.

[30] K. Okuma, A. Taleghani, N. de Freitas, J. Little, and D. Lowe. A boostedparticle filter: Multitarget detection and tracking. In Proc. European Conf.Computer Vision, I:28-39, 2004.

[31] C. Papageorgiou, T. Evgeniou, and T. Poggio. A trainable pedestriandetection system. In Proc. of Intelligent Vehicles, pp.241-246, 1998.

[32] A. Prati, I. Mikic, M. Trivedi, and R. Cucchiara. Detecting movingshadows: Algorithms and evaluation. IEEE Trans. Pattern Analysis andMachine Intelligence, 25(7):918-923, 2003.

[33] P. Prez, C. Hue, J. Vermaak, and M. Gangnet. Color-based probabilistictracking. In Proc. European Conf. Computer Vision, pages I:661-675,2002.

[34] D. Ramanan, D. Forsyth, and A. Zisserman. Strike a pose: Trackingpeople by finding stylized poses. In Proc. IEEE Conf. Computer Visionand Pattern Recognition, I:271-278, 2005.

[35] C. Rasmussen and G. D. Hager. Probabilistic data association methodsfor tracking complex visual objects. IEEE Trans. Pattern Analysis andMachine Intelligence, 23(6):560-576, 2001.

[36] J. Rittscher, P. Tu, and N. Krahnstoever. Simultaneous estimation ofsegmentation and shape. In Proc. IEEE Conf. Computer Vision andPattern Recognition, II:487-493, 2005.

[37] R. Rosales and S. Sclaroff. 3d trajectory recovery for tracking multipleobjects and trajectory guided recognition of actions. In Proc. IEEE Conf.Computer Vision and Pattern Recognition, II:2117-2123, 1999.

[38] H. Rue and MA. Hurn. Bayesian object identification. Biometrika,86(3):649-660, 1999.

[39] C. R. H. S. Geman. Diffusion for global optimization. SIAM J. onControl and Optimization, 24(5):1031-1043, 1986.

[40] N. Siebel and S. Maybank. Fusion of multiple tracking algorithm forrobust people tracking. In Proc. European Conf. Computer Vision, IV:373-387, 2002.

[41] K. Smith, D. Gatica-Perez, and J.-M. Odobez. Using particles to trackvarying numbers of interacting people. In Proc. IEEE Conf. ComputerVision and Pattern Recognition, I:962-969, 2005.

[42] X. Song and R. Nevatia. Combined face-body tracking in indoorenvironment. In Proc. Int’l Conf. Pattern Recognition, IV:159-162, 2004.

[43] X. Song and R. Nevatia. A model-based vehicle segmentation methodfor tracking. In Proc. Int’l Conf. Computer Vision, II:1124-1131, 2005.

[44] C. Stauffer and E. Grimson. Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Analysis and Machine Intelligence,22(8):747-757, 2000.

[45] C. Tao, H. Sawhney, and R. Kumar. Object tracking with bayesianestimation of dynamic layer representations. IEEE Trans. Pattern Analysisand Machine Intelligence, 24(1):75-89, 2002.

[46] H. Tao, H. Sawhney, and R. Kumar. A sampling algorithm for trackingmultiple objects. In Proc. Workshop of Vision Algorithms, 1999.

[47] L. Tierney. Markov chain concepts related to sampling algorithms. InMarkov Chain Monte Carlo in Practice, pp.59-74, 1996.

[48] Z. W. Tu and S. C. Zhu. Image segmentation by data-driven markovchain monte carlo. IEEE Trans. Pattern Analysis and Machine Intelli-gence, 24(5):651-673, 2002.

[49] Y. Weiss. Correctness of local probability propagation in graphicalmodels with loops. Neural Computation, 12(1):1-41, 2000.

[50] B. Wu and R. Nevatia. Detection of multiple, partially occluded humansin a single image by bayesian combination of edgelet part detectors. InProc. Int’l Conf. Computer Vision, I:90-97, 2005.

[51] T. Yu and Y. Wu. Collaborative tracking of multiple targets. In Proc.IEEE Conf. Computer Vision and Pattern Recognition, I:834-841, 2004.

[52] T. Zhao, M. Aggarwal, R. Kumar, and H. Sawhney. Real-time wide areamulti-camera stereo tracking. In Proc. IEEE Conf. Computer Vision andPattern Recognition, I:976-983, 2005.

[53] T. Zhao and R. Nevatia. Bayesian human segmentation in crowdedsituations. In Proc. IEEE Conf. Computer Vision and Pattern Recognition,II:459-466, 2003.

[54] T. Zhao and R. Nevatia. Tracking multiple humans in complex situations.IEEE Trans. Pattern Analysis and Machine Intelligence, 26(9):1208-1221,2004.

[55] T. Zhao and R. Nevatia. Tracking multiple humans in crowded envi-ronment. In Proc. IEEE Conf. Computer Vision and Pattern Recognition,


14

II:406-413, 2004.[56] The CAVIAR data set. http://homepages.inf.ed.ac.uk/

rbf/CAVIAR/[57] CLEAR06 Evaluation Campaign and Workshop. http://isl.ira.

uka.de/clear06/

Tao Zhao received the BEng degree from theDepartment of Computer Science and Technology,Tsinghua University, China, in 1998. He receivedthe MSc and the PhD degrees from the Departmentof Computer Science at the University of SouthernCalifornia in 2001 and 2003, respectively. He waswith Sarnoff Corporation, Princeton, New Jersey,from 2003 to 2006. He is currently with IntuitiveSurgical Incorporated, Sunnyvale, California work-ing on computer vision applications for medicineand surgery. His research interests include computer

vision, machine learning, and pattern recognition. His experience has been invisual surveillance, human motion analysis, aerial image analysis and medicalimage analysis. He is a member of the IEEE and IEEE computer society.

Ram Nevatia received his Ph.D. degree from Stan-ford University with specialty in the area of com-puter vision. He has been with the University ofSouthern California since 1975 where he is currentlya Professor of Computer Science and ElectricalEngineering. He is also Director of the Institutefor Robotics and Intelligent Systems. He has beenprincipal investigator of major Government fundedcomputer vision research programs for over 25years. Dr. Nevatia has made important contributionsto several areas of computer vision including the

topics of shape description, object recognition, stereo analysis aerial imageanalysis, tracking of humans and event recognition. Dr. Nevatia is a Fellowof the Institute of Electrical and Electronics Engineers (IEEE) and of theAmerican Association for Artificial Intelligence (AAAI). He is an associateeditor for the Pattern Recognition, and the Computer Vision and ImageUnderstanding journals. Dr. Nevatia is author of two books, several bookchapters, and over 100 refereed technical papers.

Bo Wu received the B.Eng and M.Eng degreesfrom the Department of Computer Science andTechnology, Tsinghua University, Beijing, China,in 2002 and 2004 respectively. He is currently aPhD candidate at the Computer Science Department,University of Southern California, Los Angeles. Hisresearch interests include computer vision, machinelearning, and pattern recognition. He is a studentmember of the IEEE computer society.


Segmentation and Tracking of Multiple Humans in...

Documents

Transcript of Segmentation and Tracking of Multiple Humans in...