Multi-Views Tracking Within and Across Uncalibrated Camera ...

13
Multi-Views Tracking Within and Across Uncalibrated Camera Streams Jinman Kang IRIS, USC Computer Vision Group University of Southern California Los Angeles, CA, 90098-0273 1-213-740-6435 [email protected] Isaac Cohen IRIS, USC Computer Vision Group University of Southern California Los Angeles, CA, 90098-0273 1-213-740-6434 [email protected] Gérard Medioni IRIS, USC Computer Vision Group University of Southern California Los Angeles, CA, 90098-0273 1-213-740-6440 [email protected] ABSTRACT This paper presents novel approaches for continuous detection and tracking of moving objects observed by multiple, stationary or moving cameras. Stationary video streams are registered using a ground plane homography and the trajectories derived by Tensor Voting formalism are integrated across cameras by a spatio-temporal homography. Tensor Voting based tracking approach provides smooth and continuous trajectories and bounding boxes, ensuring minimum registration error. In the more general case of moving cameras, we present an approach for integrating objects trajectories across cameras by simultaneous processing of video streams. The detection of moving objects from moving camera is performed by defining an adaptive background model that uses an affine-based camera motion approximation. Relative motion between cameras is approximated by a combination of affine and perspective transform while objects’ dynamics are modeled by a Kalman Filter. Shape and appearance of moving objects are also taken into account using a probabilistic framework. The maximization of the joint probability model allows tracking moving objects across the cameras. We demonstrate the performances of the proposed approaches on several video surveillance sequences. Categories and Subject Descriptors I.4.8 [Image Processing and Computer Vision]: Scene Analysis – motion and tracking. General Terms Algorithms, Experimentation, Theory Keywords Video Surveillance, Video Analysis, Multiple Cameras, Heterogeneous Cameras, Detection, Tracking, Tensor Voting, Camera Registration, Kalman Filter, Joint Probability Data Association Filter. 1. INTRODUCTION Video surveillance of a large facility such as a warehouse, a campus or an airport usually requires a set of cameras for ensuring a complete coverage of the scene. Understanding human activity in such scenes requires the integration of video cues acquired by a set of cameras, as the activity usually unfolds in a large area that cannot be monitored by a single camera. Multiple stationary cameras are commonly used for monitoring large areas, as they provide wide coverage, good image resolution, and allow the inference of additional characteristics necessary for activity description and recognition. Omnicameras provide an alternative, but can only be used for close range objects in order to guarantee sufficient image resolution for moving objects. Moving cameras such as pan-tilt zoom and hand-held cameras allow monitoring of dynamically changing regions of interest (ROI). They dynamically change the focus of the monitored region and provide a larger coverage than stationary cameras. In most deployed video surveillance systems, a combination of the two is used for providing good coverage of the scene. While detection of moving objects is performed independently in each video stream, understanding the activity monitored by the network of cameras requires integration of the trajectories of moving objects across multiple cameras. Tracking objects across multiple stationary and moving cameras is a challenging task, as it requires space and time registration of the trajectories recovered from each camera [9]. In addition, moving cameras make the problem even more challenging as we have to keep track of the camera’s motion in order to integrate trajectories. Several algorithms for tracking moving objects across multiple stationary cameras have been recently proposed and most of them use “color distribution” as the main cue for tracking objects across views [12][19][10][2]. Color information can be easily biased by several factors such as illumination, shadows, blob segmentation, appearance change, or different camera controls. Thus, color cues are not very reliable for tracking moving objects across large scenes where such variations are frequent. Another important component is the temporal alignment of the video streams. The above mentioned methods are limited to the case of synchronized cameras for ensuring correspondence across views. In [6], the author proposed an approach for space and time self- calibration of cameras, but the proposed approach is limited to small moving objects and top down views where the observed shapes are similar across views and the depth of the object is not significant. Automatic camera hand-off has been proposed in [20][22]. Tracking moving objects from a non-stationary camera relies on the accurate estimation of the camera motion [16][17][23]. An affine transformation is commonly used for the stabilization of the camera motion. The simultaneous integration of information from both moving and stationary camera has not Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IWVS’03, November 7, 2003, Berkeley, California, USA. Copyright 2003 ACM 1-58113-780-X/03/00011…$5.00. 21

Transcript of Multi-Views Tracking Within and Across Uncalibrated Camera ...

Page 1: Multi-Views Tracking Within and Across Uncalibrated Camera ...

Multi-Views Tracking Within and Across Uncalibrated Camera Streams

Jinman Kang IRIS, USC Computer Vision Group University of Southern California Los Angeles, CA, 90098-0273

1-213-740-6435

[email protected]

Isaac Cohen IRIS, USC Computer Vision Group University of Southern California Los Angeles, CA, 90098-0273

1-213-740-6434

[email protected]

Gérard Medioni IRIS, USC Computer Vision Group University of Southern California Los Angeles, CA, 90098-0273

1-213-740-6440

[email protected]

ABSTRACT This paper presents novel approaches for continuous detection and tracking of moving objects observed by multiple, stationary or moving cameras. Stationary video streams are registered using a ground plane homography and the trajectories derived by Tensor Voting formalism are integrated across cameras by a spatio-temporal homography. Tensor Voting based tracking approach provides smooth and continuous trajectories and bounding boxes, ensuring minimum registration error. In the more general case of moving cameras, we present an approach for integrating objects trajectories across cameras by simultaneous processing of video streams. The detection of moving objects from moving camera is performed by defining an adaptive background model that uses an affine-based camera motion approximation. Relative motion between cameras is approximated by a combination of affine and perspective transform while objects’ dynamics are modeled by a Kalman Filter. Shape and appearance of moving objects are also taken into account using a probabilistic framework. The maximization of the joint probability model allows tracking moving objects across the cameras. We demonstrate the performances of the proposed approaches on several video surveillance sequences.

Categories and Subject Descriptors I.4.8 [Image Processing and Computer Vision]: Scene Analysis – motion and tracking.

General Terms Algorithms, Experimentation, Theory

Keywords Video Surveillance, Video Analysis, Multiple Cameras, Heterogeneous Cameras, Detection, Tracking, Tensor Voting, Camera Registration, Kalman Filter, Joint Probability Data Association Filter.

1. INTRODUCTION Video surveillance of a large facility such as a warehouse, a

campus or an airport usually requires a set of cameras for ensuring a complete coverage of the scene. Understanding human activity in such scenes requires the integration of video cues acquired by a set of cameras, as the activity usually unfolds in a large area that cannot be monitored by a single camera. Multiple stationary cameras are commonly used for monitoring large areas, as they provide wide coverage, good image resolution, and allow the inference of additional characteristics necessary for activity description and recognition. Omnicameras provide an alternative, but can only be used for close range objects in order to guarantee sufficient image resolution for moving objects. Moving cameras such as pan-tilt zoom and hand-held cameras allow monitoring of dynamically changing regions of interest (ROI). They dynamically change the focus of the monitored region and provide a larger coverage than stationary cameras. In most deployed video surveillance systems, a combination of the two is used for providing good coverage of the scene. While detection of moving objects is performed independently in each video stream, understanding the activity monitored by the network of cameras requires integration of the trajectories of moving objects across multiple cameras. Tracking objects across multiple stationary and moving cameras is a challenging task, as it requires space and time registration of the trajectories recovered from each camera [9]. In addition, moving cameras make the problem even more challenging as we have to keep track of the camera’s motion in order to integrate trajectories. Several algorithms for tracking moving objects across multiple stationary cameras have been recently proposed and most of them use “color distribution” as the main cue for tracking objects across views [12][19][10][2]. Color information can be easily biased by several factors such as illumination, shadows, blob segmentation, appearance change, or different camera controls. Thus, color cues are not very reliable for tracking moving objects across large scenes where such variations are frequent. Another important component is the temporal alignment of the video streams. The above mentioned methods are limited to the case of synchronized cameras for ensuring correspondence across views. In [6], the author proposed an approach for space and time self-calibration of cameras, but the proposed approach is limited to small moving objects and top down views where the observed shapes are similar across views and the depth of the object is not significant. Automatic camera hand-off has been proposed in [20][22]. Tracking moving objects from a non-stationary camera relies on the accurate estimation of the camera motion [16][17][23]. An affine transformation is commonly used for the stabilization of the camera motion. The simultaneous integration of information from both moving and stationary camera has not

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IWVS’03, November 7, 2003, Berkeley, California, USA. Copyright 2003 ACM 1-58113-780-X/03/00011…$5.00.

21

Page 2: Multi-Views Tracking Within and Across Uncalibrated Camera ...

been extensively studied in spite of its large implications in video surveillance applications. In [24], the authors proposed a multiple human tracking system using multiple stationary and Pan Tilt Zoom (PTZ) cameras, where the control of the moving camera module depends on the stationary camera module: the FOV of the moving camera is limited to the FOV of the stationary camera. In [3], the authors proposed integrating information across multiple cameras for detecting and tracking multiple persons, but the integrated information is used for choosing an optimal viewpoint within a limited FOV. Here we focus on detecting and tracking multiple objects with a heterogeneous set of cameras with little or no spatial overlap. We propose two novel approaches for integrating information from multiple cameras. In the case of multiple stationary cameras a ground plane homography is used for registering cameras, which is proposed in [9], and a spatio-temporal homography is used for registering trajectories of each moving object. This approach allows to continuously track moving objects across two uncalibrated and unsynchronized stationary cameras. In [8], the global inference of trajectories and bounding boxes using Tensor Voting based tracking approach was proposed in order to reduce the minimization error when we register each trajectory using spatio-temporal homography. The novelty of the first approach consists in registering unsynchronized video streams by aligning trajectories across views using spatio-temporal homography. In the case of combining moving and stationary cameras, which is proposed in [10], an affine transform between a pair of frames is used for stabilizing moving camera sequences, and a perspective transformation for registering the moving and stationary cameras. The detection of moving objects in a video stream acquired by a moving camera is performed by a modified background learning method relying on a robust affine registration and a sliding window approach. The tracking of detected moving objects in each view is reformulated as a maximization of a joint probability model. The joint probability model consists of an appearance and a motion probability model. The appearance probability model is a piecewise color similarity measurement based on a polar representation of the detected blobs. The motion model probability is calculated by a Kalman filter (KF) tracker.. The novelty of the second approach consists in handling multiple trajectories observed by the moving and stationary cameras in the same KF framework. It allows deriving a more accurate motion measurement for objects simultaneously viewed by the two cameras and an automatic handling of occlusions and errors in the motion detection. The paper is organized as follows: Section 2 briefly presents the Tensor Voting-based tracking approach for stationary cameras. Section 3 presents the proposed approach for space-time registration of trajectories across a network of stationary cameras. In Section 4 we describe the method for detection moving objects in video sequences acquired by a non-stationary camera. Section 5 introduces the joint probability model used for tracking moving objects. Section 6 presents the proposed approach for space-time registration of trajectories across a network of heterogeneous cameras. A set of experimental results on multiple stationary cameras and multiple heterogeneous cameras tracking is

presented and discussed in Section 6. Finally, Section 7 concludes our paper with a discussion on future work.

2. TENSOR VOTING BASED TRACKING In [18][8], a Tensor Voting based tracking approach was proposed. It provides an efficient way for handling discontinuous trajectories and correcting fragmented bounding boxes of detected moving object. The tracking problem is reformulated as a problem of extracting salient curves and correcting bounding boxes in 2D+t space. We propose to use the tensor voting-based tracking proposed in [8] for simultaneously smoothing moving objects’ trajectories and correcting their bounding box. In

, we show an example of objects tracking and trajectory completion using the presented approach.

Figure 1

Figure 1. Tensor Voting based tracking example. (Red circle : occlusions)

3. REGISTRATION OF MULTIPLE STATIONARY CAMERAS Multiple stationary cameras provide a good coverage of the scene but require the integration of information across cameras. While geometric registration is performed offline and allows mapping the relative position of the cameras with regard to the scene, object trajectories cannot usually be mapped across the views. Indeed, object's trajectory is characterized by the bounding boxes' space-time evolution of the detected moving object. The locations of the centroid across multiple cameras do not correspond to the same physical 3D points, and therefore cannot be registered using the same perspective transform used for registering the cameras locations with respect to the ground plane. Furthermore, the cameras are not necessarily synchronized or may have different shutter speed, requiring a synchronization of the computed trajectories prior to their integration. We propose an approach based on the use of perspective registration of the multiple-views and of the trajectories in 2D+t space.

3.1 Spatial Registration of Multiple Cameras

=

′−′−′−

′−′−′−

′−′−′−

′−′−′−

′−′−′−

′−′−′−

′−′−′−

′−′−′−

=

000000000

10000001100000011000000110000001

33

32

31

23

22

21

13

12

11

4444444

4444444

3333333

3333333

2222222

2222222

1111111

1111111

hhhhhhhhh

yyyxyyxxyxxxyxyyyxyyxxyxxxyxyyyxyyxxyxxxyxyyyxyyxxyxxxyx

AH

(1)

22

Page 3: Multi-Views Tracking Within and Across Uncalibrated Camera ...

View1

View2

Figure 2. Camera registration using two views. The red circles in the views depict the scene occlusions, and the blue circles in the registered views depict the corresponding occlusions.

12H

21H

(a) (b) (c)

Figure 3. Example of trajectory registration (a) Tracking result by Tensor Voting. (b) Trajectory registration using the ground plane homography. (c) Refinement by the homography obtaining from the trajectories correspondence. (red: registered trajectory, blue: reference trajectory, green: overlapped trajectory)

23

Page 4: Multi-Views Tracking Within and Across Uncalibrated Camera ...

The geometric registration of cameras viewpoint is performed using a homography from a set of 4 matching points. The perspective parameters correspond to a null space of the matrix A (given in equation (1)) and are estimated using SVD decomposition of A. In Figure 2, we show the registration of two views obtained by a set of correspondences registering the two views. If we denote as the homography from view 2 to 1, we can register multiple cameras by series of concatenated homographies given in equation (2).

12H

121

121

−−−

+++= m

mmm

nn

nn

nm HHHHH L (2)

3.2 Stream Synchronization The integration of trajectories obtained from various sensors requires a synchronization of the acquired video streams. The estimation of the time offset is not sufficient for synchronizing the video streams, since the cameras can have different shutter speeds therefore requiring a registration of the trajectories in space and time. There are several ways for registering the observed trajectories. If we assume that each trajectory corresponds the same 3D physical point through all the views and the moving object moves above the ground plane with same height, the homography using the ground plane is sufficient for registering trajectories. In this case, the synchronization can be done using minimization of the dis-tance between trajectories for each frame by shifting the trajectory along the time axis. If the centroid correspond to the same 3D physical point, but the 3D physical point is not the same across the views, the homography obtained from the ground plane is not sufficient for aligning the trajectories. In this case, we need the calibration information for each cameras and the epipolar geometry between the considered views. Using the fundamental matrix, the synchronization can be done by finding the intersection of the trajectory point of one view and epipolar line calculated by the fundamental matrix from the other view since these points correspond to the same 3D points. Once the synchronization is done, we can correct for the height by calculating 3D position of matched trajectory points using the calibration information. In this case the homography from the ground plane is sufficient for trajectories alignment since the tra-jectories are already synchronized. In most cases, the centroid describing the object trajectories do not correspond to the same 3D point. In this paper, we propose a simple approach to solve space and time registration by using the homography obtained from ground plane and trajectories. First, we register each view and trajectories using the homography obtain from ground plane by (2). A residual misalignment remains since the centroid do not correspond to the same 3D point and the video streams are not synchronized. Such misalignment is illustrated in .b. Here we propose to use a MLE approach for correcting these misalignments. A homography is computed by iteratively selecting 4 arbitrary points on each trajectory and then deriving the alignment errors between the base trajectory and the registered trajectory. We select the homography that minimizes the registration error. The proposed two-step approach allows us to register trajectories acquired by heterogeneous cameras using the spatio-temporal registration. An illustration is shown in Figure 3.c.

Figure 3

4. DETECTING MOVING REGIONS The main difference between the detection of moving objects from a stationary and a moving camera, is the characterization of the background model. In a stationary camera, variations in the image sequence are modeled at the pixel level and allow the definition of a background model for each pixel using statistical techniques [14]. This concept can be extended to non-stationary cameras by compensating for the camera motion prior to the estimation of the background model. Registering the current frame to the selected reference is performed by concatenating the estimated pair-wise transforms (3).

refrefrefcurcur

refcurrefcur

tttt

y

x

y

x

XAA

XAXXAX

yx

tdctba

yx

tt

yx

dcba

yx

)1,(),1(

),(

)1,(1

11001

+−

++

••=

=

=⇒

=

′′

+

=

′′

L

(3)

where, is the affine transform from t to t+1, and

are the homogeneous affine coordinates of the

current frame and reference frames.

)1,( +ttA

refXcurX ,

Inaccuracies in parameter estimation, and the lack of proper approximation of the scene geometry by the affine model will inherently create registration errors and therefore erroneous background models. We propose using a sliding window, where the number of frames considered is such as the sub-pixel accuracy is guaranteed by the affine registration. Moreover, for each current frame, we re-warp the images in the buffer using the pair-wise registration, and consequently minimize the impact of an erroneous registration. Indeed, an erroneous registration will not influence the quality of the detection for the whole sequence, but only within the number of frames considered in the sliding window. The corresponding position of each pixel in the reference frame is re-estimated by equation (4):

refrefrefcurcur

refcurrefcur

refrefrefcurcur

refcurrefcur

XAA

XAXrefcur

XAAXAXrefcur

1),1(

1)1,(

1),(

)1,(),1(

),(

:

:

−−

−+

+−

⋅⋅=

=<

⋅⋅=

=>

L

L

(4)

Figure 4 illustrates the tracking of a pixel location prior to the characterization of the background model. In Figure 5, we show the comparison of the detection results obtained with the proposed approach and traditional background learning approach by concatenating prior affine transforms from the reference frame to the current frame. As one can see, the residual errors caused by accumulated registration errors are visibility improved when we use adaptive affine transforms limited by the sliding window.

24

Page 5: Multi-Views Tracking Within and Across Uncalibrated Camera ...

Figure 4. Calculation of the statistics of each pixel by tracking pixel’s location using an affine transformation

(a) (b)

Figure 5. Comparison between two detection algorithms (a) Using single concatenated affine transformations from the reference frame. (b) Using the proposed sliding window method

Figure 6. Geometric registration approach between a moving camera and stationary camera

5. REGISTRATION OF MULTIPLE HETEROGENEOUS CAMERAS By combining the affine and perspective transforms, moving and static cameras can be spatially registered by a concatenation of homographies as depicted in Figure 6.

In , we show an example of ground plane registration between a moving and a stationary camera using the proposed approach. Here, we set the tth frame as our reference frame for computing the perspective transform for ground plane registration, and propagate the obtained geometry information using the affine transform between frames of the moving camera.

Figure 7

25

Page 6: Multi-Views Tracking Within and Across Uncalibrated Camera ...

(t-w) th frame

tth frame

(t+w) th frame

Figure 7. Camera registration example (t: reference frame, w: frame variation (40 frames))

6. TRACKING USING JOINT PROBABILITY MODELS We decouple the tracking problem into two parts by first modeling moving objects’ velocity and objects’ appearances by defining a color-based object representation. Each model is formulated as a probabilistic model, and the tracking problem is defined as maximization of the joint probability. Color similarity of the detected blobs is a common approach for tracking moving objects. However, color-based tracking is not very robust as ambiguities are frequent and object appearances change as the objects move in the scene. Several dynamic models combining color and object velocity have been proposed for addressing the tracking problem. Methods such as Extended Kalman Filter (EKF) [7][13][21], Unscented Kalman Filter (UKF) [4], CONDENSATION [15], and Particle Filter (PF) [5] are commonly used. These approaches are well adapted for handling non-linear dynamic systems. However, if the working space is limited to an affine space, the dynamic system remains linear and consequently reduces the algorithmic complexity. Joint probability data association filter (JPDAF) allows combining multiple hypotheses/cues for tracking moving objects [25][26]. JPDAF techniques are fault tolerant and rely on available hypotheses for tracking moving objects. However, the selection of the multiple hypotheses, as well as their weighting is very subjective and relies on ad-hoc methods. Indeed, weighting properly the various hypotheses is not straightforward in JPDAF as motion, color and appearance (shape, edge, etc.) models are very distinct and difficult to combine. Therefore, when several distinctive hypotheses are used, they should be equally weighted.

Our approach uses the KF for modeling the velocity of moving objects and their appearance-based representation, which are the most distinctive models available from the scene.

6.1 The Motion Model If we assume the detected regions move in the image plane, at a constant velocity with a Gaussian noise, the dynamics of the motion can be approximated by first order Newtonian dynamics. When the velocity of the moving object does not vary significantly, such model is acceptable. The motion model of the moving object is obtained by a first order KF. The use of KF for predicting and estimating object trajectories is well known, however, in our approach, the tracking method does not depend only on the estimated position. The KF provides only one probability model for estimating the joint probability model used for tracking. Furthermore, the evolution matrix of the state vector associated to the moving camera is dynamic, as it takes into account the camera motion in order to accurately estimate the object velocity. Our proposed approach combines both moving and stationary cameras, and processes them simultaneously by compensating for the camera motion using the estimated affine transformation and registering the moving and stationary cameras using a homography. The state vector considered in tracking moving object across heterogeneous cameras, is defined by the following vector:

(5) ),,,,,,,( iy

ix

is

is

iy

ix

im

imi uuyxvvyxx =

where is the lowest corners of the detected bounding

box in the moving camera frame, ( is the corresponding corner of the detected bounding box in static camera frame,

is the 2D velocity of ( , and ( is

the 2D velocity of ( . Assuming that the moving object moves in the plane (affine plane) at constant velocity, the new position is obtained by the following equation:

),( im

im yx

)iyv

), is

is yx

), im

im yx,( i

xv ), iy

ix uu

), is

is yx

+

+

=

⇒++=

+

+

+

+

+

+

+

+

+

t

t

t

t

t

t

t

t

ty

tx

ty

tx

ts

ts

ty

tx

tm

tm

tttt

tttt

ty

tx

ts

ts

ty

tx

tm

tm

ttttt

wwwwwwww

gg

uuyxvvyx

dcdcbaba

uuyxvvyx

wGxFx

8

7

6

5

4

3

2

1

1

1

1

1

1

1

1

1

1

000000

10000000010000001010000001010000000010000000010000000000

(6)

where, is the system evolution matrix at time t, is the translation vector of the affine transform at time t,

are the affine parameters of the affine

transform at time t, and w is the processing noise vector.

tF

t dc ,

tG

ty

txttt ggba ,,,,

t

In equation (6), the upper left part of the evolution matrix represents the contribution of the moving camera to the evolution matrix of the system. This evolution matrix is time

26

Page 7: Multi-Views Tracking Within and Across Uncalibrated Camera ...

Figure 8. Example of predicting the position of the moving object in the non stationary camera using the proposed Kalman Filter (red circle: occlusion)

dependent as we update the parameters of the affine transformation registering the considered frames corresponding to the camera motion. The derivation of this relationship is described as follows:

tttt

ttttttt

tttt

vAtxx

vAGxAtt

xxtx

GxAx

⋅⋅+=⇒

⋅=+⋅∂

=∂−

=∂∂

+⋅=

+

+

+

δ

δ

1

1

1

)( (7)

In equation (7), only the camera motion is considered, and the velocity of the camera can be estimated by . However, due to the camera motion, the position of the moving object is estimated by using the affine transform registering two consecutive frames. Equation (7) can be rewritten as follows:

tt vA ⋅

tttt

t

tttttt

tttt

GxAA

x

vAtGxAxvAtxx

+

=⇒

⋅⋅++⋅=⋅⋅+=

+

+

+

101

1

1

δδ

(8)

Combining equation (8) to the static camera properties, we obtain the evolution matrix of whole system as stated in equation (6).

tF

If we assume that we are only able to observe the position of the moving object from the image sequence, our measurement vector is where ( i ) is the position of the detected bounding box of the moving object observed from the moving camera at time t, and ( ) is the position of the detected bounding box from the stationary camera. The measurement equation is formulated as follows:

),,,( ts

ts

tm

tmt jijiz = t

mtm j,

ts

ts ji ,

(9) t

ts

ts

tm

tm

tt x

jiji

Hxz

=

⇒=

00100000000100000000001000000001

where, H is the measurement matrix. The time update and measurement update equations are given in the following equation:

−−

−−−

−+

−+

−=

−+=

+=

+=

+=

tttt

tttttt

tTttt

Tttt

tT

tttt

tttt

PHKIP

xHzKxx

RHPHHPK

QFPFP

GxFx

)(

)(

)( 11

1

)))

))

(10)

where −tx) is the a priori state estimate at time t, tx) is the a

posteriori state estimate, P is the a priori estimate error covariance, P is the a posteriori estimate error covariance, Q is the process noise covariance, and K is the Kalman gain.

−t

t t

t

The motion probability model P is calculated by Gaussian (normal) distribution of the motion estimates.

motion

In Figure 8, the predicted position of the detected bounding box of a moving object is shown. As one can see, when the object is occluded by the kiosk (circled in red in the figure), the bounding box is still continuously tracked by the predicted position of the bounding box, and matches the correct position of the observed bounding box without delay or oscillation when the object reappears.

6.2 The Appearance Model Various methods have been proposed to solve the tracking problem using color information. Many of them use only one color histogram model per object preventing from differentiating a person wearing blue jeans with a red shirt from a person wearing red pants and a blue shirt. Multiple color models and their relative localization should be considered for an efficient use of color in object tracking. In [1] multiple color model approach was proposed for human detection. The blob of the detected person is subdivided in three regions corresponding to the head, torso, and legs. In the real life situations, blobs changes are observed due to self-occlusions and incomplete detection, and segmenting the blobs into body parts is not an easy task. Therefore, single color model are inappropriate for handling such situations. We propose an appearance-based model using a multiple distribution color model.

The color distribution model is obtained by mapping the blob into a polar representation. The angle and radius are subdivided uniformly defining a set of regions partitioning the blob. In each

27

Page 8: Multi-Views Tracking Within and Across Uncalibrated Camera ...

region, a Gaussian color model describes the object color variations. ) A similar approach is proposed in [8], but it more focuses on shape description (edge) instead of appearance description (color).

Figure 9. Example of the appearance model on each connected component (red block: no data)

The probability of the appearance model is formulated as an average of the correlation of all color components such as RGB, and it may be thought as a color similarity between two objects. The calculation of the color distribution probability is formulated as follows:

( )( )( ) ( )( ) ( ) ( )( )

( )

( , ) ( , 1 ) ( , ) ( , 1 )

2 22( , ) ( , ) ( , 1) ( , 1)

3

r t r t r t r t

r t r t r t r t

red g ree n b lu e

re d

co lo r

N

N N

P P P

P

P

µ µ µ µ

µ µ µ µ

+ +

+ +

− −

+ +

=

=

∑ ∑ ∑∑ ∑ ∑ ∑

2

(11)

where, N is the total number of bins (angular bins * radial bins), is the mean of the red component, and P P

is the probability (likelihood estimation) of the color components, and P is the probability of the appearance model.

( , )r tµ

co

, ,red green blueP

lor

The advantage of the proposed approach over the single color model is the localization of color properties. Localized variations due to occlusions or change in viewpoint generate local modifications of the color model. Also similarity between objects is measured by comparing the color distribution models rather than single scalar values.

6.3 The Joint Probability Model We formulate the tracking problem as finding an optimized position ( ) of the moving object by maximizing both probability models. We formulate the tracking problem as follows:

*x

(12) ),max(arg)( *motioncolort PPx =

The joint probability P is defined by the product of the appearance and motion probabilities. This probability maximization approach is inferred using the Viterbi algorithm:

total

( (13) ),(maxarg),|(maxarg) 00)(

000)(

*0

00

XIPXIXPXXX

==

where denotes color observation and X is a position observation at time 0. The calculation of the optimized position at each time step is formulated as follows:

0I 0

(14)

),,,,(maxarg)(

),,,(maxarg)(

),(maxarg)(

*00

)(

*

*0011

)(

*1

00)(

*0

1

0

XIXIPX

XIXIPX

XIPX

ttX

t

X

X

t

L

M

==

==

==

The optimized position at each time step depends on the current observation as well as the motion estimation obtained by the previous optimized positions. Therefore, the optimized position is calculated by the maximum probabilities of all current observations and previous optimized position corresponding to past observations. Consequently, a joint probability of the given position at time t is formulated as follows:

(15) ),,,,,,,,()( *00

*11

*11 XIXIXIXIPXP ttttttotal L−−=

using Bayes’ rule, it can be rewritten as:

),,,(),,,|(),,,,(

),,,(),,,,(),,,|(

*00

*00

*00

*00

*00*

00

XIXPXIXIPXIXIP

XIXPXIXIPXIXI

ttttt

t

tttt

LLL

L

LL

=⇒

=P (16)

The color model depends only on the current observation. Therefore the first decoupled component, corresponding to the probability of the current color observation by the current position, past color observations, and the previous optimized position, will be rewritten as follows: P (17) )|(),,,|( *

00 tttt XIPXIXI =L

The second component in equation (16) is also decoupled by Bayes’ theorem as follows:

),,,,(),,|(

,,,,(),,,,|(),,,(*00

*11

*0

*1

*00

*11

*00

*11

*00

XIXIPXXXP

XIXIPXIXIXPXIXP

tttt

tttttt

LL

LLL

−−−

−−−−

=

= ) (18)

The first term in equation (18) is the probability of the current position given all past color observations and corresponding optimized positions. The motion model provided by a first order KF relies only on the previously estimated position. Therefore the color observation models can be discarded. The remaining term in equation (18) is therefore defined by:

)(),,,,( *1

*00

*11 −−− = ttotaltt XPXIXIP L (19)

From equations (17)~ (19) , the joint probability of the current position is a product of the probabilities of color observation, position estimation by previous optimized position, and the joint probability of past optimized position as follows: (20 )

P (20)

)()()()(

)(),,|()|(),,,,(*

1

*1

*0

*1

*00

ttotal

ttotaltmotiontcolor

ttotaltttttt

XPXPXPXP

XPXXXPXIPXIXI

==

=

−− LL

In order to avoid the accumulation of products of probability, we propose to consider the log of the probabilities as joint probability. Furthermore, for ensuring a stable calculation, we discard old measurements from the estimation process. This

28

Page 9: Multi-Views Tracking Within and Across Uncalibrated Camera ...

shortens the memory of the KF and allows variations in speed and color similarities.

7. EXPERIMENTAL RESULTS We now present some experimental results obtained on real sequences. These results illustrate the registration of multiple homogeneous or heterogeneous cameras, and object tracking across these cameras. Figure 10

(a) (b)

Figure 10. Bounding box correction using the registered views. (a) Initial state without bounding box correction, (b) After applying bounding box correction.

illustrates the advantage of using multiple cameras for reducing the impacts of incomplete detection. Due to similarity of foreground and background colors, one of the bounding boxes is incorrectly detected. Since the moving objects move on the ground plane, the lower corners of the bounding boxes are usually on the ground plane, and the ground view registration homography will allow registering the bounding boxes across views. Based on this constraint, the lower corners of the false detected bounding box are recovered by the registered corners from the second view. Figure 11

Figure 11

.a shows the trajectories computed on each video stream using the tensor-voting technique presented in Section 2. In this example, one can observe that the trajectory of each moving object is very smooth and continuous. The registration of the trajectories using the ground plan is showed in Figure 11.b. As expected, the trajectories are not correctly aligned in space and time. In Figure 11.c we illustrate the proposed the spatio-temporal stream registration technique. A blue line represents an original trajectory belonging to the reference view. A red line represents a registered trajectory transferred from the second view and a green line represents the overlapping trajectory between the two views. In Figure 12, we show some results of multiple view and trajectories registration in the case of multiple moving objects. One can observe that multiple trajectories are correctly registered due to the separate registration error minimization step for each trajectory, and occluded or invisible trajectories are successfully recovered using trajectory information integration across views.

In Figure 13, we show some results obtained on the PETS’01 test sequence. Figure 13.a shows the tracking results obtained from each view using Tensor Voting based tracking approach. Figure 13.b represents the advantage of the proposed approach when a dynamic occlusions occur. In this result, a moving person is occluded by a moving car in both views, but both trajectories and bounding boxes are correctly and smoothly recovered by Tensor Voting based tracking process. As one can see in Figure 13.c the trajectories of the moving objects are not correctly aligned using a ground plane based geometric registration. Compared to the previous two results ( ~12), the alignment errors are large due to the large pose variation between the two cameras. In Figure 13.d, we show the trajectory registration result using spatio-temporal homography. In this example, we re-register all the trajectories using obtained trajectory registration homography: We measure the distance error of both trajectories, registered trajectory and reference trajectory, from the previous position, and keep the closest one as the correct trajectory. With this re-registration approach, we can minimize the registration errors when the extracted trajectories from both views have large errors. These registration errors are due to large viewpoint variation. In Figure 14.a, the bounding box information of the moving object outlined by the yellow bounding box is integrated across the two views. In Figure 14.b, the moving object is occluded by a large obstacle (a kiosk), but it is continuously tracked using the motion estimation model and the information integrated from the static view: provided by the re-projected position of two lower corners of the bounding box. The ability of continuously tracking while the moving camera is panning is presented in Figure 14.c and d.

.e shows the mosaic computed from the moving camera.

.f, illustrates the capabilities of the proposed KF motion model in tracking separately the two overlapping moving objects. It allows us to split large moving blobs into sub-components or independent moving objects using their appearance and velocity properties. Corresponding videos and additional examples can be found at:

Figure 14Figure 14

http://iris.usc.edu/~jinmanka/multiview_tracking

29

Page 10: Multi-Views Tracking Within and Across Uncalibrated Camera ...

View1

View2

(a) (b) (c)

Figure 11. Multiple view registration (a single moving object) (a) Tracking result by Tensor Voting, (b) Ground plane registration, (c) Trajectory registration

View1

View2

(a) (b) (c)

Figure 12. Multiple views registration (multiple moving objects) (a) Tracking result by Tensor Voting, (b) Ground plane registration, (c) Trajectory registration

30

Page 11: Multi-Views Tracking Within and Across Uncalibrated Camera ...

(b)

(c)

View 1

View 2

(a) (d)

Figure 13. Multiple views registration (PETS’01 sequence) (a) Tracking result by Tensor Voting, (b) Tracking resolves a dynamic occlusion, (c) Ground plane registration, (d) Trajectory registration

31

Page 12: Multi-Views Tracking Within and Across Uncalibrated Camera ...

(a) (b)

(c) (d)

(e) (f)

Figure 14. Multiple view registration example (stationary + moving camera) (a) Integrating bounding box information across view, (b) Registering occluded object using the second view, (c) Tracking beyond the FOV of the moving camera, (d) Tracking beyond the FOV of the static camera, (e) Mosaic built from moving camera, (f) Continuous tracking by motion estimation using KF

8. CONCLUSION We have introduced two novel approaches for continuous tracking of moving regions across multiple homogeneous or heterogeneous views. In the case of multiple stationary video streams, Tensor Voting based tracking approach provides refined trajectories and bounding boxes of moving objects continuously for each view, and a spatio-temporal homography registers multiple unsynchronized trajectories and therefore track moving objects across multiple cameras. The smoothed trajectories obtained from the Tensor Voting based tracking produces small registration error. For the combination of video streams acquired by stationary and moving cameras, we have proposed a different approach handling multiple trajectories observed by the moving and stationary cameras in the same KF framework. As proposed in Section 4, detection of objects from a moving camera sequence is simplified using adaptive affine transformation with a sliding window approach. This allows to use existing background-based modeling detection methods. For

each view, the combination of the appearance model and the motion model obtained by the stochastic approach presented in Section 5, allows us to track the moving objects continuously while handling occlusions, incomplete detections and cameras hand-off. The registration of heterogeneous cameras is performed by the homography between the stationary and the moving camera, and the affine transform derived from the stabilization. Several issues remain to be addressed: the integration of the inter-relationships between moving and stationary cameras into the evolution matrix of KF for efficiently propagating information extracted in each video stream. Also, the extension of the method to tracking across multiple non-overlapping cameras is of interest. The expansion of the joint probability model with other criteria such as spatio-temporal invariant shape description, and the adaptation to higher order estimation methods for introducing non-linearity into the measurement step will improve the accuracy of the tracking, and allow tracking objects from non-overlapping cameras.

32

Page 13: Multi-Views Tracking Within and Across Uncalibrated Camera ...

9. ACKNOWLEDGEMENTS This research was supported by the Advanced Research and Development Activity of the U.S. Government under contract MDA-908-00-C-0036.

10. REFERENCES [1] A. Elgammal and L. S. Davis, “Probabilistic Framework

for Segmenting People Under Occlusion”, IEEE Proc. International Conference on Computer Vision, 2001.

[2] A. Mittal and L. S. Davis, “M2Tracker: A Multi-View Approach to Segmenting and Tracking People in a Cluttered Scene Using Region-Based Stereo”, In Proc. European Conference Computer Vision, 2002.

[3] A. Utsumi, H. Mori, J. Ohya and M. Yachida, “Multiple-Human Tracking using Multiple Cameras”, Proc. of the Int. Conf. on Automatic Face and Gesture Recognition, pp. 498-503, Nara, Japan, 14-16 April, 1998.

[4] B. Stenger, P. R. S. Mendonca, and R. Cipolla, “Model-Based Hand Tracking Using an Unscented Kalman Filter”, In Proc. British Machine Vision Conference, Vol. 1, pp. 63-72, Manchester, UK, September 2001.

[5] B. Stenger, P. R. S. Mendonca, and R. Cipolla, “Model-Based Hand Tracking of an Articulated Hand”, IEEE Proc. CVPR, Vol. II, pp. 310-315, Kauai, USA, December 2001.

[6] G. Stein, “Tracking from Multiple View Points: Self-cali-bration of Space and Time”, IEEE Proc. CVPR, pp. 521-527, 1999.

[7] G. Welch and G. Bishop, “SCAAT: Incremental Tracking with Incomplete Information”, In. Proc. of SIGGRAPH 97, pp. 333-344, 1997.

[8] H. Zhang and J. Malik, “Learning a discriminative classifier using shape context distance”, Proc. of the IEEE CVPR, Vol. 1, pp. 242-247, Madison, Wisconsin, 18-20 June, 2003.

[9] J. Kang, I. Cohen and G. Medioni, “Continuous Multi-Views Tracking using Tensor Voting”, proc. of the IEEE Workshop on Motion and Video Computing, pp. 181-186, Orlando, Florida, 5-6 December, 2002.

[10] J. Kang, I. Cohen and G. Medioni, “Continuous Tracking Within and Across Camera Streams”, Proc. of the IEEE CVPR, Vol. 1, pp. 267-272, Madison, Wisconsin, 18-20 June, 2003.

[11] J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale and S. Shafer, “Multi-camera Multi-person Tracking for EasyLiving”, proc. of the 3rd IEEE Workshop on Visual Surveillance, 2000.

[12] J. Orwell, P. Remagnino, and G.A. Jones, “Multi-Camera Color Tracking”, proc. of the 2nd IEEE Workshop on Visual Surveillance, 1999.

[13] K. Bradshaw, I. Reid, and D. Murray, “The active recovery of 3d motion trajectories and their use in prediction”, IEEE Trans. PAMI, Vol. 19, no. 3, 1997.

[14] K. Toyama, J. Krumm, B. Brumitt and B. Meyers, “Wallflower: Principles and Practice of Background Maintenance”, IEEE Proc. Int’l Conf. Computer Vision, pp. 255-261, 1999.

[15] M. J. Black and A. D. Jepson, “A Probabilistic framework for matching temporal trajectories: Condensation-based recognition of gestures and expressions”, In Proc. European Conference Computer Vision, Vol. 1, pp. 909-924, 1998.

[16] M. Black and P. Anandan, “The robust estimation of multiple motions: Affine and piecewise-smooth flow fields”, Tech. Report TR, Xerox PARC, Dec. 1993.

[17] M. Irani et al., “Efficient Representations of Video Sequences and Their Applications”, Signal Processing, Vol. 8, No. 4, May, 1996.

[18] P. Kornprobst and G. Medioni, “Tracking Segmented Objects using Tensor Voting”, IEEE CVPR, Vol. 2, pp. 118-125, 2000.

[19] Q. Cai and J.K. Aggarwal, “Automatic Tracking of Human Motion in Indoor Scenes Across Multiple Synchronized video Streams”, IEEE Proc. International Conference on Computer Vision, 1998.

[20] Q. Cai and J. K. Aggarwal, “Tracking Human Motion in Structured Environments Using a Distributed-Camera System”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 21, No. 11, pp. 1241-1247, November, 1999.

[21] R. Rosales and S. Sclaroff, “Improved Tracking of Multiple Humans with Trajectory Prediction and Occlusion Modeling”, IEEE Proc. CVPR, Santa Barbara, CA, 1998.

[22] R. T. Collins, A. J. Lipton, Ho Fujiyoshi and T. Kanade, “Algorithms for Cooperative Multisensor Surveillance”, Proc. of the IEEE, Vol 89(10), pp. 1456-1477, Oct, 2001.

[23] S. Ayer, P. Schroeter, and J. Bigun, “Segmentation of moving objects by robust motion parameter estimation over multiple frames”, In Proc. European Conference Computer Vision, May, 1994.

[24] S. Stillman, R. Tanawongsuwan and I. Essa, “A system for tracking and recognizing multiple people with multiple cameras”, Proc. of the Int. Conf. on Audio and Video-Based Biometric Person Authentication, pp. 96-101, Washington, DC, 22-23 March 1999.

[25] T.-J. Cham and J. M. Rehg, “A Multiple Hypothesis Approach to Figure Tracking”, IEEE Proc. Computer Vision and Pattern Recognition, Vol. 2, pp. 239-245, Ft. Collins, CO, June 1999.

[26] Y. Chen, Y. Rui, and T. S. Huang, “JPDAF Based HMM for Real-Time Contour Tracking”, IEEE Proc. CVPR, Vol 1. pp. 543-550, Kauai, Hawaii, December 11-13, 2001

33