Real-Time Tracking of Multiple People through Stereo Visioniocchi/publications/iocchi-ie05.pdf ·...

10
Real-Time Tracking of Multiple People through Stereo Vision S. Bahadori, G. Grisetti, L. Iocchi, G.R. Leone, D. Nardi Dipartimento di Informatica e Sistemistica University of Rome “La Sapienza”, Rome, Italy E-mail {lastname}@dis.uniroma1.it Abstract People tracking is an important functionality for many applications that require to analyze the in- teraction between people and the environment. These applications range from security and en- vironmental surveillance, to ambient intelligence applications. In this paper we describe a real-time People Lo- calization and Tracking (PLT) System, based on a calibrated fixed stereo vision sensor. The system is able to locate and track multiple people in the environment, by using three different representa- tions of the information coming from a stereo vi- sion sensor: the left intensity image, the disparity image, and the 3-D world locations of measured points, as well as people modeling techniques in order to re-acquire people that are temporary oc- cluded during tracking. Experimental results show that the system is able to track multiple people moving in an area approximately 3 x 8 meters in front of the sensor with high reliability and good precision. 1 Introduction Localization and tracking of people in a dynamic environment is a key building block for many ap- plications, including surveillance, monitoring, and ambient intelligence applications (e.g. domestic elderly assistance). The fundamental capability for a multiple people tracking system are: 1) to determine the trajectory of each person within the environment; 2) to maintain a correct association between people and tracks over time. These problems can be easily solved by using special markers placed on the person to transmit their real world position to a receiver in the envi- ronment; however,it is not always possible to im- plement such a solution within an application. On the other hand, video cameras can be used for such a task, but there are several difficulties to be faced in developing a vision-based people- tracking system: first of all, people tracking is dif- ficult even in moderately crowded environments, because of occlusions and people walking close each other or to the sensor; second, people recog- nition is difficult and cannot easily be integrated in the tracking system; third, people may leave the field of view of the sensor and re-enter it after some time (or they may enter the field of view of another sensor) and applications may require the ability of recognizing (or re-acquiring) a person previously tracked (or tracked by another sensor in the network of sensors). Several approaches have been developed for tracking people in different applications. At the top level, these approaches can be grouped into classes on the basis of the sensors used: a sin- gle camera (e.g. [20, 21]); stereo cameras (e.g. [4, 7, 2, 3]); or multiple calibrated cameras (e.g. [5, 15]). Although it is possible to determine the 3-D world positions of tracked objects with a single camera (e.g. [21]), a stereo sensor provides two critical advantages: 1) it makes it easier to seg- ment an image into objects (e.g., distinguishing people from their shadows); 2) it produces more 1 Proc. of IEE International Workshop on Intelligent Environments, 2005

Transcript of Real-Time Tracking of Multiple People through Stereo Visioniocchi/publications/iocchi-ie05.pdf ·...

Page 1: Real-Time Tracking of Multiple People through Stereo Visioniocchi/publications/iocchi-ie05.pdf · in the network of sensors). Several approaches have been developed for tracking people

Real-Time Tracking of Multiple People through Stereo

Vision

S. Bahadori, G. Grisetti, L. Iocchi, G.R. Leone, D. NardiDipartimento di Informatica e Sistemistica

University of Rome “La Sapienza”, Rome, ItalyE-mail {lastname}@dis.uniroma1.it

Abstract

People tracking is an important functionality formany applications that require to analyze the in-teraction between people and the environment.These applications range from security and en-vironmental surveillance, to ambient intelligenceapplications.

In this paper we describe a real-time People Lo-calization and Tracking (PLT) System, based on acalibrated fixed stereo vision sensor. The systemis able to locate and track multiple people in theenvironment, by using three different representa-tions of the information coming from a stereo vi-sion sensor: the left intensity image, the disparityimage, and the 3-D world locations of measuredpoints, as well as people modeling techniques inorder to re-acquire people that are temporary oc-cluded during tracking.

Experimental results show that the system isable to track multiple people moving in an areaapproximately 3 x 8 meters in front of the sensorwith high reliability and good precision.

1 Introduction

Localization and tracking of people in a dynamicenvironment is a key building block for many ap-plications, including surveillance, monitoring, andambient intelligence applications (e.g. domesticelderly assistance). The fundamental capabilityfor a multiple people tracking system are: 1) todetermine the trajectory of each person within theenvironment; 2) to maintain a correct association

between people and tracks over time.These problems can be easily solved by using

special markers placed on the person to transmittheir real world position to a receiver in the envi-ronment; however,it is not always possible to im-plement such a solution within an application.

On the other hand, video cameras can be usedfor such a task, but there are several difficultiesto be faced in developing a vision-based people-tracking system: first of all, people tracking is dif-ficult even in moderately crowded environments,because of occlusions and people walking closeeach other or to the sensor; second, people recog-nition is difficult and cannot easily be integratedin the tracking system; third, people may leavethe field of view of the sensor and re-enter it aftersome time (or they may enter the field of view ofanother sensor) and applications may require theability of recognizing (or re-acquiring) a personpreviously tracked (or tracked by another sensorin the network of sensors).

Several approaches have been developed fortracking people in different applications. At thetop level, these approaches can be grouped intoclasses on the basis of the sensors used: a sin-gle camera (e.g. [20, 21]); stereo cameras (e.g.[4, 7, 2, 3]); or multiple calibrated cameras (e.g.[5, 15]).

Although it is possible to determine the 3-Dworld positions of tracked objects with a singlecamera (e.g. [21]), a stereo sensor provides twocritical advantages: 1) it makes it easier to seg-ment an image into objects (e.g., distinguishingpeople from their shadows); 2) it produces more

1

Proc. of IEE International Workshop on Intelligent Environments, 2005

Page 2: Real-Time Tracking of Multiple People through Stereo Visioniocchi/publications/iocchi-ie05.pdf · in the network of sensors). Several approaches have been developed for tracking people

accurate location information for the tracked peo-ple.

On the other hand, approaches using severalcameras viewing a scene from significantly differ-ent viewpoints are able to deal better with occlu-sions than a single stereo sensor can, because theyview the scene from many directions. However,such systems are difficult to set up (for example,establishing their geometric relationships or solv-ing synchronization problems), and the scalabilityto large environments is limited, since they mayrequire a large number of cameras.

This paper describes the implementation of aPeople Localization and Tracking (PLT) System,using a calibrated fixed stereo vision sensor. Thesystem is able to locate and track multiple peo-ple in the environment, by using three differentrepresentations of the information coming from astereo vision sensor: the left intensity image, thedisparity image, and the 3-D world locations ofmeasured points, as well as people modeling tech-niques in order to re-acquire people that are tem-porary occluded during tracking.

In this paper we specifically focus on track-ing multiple people, thus describing in detail twophases of the process: 1) image segmentationbased on background model and plan-view anal-ysis; 2) tracking of multiple people based onKalman Filtering and color-based people model-ing.

2 System Architecture

The general architecture of the PLT System is de-picted in Figure 1. The input is a stream of stereoimages captured by a calibrated stereo-vision de-vice and processed by the Stereo Computationmodule, extracting disparity and 3-D informationabout the scene (a description of the stereo visionsystem can be found in [12]).

Subsequently, the Background Modeling main-tains an updated model of the background, com-posed by three components: intensities, dispari-ties, and edges. Then Foreground Segmentationextracts foreground pixels and image blobs fromthe current image, by a type of background sub-

traction that combines intensity and disparity in-formation. Points in the foreground are then pro-jected in the plan-view within the Plan View Pro-jection module, by using 3-D information com-puted by the stereo algorithm. This module com-putes world blobs that identify moving objects inthe environment. World blobs are also used torefine image segmentation detecting situations ofpartial occlusions between people.

Once we have a set of segmented figures foreach person, the People Modeling module createsand maintain appearance models for the people be-ing tracked; these information are then passed tothe Tracker module, which maintains and tracks aset of tracked objects, associated with world blobsextracted by previous modules. Tracking is per-formed by using a Kalman Filter on a state that isformed by the location and velocity of the personin the world and by his/her appearance model.The state for each person is updated by using aa constant velocity model for person location andby also considering a similarity measure betweenappearance models and current segmented figureof each tracked person. The tracker mechanism isable to reliably track multiple people in situationsof partial occlusions among them.

For best results in the localization and track-ing process, we have chosen to place the camerahigh on the ceiling pointing down with an angleof approximately 30 degrees with respect to thehorizon. Notice that other placements of the cam-era may improve a particular aspect of the sys-tem. For example, a camera looking straight downsimplifies the tracking; however, this configurationmakes it difficult to construct person appearancemodels of the people (except for their hair). Onthe other hand, a horizontal camera is good forperson modeling and recognition, but it increasesocclusions and generally decreases tracking perfor-mance. Our choice of camera position and orien-tation provides a nice combination of tracking andperson modeling.

In the following sections, we will address firstthe image segmentation process, showing howmultiple people are segmented from the back-ground, then the tracking process including peoplemodeling and Kalman Filter based tracker.

2

Page 3: Real-Time Tracking of Multiple People through Stereo Visioniocchi/publications/iocchi-ie05.pdf · in the network of sensors). Several approaches have been developed for tracking people

Figure 1: PLT System Architecture

3 Image Segmentation

When using a static camera for object detectionand tracking, maintaining a model of the back-ground and consequent background subtraction isa common technique for image segmentation andfor identifying foreground objects. In order to ac-count for variations in illuminations, shadows, re-flections, and background changes, it is useful tointegrate information about intensity and rangeand to dynamically update the background model.Moreover, such an update must be different in theparts of the image where there are moving ob-jects [6, 20, 18, 9].

The implementation of the image segmentationmodule thus requires the computation of a back-ground model and a subsequent foreground ex-traction procedure.

3.1 Dynamic Background Modeling

In our work we maintain a background modelincluding information of intensity, disparity, andedges, as a unimodal probability distribution rep-resented with a single Gaussian. Although moresophisticated representations can be used (e.g.Gaussian mixture [6, 18, 9]), we decided to usea simple model for efficiency reasons. We also de-cided not to use color information in the model,since intensity and range usually provide a goodsegmentation, while reducing computational time.

The model of the background is represented at

every time t and for every pixel i by a vectorXt,i, including information about intensity, dis-parity, and edges computed with a Sobel opera-tor. In order to take into account the uncertaintyin these measures, we use a Gaussian distribu-tion over Xt,i, denoted by mean µXt,i and varianceσ2

Xt,i. Moreover, we assume the values for inten-

sity, disparity, and edges to be independent eachother.

This model is dynamically updated at every cy-cle (i.e., for each new stereo image every 100 ms)and is controlled by a learning factor αt,i thatchanges over time t and is different for every pixeli.

µXt,i = (1− αt,i) µXt−1,i + αt,i Xt,i

σ2Xt,i

= (1− αt,i) σ2Xt−1,i

+ αt,i (Xt,i − µXt−1,i)2

The learning factor αt,i is set to a higher value(e.g. 0.25) in the first few frames (e.g. 5 sec-onds) after the application is started, in order toquickly acquire a model of the background. In thisphase we assume the scene contains only back-ground objects. After this training phase αt,i isset to a lower nominal value (e.g. 0.1) and modi-fied depending on the status of pixel i as explainedbelow. Notice that the training phase can be com-pletely removed and the system is able to build abackground model even in presence of foregroundmoving objects since the beginning of the applica-tion run. Of course it will require a longer time tocompute this model.

3

Page 4: Real-Time Tracking of Multiple People through Stereo Visioniocchi/publications/iocchi-ie05.pdf · in the network of sensors). Several approaches have been developed for tracking people

After the training phase the learning factor αt,i

is increased (e.g. 0.15) when there are no movingobjects in the scene (speeding up model updating),and is decreased (or set to zero) in the regions ofthe image where there are moving objects. In thisway we are able to quickly update the backgroundmodel in those parts of the image that contain sta-tionary objects and avoid including people (and,in general, foreground objects) in the background.

In order to determine regions of the images inwhich background should not be updated, thework in [9] proposes to compute activities of pix-els based on intensity difference with respect tothe previous frame. In our work, instead, we havecomputed activities of pixels as their difference be-tween the edges in the current image and the back-ground edge model. The motivation behind thischoice is that people produce variations in theiredges over time even if they are standing still (dueto breathing, small variations of pose, etc.), whilestatic objects, such as chairs and tables, do not.

The value of the activity of pixel i at time t,At(i), is then used for determining the learningfactor of the background model: the higher theactivity At(i) at each pixel, the lower the learningfactor αt,i.

3.2 Foreground segmentation

Foreground segmentation is then performed bybackground subtraction from the current inten-sity and disparity images. By taking into accountboth intensity and disparity information, we areable to correctly deal with shadows, detected asintensity changes, but not disparity changes, andforeground objects that have the same color as thebackground, but different disparities. Therefore,by combining intensity and disparity informationin this way, we are able to avoid false positivesdue to shadows, and false negatives due to similarcolors, which typically affect systems based onlyon intensity background modeling.

The final steps of the foreground segmentationmodule are to compute connected components(i.e. image blobs) and characterize the foregroundobjects in the image space. These objects are thenpassed to the Plan View Segmentation module.

Figure 2 shows three snapshots of this process:the first one represents difference in the intensityspace, where foreground objects are denoted withdark (red) areas; the second one represents differ-ence in the disparity space, again with foregroundobjects denoted with dark (red) areas; finally, thecombination of these information is shown in thethird image as bounding boxes around the twopeople in the scene.

3.3 Plan View Segmentation

In many applications it is important to know the3-D world locations of the tracked objects. Wedo this by employing a plan view [3]. This rep-resentation also makes it easier to detect partialocclusions between people.

Our approach projects all foreground pointsinto the plan view reference system, by using thestereo calibration information to map disparitiesinto the sensor’s 3-D coordinate system and thenthe external calibration information to map thesepoints from the sensor’s 3-D coordinate system tothe world’s 3-D coordinate system.

For plan view segmentation, we compute aheight map, that is a discrete map relative to theground plane in the scene, where each cell of theheight map is filled with the maximum height ofall the 3-D points whose projection lies in thatcell, in such a way that higher objects (e.g., peo-ple) will have a high score. The height map issmoothed with a Gaussian filter to remove noise,and then it is searched to detect connected com-ponents that we call world blobs (see Fig. 3b wheredarker points correspond to higher values).

Since we are interested in person detection,world blobs are filtered on the basis of their sizein the plan view and their height, thus remov-ing blobs with sizes and heights inconsistent withpeople. Notice that even in situations of partialocclusions (like the one shown in the second rowof Fig. 3), the world blob of the occluded personhas generally a size and a height similar to thecase in which this person is not occluded. This isdue to the position of the camera and to the use ofplan view information. In fact, the upper part ofa person is typically visible even when s/he is oc-

4

Page 5: Real-Time Tracking of Multiple People through Stereo Visioniocchi/publications/iocchi-ie05.pdf · in the network of sensors). Several approaches have been developed for tracking people

Figure 2: a) Intensity segmentation; b) Disparity segmentation; c) Tracked People.

Figure 3: a) Foreground segmentation (1 image blob); b) Plan View Projection (2 world blobs); c)Plan View Segmentation 2 person blobs.

5

Page 6: Real-Time Tracking of Multiple People through Stereo Visioniocchi/publications/iocchi-ie05.pdf · in the network of sensors). Several approaches have been developed for tracking people

cluded, and our method for computing the heightmap is not sensible to the lack of information inthe lower parts of a person. Therefore, the tun-ing of the filter is not very critical and we preferto have a filter that is not very selective on suchworld blobs, since possible false detection are thensolved by the tracked module (for example, a noiseproducing an incorrect world blob isolated in timewill be discarded by the tracker since the obser-vation is not confirmed over time). However, it isnecessary to observe that the method proposed inthis paper does not use any technique for peoplerecognition, and therefore if a moving object withsimilar dimensions of a person (e.g. a high mobilerobot) moves in the environment, then this objectwill be tracked by the system.

It is also important to notice that Plan ViewSegmentation is able to correctly deal with par-tial occlusions that are not detected by foregroundanalysis. For example, in In the second row of Fig-ure 3 a situation is shown in which a single imageblob covers two people, one of which is partiallyoccluded, while the Plan View Segmentation pro-cess detects two world blobs. By considering theassociation between pixels in the image blobs andworld blobs, we are able to determine image maskscorresponding to each person, which we call per-son blobs. This process allows for refining fore-ground segmentation in situations of partial occlu-sions and for correctly building person appearancemodels.

The Plan View Segmentation thus returns a setof world blobs (in the plan-view space) and per-son blobs (in the image space) associated to peoplemoving in the scene.

4 Tracking

We have chosen a probabilistic approach for track-ing multiple people, based on a mono-modal prob-ability distribution, like in [2, 13, 15]. Tracking isperformed by maintaining a set of tracked objectsxt, updated with the measurements of world blobsand appearance models zt, extracted in the pre-vious segmentation phase. As a difference withprevious approaches, the state of each tracked ob-

Figure 4: Person model.

ject has two components: the spatial parameters,and the appearance.

5 People Modeling

In order to track people over time in the pres-ence of occlusions, or when they leave and re-enter the scene, it is necessary to have a modelof the tracked people. Several models for peo-ple tracking have been developed (see for exam-ple [8, 17, 14, 11]), but color histograms and colortemplates (as presented in [17]) are not sufficientfor capturing complete appearance models, be-cause they do not take into account the actualposition of the colors on the person.

Following [8, 14], we have defined temporalcolor-based appearance models of a fixed resolu-tion, represented as a set of unimodal probabilitydistributions in the RGB space (i.e. 3-D Gaus-sian), for each pixel of the model. Computation ofsuch models is performed by first scaling the por-tion of the image characterized by a person blob toa fixed resolution and then updating the probabil-ity distribution for each pixel in the model (Figure4).

Appearance models computed at this stage areused during the tracking step for improving relia-bility of data association, as described below.

5.1 Probabilistic tracking and data as-sociation

Tracking is performed with a probabilistic formu-lation. The uncertainty about the ith tracked ob-

6

Page 7: Real-Time Tracking of Multiple People through Stereo Visioniocchi/publications/iocchi-ie05.pdf · in the network of sensors). Several approaches have been developed for tracking people

ject at the time t is represented by a multi-variatenormal distribution, N(µi,t, σi,t), and a weightingfactor wi,t.

The probability distribution p(xt) for thetracked people is represented as a collection ofGaussians and weights. Each Gaussian mod-els the information about a single person, andthe weight models the confidence about the per-son estimate. Therefore p(xt) is represented asa set Pt = {N(µi,t, σi,t), wi,t | i = 1..n} whereN(µi,t, σi,t) is a Gaussian in a multi-dimensionalspace representing the ith person that is tracked attime t and wi,t a weighting factor. Similarly, ob-servations in zt are represented as a set of Gaus-sian distributions Zt = {N(µ′

j,t, σ′j,t) | j = 1..m}

denoting position and appearance information ofdetected people in the current frame.

The probability distribution p(xt|zt) is com-puted by using a set of Kalman Filters. The sys-tem model used for predicting the people posi-tion is the constant velocity model, while their ap-pearance is updated with a constant model. Thismodel is adequate for many normal situations inwhich people walk in an environment. It pro-vides a clean way to smooth the trajectories andto hold onto a person that is partially occludedfor a few frames. However, multi-modal represen-tations (e.g. [10]) should be used in more complexsituations.

For the representation presented in this paper,data association is an important issue to deal with.In general, at every step, the tracker must make anassociation between m observations (world blobs)and n people (tracked objects).

Association is solved by computing adistance di,j between the ith tracked ob-ject N(µi,t|t−1, σi,t|t−1) and the jth obser-vation N(µ′

j,t, σ′j,t). Here the Gaussian

N(µi,t|t−1, σi,t|t−1) is the predicted estimateof the ith person.

An association between the predicted state ofthe system Pt|t−1 and the current observations Zt

is denoted with a function f , that associates eachtracked person i to an observation j, with i = 1..n,j = 1..m, and f(i1) 6= f(i2), ∀i1 6= i2. The specialvalue ⊥ is used for denoting that the person isnot associated to any observation (i.e. f(i) = ⊥).

Let F be the set of all possible associations of thecurrent tracked people with current observations.Data association is then computed by solving thefollowing minimization problem

argminf∈F

∑i

di,f(i)

where a fixed maximum value is used for di,f(i)

when f(i) = ⊥.Although this is a combinatorial problem, the

size of the sets Pt and Zt on which this is appliedare very limited (not greater than 4), so |F| issmall and this problem can be effectively solved.

The association f∗, that is the solution of thisproblem, is chosen and used for updating p(xt),i.e. for computing the new status of the systemPt. During the update step the weights wi,t arecomputed from wi,t−1 and di,f∗(i), and if a weightgoes below a given threshold, the person is consid-ered lost. Moreover, for observations in Zt that arenot associated to any person by f∗ a new Gaussianin entered in Pt.

The main difference with previous approaches[2, 13, 15] is that we integrate both plan-view andappearance information in the status of the sys-tem, and by solving the above optimization prob-lem we find the best matching between observa-tions and tracker status by considering in an inte-grated way the information about the position ofthe people in the environment and their appear-ance.

Finally, in order to increase robustness of thesystem to tracking errors, a finite state machineis associated to any person being tracked. Eachtrack can be in one of these status: new, can-didate, tracked, lost, terminated. The transitionfrom one status to another depends on the cur-rent association f∗ as well as on the track his-tory. For example, after a new observation of aperson, the corresponding track becomes a can-didate, if it is confirmed for a certain number offrames it is moved to the tracked status; then ifthe track is not observed it gets the lost statusthat is maintained for some frames. In case thetrack is observed again it goes back to the trackedstatus, thus correctly dealing with partial occlu-sions (e.g., one person in front of another) and

7

Page 8: Real-Time Tracking of Multiple People through Stereo Visioniocchi/publications/iocchi-ie05.pdf · in the network of sensors). Several approaches have been developed for tracking people

bridging short-term breaks in a person’s path (i.e.short-term re-acquisition). Otherwise, after sometime, the track is declared terminated. This lat-ter state is used for a re-acquisition process (i.e.recognizing a person that exits and re-enters thefield of view of the camera), that is not describedin this paper.

6 Applications and experiments

The system presented in this paper is in use withinthe RoboCare project [1, 16], whose goal is tobuild a multi-agent system that generates userservices for human assistance and develop sup-port technology which can play a role in allow-ing elderly people to lead an independent lifestylein their own homes. The RoboCare DomesticEnvironment (RDE), located at the Institute forCognitive Science and Technology (CNR, Rome,Italy), is intended to be a testbed environment inwhich to test the ability of the domotic technologybuilt within the project.

In this application scenario the ability of track-ing people in a domestic environment or within ahealth-care institution is a fundamental buildingblock for a number of services requiring informa-tion about pose and trajectories of persons (elders,human assistants) or robots acting in the environ-ment.

In order to evaluate the system in this con-text we have performed a first set of experimentsaiming at evaluating efficiency and precision ofthe system. The computational time of the en-tire process described in this paper is below 100ms on a 2.4GHz CPU for high resolution images(640x480)1, thus making it possible to process avideo stream at a frame rate of 10 Hz. The framerate of 10 Hz is sufficient to effectively track walk-ing people in a domestic environment, where ve-locities are typically limited.

For measuring the precision of the system wehave marked 9 positions in the environment at dif-ferent distances and angles from the camera andmeasured the distance returned by the system of a

1Although some processing is performed at low resolu-tion 320x240.

Marker Distance Angle Avg.Err Std.Dev.M1 3.00 m 00◦ 7.3cm 8.5 cmM2 2.70 m 00◦ 7.1cm 6.9 cmM3 2.40 m 00◦ 6.5cm 3.9 cmM4 3.00 m 32◦ 9.6cm 10.7 cmM5 2.70 m 35◦ 9.2cm 9.3 cmM6 2.40 m 37◦ 8.8cm 7.1 cmM7 3.00 m -32◦ 9.2cm 11.0 cmM8 2.70 m -35◦ 9.2cm 10.1 cmM9 2.40 m -37◦ 8.8cm 7.3 cm

Table 1: Accuracy results.

person standing on these positions. Although thiserror analysis is affected by imprecise positioningof the person on the markers, the results of our ex-periments, in Table 1 averaging 40 measurementsfor each position, show a precision in localization(i.e. average error) of less than 10 cm, but witha high standard deviation, that denotes the diffi-culty on taking this measure precisely. However,this precision is usually sufficient for many appli-cations, like the one considered in our domesticscenario.

Finally, we have performed specific experimentsto evaluate the integration of plan-view and ap-pearance matching during tracking. We have com-pared two cases: the first in which only the posi-tion of the people is considered during tracking(i.e. the Gaussians representing xt contain onlyinformation about people location), the second inwhich appearance models of people are combinedwith their location (i.e. the Gaussians in xt inte-grate plan-view position and appearance model).

We have performed several experiments on dif-ferent movies taken in different situations andcounted the number of association errors in thesetwo cases. As data associations errors we countedall the situations in which either a track was asso-ciated to more than a person or a person is asso-ciated to more than a track.

Error analysis has been performed by analyz-ing the track strips produced as output by thePLT system. An example of track strip is shownin Figure 5. This contains 4 snapshots: 1) thefirst frame in which the track is noticed (whitebounding box); 2) the frame in which an ID is

8

Page 9: Real-Time Tracking of Multiple People through Stereo Visioniocchi/publications/iocchi-ie05.pdf · in the network of sensors). Several approaches have been developed for tracking people

Figure 5: An example of track strip.

assigned to the track (colored bounding box); 3)the last tracked frame and the projection of thetrajectory followed (the track has the same coloras the bounding box in the second snapshot); 4)ten frames after the last track (to check for tracksplitting or false negatives).

By a visual inspection of such track strips it ispossible to identify the above cited association er-rors. This error analysis has been performed onseveral video clips containing a total number ofabout 200,000 frames (of which about 3,500 con-tain two people close each other). The integratedapproach reduces the association errors by about50% (namely, from 39 in the tracker with plan-view position only to 17 in the one with integratedinformation).

7 Conclusions and Future Work

In this paper we have presented a People Local-ization and Tracking System that integrates sev-eral capabilities into an effective and efficient im-plementation: dynamic background modeling, in-tensity and range based foreground segmentation,plan-view projection and segmentation for track-ing and determining object masks, Kalman Filtertracking, and person appearance models for im-proving data association in tracking.

With respect to existing techniques, we havedescribed in this paper three novel aspects: 1) abackground modeling technique that is adaptivelyupdated with a learning factor that varies overtime and is different for each pixel; 2) a plan-viewsegmentation that is used to refine foreground seg-mentation in case of partial occlusions; 3) an inte-

grated tracking method that consider both plan-view positions and color-based appearance mod-els and solve an optimization problem to find thebest matching between observations and the cur-rent state of the tracker.

Experimental results on efficiency and precisionshow good performance of the system. Further-more, evaluation of the integrated tracking showsits improvement in reducing data association er-rors. However, we intend to address other aspectsof the system: first, using a multi-modal repre-sentation (as in [10]) for tracking in order to bet-ter deal with uncertainty and association errors;second, evaluating the reliability of tracking andshort-term to medium-term re-acquisition of peo-ple leaving and re-entering a scene.

Finally, in order to expand the size of the moni-tored area, we are planning to use multiple track-ing systems. This is a challenging problem be-cause it emphasizes the need to re-acquire peoplemoving from one sensor’s field of view to another.One way of simplifying this task is to arrange anoverlapping field of view for close cameras; how-ever, this arrangement increases the number ofsensors needed to cover an environment and limitsthe scalability of the system. In the near futurewe intend to extend the system to track peoplewith multiple sensors that do not overlap.

Acknowledgments

This research is partially supported by MIUR(Italian Ministry of Education, University and Re-search) under project RoboCare (A Multi-AgentSystem with Intelligent Fixed and Mobile Robotic

9

Page 10: Real-Time Tracking of Multiple People through Stereo Visioniocchi/publications/iocchi-ie05.pdf · in the network of sensors). Several approaches have been developed for tracking people

Components). Luca Iocchi also acknowledges SRIInternational where part of this work was carriedout and, in particular, Dr. Robert C. Bolles forhis interesting discussions and useful suggestions.

References

[1] S. Bahadori, A. Cesta, L. Iocchi, G. R. Leone,D. Nardi, F. Pecora, R. Rasconi, and L. Scoz-zafava. Towards ambient intelligence for the do-mestic care of the elderly. In P. Remagnino, G. L.Foresti, and T. Ellis, editors, Ambient Intelli-gance: A Novel Paradigm. Springer, 2004.

[2] D. Beymer and K. Konolige. Real-time trackingof multiple people using stereo. In Proc. of IEEEFrame Rate Workshop, 1999.

[3] T. Darrell, D. Demirdjian, N. Checka, and P. F.Felzenszwalb. Plan-view trajectory estimationwith dense stereo background models. In Proc. of8th Int. Conf. On Computer Vision (ICCV’01),pages 628–635, 2001.

[4] T. Darrell, G. Gordon, M. Harville, and J. Wood-fill. Integrated person tracking using stereo, color,and pattern detection. International Journal ofComputer Vision, 37(2):175–185, 2000.

[5] D. Focken and R. Stiefelhagen. Towards vision-based 3-d people tracking in a smart room. InProc. 4th IEEE Int. Conf. on Multimodal Inter-faces (ICMI’02), 2002.

[6] N. Friedman and S. Russell. Image segmentationin video sequences: a probabilistic approach. InProc. of 13th Conf. on Uncertainty in ArtificialIntelligence, 1997.

[7] I. Haritaoglu, D. Harwood, and L. S. Davis. W4S:A real-time system detecting and tracking peoplein 2 1/2D. In Proceedings of the 5th EuropeanConference on Computer Vision, pages 877–892.Springer-Verlag, 1998.

[8] I. Haritaoglu, D. Harwood, and L. S. Davis. Anappearance-based body model for multiple peopletracking. In Proc. of 15th Int. Conf. on PatternRecognition (ICPR’00), 2000.

[9] M. Harville, G. Gordon, and J. Woodfill. Fore-ground segmentation using adaptive mixturemodels in color and depth. In Proc. of IEEEWorkshop on Detection and Recognition of Eventsin Video, pages 3–11, 2001.

[10] M. Isard and A. Blake. Condensation – condi-tional density propagation for visual tracking. In-ternational Journal of Computer Vision, 29(1):5–28, 1998.

[11] J. Kang, I. Cohen, and G. Medioni. Object reac-quisition using invariant appearance model. InProc. of 17th Int. Conf. on Pattern Recognition(ICPR’04), 2004.

[12] K. Konolige. Small vision systems: Hardwareand implementation. In Proc. of 8th InternationalSymposium on Robotics Research, 1997.

[13] J. Krumm, S. Harris, B. Meyers, B. Brumitt,M. Hale, and S. Shafer. Multi-camera multi-person tracking for easyliving. In Proc. of Int.Workshop on Visual Surveillance, 2000.

[14] J. Li, C. S. Chua, and Y. K. Ho. Color based mul-tiple people tracking. In Proc. of 7th Int. Conf. onControl, Automation, Robotics and Vision, 2002.

[15] A. Mittal and L. S. Davis. M2Tracker: A multi-view approach to segmenting and tracking peo-ple in a cluttered scene using region-based stereo.In Proc. of the 7th European Conf. on ComputerVision (ECCV’02), pages 18–36. Springer-Verlag,2002.

[16] Robocare project.http://robocare.istc.cnr.it.

[17] K. Roh, S. Kang, and S. W. Lee. Multiple peo-ple tracking using an appearance model based ontemporal color. In Proc. of 15th Int. Conf. onPattern Recognition (ICPR’00), 2000.

[18] C. Stauffer and W. E. L. Grimson. Adaptivebackground mixture models for real-time tracking.In IEEE Conf. on Computer Vision and PatternRecognition (CVPR’99), pages 246–252, 1999.

[19] R. Tsai. An efficient and accurate camera cali-bration technique for 3d machine vision. In IEEEConf. on Computer Vision and Pattern Recogni-tion (CVPR’86), 1986.

[20] Christopher Richard Wren, Ali Azarbayejani,Trevor Darrell, and Alex Pentland. Pfinder: Real-time tracking of the human body. IEEE Trans.on Pattern Analysis and Machine Intelligence,19(7):780–785, 1997.

[21] T. Zhao and R. Nevatia. Tracking multiple hu-mans in crowded environment. In IEEE Conf.on Computer Vision and Pattern Recognition(CVPR’04), 2004.

10