Real-Time Cooperative Multi-Target Tracking by...

Real-Time Cooperative Multi-Target Tracking

by Communicating Active Vision Agents

Takashi MatsuyamaDepartment of Intelligent Science and TechnologyGraduate School of Informatics, Kyoto University

Sakyo, Kyoto, JAPAN 606-8501e-mail: [email protected]

Abstract Target detection and tracking is one of the most important and fundamental tech-nologies to develop real world computer vision systems such as security and traffic monitoringsystems. This paper presents a real-time cooperative multi-target tracking system. The systemconsists of a group of Active Vision Agents (AVAs), where an AVA is a logical model of anetwork-connected computer with an active camera. All AVAs cooperatively track their targetobjects by dynamically exchanging object information with each other. With this cooperativetracking capability, the system as a whole can track multiple moving objects persistently evenunder complicated dynamic environments in the real world.

1 IntroductionTarget detection and tracking is one of themost important and fundamental technologiesto develop real world computer vision systems:e.g. visual surveillance systems, ITS (Intelli-gent Transport Systems) and so on.

To realize real-time flexible tracking ina wide-spread area, we proposed the ideaof Cooperative Distributed Vision (CDV, inshort)[1]. The goal of CDV is summarized asfollows (Fig. 1):

Embed in the real world a group of ActiveVision Agents (AVA, in short: a network-connected computer with an active camera),and realize1. wide area dynamic scene understanding and2. versatile scene visualization.

Applications of CDV include real-time widearea surveillance and traffic monitoring, re-mote conference and lecturing, 3D video[2]and intelligent TV studio, and navigation ofmobile robots and disabled people.

While the idea of CDV shares much withthose of DVMT (Distributed Vehicle Monitor-ing Testbed)[3] and the VSAM (Video Surveil-lance And Monitoring) project by DARPA[4],our primary interest rests in how we can real-ize intelligent systems which work adaptivelyin the real world. And we put our focus upondynamic interactions among perception, ac-

Figure 1: Cooperative distributed vision.

tion, and communication. That is, we believethat intelligence does not dwell solely in brainbut emerges from active interactions with en-vironments through perception, action, andcommunication.

With this scientific motivation in mind, wedesigned a real-time cooperative multi-targettracking system, where we developedVisual Sensor: a Fixed-Viewpoint Pan-Tilt-Zoom Camera[5] for wide area active imagingVisual Perception: Active Background Sub-traction for target detection and tracking[1]Dynamic Integration of Visual Percep-tion and Camera Action: Dynamic Mem-ory Architecture[6] for real-time reactive track-ingNetwork Communication for Coopera-tion: a three-layered dynamic interaction ar-

chitecture for real-time communication amongAVAs.

In this paper1 , we address the key ideas ofthe above mentioned technologies and demon-strate their effectiveness in real-time multi-target tracking.

2 Fixed-Viewpoint Pan-Tilt-

Zoom Camera for Wide-

Area Active ImagingTo develop wide-area video surveillance sys-tems, we first of all should study methods ofexpanding the visual field of a video camera:1. Omnidirectional cameras using fish-eyelenses or curved mirrors[8][9][10], or2. Active cameras mounted on computer con-trolled camera heads[5][7][11].

In the former optical methods, while om-nidirectional images can be acquired at videorate, their resolution is limited. In the lattermechanical methods, on the other hand, highresolution image acquisition is attained at thecost of limited instantaneous visual field.

In our tracking system, we took the activecamera method;(a) High resolution images are of the first im-portance for object identification and scene vi-sualization.(b) Dynamic visual field and image resolutioncontrol can be realized by active zooming.(c) The limited instantaneous visual fieldproblem can be solved by incorporating agroup of distributed cameras.

The next problem is how to design an activecamera. Suppose we design a pan-tilt camera.This active camera system includes a pair ofgeometric singularities: 1) the projection cen-ter of the imaging system and 2) the pan andtilt rotation axes. In ordinary pan-tilt cam-era systems, no deliberate design about thesesingularities is incorporated, which introducesdifficult problems in image analysis. That is,the discordance of the singularities causes pho-tometric and geometric appearance variationsduring the camera rotation: varying highlightsand motion parallax. To cope with these

1 The original version of this paper will appear inIEEE Proceedings.

Appearance Sphere

tilt

pan

Projection Center

3D Scene

Figure 2: Fixed viewpoint pan-tilt camera.

(−30◦, 10◦) (0◦, 10◦) (30◦, 10◦)

(−30◦,−10◦)(0◦,−10◦) (30◦,−10◦)(a) Observed images taken by changing (pan,

tilt) angles.

(b) Generated panoramic image.

Figure 3: Panoramic image taken by the de-veloped FV-PTZ camera.

appearance variations, consequently, sophisti-cated image processing should be employed[7].

Our idea to solve this appearance variationproblem is very simple but effective[5][11]:1. Make pan and tilt axes intersect with eachother.2. Place the projection center at the intersect-ing point.We call the above designed active camera theFixed Viewpoint Pan-Tilt Camera. With thiscamera, all images taken with different pan-tilt angles can be mapped seamlessly onto acommon virtual screen (Appearance Sphere inFig. 2) to generate a wide panoramic image.Note that once the panoramic image is ob-tained, images taken with arbitrary combina-

tions of pan-tilt parameters can be generatedby back-projecting the panoramic image ontothe corresponding image planes.

Usually, zooming can be modeled by theshift of the projection center along the opticalaxis[12]. Thus to realize the Fixed ViewpointPan-Tilt-Zoom Camera (FV-PTZ camera, inshort), either of the following additional mech-anisms should be employed:(a) Design such a zoom lens system whose pro-jection center is fixed irrespectively of zoom-ing.(b) Introduce a slide stage to align the projec-tion center depending on zooming.

We found SONY EVI G20, an off-the-shelfactive video camera, is a good approximationof an FV-PTZ camera ( −30◦ ≤ pan ≤ 30◦,−15◦ ≤ tilt ≤ 15◦, and zoom: 15◦ ≤ hori-zontal view angle ≤ 44◦); its projection centerstays almost fixed irrespectively of zooming.Then, we developed a sophisticated internal-camera-parameter calibration method for thiscamera, with which we can use the camera asan FV-PTZ camera[1]. Fig. 3(a) illustratesa set of observed images taken by changingpan-tilt angles with the smallest zooming fac-tor. Fig. 3(b) shows the panoramic imagegenerated from the observed images.

3 Active Background Sub-

traction for Target Detec-

tion and TrackingWith an FV-PTZ camera, we can easily re-alize an active target tracking system. Fig. 4illustrates the basic scheme of the active back-ground subtraction for target detection andtracking we developed[1]:STEP 1 Generate the panoramic image of thescene without any objects: Appearance Planein the figure.STEP 2 Extract a window image from the ap-pearance plane according to the current pan-tilt-zoom parameters and regard it as the cur-rent background image.STEP 3 Compute difference between the gen-erated background image and the observed im-age.STEP 4 If anomalous regions are detected inthe difference image, select one and control the

Anomalous Regions

Generated Image

Input Image

Appearance Plane

ParameterPan, Tilt, Zoom

(Background Image Database)

Camera Action

Figure 4: Active background subtraction withan FV-PTZ camera.

camera parameters to track the selected tar-get.STEP 5 Go back to STEP 2.

To cope with dynamically changing situa-tions in the real world, we have to augment theabove scheme in the following three points:(a) Robust background subtraction which canwork stably under non-stationary environ-ments.(b) Flexible system dynamics to control thecamera reactively to unpredictable object be-haviors.(c) Multi-target tracking in cluttered environ-ments.

We do not address the first problem here,since various robust background subtractionmethods have been developed [13][14]. As forthe system dynamics, we will present a novelreal-time system architecture in the next sec-tion and then, propose a cooperative multi-target tracking system in Section 6.

4 Dynamic Integration of Vi-

sual Perception and Cam-

era Action for Real-Time

Reactive Target TrackingThe active tracking system described in Fig. 4can be decomposed into visual perception andcamera action modules. The former includesimage capturing, background image genera-tion, image subtraction, and object region de-tection. The latter performs camera controland camera state (i.e. pan-tilt angles and

ActionPerception

camera parameter

informationobject

(a) Information flow between visualperception and camera action modules

timePerception Action Perception Action

(b) Dynamics in a sequential target tracking system.

Figure 5: Dynamic interaction between visualperception and camera action modules.

zooming factor) monitoring.

Here we discuss the dynamics of this system.Fig. 5(a) illustrates the information flow be-tween the perception and action modules: theformer obtains the current camera parametersfrom the latter to generate the background im-age and the latter the current target locationfrom the former to control the camera. Fig.5(b) shows the dynamics of the system, wherethe two modules are activated sequentially.

While this system worked stably[1], thecamera motion was not so smooth nor couldfollow abrupt changes of target motion;(a) The frequency of image observations is lim-ited due to the sequential system dynamics.That is, the perception module should wait forthe termination of the slow mechanical cameramotion.(b) Due to delays involved in image process-ing, camera state monitoring, and mechani-cal camera motion, the perception and actionmodules cannot obtain accurate current cam-era state or target location respectively.

To solve these problems and realize real-time reactive target tracking, we proposed anovel dynamic system architecture named Dy-namic Memory Architecture[6], where the vi-sual perception and camera action modulesrun in parallel and dynamically exchange in-formation via a specialized shared memorynamed the Dynamic Memory (Fig. 6).

4.1 Access Methods for the Dy-

namic MemoryWhile the system architecture consistingof multiple parallel processes with a com-mon shared memory looks similar to the

Actionmodule

Control

Images

Camerainformation

Objectinformation

Camerainformation

Objectinformation

time

time

Camera Pan

Camera Tilt

Dynamic memory

now

predicted

Perceptionmodule

(FV-PTZ camera)

time

Camera Zoom

predicted valehistory

time

Object Pan

time

Object Tilt

Active camera

cycleprocessingintrinsic

cycleprocessingintrinsic

Figure 6: Real-time reactive target trackingsystem with the dynamic memory.

value

time

interpolated

predictedv (T1)

T1 NOW 3

t0 t1 t2 t3 t 4 t 5 t 6 t7

TT2

Figure 7: Representation of a time varyingvariable in the dynamic memory.

”whiteboard architecture”[15] and the ”smartbuffer”[16], the critical difference rests in thateach variable in the dynamic memory storesa discrete temporal sequence of values and isassociated with the following temporal inter-polation and prediction functions (Fig. 7).

The write and read operations to/from thedynamic memory are defined as follows:

(a) Write Operation

When a process computes a value val of avariable v at a certain moment t, it writes(val, t) into the dynamic memory. Since suchcomputation is done repeatedly according tothe dynamics of the process, a discrete tem-poral sequence of values is recorded for eachvariable in the dynamic memory (a sequenceof black dots in Fig. 7).

(b) Read Operation

Temporal Interpolation: A reader processruns in parallel to the writer process and triesto read from the dynamic memory the valueof the variable v at a certain moment: e.g.

frame0 frame50 frame100 frame150

Figure 8: Observed image sequence taken bythe system. Upper: input images, Lower: de-tected object regions.

the value at T1 in Fig. 7. When no valueis recorded at the specified moment, the dy-namic memory interpolates it from recordeddata. With this function, the reader processcan read a value at any temporal momentalong the continuous temporal axis withoutmaking any synchronization with the writerprocess.

Future Prediction: A reader process mayrun fast and require data which are not writ-ten yet by the writer process (for example, thevalue at T3 in Fig. 7). In such case, the dy-namic memory predicts an expected value inthe future based on those data so far recordedand returns it to the reader process.

With the above described functions, eachprocess can get any data along the tempo-ral axis freely without waiting (i.e. wastingtime) for synchronization with others. Thisno-wait asynchronous module interaction ca-pability greatly facilitates the implementationof real-time reactive systems. As will be shownlater in Section 5.2.3, moreover, the dynamicmemory supports the virtual synchronizationbetween multiple network-connected systems(i.e. AVAs), which facilitates the real-time dy-namic cooperation among the systems.

4.2 Effectiveness of the Dynamic

MemoryTo verify the effectiveness of the dynamicmemory, we developed a real-time single-target tracking system and conducted exper-iments of tracking a radio-controlled car ina computer room. The system employedthe parallel active background subtractionmethod with the FV-PTZ camera, where the

rate of image deviation of target region

observations the target size

location from

the image center

System A 1.83 [ftp] 44.0 [pixel] 5083 [pixel]

System B 11.04 [ftp] 16.7 [pixel] 5825 [pixel]

Table 1: Performance evaluation

perception and action modules were imple-mented as UNIX processes sharing the dy-namic memory. Fig. 8 illustrates a partial se-quence of observed images and detected objectregions. Note that the accurate calibration ofthe FV-PTZ camera enabled the stable back-ground subtraction even while changing pan,tilt, and zooming.

Table 1 compares the performance betweenSystem A: sequential dynamics and System B:parallel dynamics with the dynamic memory.Both systems tracked a computer-controlledtoy car under the same experimental settingsand performance factors were averaged overabout 30 sec. The left column of the tableshows that the dynamic memory greatly im-proved the rate of image observations owingto the no-wait asynchronous execution of theperception module. The other two columnsverify the improvements in the camera control.That is, with the dynamic memory, the cam-era was directed toward the target more accu-rately (the middle column) and hence couldobserve the target in higher resolution (theright column). Note that our system controlspan-tilt angles to observe the target at the im-age center and adjusts the zooming factor de-pending on deviations of the former from thelatter: smaller deviations lead to zooming in tocapture higher resolution target images, whilelarger deviations to zooming out not to missthe target[1].

5 Cooperative Multi-Target

TrackingNow we address cooperative multi-targettracking by communicating active visionagents (AVAs), where an AVA denotes anaugmented target tracking system describedin the previous section. The augmentationmeans that an AVA consists of visual percep-

Agency2

Agency1

Object

Object

AVA1 AVA2 AVA1 AVA2

AVA3AVA4AVA3AVA4

image camera

Detect!

Navigated!

Navigated! Cooperating!

Cooperating! Cooperating!

Agency

Object1

Object2

targetChange

AVA1 AVA2

AVA3AVA4

(a) (b) (c)Figure 9: Basic scheme for cooperative track-ing: (a) Gaze navigation, (b) Cooperative gaz-ing, (c) Adaptive target switching.

tion, camera action, and network communica-tion modules, which run in parallel exchanginginformation via the dynamic memory.

5.1 Basic Scheme for Cooperative

TrackingOur multi-target tracking system consists ofa group of AVAs embedded in the real world(Fig. 1). The system assumes that the cam-eras are calibrated and densely distributedover the scene so that their visual fields arewell overlapping with each other.

Followings are the basic tasks of the system:1. Initially, each AVA independently searchesfor a target that comes into its observable area.Such AVA that is searching for a target iscalled a freelancer.2. If an AVA detects a target, it navigates thegazes of the other AVAs towards that target(Fig.9 (a)).3. A group of AVAs which gaze at the sametarget form what we call an Agency and keepmeasuring the 3D information of the targetfrom multi-view images (Fig.9 (b)).4. Depending on target locations in thescene, each AVA dynamically changes its tar-get (Fig.9 (c)).

To realize the above cooperative tracking,we have to solve the following problems:Multi-target identification: To gaze ateach target, the system has to distinguish mul-tiple targets.Real-time and reactive processing: Toadapt itself to dynamic changes in the scene,the system has to execute processing in real-time and quickly react to the changes.Adaptive resource allocation: We have toimplement two types of dynamic resource al-location (i.e. grouping AVAs into agencies):

Perception Module Dynamic Memory

CommunicationModule

Action Module

MemberAVA2

MemberAVA3

MemberAVA1

Agency2

Agency3

Inter-Agency Layer

Intra-Agency Layer

Intra-AVA Layer

Agency Information

Perception Data

Camera Data Object Data

DynamicMemory

Agency1

Agency Manager

Object Information

AVA

Freelancer-AVAs

Object Information

FreelancerAVA2

FreelancerAVA1

Figure 10: Three layered dynamic interactionarchitecture.

(1) To perform both target search and track-ing simultaneously, the system has to preserveAVAs that search for new targets even whiletracking targets, (2) To track each moving tar-get persistently, the system has to adaptivelydetermine which AVAs should track which tar-gets.

In what follows, we address how these prob-lems can be solved by real-time cooperativecommunications among AVAs.

5.2 Three Layered Dynamic Inter-

actions for Cooperative Track-

ingWe designed and implemented the three lay-ered dynamic interaction architecture illus-trated in Fig. 10 to realize real-time coop-erative multi-target tracking.

5.2.1. Intra-AVA layerIn the lowest layer in Fig.10, perception, ac-tion and communication modules that com-pose an AVA interact with each other via thedynamic memory.

An AVA is an augmented target trackingsystem described in Section 5, where the aug-mentation is threefold:

(1) Multi-target detection while single-target tracking

When the perception module detects N ob-jects at t + 1, it computes and records intothe dynamic memory the 3D view lines towardthe objects2 (i.e. L1(t + 1), · · · , LN (t + 1)).Then, the module compares them with the 3D

2 The 3D line determined by the projection centerof the camera and an object region centroid.

view line toward its currently tracking targetat t + 1, L̂(t + 1). Note that L̂(t + 1) can beread from the dynamic memory whatever tem-poral moment t+1 specifies. Suppose Lx(t+1)is closest to L̂(t + 1), where x ∈ {1, · · · , N}.Then, the module regards Lx(t + 1) as denot-ing the newest target view line and records itinto the dynamic memory.

(2) Gaze control based on the 3D targetposition

When the FV-PTZ camera is ready to ac-cept a control command, the action modulereads the 3D view line toward the target (i.e.L̂(now)) from the dynamic memory and con-trols the camera to gaze at the target. As willbe described later, when an agency with mul-tiple AVAs tracks the target, it measures the3D position of the target (denoted by P̂ (t))and sends it to all member AVAs, which thenis written into the dynamic memory by thecommunication module. If such information isavailable, the action module controls the cam-era based on P̂ (now) in stead of L̂(now).

(3) Incorporation of the communicationmodule

Data exchanged by the communicationmodule over the network can be classified intotwo types: detected object data and messagesfor cooperations among AVAs. The former in-clude 3D view lines toward detected objects:AVA → other AVAs and agencies, and 3D tar-get position: agency → member AVAs. Thelatter realize various communication proto-cols, which will be described later.

5.2.2. Intra-Agency layerAs defined before, a group of AVAs whichtrack the same target form an Agency. Theagency formation means the generation ofan agency manager, which is an indepen-dent parallel process to coordinate interac-tions among its member AVAs. The middlelayer in Fig.10 specifies dynamic interactionsbetween an agency manager and its memberAVAs.

In our system, an agency should correspondone-to-one to a target. To make this corre-spondence dynamically established and persis-tently maintained, the following two kinds ofobject identification are required in the intra-

agency layer.

(a) Spatial object identification

The agency manager has to establish the ob-ject identification between the groups of the3D view lines detected and transmitted by itsmember AVAs. The agency manager checksdistances between those 3D view lines detectedby different member AVAs and computes the3D target position from a set of nearly inter-secting 3D view lines. The manager employswhat we call the Virtual Synchronization tovirtually adjust observation timings of the 3Dview lines (see 5.2.3 for details). Note thatthe manager may find none or multiple setsof such nearly intersecting 3D view lines. Tocope with these situations, the manager con-ducts the following temporal object identifica-tion.

(b) Temporal object identification

The manager records the 3D trajectory ofits target, with which the 3D object posi-tion(s) computed by the spatial object iden-tification is compared. That is, when multiple3D locations are obtained by the spatial ob-ject identification, the manager selects the oneclosest to the target trajectory. When the spa-tial object identification failed and no 3D ob-ject location was obtained, on the other hand,the manager selects such 3D view line that isclosest to the latest recorded target 3D posi-tion. Then the manager projects the target3D position onto the selected view line to es-timate the new 3D target position.

5.2.3. Virtual SynchronizationHere we discuss dynamic aspects of the aboveidentification processes.

(a) Spatial object identification

Since AVAs capture images autonomously,member AVAs in an agency observe the tar-get at different moments. Furthermore, themessage transmission over the network intro-duces unpredictable delay between the obser-vation timing by a member AVA and the ob-ject identification timing by the agency man-ager. These asynchronous activities can sig-nificantly damage the reliability of the spatialobject identification.

To solve this problem, we introduce thedynamic memory into an agency man-

time

time

Real ValueEstimated Value

Sequence of View lines

Sequence ofView lines

Intrepolation

VirtualSynchronization

Observedby Member1

Estimated Value

Estimation

T

View direction

View direction

Observedby Member2

Estimated Value

L (T)2

_

L (T)1

_

Figure 11: Virtual synchronization for spatialobject identification

ager, which enables the manager to vir-tually synchronize any asynchronously ob-served/transmitted data. We call this func-tion Virtual Synchronization by the dynamicmemory.

Fig.11 shows the mechanism of the virtualsynchronization. All 3D view lines computedby each member AVA are transmitted to theagency manager, which then records them intoits internal dynamic memory. Fig.11, for ex-ample, shows a pair of temporal sequences of3D view line data transmitted from memberAVA1 and member AVA2, respectively. Whenthe manager wants to establish the spatial ob-ject identification at T , it can read the pairof the synchronized 3D view line data at T

from the dynamic memory (i.e. L̄1(T ) andL̄2(T ) in Fig.11). That is, the values of the 3Dview lines used for the identification are com-pletely synchronized with that identificationtiming even if their measurements are con-ducted asynchronously.

(b) Temporal object identification

The virtual synchronization is also effec-tive in the temporal object identification. LetP̂ (t) denote the 3D target trajectory recordedin the dynamic memory and {Pi(T )|i =1, · · · ,M} the 3D positions of the objects iden-tified at T . Then the manager 1) reads P̂ (T )(i.e. the estimated target position at T ) fromthe dynamic memory, 2) selects the one among{Pi(T )|i = 1, · · · ,M} closest to P̂ (T ), and 3)records it into the dynamic memory as the newtarget position.

(2-b)Member

Identification SuccessMessage

Member

3D view line

(1)

Detect! Member

(ID Request) object

(2-a)Member

Freelancer Freelancer

Freelancer

Identification FailureMessage

Member

Agency AgencyGenerate!

AgencyAgency

Figure 12: Agency Formation

object

(1)

Member Member

(3)

Member

Identification FailureMessage

Member

Obstacle

Invisible

3D view line(Detected result)

(2)

Member Member

3D object position(Gaze navigation)

Member m

Agency Agency

Agency

Member m

Figure 13: Agency Maintenance

MemberMember

3D view line(Detected result)

MemberMember

Member

Agency SpawnMessage

MemberMember

Member

objectnew

(1) (2) (3)

Agency Agency

Agency

Agency

n nnew object

Ln

Member n

Ln

Figure 14: Agency Spawning

5.2.4. Communications at Intra-AgencyLayer

The above mentioned temporal object identifi-cation fails if the closest distance between theestimated and observed 3D target locationsexceeds a threshold. The following three com-munication protocols are activated dependingon the success or failure of the object identifi-cation. They materialize dynamic interactionsat the intra-agency layer.

(a) Agency formation protocol

This protocol defines (1) the new agencygeneration procedure by a freelancer AVA and(2) the participation procedure of a freelancerAVA into an existing agency.

When a freelancer AVA detects an object, itrequests the existing agency managers to ex-amine the identification between the detectedobject and the target object of each agency(Fig.12, (1)). Depending on the result of thisobject identification, the freelancer AVA worksas follows:No agency established the object iden-tification: The freelancer AVA generates anew agency manager to track the newly de-tected object and joins into that agency as itsmember AVA (Fig.12, (2-a)).An agency established the object iden-

tification: The freelancer-AVA joins into theagency that has made successful object iden-tification, if requested (Fig.12, (2-b)).

(b) Agency maintenance protocol

This protocol defines procedures for the con-tinuous maintenance of an agency and theelimination of an agency.

After an agency is generated, the agencymanager repeats the spatial and temporalobject identifications for cooperative track-ing (Fig.13 (1)). Following the spatial ob-ject identification, the manager transmits thenewest 3D target location to each memberAVA (Fig.13 (2)), which then is recorded intothe dynamic memory of the member AVA.

Suppose a member AVAm cannot detect thetarget object due to an obstacle or process-ing errors (Fig.13 (3)). Even in this case,the manager informs AVAm the 3D positionof the target observed by the other memberAVAs. This information navigates the gazeof AVAm towards the (invisible) target. How-ever, if such mis-detection continues for a longtime, the agency manager forces AVAm out ofthe agency to be a freelancer.

If all member AVAs cannot observe the tar-get being tracked so far, the agency managerdestroys the agency and makes all its memberAVAs become freelancers.

(c) Agency spawning protocol

This protocol defines a new agency genera-tion procedure from an existing agency.

After the spatial and temporal object iden-tifications, the agency manager may find sucha 3D view line(s) that does not correspond tothe target. This means the detection of a newobject by its member AVA. Let Ln denote such3D view line detected by AVAn (Fig.14 (1)).Then, the manager broadcasts Ln to otheragency managers to examine the identificationbetween Ln and their tracking targets.

If none of the identification is successful,the agency manager makes AVAn quit fromthe current agency and generate a new agency(Fig.14 (2)). AVAn then joins into the newagency (Fig.14 (3)).

5.2.5. Inter-Agency layerIn multi-target tracking, the system shouldadaptively allocate resources: the system has

Unify RequestMessage

Member

Member

Member

Member

Member

Member

Member

Member

Member

(1) (2) (3)

AgencyAgencyAgency

Agency Agency

AAA

B B

Change AgencyMessage

Figure 15: Agency Unification

AVA RequestMessage

Change AgencyMessage

Join AgencyMessage

Member

Member

Member

Member

Member Member

Member

Member

(1) (2) (3)

Member

Member

AgencyC AgencyCAgencyC

AgencyD

AgencyD

AgencyD

o

oMember

TargetC

TargetD

Figure 16: Agency Restructuring

to adaptively determine which AVAs shouldtrack which targets. To realize this adaptiveresource allocation, the information about tar-gets and member AVAs is exchanged betweenagency managers (the top layer in Fig. 10).

The dynamic interactions between agencymanagers are triggered based on the objectidentification across agencies. That is, whena new target 3D location is obtained, agencymanager AMi broadcasts it to the others.Agency manager AMj , which receives this in-formation, compares it with the 3D position ofits own target to check the object identifica-tion. Note that here also the virtual synchro-nization between a pair of 3D target locationsis employed to increase the reliability of theobject identification.

Depending on the result of this inter-agencyobject identification, either of the followingtwo protocols are activated.

(a) Agency unification protocol

This protocol is activated when the inter-agency object identification is successful anddefines a merging procedure of the agencieswhich happen to track the same object.

In principle, the system should keep the one-to-one correspondence between agencies andtarget objects. However, this correspondencesometimes is violated due to failures of objectidentification and discrimination:(a) asynchronous observations and/or errorsin object detection by individual AVAs or(b) multiple targets which come too close toseparate.

Fig.15 shows an example. When agency

manager AMA of agencyA establishes the iden-tification between its own target and the onetracked by AMB , AMA asks AMB to bemerged into AMA (Fig.15(1)). Then, AMB

asks its member AVAs to join into agencyA

(Fig.15(2)). After copying the target informa-tion recorded in the dynamic memory into theobject trajectory database, AMB eliminatesitself (Fig.15(3)).

As noted above, agencies corresponding tomultiple different targets may be unified ifthey are very close. However, this heteroge-neously unified agency can be separated backby the agency spawning protocol when the dis-tance between the targets get larger. In suchcase, characteristics of the newly detected tar-get are compared with those recorded in theobject trajectory database to check if the newtarget corresponds to a target that had beentracked before. If so, the corresponding targettrajectory data is moved from the databaseinto the dynamic memory of the newly gener-ated agency.

(b) Agency restructuring protocol

When the inter-agency object identificationfails, agency manager AMj checks if it can ac-tivate the agency restructuring protocol tak-ing into account the numbers of member AVAsin agencyj and agencyi and their target loca-tions.

Fig.16 illustrates an example. agency man-ager AMC of agencyC sends its target informa-tion to AMD, which fails in the object identi-fication. Then, AMD asks AMC to trade itsmember AVA into AMD (Fig.16(a)). Whenrequested, AMC selects its member AVA andasks it to move to agencyD (Fig.16(b) (c)).

5.2.6. Communication with FreelancerAVAs

An agency manager communicates with free-lancer AVAs as well as with other managers(the top row of Fig. 10). As described in theagency formation protocol in Section 5.2.4, afreelancer activates the communication withagency managers when it detects an object.An agency manager, on the other hand, sendsto freelancers its target position when the newdata are obtained. Then, each freelancer de-cides whether it continues to be a freelancer or

joins into the agency depending on the targetposition and the current number of freelancersin the system. Note that in our system a usercan specify the number of freelancers to bepreserved while tracking targets.

6 Experiments

To verify the effectiveness of the proposed sys-tem, we conducted experiments of multiple hu-man tracking in a room (about 5m×5m). Thesystem consists of ten AVAs. Each AVA is im-plemented on a network-connected PC (Pen-tiumIII 600MHz × 2) with an FV-PTZ camera(SONY EVI-G20), where the perception, ac-tion, and communication modules as well asagency managers are realized as UNIX pro-cesses. Fig.18 (a) illustrates the camera lay-out: camera9 and camera10 are on the walls,while the others on the ceiling. The exter-nal camera parameters are calibrated. Notethat the internal clocks of all the PCs are syn-chronized by the Network Time Protocol torealize the virtual synchronization. With thisarchitecture, the perception module of eachAVA can capture images and detect objectsat about 10 frames per second on average.

In the experiment, the system tracked twopeople. Target1 first came into the scene andafter a while, target2 came into the scene.Both targets then moved freely. The upperpart of Fig. 17 shows the partial image se-quences observed by AVA2, AVA5 and AVA9.The images on the same row were taken bythe same AVA. The images on the same col-umn were taken at almost the same time. Theregions enclosed by black and gray lines in theimages show the detected regions correspond-ing to target1 and target2 respectively.

Each figure in the bottom of Fig.17 showsthe role of each AVA and the agency orga-nization at such a moment when the samecolumn of images in the upper part were ob-served. White circles denote freelancer AVAs,while black and gray circles indicate memberAVAs belonging to agency1 and agency2, re-spectively. Black and gray squares indicatecomputed locations of target1 and target2 re-spectively.

The system worked as follows.

AVA2: 2-a 2-b 2-c 2-d 2-e 2-f 2-g 2-h 2-i

AVA5: 5-a 5-b 5-c 5-d 5-e 5-f 5-g 5-h 5-i

AVA9: 9-a 9-b 9-c 9-d 9-e 9-f 9-g 9-h 9-i

AVA1

AVA2 AVA3

AVA4

(a) (b) (c) (d) (e) (f) (g) (h) (i)

time

Figure 17: Experimental results.

a: Initially, each AVA searched for an objectindependently.b: AVA5 first detected target1, and agency1

was formed.c: All AVAs except for AVA5 were trackingtarget1, while AVA5 was searching for a newobject as a freelancer.d: Then, AVA5 detected target2 and gener-ated agency2.e: The agency restructuring protocol balancedthe numbers of member AVAs in agency1 andagency2. Note that AVA9 and AVA10 wereworking as freelancers.f: Since two targets came very close to eachother and no AVA could distinguish them, theagency unification protocol merged agency2

into agency1.g: When the targets got apart, agency1 de-tected a ’new’ target. Then, it activated theagency spawning protocol to generate agency2

again for target2.h: Target1 was going out of the scene.i: After agency1 was eliminated, all the AVAsexcept AVA4 tracked target2.

Fig.18 (a) shows the trajectories of thetargets computed by the agency managers.Fig.18 (b) shows the dynamic populationchanges of freelancer AVAs, AVAs trackingtarget1 and those tracking target2.

As we can see, the dynamic cooperationsamong AVAs and agency managers worked

very well and enabled the system to persis-tently track multiple targets.

5(m)

6(m)

Trajectory of Object2

Trajectory of Object1

AVA1

AVA5

AVA2 AVA10

AVA6

AVA3

AVA4

AVA7

AVA8

AVA9

(a)

0

2

4

6

8

10

0 10 20 30 40 50 60

(num)

(sec)

FreelancersMembers of Agency1

Members of Agency2

Detect Object2

DetectObject1

5 15

Object1 Exit

48

(b)

Figure 18: Experimental results: (a) Trajec-tories of the targets, (b) The number of AVAsthat performed each role

7 Concluding RemarksThis paper presented a real-time active multi-target tracking system, which is the most pow-erful and flexible but difficult to realize among

various types of target tracking systems.To implement the system, we developed 1)

a Fixed-Viewpoint Pan-Tilt-Zoom Camera forwide area active imaging, 2) Active Back-ground Subtraction for target detection andtracking, 3) Dynamic Memory Architecturefor real-time reactive tracking, and 4) a three-layered dynamic interaction architecture forreal-time communication among AVAs.

In our system, parallel processes (i.e. AVAsand its constituent perception, action, andcommunication modules) cooperatively workinteracting with each other. As a result, thesystem as a whole works as a very flexiblereal-time reactive multi-target tracking sys-tem. We believe that this cooperative dis-tributed processing greatly increases the flex-ibility and adaptability of the system, whichhas been verified by experiments of multiplehuman tracking.

This work was supported by the Researchfor the Future Program of the Japan So-ciety for the Promotion of Science (JSPS-RFTF96P00501). Research efforts by all for-mer and current members of our laboratoryare gratefully acknowledged.

References

[1] T. Matsuyama: Cooperative DistributedVision - Dynamic Integration of VisualPerception, Action and Communication -,Proc. of Image Understanding Workshop,pp.365–384, 1998.

[2] S.Moezzi, L.Tai, and P.Gerard: VirtualView Generation for 3D Digital Video,IEEE Multimedia, pp.18-26, 1997.

[3] V. R. Lesser and D. D. Corkill: The Dis-tributed Vehicle Monitoring Testbed: aTool for Investigating Distributed Prob-lem Solving Networks, AI Magazine, Vol.4, No. 3, pp.15–33, 1983.

[4] Video Surveillance and Monitoring,Proc. of Image Understanding Workshop,Vol.1, pp.3-400, 1998

[5] T. Wada and T. Matsuyama: Appear-ance Sphere: Background Model for Pan-Tilt-Zoom Camera, Proc. of ICPR, Vol.A, pp. 718-722, 1996.

[6] T. Matsuyama, et al: Dynamic Mem-ory: Architecture for Real Time Integra-tion of Visual Perception, Camera Ac-tion, and Network Communication, Proc.of CVPR, pp.728–735, 2000.

[7] D. Murray and A. Basu: Motion Trackingwith an Active Camera, IEEE Trans. ofPAMI, Vol. 16, No. 5, pp. 449-459, 1994.

[8] Y. Yagi and M. Yachida : Real-TimeGeneration of Environmental Map andObstacle Avoidance Using Omnidirec-tional Image Sensor with Conic Mirror,Prof. of CVPR, pp. 160-165, 1991.

[9] K. Yamazawa, Y. Yagi, and M. Yachida:Obstacle Detection with OmnidirectionalImage Sensor HyperOmni Vision, Proc.of ICRA, pp.1062 - 1067, 1995.

[10] V.N. Peri and S.K. Nayar: Generation ofPerspective and Panoramic Video fromOmnidirectional Video, Proc. of IUW,pp.243 - 245, 1997.

[11] S. Coorg and S. Teller: Spherical Mosaicswith Quaternions and Dense Correlation,Int’l J. of Computer Vision, Vol.37, No.3,pp.259-273, 2000.

[12] J.M. Lavest, C. Delherm, B. Peuchot,and N. Daucher: Implicit Reconstructionby Zooming, Computer Vision and ImageUnderstanding, Vol.66, No.3, pp.301-315,1997.

[13] K. Toyama, et al: Wallflower: Princi-ples and Practice of Background Mainte-nance, Proc. of ICCV, pp. 255-261, 1999

[14] T. Matsuyama, T. Ohya, and H.Habe: Background Subtraction for Non-Stationary Scenes, Proc. of 4th AsianConference on Computer Vision, pp.662-667, 2000

[15] C. Thorpe, M.H. Herbert, T. Kanade,and S.A. Shafer: Vision and Navigationfor the Carnegie-Mellon Navlab, IEEETrans., Vol.PAMI-10, No.3, pp.362-373,1988

[16] J.J. Little and J. Kam: A Smart Bufferfor Tracking Using Motion Data, Proc. ofComputer Architecture for Machine Per-ception, pp.257-266, 1993

Real-Time Cooperative Multi-Target Tracking by...

Documents

Transcript of Real-Time Cooperative Multi-Target Tracking by...