Synthetic Movies for Computer Vision ApplicationsSynthetic Movies for Computer Vision Applications...

Synthetic Movies for Computer Vision ApplicationsA. Santuari, O. Lanz, and R. Brunelli1

1 ITC-irst, POVO (TN), ITALY

ABSTRACTThis paper presents a real time graphical simulator basedon a client-server architecture. The rendering engine, sup-ported by a specialized client application for the automaticgeneration of goal oriented motion of synthetic characters,is used to produce realistic image sequences for extensiveperformance assessment of computer vision algorithms forpeople tracking.

KEY WORDSAnimation, Virtual Reality, Tracking, Intelligent Ambience

1 Introduction

Vision is human most important sensory channel and a de-tailed understanding of its inner workings is one of thegreat challenges of modern science. Artificial vision re-search tries to match human visual competence by devel-oping algorithms able to infer from bi-dimensional images,or image sequences, the contents of the corresponding 3Dstatic or dynamic scenes. Extensive evaluation of algorithmperformance is necessary but in many cases extremelycostly to perform. Let us consider a simple algorithm todetect all persons within a camera field of view, providinga binary image distinguishing moving people from staticbackground. Precise evaluation of the accuracy of the algo-rithm would require manual annotation of each single pixelfor every image of the captured sequences. Reliable esti-mation of the performance under different operating condi-tions requires the use of many sequences, further increasingthe cost of performance evaluation. A similar situation canbe found in the development of systems based on the learn-ing by example paradigm. This approach has demonstratedits validity in several applications but its feasibility rests onthe availability of a set of examples covering the range ofoperating conditions of the system. Again, gathering a rep-resentative set of examples may be too costly, ruling out theuse of otherwise powerful techniques.

In this paper we focus on the possibilities offered bycurrent computer graphics techniques to generate realisticsynthetic imagery supported by extensive ground truth in-formation at frame rate. Among the many topics currentlyinvestigated by the computer vision community, the de-velopment of indoor surveillance systems is of increasingscientific and practical interest. The evaluation of peopletracking algorithms, especially in the case of crowded en-vironments, is particularly difficult and has not been ad-dressed so far in a principled way. This paper presents a

graphical simulator whose architecture has been designedto improve the work-flow of algorithm development andevaluation for third generation surveillance systems basedon distributed multi camera systems for museum monitor-ing. The architecture of the system and the rendering strate-gies are presented in Section 2. Animation techniques aredescribed in Section 3 while first results and foreseen ex-tensions are reported in Section 4.

2 Rendering Architecture

One of the trends in the development of novel surveillancesystems is represented by the use of multiple sensors co-operating in monitoring multi room environments. Thedesign of computer vision algorithms within this frame-work is particularly challenging both on the theoretical side(effective integration of multiple information streams) andon the practical side (setting up large sensor networks andevaluating algorithm performance). The proposed simula-tor derives its architectural features from the necessity ofmimicking the structure of such surveillance systems:

• network wide operation, supporting multiple connec-tions with several image processing systems (emulat-ing a set of intelligent active cameras).

• management of virtual environments comprisingstatic structures and moving objects such as people;

• generation of believable motion patterns for peoplegroups;

• synthesis of image sequences providing realistic inputto image processing algorithms;

• efficient image generation to effectively support tightdevelop test refine work-flows;

• automatic generation of ground truth data upon whichbasing automatic evaluation of algorithms.

The adopted solution is based on the client-server archi-tecture schematically reproduced in Figure 1. The servermanages the environment and provides clients with accessto dynamic objects, currently characters and cameras. Eachclient can then take control of an available camera to get apersonalized view of the environment or join the simulationfrom the point of view already controlled by another client.

Synthetic characters can also be controlled by a clientand the next section will provide a detailed description ofthe currently available module for the control of groups of

3D Engine

Control

Data

Client 1 Client N

NetworkingSimulator

.................

Figure 1. System architecture

characters visiting a museum. Each client can then requestfrom the server the images of the scene, their depth map(that can be used to simulate other kind of sensors) and asegmented image where each person is represented with aunique color (see Figure 2). This provides the necessaryground truth to evaluate the performance of people track-ing algorithms. The feature set of the rendering engine has

Figure 2. The information generated by the simulator: thevisual stimuli, a depth map, and the ’oracle’.

been driven by the requirement of providing a useful substi-tute of real camera signals to image processing algorithms.In particular the chosen scenario requires:

• effective management of complex geometry to simu-late multi room environments with many characters;

• easy control of character animation to create believ-able human motion;

• realistic lighting of static and dynamic geometry.

The rendering engine relies the OpenGL API, for whichvery efficient hardware implementations exist, and man-ages complex geometry using Binary Space Partition (BSP)Trees [1] and precomputed Potentially Visible Sets (PVS).

While fast, plain OpenGL based rendering does notprovide realistic images. The available lighting model issimple and, among other limitations, does not take into ac-count environment diffusions. In order to generate highquality images, the rendering engine exploits a precom-puted global illumination for the static environment madeavailable through a set of lightmaps, mapped by OpenGLonto surface textures, modulating their lighting. Movingobjects must instead be shaded dynamically using OpenGLlights. The maximum number of supported lights depends

on the implementation and the rendering engine keeps asorted list of lights for each volume cell generated by theBSP tree, so that only the most relevant lights are activatedfor lighting (and for shadowing).

As the basic OpenGL primitives are polygons, allscene geometries are defined by their polygonal approxi-mations. Character animation, an important feature of theproposed simulator, would then require the position of allmoving vertices as a function of time. As this would be im-practical, animations are broken down into cyclic actionsthat are in turn specified through a limited set of snapshots,the key frames, with all intervening positions interpolatedby the system. The flexibility of the system in specifyingcomposed actions can be further increased using the tech-nique of skeletal animation. Each model is represented bya hierarchical skeleton: each bone has specific rotation con-straints and is further characterized by an influence spherewhich identifies the vertices that will follow the movementsof the bone. This system requires the interpolation of ro-tations that can be accomplished neatly using quaternionsand spherical linear interpolation [8]. In the described sys-

direction

Left

Run

Walk

RightIdle

Back

Figure 3. Character is moving upwards: animation is cho-sen based on speed direction and magnitude

AnimationPersonalized

rotations

Joint

k1 k2 t q qq1 2 n

q qa p

q

X

to rendering phase

Figure 4. Pipeline for the computation of joint rotation.The additional degree of freedom granted by personalizedrotations permits the introduction of new character poses.

tem, characters are animated through a hybrid techniqueused in current video games. The approach tries to gainthe speed of vertex animation approaches and the flexibil-ity of skeletal animation techniques. Each character is rep-resented by a very simple skeleton with three bones con-

nected to the head, the torso, and the legs. The flexibilityof skeletal animation is exposed to client applications, al-lowing them to specify personalized rotations for the joint(as described in Figure 4). This is particularly useful inthe simulation of a museum environment as the characterattention may be directed to a given point, such as an ex-hibit, through a combined rotation of head and torso (seeFigure 5). In the current system, character position is not

Figure 5. A character pose can be controlled through therotation of his joints.

controlled by the rendering engine itself but is provided bya specialized client application. This implies that the ren-dering engine must deduce from character orientation andspeed the appropriate animation class among the availableones: run, walk, stride left, stride right (see Figure 3). Asthe kinematics of the character may suddenly result in thechoice of a new animation, the rendering engine smoothsthe transition between the animations to avoid annoyingartifacts. The transition between animation classes is ob-tained by creating temporary additional key-frames and in-terpolating towards the character configuration specified bythe new motion class (see Figure 6). A major source of dif-

t

walk run

k

Figure 6. Interpolated transition between two animationclasses

ficulties in the development of computer vision algorithmsfor surveillance is given by dynamic lighting conditions,including shadows due to moving objects such as peopleand vehicles [2, 6]. Fast but geometrically accurate render-ing of shadows, coherent with environment global illumi-nation, is one of the key features of the developed render-ing engine. Geometrical accuracy of characters shadowshas been obtained generating shadow volumes using a nonstandard OpenGL perspective projection matrix, obtainedby moving to infinity the far clipping plane of the standardmatrix [9].

The actual drawing of shadow boundaries relies on amodification of the original stencil algorithm, proposed by

Heidmann [3], known as Carmack’s Reverse that solves theproblems of the original formulation whenever the camerais located within the shadow volume [4]. However, deter-mining shadows boundaries does not completely solve theproblem. Shadowed regions should be filled in accordanceto a global illumination solution for the given environment.While a correct solution is computationally expensive, agood approximation can be obtained relying on the correctcontribution of each light to the illumination of static geom-etry and using OpenGL lights for the shadowing (and selfshadowing) of the dynamic characters (see Figure 2). Foreach light, shadows are generated by substractive blend-ing of light contribution, providing full support for coloredlights and shadows. The major drawback is given by the ne-cessity of generating a global illumination solution of thegiven environment for each of the lights, as the resultinglightmaps are needed to paint the shadows on the static en-vironment.

Figure 7. The different steps in the generation of accurateshadows

The rendering engine implemented for the server sideof the architecture is then able to generate realistic imagesfor multiple clients, from different viewpoints, managingboth static and dynamic geometry. While the engine takescare of the low level actuation of character motion, the gen-eration of believable, purposive, motion for the simulatedcharacters has been assigned to a specialized client of thesimulator to be described in the next section. This architec-tural choice has the advantage that different algorithms forgenerating complex behaviour can be tested and comparedwith minimum intervention on the structure of the simula-tor.

3 Choreographed Animation

Living beings perceive their environment and react accord-ing to subjective criteria, making realistic simulation ofmoving people a difficult task. Although rendering thecharacters is by itself a complex task, the creation of a suit-able choreography of groups moving purposively in a be-lievable way poses additional challenges. The variety of

collective human motion patterns in a complex environ-ment such as a museum are due to many factors amongwhich we can cite:

visual perception, the interpretation of visual stimulifrom the environment;

prior knowledge, already acquired information on the en-vironment;

decision capability, goal driven reaction to perceived im-pressions;

group behaviour, aggregation based on similar interests;

experience accumulation, dynamically reshaping goals.

In order to create believable motion patterns, a set of syn-thetic characters must be endowed with the same function-alities, albeit in a simplified version. In our implementa-tion, visual perception is simply responsible for filteringout information on the environment that is not available tothe characters due to occlusions and limited field of view.The interpretation of perceived stimuli can be bypassed bygiving the character direct access to scene geometry, mu-seum exhibit locations and descriptions, exact position andvelocities of mates. The strategy used for the simulationof museum visiting groups relies on a hybrid approach in-tegrating second order dynamics for global motion coor-dination and indirect control of groups through invisibleleaders. In fact, while plausible flock behavior naturallyemerges from the simple rules proposed in [7], consistentgoal oriented behavior cannot be easily created in the sameway.

Our approach is related to the concept of Co-Fieldsintroduced in [5]. In the original formulation, each movingcharacter (agent) perceives the position of the other agentsthrough a force field and coordinated behavior emergesfrom the agents following the force field, dynamically re-shaping it by changing their position. In our implemen-tation the potential fields do not only depend on time andcharacter position but also on her velocity. Modeling wallsand other static and dynamic obstacles with pure positionalfields may result in strange motion patterns that prevent theagents from reaching the expected objectives. The short-coming of a pure positional approach is also evident in themodeling of group dynamics, as people clearly use infor-mation on velocity to adapt their trajectories and avoid col-lisions. Character motion (x(t),

.x(t)) over a potential field

E(x,.x, t) is governed by a second order differential equa-

tion:

F (x,.x, t) =

d2x

dt2(t) = −∇xE(x,

.x, t)

x(t + ∆t) = x(t) +.x(t)∆t + 1

2F (x,.x, t)∆t2

.x(t + ∆t) =

.x(t) + F (x,

.x, t)∆t

Although the pioneering work of Reynolds [7] addressedthe simulation of group behaviours of animals, similar rulescan be applied to human characters. In our implementation,

the potential field E is defined by the superposition of thefollowing components:

EC , to avoid collisions with static and dynamic obstacles;

EL, driving the members of a group towards their (invisi-ble) leader;

EF , responsible for flock behaviour;

EV , urging match the velocity of nearby members of thesame group.

In scenarios with complex geometry and frequent crowdingit is important to find plausible collision avoidance poli-cies. If an agent walks away from a wall there is no rea-son to be repelled by it, independently of how close it is.On the other hand, if two persons move toward each otherthey have to change their paths in advance. But again, ifthey walk in the same direction they can stay very close toeach other. Positional information with additional velocitydata can then be used to implement more realistic collisionavoidance strategies. If a polygonal scene model is avail-

0.51

1.52

2.53

relative velocity

00.5

11.5

22.5

3

distance

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 8. Collision avoidance field α(to): the larger therelative velocity the greater the visibility of the obstacle.

able, collision detection can be carried out by ray-polygonintersection. Given the impact location xo of the nearestobject and the surface normal versor no, these behaviorscan be simulated by the following field component:

∇xEC(x,.x, t) = −α(to)no, to = ‖x−xo‖

2

(.

x−.

xo)·(x−xo)

and α(t0) is a positive decreasing function of the expectedimpact time (e.g. a Gaussian, see Figure 8). Given thepositions xi of members i in the set of nx

v (close) visibleagents Ax

v , flock behaviour can be simulated by introducingthe following field:

∇xEF (x,.x, t) ∝

x −1

nxv

∑

i∈Axv

xi

so that an agent is attracted by the center of mass of nearbygroup mates with a force proportional to its distance. In asimilar way, velocity matching is governed by

∇xEV (x,.x, t) ∝

.x −

1

nxv

∑

i∈Axv

.xi

Although consistent goal-oriented behaviour can be incor-porated in agent dynamics through an additional, dynamicinterest field, group behaviour can be more appropriatelymaintained through a (invisible) leader following policy.The corresponding potential component is expressed by aradial field centered at the leader position xl(t):

∇xEL(x,.x, t) ∝ (x − xl(t))

The leaders can then be moved on a graph whose nodes rep-resent physical locations in the museum. However, naiveimplementation of follow-the-leader behaviour may resultin unacceptable motion patterns whenever the characterlooses sight of the leader. If geometry constraints are ig-nored, the character may be subjected to acceleration driv-ing her towards the obstacles. On the other side, if the con-straints are considered, the character may have no leader tofollow in sight. The solution used in the presented systemis to endow each character with a memory of the latest po-sitions of the group guide so that motion can be directed bythe most recent position of the leader that is visible fromthe standing point of the moving character x̂l(x, t):

∇xEL(x,.x, t) = δ(x − x̂l(x, t))

The nodes corresponding to the exhibits are given a set of

? !

Figure 9. Endowing characters with memory of leaders tra-jectories eases the implementation of realistic follow-the-leader strategies.

descriptive labels from which an interest score is derivedbased on the group profile while the arcs are labeled withthe corresponding spatial distances. In this case, peoplemotion can be planned in advance, optimizing the fruitionof the exhibits (sum of the scores) subject to predefinedtemporal constraints on the length of the visit. However,such an approach is not completely satisfactory as peopledo not, usually, optimize their visit this way. Furthermore,the impact of local events (such as overcrowding) on grouptrajectories cannot be modeled with the necessary spatialdetail unless a large number of nodes is inserted into thegraph. The proposed solution is based on the idea of dom-inant interest field. Each exhibit in the museum generatesan attractive field driving the motion of each group leader.In order to get feasible leader paths, scene geometry infor-mation must be incorporated in the interest fields. In theproposed approach fields are generated by propagating in-terest rays from each exhibit that are subject to occlusionsby scene obstacles. The rays move linearly in a discretized

space and mark each floor cell with the minimum distanceto the exhibit, giving rise to the distance field Di(x) of theith exhibit. In order to cover the whole environment (in away not dissimilar to diffuse illumination) each pixel actsas a new source. The following algorithm computes thespatially sampled distance field for a polygonal scene map:

set distances of accessible pixels to∞;activate source on exhibit location and set distance to 0;

while there is an active source doemit light from current source;ds = current source distance;foreach illuminated pixel do

d = ds+ distance from current source;if d < current pixel distance then

set current pixel distance to d;label pixel with displacement from current source;if current pixel is on a corner then

activate new source on current pixel;end

endturn off current source;

end

Algorithm 1: Computation of distance map

The resulting fields can be effectively modulated bythe complexity of the path leading to the exhibit as well asthe mere distance to be traveled (see Figure 10):

EiI(x,

.x, t) = ηi(t)

1+Di(x,t)λi(x,.x)

Here λi ∈ (0, 1] accounts for exhibit visibility and for ac-quired prior knowledge about the museum, e.g. by consult-ing a exhibit plan beforehand. The modulating factor ηi(t)defines the group interest in the ith exhibit, which decreaseswhile looking at the exhibit (experience accumulation):

ηi(t + ∆t) = ηi(t) − νe−

D2i(xl)

2σ2i ∆t

where ν characterizes the fruition rate and σ represents thespatial scale from which the exhibit becomes enjoyable.Each group can then be characterized by a specific profileηi(0) that will affect its museum tour. Leader motion isthen governed by

xl(t + ∆t) = xl(t) + vl∆t

‖∇xEı̂

I(x,

.

x,t)‖∇xE ı̂

I(x,.x, t)

where vl denotes constant leader velocity and

ı̂ = arg maxi

{EiI(x,

.x, t)}

represents the current dominant interest of the group leader.Due to the consistent formulation, the interest field candirectly be used as an additional component of agent dy-namics. The implemented system permits the simulationof groups with different interests in museum exhibits aswell as different visiting strategies (detailed planning, ca-sual, time limited etc.) by appropriate modulation of inter-est fields.

Figure 10. The two columns reports the evolution in timeof the co-fields originated from two different exhibits: thedarker the field the stronger the attraction of the exhibit.The visitor is at first attracted by the one corresponding tothe left. As time pass by, she gets less and less interestedin it till her attention is captured by the nearby exhibit towhich her attention turns. Field discontinuities are due tothe depression of secondary source emission, capturing theeffect out-of-sight out-of-interest.

4 Conclusions

This paper has presented a graphical simulator supportingthe development of third generation surveillance systemsbased on visual input. Fast image rendering, a flexibleclient-server architecture, perceptual oracle functionalitiesto provide extensive ground truth information for algorithmevaluation, and generation of complex movements of peo-ple groups are among its most important features. The sys-tem is currently employed in the development of algorithmsfor people segmentation where it is able to provide accu-rate performance reports (see Figure 11 for an example onthe evaluation of a people/background segmentation algo-rithm). The system will be extended to support characterswith an increased number of joints for the development ofgesture recognition algorithms.

Acknowledgements

The authors would like to thank F. Bertamini at ITC-irst forkindly providing the information reported in Figure 11.

0

500

1000

1500

2000

2500

3000

3500

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

wro

ng p

ixel

s

threshold

’sum’’shadow-fg’’fg-shadow’

Figure 11. The plot reports the error, in pixel units, madeby a simple algorithm for people/background segmentationin an indoor environment. After a first step segmenting theimage into background and foreground plus shadows, thealgorithm tries to identify shadows using a set of classi-fiers combined through scoring. Manual computation ofalgorithm performance at this level of detail would be ex-tremely costly and the accuracy would not reach the levelpossible with the proposed approach.

References

[1] H. Fuchs, A. Kedem, and B. Naylor. On visible sur-face generation by a priori tree structures. ComputerGraphics, 14(3):124–133, July 1980.

[2] I. Haritaoglu, D. Harwood, and L. S. Davis. W 4:Real-Time Surveillance of People and Their Activities.IEEE Trans. On PAMI, 22(8):809–830, 2000.

[3] Tim Heidmann. Real shadow real time. IRIS Universe,18:28–31, 1991.

[4] M. J. Kilgard. Robust Stencil Shadow Volumes. InCEDEC, Tokyo, Japan, 2001. NVIDIA.

[5] M. Mamei, F. Zambonelli, and L. Leonardi. Co-Fields:A Unifying Approach to Swarm Intelligence. In Proc.of 3rd International Workshop on Engineering Soci-eties in the Agents’ World, Madrid, Sept. 2002.

[6] A. Prati, I. Mikic, C. Grana, and M. M. Trivedi.Shadow Detection Algorithms for Traffic Flow Anal-ysis: a Comparative Study. In Proceedings of IEEEIntl. Conference on Intelligent Transportation Systems,pages 340–345, 2001.

[7] C. W. Reynolds. Flocks, Herds, and Schools: ADistributed Behavioral Model. Computer Graphics,21(4):25–43, 1987.

[8] Jan Svarovsky. Quaternions for game programming.In Mark DeLoura, editor, Game Programming Gems,pages 195–199. Charles River Media, 2000.

[9] M. Woo, J. Neider, T. Davis, and D. Shreiner. OpenGLProgramming Guide. Addison-Wesley, 2001.

Synthetic Movies for Computer Vision ApplicationsSynthetic Movies for Computer Vision Applications...

Documents

Transcript of Synthetic Movies for Computer Vision ApplicationsSynthetic Movies for Computer Vision Applications...