Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective

8/3/2019 Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective

1/162

Universidad Politecnica de MadridEscuela Tecnica Superior

de Ingenieros de Telecomunicacion

Visual Object Tracking in Challenging

Situations using a Bayesian Perspective

Seguimiento visual de objetos en situaciones

complejas mediante un enfoque bayesiano

Ph.D. Thesis

Tesis Doctoral

Carlos Roberto del Blanco Adan

Ingeniero de Telecomunicacion

2010
http://0_frontmatter/figures/EtsiTeleco_new.eps


2/162

ii


3/162

Departamento de Senales, Sistemas yRadiocomunicaciones

Escuela Tecnica Superiorde Ingenieros de Telecomunicacion

Visual Object Tracking in Challenging

Situations using a Bayesian Perspective

Seguimiento visual de objetos en situacionescomplejas mediante un enfoque bayesiano

Ph.D. Thesis

Tesis Doctoral

Autor:

Carlos Roberto del Blanco AdanIngeniero de Telecomunicacion

Universidad Politecnica de Madrid

Director:

Fernando Jaureguizar NunezDoctor Ingeniero de Telecomunicacion

Profesor titular del Dpto. de Senales, Sistemas y

Radiocomunicaciones

Universidad Politecnica de Madrid

2010

iii


4/162


5/162

TESIS DOCTORAL

Visual Object Tracking in Challenging Situations using a Bayesian

Perspective

Seguimiento visual de objetos en situaciones complejas mediante un

enfoque bayesiano

Autor: Carlos Roberto del Blanco Adan

Director: Fernando Jaureguizar Nunez

Tribunal nombrado por el Mgfco. y Excmo. Sr. Rector de la Universidad Politecnica

de Madrid, el da . . . . de . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . de 2010.

Presidente D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Vocal D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Vocal D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Vocal D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Secretario D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Realizado el acto de defensa y lectura de la Tesis el da . . . . de . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . d e 2010 en . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Calificacion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

EL PRESIDENTE LOS VOCALES

EL SECRETARIO

v


6/162


7/162

A Vanessa, a mis padres, a mis hermanos.


8/162


9/162

Acknowledgements

Me gustara agradecer a un gran numero de personas el que hayan compar-

tido mi andadura por el camino de la tesis. Empezare por mi mujer Vanessa

que tanto apoyo me ha dado y que en mas de una ocasion ha tenido que

aguantar la version mas irascible de mi mismo. Seguire con mis padres y

hermanos que tantas veces me han preguntado si me quedaba mucho para

terminar la tesis. Continuare con todos los miembros y visitantes del GTI,

haciendo una mencion especial a Fernando, Narciso y Luis, los cuales han

sufrido los estragos de mis artculos escritos en ese ingles tan espanol. Sin

olvidar el esfuerzo y migranas que a Fernando, mi tutor, le ha costado leer

esta tesis, gracias a la cual debe odiar al senor Bayes. Sin duda me gustara

escribir unas lneas de todos mis companeros del GTI con los que tantacafena he compartido y que han hecho tan agradable mi vida laboral. Sin

embargo, eso supondra escribir, mas que un tomo de tesis, toda la enciclo-

pedia Salvat. Por ello me veo obligado a hacer algo mas alternativo a la

par que dudosamente util: una tabla con las edades del GTI, en la que se

puede ver a todos mis companeros (o al menos esa ha sido mi intencion) a

lo largo del viaje temporal de mi tesis.

This work has been partially supported by the Ministerio de Ciencia e In-

novacion of the Spanish government by means of a Formaci on del Per-

sonal Investigador fellowship and the projects TIN2004-07860 (Medusa)

and TEC2007-67764 (SmartVision).


10/162

Era Periodo Especmenes Procedencia

Proterozoico - Narciso, Fernando, Luis, Espana

Francisco, Julian, Nacho

Arquezoico - Marcos N., Juan Carlos, Carlos R. Espana, EEUU

Marcos A., Usoa, Shagniq

Paleozoico

Cambrico Carlos C., Daniel A., Sharko Espana, Macedonia

Ordovcico Raul, Jon, Angel Espana

Silurico Irena, Kristina, Binu Macedonia, India

Devonico Nerea, Pieter Espana, Belgica

Carbonfero Pablo, Victor, Gian Luca Espana, Italia

Permico Hui, Xioadan, Yi, Yang China

Mesozoico

Triasico Shankar, Ravi, Gogo, Antonio India, Macedonia, Brasil

Jurasico Filippo, Maykel, Esther Italia, Cuba, Espana

Cretazico Sasho, Cesar Macedonia, Espana

Cenozoico

Paleoceno Daniel B., Claire Espana, Francia

Eoceno Lihui, Yu, Ivana China, Macedonia

Oligoceno Toni, Richard, Carlos G. Espana, Peru

Mioceno Manuel, Massimo, Jesus Espana, Italia

Plioceno Rafa, Sergio Espana

Pleistoceno Su, Wenjia, Xiang, Iviza China, Macedonia

Holoceno Samira, Abel, Pratik, Srimanta Iran, Espana, India


11/162

Abstract

The increasing availability of powerful computers and high quality video

cameras has allowed the proliferation of video based systems, which perform

tasks such as vehicle navigation, traffic monitoring, surveillance, etc. A

fundamental component in these systems is the visual tracking of objects

of interest, whose main goal is to estimate the object trajectories in a video

sequence. For this purpose, two different kinds of information are used:

detections obtained by the analysis of video streams and prior knowledge

about the object dynamics. However, this information is usually corrupted

by the sensor noise, the varying object appearance, illumination changes,

cluttered backgrounds, object interactions, and the camera ego-motion.

While there exist reliable algorithms for tracking a single object in con-

strained scenarios, the object tracking is still a challenge in uncontrolled

situations involving multiple interacting objects, heavily-cluttered scenar-

ios, moving cameras, and complex object dynamics. In this dissertation,

the aim has been to develop efficient tracking solutions for two complex

tracking situations. The first one consists in tracking a single object in

heavily-cluttered scenarios with a moving camera. To address this situa-

tion, an advanced Bayesian framework has been designed that jointly models

the object and camera dynamics. As a result, it can predict satisfactorily

the evolution of a tracked object in situations with high uncertainty about

the object location. In addition, the algorithm is robust to the background

clutter, avoiding tracking failures due to the presence of similar objects.

The other tracking situation focuses on the interactions of multiple objects

with a static camera. To tackle this problem, a novel Bayesian model has

been developed, which manages complex object interactions by means of

an advanced object dynamic model that is sensitive to object interactions.

This is achieved by inferring the occlusion events, which in turn trigger

different choices of object motion. The tracking algorithm can also handle

false and missing detections through a probabilistic data association stage.

Excellent results have been obtained using publicly available databases,

proving the efficiency of the developed Bayesian tracking models.


12/162


13/162

Resumen

La creciente disponibilidad de potentes ordenadores y camaras de alta cal-

idad ha permitido la proliferacion de sistemas basados en vdeo para la

navegacion de vehculos, la monitorizacion del trafico, la vdeo-vigilancia,

etc. Una parte esencial en estos sistemas es seguimiento de objetos, siendo

su principal objetivo la estimacion de las trayectorias en secuencias de vdeo.

Para tal fin, se usan dos tipos de informacion: las detecciones obtenidas del

analisis del vdeo y el conocimiento a priori de la dinamica de los objetos.

Sin embargo, esta informacion suele estar distorsionada por el ruido del sen-

sor, la variacion en la apariencia de los objetos, los cambios de iluminacion,

escenas muy estructuradas y el movimiento de la camara.

Mientras existen algoritmos fiables para el seguimiento de un unico objeto

en escenarios controlados, el seguimiento es todava un reto en situaciones

no restringidas caracterizadas por multiples objetos interactivos, escenarios

muy estructurados y camaras en movimiento. En esta tesis, el objetivo ha

sido el desarrollo de algoritmos de seguimientos eficientes para dos situa-

ciones especialmente complicadas. La primera consiste en seguir un unico

objeto en escenas muy estructuradas con una camara en movimiento. Para

tratar esta situacion, se ha disenado un sofisticado marco bayesiano que

modela conjuntamente la dinamica de la camara y el objeto. Esto per-

mite predecir satisfactoriamente la evolucion de la posicion de los objetos

en situaciones de gran incertidumbre. Ademas, el algoritmo es robusto a

fondos estructurados, evitando errores por la presencia de objetos similares.

La otra situacion considerada se ha centrado en las interacciones de objetos

con una camara estatica. Para tal fin, se ha desarrollado un novedoso mod-

elo bayesiano que gestiona las interacciones mediante un avanzado modelo

dinamico. Este se basa en la inferencia de oclusiones entre objetos, las cuales

a su vez dan lugar a diferentes tipos de movimiento de objeto. El algoritmo

es tambien capaz de gestionar detecciones perdidas y falsas detecciones a

traves de una etapa de asociacion de datos probabilstica.

Se han obtenido excelentes resultados en diversas bases de datos, lo que

prueba la eficiencia de los modelos bayesianos de seguimiento desarrollados.


14/162

xiv


15/162

Contents

List of Figures xvii

List of Tables xix

1 Introduction 1

2 Bayesian models for object tracking 5

2.1 Tracking with moving cameras . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Tracking of multiple interacting objects . . . . . . . . . . . . . . . . . . 12

3 Bayesian Tracking with Moving Cameras 17

3.1 Optimal Bayesian estimation for object tracking . . . . . . . . . . . . . 17

3.1.1 Particle filter approximation . . . . . . . . . . . . . . . . . . . . . 20

3.2 Bayesian tracking framework for moving cameras . . . . . . . . . . . . . 21

3.3 Object tracking in aerial infrared imagery . . . . . . . . . . . . . . . . . 24

3.3.1 Particle filter approximation . . . . . . . . . . . . . . . . . . . . . 27

3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.2.1 Strong ego-motion situation . . . . . . . . . . . . . . . . 36

3.3.2.2 High uncertainty ego-motion situation . . . . . . . . . . 39

3.3.2.3 Global tracking results . . . . . . . . . . . . . . . . . . 43

3.4 Object tracking in aerial and terrestrial visible imagery . . . . . . . . . 46

3.4.1 Particle Filter approximation . . . . . . . . . . . . . . . . . . . . 51

3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

xv


16/162

CONTENTS

4 Bayesian tracking of multiple interacting objects 65

4.1 Description of the multiple object tracking problem . . . . . . . . . . . . 65

4.2 Bayesian tracking model for multiple interacting objects . . . . . . . . . 69

4.2.1 Transition pdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.2.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.3 Approximate inference based on Rao-Blackwellized particle filtering . . 85

4.3.1 Kalman filtering of the object state . . . . . . . . . . . . . . . . . 91

4.3.2 Particle filtering of the data association and object occlusion . . 94

4.4 Ob ject detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.5.1 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.5.2 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5 Conclusions and future work 125

5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6 Appendix 131

6.1 Conditional independence and d-separation . . . . . . . . . . . . . . . . 131

References 133

xvi


17/162

List of Figures

3.1 Graphical model for the Bayesian object tracking. . . . . . . . . . . . . 18

3.2 Consecutive frames of an aerial infrared sequence . . . . . . . . . . . . . 25

3.3 Multimodal LoG filter response . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Likelihood distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5 Initial translational transformations . . . . . . . . . . . . . . . . . . . . 30

3.6 Probability values for the ego-motion hypothesis . . . . . . . . . . . . . 30

3.7 Metropolis-Hastings sampling of the likelihood distribution . . . . . . . 32

3.8 Particle approximation of the posterior pdf . . . . . . . . . . . . . . . . 33

3.9 SIR resampling of the posterior pdf . . . . . . . . . . . . . . . . . . . . . 33

3.10 Kernel density estimation and state estimation . . . . . . . . . . . . . . 35

3.11 Object tracking result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.12 Intermediate results for a situation of strong ego-motion . . . . . . . . . 37

3.13 Tracking results for the BEH algorithm under strong ego-motion . . . . 38

3.14 Tracking results for the DEH algorithm under strong ego-motion . . . . 39

3.15 Tracking results for the NEH algorithm under strong ego-motion . . . . 40

3.16 Intermediate results for a situation greatly affected by the aperture problem 41

3.17 Tracking results for the BEH algorithm in a situation greatly affected

the aperture problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.18 Tracking results for the DEH algorithm in a situation greatly affected


3.19 Tracking results for the NEH algorithm in a situation greatly affected


3.20 Example of the similarity measurement between image regions . . . . . 50

3.21 Example of feature correspondence . . . . . . . . . . . . . . . . . . . . . 53

xvii


18/162

LIST OF FIGURES

3.22 Representation of the affine transformation hypothesis . . . . . . . . . . 55

3.23 Samples of the object position . . . . . . . . . . . . . . . . . . . . . . . . 56

3.24 Samples of ellipsis enclosing the object . . . . . . . . . . . . . . . . . . . 56

3.25 Weighted sampled representation of the posterior pdf . . . . . . . . . . . 57

3.26 Tracking results with a camera mounted on a car . . . . . . . . . . . . . 59

3.27 Tracking results with a camera mounted on a helicopter . . . . . . . . . 61

4.1 Set of detections yielded by multiple detectors . . . . . . . . . . . . . . . 67

4.2 Data association between detections and objects . . . . . . . . . . . . . 68

4.3 Object dynamic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4 Graphical model for multiple object tracking . . . . . . . . . . . . . . . 71

4.5 Graphical model for the initial time step . . . . . . . . . . . . . . . . . . 74

4.6 Restrictions imposed to the associations between detections and objects. 79

4.7 Restrictions imposed to the occlusions among objects. . . . . . . . . . . 80

4.8 Color histograms of two object categories . . . . . . . . . . . . . . . . . 100

4.9 Similarity maps of the color histograms . . . . . . . . . . . . . . . . . . 102

4.10 Computed detections from the red dressed team. . . . . . . . . . . . . . 104

4.11 Computed detections from the black and white dressed team. . . . . . . 105

4.12 Tracking results for a simple object cross . . . . . . . . . . . . . . . . . . 110

4.13 Marginalization of the posterior pdf over one specific object . . . . . . . 111


4.15 Tracking results for a complex object cross . . . . . . . . . . . . . . . . 113




4.19 Tracking results for an overtaking action . . . . . . . . . . . . . . . . . . 1174.20 Marginalization of the posterior pdf over one specific object . . . . . . . 118



6.1 Concepts of d-separation and descendants . . . . . . . . . . . . . . . . . 132

xviii


19/162

List of Tables

2.1 Tracking problems related to the data association . . . . . . . . . . . . . 13

3.1 Quantitative results for object tracking with a moving camera in infrared

imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2 Quantitative results for object tracking with a moving camera in visible

imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.1 Quantitative results for interacting objects 1/2 . . . . . . . . . . . . . . 122

4.2 Quantitative results for interacting objects 2/2 . . . . . . . . . . . . . . 123

xix


20/162

LIST OF TABLES

xx


21/162

Chapter 1

Introduction

The evolution and spreading of the technology nowadays have allowed the prolifera-

tion of video based systems, which make use of powerful computers and high quality

video cameras to automatically perform increasingly demanding tasks such as vehicle

navigation, traffic monitoring, human-computer interaction, motion-based recognition,

security and surveillance, etc. Visual object tracking is a fundamental part in all of

the previous tasks, and also in the field of computer vision in general. This fact has

motivated a great deal of interest in object tracking algorithms. The ultimate goal of

tracking algorithms is to estimate the object trajectories in a video sequence. For this

purpose, two different kinds of information are used: the video streams acquired by

the camera sensor and the prior knowledge about the tracked objects and the envi-

ronment. The video-stream based information is used to compute object detections in

each frame, also known as observations or measurements. The detection process uses

the most distinctive appearance features, such as color, gradient, texture, and shape,

to minimize the probability of false detections and at the same time to maximize the

detection probability. However, the object appearance can undergo significant varia-

tions that cause noisy detections and even missing detections, i.e. tracked objects that

have not been detected. The appearance variations can be produced by articulated or

deformable objects, illumination changes due to weather conditions (typical in outdoor

applications), and variations in the camera point of view. Object interactions, such as

partial and total occlusions, are another source of noisy and missing detections. On the

other hand, scene structures similar to the objects of interest can cause false detections,

and thus obfuscating the tracking process. To alleviate these detection shortcomings,

1


22/162

1. INTRODUCTION

the tracking also relies on the available prior information in order to constrain the tra-

jectory estimation problem. This kind of information is mainly the object dynamics,

which is used to predict the evolution of the object trajectories. The modeling of the

object dynamics can be a very difficult task, especially in situations in which objects

undergo complex interactions. On the other hand, object dynamic information is only

meaningful for static or quasi-static cameras, since, in the case of moving cameras, a

global motion is induced in the image, called ego-motion, that corrupts the trajectory

predictions. As a result, the camera dynamics must be also modeled, which makes

more complex the tracking and increases the uncertainty in the trajectory estimation.While there exist reliable algorithms for the tracking of a single object in constrained

scenarios, the object tracking is still a challenge in uncontrolled situations involving

multiple interacting objects, heavily-cluttered scenarios, moving cameras, objects with

varying appearance, and complex object dynamics. In this dissertation, the main aim

has been the development of efficient tracking solutions for two of these complex track-

ing situations. The first one consists in tracking a single object in heavily-cluttered

scenarios with a moving camera. For this purpose, an advanced Bayesian framework

has been designed that jointly models the object and camera dynamics. This allowsto predict satisfactorily the evolution of the tracked object in situations of high uncer-

tainty, in which several object locations are possible because of the combined dynamics

of the object and the camera. In addition, the algorithm is robust to the background

clutter, avoiding tracking failures due to the presence of similar objects to the tracked

one in the background. The inference of the tracking information in the proposed

Bayesian model cannot be performed analytically, i.e. there is not a closed form ex-

pression to directly compute the required tracking information. This situation arises

from the fact that the dynamic and observation processes involved in the Bayesian

tracking framework are non-linear and non-Gaussian. In order to deal with this prob-

lem, a suboptimal inference method has been derived that makes use of the particle

filtering technique to compute an accurate approximation of the object trajectory.

The other unrestricted tracking situation focuses on the interactions of multiple

objects with a static camera. To successfully tackle this problem, a novel recursive

Bayesian model has been developed to explicitly manage complex object interactions.

This is accomplished by an advanced object dynamic model that is sensitive to the

object interactions involving long-term occlusions of two or more objects. For this

2


23/162

purpose, the proposed Bayesian tracking model uses a random variable to predict the

occlusion events, which in turn triggers different choices of object motion. The track-

ing algorithm is also able to handle false and missing detections through a probabilistic

data association stage, which efficiently computes the correspondence between the un-

labeled detections and the tracked objects. Regarding the inference of the tracking

information in the proposed Bayesian model for interacting objects, two major issues

have been carefully addressed. The first one is the mathematical derivation of the

posterior distribution of the object tracking information, which has been a challenging

task due to the complexity of the tracking model. The second issue, closely relatedto the first one, arises from the fact that the derived mathematical expression for the

posterior distribution has not an analytical form due to the involved complex integrals.

This situation is caused by the non-linear and non-Gaussian character of the stochastic

processes involved in the Bayesian tracking model, i.e. the dynamic, observation and

occlusion processes. Subsequently, the inference has to be accomplished by means of

suboptimal methods, such as the particle filtering. However, the high dimensionality of

the tracking problem, proportional to the number of tracked objects and object detec-

tions, causes that the accuracy of the approximate posterior distribution be very poor.

To overcome this drawback, a novel suboptimal inference method has been developed

which combines the particle filtering technique with a variance reduction technique,

called Rao-Blackwellization. This allows to obtain an accurate approximation of the

object trajectories in high dimensional state spaces, involving multiples tracked objects.

The organization of the dissertation is as follows. In Chap. 2, a state of the art

of the most remarkable object tracking techniques for multiple objects is presented,

placing special emphasis in Bayesian models, strategies for handling moving cameras,

and the management of multiple objects. The developed recursive Bayesian model

for tracking a single object in heavily-cluttered scenarios with a moving camera is

described in Chap. 3. At the end of the chapter, tracking results of the proposed

Bayesian framework is presented for two kinds of applications, one involving aerial

infrared imagery, and another dealing with both terrestrial and aerial visible imagery.

In Chap. 4, the developed Bayesian tracking solution for multiple interacting objects is

presented, along with a test bench to evaluate the efficiency of the tracking in object

interactions. Lastly, conclusions and future lines of research are set out in Chap. 5.

3


24/162

1. INTRODUCTION

4


25/162

Chapter 2

Bayesian models for object

tracking

Visual object tracking is a fundamental task in a wide range of military and civilian

applications, such as surveillance, security and defense, autonomous vehicle navigation,

robotics, behavior analysis, traffic monitoring and management, human-computer in-

terface, video retrieval, and many more. The visual tracking can be defined as the

problem of estimating the trajectories of a set of objects of interest in a video sequence

as they move around the scene. In a typical tracking application there are one or more

object detectors that generate a set of noisy measurements or detections in discrete

time instants. The uncertainty of the detection process arises from the noise of camera

sensor, changes in the scene illumination, variations in the appearance of the objects,

non-rigid and/or articulated objects, and loss of information caused by the projection

of the 3D world onto the 2D image plane. The tracking algorithm must be able to han-

dle the uncertainty in the detection process to assign consistent labels to the tracked

objects in each frame of a video sequence. This process can be simplified by imposing

certain constrains on the motion of the objects. For this purpose, a dynamic model can

be used to predict the motion of the objects, restricting in this way the spatio-temporal

evolution of the trajectories. Nonetheless, the dynamic model is only an approximation

of the underlying object dynamics, which indeed can be very complex. As a result, the

tracking algorithm have to manage different sources of information (detections and ob-

ject dynamics), taking into account their own uncertainties, to efficiently estimate the

object trajectories.

5


26/162

2. BAYESIAN MODELS FOR OBJECT TRACKING

Bayesian estimation is the most commonly used framework in visual tracking and

also in other contexts such as radar and sonar. This framework models in a probabilistic

way the tracking problem, and all its sources of uncertainty: sensor noise, inaccurate

dynamic models, environmental clutter, etc. From a Bayesian perspective, the aim is to

compute the posterior distribution over the object state, which is a vector containing

all the desired tracking information such as, position, velocity, etc. This posterior

distribution encodes all the necessary information to efficiently compute an estimation

of the object state. The computation of the posterior distribution is usually performed

in a recursive way via two cyclic stages: prediction and update. Thus, the computationis efficiently performed since it is only required the previous estimation of the posterior

distribution, and the set of detections at the current time step. The prediction stage

evolves the posterior distribution at the previous time step according to the object

dynamics, obtaining as a result the predicted posterior distribution at the current time

step. The update stage makes use of the available detections at the current time step

to correct the predicted posterior distribution by means of the likelihood model of the

object detector.

In single object tracking with static cameras, the main difficulty arises from the factthat realistic models for the object dynamics and detection processes are often non-

linear and non-Gaussian, which leads to a posterior distribution without a closed-form

analytic expression. In fact, only in a limited number of cases there exist close-form ex-

pressions. The most well-known closed-form expression is the Kalman filter (1), which

is obtained when both the dynamic and likelihood models are linear and Gaussian.

Grid based approaches (2) overcome the limitations imposed on Kalman filter by re-

stricting the state space to be discrete and finite. If any of the previous assumptions

does not hold, the exact computation of posterior distribution is not possible, and it

becomes necessary to resort to approximate inference methods that computes an ap-

proximation of the posterior distribution. The extended Kalman filter (1) linearizes

models with weak non-linearities using the first term in a Taylor expansion, so that

the Kalman filter expression can be still applied. Nonetheless, the performance of the

extended Kalman filter rapidly decreases as the non-linearities becomes more severe.

The unscented Kalman filter (3; 4) has proved to be more efficient in models that

are moderately non-linear. It recursively propagates a set of selected sigma points to

maintain the second order statistics of the posterior distribution. Both approximate

6


27/162

solutions, extended and unscented Kalman filters, assume that the underlying posterior

distribution is Gaussian. But if this assumption does not hold (e.g. the distribution is

heavily skewed or multimodal), the accuracy of the estimation can be randomly poor.

The Gaussian sum filter (5) was one of the first attempts to deal with non-Gaussian

models, approximating the posterior distribution by a mixture of Gaussians. The main

limitation of the Gaussian sum filter is that linear approximations are required, as in

the extended Kalman filter. Another limitation is the combinatorial growth of the

number of Gaussian components in the mixture over time. An alternative solution

for non-linear non-Gaussian models that does not need linearization is obtained by

approximate-Grid based methods (2; 6). These methods approximate the continuous

state space by a finite and fixed grid, and then they apply numerical integration for

computing the posterior distribution. The grid must be sufficiently dense to compute

an accurate approximation of the posterior distribution. However, the computational

cost increases dramatically with the dimensionality of the state space and becomes

impractical for dimensions larger than four. An additional disadvantage of grid-based

methods is that the state space cannot be partitioned unevenly in order to improve

the resolution in regions of high density probability. All these shortcomings are over-

come by the particle filtering technique (2; 7; 8), also known as Sequential Monte Carlo

method (9; 10; 11), condensation algorithm (12; 13), or bootstrap filtering (14). It is

a numerical integration technique that simulates the posterior distribution by a set of

weighted samples, known as particles, that are propagating recursively along the time.

The samples are drawn from a proposal distribution, that is the key component of the

algorithm, and evaluated by means of the dynamic and likelihood models. The particle

filter has become very successful in a wide range of tracking applications due to its

efficiency, flexibility, and easy of implementation. Moreover, its computational cost is

theoretically independent of the dimension of the state space.

To sum up, the previous tracking approaches have proved to be efficient and reliable

solutions for single object tracking provided that:

they fulfill the assumptions of linearity/non-linearity Gaussianity/non-Gaussinity

for which they were conceived,

the cameras are static, i.e. with no motion, and

7


28/162


there is always a unique detection for the tracked object, which only occurs in

constrained scenarios where there is total control about the number and types of

objects that compose the scene.

In the rest of situations, the tracking task is still a challenge, which is receiving a great

deal of research attention because of the wide range of potential applications that can

be developed. The main contribution of this dissertation is the development of efficient

and reliable algorithms for the tracking of objects in challenging situations. Specifically,

the research has been focused on two situations: the single object tracking with moving

cameras, and the multiple interacting object tracking with static cameras.

In the first situation, the moving camera induces a global motion in the scene,

called ego-motion, that corrupts the spatio-tempotal continuity of the video sequence.

As a consequence, the object dynamic information is not useful anymore, since the

camera motion is not considered, and the tracking performance is seriously reduced. In

Sec. 2.1, a thorough review of the main techniques that address the ego-motion problem

for single object tracking with moving cameras is presented.

In the other considered situation, the tracking algorithm has to manage several

interacting objects in an environment with static cameras. The difficulty arises from

the fact that in each time step there is a set of unlabeled detections generated from

the detectors. This means that the correspondence between objects and detections

is not known, and therefore a data association stage is required. This fact violates

the assumption that there is always a unique detection per object, since potentially

whatever detection can be associated with an object. In fact, the data association

can be very complex since the number of possible associations is combinatorial with

the number of objects and detections. Furthermore, there can be false detections and

missing detections that increase even more the complexity of the data association.

The false detections arise from the noise of the camera sensor and the scene clutter

(similar structures to the tracked object in the background). On the other hand, object

occlusions and strong variations in the object appearance can cause that one or more

of the objects are not detected, the so-called missing detections. These phenomena

can also occur for single object tracking in unconstrained scenarios, in which there is

a unique object, but there can be none, one or multiple detections. For example, the

nearest neighbor Kalman filter (15) handles the data association problem by selecting

8


29/162

2.1 Tracking with moving cameras

the closest detection to the predicted object trajectory, which is used to update the

posterior distribution of the object state. Unlike the previous method that only uses a

unique detection, the probabilistic data association filter (16; 17) updates the posterior

distribution utilizing all the detections that are close to the predicted object trajectory.

This is accomplished by averaging the innovation terms of the Kalman filter resulting

from the set of detections. This approach maintains the Gaussian character of the

posterior distribution. On the other hand, the data association in single object tracking

can be considered as a specific case of the data association of multiple object tracking,

where the number of tracked objects in the scene is just one. There exist a lot ofscientific literature in the field of multiple object tracking, and recently there has been

a revival of interest due to the recent developments in particle filtering and recursive

Bayesian models in general. In Sec. 2.2, the main multi-object tracking techniques are

presented, focusing on the problem of the data association.


In video based applications in which the video acquisition system is mounted on a

moving aerial platform (such as a plane, a helicopter, or an Unmanned Aerial Vehicle),

a mobile robot, a vehicle, etc., the acquired video sequences undergo a random global

motion, called ego-motion, that prevents the use of the object dynamic information to

restrict the object position in the scene. As a consequence, the tracking performance

can be dramatically reduced. The ego-motion problem has been addressed in different

manners in the scientific literature. They can be split into two categories: approaches

based on the assumption of low ego-motion, and those based on the ego-motion esti-

mation.

Approaches assuming low ego-motion consider that the motion component due to

the camera is not very significant in comparison with the object motion. In this context,

some works assume that the spatio-temporal connectivity of the object is preserved

along the sequence (18; 19; 20), i.e. the image regions associated with the tracked object

are spatially overlapped in consecutive frames. Then, the tracking is performed using

morphological connected operators. In cases where the previous assumption does not

hold, the most common approach is to search for the object in a bounded area centered

in the location where it is expected to find the object, according to its dynamics.

9


30/162


In (21; 22), an exhaustive search is performed in a fixed-size image region centered in

the previous object location. In (23), the initial search location is estimated using a

Kalman filter, and then the search is performed deterministically using the Mean Shift

algorithm (24). Other authors (25; 26) propose a stochastic search based on particle

filtering, which is able to manage multiple initial locations for the search. However,

all these methods lose effectiveness as the displacement induced by the ego-motion

increases. The reason is the size of the search area must be enlarged to accommodate

the expected camera ego-motion, which produces that the probability that the tracking

can be distracted by false candidates increases dramatically.The other category of approaches based on the ego-motion estimation are able

to deal with strong ego-motion situations, in which the camera motion is at least as

significant as the object motion, and even more. They aim to compute the camera ego-

motion between consecutive frames in order to compensate it, and thus recovering the

spatio-temporal correlation of the video sequence. The camera ego-motion is modeled

by a geometric transformation, typically an affine or projective one, whose parameters

are estimated by means of an image registration technique. The existing works differ

in the specific image registration technique used to compute the parameters of thegeometric transformation. Extensive reviews of image registration techniques can be

found in (27; 28), where the first one tackles all kind of vision based applications, while

the second one is focused on aerial imagery. According to them, a possible classification

of the image registration techniques is: those based on features and those based on

area (i.e. image regions). Feature based image registration techniques detect and

match distinctive image features between consecutive frames to estimate a geometric

transformation, which represents the camera ego-motion model. In (29), an object

detection and tracking system with a moving airborne platform is described, which

uses a feature based approach to estimate an affine camera model. In (30), the KLT

method (31) is used to infer a bilinear camera model in an application that detects

moving objects from a mobile robot. In the field of FLIR (Forward Looking InfraRed)

imagery, the works (32; 33; 34) describe a detection and tracking system of aerial

targets mounted on an airborne platform that uses a robust statistic framework to

match edge features in order to estimate an affine camera model. This system is able

to successfully handle situations in which the camera motion estimation is disturbed by

the presence of independent moving objects, provided that there is a minimum number

10


31/162


of detected features belonging to the background. In situations in which the detection

of distinctive features is particularly complicated, because the acquired images are low

textured and structured, an area-based image registration technique is used to estimate

the parameters of the camera model. In (35), a perspective camera model is computed

by means of an optical flow algorithm to detect moving objects in an application of

aerial visual surveillance. An optical flow algorithm is also used in (36) to estimate the

parameters of a pseudo perspective camera model, which is utilized to create panoramic

image mosaics. The same approach is followed in (37; 38) for a tracking application ofterrestrial targets in airborne FLIR imagery. In (39; 40), a target detection framework is

presented for FLIR imagery that minimizes a SSD (Sum of Squares Differences) based

error function to estimate an affine camera model. A similar framework of camera

motion compensation is used in (41) for tracking vehicles in aerial infrared imagery,

but utilizing a different minimization algorithm. In (42), the Inverse Compositional

Algorithm is used to obtain the parameters of an affine camera model for a tracking

application of vehicles in aerial imagery. Unlike the feature based image registration

techniques, the area based techniques are not robust to the presence of independent

moving objects, which can drift the ego-motion estimation. In addition, they require

that the involved images are closely aligned to achieve satisfactory results.

All the previous approaches, independently of the used camera ego-motion compen-

sation technique, have in common that they compute at most one parametric model to

represent the ego-motion between consecutive frames. However, in real applications,

the ego-motion computation can be quite challenging, because there can be several

feasible solutions, i.e. several camera geometric transformations, and not necessarily

the solution with less error is the correct one. This situation arises as a consequence

of several phenomena, such us the aperture problem (43) (related to low structured

or textured scenes), the presence of independent moving objects, changes in the scene,

and limitations of the own camera ego-motion technique. In Chap. 3, an efficient and

reliable Bayesian framework is proposed to deal with the uncertainty in the estimation

of the camera ego-motion for tracking applications.

11


32/162


2.2 Tracking of multiple interacting objects

Multiple object tracking can be sought as the generalization of single object tracking,

in the sense that the main goal is to recover the trajectories of multiple objects from

a video sequence, rather than only one trajectory from an unique object. However,

techniques of multiple object tracking are fundamentally different from those of sin-

gle object tracking, due to the particular problems that arise in the presence of two

or more objects. In multiple object tracking, the object detections are unlabeled and

unordered, i.e. the true correspondence between objects and detections is unknown.

The estimation of the true correspondence, called data association, suffers from the

combinatorial explosion of the possible associations, in which the computational cost

inevitably grows exponentially with the number of objects. On the other hand, data

association is a stochastic process in which the estimation of the true detection as-

sociation can be extremely difficult due to the involved uncertainty. Furthermore, in

real situations there can be none, one, or several detections per object. As a result,

there can be false detections and missing detections, in spite of the fact that the goal

of the detector is both to minimize the probability of false alarms and to maximize the

detection probability. This fact increases the complexity of the data association prob-

lem. The false detections arise from scene structures similar to the objects of interest,

which can obfuscate the tracking process. The missing detections can be originated

from changes in the object appearance, which in turn are caused by articulated or de-

formable objects, illumination changes due to weather conditions (typical in outdoor

applications), and variations in the camera point of view. Another source of missing

detections are the partial and total occlusions involved in the object interactions. All

of these phenomena are also responsible of the noisy character of the detection process.

Tab. 2.1 summarizes the mentioned sources of disturbances along with their effects,

and the derived data association problems.

A great deal of strategies have been proposed in the scientific literature to solve

the data association problem. These can be divided into single-scan and multiple-scan

approaches. Single-scan approaches perform the data association considering only the

set of available detections in a specific time step, while the multiple-scan approaches

make use of the detections acquired in a temporal interval, comprising several time

steps. Multiple-scan approaches consider that tracks are basically a sequence of noisy

12


33/162


Disturbance Effect Data association Problem

Changes in the camera Variations in the Missing detections,

point of view object appearance noisy detections

Articulated or Variations in the Missing detections,

deformable objects object appearance noisy detections

Illumination changes Variations in the Missing detections,

object appearance noisy detections

Ob ject interactions Partial or Missing detections,

total occlusions noisy detections

Scene structures similar Presence of clutter False detectionsto the objects of interest

Table 2.1: Disturbances in the detection process, their effects, and the resulting problems

in data association.

detections. Thus, the multiple object tracking consists in seeking the optimal paths

in a trellis formed by the temporal sequence of detections. In this way, the data as-

sociation problem is cast to one of association of sequence of detections. Techniques

that accomplish this task are the Viterbi algorithm (44; 45; 46), multiple scan assign-

ment (47; 48), network theoretic algorithms (49), and the expectation-maximization

algorithm (EM) (50). The precedent approaches compute a single solution that is

considered the best one, discarding a lot of feasible hypotheses that could be the true

solution. To alleviate this situation, some approaches (51; 52) compute the best N solu-

tions in order to minimize the risk of an incorrect trajectory estimation. An additional

problem is the computational cost. It is known that the multiple-scan approaches are

NP-hard problems in combinatorial optimization, i.e. their complexity is exponential

with the number of objects and detections. The most popular solution to tackle this

problem is the Lagrangian relaxation (53; 54), wherein the N dimensional assignment

problem is divided into a set of assignment problems of lower dimensionality. Another

approach (55) transforms the integer programming problem, posed by the multiple-

scan assignment, into a linear programming problem by relaxing the constraints for

an integer solution. This allows to efficiently solve the problem in polynomial time

through well-known algorithms, such as the interior point method (56).

Inside the group of single-scan approaches, the simplest one is the global nearest

13


34/162


neighbor algorithm (57), also known as the 2D assignment algorithm, which computes

a single association between detections and objects by minimizing a distance based

cost function. The main problem of this approach is that many feasible associations

are discarded. On the other hand, the multiple hypotheses tracker (MHT) (58; 59)

attempts to keep track of all the possible associations along the time. As it occurs

with the multiple-scan approaches, the complexity of the problem is NP-hard because

the number of association grows exponentially over time, and also with the number

of objects and detections. Therefore, additional methods are required to establish a

trade-off between the computational complexity and the handling of multiple associa-tion hypotheses. In this respect, one of the most popular methods is the joint prob-

abilistic data association filter (JPDAF) (60; 61), which performs a soft association

between detections and objects. This is carried out by combining all the detections

with all the objects, in such a way that the contribution of each detection to each

object depends on the statistical distance between them. This method prunes away

many unfeasible hypotheses, but also restricts the data association distribution to be

Gaussian, which limits the applicability of the technique. Subsequent works (62; 63)

try to overcome this limitation by modeling the data association distribution by a mix-ture of Gaussians. However, heuristics techniques are necessary to reduce the number

of components to make the algorithm computationally manageable. The probabilistic

multiple hypotheses tracker (PMHT) (64; 65) is another alternative to estimate the best

data associations hypotheses at a moderate computational cost. It assumes that the

data association is an independent process to work around the problems with pruning.

Nevertheless, the performance is similar to that of the JPDAF, although the compu-

tational cost is higher. The data association problem has been also addressed with

particle filtering, which allows to deal with arbitrary data association distributions in a

natural way. Theoretically, the algorithms based in particle filtering have the ability to

manage the best data association hypotheses with a computational cost independently

of the number of objects and detections. The computed association hypotheses consti-

tute an approximation of the true data association distribution, and the approximation

is more accurate as the number of hypotheses increases. In practice, the performance

of the particle filtering techniques depends on the ability to correctly sample associa-

tion hypotheses from a proposal distribution called importance density. In (66; 67), a

Gibbs sampler is used to sample the data association hypotheses. In a similar way, a

14


35/162


Markov Chain Monte Carlo (MCMC) (68; 69; 70) scheme has been used for drawing

samples that simulate the underlying data association distribution. The main problem

with these samplers is that they are iterative methods that need an unknown number

of iterations to converge. This fact makes them inappropriate for online applications.

Some works (71; 72) overcome this limitation by means of the design of an efficient

and non-iterative proposal distribution that depends on the specific characteristic of

the underlying dynamic and likelihood processes of the tracking system. The accuracy

of the estimation achieved by techniques based on particle filtering depends on the size

of the dimension of the state space. For high dimensional spaces, the accuracy canbe quite low. In order to deal with this drawback, a technique of variance reduction,

called Rao-Blackwellization, has been used in (73), which improves the accuracy of

the estimated object trajectories for a given number of samples or hypotheses. An

alternative to the particle filtering is the probability hypothesis density (PHD) filter

that can also address missing and false detections like the particle filtering. However,

the computational cost is exponential with the number of objects. In order to reduce

the complexity from exponential to linear, the full posterior distribution is simplified

by its first-order moment in (74). Nonetheless, this approach is only satisfactory formultivariate distributions that can be reasonable approximated by its first moment,

which can be an excessive limitation for some tracking applications.

The previous works have been designed to track multiple objects with restricted

kinds of interactions among them. For instance, these works are able to handle object

interactions involving trajectory changes but without occlusions, such as a situation

with two people who stop one in front the other. In this case the object detections are

used to efficiently correct the object trajectories. Another kind of interaction that is

successfully addressed involves object occlusions but without trajectory changes, such

as a situation with two people who cross each other maintaining their paths. In this

case, the data association stage can manage the missing detections during the occlusion,

relying on their trajectories are unchanged in order to predict their tracks. However, in

complex object interactions involving trajectory changes and occlusions, the previous

approaches are prone to fail because the occluded objects have not available detections

to correct their trajectories. This limitation arises from the fact the main tracking

techniques for multiple objects have been developed for radar and sonar applications,

in which the dynamics of the tracked objects have physical restrictions that make

15


36/162


impossible the complex interactions that arise in visual tracking. Moreover, in the field

of radar and sonar, the objects are handled as point targets that cannot be occluded.

Some works have proposed strategies to deal with the specific problems that arise in

the field of visual tracking. In (75; 76), the data association hypotheses are drawn

using a sampling technique that is able to handle split object detections, i.e. group

of detections that have been generated from the same object. The split detections

are typical from background subtraction techniques (77), which are used to detect

moving objects in video sequences. In (78), a specific approach for handling object

interactions that involve occlusions and changes in trajectories is presented. It createsvirtual detections of possible occluded objects to cope with the changes in trajectories

during the occlusions. However, since the occlusion events are not explicitly modeled,

tracking errors can appear when a virtual detection is associated to an object that is

actually not occluded. In order to improve the performance of the tracking of multiple

objects in the field of computer vision, a novel Bayesian approach that explicitly models

the occlusion phenomenon has been developed. This approach is able to track complex

interacting objects whose trajectories change during the occlusions. Chap. 4 describes

in detail the proposed visual tracking for multiple interacting models.

16


37/162

Chapter 3

Bayesian Tracking with Moving

Cameras

This chapter starts with a brief overview of the optimal Bayesian framework for gen-

eral object tracking (Sec. 3.1), explaining also the basics of the particle filtering, an

approximate inference technique. Next, the developed Bayesian tracking framework for

moving cameras is presented in Sec. 3.2, which models the camera motion in a prob-

abilistic way. Lastly, Secs. 3.3 and. 3.4 show respectively how to apply the proposed

Bayesian model to two visual tracking applications for moving cameras: the first one

focused on aerial infrared imagery, and the second one for aerial and terrestrial visible

imagery.

3.1 Optimal Bayesian estimation for object tracking

The Bayesian approach for object tracking aims to estimate a state vector xt that

evolves over time using a sequence of noisy observations z1:t = {zi|i = 1,...,t} up

to time t. The state vector contains all the relevant information for the tracking at

time step k, such as the object position, velocity, size, appearance, etc. The noisy

observations z1:t (also called measurements or detections) are obtained by one or more

detectors, which analyze the video sequence information acquired by the camera to

either directly compute the object position, or indirectly obtain relevant features that

can related to the object position, such as motion, color, texture, edges, corners, etc.

From a Bayesian perspective, some degree of belief in the state xt at time t is

17


38/162

3. BAYESIAN TRACKING WITH MOVING CAMERAS

calculated, using the available prior information (about the object, the camera and

the scene), and the set of observations z1:t. Therefore, the tracking problem can be

formulated as the estimation of the posterior probability density function (pdf) of the

state of the object, p(xt|z1:t), conditioned to the set of observations, where the initial

pdf p(x0|z0) p(x0) is assumed to be known. This probabilistic model for the object

tracking can be graphically represented by a graph (see Fig. 3.1), called graphical

model, in which the random variables are represented by nodes, and the probabilistic

relationships among the variables by arrows.

Figure 3.1: Graphical model for the Bayesian object tracking.

For efficiency purposes, the estimation of the posterior pdf p(xt|z1:t) is recursively

performed through two stages: the prediction of the most probable state vectors us-

ing the prior information, and the update (or correction) of the prediction based on

the observations. The prediction stage involves computing the prior pdf of the state,

p(xt|z1:t1), at time t via the Chapman-Kolmogorov equation

p(xt|z1:t1) = p(xt, xt1|z1:t1)dxt1 = p(xt|xt1)p(xt1|z1:t1)dxt1, (3.1)where p(xt1|z1:t1) is the posterior pdf at the previous time step, and p(xt|xt1) is

the state transition probability, that encodes the prior information, for example the

object dynamics along with its uncertainty. The state transition probability is defined

by a possibly non-linear function of the state xt1, and an independent identically

distributed noise process vt1

xt = ft(xt1, vt1). (3.2)

18
http://./3/figures/eps/chap_3_sec_1_sub_1_fig_1.eps


39/162

3.1 Optimal Bayesian estimation for object tracking

The update stage aims to reduce the uncertainty of the prediction, p(xt|z1:t1),

using the new available observation zt (observations are available at discrete times)

through the Bayes rule

p(xt|z1:t) =p(zt|xt)p(xt|z1:t1)

p(zt|z1:t1), (3.3)

where p(zt|xt) is the likelihood distribution that models the observation process, i.e. it

assesses the degree of support of the observation zt by the prediction xt. The likelihood

is given by a possibly nonlinear function of the state xt, and an independent identically

distributed noise process nt

zt = ht(xt, nt). (3.4)

The denominator of Eq. (3.3) is simply a normalization constant given by

p(zt|z1:t1) = p(zt, xt|xt)dxt = p(zt|xt)p(xt|z1:t1)dxt. (3.5)The posterior p(xt|z1:t) embodies all the available statistical information, allowing

the computation of an optimal estimate of the state vector xt, that contains the de-sired tracking information. Commonly used estimators are the Maximum A Posteriori

(MAP) and the Minimum Mean Square Error (MMSE), given respectively by

M AP :

xt = arg max

xt

p(xt|z1:t) (3.6)

MMSE: xt = E(p(xt|z1:t)) (3.7)Nevertheless, the optimal solution of the posterior probability, given by Eq. 3.3,

can not be determined analytically in practice, due to the nonlinearities and non-

Gaussianities of the prior information and observation models. Therefore, it is necessary

the use of suboptimal methods to obtain an approximate solution. In the Sec. 3.1.1, a

powerful and popular suboptimal method, call Particle Filtering, will be described.

19


40/162


3.1.1 Particle filter approximation

The Particle Filter is an approximate inference method based on Monte Carlo simula-

tion for solving Bayesian filters. In contrast to other approximate inference methods,

such as Extended Kalman Filters, Unscented Kalman Filters and Hidden Markov Mod-

els, Particle Filtering is able to deal with continuous state spaces and nonlinear/non-

Gaussian processes (9), which arise in a natural way in real tracking situations. The

Particle Filtering technique approximates the posterior probability p(xt|z1:t) by a set

of NS weighted random samples (or particles) {xit, i = 0,...,NS} (2)

p(xt|z1:t) 1

c

NSi=1

wit(xt xit), (3.8)

where the function (x) is the Kroneckers delta, {wit, i = 0,...,NS} is the set of weights

related to the samples, and c =NSi=1 wit is a normalization factor. As the number

of samples becomes very large, this approximation becomes equivalent to the true

posterior pdf.

Samples x

i

t and weights w

i

t are obtained using the concept of importance sam-pling (2; 79), which aims to reduce the variance of the approximation given by Eq. (3.8)

through Monte Carlo simulation. The set of samples {xit, i = 0,...,NS} is drawn from

a proposal distribution function q(xt|xt1, zt), called the importance density. The op-

timal q(xt|xt1, zt) should be proportional to p(xt|z1:t), and should have the same

support (the support of a function is the set of points where the function is not zero),

in whose case the variance would be zero. But this is only a theoretical solution, since

it would imply that p(xt|z1:t) is known. In practice, it is chosen a proposal distribution

as similar as possible to the posterior pdf, but there is not a standard solution, sinceit depends on the specific characteristics of the tracking application. The choice of

the proposal distribution is a key component in the design of Particle Filters, since

the quality of the estimation of the posterior pdf depends on the ability to find an

appropriate proposal distribution.

The weights wit related to each sample xit are recursively computed by (2)

wit = wit1

p(zt|xit)p(xit|xit1)

q(xit|xit1, zt)

. (3.9)

20


41/162

3.2 Bayesian tracking framework for moving cameras

The importance sampling principle has a serious drawback, called the degeneracy

problem (2), consisting in all the weights except one have an insignificant value after

a few iterations. In order to overcome this problem, several resampling techniques

have been proposed in the scientific literature, which introduce an additional sampling

step that consists in populating more times those samples that are more probable. A

popular resampling strategy is the Sampling Importance Resampling (SIR) algorithm,

which makes a random selection of the samples at each time step according to their

weights. Thus, the samples with higher weights are selected more times, while the ones

with an insignificant weight are discarded. After SIR resampling, all the samples havethe same weight.


In video sequences acquired by a moving camera, the perceived motion of the objects is

composed by the own object motion and the camera motion. Consequently, it is neces-

sary to estimate the camera motion in order to obtain the object position. According

to this, the state vector xt = {dt, gt} must contain not only the object dynamics, dt,

(position and velocity over the image plane), but also the camera dynamics, gt, i.e. the

camera ego-motion. The posterior pdf of the state vector is recursively expressed by

the equations

p(xt|z1:t) =p(zt|xt)p(xt|z1:t1)

p(zt|z1:t1)(3.10)

p(xt|z1:t1) =

p(xt|xt1)p(xt1|z1:t1)dxt1. (3.11)

The transition probability p(xt|xt1) = p(dt, gt|dt1, gt1) encodes the information

about the object and camera dynamics, along with their uncertainty. If the camera

motion is not considered, the object dynamics can be modeled by the linear function

d

t = M d

t1, (3.12)

where M is a matrix that represents a first order linear system of constant velocity.

This object dynamic model is a reasonable approximation for a wide range of object

tracking applications, provided that the camera frame rate is enough high. The camera

21


42/162


dynamics is modeled by a geometric transformation gt that ideally is a projective cam-

era model, although, depending on the camera and scene disposition, it can be simplify

to an affine or Euclidean transformation. For example, in aerial tracking systems, an

affine geometric transformation is a satisfactory approximation of the projective cam-

era model, since the depth relief of the objects in the scene is small enough compared

to the average depth, and the field of view is also small (80). The joint dynamic model

for the camera and the object is expressed as a composition of both individual models

dt = gt M dt1. (3.13)

Based on this joint dynamic model, the transition probability p(xt|xt1) can be ex-

pressed as

p(xt|xt1) = p(dt, gt|dt1, gt1) = p(dt|dt1, gt1:t)p(gt|dt1, gt1)

= p(dt|dt1, gt)p(gt), (3.14)

where it has been assumed that, on the one hand, the current object position is condi-

tionally independent of the camera motion in the previous time step (as the proposed

joint dynamic model states), and, on the other hand, the current camera motion is

conditionally independent of both the camera motion and the object position in pre-

vious time steps. This last assumption results from the fact that the camera ego-

motion is completely random, not following any specific pattern. The probability term

p(dt|dt1, gt) models the uncertainty of the proposed joint dynamic model as

p(dt|dt1, gt) = Ndt; gt M dt1,

2tr

, (3.15)

where N(x; , 2) is a Gaussian or Normal distribution of mean and variance 2.

Thus, the term 2tr represents the unknown disturbances of the joint dynamic model.

The other probability term in Eq. 3.14, p(gt), expresses the probability that one

specific geometric transformation represents the true camera motion between consecu-

tive time steps. This is typically computed by a deterministic approach using an image

registration algorithm (27), which amounts to express p(gt) as

p(gt) = (gt gjt ), (3.16)

22


43/162


where gjt is the geometric transformation obtained by the image registration technique.

However, this approximation can fail in situations in which the aperture problem (43;

81) is quite significant and/or the assumption of only one global motion does not hold,

for instance, in the presence of independent moving objects. Under these circumstances

there are several putative geometric transformations that can explain the camera ego-

motion. Moreover, the best geometric transformation according to some error or cost

function can not necessarily be the actual camera ego-motion, due to the noise and

non-linearities involved in the estimation process. In order to satisfactorily deal with

this situation, gt is addressed as a random variable, rather than a parameter computed

in a deterministic way. The specific computation of p(gt) depends of the tracking

application and type of imagery. Two different methods are proposed in Secs. 3.3 and

3.4 for infrared and visible imagery, respectively. In any case, they have in common

that they compute an approximation of p(gt) as

p(gt)

Ng

j=1 wjt (gt g

jt ), (3.17)

where Ng is the number of geometric transformations used to represent p(gt), {gjt |j =

1,...,Ng} are the best transformation candidates to model the camera ego-motion, and

wjt is the weight ofgjt , that evaluates how well the transformation represents the camera

ego-motion.

The likelihood function p(zt|xt) in Eq. 3.10 is dependent on the kind of imagery, and

the object type that is being tracked. Two different models have been developed, one

based on the detection of blob regions for infrared imagery, and another based on colorhistograms for visible video sequences, which are described respectively in Secs. 3.3

and 3.4. In general terms, the resulting likelihood will be non-Gaussian, nonlinear and

multi-modal, due to the presence of clutter and objects similar to the tracked object.

The initial pdf p(x0|z0) p(x0), called the prior, can be initialized as a Gaussian

distribution using the information given by an object detector algorithm, as in (18;

19; 32; 33; 34; 39; 40). Another alternative is to use the ground truth information (if

it is available) to initialize a Kroneckers delta function (x0).

23


44/162


3.3 Object tracking in aerial infrared imagery

This section presents the developed object tracking approach for aerial infrared imagery.

In contrast to visual-range images, infrared images have low signal-to-noise ratios, ob-

jects low contrasted with the background, and non-repeatable object signatures. These

drawbacks, along with the competing background clutter, and illumination changes

due to weather conditions, make the tracking task extremely difficult. On the other

hand, the unpredictable camera ego-motion, resulting from the fact that the camera

is on board of an aerial platform, distorts the spatio-temporal correlation of the video

sequence, negatively affecting the tracking performance.

All the aforementioned problems are addressed by a tracking strategy based on

the Bayesian tracking framework for moving cameras proposed in Sec. 3.2. According

to this, the posterior pdf of the state vector p(xt|z1:t1) is recursively computed by

Eqs. 3.10 and 3.11.

The transition probability p(xt|xt1), that encodes the joint camera and object dy-

namic model, is given by Eq. 3.14, where the prior probability p(gt) of the geometric

transformation was dependent on the specific type of imagery. For the ongoing tracking

application dealing with infrared imagery, the p(gjt ) of a specific geometric transfor-

mation gjt is based on the quality of the image alignment between consecutive frames

achieved by gjt . The quality of the image alignment (or the ego-motion compensation)

is computed by means of the Mean Square Error function, mse(x, y), between the cur-

rent frame It, and the previous frame It1 warped by the transformation gjt . Thus, the

probability p(gjt ) is mathematically expressed as

p(gjt ) = Nmse

It, gjt It1

; 0, 2g(3.18)

where N(x; , 2) is a Gaussian distribution of mean and variance 2, and 2g is the

expected variance of the image alignment process. Notice that It is an infrared intensity

image.

Finding an observation model for the likelihood p(zt|xt) in airborne infrared imagery

that appropriately describes the object appearance and its variations along the time,

is quite challenging due to the aforementioned characteristics of the infrared imagery.

The most robust and reliable object property is the presence of bright regions, or at

least, regions that are brighter than their surrounding neighborhood, which typically

24


45/162


(a) (b)

Figure 3.2: Two consecutive frames of an infrared sequence acquired by an airborne

camera.

correspond to the engine and exhaust pipe area of the object. Based on this fact,

the likelihood function uses an observation model that aims to detect the main bright

regions of the target. This is accomplished by a rotationally symmetric Laplacian of

Gaussian (LoG) filter, characterized by a sigma parameter that is tuned to the lowest

dimension of the object size, so that the filter response be maximum in the brightregions with a size similar to the tracked object. The main handicap of the observation

model is its lack of distinctiveness, since whatever bright region with an adequate

size can be the target object. As a consequence, the resulting LoG filter response is

strongly multi-modal. This fact, coupled with the camera ego-motion, dramatically

complicate a reliable estimation of the state vector. This situation is illustrated in

Figs. 3.2 and 3.3. The first one, Fig. 3.2, shows two consecutive frames,(a) and (b),

of an infrared sequence acquired by an airborne camera, in which the target object

has been enclosed by a rectangle. Fig. 3.3 shows the LoG filter response related to

Fig. 3.2(b), where the own image has been projected over the filter response for a

better interpretation. The multi-modality feature is clearly observed, and in theory

any of the modes could be the right object position. Moreover, if only the object

dynamics is considered, the closest mode to the predicted object location (marked by a

vertical black line) is not the true object location, because of the effects of the camera

ego-motion.

25
http://./3/figures/eps/chap_3_sec_2_fig_1_2.epshttp://./3/figures/eps/chap_3_sec_2_fig_1_1.eps


46/162


Figure 3.3: Multimodal LoG filter response related to Fig. 3.2(b).

Figure 3.4: Likelihood distribution related to Fig. 3.3.

The likelihood probability can be simplified by

p(zt|xt) = p(zt|dt, gt) = p(zt|dt), (3.19)

assuming that zt is conditionally independent ofgt given dt. Then, p(zt|dt) is expressed

by the Gaussian distribution

p(zt|dt) = N(zt; Hdt, 2L), (3.20)

where zt is the LoG filter response of the frame It, H is a matrix that selects the object

positional information, and the variance L is set to highlight the main modes of zt,

while discarding the less significant ones. This is illustrated in Fig. 3.4, where only the

most significant modes of Fig. 3.3 are highlighted.

26
http://./3/figures/eps/chap_3_sec_2_fig_3.epshttp://./3/figures/eps/chap_3_sec_2_fig_2.eps


47/162


As both dynamic and observation models, are nonlinear and non-Gaussian, the

posterior pdf can not be analytically determined, and therefore, the use of approximate

inference methods is necessary. In the next section, a Particle Filtering strategy is

presented to obtain an approximate solution of the posterior pdf.

3.3.1 Particle filter approximation

The posterior pdf p(xt|z1:t) is approximated by means of a Particle Filter as

p(xt|z1:t) 1c

NSi=1

wit(xt xit), (3.21)

where the samples xit are drawn from a proposal distribution based on the likelihood

and the prior probability of the camera motion

q(xt|xt1, zt) = p(zt|dt)p(gt), (3.22)

which is an efficient simplification of the optimal, but not tractable, importance density

function (9)

q(xt|xt1, zt) = p(xt|xt1, zt). (3.23)

The samples xit = {dit, git} are drawn from the proposal distribution by a hierar-

chical sampling strategy. This, firstly, draws samples git from p(gt), and then, draws

samples dit from p(zt|dt). The sampling procedure for obtaining samples git from p(gt)

is based on an image registration algorithm presented in (82). This method assumes an

initial geometric transformation tit, and then, uses the whole image intensity informa-

tion to compute a global affine transformation git, which is a candidate for representing

the true camera motion. This method explicitly accounts for global variations in image

intensities to be robust to illumination changes. However, the computed candidate git

only will be a reasonable approximation of the camera motion if the initial geometric

transformation tit is close to the geometric transformation that represents the actual

camera motion. This means that the image in the previous time step, warped by the

initial transformation, must be closely aligned to the current image to achieve a satis-

factory result. This limitation derives from the optimization strategy used in the image

registration algorithm, that tends to the closest mode given an initial transformation.

27


48/162


As a consequence, if the two images are not closely aligned, the computed solution will

probably correspond to a local mode, that does not represent the true camera motion.

By default, tit is a 3 3 identity matrix that represents the previous image without

warping. This approach is inefficient in airborne visual tracking, since the camera can

undergo strong displacements that can not be satisfactorily compensated. To overcome

this problem, the previous image registration technique has been improved using several

initial geometric transformations {tit|i = 1,...,NS}, obtaining in turn a set of camera

ego-motion candidates {git|i = 1,...,NS}. The set of initial transformations are com-

puted with the purpose that at least one of them is relatively close to the actual cameramotion, so that the image registration algorithm can effectively compute the correct

geometric transformation. In this context, the concept of closeness between geometric

transformations depends, on the one hand, on the magnitude of the camera motion,

and, on the other hand, on the own capability of the image registration algorithm to

rectify misaligned images. For example, the ideal situation would be that the magni-

tude of the camera motion were lower than the maximum displacement that the image

registration algorithm is able to rectify. For the purpose of measuring the magnitude

of the camera motion, a subset of video sequences belonging to the AMCOM dataset(see Sec. 3.3.2) has been used as training set to compute the actual camera motion.

These sequences have been acquired by different infrared cameras on board a plane.

The computation process of the camera motion has been supervised by a user, which

not only guides the image alignment, but also evaluates if the reached result is enough

accurate to be considered the real camera motion. As a result, a set of affine transfor-

mations are obtained, which described the typical camera movements. Regarding the

image registration algorithm,

Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective

Documents

Transcript of Th_Visual Object Tracking in Challenging Situations Using a Bayesian Perspective