Robust Shape Estimation and Tracking in the Presence of ...users.isr.ist.utl.pt/~jan/PhD.pdf · The...

Robust Shape Estimation and Tracking in

the Presence of Clutter

Jacinto Carlos Marques Peixoto do Nascimento

(Mestre)

Dissertacao para obtencao do grau de Doutor em

Engenharia Electrotecnica e de Computadores

Orientador:

Doutor Jorge dos Santos Salvador Marques

Juri

Presidente: Reitor da Universidade Tecnica de Lisboa

Vogais: Doutor Joao Manuel Lage de Miranda Lemos

Doutor Jorge dos Santos Salvador Marques

Doutor Jose Alberto Rosado dos Santos Victor

Doutor Arnaldo Joaquim de Castro Abrantes

Doutor Jose Manuel Bioucas Dias

Doutor Gilles Celeux

Abril 2003

iii

Abstract

This thesis proposes robust methods for the estimation and tracking of objects boundaries in images.

There are several well known methods for contour estimation and tracking using deformable models.

However, these methods have strong limitations. Their performance is severely hampered in the presence

of outliers, i.e., image features detected which do not belong to the object boundary and this happens

in most of the practical applications.

The goal of this thesis is to improve the performance of existing methods in the presence of outliers.

This thesis proposes robust versions for three contour estimation algorithms: Snakes, Kalman tracker,

and Multi model Kalman tracker. The first one is a pioneering contour estimation algorithm for static

objects. The last two are tracking methods for the estimation of motion objects in image sequences.

The Kalman tracker is the most popular method for tracking deformable objects using active contours.

The Multi model tracker is a recent method and it is based on the use of multiple dynamic models

switched according to a Markov process, being useful to represent complex motions.

In this work we propose robust versions of the three methods using statistical models to represent

valid and invalid observations. The proposed algorithms share the following properties. First, they are

based on the use of middle level features (contour strokes) instead of low level ones (edge points) which

are used in the original methods. Second, the detected features are not all considered as valid, since we

know that some of them are outliers.

In this thesis, confidence degrees are assigned to each feature or a set of features, and the contour

estimates are based on these confidence degrees. Features with high confidence degrees have a large

influence on the shape estimates, while features with low confidence degrees have a negligible influence.

A set of tests are presented to evaluate the performance of the proposed algorithms in shape estima-

tion and tracking. It is concluded that the performance of the proposed methods is much better than

the one obtained with the original algorithms.

Keywords: Shape analysis, Tracking, Robust estimation, Active contours, Adaptive Snakes, Multi-

Model tracking.

v

Resumo

A presente tese propoe metodos robustos para a estimacao e seguimento da fronteira de objectos em

imagens. Ha varios metodos conhecidos para a estimacao e seguimento de contornos usando modelos

deformaveis. Esses metodos tem contudo limitacoes. Em particular, sao sensıveis a deteccao de carac-

terısticas visuais na imagem que nao pertencam ao contorno do objecto, situacao que ocorre na maioria

dos problemas praticos de interesse.

O objectivo desta tese e melhorar o desempenho dos metodos existentes na presenca de observacoes

incorrectas da fronteira do objecto de interesse. Propoem-se nesta tese versoes robustas para tres

algoritmos de estimacao de contornos: Snakes, seguidor de Kalman e seguidor de Kalman com modelos

comutados. O primeiro e um algoritmo pioneiro de estimacao do contorno de objectos estaticos. Os dois

ultimos sao metodos de seguimento de objectos em movimento em sequencias de imagens. O seguidor

de Kalman e o metodo mais popular para seguimento de objectos deformaveis atraves de contornos

activos e o seguidor com modelos comutados e um metodo recente, baseado na utilizacao de multiplos

modelos dinamicos comutados, que e util na representacao de movimentos complexos.

Neste trabalho propoem-se versoes robustas para os tres metodos escolhidos usando modelos es-

tatısticos para representar as observacoes validas e invalidas. Os algoritmos propostos partilham aspec-

tos comuns. Em primeiro, lugar baseiam-se no uso de caracterısticas visuais de nıvel medio (segmentos

de contorno) em vez das caracterısticas de baixo nıvel (pontos de contorno) utilizadas nos algoritmos

originais. Em segundo lugar, as caracterısticas detectadas nao sao consideradas como sendo todas

validas, ao contrario do que e feito nos metodos classicos de contornos activos. Nesta tese, atribuem-

se de graus de confianca a cada caracterıstica ou grupos de caracterısticas e o contorno do objecto e

estimado tendo em consideracao os graus de confianca atribuıdos. Assim, observacoes com graus de con-

fianca elevados tem influencia elevada na estimacao do contorno do objecto enquanto que observacoes

com baixo grau de confianca tem uma influencia reduzida.

Os metodos propostos sao avaliados atraves de testes experimentais de estimacao e seguimento de

objectos deformaveis em sequencias de vıdeo. Mostra-se que o desempenho dos metodos propostos e

muito superior ao dos metodos originais.

Palavras-chave: Analise de forma, Seguimento, Estimacao robusta, Contornos Activos, Snakes adap-

tativas, Seguimento com multiplos modelos.

Agradecimentos

Tenacidade, persistencia e isolamento foram fundamentais para a realizacao deste trabalho. Desde o

seu inıcio, sempre desejei ver “uma luz ao fundo do tunel”. Verifico que nao encontrei a luz, pois

nao estou num tunel. Percebo agora, que disponho duma oportunidade para dar os primeiros passos,

ambicionando num futuro, ter a sorte de encontrar um tunel...

As minhas primeiras palavras de gratidao, sao para o Prof. Jorge Marques. Conheci-o pessoalmente

em Outubro de 95, tendo sido sempre uma pessoa rigorosa, exigente e, acima de tudo, disponıvel para

imensas discussoes que mantivemos ao longo destes anos, ensinando-me que a capacidade de trabalho nao

tem horizontes. Confesso-me atraıdo pela sua capacidade de pesquisa, procura e descoberta cientıfica.

Gostaria de agradecer ao Prof. Joao Sentieiro, director do ISR, e ao Prof. Victor Barroso, responsavel

pelo Lab. de Processamento de Sinal, e a todos os colegas do ISR, pelo excelente ambiente de trabalho

que me proporcionaram durante a realizacao da tese.

Agradeco ao Prof. Bioucas pela leitura meticulosa da tese, e pelas sugestoes que contribuıram para

a melhoria da qualidade do documento final.

Quero tambem agradecer ao Prof. Gilles Celeux, a oportunidade de visitar por duas vezes o INRIA

e de aı desenvolver trabalho cientıfico e realizar agradaveis corridas no verdor de Grenoble. Foram sem

duvida experiencias enriquecedoras e inesquecıveis para mim.

Ao meu colega Joao Sanches, pelo apoio que sempre me emprestou durante estes anos e pelas imagens

ecograficas que amavelmente me cedeu e que foram uteis para avaliar o desempenho dos algoritmos desta

tese.

Ao Arnaldo Abrantes pelas conversas frutuosas sobre modelos deformaveis e fusao de informacao,

para alem da amizade que sempre mantive.

Agradeco a Fundacao para a Ciencia e Tecnologia que financiou este Doutoramento atraves da bolsa

PRAXIS XXI/BD/15827/98, assim como as deslocacoes necessarias a participacao de conferencias

internacionais.

Num mundo nao academico, mas sem sombra de duvida o mais importante, gostaria de agradecer

a mana Lia, pelas magnıficas sequencias de vıdeo, ao mano Ze, por muitos momentos partilhados,

juntamente com a Guida, Jonas, Ian, sem esquecer claro, a pequena Iara, que invadiu as nossas vidas

ha bem pouco tempo, matizando-as com os seus interminaveis e constantes momentos infantis e de

viii

inquietude. As minhas ultimas palavras vao para os meus pais, por terem tido paciencia comigo, e por

fim aceitarem ter um filho que nao “ha meio de se formar”.

Contents

1 Introduction 1

1.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Active Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Shape Tracking with Deformable Models 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Algorithms based on a Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 The snake model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Shape Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Motion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Feature Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Adaptive Snakes 21

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Classical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Adaptive Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Contour Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

x Contents

3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 A Robust Feature based Tracker 41

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 S-PDAF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5 Application to Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.5.1 Association Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


4.6.1 Hand Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.6.2 Lip Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.6.3 Vehicle Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.7 Computational Complexity Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 A Robust Multi Model Tracker 71

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3 Switched Dynamical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.4 Density Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.5 Contour prediction and update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77


5.6.1 Hand Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.6.2 Lip Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.6.3 Heart Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6 Conclusions and Future Work 93

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Contents xi

A Shape Models 95

B Feature Detection 99

C Covariance Update for the S-PDAF Model 101

D Robust Multi-Model Tracker 103

D.1 Mixture Coefficients of RMM tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

D.2 Mixture Coefficients for Multi-Model Tracker . . . . . . . . . . . . . . . . . . . . . . . . 104

D.3 State estimation for a given model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Bibliography 107

xii Contents

Ch a p t e r 1

Introduction

1.1 Scope

This thesis addresses the estimation of objects boundaries in image sequences. The problem can be

formulated as follows: given a rough estimate of the object of interest, we wish to accurately estimate

the object boundary in the whole image sequence.

This problem occurs in many image processing operations, e.g., in medical imaging, in human-

machine interfaces and in surveillance. Let us consider a few examples. Given a set of cross sections

of the human body, obtained by CT or by ultrasound imaging, we wish to estimate the boundary of

human organs and evaluate their properties. Similar operations are performed in video surveillance

when we wish to track moving objects with deformable shapes (e.g., persons) or in human machine

communication when we wish to detect human gestures or to track the lip motion.

Object estimation is difficult in many problems, since it is not possible to discriminate the object

from the background using a single type of features. In this case, tracking relies on the ability to

integrate different cues (e.g., edge points, color, texture) over time. This requires the ability to segment

the data, i.e., to discriminate valid features (belonging to the object) from invalid ones and to estimate

the object boundary using a priori information about the object position, shape, and motion. The

optimal solution is problem dependent since it exploits the information available about the object

shape, motion, and visual appearance as well as the information about the other interacting objects

and background image.

Several methods have been proposed for object estimation and tracking in image sequences. Active

contours are among the most popular choices since they allow to integrate geometric, dynamic, and

2 Introduction

image information into a common framework and to estimate the object contour in a principled way

[10]. Active contours approximate the object boundary by an elastic curve which is deformed until it

matches the image features, according to a specified criterion. Two types of forces are usually considered

to modify the contour shape and position: internal forces which avoid unusual shapes and external forces

which attract the curve points towards the image features.

Most active contour models work well in simple problems, e.g., when the background image is quasi-

homogeneous and the object shape and motion change slowly or in a predictable way. The tracking

results are very poor, however, when there are multiple interacting objects in a cluttered background

or when the object shape and motion undergo sudden changes. The main difficulty concerns data

segmentation, i.e., the available techniques are unable to discriminate valid observations which belong

to the object boundary from invalid ones generated by inner contours, by other objects or by the

background texture. New methods are required to overcome these difficulties and to provide robust

estimates of the object boundary in these cases. This is the main objective of this thesis.

1.2 Active Contours

The estimation of objects in images is an old image processing problem. The ultimate goal is the

segmentation of the whole image, i.e., the detection of all meaningful objects in the image. This goal

has been pursued for a long time, seeking for general techniques valid for all types of scenes and images.

This work has been fostered by a strong argument: the human eye easily performs such segmentation

using similar data. However, it was soon recognized in the early 80s that the image segmentation

problem is too difficult to be solved using bottom up techniques, i.e., without making assumptions

about the types of objects present in the scene as well as the scene properties. A different perspective

was adopted then, based on two main ideas: i) keep the problem as simple as possible (e.g., detect a

single object instead of many) and ii) develop specific algorithms, i.e., use all the available information

about the type of objects and background to be tracked (e.g., shape, color, and texture properties).

Active contours were proposed in the late 80s as an attempt to solve the segmentation problem.

The first model was proposed by Kass, Witkin, and Terzopoulos and it is denoted as snakes [42]. The

snake model approximates the boundary of the object to be estimated by an elastic curve. The curve is

initialized by the user, e.g., using a graphical editor, and it is automatically modified using an adaptation

method which minimizes an energy function. The snake energy has two terms: an image dependent

1.2 Active Contours 3

term, which attracts the curve points towards the object boundary, and a regularization term which

tries to keep the curve as small and smooth as possible.

Both terms play an important role. The regularization term defines the set of typical shapes and

assigns high energy to shapes which are not typical and should, therefore, be avoided. The image

dependent term defines which image properties should attract the contour. This term is often defined

as the integral of a potential function along the curve. The valleys of the potential function are associated

with edge points or with points in which the image intensity changes. Unfortunately the snake algorithm

defined in [42] has many drawbacks. The final configuration of the elastic curve is often stuck in valleys

of the potential function which are not associated to the object boundary, therefore, producing wrong

shapes estimates. Furthermore, the final shape estimate depends on the initial contour and snake

parameters.

A lot of work has been done to overcome these difficulties. Most of it can be grouped into three

different directions: i) improvement of the shape model; ii) improvement of the dynamical model and

iii) improvement of the estimation techniques. Each of these aspects will be briefly addressed in the

sequel.

Let us consider the shape model first. The snake model approximates the object boundary by a

smooth curve with continuous second order derivatives. The set of admissible contours has, therefore,

an infinite dimension. It was soon recognized that this choice was too general and did not incorporate

enough information about the shape of the object to be estimated. Several authors proposed the use

of finite dimension vector spaces to describe the object shape, e.g., B-spline curves [53]. This is an

important progress. However, when we use B-spline curves we still have to define a large number

of parameters to accurately approximate the object boundary (typically, several tens) and no a priori

information of the object shape is used. A more focused approach is proposed in [11]. It is assumed that

the object boundary is obtained by applying a geometric transformation (e.g., affine transformation) to a

reference shape. Since the reference shape is known, only the parameters of the geometric transformation

(e.g., 6 coefficients in the case of affine transform) have to be estimated. The set of allowed shapes,

obtained by applying the geometric transform to the reference shape, defines a subspace of the B-spline

vector space denoted as shape space. The shape space can be defined by the user or trained from the

data. This approach is robust but it is too rigid to be applied in all tracking problems unless complex

4 Introduction

transformations are considered in order to accommodate shape deformation.

Another important milestone is the active shape model [19, 70]. This model explicitly represents

shape uncertainty and deformation. It assumes that the coefficients of the B-spline model are obtained

by applying a geometric transform to a random B-spline curve whose coefficients are random variables

with normal distribution. The object shape is, therefore, characterized by a mean shape and by the

covariance matrix of the B-spline coefficients. This model allows to define typical deformation modes

and to assign a probability distribution to the curve model.

Another way to improve the performance of shape trackers concerns the use of temporal constraints.

The first tracking algorithms did not make an explicit use of such constraints [42, 44]. In such algorithms,

the object boundary was independently estimated in every new frame, using the contour estimate

obtained in the last frame as a starting point for the adaptation. This approach is poor since it

does not consider the evolution of the shape parameters as time goes by. To overcome this difficulty,

dynamic models have been used to describe the evolution of the shape variables. Typical examples are

the Kalman tracker [72] and the shape space tracker [11] which use a stochastic linear state model to

describe the evolution of the unknown parameters and their relationship with the image data. The

dynamic model is defined by the user or trained from the data [10, 12]. The use of such models allows

the prediction of the object contour in the next frame as well as to perform data fusion between current

and past information.

The use of a linear stochastic model has been adopted in many trackers but it presents several

difficulties. Other alternatives have been explored, e.g., probabilistic models using Markov random

fields have been used for shape estimation and tracking allowing a better representation of the image

data and a significant improvement of the tracking results [24, 32]. These models assume that the

object has an homogeneous texture or color. This hypothesis is acceptable, e.g., in many medical

imaging problems but it is not true in general. Recently, it has been found that a single dynamical

model does not allow an accurate representation of the evolution of the shape parameters, especially

in the presence of abrupt shape or motion changes. The use of multiple switched models has been

advocated [38, 49]. The advantages of this approach are not fully understood yet since it also increases

the computational effort and the complexity of the tracker.

Another key aspect concerns the choice of the estimation technique used to evaluate the shape pa-

1.2 Active Contours 5

rameters. This is a difficult problem since we still do not have good image models, valid for a wide

range of applications. The minimization of an energy function proposed by Kass, Witkin, and Ter-

zopoulos [42] has many drawbacks. As it was mentioned before, the deformable model often gets stuck

into local minima of the potential function which do not belong to the object boundary. To overcome

this difficulty several methods were proposed. A significant amount of work was done to improve the

potential functions, e.g., using edge points [16], color information [33], anisotropic diffusion [77], Markov

Random fields [24], competitive learning [1] or multiple cues [50]. Although some improvements were

achieved, the main difficulties remained unsolved, except in the case of the Markov random fields which

are mainly used in quasi-homogeneous problems. Furthermore, most of the above techniques are time

consuming and inappropriate for real time applications.

In real time applications, feature based methods have been used instead. Instead of using dense

information extracted from the image (e.g., a potential function), shape estimation is based on a small

set of feature points, detected in the image. These algorithms are usually based on three steps [10]:

prediction, feature detection, and filtering. The first step computes a predicted shape estimate for the

next frame. The second step extracts a set of features points (e.g., edges) from the observed image,

in the vicinity of the predicted shape. The third step updates the shape estimate using the detected

features. Shape prediction and filtering are often performed by Kalman filtering.

This approach is well suited for real time applications. However, the tracker performance is very

sensitive with respect to the presence of invalid features (outliers) which are produced by inner edges,

by other objects present in the scene or by the background. The predictor also plays a key role in this

structure since it can significantly reduce the number of outliers by providing good estimates of the

object boundary in the next frame. Some authors have suggested the use of non-Gaussian models to

reduce the influence of the outliers [37]. Unfortunately the Kalman filter can not be used in such cases

and non-linear filtering techniques (e.g., particle filters) have to be adopted to propagate the conditional

distribution of the state vector [37]. Although such techniques have been advocated, the computational

effort of the non-linear filtering methods is, however, unbearable if a large number of parameters have

to be estimated.

6 Introduction

1.3 Thesis Objectives

This thesis aims to improve the performance of known methods for object estimation and tracking

in complex scenes. Namely we wish to explicitly address the data segmentation problem, i.e., the

discrimination of valid/invalid data, in order to reduce the influence of false observations on the contour

estimates. To accomplish this, robust shape estimation methods based on the three existing algorithms

are proposed. The algorithms considered in this thesis are the snake algorithm, the Kalman tracker,

and the Multi-Model tracker. The first method is a contour estimation algorithm for single images [42].

The last two are methods for object tracking in video sequences. The Kalman tracker is one of the most

popular trackers since it is simple and fast [11]. The Multi Model tracker is a recent method based on

a set of switched dynamical models, which is useful when a single dynamical model is not enough to

accurately describe the motion and deformation of the object in a video sequence [49].

To improve the robustness of shape estimation algorithms, segmentation mechanisms must be con-

sidered as well as reliable image features. All the algorithms proposed in this thesis use middle level

features (herein denoted as strokes) instead of low level features (edge points) used in the original

methods. This is accomplished by organizing edge points into strokes using an edge linking operation.

Furthermore, strokes detected in the image are not all considered as valid. On the contrary, a confi-

dence degree (weight) is assigned to each stroke or to a set of strokes. Since the weights depend on

the distance from the strokes to the contour estimates, they are adaptively modified during the shape

estimation procedure. In the final stage, the observations with high weights have larger influence on the

shape estimates while the observations with small weights have a small influence. Weight assignment

is based on probabilistic models of the data.

Experimental results are described to evaluate the proposed methods, showing that significant im-

provements are achieved.

1.4 Thesis Organization

The thesis is organized as follows. Chapter 1 presents the motivation for this work, the thesis objectives,

organization and contributions. Chapter 2 addresses shape estimation with deformable models. It

describes two classical approaches to shape estimation: potential based methods and feature based

methods. The snake algorithm is described as an example of the first class of methods. The Kalman

1.4 Thesis Organization 7

tracker is an example of the second class. A description of these methods is provided, stressing the

difficulties caused by invalid features.

Chapter 3 describes a new shape estimation algorithm based on the snake method. This algorithm

uses a new type of potential denoted in this thesis as adaptive potential. The algorithm is based on

the minimization of an energy function with two terms: a regularization term and an image dependent

term. The first term is similar to the one adopted in [44]. The second term is new and it depends

on the strokes extracted from the data. The proposed algorithm relies on a set of weights assigned

to the strokes detected in the image. Weight assignment is obtained using a probabilistic model for

the observed data. We use the EM (Expectation Maximization) algorithm to perform this task. This

chapter ends with a set of experimental tests comparing the proposed algorithm with snakes. It is shown

that significant improvements are achieved especially in the presence of cluttered (non-homogeneous)

backgrounds.

Chapter 4 presents a robust version of the Kalman tracker denoted as S-PDAF (Shape Probabilistic

Data Association Filter). The S-PDAF tracker is based on two key concepts. First it uses middle level

features (strokes) instead of edge points. Middle level features are more robust and much fewer than

edge points. Second, it explicitly assumes that some features are outliers. Since we do not know which

features are valid and which are invalid, a confidence degree is assigned to each data interpretation,

i.e., to each tentative set of valid features. Update equations for the mean and covariance of the state

estimate are provided. Experimental tests are conducted to illustrate the performance of the S-PDAF

algorithm in the presence of outliers. Good results are achieved in problems in which the Kalman filter

fails.

Chapter 5 addresses the use of multiple models for object tracking inspired in [49]. The motivation

for using multiple models is simple. Sometimes the object undergoes complex motion or shape changes

which can not be well represented by a single model. One example concerns the recognition of sign

language. When we try to track the boundary of the hands, we easily conclude that this is a difficult task

which can not be accomplished by a single shape model. In this case it would be better if we could use

several models. This problem can be addressed by using switched dynamic systems, studied in control

theory. This chapter considers the use of multiple models with switching mechanism incorporated in

a tracking framework. The multiple model tracker is based on a bank of S-PDAF filters (proposed in

8 Introduction

Chapter 4 of this thesis) organized in a tree structure. This algorithm denoted as RMMT (Robust

multiple model tracker) chooses which model is active at each instant of time and updates the state

estimate. For this purpose, we define a hybrid state (state vector and model label) characterized by a

probabilistic distribution. The RMMT propagates the probability distribution using S-PDAF filters for

each model. To illustrate the performance of the proposed method, we have performed a comparison

between the RMMT and the MMT proposed in [49]. These experiments show that the MMT has a

poor performance in the presence of outliers. This problem is solved by the RMMT, which copes with

multiple dynamics and outliers.

Chapter 6 concludes the thesis and presents future directions of work.

1.5 Contributions

The thesis proposes three new algorithms for the estimation of object boundaries (static and dynamic)

in complex scenes. Since image analysis techniques produce invalid features, robust estimation methods

are required. The proposed algorithms use middle level features (strokes) and assign a confidence degree

to each feature or set of features. Confidence degrees are explicitly considered in the estimation of the

object boundary.

The main contributions of this thesis are the following:

• a shape estimation algorithm using an adaptive potential, denoted as adaptive snakes, which

allows a robust estimation of the object contour in the presence of invalid data.

• a robust tracking algorithm denoted as S-PDAF which improves the Kalman tracker by allowing

the estimation of moving objects in cluttered scenes, using a single dynamical model.

• a robust version of the MM tracker recently proposed in [49], which is denoted as RMMT. In the

proposed method, S-PDAF trackers are used to propagate the different mixture components.

• the emphasis of middle level features as well as the use of confidence degrees as a way to improve

the performance of the classical methods.

Parts of the work presented were published in [56, 57, 58, 59, 60, 61, 62].

Ch a p t e r 2

Shape Tracking with Deformable

Models

2.1 Introduction

This chapter briefly describes a class of tracking algorithms denoted as deformable models which was

developed in the late 80s and 90s. Two types of methods are considered: potential based methods and

feature based methods. We will briefly summarize the snake algorithm which belongs to the first class

of methods. The performance of this algorithm in object boundary estimation is characterized and

its limitations are identified. The snake algorithms uses a general purpose shape model (differentiable

curve). Some of these difficulties can be alleviated by using a shape model adapted to the type of

objects to be estimated. A brief overview of feature based deformable models is also provided. These

methods have been extensively used in shape tracking using Kalman filtering. The drawbacks of the

Kalman based trackers are pointed out, especially in the presence of complex background images and

multiple objects.

This chapter is organized as follows: Section 2.2 describes the snake method and illustrates its per-

formance in synthetic examples. Section 2.3 describes several shape models which are used throughout

this work. Section 2.4 describes a motion model. Section 2.5 describes feature based algorithms, using

Kalman filtering and illustrates their difficulties. Section 2.6 concludes the chapter.

10 Shape Tracking with Deformable Models

2.2 Algorithms based on a Potential

2.2.1 The snake model

Snakes were originally proposed by Kass, Witkin, and Terzopoulos in 1987 [42]. They are among the

first deformable models proposed in the literature and are widely known. A snake is a deformable curve

defined in ℜ2, described by a function v : [0, 1] → ℜ2. The curve is initialized close to the object and it

is allowed to deform until it converges to the object boundary. An energy functional with two terms is

associated to each curve (snake) configuration: an internal energy and an external energy. The internal

energy is given by

Eint(v) =

∫

s

α ‖vs‖2 + β ‖vss‖

2ds, (2.1)

where vs and vss are the first and the second order derivatives of v with respect to the independent

variable s and α, β are weights defining the elastic properties of the snake.

The external energy is given by

Eext(v, I) =

∫

s

P (v(s) | I) ds, (2.2)

where1 P : ℜ2 → ℜ is an image dependent function, denoted as potential function. The curve estimation

is performed by the minimization of the energy

E(v) = Eint(v) + Eext(v | I). (2.3)

Therefore, the contour estimate v⋆ is given by

v⋆ = argminv

{Eint(v) + Eext(v | I)}. (2.4)

This optimization problem is solved by using numeric methods. A numeric solution for this problem is

obtained by sampling the continuous curve v at finite number of points vi = v(si) ∈ ℜ2, with si = i∆s.

A discrete model is then obtained

v = [v1, . . . , vk, . . . , vN ]T . (2.5)

1From this point on we denote P (v(s) | I) as P (v(s)) for the sake of simplicity.

2.2 Algorithms based on a Potential 11

Each point vi is denoted as a snake point, snake unit or model unit. The snake energy (2.1-2.3)

is approximated by a discrete energy function obtained by replacing the integral by a sum and the

derivatives by first and second order differences [42]

E(v) =

M∑

i=1

Eint(vi) + P (vi), (2.6)

where

Eint(vi) = α‖vi+1 − vi−1‖2 + β‖vi+1 − 2vi + vi−1‖

2. (2.7)

The minimization of E is achieved by the gradient algorithm which leads to the recursive equation

vk+1i = vk

i + γ(

Fint(vi) + Fext(vi))

, (2.8)

where Fint(vi), Fext(vi) are denoted as the internal and external forces. These forces are defined as [42]

Fint(vi) = dEint(vi)dvi

, Fext(vi) = dEext(vi)dvi

, leading to

Fint(vi) = (2α + 6β) vi−2 − 4β vi−1 + (−α + β) vi + (−α + β) vi+1 − 4β vi+2,

Fext(vi) = −∇P (vi). (2.9)

Equation (2.8) is recursively applied until convergence is achieved. The algorithm converges when

an equilibrium between Fint and Fext is reached. Equation (2.4) leads to a trade off between smoothness

(measured by Eint), and the ability to approximate the valleys of the potential function by the model.

The image potential P used in (2.2) can be defined in several ways. The goal is to associate the

valleys of the image potential to the desired features (e.g., edge points), in order to attract the model

towards such features. A typical choice (Cohen potential [16]) is given by the sum of Gaussian functions

centered at the edge points:

P (xi) = −∑

yk∈y

N (xi; yk, σ2I), (2.10)

where xi ∈ ℜ2 is a 2D vector containing the coordinates of an image point, y is the set of detected

features, typically edge points, and N (x; µ, R) denotes the normal density function with mean µ and

covariance matrix R, computed at x.

Cohen potential is computed in two steps:


• detect the set of feature points in the image y = {y1, y2, . . . , yN}.

• convolve an image formed with Dirac impulses centered at the feature points with Gaussian shaped

filter with impulse response −N (xi; 0, R).

Despite the popularity of the snake algorithm it has many drawbacks. To illustrate the performance

of the snake algorithm, let us consider two synthetic examples shown in Fig. 2.1. The snake curve is

approximated by a discrete set of points and Cohen potential is used in these examples.

The first example shows the convergence of the snake algorithm assuming that all edge points are

correctly detected and no other features are present in the image. Although the model is attracted by

the object boundary it is not able to represent corners or high curvatures regions of the object boundary.

The second example displays the convergence of the snake when there is invalid data (outliers) present

in the image which does not belong to the object boundary. The outliers attract the elastic model

towards wrong valleys of the potential function leading to meaningless shape estimates. The snake

algorithm is, therefore, strongly dependent on the initialization and on the presence of invalid data.

Many works (e.g., [17, 18, 44, 51, 77, 80]) have been published in an attempt to overcome these

difficulties.

2.3 Shape Models

The representation of the object boundary using point models is simple. However, this approach usually

requires a large number of units to accurately represent the object boundary. This implies the estimation

of a large number of coordinates and, therefore, the estimation procedure becomes very sensitive to the

presence of noise, since the shape model does not convey any information about the shape of the object

to be estimated.

To improve the performance of the snake method, alternative models have been considered to rep-

resent the object boundary. Several methods were proposed to represent the object boundary with less

parameters, e.g., using deformable templates [20, 78], Fourier descriptors [31, 70], wavelets [15] or using

a Sinc-type basis [25]. Spline curves model are often used to represent the object boundary, since they

allow to approximate complex shapes using a small set of basis functions [5, 11, 53]. Spline curves are

parameterized by a number of control points, which are usually much less than the number of points

of a point model. For example, in B-spline models we can drastically reduce the number of points in

2.3 Shape Models 13

Figure 2.1: Results obtained with Snake algorithm: initialization (left column) and shape estimatesobtained at iteration 3, 30 (first row) and iteration 31, 40 (second row).

the curve, getting the same accuracy as we would obtain using point models. The B-spline model is

adopted throughout this work although, other alternatives could be considered instead.

To represent the object boundary, it is assumed that the object shape is a transformed version of a

(known) reference shape plus an additional deformation, each of them being described by a B-spline.

Let v(s) : I → ℜ2 be a parametric representation of the object boundary, with I ⊂ ℜ and vr(s) : I → ℜ2

a reference shape. It is assumed that 2

v(s) = G(vr(s)) + vd(s), (2.11)

where v(s), G(vr(s)), and vd(s) represent the observed shape, the transformed version of the reference

shape and the shape deformation, respectively. The shape deformation vd(s) is modeled by

2Other works assume that v(s) = Gt(vr(s) + vd(s)), [20, 8]. Both approaches have advantages and disadvantages.


v(s) =

Nc∑

n=1

θnφn(s), (2.12)

where φn(s) : [0, 1] → ℜ, n = 1, . . . , Nc is a set of B-spline basis functions and θn ∈ ℜ2 is the nth

control point of the respective basis.

Sampling the curves defined in (2.11) at N points and storing their coordinates in three column

vectors

v = [v1(s1), . . . , v1(sN ), v2(s1), . . . , v2(sN )]T ,

vr = [vr1(s1), . . . , v

r1(sN ), vr

2(s1), . . . , vr2(sN )]T ,

vd = [vd1(s1), . . . , v

d1(sN ), vd

2(s1), . . . , vd2(sN )]T , (2.13)

equation (2.11) can be written as follows

v = G(vr) + vd, (2.14)

where v, G(vr), and vd, are 2N × 1 vectors. In this work several transforms G are considered, e.g.,

translations, Euclidean similarities and affine maps. The number of unknown coefficients, D, is D =

2, D = 4, D = 6, respectively [10] and the unknown coefficients will be denoted by θg.

We can define a vector containing the unknown parameters

x =

[θg

θd

]

, (2.15)

where x is a (D + 2Nc) × 1 vector and θd = [θd11, . . . , θ

d1Nc

, θd21, . . . , θ

d2Nc

]T . Vector x contains two

sets of parameters: the first D parameters define the global transformation and they are called global

parameters; the remaining 2Nc coefficients define the shape deformation and they are denoted local or

deformation parameters, i.e

x = [x1, . . . , xD, θd11, . . . , θ

d1Nc

, θd21, . . . , θ

d2Nc

]T , (2.16)

where x1, . . . , xD are the global parameters and θdin are the local parameters.

If G is one of the previously mentioned transforms, equation (2.14) can be rewritten as

2.4 Motion Model 15

v = C x + e, (2.17)

where C is a 2N × (D + 2Nc) matrix and e is a 2N × 1 vector which depends on the reference shape.

Details about C and e can be found in Appendix A, for different geometric transformations.

2.4 Motion Model

In tracking applications, we wish to track moving objects with deformable shape. In this case, dynamic

equations must be used to represent the evolution of the model parameters. It is often assumed that

the vector of the model parameters, or an extended version of it, is the output of a stochastic difference

equation

x(t + 1) = Ax(t) + w(t), (2.18)

w(t) ∼ N (0, Q) is white Gaussian noise and A is a square matrix.

Measurements made from images are noisy. Therefore, a sensor model

y(t) = Cx(t) + e + η(t), (2.19)

is used where

y(t) = [y1t(s1), . . . , y1t(sN ), y2t(s1), . . . , y2t(sN )]T , (2.20)

is the observation vector containing noisy samples of the curve v(s), η(t) is the measurement noise

and, C, e depend on the geometric transformation adopted in the shape model as before. Equations

(2.18,2.19), define a stochastic state model.

2.5 Feature Based Methods

The first attempts to use active contours in object tracking were based on snakes [42]. The idea was to

estimate the object boundary in each frame using the best contour estimate obtained in the previous

frame. This idea fails in practice. First, the snake suffers from the aperture problem: only the normal

displacement of the model points can be estimated. Since no motion model is used in [42], forcing a

coherent evolution of the model units, it is not possible to solve this ambiguity. Second, no a priori


knowledge of the object shape is used. Third, the model is slow and computationally heavy, preventing

its application in real time.

To overcome these difficulties several researchers proposed the use of visual features detected in the

image, e.g., boundary points for tracking purposes [2, 11, 71] (see Appendix B). Typically, the object

shape and position are estimated to fit the observed features. The feature based algorithms are usually

based on three steps [10]: contour prediction, image measurement and contour update. The first step

predicts the object position and shape in the next frame, using current and past information. The

second step computes image features in the vicinity of the predicted contour (see Fig. 2.2). In this

case the contour is sampled at equally spaced points and a set of features is detected at each sample

by directional search methods, i.e., by applying an edge detection algorithm to the image profiles along

directions orthogonal to the object boundary (see Appendix B for the details). More than one feature

can be detected in the vicinity of each contour sample.

Figure 2.2: Feature detection: • - edge points; ◦ - samples of the predicted contour.

Tracking algorithms using this approach can be found in [11, 22, 72].

These features have also been successfully used in other computer vision problems, e.g., in contour

estimation problems [25, 30, 32, 80] where the contour position depends on all the image data and in

image segmentation algorithms based on front propagation [47].

Kalman Tracker

As discussed above, the use of feature based algorithms is a popular approach for object tracking. Now

we shall see how the techniques of linear estimation can be applied in this context.

Let x(t) be a random vector characterizing the contour at the time instant t. Assuming that we

have observed a set of image features Y t = {y(1), . . . , y(n)} obtained at different time instants, the

2.5 Feature Based Methods 17

maximum a posteriori (MAP) estimate of x(t) is given by

x(t) = arg maxx(t)

p(

x(t) | Y t)

. (2.21)

Given the motion dynamic model (2.18) and an observation model (2.19), the estimation of the state

vector in (2.21) can be recursively propagated in two steps. First (prediction) is obtained as

p(

x(t) | Y t−1)

=

∫

p(

x(t) | x(t − 1))

p(

x(t − 1) | Y t−1)

dx(t − 1). (2.22)

Since p(

x(t) | x(t − 1))

= N(

x(t); Ax(t − 1), Q)

, and p(

x(t − 1) | Y t−1)

= N(

x(t − 1); x(t −

1), P (t− 1))

, the convolution integral (2.22) leads to a normal density N(

x(t); x(t | t− 1), P (t | t− 1))

with the mean and covariance defined by

x(t | t − 1) = Ax(t − 1), (2.23)

and

P (t | t − 1) = AP (t − 1)AT + Q. (2.24)

Equations (2.23,2.24) define the prediction step of the Kalman filter.

The second step (filtering) computes the probability

p(

x(t) | Y t)

= α p(

y(t) | x(t))

p(

x(t) | Y t−1)

, (2.25)

where α is a normalization constant. Since p(

y(t) | x(t))

= N(

y(t); Cx(t), R)

, and p(

x(t) | Y t−1)

=

N(

x(t); x(t | t−1), P (t | t−1))

the probability density function (2.25) is N(

x(t); x(t), P (t))

, with [41]

x(t) = x(t | t − 1) + K(t)(

y(t) − C x(t | t − 1) − e)

, (2.26)

P (t) =(

I − K(t)C)

P (t | t − 1), (2.27)

where vector K(t) is the Kalman gain computed by

K(t) = P (t | t − 1)CT(

C P (t | t − 1)CT + R)

. (2.28)

The equations (2.26-2.28) define the filtering step of the Kalman filter [41].


To illustrate the performance of the Kalman filter in a tracking problem, let us consider the estima-

tion of a point target in the plane. Suppose that the state vector is x(t) = [x1 x2 x1 x2]T , where xi are

the coordinates of the moving target and xi are the velocity components. The point moves according

to the dynamic matrix

A =

1 0 δt 00 1 0 δt0 0 1 00 0 0 1

. (2.29)

It is assumed that w(t) ∼ N (0, Q), Q = kQIQ, δt, kQ are constants and IQ a 4×4 identity matrix. The

target measurements are given by

C =

[1 0 0 00 1 0 0

]

, (2.30)

and η(t) ∼ N (0, R), where R = kRIR, being kR a constant and IR a 2 × 2 identity matrix.

In Fig. 2.3, the dotted line represents the true position of the state vector, the solid line is the

output of the Kalman filter. Two situations are depicted: in Fig. 2.3 (a) all the observations are

generated by (2.19); in Fig. 2.3 (b) some observations are outliers, which correspond to sensor failures or

misdetections. The outliers were randomly generated with uniform distribution. This figure illustrates

that the Kalman tracker is strongly influenced by the presence of outliers, pointing out the degradation

of the filter output in the presence of the false alarms.

This problem also exists in shape tracking. Image analysis methods often produce invalid features

which do not correspond to boundary points. These features are associated to intensity transitions of

the background image or to inner edges of the object to be tracked. Fig. 2.4 shows one such example.

The first image illustrates an ideal situation where all the features are located on the boundary of the

object to be tracked (lips). The second image shows a more difficult case in which wrong features are

detected inside and outside of the lips boundary.

In practice, it is not easy to distinguish wrong features from the good ones since this information is

not given to the tracker. Therefore, the Kalman tracker considers all observations as valid, providing

erroneous shape estimates. If we apply the Kalman filter to visual features as the ones illustrated in

Fig. 2.4 (b), we may say in advance that the object contour will be lost after a few frames.

The performance of the Kalman tracker degrades if the number of outliers increases. This is a key

difficulty which has prevented a wide spread use of Kalman filtering in complex tracking problems. To

2.5 Feature Based Methods 19

0 400 800 1200 1600 20000

2000

4000

6000

8000

0 2000 4000 60000

1000

2000

3000

(a) (b)

Figure 2.3: Tracking of moving target with Kalman filter: using (a) correct measurements, and (b) inthe presence of false alarms.

(a) (b)

Figure 2.4: Feature detection: (a) Correct and (b) with outliers.

circumvent this difficulty, robust tracking methods are needed. This is the main problem addressed in

Chapter 4.


2.6 Conclusions

This chapter presented two classes of deformable contours: potential based and feature based models.

These are two of the most popular techniques for the estimation of the object boundary. However,

there are other available methods, for instance, based on front propagation [47], or based on statistical

models of the contour and image using random Markov fields [24, 32].

The difficulties of potential based and feature based deformable models were identified. Both meth-

ods have weak performances in the presence of outliers, i.e., in the presence of image features or valleys

which do not belong to the object boundary. Unfortunately, all feature detection algorithms produce

such artifacts since they are not able to segment the data, i.e., to know in advance which features

belong to the object boundary and which do not. It has been experimentally illustrated that the classic

methods (snake and Kalman tracker) are not capable to cope with this situation, since both methods

consider all the observations as valid, i.e., all the observations are considered as belonging to the object

boundary. A direct consequence is that, they both try to represent all data detected in the image lead-

ing to meaningless results. In noisy and complex environments, we may foresee that similar or worse

performance is expected. The main question that we must address at this stage is the following: given

a set of observations corrupted by outliers, how can a good shape estimation and tracking performance

be achieved? To answer this question, it is crucial to use robust estimation techniques. They should

be able to discriminate invalid observations from valid ones, providing a different treatment for each of

them.

The next two chapters answer the question formulated above, by proposing two new algorithms

which allow to overcome these weaknesses. Chapter 3 proposes a new robust algorithm which extends

the snake method by making it robust in the presence of outliers. Chapter 4 presents a new robust

tracking algorithm which overcomes the limitations of the Kalman tracker. These ideas can be further

extended for a class of algorithms which uses a bank of stochastic dynamic models to track the object

boundary. This is addressed in Chapter 5.

Ch a p t e r 3

Adaptive Snakes

3.1 Introduction

The performance of active contours in shape estimation is highly dependent on the image data. In

cluttered environments the performance is strongly hampered, leading to inaccurate estimation of the

object boundary, as seen in the Chapter 2. The main reason is that feature detection often produces

outliers, i.e., features which do not belong to the object boundary. These outliers can be caused

by a cluttered background, nearby objects, or inner edges. This chapter addresses the problem of

shape estimation in the presence of outliers and proposes a new method based on the concepts of

strokes and confidence degrees. Strokes are middle level features used instead of low level ones (edge

points). Middle level features are more informative and describe the outer boundary of the object better.

Confidence degrees are probabilistic assigned to strokes. All strokes detected in the image contribute

to the potential but with different weights. Weight assignment is obtained using a probabilistic model

for the observed data and the Expectation Maximization (EM) algorithm. This chapter proposes a

new potential function denoted as adaptive potential. It is shown that an improvement performance is

achieved by using the adaptive potential.

This chapter is organized in the following way. Section 3.2 describes research work related to this

topic. Section 3.3 formulates the shape estimation problem in a probabilistic framework. Section 3.4

proposes an adaptive potential function, based on the EM algorithm. Section 3.5 addresses the contour

estimation by the minimization of an adaptive energy. Section 3.6 evaluates the proposed algorithm

with synthetic and real images. Section 3.7 concludes the chapter. The work presented in this chapter

was previously published in [58].

22 Adaptive Snakes

3.2 Related work

Active contours estimate the object boundary using a deformable curve. During the estimation process

the model points move under the influence of image forces and internal forces. Image forces should

be defined such that the curve (usually far from the object boundary) is attracted towards the desired

features. The design of image forces has been thoroughly investigated (e.g., see [16, 18, 20, 77]).

However, the main difficulty concerns the presence of invalid features (outliers) which are not located

at the object boundary and attract the elastic model towards wrong shape configurations.

Several strategies have been proposed to improve the performance of active contours, e.g., the

gradient vector flow has been used to define a new external force applied to the snake [77], competitive

learning, directional attraction regions [1, 4], which allow the progress of the model towards concavities,

multi-resolution methods [43], data fusion [50], inflation forces [16], distance potentials [18] which reduce

the influence of initial conditions, the validation gate to reduce the search region [10], non-linear filtering

techniques with non-Gaussian distributions [38], the use of geometric and dynamic constraints to reduce

shape and motion variability [11, 20], and robust estimation techniques which are able to reduce the

influence of outliers on the final shape estimates [56].

A different approach is proposed in this chapter. The boundary estimate is obtained by the mini-

mization of a potential function as in snakes. A potential with hidden variables is proposed, where the

missing data represents the confidence degree of the image data. The object boundary is estimated by

using the EM algorithm. The proposed method is, therefore, denoted as adaptive snakes.

Adaptive snakes are based on the use of strokes which are more informative and reliable than edge

points. The use of strokes has been recently proposed by several authors [34, 40, 56, 79]. In this

chapter, each stroke is considered as valid if it belongs to the object boundary or invalid otherwise.

This information is not known in advance. Thus, a confidence degree (weight) is assigned to each

stroke. The weights depend on the distance from the strokes to the object and the strokes lengths.

The EM algorithm is a method proposed in the 70s to compute the maximum likelihood estimate of

parameters with missing data [26]. It has been widely used in several contexts, e.g., in object tracking

[61], in neural networks [9], in the estimation of Hidden Markov Models [35], in system identification

[64].

It is shown bellow that the EM technique is well suited for robust shape estimation, assuming that

3.3 Classical Approach 23

the missing data are the stroke labels (valid/invalid) which are unknown.

3.3 Classical Approach

This section addresses shape estimation assuming that the feature labels are known, i.e., we know

which features are valid and which are outliers. For the sake of simplicity, no regularization forces are

considered in this section. Let y be the set of all features detected in an image organized in strokes

y = {y1, . . . , yM}, and yj being the set of observations (edge points) belonging to the jth stroke. Let

v be a contour model defined by a sequence of 2D points vi, i = 1, . . . , N . The goal is to approximate

the data contained in y by the contour model v. To accomplish this, we shall consider the potential

function

P (vi; y, k) = −∑

j, n

Φ(vi; yjn, kj), (3.1)

where vi is the ith model unit, yjn is the nth observation of the jth stroke, k = {k1, . . . , kN} is a set of

stroke labels, (kj = 1 if the jth stroke is valid, kj = 0 otherwise), and Φ measures the influence of yjn

on the model unit vi. The contribution of each feature yjn to the potential is defined by

Φ(vi; yjn, kj) =

{N (yj

n; vi, σ2I) kj = 1

L kj = 0, (3.2)

where N (yjn; vi, σ

2I) is a Gaussian kernel and L is a constant.

If y, k were known, the contour model would be obtained by minimizing the contour energy

v = argminv

∑

i

P (vi; y, k). (3.3)

Equation (3.3) is equivalent to the snake algorithm with the Cohen potential [18] provided that we

assume that all the data is valid, i.e., kj = 1, ∀j.

The problem may be addressed in a probabilistic framework, by assuming that y and k are random

variables with known probability density function1

p(y, k | v) = α e−

∑

iP (vi; y, k)

. (3.4)

The log likelihood function is

1The distribution can be normalized if the strokes belong to a finite interval of ℜ2.

24 Adaptive Snakes

l(v; y, k) = log p(y, k | v) = c −∑

i

P (vi; y, k), (3.5)

and the maximization of the log likelihood function leads to the same optimization problem defined in

(3.3).

In practice we do not know which features are valid and which are outliers. The labels kj are,

therefore, unknown. This problem is addressed in the next section.

3.4 Adaptive Potential

Since the stroke labels are unknown in practice, the object contour should be estimated by maximizing

the likelihood of the observed data

log p(y | v) = log∑

k

p(y, k | v). (3.6)

This is, however, a difficult problem since it is not possible to obtain a closed form expression for

log p(y | v) nor to optimize it analytically. One way to circumvent this difficulty is by using the EM

algorithm [26] which optimizes the ML criteria by using an auxiliary function

U(v, v) , Ek{log p(y, k | v) | y, v}, (3.7)

where v is an estimate of the unknown parameter. Using (3.4), U can be rewritten as follows

U =∑

j

Ekj

{

log p(yj, kj | v) | y, v}

=∑

j

wj log p(yj, kj = 1 | v) + (1 − wj) log p(yj , kj = 0 | v),(3.8)

where wj = p(kj = 1 | yj , v). The second term in (3.8) (outlier potential) can be discarded since it does

not depend on the model v.

Using (3.5, 3.1), equation (3.8) can be written as function of the stroke samples (the terms associated

to kj = 0 were discarded)

U =∑

j

wj{

cj −∑

i

P (vi; yj, kj = 1)

}

=∑

j

wj{

cj +∑

i,n

Φ(vi; yjn, kj = 1)

}

.(3.9)

3.4 Adaptive Potential 25

Since all strokes are considered as valid (kj = 1), a Gaussian potential is used (see 3.1)

U =∑

j,i,n

wj N (yjn; vi, σ

2I) +∑

j

cjwj . (3.10)

Therefore, the function U becomes

U = C −∑

i

Pa(vi, y), (3.11)

where

Pa(vi, y) =∑

j

(

−∑

n

N (yjn; vi, σ

2I))

wj , (3.12)

is denoted as an adaptive potential since it depends on the confidence degrees of the image strokes

wj which vary during the estimation process. The weights wj are computed in the E step of the EM

algorithm as follows

wj = p(kj = 1 | y, v)

= p(kj = 1 | yj, v)

= βj p(yj, kj | v)

= βj∏

i

e

∑

n

N (yjn; vi, σ

2I),

(3.13)

where βj is a constant which is obtained from the equation

p(kj = 0 | yj, v) + p(kj = 1 | yj , v) = 1, (3.14)

since

p(kj = 0 | yj , v) = βj∏

i

enL, (3.15)

we obtain

βj =

[∏

i

e

∑

n

N (yjn; vi, σ

2I)+

∏

i

enL]−1

. (3.16)

26 Adaptive Snakes

3.5 Contour Estimation

This section addresses contour estimation by the minimization of a two terms energy: a regularization

term similar to the one proposed in [44] which tries to keep the distance between consecutive model

points close to an average distance l0 and an image dependent term given by (3.12). Therefore,

E(v) =∑

i

(li − l0)2 + Pa(vi, y), (3.17)

where li = ‖vi+1 − vi‖ is the distance between consecutive model points and l0 is the average distance

specified by the user.

The minimization of (3.17) is performed in the M-step as follows

vt+1 = argminv

E(v, v). (3.18)

Using the gradient algorithm

vt+1 = vt − γ∇vE(v, v), (3.19)

where ∇v is the gradient operator defined by ∇vE(v, v) = [∇v1E, . . . ,∇vM

E]T . Therefore, the algorithm

proposed here is a generalized EM (GEM) [52], since does not attempt to find the value of v that globally

maximizes the function U(v, v).

Equation (3.19) can be rewritten as follows

vt+1 = vt − γiFint + γeFext, (3.20)

where γi, γe are gains, Fint = {Fint(v1), . . . , Fint(vN )}, Fext = {Fext(v1), . . . , Fext(vN )} are interpreted

as internal and external forces applied to the model unit vi respectively. The gain γi = 1‖Fint‖

normalizes

the internal force Fint. The gain γe is chosen in order to avoid a excessive movement of each model

unit in a single iteration (details can be found in [1]). The internal and external forces are given by

Fint(vi) = −2li − l0

li(vi+1 − vi), (3.21)

Fext(vi) =1

σ2

∑

j

wj∑

n

(yjn − vi)N (yj

n; vi, σ2I). (3.22)

3.5 Contour Estimation 27

The external forces (3.22) can be rewritten in a different way as follows. If we define an auxiliary

function ϑi(yjn) = N (yj

n; vi, σ2I), the external forces are given by

Fext(vi) = α∑

j

wj∑

n

(yjn − vi)ϑi(y

jn)

= α

{∑

j

wj∑

n

yjnϑi(y

jn) − vi

∑

j

wj∑

n

ϑi(yjn)

}

= α∑

j

wj∑

n

ϑi(yjn)

[∑

j wj∑

n yjnϑi(y

jn)

∑

j wj∑

n ϑi(yjn)

− vi

]

.

(3.23)

Defining

µi =∑

j

wj∑

n

ϑi(yjn), ξi =

∑

j wj∑

n yjnϑi(y

jn)

∑

j wj∑

n ϑi(yjn)

, (3.24)

as the mass and centroid of the ith unit, the external force applied to vi becomes

Fext(vi) = µi(ξi − vi). (3.25)

The external force attracts each model unit vi towards a data centroids ξi and the force magnitude

is proportional to the distance from the model point to the centroid. This result is closely related to

the work presented in [1] which includes several methods sharing the same structure which is denoted

as an unified framework for active contours. The methods belonging to this framework share a set of

common properties namely: in each iteration, the model units are attracted towards data centroids

using different choices for ϑi(yjn).

Table 3.1 summarizes the adaptive snake algorithm. The next section performs a comparison be-

tween the snake and adaptive snake algorithms.

28 Adaptive Snakes

Adaptive Snakes

Strokes detection: Detect edge points in the image and organize them instrokes, by performing an edge-linking operation

iterate

1. E-step: For each stroke compute the weight

wj = p(kj = 1 | yj , v) = βj∏

i

e∑

nN (yj

n;vi,σ2I).

2. M-step: Update the contour model by

vt+1 = vt − γiFint(v) + γeFext(v),

where γi, γe are gains defined as in [1, 16]. The internal and external forcesapplied to each model unit are given by

Fint(vi) = −2li − l0

li(vi+1 − vi), Fext(vi) = µi(ξi − vi),

with

µi =∑

j

wj∑

n

ϑi(yjn), ξi =

∑

j wj∑

n yjnϑi(y

jn)

∑

j wj∑

n ϑi(yjn)

,

where ϑi(yjn) = N (yj

n; vi, σ2I).

Table 3.1: Robust shape estimation with the adaptive snakes.

3.6 Experimental Results 29

3.6 Experimental Results

In this section we describe experiments illustrating the performance of the adaptive snakes method

proposed in this chapter. We compare the proposed algorithm with classical snakes that are a special

case of the proposed method in which all the data features are valid. The experimental tests are

performed under the following conditions. In each iteration, the boundary model is resampled at

equally spaced points. The gain γi used in (3.20) is chosen as proposed in [16], i.e., after normalizing

the internal forces. For the external forces we use independent gain factors acting on each model unit

as in [1]. These procedures increase the convergence rate of the algorithm.

Example 1

The first example was presented in Chapter 2 (see Fig. 2.1). It illustrates the estimation of a square

object in a cluttered environment. Fig. 3.1 shows the results obtained with the method proposed in this

section. The data and the initial contour are shown in the left image. The next image show the contour

configuration during the convergence process and the final estimate, respectively. It is concluded from

this example that significant improvements are obtained by using the adaptive snakes. The estimates

obtained with the snake potential do not converge towards the object boundary since they get stuck in

the valleys (local minima) associated to the outliers (see Fig. 2.1). On the contrary, the adaptive snakes

manages to assign more importance to the true strokes and discards the others since they have smaller

lengths. The adaptive snakes reduces the influence of outliers and allows an accurate estimation of the

object boundary.

Figure 3.1: Results obtained with adaptive snakes: iterations 0, 31, 40.

We have also done experiments using outlier strokes inside the object boundary (see Fig. 3.2). As

30 Adaptive Snakes

expected the snake does not make any distinction between the boundary data and the outliers (see first

row), being attracted by inner strokes. The proposed method is much more robust and converges to

the object boundary (see second row).

Figure 3.2: Results obtained with snake (first row) and adaptive snake (second row), (iterations 0, 9,30).

Monte Carlo tests were performed to evaluate the performance of both methods in the estimation

of the square object. The test images were obtained by adding random strokes to the original image.

The test images were automatically generated. The number of outlier strokes is randomly generated.

The length of each stroke is a gamma distributed random variable, with the parameters α = µ/β,

β = σ2/µ, where µ and σ are the mean and variance of the stroke length. The initial point of each

stroke and the stroke direction are randomly generated with uniform distribution.

In these experiments two parameters were changed: the average length µ and the stroke percentage

p. The stroke percentage p is defined as the sum of stroke lengths divided by the object perimeter. For

each value µ the stroke percentage was changed from 0% to 100%. Three values were considered for

the average stroke length: µ = 3, µ = 10, µ = 20. Fig. 3.3 illustrates images samples used for test.

Two distances were used to evaluate the boundary estimates obtained by both methods:

Dav =1

2

((dav(v, y) + dav(y, v)

), (3.26)


Figure 3.3: Randomly generated data with p = 30%, µ = 3 (left); p = 60%, µ = 10 (center); p = 100%,µ = 20 (right) .

Dmax =1

2

((dmax(v, y) + dmax(y, v)

), (3.27)

where

dav(v, y) =1

M

M∑

n=1

miny

‖vn − y‖, (3.28)

is the average distance from the contour model v to the true boundary (ideal contour), and

dmax(v, y) = maxn

miny

‖vn − y‖, (3.29)

is the largest deviation from the contour model to the true boundary. The other measures dav(y, v),

dmax(y, v) are obtained by changing the role of y and v in (3.28,3.29).

Fig. 3.4 shows the performance of both methods. The solid line corresponds to the estimates

obtained with the adaptive snake and the dashed line corresponds to the snake algorithm with Cohen

potential (2.10). These results were computed by performing 10 experiments for each test conditions.

The first row shows the values of Dav while the second shows the results of Dmax. It is concluded from

Fig. 3.4 that the proposed algorithm is robust in the presence of outliers while the snake algorithm

shows a significant degradation as the noise level increases, specially in the case of small strokes. When

the average length is small a larger number of outlier strokes are generated filling the whole image plane.

In this case the deformable model gets easily stuck during the convergence process.

Example 2

Another example is shown in Fig. 3.5. The object to be estimated is a hand. Two strokes are detected

in the vicinity of the hand (left images): a valid stroke (hand boundary) and an outlier (white bar

boundary). Two methods were used to estimate the hand: snakes and the adaptive snakes algorithms.

32 Adaptive Snakes

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9µ=3

Dav

Outlier Percentage0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9µ=10

Dav


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9µ=20

Dav

Outlier Percentage

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5µ=3

Dmax


0

0.5

1

1.5

2

2.5µ=10

Dmax


0

0.5

1

1.5

2

2.5µ=20

Dmax

Outlier Percentage

Figure 3.4: Results obtained with snake (dashed line) and adaptive snake (solid line); Dav (first row),Dmax (second row). From left to right µ = 3, µ = 10, µ = 20.

It is observed that the adaptive potential finds the correct boundary, while the estimates obtained with

the classic potential converge to a incorrect shape configuration. Fig. 3.6 shows the evolution of both

potentials (Cohen potential and adaptive potential) during the convergence process. The dark regions

are the potential valleys which attract the model units. Cohen potential remains unchanged during the

convergence process and is not able to discriminate valid data from the outliers. On the contrary, the

adaptive potential manages to discriminate the true stroke from the incorrect one. At iteration 10 only

the valley associated with the hand boundary is visible in Fig. 3.6.

Significant changes of the adaptive potential occur during the estimation process. The valley asso-

ciated with the outlier stroke varies during the convergence process until it disappears at the end. This

is due to the variation of the weights illustrated in Fig. 3.7. In the first iterations both strokes have

similar weights. However, the weight of the hand stroke increases during the convergence process while

the weight of the outlier tends to zero.

Example 3

The next example illustrates the performance of the proposed algorithm in the estimation of car bound-


Figure 3.5: Results obtained with snake at iteration 0, 7, 20 (top row) and with adaptive snake atiteration 0, 7, 10 (bottom row).

aries in cluttered scenes. Figs. 3.8, 3.9, and 3.10 show the results obtained with the snake and with

the adaptive snake. These examples illustrate two typical situations. In the first example only a poor

estimate of the object shape is available. The initial contour is, therefore, far from the car boundary.

In the second example, there is a good initial shape estimate but there is a significant shift of the shape

estimate. Shape estimation with the snake potential fails in both cases while the proposed algorithm

solves both problems well. Fig. 3.13 shows the detected strokes as well as the initializations used in

these examples.

Example 4

Figs. 3.11, 3.12 show the estimation of the lips boundary. In this example the strokes are detected in

the whole image (see the white lines on the top of Figs. 3.11, 3.12). The contour is initialized in such

way that the model units can be attracted by strokes outside and inside the lips boundary. The output

of the snakes algorithm (Fig. 3.11) is poor. The snakes are attracted by both kind of outlier strokes.

The results obtained with adaptive snakes are much better (Fig. 3.12). The final estimates accurately

represent the lips boundary.

Fig. 3.13 shows the strokes detected in these experiments.

Example 5

This example shows some limitations of the adaptive snakes algorithm. Fig. 3.14 shows an example

34 Adaptive Snakes

Figure 3.6: Shape estimates and potentials. Top row: shape estimates and snake potential at iterations4, 10, 20. Bottom row: shape estimates and adaptive potential at iterations 2, 5, 10.

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Hand Stroke Weight

Outlier Stroke Weight

Iteration number

Figure 3.7: Evolution of the weights.

with a large number of outlier strokes, which are longer than the boundary of the object to estimate

(the upper car). Two initializations are considered (see left column). In the first case (first line) the

model units are initialized far from the true position, embodying a large number of invalid strokes. In

this case the internal forces are not sufficient to attract the model to the car boundary, and only a

partial estimate is obtained. If the initialization is somewhat closer although still very far from the

object boundary (second line of the Fig. 3.14), the algorithm is able to neglect the influence of the

outlier strokes, exhibiting remarkable robustness.

This example shows that the adaptive snakes algorithm exhibits a remarkable robustness with respect

to contour initialization although there is still some dependence if the object is initialized very far from


10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

Figure 3.8: Shape estimates and snake potential, iterations 1, 7, 40.

the object boundary.

36 Adaptive Snakes

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

Figure 3.9: Shape estimates and adaptive potential, iterations 1, 12, 40.

Figure 3.10: Results obtained with snake potential (top row), (iterations 1, 4, 30) and adaptive snake(bottom row), (iterations 1, 4, 8).


50 100 150 200 250

20

40

60

80

100

120

140

160

180

20050 100 150 200 250

20

40

60

80

100

120

140

160

180

20050 100 150 200 250

20

40

60

80

100

120

140

160

180

200

Figure 3.11: Shape estimates and snake potential, iterations 1, 10, 40.

50 100 150 200 250

20

40

60

80

100

120

140

160

180

200

Figure 3.12: Shape estimates and adaptive potential, iterations 1, 7, 30.

38 Adaptive Snakes

Figure 3.13: Strokes detected in Examples 3 and 4 (circles represent the initial shape estimate).

Figure 3.14: A difficult example in which the results depend on initialization: this figure shows theinitial contours (left) intermediate shape estimates (center) and final shape estimates (right).

3.7 Conclusions 39

3.7 Conclusions

This chapter describes a new algorithm for the estimation of object boundaries in the presence of

outliers [58]. The object boundary is approximated by a deformable contour as in snakes. Model points

are deformed by internal forces and by external forces computed using an image potential. However,

instead of using the classic potential function which remains invariant during the convergence process,

an adaptive potential is proposed which is able to discard the influence of outliers. This is achieved as

follows. Image features (edges points) are organized in strokes and each stroke is either classified as

valid or invalid (outlier). Since this information is not available, a confidence degree is assigned to each

stroke, being updated during the estimation process. Therefore, all strokes contribute to the image

potential with different weights. The algorithm assigns higher weights to longer strokes than to short

ones, since longer strokes may be expected to give better information about the image. The image

potential and the contour model are recursively estimated in a maximum likelihood framework by the

EM algorithm.

Experimental tests have shown that the proposed algorithm improves snakes in the presence of

clutter, providing remarkable robustness results and being less dependent on the initialization.

Estimation errors may still occur when the object is initialized far from the object boundary and

the outlier strokes are longer than the object strokes.

40 Adaptive Snakes

Ch a p t e r 4

A Robust Feature based Tracker

4.1 Introduction

It was shown in Chapter 2 that Kalman based trackers are not robust. This chapter presents a new

feature based tracker which is robust in the presence of outliers. The algorithm shares some of the

principles of the robust method described in the previous chapter. First, middle level features (strokes)

are detected in the image. Second, a label is assigned to each stroke: a label 1 (valid) if the stroke

belongs to the object boundary, and a label 0 (invalid) otherwise. The stroke labels are unknown.

Some strokes belong to the object boundary and the remaining strokes are outliers but we do not know

which. Thus, every possible combination of valid/invalid labels must be taken into account. This results

in a collection of a sequences of valid/invalid stroke labels. Each sequence will be denoted as a data

interpretation.

The next step is to assign a probability to each data interpretation. In this way each interpretation

has a confidence degree which contributes to update the contour estimates, i.e., all the strokes contribute

to track the moving object but with different weights. This allows a robust performance of the tracker

in the presence of outliers.

The chapter is organized as follows. Section 4.2 describes related work. Section 4.3 presents the

problem statement. Section 4.4 describes the proposed tracking algorithm denoted as S-PDAF. Section

4.5 shows how this method can be applied in object tracking. Section 4.6 illustrates the performance of

the algorithm in the context of lip and gesture tracking as well as in surveillance applications. Section

4.7 concludes the chapter.

This work was published before in [56, 62].

42 A Robust Feature based Tracker

4.2 Related Work

The estimation of the model parameters is often performed by Kalman filtering [10, 72]. As discussed

in Chapter 2, boundary detection is an ill-posed problem, i.e., image detection algorithms used to

extract information from the image produce many invalid features (outliers) which do not belong to

the boundary of the object to be tracked (see Fig. 2.4 (b)). Furthermore, the outliers have a strong

influence on the performance of active contour algorithms.

Several methods have been proposed to alleviate this difficulty. The use of centroids performs an

average of the detected features [2, 3, 4, 55], the use of restrictions on the object shape, e.g., by using

rigid templates or eigen shapes learned from the data [10, 20] also allows to reduce the admissible

deformations. The latter approach prevents the model boundary from having unpredictable shapes

caused by the outliers. Temporal restrictions have also been considered by representing the evolution of

the motion and shape parameters using dynamic models, e.g., stochastic difference equations. In general,

dynamic models may be specified by the user or learned from the video sequences using standard system

identification methods [10].

Despite these improvements, none of these methods is able to solve the segmentation problem,

i.e., none of them is able to discriminate valid data from the outliers which hamper the performance of

Kalman trackers. An ad hoc procedure used to improve the Kalman tracker consists of using a validation

gate, computed from the predicted object boundary [20]. This method assumes that valid features can

only be located in the vicinity of the predicted contour. This approach works well if the object motion

is slow and highly predictable and the outliers are far from the object boundary but it fails in more

complex situations. Data fusion methods have also been considered [50], the use of gradient direction

and color context helps to reduce the influence of outliers [77].

The estimation of random signals from multiple noisy observations with outliers was extensively

studied in the context of target tracking using radar measurements [6, 28]. Several techniques were

proposed in this context. Robust filtering methods have been considered since multiple returns can be

observed, most of them being false alarms. Several techniques were proposed for target tracking ranging

from the naive Nearest Neighbor Filter to the optimal Track Splitting Filter. The first approach, takes

the observed feature closest to the predicted measurement as being the correct one, discarding the

rest. The problem with choosing the closest measurement is that it may be the incorrect, leading

4.2 Related Work 43

to a significant degradation of the tracker performance. The second approach handles the problem by

simultaneously processing several measurements that could have been produced by the target of interest.

This method splits the track into multiple hypotheses every time new measurements are detected in the

validation gate. A pruning technique is required to avoid a combinatorial explosion, since the growth of

individual tracks increases exponentially. Nevertheless, the number of significant tracks is often large in

a noisy environment. The PDAF (Probabilistic Data Association Filter) proposed by Bar-Shalom and

Fortmann [6] is a good compromise between the previous approaches since it takes into account multiple

data association hypotheses without the combinatorial explosion. Non-linear filtering techniques have

also been considered in the context of object tracking by using long tail distributions for the observed

data. Inference is done by the Condensation algorithm, a particle filter [38].

This chapter describes a robust estimation algorithm for shape tracking which is denoted as S-PDAF

(Shape PDAF) [56, 62]. This algorithm is an extension of the PDAF filter proposed in the context of

point tracking. The algorithm considers middle level image features (strokes) detected in the vicinity

of the object to be tracked. Each feature can be either a valid observation of the object boundary or an

outlier. Since we do not know a priori which features are valid, all possible sequences of valid/invalid

stroke labels are considered and a probability (confidence degree) is assigned to each labeling sequence.

For convenience, each label sequence is denoted as a data interpretation1

The stroke probabilities are computed using a data model, an outlier model, and the predicted shape

estimate and uncertainty. This leads to a Kalman type recursion for the update of the state estimate

and uncertainty. All the strokes contribute to update shape and motion estimates. However, each

stroke interpretation has a different confidence degree and, therefore, a different influence on the final

estimates. For example, a stroke far away from the object boundary will have a negligible influence

on the estimate (the opposite behavior is observed in Kalman filtering: strokes far from the object

dominate the estimation process). Furthermore, since the algorithm considers stroke sequences, it is

able to detect false strokes near the object boundary if the stroke is not compatible with the others

strokes.

The differences between the S-PDAF and PDAF filter of Bar-Shalom and Fortmann concern different

sensor models and new update equations for the shape and motion estimates, which will be presented

1This approach is different from the one used in Chapter 3. Here, confidence degrees are associated to data interpre-tations and not to individual features.


in this chapter.

4.3 Problem Statement

This section addresses the estimation of moving objects in image sequences in the presence of outliers.

The main difficulty concerns the presence of false alarms and detection failures in the feature extraction

process. Both produce undesirable effects, which significantly hamper the performance of Kalman based

trackers. This difficulty disappears if we knew that each feature was either true (belonging to the object

boundary) or false (produced by the background), but this information is not available in advance.

It is not possible to consider all the combinations of true and false low level features. This would

lead to 2N interpretations of the data, where N is the number of detected features (typically higher

than 100). A different approach is adopted in this work. The observations are associated in strokes.

Strokes are obtained by matching feature points detected at consecutive measurement lines. This is

accomplished by using the mutual favorite pairing method [36]. In this method, a pair of consecutive

lines is considered in each step of the algorithm. Each feature from the first line selects the best match

from the set of features detected in the second line. The same procedure is performed backwards for

each feature of the second line. Two features are linked when they both choose the other feature as

its best match. The matching criterion is the distance between feature points. The number of data

interpretations is drastically reduced to 2M , where M is the number of strokes (typically M < 10). An

example is shown in Fig. 4.1 where the strokes are obtained from the image features of the Fig. 2.2.

STROKE 1

STROKE 3

STROKE 2

STROKE 4

Figure 4.1: Stroke detection. The dashed line represents the predicted contour and the circles representthe contour samples.

After computing the strokes, each of them may be classified either as true or false. A stroke

interpretation Ii is defined as a binary sequence, Ii = {I1i , . . . , IM

i }, where Iji ∈ {0, 1} is the label of the

4.4 S-PDAF 45

jth stroke in the interpretation Ii, (Iji = 1 means valid stroke; Ij

i = 0 means invalid stroke).

Let y(t) be the vector of all image features detected at time instant t and let yi(t) be a vector with

the coordinates of all valid features according to the interpretation Ii.

Let x be a set of parameters defining the object boundary. It will be assumed that yi and the curve

parameters x(t) are related by

yi(t) = Cix(t) + ei + ηi(t), (4.1)

where Ci, ei are the observation matrix and the vector associated to the ith interpretation, and ηi(t) ∼

N (0, Ri) is white Gaussian noise (measurement noise). In general, the observation matrices Ci, Cj ,

ei, ej associated with two interpretations Ii, Ij are different since the observation vectors yi, yj contain

different data features and often have different dimensions. The object motion and deformation, i.e.,

the evolution of x(t) is described by the stochastic difference equation (2.18).

The problem to be solved can be formulated as follows: how to estimate the shape parameters x(t),

knowing current and past observations assuming that we do not know which interpretation is correct ?

4.4 S-PDAF

A non-linear filtering approach is adopted in this section. The estimation of the state vector requires

the propagation of the a posteriori density p(x(t) | Y t), where Y t is a set with the current and past

observations (visual features). Since there are multiple data interpretations (hypotheses), the a poste-

riori density is not Gaussian: it is a mixture of Gaussians. However, the number of modes increases

exponentially with t [74]. Therefore, the propagation of the exact a posteriori density is unfeasible. A

suboptimal approach is adopted instead, inspired in the probabilistic data association filter (PDAF) [6]

developed in the context of point target tracking.

The main assumption is the following: it will be assumed that the state distribution given past

observations is Gaussian, i.e.,

p(

x(t) | Y t−1)

= N(

x(t); x(t | t − 1), P (t|t − 1))

,

where x(t | t − 1), P (t|t − 1) are the mean and covariance of x(t) given past observations Y t−1.


Let us now consider the computation of the state estimate and uncertainty (state mean and covari-

ance matrix) given current and past observations, i.e.,

x(t | t) , E[x(t) | Y t], (4.2)

P (t | t) , E

{

[x(t) − x(t | t)][x(t) − x(t | t)]T | Y t

}

. (4.3)

Since we do not know which interpretation is valid, they all have to be considered as follows

x(t | t) = E[x(t) | Y t] =

∫

x(t)p(x(t) | Y t)dx(t)

=

∫

x(t)∑

i

p(x(t), Ii(t) | Y t)dx(t)

=∑

i

∫

x(t)p(x(t) | Ii(t), Yt) p(Ii(t) | Y t)dx(t). (4.4)

Equation (4.4) can be rewritten as

x(t | t) =

mi∑

i=0

αi(t)xi(t | t), (4.5)

where αi(t) , p(Ii(t) | Y t) is the a posteriori probability of the ith interpretation, mi is the number of

data interpretations at time t, and

xi(t | t) = E{x(t) | Ii(t), Yt}. (4.6)

The state estimate x(t | t) is a weighted sum of the state estimates xi(t | t) obtained for each

interpretation Ii(t) and updated by Kalman filtering

xi(t | t) = x(t | t − 1) + Ki(t)νi(t), (4.7)

where Ki(t), νi(t) are the Kalman gain and innovation associated to the interpretation Ii(t)

Ki(t) = P (t | t − 1)CTi Si(t)

−1, (4.8)

Si(t) = CiP (t | t − 1)CTi + Ri, (4.9)

4.5 Application to Tracking 47

νi(t) = yi(t) − Cix(t | t − 1) − ei. (4.10)

Replacing (4.7) in (4.5) leads to

x(t | t) = x(t | t − 1) +

mi∑

i=1

αi(t)Ki(t)νi(t). (4.11)

A recursive equation can also be derived for the covariance matrix (see Appendix C for the details).

P (t | t) =

[

I −mi∑

i=1

αi(t)Ki(t)Ci

]

P (t | t − 1)

+

mi∑

i=0

αi(t)xi(t | t)xi(t | t)T − x(t | t)x(t | t)T . (4.12)

Equations (4.11, 4.12) define a recursive algorithm for the update of the state estimate and uncer-

tainty which will be denoted as the Shape Probabilistic Data Association Filter (S-PDAF).

4.5 Application to Tracking

To apply the S-PDAF in shape tracking, several steps must be carried out. Since this is a feature based

method, it follows the structure of the Kalman tracker: contour prediction, image measurement, and

filtering step (see Section 2.5). However, the filtering step is more complex and requires the computation

of the states estimates for all the interpretations xi(t | t) and the association probabilities αi(t). To

compute the association probabilities a probabilistic model of the image strokes is required (see Section

4.5.1).

Another issue must be considered at this stage. In this chapter, the visual features detected in

the image are obtained by directional search in the vicinity of the predicted contour (see Appendix B).

Therefore, they depend on the contour estimate. The visual features can be obtained using the predicted

contour computed from x(t | t − 1). However, the position of these features can be reestimated using

updated estimates of the state vector x(t | t) computed from (4.11,4.12) and in this case the filtering

step should be repeated again.

4.5.1 Association Probabilities

In this section we will define the stroke model and the association probabilities. It is assumed that high

probabilities are assigned to the interpretations with the following characteristics:


• long valid strokes: long strokes are more informative and reliable; they contribute in a significant

way to describe the data in the image.

• valid strokes close to the predicted contour: strokes detected in the vicinity of the object boundary

are more reliable then strokes located far from the object contour.

• stroke overlap: interpretations with overlapping strokes, which assign multiple observations to a

single contour sample (see Fig. 4.1) should have zero probability.

Let us consider the example shown in Fig. 4.2. It will be assumed that the image strokes are

characterized by the following variables: M - number of strokes; bj ej - first and last indices of the jth

stroke; Ii - label sequence and y(t) - a vector containing the coordinates of the visual features detected

in the tth image.

It is assumed that these variables are randomly generated as shown in Fig. 4.3. The first block

generates the number of strokes M . The second block, defines the beginning and the end of each stroke,

bj , ej, j = 1, . . . , M . The output of the third block is a labeling sequence (interpretation), defining

which strokes are valid and invalid. Finally the fourth block generates the image features. Remind that

all these variables are known (see Fig. 4.2) except the interpretation I.

STROKE 1

STROKE 2

STROKE 3

b1

e1b

2

e2

e3 b

3

Figure 4.2: Stroke parameters (◦ - samples of the predicted contour; • - edge points).

b = (b1, . . ., bM)

e = (e1, . . ., eM)M I y(t)

Object boundaryNumber of strokes Interpretations Image features

Figure 4.3: Data generation model.


The data model is characterized by the joint distribution p(y(t), Ii(t), b, e, M | Y t−1). This proba-

bility can be factorized as follows

p(

y(t), Ii(t), b, e, M | Y t−1)

= p(

y(t) | Ii(t), b, e, M, Y t−1)

× p(

Ii(t) | b, e, M, Y t−1)

p(

b, e | M, Y t−1)

p(M | Y t−1).(4.13)

The association probabilities are given by

αi(t) = p(

Ii(t) | y(t), b, e, M, Y t−1)

=p(

y(t), Ii(t), b, e, M | Y t−1)

p(

y(t), b, e, M | Y t−1)

= c p(

y(t), Ii(t), b, e, M | Y t−1)

,

(4.14)

where c is a normalization constant.

Using (4.13), equation (4.14) can be written as

αi(t) = β p(

y(t) | Ii(t), b, e, M, Y t−1)

p(

Ii(t) | b, e, M, Y t−1)

, (4.15)

where β is a constant.

The distributions p(

y(t)|Ii(t), b, e, M, Y t−1)

, p(

Ii(t)|b, e, M, Y t−1)

are problem dependent. They

characterize the data features generated by valid and invalid strokes, as well as the a priori probabilities

of the data interpretation. These distributions can be learned from the data. However, a more pragmatic

approach is adopted in this chapter. Both distributions are defined based on a set of hypotheses. First,

we assume that all image features are independently generated i. e.,

p(

y(t) | Ii(t), b, e, M, Y t−1)

=

M∏

j=1

ej

∏

n=bj

p(

yj(sn, t) | Iji (t)

)

, (4.16)

where yj(sn, t) is the feature point belonging to the jth stroke and detected in the vicinity of sn. Second,

it is assumed that the visual features have uniform distribution in the search area (validation gate) if

Iji = 0 (classified as unreliable) and Gaussian distribution if Ij

i = 1 ( classified as reliable). Therefore,

p(

yj(sn, t) | Iji (t)

)

=

ρ−1N

(

νj(sn, t); 0, S(sn, t)

)

if Iji (t) = 1

V (sn, t)−1 otherwise, (4.17)


where V (sn, t) is the length of the search area, ρ is the normalization constant, νj(sn, t) is the innovation

associated to the jth stroke, and S(sn, t) = C(sn)P (t | t − 1)C(sn)T + R(sn) is the covariance of the

innovation vector where C(sn) and R(sn) are the output matrix and noise covariance associated to the

nth sample.

Let us now consider the second term P(

Ii(t) | b, e, M, Y t−1)

. Assuming independence of the stroke

labels

p(

Ii(t) | b, e, M, Y t−1)

= p(I1i | b, e, M) . . . p(IM

i | b, e, M), (4.18)

Since it is assumed that, long valid strokes have higher probability than short strokes, a linear model

is adopted to represent the dependence of the stroke probability with respect to the stroke length

p(Iji = 1) = mlj + c, p(Ij

i = 0) = 1 − mlj − c, (4.19)

where

c = PA, m =PA − PB

L, (4.20)

lj is the length of the jth stroke, and L is the number of sampling points. PA and PB are constants.

Using (4.19)

p(Ii(t) | b, e, M, Y t−1) =∏

j:Ij

i=1

(mlj + c)∏

j:Ij

i=0

(1 − mlj − c), (4.21)

if there is no overlap among the strokes. Otherwise p(Ii(t) | b, e, M) = 0.

It is now possible to compute the data association probability αi using (4.14 - 4.21). The interpre-

tation with highest probability is called the dominant interpretation.

Example

This example illustrates the performance of the S-PDAF with synthetic data. We wish to estimate

the boundary of an object given four strokes and a first estimate of the object boundary (see Fig. 4.4).

The goal is to compute the contour which matches the observed strokes better. In this example it

is assumed the object motion is described by translation vector x(t) = [tx ty]T , where tx, ty are the

displacements along x, y directions and no deformation is observed.


Fig. 4.4 (a) shows the initial contour estimate (dashed line) and four strokes detected in the image

(continuous lines). Since we do not know which strokes are valid/invalid, all data interpretations must

be considered. Four interpretations have overlapping strokes and will be discarded. Therefore, only

12 interpretations are considered. Fig. 4.5 shows three possible interpretations of the image data: the

thick strokes are classified as reliable and the thin ones are classified as outliers.

Table 4.1 shows the values of the association probabilities computed by the S-PDAF filter in two

consecutive iterations. The initial contour in the second iteration is the output of the first iteration. Fig.

4.6 displays this information. The most likely interpretation is I12 (dominant interpretation), which

considers all the strokes as valid except S3. Therefore, S-PDAF solves the data conflict by assigning

high confidence degrees to the strokes (S1, S2, S4) and low confidence degree to S3. Although the S3 is

classified as valid in I10, the association probability of I10 significantly decreases in the second iteration,

on contrary, the interpretations I6, I11, and I12 (which classify S3 as outlier) increase. This can be seen

in Fig. 4.5.

The S-PDAF estimate of the object boundary computed by (4.11) is shown in Fig. 4.4 (b) (solid

line).

STROKE 1

STROKE 2

STROKE 3

STROKE 4

(a) (b)

Figure 4.4: (a) Detected strokes; (b) S-PDAF estimate (solid line).

Figure 4.5: Three different interpretations with highest probabilities.


Interpretations S1 S2 S3 S4 α(1) α(2)

I1 0 0 0 0 0.0003 0.0015I2 0 0 0 1 0.0043 0.0094I3 0 0 1 0 0.0041 0.0007I4 0 0 1 1 0.0525 0.0043I5 0 1 0 0 0.0096 0.0263I6 0 1 0 1 0.1243 0.1594I7 1 0 0 0 0.0014 0.0061I8 1 0 0 1 0.0177 0.0371I9 1 0 1 0 0.0167 0.0028I10 1 0 1 1 0.2166 0.0170I11 1 1 0 0 0.0396 0.1040I12 1 1 0 1 0.5130 0.6314

Table 4.1: Data interpretations and association probabilities in the 1st and 2nd iterations.

1 2 3 4 5 6 7 8 9 10 11 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Interpretations

Ass

ocia

tion

Pro

babi

litie

s

1 2 3 4 5 6 7 8 9 10 11 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Interpretations

Ass

ocia

tion

Pro

babi

litie

s

Figure 4.6: Data interpretations and association probabilities in the 1st and 2nd iterations.

Table 4.2 summarizes the S-PDAF algorithm.


S-PDAF Algorithm

1. Prediction: Predict the state vector x(t | t− 1) and covariance matrix P (t |t − 1) using the prediction step of the Kalman filter (2.23), (2.24).

2. Stroke Detection: Sample the predicted contour at equally spaced pointsand perform feature detection along the lines orthogonal to the curve (seeAppendix B)

Feature Detection

Edge detection Stroke detection

3. Association Probabilities: for each data interpretation Ii, compute theassociation probability αi(t)

αi(t) = β p(

y(t) | Ii(t), b, e, M, Y t−1)

p(

Ii(t) | b, e, M, Y t−1)

,

according to (4.16-4.17), and (4.18-4.21)

4. Filtering: Update the state estimate and uncertainty, according to

x(t | t) = x(t | t − 1) +

mi∑

i=1

αi(t)Ki(t)νi(t),

P (t | t) =

[

I −mi∑

i=1

αi(t)Ki(t)Ci

]

P (t | t − 1)

+

mi∑

i=0

αi(t)xi(t | t)xi(t | t)T − x(t | t)x(t | t)T .

5. Repeat steps 2-4 for a number of times (e.g., 2 times), to improve the confi-dence degrees.

Table 4.2: Steps of the S-PDAF algorithm.



The S-PDAF tracker was tested with a large number of sequences. Three examples are presented in

this section, corresponding to the tracking of a hand, lips, and vehicles in video sequences. All these

problems are useful in practice and have been extensively studied (e.g., see [45, 67, 73]).

4.6.1 Hand Tracking

The first example considers a hand moving in front of a shirt (see Fig. 4.7) in a sequence of 20 gray scale

images obtained at 12 frames sec−1. The Kalman tracker fails in this example due to the presence of

white bar although only two degrees of freedom are considered: a translation model is used to describe

the hand motion and no deformation is allowed. The state vector x(t) is 2-dimensional, and contains

the displacement coordinates of the translation vector.

In this example the following assumptions were made

A = IA, P = kP IP , Q = kQ IQ, R = IR, (4.22)

where A is the transition matrix, P is the state covariance matrix, Q, R are the covariance matrices

which account for motion and measurement uncertainty respectively, IA, IP , IR are the D ×D identity

matrices, IR is a 2N × 2N identity matrix, and kP , kQ are constants. The reference shape is specified

by the user in the first image and it is represented using a B-spline with Nc = 12 control points; N = 21

samples were considered to compute the image features. The state vector is x(t) = [tx ty]T , and it is

also initialized by the user in the first frame.

Fig. 4.7 shows the performance of two algorithms: Kalman filter and S-PDAF. The visual features

are represented by dots. The lines from the contour to the features can be interpreted as springs which

attract the contour towards image features. The first line shows the performance of the Kalman filter.

The tracker fails when the hand is close to the white bar. This means that the model is strongly

influenced by outliers generated by the bar boundary. The second line shows the results obtained with

S-PDAF. This method manages to distinguish the outliers from the valid data associated to the hand

boundary and the hand motion is correctly estimated.


Figure 4.7: Tracking results obtained with Kalman filter (first row) and S-PDAF filter (second row),(frames 1, 7, 18).

4.6.2 Lip Tracking

The next experiment considers a lip tracking problem. Lip tracking is useful in automatic speech

recognition, a problem which has been extensively studied in the last two decades [63, 68, 69]. Visual

features are useful to improve the performance of speech recognizers, specially in noisy environments.

Much work has been recently done on audio-visual speech recognition [7, 29, 66, 76], which aims at

recognizing speech using the lip movements and shape as well as audio features. The use of visual

information requires accurate lip tracking. This is a difficult task since the image analysis techniques

often produce outliers which deform the lip estimates. It is shown bellow that the S-PDAF is a useful

tool to perform this task.

Two types of images were used in these experiments: gray scale and RGB images. The gray scale

images were obtained at 12 frames sec−1 and the RGB images at 15 frames sec−1.

In the gray scale examples, three tracking sequences are used with 60 (Figs. 4.8, 4.9) 100 (Figs. 4.10,

4.11), and 75 (Fig. 4.12) images. These sequences correspond to two situations: normal speech (Figs.

4.8, 4.9) and singing (Figs. 4.10 - 4.12). In speech experiments, the lip boundaries move smoothly while

in the singing case abrupt changes in consecutive frames are observed. In this case, the prediction step


performs poorly in many time instants.

In these experiments an affine motion model is used. The state vector contains the affine parameters

(D = 6) as well as the deformation parameters of Nc control points,

x(t) = [x1, . . . , xD, θd11, . . . , θ

d1Nc

, θd21, . . . , θ

d2Nc

]T . (4.23)

In these experiments, the shape boundary is represented by a B-spline with Nc = 12 control points and

N = 21 samples are used to compute visual features from the image. The state model is defined by the

following matrices

A = IA, P = kP IP , Q = kQ IQ, R = IR, (4.24)

where IA, IP , IQ, and IR are D + 2Nc × D + 2Nc, 2N × 2N identity matrices respectively, and kP , kQ

are constants. The shape matrix C is given by

C =

[M ON×3 BN×Nc

ON×Nc

ON×3 M ON×NcBN×Nc

]

, (4.25)

where M is the matrix which contains the coordinates of the object boundary (see Appendix A).

Figs. 4.8 and 4.9 display the results obtained with Kalman filter (first line) and S-PDAF (second

line) during the speech utterance. In Fig. 4.8 the Kalman filter produces distorted estimates of the lips

due to the presence of outliers close to the predicted contour. When the number of outliers increases

(Fig. 4.9), the Kalman tracker loses the object boundary. The S-PDAF performs well in both cases

providing accurate boundary estimates, even in the presence of a large number of outliers.

Fig. 4.10 displays the tracking results obtained with Kalman filter and with S-PDAF in the Sing

sequence. Again, Kalman estimates are attracted by spurious data and become lost. The S-PDAF has

a good performance as before. Figs. 4.11 (Sing sequence), 4.12 (Sing1 sequence) show eight pairs of

consecutive frames corresponding to rapid changes of the lip contour and the lips estimates obtained

by the S-PDAF. Despite the large number of outliers, the filter exhibits remarkable robustness in both

sequences when sudden changes are present in the lip boundary.


Figure 4.8: Lip tracking in the Speech sequence with Kalman filter (first row) and S-PDAF (secondrow), (frames 1, 7 10).

Figure 4.9: Lip tracking in the Speech sequence with Kalman filter (first row) and S-PDAF (secondrow), (frames 18, 28, 30).


Figure 4.10: Lip tracking in the Sing sequence with Kalman filter (first row) and S-PDAF (second row)(frames 1, 6, 8, 12).

Figure 4.11: Lip tracking in the Sing sequence with S-PDAF. Frames 45, 46 (first column), 60, 61(second column), 66, 67 (third column), 87, 88 (fourth column).


Figure 4.12: Lip tracking in the Sing1 sequence with S-PDAF. Frames 7, 8 (first column), 13, 14 (secondcolumn), 31, 32 (third column), 39, 40 (fourth column).


The S-PDAF was also used to track lips from color image sequences. The color images were converted

into scalar images by applying the Fisher linear discriminant to the pixel color components [27]. We

have used HSV color coordinates since they provide a better separation of the lip and skin regions than

the standard RGB. Fig. 4.13 represents the color distributions in the lips and skin regions. It can be

seen that it is easier to separate the two regions in the HSV space than in the RGB space.

80 100 120 140 160 180 200 22080

100

120

140

160

180

200RG

80 100 120 140 160 180 200 22060

80

100

120

140

160

180

200RB

80 100 120 140 160 180 20060

80

100

120

140

160

180

200GB

10 20 30 40 50 60 70 800.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4HS

10 20 30 40 50 60 70 8080

100

120

140

160

180

200

220HV

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.480

100

120

140

160

180

200

220SV

Figure 4.13: RGB color distribution (first row), HSV color distributions (second row). Dots representpoints in the lip region, circles represent points in the skin region.

Let u ∈ ℜ3 be the color components of an image pixel. The Fisher discriminant is used to convert

u into a scalar intensity as follows

y = fT u, (4.26)

where [27]

f = (PL + PS)−1(IL − IS). (4.27)

In (4.27) IL is the 3 × 1 mean vector and PL is a 3 × 3 matrix covariance of the color vector in the lip

region; IS , PS have the same meaning but they are computed for the skin region.

The vector f computed in (4.27) defines the plane which separates the two regions best.


Fig. 4.14 shows the results obtained with Fisher discriminant using RGB and HSV coordinates.

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

(a) (b)

Figure 4.14: Image results after applying the Fisher discriminant in RGB image (a) and HSV image(b).

After applying the transformation (4.26) the color sequences become gray level sequences of images

and the previous algorithm can be directly applied.

Fig. 4.15 shows results obtained with S-PDAF for different speakers. Each sequence has about 70

images. The same model model was used in all these experiments, i.e., no manual adjustments were

performed in these tests.

The S-PDAF can also be used track other face features such as the eyebrows (see Fig. 4.16). It

is, therefore, useful to estimate facial expressions and to animate face models. The S-PDAF tracker

developed in this thesis was used by J. Costeira and J. Maciel to animate the face model described in

[46] and integrated in a prototype presented to the public in Parque das Nacoes, 2000.


Figure 4.15: Lip tracking in RGB images. From top to bottom row: Hongfei, Bernhard, Katharina,Maciel, and Luyang sequences.


Figure 4.16: Tracking of multiple objects. Frames 10, 18 32 (first row), 38, 44, 50 (second row).


4.6.3 Vehicle Tracking

This section presents tracking results obtained in traffic sequences. Again, it is assumed that x(t) is

defined by a stochastic difference equation. Since the average velocity is non-zero, the state vector must

contain the pose parameters as well as their derivatives. In the following examples no deformation is

allowed. Thus, the equation (2.16) is rewritten as

x(t) = [x1t, . . . , xDt, x1t, . . . , xDt]T , (4.28)

where xit, are the derivatives. The shape matrix C is similar to the shape matrix described in Appendix

A, the only difference is that the matrix is extended with zero columns to cope with velocity parameters

in x(t).

Two traffic sequences digitized at 12 frames sec−1 are considered. The first sequence in Fig. 4.17

has 40 images. In this sequence the cars move without rotations or deformations and, therefore, only

translations have to be considered (2 degrees of freedom). The state vector is

x(t) = [x1 x2 x1 x2]T . (4.29)

A B-spline with Nc = 12 control points and N = 15 samples is used. The shape matrices are

C =

[1 0 0 0

0 1 0 0

]

, e = [vr1(s1), . . . , v

r1(sN ), vr

2(s1), . . . , vr2(sN )]T , (4.30)

where the 1 = [1, 1, . . . , 1]T , 0 = [0, 0, . . . , 0]T are two 2N vectors. The dynamic matrix is

A =

[IA IA

OA IA

]

, (4.31)

where IA is a D × D identity matrix, and OA is a D × D null matrix. The remaining matrices are

P = kP IP , Q = kQ IQ, R = kR IR, (4.32)

where IP , IQ, and IR are 2D × 2D, 2N × 2N identity matrices respectively, kP , kQ, kR are constants.

Fig. 4.18 shows another sequence of 65 images, in which the car performs a left turn. In this case

it was assumed that the car motion is described by an Euclidean similarity (4 degrees of freedom).

Therefore,


x(t) = [x1 x2 x3 x4 x1 x2 x3 x4]T . (4.33)

A B-spline with Nc = 12 control points and N = 23 samples is used. The shape matrix is

C =

[Mx −My ON×Ns

My Mx ON×Ns

]

, (4.34)

where Mx, My are defined in Appendix B, O is a N × Ns null matrix with Ns = 4. This model allows

rotation, translation, and scaling. However, no shearing is allowed.

The dynamic equation is similar to the previous example (see (4.31)) with D = 4, the remaining

matrices are

P =

[O O

O kP IP

]

, Q =

[O O

O kQ IQ

]

, R = kR IR, (4.35)

where IP , IQ are D × D identity matrix, IR is a 2N × 2N identity matrices, O is a D × D null matrix

and kP , kQ, kR are constants.

The S-PDAF tracker was applied to two sequences. Figs. 4.17, 4.18 show the results. It is shown

that the S-PDAF tracker correctly estimates the vehicle boundary in both examples. Although it is not

shown in these figures, the Kalman filter fails to track the car in these examples due to the presence of

clutter.


Figure 4.17: Tracking results with S-PDAF. Frames 14, 20 (first row), frames 34, 46 (second row).

Figure 4.18: Tracking results with S-PDAF. Frames 5, 19, 28 (first row), frames 37, 47, 61 (secondrow).

4.7 Computational Complexity Discussion 67

4.7 Computational Complexity Discussion

In this section we compare the complexity of the S-PDAF tracker and Kalman tracker. Table 4.3

shows the number of operations used in prediction, feature detection, and filtering for most of examples

presented in this chapter. The test conditions are the same for both trackers and the algorithm was

programed in MATLAB. No attempt was made to optimize the code.

From the Table 4.3 we observe that the prediction step of both algorithms have equal complexity.

This was expected since both methods use the same equations. Furthermore, the complexity of pre-

diction step is image independent. It only depends on the dimension of the state vector. In the hand

tracking example the complexity is lower than in other examples since a simple translation model is

used. Therefore, the state vector is 2-dimensional. In the lips sequences an affine transform was used.

The state vector contains the affine parameters plus the coefficients of the shape deformation model.

This increases the computational effort. There is a variation in the prediction cost within the lips

sequences since a different number of B-spline samples was used in each experiment.

The feature detection has a similar complexity in both trackers. Although, the S-PDAF requires

the matching of the feature points detected in consecutive measurement lines, this does not introduce

a significant increase of complexity. The computational cost in this step depends on the number of

samples used in the B-spline and also on measurement window which is adaptively chosen by the

algorithm based on the state covariance matrix. We notice that the sequences Sing and Sing1 have a

large value in the feature detection step, comparing to the other lips sequences, since the measurement

lines are larger to cope with large variations in the lip boundary.

The filtering step of the S-PDAF method has a higher computational cost than the Kalman tracker.

The filtering step of the Kalman tracker updates the mean and the covariance of the state estimate.

In the S-PDAF tracker the filtering step is more complex and involves three operations: first, we need

to generate all valid interpretations of the stroke data; second we compute the association probabilities

for each valid interpretation; finally, we update the state vector and the covariance matrix using the

association probabilities. This increases the computational burden of the S-PDAF method. The com-

plexity of the S-PDAF tracker depends on the number of samples as well as on the dimension of the

state vector. Furthermore, it also depends on the number of interpretations generated for each image.

Interpretations are, therefore, the crucial aspect which makes the S-PDAF method heavier than the


Kalman tracker.

Kalman S-PDAF

Image Sequence Length Prediction Feature Detection Filtering Prediction Feature Detection Filtering

Hand 20 254 × 10−6 0.8852 0.1008 254 × 10−6 0.9731 0.2310

Speech 60 0.1131 1.4248 0.5411 0.1131 1.4780 4.3547

Sing 100 0.1132 3.5753 0.5828 0.1132 3.6002 3.0962

Sing1 80 0.1132 6.6161 0.5827 0.1132 6.5166 5.1191

Hongfei 70 0.0459 1.0978 0.3873 0.0459 1.1364 1.6136

Bernhard 50 0.0456 0.8173 0.2980 0.0456 0.8592 2.0891

Katharina 70 0.0459 1.1166 0.3871 0.0459 1.1578 1.7777

Maciel 50 0.0457 0.8615 0.3261 0.0457 0.9121 4.0872

Luyang 80 0.0458 1.0935 0.3558 0.0458 1.1613 4.7671

Table 4.3: Computational cost of the two methods in floating point operations flops (results in Mflpsper image).

4.8 Conclusions

Kalman based trackers are not robust in the presence of invalid features. The reason is simple. The

model shape and position are estimated in order to fit the visual features detected in the image. Unfor-

tunately, image analysis methods often produce invalid features (outliers) which have a strong influence

on the shape estimates, leading to meaningless tracking results. This raises a major issue in the design of

feature based tracking algorithms: how to discriminate valid data from outliers ? This chapter provides

methods to answer this question. A robust algorithm is proposed which avoids a hard decision using

middle level features (strokes) and considering all possible hypotheses of valid/invalid label sequences

which are denoted as data interpretations. All interpretations are used to update the shape and motion

estimates in each new image but with different weights.

The a posteriori distribution of the unknown variables is a mixture of Gaussians with a growing

number of modes. To avoid the computation burden associated with an exponential increase of modes

4.8 Conclusions 69

the mixture is approximated by a single Gaussian whose mean and covariance depend on the data

interpretations and their weights. The most probable interpretations have larger influence on the shape

and motion estimates than the least probable ones. The proposed algorithm, denoted as S-PDAF, was

tested in several tracking problems. Special attention was paid to the lip tracking problem. In this

case the algorithm was tested with gray scale and color images. In all the tests, the S-PDAF tracker

provided a robust performance in the presence of outliers and sudden shape changes.

Ch a p t e r 5

A Robust Multi Model Tracker

5.1 Introduction

Object tracking in video sequences is a hard operation. Several issues contribute to this fact: the time

varying illumination; changes in the object pose with respect to the camera; diversity of the object

shapes and motion regimes which lead to rapid changes of the object boundary; presence of invalid

image features produced by other objects or by the background. This suggests that a single model

is not enough to accurately represent the evolution of the object boundary and, furthermore, robust

methods should be used to deal with invalid features.

Most tracking algorithms are based on a single dynamic model. This assumption often leads to

inaccurate estimates of the object to be tracked since it cannot cope with different motion trajectories

and accelerations. To represent the observed data, the tracker should accommodate different motion

regimes as well as different shape configurations. This task can be accomplished by using multiple

models with switching capabilities, incorporated in a tracking framework. This allows to choose an

appropriate dynamic model among several candidates at each instant of time.

This chapter presents a robust tracking algorithm based on multiple models which extends the

Multiple Model tracker (MMT) recently proposed in [48, 49]. The Multiple Model tracker deals with

sudden changes in shape or motion by describing the data as the output of a bank of linear filters

equipped with a switching mechanism. This approach allows an efficient choice of the best model

to describe the evolution of the object to be tracked based on the shape and motion parameters.

Inference is performed by adopting a parametric representation of the unknown variables, using mixtures

of Gaussians whose parameters are updated by a tree of Kalman filters. This method has several

72 A Robust Multi Model Tracker

weaknesses. First, it is very time consuming. Second, and more important, the performance of the

algorithm is hampered by the presence of outliers. This chapter presents a robust tracking method

which overcomes this difficulty. The proposed algorithm is based on the same switching framework, but

it uses a different inference method based on a tree of S-PDAF filters. The robust multi model tracker

developed in this thesis was published in [57, 59].

5.2 Related Work

Object tracking with time varying shapes and motions regimes is an important problem which appears in

several image processing applications, e.g., medical diagnosis, surveillance and human-machine interface.

Attempts to solve this problem have been made by several researchers. Some works represent the

object of interest by several static models (e.g., appearance models). Examples of this approach can

be found in [14, 21, 23, 54, 75]. Other works address the estimation of complex parameters trajectories

such as those observed in lip tracking or in human motion estimation [8]. Non-linear dynamic models

should be used in these problems in order to exploit all the geometric and dynamic restrictions. To

avoid the use of such models, multiple dynamic models can be used instead, each model being tailored

to specific motion regime [10, 49, 54].

In this chapter the object motion is represented by a set of dynamic models. This model is inspired by

the work described in [48, 49]. The data is approximated by a number of models each of them describing

a motion regime or shape evolution which is considered as typical for the specific application. Switched

dynamic models were studied in control theory and aeronautics to deal with abrupt changes in dynamic

systems (e.g., see [13],[65], [74]), or in target tracking for surveillance applications [6, 28]. These ideas

have also been applied to object tracking in [38], which uses multiple hybrid models. In the latter work,

the distribution of the unknown parameters is updated using particle filtering (Condensation algorithm)

which has difficulties in high dimension estimation problems.

Two problems have to be addressed if we want to use switched dynamical models in shape tracking.

First, given a video sequence we have to determine which model is active at each instant of time. This

is the labeling problem. Second, we have to estimate the state of the active model using the available

data. This amounts to estimating the shape and motion parameters of the object to be tracked. These

problems have been addressed either by non-parametric techniques [38] or by parametric ones, based

on the propagation of Gaussian mixtures [48, 49].

5.3 Switched Dynamical Models 73

Although switched models are able to describe complex motion and shape evolution, they fail in the

presence of outliers, i.e., if the image measurements contain invalid data. Typically, the tracker loses

the object boundary when wrong edge points are detected in the image, e.g., edge points belonging to

the background or to inner regions of the object to be tracked. This is a major drawback which prevents

the application of such models in many tracking problems. This difficulty is addressed here. A robust

tracking algorithm is presented which extends the method described in [48, 49]. The proposed tracker

[57, 59] is based on two key concepts already used in Chapter 4. First, middle level features (strokes)

are used instead of low level ones (edge points). Second, a label (valid/invalid) is associated to each

stroke.

The chapter is organized as follows. Section 5.3 provides an overview of multiple dynamic systems.

The robust multi-model tracker (RMMT) is presented in Section 5.4. Section 5.5 presents experimental

results obtained with the RMMT. Section 5.6 concludes the chapter.

5.3 Switched Dynamical Models

In order to estimate the object position and deformation, at time step t, we assume that a state vector

x(t) is generated by the stochastic difference equation [74]

x(t) = Ak(t−1),k(t)x(t − 1) + w(t), (5.1)

where w(t) ∼ N (0, Qk(t−1),k(t)) is white Gaussian noise, Ak(t−1),k(t) is the state transition matrix,

k(t) ∈ {1, . . . , m} is the label of the active model at instant t, and m is the number of steady state

models (see Fig. 5.1) 1 Each value of k(t) corresponds to a different dynamic model.

It is assumed that the label sequence k(t) is a random sequence modeled by a first order Markov

process with transition probability

Trq = p(

k(t) = q | k(t − 1) = r)

, (5.2)

where r, q ∈ {1, . . . , m}. As before, it will be assumed in this chapter that the available observations are

the strokes detected in the image which can be considered either as valid or false. A confidence degree

is assigned to each stroke interpretation. It is assumed that y(t) is the vector of all image features

1Transition models are not shown in this figure.


Label

Generation

Model 1

Model 2

Model m

y(t)

k(t)

Figure 5.1: Switched dynamical systems.

detected at instant t, and yi(t) is a vector with all the features classified as valid according to the ith

interpretation. The sensor model associated with the ith interpretation is given by

yi(t) = kCix(t) + kei + ηi(t), (5.3)

where matrices C, e depend on the interpretation i as well as the model label k(t).

A hybrid state z(t) =(

x(t), k(t))

is defined which includes the state vector x(t) and the label of the

active model. The hybrid state is characterized by the transition density p(

x(t), k(t) | x(t−1), k(t−1))

,

which can be split as follows

p(

x(t), k(t) | x(t − 1), k(t − 1))

= p(

x(t) | k(t), x(t − 1), k(t − 1))

p(

k(t) | x(t − 1), k(t − 1))

. (5.4)

The first factor is defined by the dynamic equation (5.1) while the second is an element of the transition

matrix of the Markov chain Tk(t−1),k(t).

5.4 Density Propagation

The problem to be solved can be formulated as follows: given a set of observations Y t = {y(1), . . . , y(t)}

which may contain outliers, what are the best estimates for the state vector x(t) and the model label

k(t). This is a non-linear filtering problem.

5.4 Density Propagation 75

Assuming that the joint probability density function p(

x(t), k(t) | Y t)

is known, the maximum a

posteriori (MAP) estimate of(

x(t), k(t))

is given by

(

x(t), k(t))

= arg maxx(t),k(t)

p(

x(t), k(t) | Y t)

. (5.5)

Using the law of total probabilities, the a posteriori density becomes

p(

x(t), k(t) | Y t)

=∑

Kt−1

p(

x(t), k(t), Kt−1 | Y t)

=∑

Kt−1

p(

x(t) | Kt, Y t)

p(Kt | Y t)

=∑

Kt−1

cKtp(

x(t) | Kt, Y t)

, (5.6)

where cKt = p(Kt | Y t) and Kt = {k(1), . . . , k(t)} is the sequence of model labels up to instant t. If

the density function p(

x(t) | Kt, Y t)

is normal N (xKt , PKt), the joint density p(x(t), k(t) | Y t) defined

in (5.6) is a mixture of Gaussians, each of them being associated to a different label sequence Kt.

The propagation of the a posteriori density can be split into two steps: multi model prediction and

multi model filtering. Let us first consider the first step. The prediction step aims to compute

p(

x(t), k(t) | Y t−1)

=∑

Kt−1

cKt|t−1 p(

x(t) | Kt, Y t−1)

, (5.7)

where cKt|t−1 = P (Kt | Y t−1). These coefficients can be recursively computed as follows

cKt|t−1 = P (Kt | Y t−1) = P(

k(t), Kt−1 | Y t−1)

= P(

k(t) | Kt−1, Y t−1)

P (Kt−1 | Y t−1), (5.8)

since P(

k(t) | Kt−1, Y t−1)

= Tk(t−1),k(t) (see (5.2)) and cKt−1 = P (Kt−1 | Y t−1). Equation (5.8) can

be rewritten as follows

cKt|t−1 = Tk(t−1),k(t)cKt−1 . (5.9)

Assuming that the model sequence is known p(x(t) | Kt, Y t−1) = N (xKt|t−1, PKt|t−1) the mean

and the covariance are updated by


xKt|t−1 = Ak(t−1),k(t) xKt−1 , (5.10)

PKt|t−1 = Ak(t−1),k(t) PKt−1 ATk(t−1),k(t) + Qk(t−1),k(t).

In the filtering step, the computation of the mixture modes cKt depends on the method being used.

If all the observations are valid, p(

x(t) | Kt, Y t)

(Gaussian component) can be updated by Kalman

filtering as mentioned in [49]. However, when y(t) is contaminated with outliers robust filtering methods

must be adopted. In fact, assuming that the model sequence Kt is known, the mean and covariance

matrix can be computed using the S-PDAF method, described in Chapter 4.

The mean and covariance matrix of the state estimates, updated by S-PDAF, are given by (see

(4.11,4.12) and Appendix C for details)2

xKt = xKt|t−1 +

mi∑

i=1

αi(t)Ki(t)νi(t), (5.11)

PKt =

[

I −mi∑

i=1

αi(t)Ki(t)Ci

]

PKt|t−1

+

mi∑

i=0

αi(t)xi(t | t)xi(t | t)T − x(t | t)x(t | t)T , (5.12)

where xi(t | t) = E{x(t) | Ii(t), Kt, Y t}, αi(t) , p(Ii(t) | Kt, Y t), Ki(t), νi(t) are the Kalman gain and

innovation associated to the interpretation Ii(t), Kt denotes the path in the tree structure (see Fig.

5.2).

To update the coefficients cKt in a robust framework, a new update law is required. After some

manipulations (see Appendix D.1)

cKt = γ cKt|t−1

∑

i

αi(t)M∏

j=1

ej

∏

n=bj

Eji (sn, t), (5.13)

where γ is a normalization constant, cKt−1|t−1 is the predicted mixture coefficient, αi(t) is the association

probability assigned to the data interpretation Ii(t), M is the number of strokes, bj , ej are the indices

of the jth stroke, and E is a normal or uniform distribution, depending on the stroke j being considered

as valid or invalid on the interpretation Ii(t) (see (4.17)).

2Matrices C, K as well as x(t | t), xi(t | t), αi, νi and E depend on the label k. This dependence was omitted for thesake of simplicity.

5.5 Contour prediction and update 77

The multi-model tracker is a special case of this algorithm when all the data is considered as valid

(see Appendix A). In this case, equation (5.13) is given by

cKt = γ cKt|t−1

L∏

n=1

E(sn, t). (5.14)

The filter defined in (5.6-5.13) is denoted as Robust Multi Model tracker (RMMT).

The computation of (5.11), (5.12), and (5.13) is organized in a tree structure, each branch being

characterized by cKt , xKt , and PKt (Kt defines a tree path from the root to one of the leaves, see Fig.

5.2). The structure illustrated in Fig. 5.2 shows that the number of leaves (Gaussian components)

exponentially increases as time increases.

Assuming that we have m label values, the mixture will have mt modes at time t. In practice the

number of modes must be limited. Several strategies can be used to achieve this goal, e.g., by using

mode merging and elimination [48]. In this chapter, the second method is adopted by discarding the

mixture components with small coefficients and normalizing the others.

t = 1 t = 2 t = n

cK

t xK

t P

Kt

Figure 5.2: Tree structure of RMM tracker (m = 3).

5.5 Contour prediction and update

Let us now consider contour prediction. Given p(

x(t), k(t) | Y t−1)

we wish to predict the object

contour at time t. To accomplish this, we first estimate the model label by using the MAP method as

follows


q(t) = arg maxq

P{

k(t) = q | Y t}

(5.15)

= arg maxq

∑

Kt−1

P(

k(t) = q, Kt−1 | Y t)

= arg maxq

∑

Kt:k(t)=q

cKt . (5.16)

To compute q(t) we must add the coefficients of all the mixture components (tree leaves) for which

k(t) = q and choose the maximum.

Then, we compute the state estimate for the best model q

qxt|t−1 = γ∑

Kt:k(t)=q

cKt|t−1xKt|t−1, (5.17)

i.e., we compute a weighted average of all mean vectors associated to the tree paths such that k(t) = q(t).

The uncertainty of the state estimate is measured by the covariance matrix

PKt|t−1 =∑

Kt

cKt|t−1PKt|t−1. (5.18)

The predicted contour is then computed as follows

y(t) = qC qxt|t−1 + qe. (5.19)

After knowing the observations y(t), the object contour can be computed from an update estimate of

the state vector. The mean square method was used to update the state vector as before (see Appendix

D.3), leading to

qxt|t = γ∑

Kt:k(t)=q

cKt

mi∑

i=0

kαi(t)kxi(t | t). (5.20)

The state estimate is obtained by using the most probable model as a weighted sum of the estimates

associated to the tree paths ending with the q label.

The contour update becomes as

y(t) = qC qxt|t + qe. (5.21)

Table 5.1 describes the Robust Multi Model tracker.

5.5 Contour prediction and update 79

Table 5.1: Robust Multi Model Tracker.

Robust Multi Model Tracker

1. Prediction: (Expand each leave with m new leaves each one associated with a different labelk(t)). For each leave of the tree, predict the state mean xKt|t−1, covariance PKt|t−1, andmixture coefficients cKt|t−1

xKt|t−1 = Ak(t−1),k(t) xKt−1 ,

PKt|t−1 = Ak(t−1),k(t) PKt−1 ATk(t−1),k(t) + Qk(t−1),k(t),

cKt|t−1 = Tk(t−1),k(t) cKt−1 .

2. Contour prediction:

qxt|t−1 = γ∑

Kt:k(t)=q

cKt|t−1xKt|t−1,

PKt|t−1 =∑

Kt

cKt|t−1PKt|t−1,

y(t) = qC qxt|t−1 + qe,

where q is the label of the best model computed in step 6

3. Image Analysis: Detect features in the vicinity of y(t) and organize them in strokes.


(continuation)

4. Filtering: For each leave (component of the state vector) compute

a) Association probabilities : αi(t) , p(Ii(t) | Kt, Y t)

b) mean and covariance matrix of the state estimate and mixture coefficients for each treepath

xKt = xKt|t−1 +

mi∑

i=1

kαi(t)kKi(t)

kνi(t),

PKt =

[

I −mi∑

i=1

kαi(t)kKi(t)

kCi

]

PKt|t−1

+

mi∑

i=0

kαi(t)kxi(t | t) kxi(t | t)T − kx(t | t) kx(t | t)T ,

cKt = γ cKt|t−1

∑

i

kαi(t)

M∏

j=1

ej

∏

n=bj

kEji (sn, t).

5. Mode elimination: discard the leaves with mixture coefficients below a given threshold,normalize the mixture coefficients cKt .

6. Choice of the best model and the best state vector:

q(t) = arg maxq

∑

Kt:k(t)=q

cKt ,

qxt|t = γ∑

Kt:k(t)=q

cKt xKt .

7. Contour update:

y(t) = qC qxt|t + qe.

8. Repeat steps 3-7 for a number of times (e.g., 2 times).

9. Return to the item 1.



This section presents the results obtained with the Robust Multi Model tracker. Examples of gesture,

lip, and heart tracking are shown. A comparison between the proposed method and the Multi Model

tracker (MMT) described in [48] is given.

5.6.1 Hand Tracking

This example was considered before in Section 4.6. The tracker aims to follow a hand moving in front

of a shirt. The frame rate is 12 frames sec−1 and the size of the images have 288 × 384 pixels. The

hand motion is characterized by a translation vector and by its velocity. Therefore, the state vector is

defined by

x(t) = [x1 x2 x1 x2]T , (5.22)

where x1, x2 are the displacements in two orthogonal directions and x1, x2 are the velocity components.

In this experiment a B-spline with Nc = 12 control points is used and the B-spline is sampled at

N = 21 points to obtain image features.

To represent the hand motion upwards and downwards, two models are used. Each model is adapted

to a specific motion regime. This can be accomplished by adopting different dynamics for each model.

The dynamic matrices used in this experiment are

Am =

1 00 ∆m

I

O I

, m = 1, 2, (5.23)

where I,O are D × D identity and null matrices respectively, ∆1 = 0.91, and ∆2 = 1.10. The first

model describes the upper motion of the hand, while the second describes a down motion.

The shape matrices are

C =

[1 0 0 0

0 1 0 0

]

, e = [vr1(s1), . . . , v

r1(sN ), vr

2(s1), . . . , vr2(sN )]T , (5.24)

where 1,0, and e are 2N × 1 vectors, e contains the coordinates of the reference shape.

The covariances matrices P, Q, R are given by


P = kP IP , Q = kQ IQ, R = IR, (5.25)

where IP , IQ, and IR are 2D × 2D and 2N × 2N identity matrices respectively, kP , kQ are constants.

The goal is to estimate the correct switching between these two models and to track the hand

position along a video sequence.

Fig. 5.3 shows the initial position of the hand, and the prediction result given by the two models.

Figure 5.3: Contour prediction using models 1 (dots) and 2 (dashed line).

Fig. 5.4 shows the predicted contours as well as the estimated model obtained by the multi-model

tracker. This algorithm fails to choose for the correct dynamical model. We notice that when the

motion direction changes (frame 8), several outliers appear in the image, turning the tracking problem

more difficult. The presence of false observations produces errors in the choice of the best model. The

hand motion remains described by the upper motion model.

Fig. 5.5 shows the results, when we replace Kalman filters by S-PDAF filters in the tree structure.

Now, the RMMT manages to solve this situation well, correctly switching between the upper motion

and the down motion. This example illustrates the usefulness of the MM approach in the case of abrupt

motion changes.

5.6.2 Lip Tracking

We have applied the same methods to the lip tracking problem presented in Chapter 4. These images

were acquired with the same frame rate as in the previous example, the size of the images are the same

as well. The test conditions are similar to the ones described in Chapter 4. The state vector and the


Figure 5.4: Hand tracking with Multi Model tracker. Predicted contours (first row), estimated contour(second row). Active model: 1,2,2,2. (Frames 4, 7, 9, 15).

shape matrices are defined in (4.23) and (4.25) assuming that the observed data is an affine transform

of the reference shape with deformation. B-splines with Nc = 12 control points sampled at N = 21

points are used. The structure of the matrices P, Q, R are the ones adopted in (4.24).

As in the previous example two models are used. The first model performs a vertical contraction of

the object boundary, using the boundary estimate computed in the previous frame. The second model

expands the object contour. The first model describes the evolution of the lip contour while the mouth

is closing whereas the second model is tailored to the mouth opening. The predicted contours provided

by these models are depicted in Fig. 5.6, where the contraction and expansion are obtained from the

predicted contour of the initial position of the lips. The A matrices are defined by

Am = diag(lm,1), m = 1, 2, (5.26)

with size (D + 2Nc) × (D + 2Nc), 1 is a 2Nc × 1 vector, lm = [1 1 1 1 ∆m 1] is a D × 1 vector with

∆1 = 0.6, ∆2 = 1.2.

Figs. 5.7 - 5.9 illustrate the performance of MMT and RMMT in the lip tracking problem. Fig. 5.7

shows three frames in which the MM tracker produces wrong results due to the presence of outliers.

The model sequence is wrong (the contraction model is selected instead of the expansion model in the

third frame) and the contour estimate becomes wrapped. Both difficulties are overcome by the RMMT

which selects the correct model sequence and provides accurate lip estimates.


Figure 5.5: Hand tracking with Robust Multi Model tracker. Predicted contours (first row), estimatedcontour (second row). Active model: 1,1,2,2. (Frames 4, 7, 9, 15).

Figure 5.6: Contour prediction using models 1 (dots); and 2 (dashed line).

A more difficult situation is presented in Fig. 5.8. In this case, the MMT loses the boundary of

the lips and does not manage to recover from this error afterwards. The RMMT estimates the correct

dynamic model, exhibiting remarkable robustness in the presence of a large number of outliers.

Fig. 5.9 shows the performance of the RMMT, when the mouth progressively opens. The first row

shows the predicted contours. It is clear that the expansion model is the one which better describes the

motion of the mouth in these frames. The RMMT always chooses the correct model in this example.

Another set of consecutive frames is shown in the Fig. 5.10. This example illustrates the performance

of the RMMT in the case of abrupt shape changes. In this case the mouth remains closed for a couple

of frames and suddenly opens. Although the algorithm selects the correct model, the expansion factor

adopted in model 2 is not enough to accurately predict the contour in frame 61. Even though the tracker


Figure 5.7: Lip tracking with Multi Model tracker (first row, active model: 2 1 1) and Robust MultiModel tracker (second row, active model: 2 1 2), (frames 8, 9, 13).

manages to follow the lip contour well. To improve the prediction the number of dynamic models and

their parameters should be estimated from the data using learning algorithms.

5.6.3 Heart Tracking

The RMMT was tested in echocardiographic images 3. The goal is to track the walls of the left ventricle

in ultrasound images. An example of an ultrasound image is displayed in Fig. 5.11 showing the 4 heart

cavities. The ECG signal is shown at the bottom. The sequence has 80 images obtained at a frame

rate of 15 frames sec−1. The size of the images is 360 × 480.

Each ultrasound image displays a cross section of the human body (details about acquisition system

can be found in [39]). This is done by moving or rotating a transducer crystal (phased-array). A digital

scan converter performs the mapping of the data for display.

Since ultrasound images are noisy, a filtering operation with a median filter was performed before

applying the feature detection method used in this chapter.4

We have considered the group of Euclidean transformations to model the heart motion. Therefore,

3The images were kindly provided by Joao Sanches and were obtained at Instituto Cardiovascular de Lisboa with thesupport of Prof. Fausto Pinto of the Faculty of Medicine of the University of Lisbon.

4Different approaches based on statistical models of the ultrasound image have been used with success in this problem[24].


Figure 5.8: Lip tracking with Multi Model tracker (first row, active model: 2 1 1) and Robust MultiModel tracker (second row, active model: 2 2 2), (frames 16, 27, 46).

four motion parameters and deformation are considered

x(t) = [x1 x2 x3 x4, θd11, . . . , θ

d1Nc

, θd21, . . . , θ

d2Nc

]T . (5.27)

In this experiment, a B-spline with Nc = 25 control points is used, sampled at N = 48 points, the

observation model is defined by

C =

[Mx −My BN×Nc

ON×Nc

My Mx ON×NcBN×Nc

]

, e = O2N×1, (5.28)

where C is a 2N × (D + 2Nc) matrix, B is a N × Nc interpolation B-spline matrix, O is a null matrix

with the appropriate dimensions, Mx, My are N×2 matrices with the coordinates of the reference shape

(see Appendix A).

The following matrices were also used

P = kP IP , Q = kQ IQ, R = IR, (5.29)

where IP , IQ, and IR are (D + 2Nc) × (D + 2Nc) and 2N × 2N identity matrices respectively, kP , kQ


Figure 5.9: Lip tracking with Robust Multi Model: predicted contours (first row) and estimated contours(second row), (frames 45, 46, 47), active model: 2 2 2).

are constants.

We have used two models to track the boundary of the left ventricle. The A matrices used in this

experiment are

Am = diag(lm,1), m = 1, 2, (5.30)

with size (D + 2Nc) × (D + 2Nc), 1 is a 2Nc × 1 vector, lm = [∆m 1 1 1] is a D × 1 vector with

∆1 = 0.9, ∆2 = 1.1.

These two models perform an expansion and a contraction of the reference shape. The predicted

contours provided by these models are depicted in Fig. 5.12.

Figs. 5.13, 5.14, 5.15 show 3 cardiac cycles displaying the motion of the endocardium, and the

motion of the mitral valve which exhibits smaller time constants.

The cardiac cycle comprises two different phases: diastole (relaxation) and systole (contraction).

In the diastole the ventricle relaxes and the mitral valve is open. In the systole phase the ventricle

contracts, and the mitral valve closes. The RMMT must choose the correct model (contraction or

expansion), which are associated with the two phases of the cardiac cycle.


Figure 5.10: Lip tracking with Robust Multi Model: predicted contours (first row) and estimatedcontours (second row), (frames 59, 60, 61, 62), active model: 1 1 2 2).

In Figs. 5.13, 5.14, and 5.15, the upper row corresponds to the systole (contraction) phase and

the lower row corresponds to the relaxation phase (diastole). In the systole the tracker chooses the

contraction model (model 1). This phase of the cardiac cycle is initiated by the peak of the ECG

signal, known as QRS complex, which represents ventricular depolarization. Its duration ends after the

occurrence of the T-wave (the perturbation after the QRS complex) which represents the ventricular

repolarization. In the diastole, the RMMT chooses the expansion model (model 2). This phase starts

after the occurrence of the T-wave (when the ECG signal is stable) until a peak of the ECG is observed.

The model is unable to accurately track rapid changes of the mitral valve, see the left lower image

of the Figs. 5.14 and 5.15. This happens since the valve has a sudden motion, producing a speckle

noise near its boundary and the contour is more complex in this region. However, this problem does

not affect the choice of the model, and the RMMT correctly estimates the mitral valve when it closes

(see second rows of the Figs. 5.14, 5.15).

Fig. 5.16 shows the model label during 3 cardiac cycles. During the systole phase, the contraction

model (model 1) should be active. In the diastole phase, the second model should be chosen.

In this way it is possible to check the model classification provided by the trackers. Fig. 5.16

displays the correct classification model on the top, and the estimation provided by the two multiple

model trackers (middle and bottom rows). It can be seen in Fig. 5.16 (a) that the label estimated

by the RMMT provides a good estimate of the cardiac phase. The MMT estimate fails most of the


Figure 5.11: An ultrasound image showing the left cardiac ventricle.

Figure 5.12: Contour prediction using models 1 (dots); and 2 (dashed line).

time. These results were obtained without providing any information concerning the model switching

probabilities, i.e., all the elements of the Markov transition matrix Tk(t−1),k(t) have the same probability

0.5. This means that all transitions have the same probability. This is a non-realistic assumption. The

same experiment was repeated assuming that pii = 0.8, i = 1, 2 (see Fig. 5.16 (b)). In this case better

label estimates is achieved. This example shows that the RMMT can be used to detect the beginning

of each phase of the cardiac cycle.

The robust multiple model tracker, thus, provides two levels of information: estimates of the left

ventricle boundary and the phase of the cardiac cycle. This type of information can not be obtained if

a single model is used.


Figure 5.13: Tracking in the first cardiac cycle with RMMT: active contraction model, frames 14, 16,18, 19, (upper row), active expansion model, frames 20, 28, 31, 32 (bottom row).

Figure 5.14: Tracking in the second cardiac cycle with RMMT: active contraction model, frames 34,36, 38, 40, (upper row), active expansion model, frames 42, 45, 48, 52 (bottom row).


Figure 5.15: Tracking in the third cardiac cycle with RMMT: active contraction model, frames 54, 57,59, 61, (upper row), active expansion model, frames 62, 64, 66, 69 (bottom row).

10 20 30 40 50 60 70

Frames

Mod

el la

bel

10 20 30 40 50 60 70

Frames

Mod

el la

bel

(a) (b)

Figure 5.16: Ideal label sequence (top) and estimated signal label sequence using the RMMT (middle)and MMT (bottom) during 3 cardiac cycles.


5.7 Conclusions

This chapter presents a new algorithm for tracking moving objects in video sequences, based on multiple

switched dynamic models. The evolution of the state vector is described by a bank of stochastic

difference equations. Furthermore, it is assumed that the visual features detected in the image contain

outliers, i.e., invalid features which do not belong to the object boundary. A robust filtering algorithm is

proposed which is able to deal with multiple dynamics and invalid observations. This is accomplished by

propagating the a posteriori density of the unknown parameters using Gaussian mixtures. Experimental

results presented in this chapter show that significant improvements are achieved, comparing with the

results obtained by the multi model tracker which was recently proposed in [48, 49]. The algorithm was

tested in lip tracking. It was experimentally observed that the proposed method efficiently estimates the

best model in these problems even in the presence of noisy measurements and outliers. The RMMT was

also applied to the tracking of the left ventricle in ultrasound images. This is a difficult problem which

remains an active research topic. The tracker performed reasonably well in this example providing

two levels of information: shape estimates and binary labels classifying the phase of the cardiac cycle.

This is an important advantage obtained by using the multiple models. Training methods should be

developed in the future to estimate the dynamic models from the data instead of using models defined

by the user.

Ch a p t e r 6

Conclusions and Future Work

6.1 Conclusions

This thesis provides robust algorithms for the estimation of objects (static and dynamic) in video

sequences. The main problem which was addressed concerns the presence of invalid features (outliers)

which jeopardizes the performance of classical active contour methods. Three methods were considered

in this thesis: snakes, the Kalman tracker and the Multi Model tracker. They all provide wrong shape

estimates in the presence of outliers.

This thesis proposes modified versions of these methods which are able to cope with outliers and

provide accurate estimates of the object boundaries in complex scenes. The new algorithms are denoted

as adaptive snakes, Shape Probabilistic Data Association Filter (S-PDAF) and Robust Multi-Model

Tracker (RMMT).

Although each method has its own details they all explore two key ideas which can eventually be

applied to improve with other tracking algorithms as well:

1. The use of middle level features. We choose middle level features (strokes) instead of low level

ones (edge points), which are used in most shape analysis methods.

2. Outlier model. It is assumed that the features detected in the image can be either valid or invalid.

A statistical model is used to represent both types of features. Confidence degrees are used to

measure the confidence on each detected feature.

Experimental tests were performed to evaluate the proposed algorithms in several video tracking

problems. The proposed algorithms exhibit a clear improvement of performance, compared with the

94 Conclusions and Future Work

original methods used as a starting point. Robust shape estimation is achieved by all of them.

• Strokes are more reliable and informative than edge points, since they allow a better description

of the object shape. A comparison of methods based on both types of features is performed in

[61], middle level features being considered as more reliable than low level ones.

• the use of the adaptive potential in snake method (adaptive snakes) allows to reduce the influence

of the outlier valleys since they usually have negligible weights after the first few iterations.

• the use of statistical data models is crucial to discriminate valid data from invalid data, avoiding

hard decisions. In the algorithms described in this thesis, this leads to the use of weights or

confidence degrees associated to the detected features.

6.2 Future Work

This thesis presents robust methods for object estimation and tracking in video sequences which solve

most of the problems caused by the invalid features detected in the image.

Several directions can be followed in the future in order to improve the proposed methods and to

extend the concepts presented in this thesis to other algorithms. One direction concerns the extension

of the adaptive potential to tracking applications. A motion model should be included. This is possible,

since adaptive snakes were performed in a probabilistic framework which allows the use of inference

techniques to propagate the distribution of the unknown parameters.

A second issue concerns the robust multi-model tracker proposed in this thesis. It is not clear if this

method performs better than single model techniques. Therefore, a comparison with other methods

(e.g., S-PDAF) should be performed in the future to clarify this issue. Furthermore, there are several

open issues to be investigated namely the estimation of the number of models and the estimation of the

model parameters from the available data.

Some of these issues are already being addressed in collaboration with Prof. Gilles Celeux of INRIA.

A p p e n d i x A

Shape Models

This appendix describes the shape models used in this thesis. These models account for a global motion

of a shape template plus a deformation.

It is assumed that the object boundary vt(s) is given by

vt(s) = Gvr(s) + vdt (s), (A.1)

where G is a geometric transformation, s ∈ I is a parameter defining the location of a point in the

curve, and vdt (s) = (vd

1t(s), vd2t(s)) is the curve deformation. Furthermore, it is assumed that

v(s) =

Nc∑

n=1

θnφn(s), (A.2)

where Nc is the number of control points, φn(s) are B-spline basis functions, and θn are the control

points. An introduction to B-splines in shape modeling can be found in [10].

Several transforms can be used. In this work, translation, Euclidean or affine transforms were consid-

ered, depending on the example being studied. Therefore, the object boundary vt(s) = (v1t(s), v2t(s))

is given by one of the following expressions

Translation:

v1t(s) = vr1(s) + x1t + vd

1t(s)

v2t(s) = vr2(s) + x2t + vd

2t(s), (A.3)

Euclidean Similarity:

96 Shape Models

v1t(s) = x1tvr1(s) − x3tv

r2(s) + x2t + vd

1t(s)

v2t(s) = x3tvr1(s) + x1tv

r2(s) + x4t + vd

2t(s), (A.4)

Affine Transform:


r2(s) + x3t + vd

1t(s)


r2(s) + x6t + vd

2t(s), (A.5)

where xt = [x1t, x2t]T (translation), xt = [x1t, . . . , x4t]

T (Euclidean), and xt = [x1t, . . . , x6t]T (Affine),

are the motion parameters at instant t.

Let y(t) be a vector with noisy samples of the object boundary

y(t) = [y1t(s1), . . . , y1t(sN ), y2t(s1), . . . , y2t(sN )]T . (A.6)

Using equations (A.3-A.5) y(t) can be written as

y(t) = Cx(t) + e + η(t), (A.7)

where C is a 2N × (D + 2Nc) matrix, x(t) is a (D + 2Nc)× 1 state vector, e is a 2N × 1 null vector or

a vector containing the object coordinates, D the number of coefficients of the motion model.

The state vector is given by

x(t) = [x1, . . . , xD, θd11, . . . , θ

d1Nc

, θd21, . . . , θ

d2Nc

]T , (A.8)

where D = 2, 4, 6 depending on the motion model.

The shape matrices C and e also depend on the motion model as follows

Translation (D = 2):

C =

[1N×1 ON×1 BN×Nc

ON×Nc

ON×1 1N×1 ON×NcBN×Nc

]

, (A.9)

e = [vr1(s1), . . . , v

r1(sN ), vr

2(s1), . . . , vr2(sN )]T , (A.10)

Euclidean Similarities (D = 4):

C =

[Mx −My BN×Nc

ON×Nc

My Mx ON×NcBN×Nc

]

, with Mx =

vr1(s1) 1

......

vr1(sN ) 1

, My =

vr2(s1) 0

......

vr2(sN ) 0

, (A.11)

97

e = 0, (A.12)

Affine transform (D = 6):

C =

[M ON×3 BN×Nc

ON×Nc

ON×3 M ON×NcBN×Nc

]

, with M =

vr1(s1) vr

2(s1) 1vr1(s2) vr

2(s2) 1...

......

vr1(sN ) vr

2(sN ) 1

, (A.13)

where B

B =

φ1(s1) φ2(s1) · · · φNc(s1)

φ1(s2) φ2(s2) · · · φNc(s2)

......

. . ....

φ1(sN ) φ2(sN ) · · · φNc(sN )

, (A.14)

e = 0, (A.15)

where B is the interpolation B-spline matrix, M is the matrix containing the coordinates of the object

boundary O is the null matrix with appropriate dimensions.

Sometimes the state vector x(t) contain some derivatives of the shape parameters. In these cases

x(t) is augmented and has the following form:

x(t) = [x1, . . . , xD, x1, . . . , xD, θd11, . . . , θ

d1Nc

, θd21, . . . , θ

d2Nc

]T . (A.16)

In this case the equations (A.9-A.13) must be accordingly modified, i.e., matrix C must be extended

with zero columns.

98 Shape Models

A p p e n d i x B

Feature Detection

Several methods can be used to detect visual features in the image. In this thesis, the predicted

contour is sampled at equally spaced points v(si), and feature detection is performed along search lines

orthogonal to the predicted contour. This technique has been used by several authors [8, 10, 20]. The

length of the inspection interval depends on the uncertainty of the predicted contour at each sample

point yi in the direction of the search line. Assuming that yi ∼ N (yi, Si) then

ρ(si, t) = δ√

n(si)T S(si, t)n(si), (B.1)

where n(si) is the unit normal at si.

Since the shape samples y(t) = [y1t(s1), . . . , y1t(sN ), y2t(s1), . . . , y2t(sN )]T are related to the state

vector x(t) by

y(t) = Cx(t) + e + η(t), (B.2)

where x(t) ∼ N (x(t), P (t)), then the covariance of the ith sample is given by

S(si, t) = C(si) P (t | t − 1) C(si)T + R(si), (B.3)

C(si) is a matrix formed by lines i and i+N of matrix C, R(si) is a 2× 2 covariance matrix associated

to the observation noise at si.

Feature detection along the ith direction is performed by comparing the image profile with a profile

template Ti. This procedure is based on the minimization of a cost function given by

100 Feature Detection

J (∆0) =

∫

t

|pi(t) − Ti(∆, ∆0)|2dt, (B.4)

where pi(t) is the image profile along the ith direction, ∆ is the distance to the object boundary and

Ti(∆, ∆0) is a known template. The template Ti(∆) is defined as follows: Ti(∆) is equal to the average

intensity of the object for ∆ ≤ ∆0, and Ti(∆) is equal to the background image profile or to the average

background color for ∆ > ∆0 (see Fig. B.1). The first hypothesis is used in the surveillance examples

and the second is used in the lip and gesture experiments.

)isn(

)isv(Predicted Contour

D

T

∆0

p

(a) (b)

Figure B.1: Feature detection: (a) directional search; (b) image profile p, and shifted template T at∆0.

A p p e n d i x C

Covariance Update for the S-PDAF

Model

This appendix derives the expression for the update of the covariance matrix in the case of the S-PDAF

algorithm. The variables used in this appendix are those defined in Section 4.4.

The covariance of the state estimate is

P (t | t) = E

{

[x(t) − x(t | t)][x(t) − x(t | t)]T | Y t

}

= E

{

(x(t)x(t)T

︸︷︷︸

P 1

− x(t)x(t | t)T

︸︷︷︸

P 2

− x(t | t)x(t)T

︸︷︷︸

P 2T

+ x(t | t)x(t | t)T

︸︷︷︸

P 3

) | Y t

}

, (C.1)

where

P 1 , E{x(t)x(t)T | Y t} =

mi∑

i=0

E{x(t)x(t)T | Ii(t), Yt}αi(t). (C.2)

Attending that

cov{x(t)} = E{x(t)x(t)T } − x(t | t)x(t | t)T . (C.3)

The first term is

P 1 =

mi∑

i=0

αi(t)

[

Pi(t | t) + xi(t | t)xi(t | t)T

]

. (C.4)

102 Covariance Update for the S-PDAF Model

The second term in (C.1) is

P 2 , −mi∑

i=0

E{x(t)x(t | t)T | Ii(t), Yt}αi(t)

= −

( mi∑

i=0

E{x(t) | Ii(t), Yt}αi(t)

)

x(t | t)T

= −

( mi∑

i=0

xi(t | t)αi(t)

)

x(t | t)T

= −x(t | t)x(t | t)T = P 2T. (C.5)

The third term is

P 3 , x(t | t)x(t | t)T

mi∑

i=0

αi(t) = x(t | t)x(t | t)T = −P 2. (C.6)

Combining (C.2), (C.5), and (C.1) yields

P (t | t) =

mi∑

i=0

αi(t)

[

Pi(t | t) + xi(t | t)xi(t | t)T

]

− x(t | t)x(t | t)T . (C.7)

Since the conditional covariance is given by

Pi(t | t) = (I − Ki(t)Ci)P (t | t − 1), (C.8)

equation (C.7) can be written as follows

P (t | t) =

[

I −mi∑

i=1

αi(t)Ki(t)Ci

]

P (t | t − 1) +

mi∑

i=0

αi(t)xi(t | t)xi(t | t)T − x(t | t)x(t | t)T . (C.9)

A p p e n d i x D

Robust Multi-Model Tracker

This appendix derives expressions for the mixture coefficients and state estimate of the RMMT as well

as the mixture coefficients of the MMT.

D.1 Mixture Coefficients of RMM tracker

This section addresses the update of cKt (filtering step) assuming that cKt|t−1 is known

cKt ,p(Kt, Y t)

p(Y t)=

p(y(t) | Kt, Y t−1

)p(Kt, Y t−1

)

p(Y t). (D.1)

Since the term p(Kt, Y t−1

)is related with the prediction step (see (5.8)) equation (D.1) can be written

as

cKt = γ ckt|t−1

∫

p(y(t) | Kt, Y t−1, x(t)

)p(x(t) | Kt, Y t−1

)dx(t)

= γ ckt|t−1

∫∑

i

p(y(t) | Ii(t), K

t, Y t−1, x(t))

p(Ii(t) | Kt, Y t−1, x(t)

)p(x(t) | Kt, Y t−1

)dx(t)

= γ ckt|t−1

∑

i

αi(t)

∫

p(y(t) | Ii(t), K

t, Y t−1, x(t))

p(x(t) | Kt, Y t−1

)dx(t),

(D.2)

with γ =p(Y t−1)

p(Y t).

Since y(t) may contain some gaps along the contour, it depends on the localization of the strokes

detected in the image. Thus, the probability p(y(t) | Ii(t), Kt, Y t−1) = p(y(t) | Ii(t), K

t, b, e, M, Y t−1)

where b = {b1, . . . bM}, e = {e1, . . . eM} define the beginning and the end of the strokes. Using the

observation model described in Chapter 4.

104 Robust Multi-Model Tracker

p(y(t) | Ii(t), Kt, b, e, M, Y t−1) =

M∏

j=1

ej

∏

n=bj

p(yj(sn, t) | Iji (t), Kt), (D.3)

where yj(sn, t) is the feature point belonging to the jth stroke detected in the vicinity of sn. It is assumed

that the visual features have uniform distribution in the search area if Iji = 0 (classified as unreliable) and

Gaussian distribution if Iji = 1 (classified as reliable). Therefore, kEj

i (sn, t) = p(yj(sn, t) | Iji (t), Kt)

kEji (sn, t) =

V (sn, t)−1 if Iji (t) = 0

ρ−1N

(

kνj(sn, t); 0, kS(sn, t)

)

otherwise, (D.4)

where V (sn, t) is the length of the search area, ρ is the normalization constant,

kνj(sn, t) = yj(sn, t) − kC(sn) kxt|t−1 is the innovation associated to the jth stroke, and kS(sn, t) =

kC(sn)PKt|t−1kC(sn)

T+R(sn) is the covariance of the innovation vector where kC(sn) and kR(sn) are

the output matrix and noise covariance associated to the nth sample of the object contour. Replacing

(D.4) in (D.3) into (D.2) leads to

cKt = γ cKt|t−1

∑

i

kαi(t)M∏

j=1

ej

∏

n=bj

kEji (sn, t). (D.5)

D.2 Mixture Coefficients for Multi-Model Tracker

The Kalman model is a particular case of S-PDAF, which corresponds to assuming that all the observed

data is valid. Equation (D.2) can be written as

cKt = γ cKt|t−1

∫

p(y(t) | Kt, Y t−1, x(t)

)p(x(t) | Kt, Y t−1

). (D.6)

Assuming independence of the L features along the contour we can write

cKt = γ cKt|t−1

L∏

n=1

kE(sn, t), (D.7)

kE(sn, t) is similar to (D.4) and it is defined as

kE(sn, t) =

V (sn, t)−1 if no features detected

ρ−1N

(

kν(sn, t); 0, kS(sn, t)

)

otherwise, (D.8)

D.3 State estimation for a given model 105

kν(sn, t), kS(sn, t) have the same meaning as before, however, the superscript j and subscript i are

suppressed since we do not have strokes interpretations (all strokes are valid).

D.3 State estimation for a given model

Assuming that the active model is k(t) = q, the mean squared error estimate of the state value is

qxKt , E{x(t) | Y t, k(t) = q}

=

∫

x(t)p(x(t) | Y t, k(t) = q) dx(t)

=

∫x(t)p(x(t), k(t) = q | Y t)

p(k(t) = q)dx(t)

=1

p(k(t) = q)

∫

x(t)∑

Kt−1

p(x(t), k(t) = q, Kt−1 | Y t)dx(t)

=1

p(k(t) = q)

∫

x(t)∑

Kt:k(t)=q

cKt

∑

i

p(x(t), Ii(t) | Kt, Y t)dx(t)

= γ∑

Kt:k(t)=q

cKt

∑

i

∫

x(t)p(x(t) | Ii(t), Kt, Y t)p(Ii(t) | Kt, Y t)dx(t)

= γ∑

Kt:k(t)=q

cKt

∑

i

αi(t)

∫

x(t)p(x(t) | Ii(t), Kt, Y t) dx(t), (D.9)

where αi(t) , p(Ii(t) | Kt, Y t) is the a posteriori association probability of the ith interpretation

assigned to the model k. Therefore,

qxKt = γ∑

Kt:k(t)=q

cKt

∑

i

αi(t) xi(t | t), (D.10)

where

xi(t | t) = E{x(t) | Ii(t), Kt, Y t}. (D.11)

106 Robust Multi-Model Tracker

Bibliography

[1] A. Abrantes, J. S. Marques, A Class of Constrained Clustering Algorithms for Object Boundary

Detection, IEEE Trans. Image Processing, vol. 5, no. 11, pp. 1507-1521, 1996.

[2] A. Abrantes, J. S. Marques, Tracking of Moving Objects using Deformable Models, Proc. 4th Int.

Symposium on Intelligent Robotic Systems, pp. 309-316, Lisbon, 1996.

[3] A. Abrantes, J. S. Marques, Pattern Recognition Methods for Object Boundary Detection, Proc.

British Machine Vision Conf., vol. 2, pp. 409-417, Southampton, September 1998.

[4] A. Abrantes, Extraccao e Seguimento de Contornos de Objectos: uma Perspectiva Unificadora,

PhD thesis, Instituto Superior Tecnico, Lisbon, Portugal, December 1998.

[5] A. Amini, R. Curwen, J. Gore, Snakes and splines for tracking nonrigid heart motion, Proc. Eur.

Conf. Computer Vision, pp. 251-261, Cambridge, U.K., 1996.

[6] Y. Bar-Shalom, T. Fortmann, Tracking and Data Association, Academic Press, 1988.

[7] S. Basu, C. Neti, N. Rajput, A. Senior, L. Subramaniam, A. Verma, Audio-visual large vocabulary

continuous speech recognition in the broadcast domain, IEEE Workshop on Multimedia Signal

Processing, pp. 475-481, Copenhagen, September 1999.

[8] A. Baumberg, D. Hogg, Learning deformable models for tracking the human body, Motion Based

Recognition, R. Jain, M. Sha, eds., pp. 39-60, Kluwer, 1997.

[9] C.M. Bishop, Neural Networks for pattern recognition, Oxford, 1995.

[10] A. Blake, M. Isard, Active Contours, Springer, 1998.

[11] A. Blake, R. Curwen, A. Zisserman, A framework for spatio-temporal control in the tracking of

visual contours, Int. Journal of Computer Vision, vol. 11, no. 2, pp. 127-145, 1993.

108 Bibliography

[12] A. Blake, M. Isard, D. Reynard, Learning to track the visual motion contours, Artificial Intelligence,

vol. 78, pp. 179-212, 1995.

[13] C. Chang, M. Athans, State Estimation for Discrete Systems with Switching Parameters, IEEE

Trans. Aerosp. Electron. Syst., vol. 14, pp. 418-425, 1978.

[14] J. Chen, G. Stockman, K. Rao, Recovering and tracking pose of curved 3D objects from 2D images,

Proc. CVPR, pp. 233-239, 1993.

[15] G. Chuang, C. Kuo, Wavelet description of planar curves: Theory and applications, IEEE Trans.

Image Processing, vol. 5, pp. 56-70, January 1996.

[16] L. Cohen, On active contour models and ballons, CVGIP: Image Understanding, vol. 53, no. 2, pp.

211-218, 1991.

[17] L. Cohen, Auxiliary variables and two-step iterative algorithms in computer vision problems, Int.

Journal Computer Vision, vol. 6, no. 1, pp. 59-83, 1996.

[18] L. Cohen, I. Cohen, Finite-element methods for active contour models and ballons for 2-D and

3-D images, IEEE Trans. Pattern Anal. Machine Intell., vol. 15, no. 11, pp. 1131-1147, November

1993.

[19] T. Cootes, C. Taylor, A. Hill, J.Haslam, The Use of Active Shape Models for Locating Structures,

Proc. 13th Int. Conf. on Information Processing in Medical Imaging, H.H. Barrett, A.F. Gmitro,

eds., Springer-Verlag, pp. 33-47, 1993.

[20] T. Cootes, C. Taylor, D. Cooper, J. Graham, Active shape models - their training and application,

Computer Vision and Image Understanding, vol. 61, no. 1, pp. 38-59, 1995.

[21] T. Cootes, C. Taylor, J. Haslam, The use of active shape models for locating structures in medical

images, Image Vis. Comput., pp. 355-366, 1994.

[22] R. Curwen, A. Blake, Dynamic contours: Real-time active splines, A. Blake, A. Yuille, eds., Active

Vision, cap. 3, MIT Press, 1992.

[23] T. Darrel, A. Pentland, Space-time gestures, Comput. Vis. Pattern Recognition, pp. 335-340, 1993.

Bibliography 109

[24] J. M. Dias, J. Leitao, Wall position and thickness estimation from sequences of echocardiograms

images, IEEE Trans. Med. Imag., vol. 15, no. 1, pp. 25-38, February 1996.

[25] J. M. Dias, Bayesian Contour Estimation: A Subspace Representation Approach, Energy Min-

imization Methods in Computer Vision and Pattern Recognition, E. Hancock, M. Pelillo, eds.,

Springer Verlag, pp. 157-172, July 1999.

[26] A. Dempster, M. Laird, D. Rubin, Maximum Likelihood from incomplete data via the EM-

Algorithm, Journal of the Royal Statistical Society B, vol. 39, pp. 1-38, 1977.

[27] R. Duda, P. Hart, Pattern Classification and Scene Analysis, John Wiley and Sons, 1973.

[28] J. Evans, R. Evans, Image-enhanced multiple model tracking, Automatica, vol. 35, pp. 1769-1786,

1999.

[29] T. Faruquie, A. Majumdar, N. Rajput, L. Subramaniam, Large vocabulary audio-visual speech

recognition using active shape models, Proc. IEEE Int. Conf. on Pattern Recognition, vol. 3, pp.

110-113, 2000.

[30] M. Figueiredo, J. Leitao, A. Jain, Unsupervised contour representation and estimation using B-

splines and a minimum description length criterion, IEEE Trans. Image Processing, vol. 9, no. 6,

pp. 1075-1087, June 2000.

[31] M. Figueiredo, J. Leitao, A. Jain, Adaptive parametrically deformable contours, Energy Minimiza-

tion Methods in Computer Vision and Pattern Recognition, M. Pellilo, E. Hancock, eds. Berlin,

Germany: Springer-Verlag, pp. 35-50, 1997.

[32] M. Figueiredo, J. Leitao, Bayesian estimation of ventricular contours in angiographic images, IEEE

Trans. Med. Imag., vol. 11, pp. 416-429, March 1992.

[33] T. Gevers, S. Ghebreab, A. Smeulders, Color Invariant Snakes, Proc. British Machine Vision Conf.,

vol. 2, pp. 578-588, Southampton, September 1998.

[34] H. Gu, Y. Shirai, M. Asada, MDL-Based Segmentation and Motion Modeling in a Long Sequence

of Scene with Multiple Independently Moving Objects, IEEE Trans. Pattern Anal. Machine Intell.,

vol. 18, no. 1, pp. 58-64, 1996.

110 Bibliography

[35] X. D. Huang, Y. Arika, M. A. Jack, Hidden Markov models for speech recognition, Edinburg

University Press, 1990.

[36] D. Huttenlocher, S. Ullman, Recognizing Solid Objects by Alignment with an Image, Int. Journal

of Computer Vision, vol. 5, pp. 195-212, 1990.

[37] M. Isard, A. Blake, Contour tracking by stochastic propagation of conditional density, Proc. Eu-

ropean Conf. on Computer Vision, vol. 1, pp. 343-356, 1996.

[38] M. Isard, A. Blake, A mixed-state condensation tracker with automatic model-switching, Int. Conf.

on Computer Vision, pp. 107-112, 1998.

[39] J. Jensen, Estimation of blood velocities using ultrasound, A signal processing approach, Cambridge

university press, 1996.

[40] S. Kalitzin, J. J. Staal, B. M. ter Haar Romeny, M. A. Viergever, Image Segmentation and Object

Recognition by Bayesian Grouping, Proc. IEEE Int. Conf. on Image Processing, vol. 3, pp. 580-583,

2000.

[41] R. E. Kalman, A new approach to linear filtering and Prediction Problems, ASME Trans. - journal

of Basic Engineering, no. Series D, vol. 82, pp. 35-45, March 1960.

[42] M. Kass, A. Witkin, D. Terzopoulos, Snakes: Active contour models, Int. Journal of Computer

Vision, vol. 1, no. 4, pp. 321-331, 1987.

[43] B. Leroy, I. Herlin, L. D. Cohen, Multi-resolution algorithms for active contour models, 12th Int.

Conf. Analysis and Optimization od Systems, pp. 58-65, 1996.

[44] F. Leymarie, M. Levine, Tracking deformable objects in the plane using an active contour model,

IEEE Trans. Pattern Anal. Machine Intell., vol. 15, no. 6, pp. 617-634, 1993.

[45] S. Lucey, S. Sridharan, V. Chandran, Initialised eigenlip estimator for fast lip tracking using linear

regression, Proc. IEEE Int. Conf. on Pattern Recognition, vol. 3, pp. 182-185, 2000.

[46] J. Maciel, J. Costeira, Holistic Synthesis of Human Face Images, Proc. IEEE Int. Conf. on Acous-

tics, Speech and Signal Processing, vol. 6, pp. 3545-3548, Phoenix, March 1999.

Bibliography 111

[47] R. Malladi, J.A. Sethian, B.C. Vemuri, Shape Modeling with Front Propagation: A Level Set

Approach, IEEE Trans. Pattern Anal. Machine Intell., vol. 17, no. 2, pp. 158-175, February 1995.

[48] J. S. Marques, J. M. Lemos, Shape tracking Based on Switched Dynamical Models, Proc. IEEE

Int. Conf. on Image Processing, pp. 954-958, Kobe, 1999.

[49] J. S. Marques, J. M. Lemos, Optimal and Suboptimal Shape Tracking Based on Switched Dynamic

Models, Image and Vision Computing, pp. 539-550, June 2001.

[50] J. S. Marques, A. J. Abrantes, A Constrained Clustering Algorithm for Shape Analysis with Mul-

tiple Features, Proc. Int. Conf. on Pattern Recognition, vol. 1, pp. 916-919, Barcelona, September

2000.

[51] T. McInerney, D. Terzopoulos, Topologically adaptable snakes, Proc. Int. Conf. Computer Vision,

pp. 840-845, Cambridge, 1995.

[52] G. J. McLachlan, T. Krishnan, The EM Algorithm and Extensions, New York: John Wiley and

Sons, 1997.

[53] S. Menet, P. Saint-Marc, G. Medioni, B-snakes: Implementations and applications to stereo, Proc.

DARPA Image Understanding Workshop, pp. 720-726, 1990.

[54] H. Murase, S. Nayar, Visual learning and recognition of 3-D objects from appearence, Int. Journal

of Computer Vision, vol. 14, pp. 5-24, 1995.

[55] J. Nascimento, A. Abrantes, J. S. Marques, An algorithm for centroid-based tracking of moving

objects, Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 6, pp. 3305-3308,

Phoenix, March 1999.

[56] J. Nascimento, J. S. Marques, Robust shape tracking in the presence of cluttered background,

Proc. IEEE Int. Conf. on Image Processing, vol. 3, pp. 82-85, Vancouver, September 2000.

[57] J. Nascimento, J. S. Marques, Improving the Robustness of Parametric Shape Tracking with

Switched Multiple Models, Workshop on Pattern Recognition in Information Systems, A. Fred,

A. Jain, eds., pp. 50-58, Setubal, July 2001.

112 Bibliography

[58] J. Nascimento, J. S. Marques, An Adaptive Potential for Robust Shape Estimation, Proc. British

Machine Vision Conf., vol. 1, pp. 343-352, Manchester, September 2001.

[59] J. Nascimento, J. S. Marques, Robust Multi-Model Filter for Shape Tracking in the Presence of

Outliers, Pattern Recognition, vol. 35, pp. 2711-2718, December 2002.

[60] J. Nascimento, A. J. Abrantes, J. S. Marques, The Role of Middle Level Features for Robust Shape

Tracking, Proc. of 12th Portuguese Conf. on Pattern Recognition, Aveiro, June 2002.

[61] J. Nascimento, A. J. Abrantes, J. S. Marques, Using Middle Level Features for Robust Shape

Tracking, Pattern Recognition Letters, vol. 24, pp. 295-307, 2003.

[62] J. Nascimento, J. S. Marques. Robust Shape Tracking in the Presence of Cluttered Background.

IEEE Trans. Multimedia, accepted, 2003.

[63] H. Ney, Stochastic Modelling: From pattern classification to speech recognition and translation,

Proc. IEEE Int. Conf. on Pattern Recognition, vol. 3, pp. 25-32, 2000.

[64] B. North, A. Blake, Learning dynamical models using Expectation-Maximisation, Int. Conf. on

Computer Vision, pp. 384-389, 1998.

[65] V. Petridis, A. Kehagias, A multi-nodel algorithm for parameter estimation of time varying non-

linear systems, Automatica, vol. 34, no. 4, pp. 469-475, 1998.

[66] G. Potamianos, H. P. Graf, Discriminative training of HMM stream exponents for audio-visual

speech recognition, Int. Conf. on Acoustic, Speech and Signal Processing, vol. 6, pp. 3733-3736,

1998.

[67] A. Rajagopalan, R. Chellappa, Vehicle detection and tracking in video, Proc. IEEE Int. Conf. on

Image Processing, vol. 1, pp. 351-355, 2000.

[68] L. Rabiner, A tutorial on hidden Markov models and selected applications, Speech Recognition,

Proc. IEEE, vol. 77, no. 2, pp. 257-286, 1989.

[69] L. Rabiner, Bing-Hwang, Fundamentals of speech recognition, Prentice Hall, 1993.

[70] L. Staib, J. Duncan, Boundary Finding with Parametrically Deformable Models, IEEE Trans.

Pattern Anal. Machine Intell., vol. 11, no. 14, pp. 1061-1075, November 1992.

Bibliography 113

[71] H. Tagare, Deformable 2-D template matching using orthogonal curves, IEEE Trans. Med. Imag.,

vol. 16, no. 1, pp. 108-117, 1997.

[72] D. Terzopoulos, R. Szeliski, Tracking with Kalman snakes, A. Blake, A. Yuille, eds., Active Vision,

cap. 1, pp. 3-20, MIT Press, 1992.

[73] F. De la Torre, J. Vitria, P. Radeva, J. Melenchon, Eigenfiltering for flexible eigentracking (EFE),

Proc. IEEE Int. Conf. on Pattern Recognition, vol. 3, pp. 1118-1121, 2000.

[74] J. Tugnait, Detection and estimation for abruptly changing systems, Automatica, vol. 18, no. 5,

pp. 607-615, 1982.

[75] S. Ullman, R. Basri, Recognition by linear combination of models, IEEE Trans. Pattern Anal.

Machine Intell., vol. 13, no. 10, pp. 992-1006, 1991.

[76] A. Verma, T. Faruquie, C. Neti, S. Basu, A. Senior, Late integration in audio-visual continuous

speech recognition, Automatic Speech Recognition and Understanding, vol. 1, pp. 71-74, 1999.

[77] C. Xu, J. Prince, Snakes, shapes, and gradient vector flow, IEEE Trans. Image Processing, vol. 7,

no. 3, pp. 359-369, March 1998.

[78] A. Yuille, P. Hallinan, Deformable templates, A. Blake, A. Yuille, eds., Active Vision, cap. 2, pp.

21-38, MIT Press, 1992.

[79] X. Zhang, H. Burkhardt, Grouping Edge Points into Line Segments by Sequencial Hough Trans-

formation, Int. Conf. on Pattern Recognition, vol. 3, pp. 676-679, 2000.

[80] S. Zhu, A. Yuille, Region competition: Unifying snakes, region growing, energy/Bayes/MDL for

multi-band image segmentation, IEEE Trans. Pattern Anal. Machine Intell., vol. 18, pp. 884-900,

September 1996.

Robust Shape Estimation and Tracking in the Presence of ...users.isr.ist.utl.pt/~jan/PhD.pdf · The...

Documents

Transcript of Robust Shape Estimation and Tracking in the Presence of ...users.isr.ist.utl.pt/~jan/PhD.pdf · The...