Université Catholique de Louvain Ecole Polytechnique de Louvain … · 2011. 4. 4. · 1st utor:T...

Université Catholique de Louvain

Ecole Polytechnique de Louvain

Département d'Ingénierie Informatique

Human motion prediction using kernel density

estimators

under the guidance of Master's thesis presented in attainment

Supervisor: Prof. Pierre Dupont of a master degree in computer science

Co-supervisor: Prof. Rainer Stiefelhagen option in arti�cial intelligence

1st Tutor: Tobias Feldmann by Richard Tillieux

2nd Tutor: Sebastian Schulz Academic year 2010-2011

Acknowledgements

First of all I would like to thank my tutors Tobias Feldmann and Sebastian Schulz

for their long-term supervision. They have invested quite a lot of time to follow the

evolution of this work, and I am grateful for their interest. I will especially miss the

weekly morning group meetings, where Tobias generously feeds his starving students

with the most decadent pastries to be ever found on this side of the Atlantic!

I would also like to thank my supervisor Prof. Pierre Dupont for his open-

mindedness. He friendly accepted to take charge of a master thesis at a very unusual

time for the Belgian academic year.

I am greatly thankful to Chantal Poncin for her truly amazing personal support

throughout this whole Erasmus year. She truly dedicated herself with unending

patience to facilitate administrative obstacles, and to give me excellent advices to

resolve very various issues. Without her help, staying one semester more in Karlsruhe

would certainly not have been possible at all.

I would like to thank the whole Go Hu.MAn. group for the kind atmosphere they

put in the o�ce. The monthly social events, despite the occasional injury, were all

pretty nice evenings.

I could not fail to mention my original supervisor Prof. Marco Saerens, who was

really kind-hearted about my desire to stay one year in Karlsruhe, despite the fact

that this also meant canceling his master thesis.

And of course, last but certainly not least, I would �nally like to thank my parents,

whose continued support I felt despite of the distance.

Abstract

Markerless human motion tracking could be achieved through two distinct, but com-

plementary, sets of techniques: those based on the analysis of the image, and those

based on prior knowledge. Among the latter, direct prediction of human motion on

a global perspective has promising potential, yet it is also a challenging Machine

Learning problem, due to the high-dimensionality of the data. This master thesis

seeks to explore the strengths and limitations of an approach speci�cally based on

kernel density estimators. The methodology proposed here compares on several lev-

els the classic isotropic kernels with their anisotropic counterparts, in the context of

human motion prediction.

This master thesis was realized within the Go Hu.MAn. group at the Karlsruhe

Institute of Technology (KIT).

Table of contents

1 Introduction 5

2 State of the art 9

2.1 Prior knowledge in markerless motion tracking . . . . . . . . . . . . . 9

2.1.1 Explicit joint limits . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.2 Static pose priors . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.3 Dynamic models . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Kernel estimators in motion tracking . . . . . . . . . . . . . . . . . . 17

3 Theoretical foundations 21

3.1 Geometrical concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . 21

3.1.2 Intrinsic dimensionality . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Machine Learning concepts . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 Inductive bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.2 Training and test sets . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.3 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.4 Gradient ascent . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.5 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.1 Non-parametric density estimation . . . . . . . . . . . . . . . 31

3.3.2 Kernel de�nition . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.3 Bandwidth selection . . . . . . . . . . . . . . . . . . . . . . . 36

1

4 Chosen methodology 39

4.1 Preliminary analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Kernel reconstruction of arti�cial examples . . . . . . . . . . . . . . . 41

4.2.1 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.2 Integrated Square Error . . . . . . . . . . . . . . . . . . . . . 43

4.2.3 Average Negative Log Likelihood . . . . . . . . . . . . . . . . 44

4.3 Processing real data . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3.1 Building of training set . . . . . . . . . . . . . . . . . . . . . . 44

4.3.2 Frame subsampling . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.3 Step-by-step prediction . . . . . . . . . . . . . . . . . . . . . . 47

4.3.4 Prediction by gradient ascent . . . . . . . . . . . . . . . . . . 48

4.4 Hyper-parameters selection . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4.1 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4.2 Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.3 Training frame step . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.4 Time sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 Combination with image �delity . . . . . . . . . . . . . . . . . . . . . 54

4.5.1 Con�dence weight of estimation . . . . . . . . . . . . . . . . . 54

4.5.2 Ambiguity measure of training set . . . . . . . . . . . . . . . . 55

4.6 Complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.6.1 Space complexity . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.6.2 Time complexity . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Experiments 63

5.1 Performance criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Test set selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.3 Limitations of the methodology . . . . . . . . . . . . . . . . . . . . . 68

5.3.1 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . 68

5.3.2 Ill-conditioned matrices . . . . . . . . . . . . . . . . . . . . . . 72

5.4 Estimation of intrinsic dimensionality . . . . . . . . . . . . . . . . . . 73

5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

2

5.5.1 Default case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.5.2 Some graphical results . . . . . . . . . . . . . . . . . . . . . . 78

5.5.3 Maximum nearest neighbor distance . . . . . . . . . . . . . . 84

5.5.4 Anistropic kernels . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.5.5 Longer time sequence . . . . . . . . . . . . . . . . . . . . . . . 86

6 Conclusion 91

A Gradients derivation 101

3

1. Introduction

`Back! back! Away from me, or you

must go with me - whither you

know not - into the Land of Three

Dimensions!' `Fool! Madman!

Irregular!' I exclaimed; `never will I

release thee; thou shalt pay the

penalty of thine impostures.' `Ha!

Is it come to this?' thundered the

Stranger: `then meet your fate: out

of your Plane you go. Once, twice,

thrice! 'Tis done!'

Edwin A. Abbott, Flatland

Human motion tracking relates to the task of analyzing the movement of a person

over time, and to convert the raw data obtained into a digital anatomical model.

There are mainly two ways to track human motion: through the use of optical

markers, or by analyzing the image taken from a camera system.

The �rst solution has been extensively used in the movie industry for the ani-

mation of digital characters. Basically, the markers can be pretty precisely located

in space, the main task being to �nd a model �tting the marker set positions (with

enough markers, there is barely ambiguity). The work of the animation team is in-

deed much facilitated if an human actor can directly make the virtual character move

in a natural manner. This remains of course only a model, and the real complexity of

an human pose cannot be fully rendered. The position and concentration of markers

5

6 CHAPTER 1. INTRODUCTION

will therefore vary depending on the desired level of scrutiny. For e.g., reproducing

facial expressions will certainly be more relevant in the case of a dialogue than in the

middle of a cascade.

Marker-based techniques have proved to provide quite reliable results. However,

they have the disadvantage of requiring a large and static infrastructure, which may

be suitable for movies, but obviously lacks �exibility for more general usage. The

practical constraint to stick all the markers over oneself is certainly not an ideal of

ergonomics.

The second approach seeks to overcome the necessity of relying over markers, and

directly treats the image recorded by the cameras. The potential applications are

multiple (domotics, user interfaces, video games, etc). The underlying application in

our context is to help aspiring sportsmen to correct their technical moves compared

to a reference movement performed by an accomplished expert of the sport in ques-

tion. This could also possibly be used for physiotherapy, by helping patients to do

rehabilitation exercises themselves at home.

Nonetheless, this is much more arduous to achieve than with marker-based tech-

niques, and just begins to be actually used in a industrial context (recently, Microsoft

released the controller-free Kinect sensor device for the Xbox for instance). If we only

have one camera at our disposal, then the challenge is even bigger: reconstitution of

a 3D model based on a single 2D image is highly ambiguous. Not helping are also the

risks of occlusion (some object passes before the observed person) and self-occlusion

(in a pro�le view by walking, one of our legs hides the other one). There are several

other di�culties upon which we will not elaborate, as the analysis of image in itself

is not the object of this work.

At this point, there is an interesting observation to make: as human beings, we

can have a pretty precise idea of how visible persons are moving around, and closing

one eye does to seem to degrade substantially our perception of them. But we have

accumulated experience from years of real-world interaction with persons at various

distances, so there is no big deal for us to infer the poses of passers-by with almost

no visual e�ort.

The idea is thus, in complement to image analysis (the so-called image �delity

7

method), to make use of anatomical knowledge to favor poses that are similar to

previously learned movements. Indeed, we could in some limited proportion predict

what the current pose could be, with the help of both an estimation of the previous

movements and a knowledge base. There are more or less elaborate ways to do this,

we would like to propose here an approach totally independent of the image �delity

method. This clear distinction provides a clean framework to �plug� at will the two

together when the image is ambiguous.

Let us give a quick overview of the following chapters. Chapter 2 brie�y presents

some theoretical works about knowledge-assisted motion tracking. Then follows a

closer look at the approach proposed in [5], the paper upon which our own work is

primarily based. Chapter 3 introduces the main concepts that we use and that are

not speci�c to the task of human motion prediction. Further chapters assume its

reading, so we recommend to the reader to take a look at the conventions used here

in the formulas to avoid ambiguity, even if he or she is already familiar with the

subject. Chapter 4 details how these concepts are used speci�cally in the context

of human motion prediction. Chapter 5 discusses how to relevantly test the system,

but also the concrete consequences of theoretical problems, and some performance

results with the available data.

8 CHAPTER 1. INTRODUCTION

2. State of the art

The rhythmical and, if I may say,

well-modulated undulation of the

back of our ladies of Circular rank

is envied and imitated by the

common Equilateral, who can

achieve nothing but a mere

monotonous swing, like the ticking

of a pendulum.


2.1 Prior knowledge in markerless motion tracking

In order to restrict this summary to a well-de�ned problem, we would like to clarify

a bit which task of human motion analysis we consider to be relevant here. Given a

temporal sequence of images taken from a person uncovered with markers, by �motion

tracking� we mean modeling the person's poses through time. The term of �tracking�

alone actually does not necessarily imply explicit pose estimation: in all generality,

it considers states of the person through time (represented features may be totally

image-based, such as colors or position in the image). Tasks such as people-tracking

(locating the position of a moving person in a complex environment, see [7] for an

example making use of predictions) serve a distinct purpose as they do not represent

pose estimation, and thus will not be considered here.

9

10 CHAPTER 2. STATE OF THE ART

Figure 2.1: From left to right: stick, 2D contours and volumetric human models(images taken from [2])

Additionally, we will stick to the problem as formulated for a single individual.

Thus, we will not discuss the task of multiple persons tracking (with interacting

and mutually occluding persons, see [9] for instance). Even with a knowledge-based

approach that is independent of the image such as ours, it would require to take

interactions into account, as individual motions will clearly be in�uenced by the

whole group's overall occupation.

Another remark about terminology: by �prior knowledge� we refer here exclusively

to information that is exterior to the currently tracked motion, and gained through

automated learning or domain experts. The term �prediction� is a bit ambiguous,

because prediction is achieved on the base of previous poses of the same motion,

the use of prior knowledge staying optional. Indeed, tracking algorithms may very

well make predictions, without considering the assistance of previously learned mo-

tions or anatomical constraints. As we are mainly interested in the knowledge-based

techniques deployed to assist motion tracking (and not in the tracking techniques

themselves), we will primarily focus on this aspect here.

A presentation of human motion analysis can be found in [2], and more recently

in [27]. The three most common ways to explicitly model human poses are sketched

in �gure 2.1). These models represent a human body pose in the form of a kinematic

skeletal structure (in this context, �2D contours� is not the overall silhouette but

2.1. PRIOR KNOWLEDGE IN MARKERLESS MOTION TRACKING 11

means that the body is decomposed into articulated parts). The values of the joint

angles are therefore fundamental to describe the pose. Indeed, for a given person,

they are the only parameters susceptible to vary in stick and volumetric models, as we

do not expect the limbs of the person to change their shape in the process of motion.

Stick models are today mostly used for illustration, as they can be superposed directly

on the person's image.

2.1.1 Explicit joint limits

A quite intuitive way to take prior knowledge into account to ensure a meaningful

estimated pose is to de�ne and enforce constraints on the model. Typically, we

will impose speci�c bounds on each joint angle, or prohibit mutual inclusion for

body parts in the case of the volumetric model. This allows to discard anatomically

inconsistent hypothetical poses, thus diminishing the ambiguity and improving the

robustness of the result.

In [36], the authors employ a volumetric body model parametrized by a vector

p = (χ,d, i), where χ designates the joint angles parameters (about 30 in their

model), d and i being deformable shape parameters (limbs volume) and internal

joint proportions (respective position in space of some joints that are susceptible to

vary individually, such as the hip joint) respectively1. As these two last vectors are

optimized once and for all at initialization to �t better parameters speci�c to the

modeled subject's body, the temporal modeling to achieve throughout the tracking

process depends only on χ (as in our case). Interestingly, all these three vectors

are subjected to hard-constraints (for e.g. the values for i are allowed to deviate

10% from the human standard). Those constraints are formulated by the inequality

Cbl · p < 0, with Cbl as the box-limit constraint matrix.

The body representation used in [26] is a subject-speci�c volumetric model, con-

sisting of 15 body segments and 14 joints. Additionally, the authors make use of

a convex visual hull before estimating the actual model. They propose a �exible

modeling of constraints. Firstly, joint centers are not held �xed at the juncture of

1We renamed the variables so that the nomenclature corresponds to ours for χ.


strictly adjacent body segments. In other words, joints also get 3 supplementary

translational Degrees of Freedom (DoF). Secondly, constraints are formulated in a

soft way, by weighting deviations on each joint, for both rotational DoF and the

additional translational DoF.

This work is extended in [8]. The subject-speci�c model is generated automat-

ically, with a very realistic visual result. Constraints are formulated in a hard way

this time, and are de�ned locally on the 6 DoF of each joint. The 3 translational

DoF constraints are grouped in a bounding-box. The 2 DoF corresponding to the

�swing� rotation of the articulation are constrained together. Finally, the �twist�

DoF is simply constrained by one inequality.

The approach of [17] is quite sophisticated on the subject of joint constraints, as

it is precisely the focus of the paper. The authors point out that medical textual

resources that might be consulted to device �xed joint constraints do not provide

explicit dependency constraints between linked joints (for e.g., the elbow twist de-

pends on the shoulder orientation) or between the di�erent DoF of a single joint.

Their idea is to measure those relations by experience. After extensive recording

of a subject performing twist and swing on the studied joint, they obtain a quater-

nion �eld. Quaternions are used instead of Euler's angles (roll, pitch, yaw) to avoid,

among others, gimbal lock problems (for joints having 3 DoF).

Each quaternion q = (qx, qy, qz, qw)T = (sin θ2v, cos θ

2) corresponds to a given

orientation on the joint represented by the unit axis v and the angle of rotation θ

around that axis. Those quaternions �elds are then converted into a voxelized form.

Internal dependencies between the DoF are thus encapsulated for the whole joint.

The voxel shapes can then be hierarchically combined to represent dependencies

between two linked joints (see �gure 2.2). Each voxel of the parental joint (for e.g.

the shoulder) will further constrain the range of valid values for the child joint (the

elbow in that case).


Figure 2.2: Two local ranges for the elbow, depending on the shoulder con�guration(image taken from [17])

2.1.2 Static pose priors

While anatomical constraints can prove well adapted for multi-cameras systems,

they may reveal insu�cient for the resolution of ambiguity in a monocular setting.

Multiple anatomically consistent poses may match a 2D image, hence requiring to

arbitrate between them on a likelihood basis, rather than on a constraint basis. This

is not to say that explicitly de�ning joint limits is primitive and should be regarded

as inferior! It simply depends on the setting. Furthermore, it might in some cases

be interesting to keep constraints just in case, if the predictions based on priors are

susceptible to produce inconsistent results.

We borrow the term of �static pose priors� from [5]. It is not to be understood

as a way of determining priors on single isolated frames, without taking past frames

into consideration. Rather, it is to distinguish from the dynamic models, which

will be sketched in the next point. Roughly put, a pose is said to be static if its

model does not contain temporal information when considered at a �xed moment.

This information appears indirectly by combining several poses into a chronological

sequence. Our own approach can be categorized as such.


One of the �rst integration of priors on motion can be found in [19]. The model is

a simple stick �gure, in which a pose is encoded as a group of 20 body point locations

(instead of joint angles). Each learning element is what the authors call a snippet, and

consists of a brief sequence of 11 successive poses juxtaposed in a vector, lasting only

about a third of a second at the frame rate used. Those snippets are clustered into m

groups, and a matrixMi is de�ned for each cluster i, by composition of its ni snippets

centered on their mean µi. Singular Value Decomposition is then performed on each

matrix to yield Mi = UiS2i V

Ti . Only the largest 50 singular values are retained. A

covariance matrix can be de�ned for each cluster i as2 Σi = 1niUiS

2i U

Ti . Finally, the

prior probability for a snippet x is de�ned as:

P (x) =m∑i=1

kπie− 1

2xT Σ−1

i x (2.1)

where k is a normalization factor and πi weights the importance of cluster i as

its fraction of the whole training set.

The methodology proposed in [30] takes another point of view to the problem. In-

stead of con�ning learning to the model space, the authors seek to learn the mapping

between image space Rc and model space Rd. Reconstitution of the image I corre-

sponding to a pose con�guration3 χ is normally unambiguous, and can be managed

through the use of a function ζ : Rd → Rc that computes the forward kinematics

and the resulting image. As already mentioned, the inverse mapping is much more

di�cult to achieve, because of the inherent ambiguity of the problem. The idea of

the authors is to generate a set of m functions φk : Rc → Rd, each one specialized on

a sub-domain of Rc. This portion of the model space is not forced to be a connected

region. Interestingly, the function ζ can be integrated in the inference to provide a

weighted con�dence in the candidate poses with respect to the actual image.

Similarly to this explicit binding between image and model in learning, the au-

thors of [35] propose learning algorithm that combines both generative modeling

2We renamed two variables used in the paper, namely j into i and Λ into Σ, to harmonizenomenclature with our own, as we also make use of local covariance matrices (although we have noclusters, and we de�ne Σ in a quite di�erent way).

3Once again, variables are renamed when appropriate.


(top-down) and model recognition (bottom-up) in a kind of bidirectional search.

The authors describe the learning technique as being self-supervised in the sense

that inferences in one of the two modeling processes optimizes the other, and vice-

versa. This bootstrap cannot be fully realized automatically however, because of the

absolute need of relevant initialization by supervised learning independently for both

models, hence we might give it a kind of semi-supervised status as a whole.

2.1.3 Dynamic models

These kinds of models view human body poses as physical systems. As such, the

current velocity (along acceleration if required) of each pose parameter can be in-

jected into the model, and treated as an inherent property of the pose. This way

of doing allows to make use of related techniques (like Kalman �lters for instance).

However, as human motion cannot be represented by linear dynamic systems, more

speci�c techniques have to be employed.

Why not always integrate motion dynamics? Firstly, one could argue that velocity

and acceleration are at least partially recovered implicitly in temporal static poses,

even though they do not appear measured as such. Furthermore, according to [30]:

Although the approach presented here can be used to model dynamics,

we argue that when general human motion dynamics are intended to be

learned, the amount of training data, model complexity, and computa-

tional resources required are impractical.

Of course, a more complex model may still yield better performance when ex-

ploited meaningfully, that is, with an approach based itself on dynamics. The quoted

statement was also made in 2001, and since then, advances in computational power

allow for more demanding processes. So it seems that the modeling choice is directly

dependent on the intended usage that is to be made from it.

An early use of dynamic models for human motion analysis can be found in [41].

The authors begin to advocate the use of dynamics by comparing the predictions

made by two systems based on Hidden Markov Models (HMM) for mouse motion


gestures. The �rst class of HMM is trained on the di�erence between two successive

recorded mouse positions, whereas the second is trained on the innovation sequence

(di�erence between observation and prediction) generated by a Kalman �lter. Both

systems classify samples of simple geometrical shapes (circles, triangles, scribbles)

without error, but the dynamic model is sensibly smoother in its predictions due to

its auto-calibration.

When applying dynamics to human motion models, the authors de�ne hard-

constraints by the form of a set of physical forces. Based on q̈ = W · Q, the

de�nition of acceleration of a object q as its inverse mass matrix times the external

forces applied to it, they add forces c(q, t) to Q to account for kinematic constraints,

and de�ned so as to avoid arti�cially adding energy to the model. Additionally,

soft-constraints are represented by a potential �eld to account for the probabilistic

aspects of the problem.

In [29], the tactic chosen is to use a Switching Linear Dynamic System (SLDS),

which permits transitions between several linear dynamic systems. This kind of

fusion between HMM and LDS allows to overcome both the discrete nature of the

HMM and the nonlinearity of human motion dynamics. But because of exponential

complexity over time due to all the potential switchings from one LDS to another,

the authors use approximated algorithms in practice, namely Viterbi, variational and

generalized Pseudo Gaussian.

The approach taken in [33] is quite original. Their dynamical model is volumetric

and consists of 50 parameters covering both positions in space (for the torso) and

angles (for the joints), along with the velocities in both cases. But they do not device

a speci�cally dynamical method to learn from these variables. Less focusing on the

inference power of a learning algorithm, the authors propose to emphasize rather

on e�cient ways to access prior knowledge. Instead of making costly combinations

between all training samples, they simply seek to match the one closest to the re-

quested motion, using a binary tree representation to store the knowledge base. The

basic idea is to use the variables as nodes to classify the samples.

Samples are, as usual, motions over a time window, which are then transformed

to a lower dimensional form using Principal Components Analysis (PCA). Although

2.2. KERNEL ESTIMATORS IN MOTION TRACKING 17

Figure 2.3: Sketch of the methodology proposed in [33] (image taken from it)

being intrinsically linear (and thus, not ideal for reducing the dimensionality of hu-

man motion data), PCA has several advantages for this use: it is widely-known

and time-e�cient, its results can be easily used to decide the number of dimensions

retained (instead of a prede�ned choice) and most importantly here, it ranks the

obtained variables according to their in�uence. This allows to structure the tree in

that order, the dimension with the higher variance being the root. Figure 2.3 illus-

trates this. With this representation, making a prediction has a very reduced time

cost that is logarithmic in the size of the tree plus linear in the number of samples

landing into the same leaf (authors argue they generally obtain reasonably balanced

trees). This is especially interesting of course when it is possible to generate large

knowledge bases that would require large computational cost to cover exhaustively

at each prediction.

2.2 Kernel estimators in motion tracking

In the approach proposed in [5], the authors make use of kernel density estimators

to model prior knowledge. As we based our methodology primarily on this paper,


we will dedicate section 4.1 to explain the intuition behind the formulas used here.

Also, as the term of `kernel' has quite various meanings, we recommend reading point

3.3.2 to ensure against possible ambiguities.

Two classes of kernel density estimators are given: isotropic and anisotropic.

Being constructed on the base of a training set composed of D-dimensional vectors

x1, ...,xN , each estimator can return a relative likelihood value for a requested vector

x. Each of these vectors contains the joint angle values of several successive poses

χt, ..., χt−k. Knowing χt−1, ..., χt−k, the most likely (on a local level) unknown pose

χt is determined by gradient ascent, according to the kernel estimator used.

Firstly, given some stochastic kernel Kh(x,x′), consider the isotropic kernel esti-

mator:

f̂iso(x) =1

N

N∑i=1

K isoh (x,xi) (2.2)

The authors propose to use the (Euclidean distance-based) Gaussian kernel de-

�ned as:

K isoh (x,x′) =

1

(2πh2)D2

exp

(− ‖x− x′‖2

2h2

)(2.3)

The anisotropic kernel estimator is de�ned as:

f̂ani(x) =1

N

N∑i=1

Kanii (x) (2.4)

Notice that unlike the isotropic estimator, each kernel Kanii (x) is not only called

locally on xi, but also de�ned depending on i:

Kanii (x) =

1

|2πΣi|12

exp

(− 1

2(x− xi)

TΣ−1i (x− xi)

)(2.5)

2.2. KERNEL ESTIMATORS IN MOTION TRACKING 19

Each Σi is a local covariance matrix de�ned as:

Σi = αI +N∑j=1

K isoh (xi,xj)(xi − xj)(xi − xj)

T (2.6)

where variable α is an external parameter, used to regularize the obtained covari-

ance matrices (to prevent them to be ill-conditioned). Notice that the anisotropic

kernel estimator makes use of the isotropic kernel to compute the local covariance

matrices.

3. Theoretical foundations

Although popularly every one

called a Circle is deemed a Circle,

yet among the better educated

Classes it is known that no Circle is

really a Circle, but only a Polygon

with a very large number of very

small sides.


3.1 Geometrical concepts

This section is directly inspired from the passages of [24] corresponding to the curse

of dimensionality (point 1.2.2 in the reference) and to the estimation of the instrinsic

dimension (chapter 3 in the reference). That said, the examples and formulas we

mention here are also presented in the same way in other sources.

3.1.1 Curse of dimensionality

The so-called �curse of dimensionality� appears in applied mathematics to describe

the (undesired) e�ects appearing when the dimensions of a given problem increase,

all other things being equal. In particular, the problem can be viewed as our inability

21

22 CHAPTER 3. THEORETICAL FOUNDATIONS

to e�ciently cope with the exponential increase of (hyper-)volume1 of a considered

object.

The classical example used to illustrate this e�ect is the volume of a sphere relative

to its circumscripted cube, when both are de�ned within a space of dimensionality

D. We can de�ne their volumes depending on the radius r of the sphere:

Vsphere(r) =πD/2rD

Γ(1 +D/2)(3.1)

Vcube(r) = (2r)D (3.2)

where Γ(n) denotes the gamma function (for the sake of the argument, let us

consider it simply as a generalized factorial). The ratio Vsphere/Vcube is thus:

Vsphere(r)

Vcube(r)=

πD/2rD

Γ(1 +D/2)(2r)D=

πD/2

Γ(1 +D/2)2D(3.3)

What happens when D becomes larger and larger? The limit of 3.3 when D →∞equals 0. This means, the volume occupied by the sphere tends to become negligible

compared to the cube containing it, as the dimensionality increases! In other words,

the core of the cube becomes actually less and less representative of the whole object,

the main volume being concentrated in the exponentially many corners.

If we transpose this to the case of highly multivariate isotropic (symmetrical

in all dimensions) normal distributions, we can observe that the density tends to

spread out in all dimensions. Although a given point A is always exponentially more

likely than a given point B farther from the modus, the region near the modus (say,

within standard deviation σ) accounts for a smaller and smaller part of the whole

distribution as D goes up. Suddenly, being somewhere in the tails is actually more

likely than landing near the center (clearly not the case in the univariate case).

What are the practical consequences of this phenomenon? Firstly, desirable prop-

erties we pragmatically take for being only marginally violated in low-dimensional

1From now on, we will take the liberty to leave out the pre�x �hyper� in order to improvereadability.

3.1. GEOMETRICAL CONCEPTS 23

Figure 3.1: 3rd and 5th order approximations of the Hilbert curve

spaces may degrade severely in higher dimensions. Secondly, if we don't possess a

strong prior knowledge of the object, studying it locally will prove to be exponen-

tially more di�cult. In particular, when having only limited data at our disposal,

our knowledge of the whole problem risks to lose global relevance. These issues are

quite critical in the sense that there is no easy workaround to avoid them.

3.1.2 Intrinsic dimensionality

A nuance is to made regarding the conclusions made in the previous point. The curse

of dimensionality is indeed structural to the real dimension of the object considered,

and there is nothing to do against it. Nevertheless, we should bear in mind that

the important word here is real. Let's suppose we analyze a data set X, whose each

element xi is formed of several variables xi1, xi2, ..., xiD. If we wish to extrapolate

a continuous geometrical set from it, then the adequate dimensionality which ap-

pears to us intuitively is simply D, the number of variables. But maybe are there

redundancies, even partial, among them (in a statistical context, the variables are

not independent), and thus a dimension d < D would actually already be su�cient.

The intrinsic dimension can be naively thought of as the minimal number of latent

variables which are necessary to describe completely the considered set. However,

fractal objects put this de�nition into di�culty: for instance the Hilbert curve (see


�gure 3.1), which covers the space of a square, although the points on a curve can

be described by a single parameter. The Lesbegue covering dimension (also called

topological dimension) is in that case more theoretically sound. But when dealing

with limited data, it might be easier to use one of the fractal dimensions instead.

Another argument in favor of the fractal dimensions is that they do not restrict

themselves on integer numbers, which may perhaps seem a bit absurd at �rst glance,

but actually much better characterizes the nature of �nite data sets.

There are several fractal dimensions, which actually variate the free parameter

q of the generalized q-dimension (q does not itself represent a dimension, but only

determines how to estimate one), such as the capacity, information and correlation

dimensions. Let us just consider here the capacity dimension to illustrate the data-

driven quality of these approaches. We can circumscribe the given data set within a

hypercube, and then split it into a grid of smaller hypercubes of edge length ε. Be

N(ε) the number of those hypercubes that contain at least one data point.

The capacity dimension is then de�ned as:

dcap = − limε→0

log N(ε)

log ε(3.4)

The underlying intuition behind this formula is to cover the space that is really

occupied with increased precision, as ε approaches 0. This is very similar to digital

images constituted of pixels (or voxels in 3D): once the resolution is set high enough,

there is no more visible di�erence with the original.

3.2 Machine Learning concepts

The �eld of Machine Learning is dedicated to the task of designing programs able

to adjust their behavior automatically, provided they receive adequate training. The

commonly accepted de�nition [25] states:

A computer program is said to learn from experience E with respect to

some class of tasks T and performance measure P, if its performance at

tasks in T, as measured by P, improves with experience E.

3.2. MACHINE LEARNING CONCEPTS 25

3.2.1 Inductive bias

In the above-quoted de�nition, one may wonder how far the tasks T could deviate

from available experience E. In the extreme case where the tasks consist of a prede-

�ned set of requests exactly matching the complete experience, a database system

could actually be considered as a learning program (as �lling it will indeed make it

more precise)! The human equivalent of this would be a passive pupil strictly recit-

ing his lesson, without making any e�ort to think about the meaning of the words

he takes from his memory. Obviously, those so-called rote-learners are not what

we expect from Machine Learning. A program attempting to recognize people by

face cannot realistically assume that the input image could directly be found in the

knowledge base, far from it! Brie�y put, solely accumulating experience could not be

a purpose in itself, as the bene�ts of learning are then limited to what was already

at our disposal in the �rst place. Instead, we expect our program to generalize what

it gets from its experience to a much broader set of examples.

At that point, it should be noted that, for all its merits, generalization by in-

duction is not at all in accordance with formal logic. Indeed, consider for instance,

having X ⊂ Y:

∀x ∈ X, P (x)

∀y ∈ Y, P (y)(3.5)

This claim clearly does not hold. This should not be a surprise, however, as

induction is actually the pendant of deduction: reversing the inference in 3.5 makes it

logically correct. Nevertheless, pragmatically speaking, we could treat the conclusion

on Y as at least plausible, provided we strongly believe that X is representative

of Y with regards to property P . Philosophers ardently discussed the validity of

inductive arguments, and of their status inside science itself. Popper even argued

that statements must necessarily be falsi�able to be called scienti�c! And indeed, in

a strictly deductive world, we would never learn any rule from observation, leading to

a very poor understanding of our universe. In this sense, induction has the strength

of its weaknesses.

Inductive bias is maybe the most fundamental characterization of any Machine


Learning technique. It is, so to speak, the criterion the algorithm uses to arbitrate

the choice between several contradictory hypotheses. Two levels of biases must be

distinguished: restriction and preference [16]. In the �rst case, we assume some pre-

de�ned properties of the general rule, and therefore discard a whole set of hypotheses

that will never be considered, even if the data strongly suggests one of them. In the

second case, more data-driven, we follow some guiding criteria through the complete

hypothesis space.

These two kinds of biases will be exempli�ed in our context inside point 3.3.1, by

the distinction between parametric and non-parametric density estimators.

3.2.2 Training and test sets

The experience mentioned in the above de�nition does not implicate a direct inter-

action of the learning program with its environment. In most of the cases, we rather

provide the data ourselves, after having measured it independently. This is called

the training set, or in other words the knowledge base of the learning algorithm.

The model built from that data can be explicit (for e.g. a decision tree), or implicit

(the so-called lazy learners). But even in the explicit case, the quality of the model

cannot be evaluated in all objectivity, without comparing the predictions made to a

group of already well-established results. This is called the test set.

There are at least two critical conditions which need to be ful�lled, if we want

to guarantee the signi�cance of the results. First of all, the two mentioned sets need

to be determined independently one from the other. Should we fail to respect this,

then learners overly con�dent in the training set (this is called over�tting) would be

judged favorably, but their predictive power will actually be biased by the training

set, and still not apprehend the whole domain we want to cover! To understand why,

one has to remember the nature of inductive reasoning (see point 3.2.1). The more

we believe the examples we have at our disposal have the value of a �xed rule for all

possible situations, the more �dogmatic� and categoric will be the results, while being

in accordance with our (limited) experience. As previously said, this is necessary to

generalize, but we need the test set to be totally independent and unbiased toward


the incomplete clues suggested by the training set.

This leads us to the second fundamental condition, namely reasonable proportion

of training and test sets among the available data. It should be fairly clear that the

bigger the training set, the less over�tting there will be, and similarly, the bigger

the test set, the more reliable and statistically signi�cant will be the evaluation.

Unfortunately, most of the time, we have rather limited data already determined

(relatively to the hypothesis space), and that's precisely the reason why we would

like the learning program to do it for us automatically! When dealing with an amount

of data critically low, priority should be given to the test set, even if this consequently

degrade the performance of the learner confronted to an even-smaller training set. A

useful trick in this situation is to perform a K-folds cross validation, see point 3.2.5.

On top of that, it is also plausible that even the data censored by us is not

perfectly reliable: in other words, that it contains approximations or even strict

logical contradictions. For the training set, depending on the learner used, noise

will a�ect performance in di�erent proportions. In all generality, statistically-based

approaches show more robustness to noise than logical-based ones. For the test set,

this also impacts the evaluation's variability, in addition to the above-mentioned

potential non-representativity of the test set.

In our case, we get reference data about the pose directly by marker-based mo-

tion capture instances. This assures us to be independent of the results obtained

with image-based reconstruction techniques which, even if the optical conditions are

optimal, are still more prone to noise that poses determined by resolving the inverse

kinematic problem on the quite precise positions of the markers. Nevertheless, noise

remains an issue, especially when pose variation is being considered between short

temporal sequences (consecutive frames). In the �eld of human motion tracking,

building marker-based reference data is designated by the term of ground truth (see

for e.g. [26]).


3.2.3 Supervised learning

Another important aspect of Machine Learning is whether the training is supervised

or not. Roughly put, it is about the labeling of the data we already possess. If

we give the algorithm a corresponding result on each training sample, and then

expect it to predict the same type of result on the subsequent unlabeled instances

of the test set, it is called supervised learning. In this case, we determine ourselves

which variable, or group of variables, should be considered as the result, and thus

we implicitly establish a causality chain. Many supervised algorithms consider only

discrete output values, typically class names and not numbers (the algorithm is then

called a classi�er), but this is not a sine qua non condition, and the output value

could also be a vector of continuous values for instance.

In contrast to this, unsupervised learning abstracts the data as a whole, and is

more interested to discover the underlying structure, without making assumptions

about the causality links between the variables. The paradigmatic examples of this

class are the clustering algorithms, where we only decide of the number of anonymous

data classes (the so-called clusters) we wish to obtain in the end.

The supervised nature of our learner is perhaps not obvious at �rst glance. Indeed,

kernel density estimation is considered to be linked to unsupervised learning, as

guessing the general probability distribution of the data is clearly an unsupervised

approach. But we are not interested in the distribution in itself, rather we use it as

a tool to guide prediction. See point 4.3.1 for more details about the structure of the

training samples.

3.2.4 Gradient ascent

The technique of gradient ascent is not a concept of Machine Learning strictly speak-

ing, but several ML algorithms rely more or less heavily on it, most notably the

famous Arti�cial Neural Networks. In all generality, gradient ascent (resp. descent)

makes possible to �nd local maxima (resp. minima) of a function whose general

shape is unknown. This methodology does not su�er from the pitfalls of more naive

approaches, like building a whole grid of the function domain, which could be both


Figure 3.2: Missing global maximum (red is high, blue low)

expensive to compute and imprecise (as this grid is necessarily discrete, the real max-

imum will very probably lie between the meshes of the net). Starting with a more

or less arbitrary point, gradient ascent will converge toward a peak on the function

surface.

Mathematically speaking, the gradient 5f of a function f is the vector of all

partial derivatives of f . If we evaluate it on a given point, the gradient determines

the direction of steepest ascent, or in other words, an arrow toward the locally most

likely candidate for a maximum. The length of the step to be made in that direction is

signi�cant, as too big steps might �jump over the hill� so to speak, whereas minimalist

steps will slow down convergence. Applying gradient ascent iteratively is guaranteed

to converge toward a local maximum after a �nite amount of time, provided step

size decreases in a meaningful manner over time. It is important to note the word

local : ideally, we would like to �nd a global maximum, which is unfortunately a much

more di�cult task. Depending on the chosen starting point, we might thus land on

local maxima of varying quality. Figure 3.2 shows an example of this. Starting with

point x0, gradient ascent will attain local maximum lm after 5 iterations. Global


maximum gm is unfortunately not discovered. If x0 was situated somewhere else

(not necessarily closer to gm), it might have been the case though.

In our implementation, we use gradient ascent to �nd a local maximum on the

surface of the density estimation. See point 4.3.4 for more details.

3.2.5 Cross-validation

Cross-validation is a methodology which is often employed in Machine Learning. Its

core principle is to extract a subset from the data available as training set to form a

distinct entity, called the validation set. This is done several times, with each time

a new selection to form the validation set, in order to guarantee its representativity

and neutrality relative to the remaining training set. Especially in the case where

data is limited, and that building a signi�cant test set would also directly impact

the quality of training itself, it may be interesting to treat each validation set ex-

actly as a small test set (speaking of �cross-testing� might actually be more accurate

then). Otherwise, in the classical usage, the validation set is directly used to improve

performance, and must therefore remain strictly separated from the test set as well.

The general idea behind this third set is to act as an interactive feedback: instead

of just judging the �nal model to assess its quality (test set), it evaluates the current

model proposed, while remaining independent of the current training data itself. Of

course, learning algorithms already select the model that they consider to be the

��ttest�. Nevertheless, many of them require some tuning parameters to be �xed

from outside, the so-called hyper-parameters. To choose their values manually at

�rst guess might be largely suboptimal, so a good practice is to seek the best hyper-

parameters with cross-validation. When confronted to multiple hyper-parameters,

there is of course no guarantee that taking the best ones one an individual scale (that

is, by letting only one vary at a time) will bring the best combination overall as they

are unlikely to impact performance independently, so in theory all combinations

should be considered. However, in some cases, the time overhead might become

truly prohibitive.

A quite generic way of extracting several validation sets from the base training

3.3. KERNEL DENSITY ESTIMATION 31

set is to partition it into K-folds of same size. This allows to cover the whole

base training set with no overlap between the subsequent validation sets. The most

complete form of K-folds is leave-one-out cross-validation, where K is simply the

size of the base training set. Each element of the training set is thus subsequently

removed to act as a singleton validation set. The disadvantage of this method is of

course its very high computational cost, especially with big training sets.

3.3 Kernel density estimation

3.3.1 Non-parametric density estimation

Basically speaking, density estimation is the inverse operation of drawing samples

from a given probability distribution. There is a major di�erence, though. Following

a probability distribution yields unpredictable results, yet can be done in all neutral-

ity. This is much more di�cult to achieve when we deal with limited data and seek

to reconstitute the original distribution underlying it. There are an in�nite number

of potentially matching distributions, but of course they are not all equiprobable,

that is, many could have generated the data but are very unlikely to have done so.

To help us discriminate between the candidate distributions, we need to de�ne some

sort of inductive bias.

A �rst way to resolve the ambiguity is to make a strong assumption about the

sought distribution, and �xing all of it but its parameters. For instance, in an

univariate setting, we might assume our data follows an exponential distribution

λe−λx. We then seek the parameter λ for which the obtained data is the best �t.

This is called parametric density estimation. Although choosing the appropriate

function seems to be rather arbitrary if we do not possess prior knowledge external

to the available data itself, the estimation will be already precise with limited data if

our �bet� is right. On the other hand, if this condition is not ful�lled, the estimation

will always be wrong, even with in�nite data at our disposal. In this sense, the

employed inductive bias is restrictive (see end of point 3.2.1).

A second approach, namely non-parametric density estimation, is to rely primar-


Figure 3.3: Estimators receiving samples drawn from the standard normal: three �rstplots are histograms, fourth plot is a parametric estimator for normal distributions

ily upon the data, without constraining it into a pre-established framework. This

methodology has the big advantage of automatically converging toward the true

distribution as the number of data samples grows. The classical example is the his-

togram estimator. It consists on dividing the domain covered by the samples into a

grid of bins with length ε, and then counting the number of samples falling into each

of those bins. Finally, the overall sum is normalized to 1 (to respect the de�nition of

density), and we obtain quite simply a discrete estimate of the underlying distribu-

tion. This intuitive technique leads from itself to the true distribution when ε → 0

and the number of samples N →∞.

Of course, in practice, we cannot just �x ε to a low number when data is not

su�cient enough to remain representative at a very local bin level. On the other

hand, the bigger ε is, the rougher will be the resulting distribution. In other words,

ε remains as hyper-parameter (another equivalent hyper-parameter would be k, the


number of bins). This does not contradict the adjective non-parametric, as ε is not

the parameter of any �xed density, but rather inherent to the methodology itself.

Figure 3.3 summarizes all this. The histogram with 1000 samples is either coarse or

chaotic. But with su�cient samples and a low ε, we get a much better estimation.

In contrast, the parametric estimator needs only 1000 samples to generate a very

precise approximation, but bear in mind that this works only if the true distribution

is indeed a normal.

Within a non-parametric setting, a �rst disadvantage of employing an histogram

is that its convergence to a smooth estimation is particularly slow. Its inductive

bias lets it prefer discrete distributions, whereas we might very well expect natural

distributions to be continuous functions. A second point which could be improved

is that the bins only approximate the local in�uence of each data sample: whatever

the exact position of the sample inside the bin, the result will be the same. Kernel

density estimation is a non-parametric technique which avoids these limitations.

3.3.2 Kernel de�nition

What is a kernel? The word has several signi�cations, depending on the context.

We will restrain ourselves here to stochastic kernels of order 2, as de�ned in [38]. In

this de�nition, a kernel K(u) is a function over R so that:

1. K(u) = K(−u),∀u (symmetry)

2.∫ +∞−∞ K(u) du = 1 (valid density)

3.∫ +∞−∞ u2K(u) du 6= 0

The third condition is not as straightforward to interpret as the two others, and

comes from the general kernels of order k. Fortunately, for the case of k = 2, it

only eliminates speci�c cases among those that allow negative values for the function

K(u). If we stick to kernels that can be interpreted as probability distributions, we

can thus simply ignore it (the wikipedia entry on stochastic kernels does not even

mention it).


Figure 3.4: Epanechnikov kernel using various univariate conversions, from darkerto lighter: Euclidean norm, Manhattan distance, ∞-norm, multiplicative kernels

How can such a function be used for density estimation? The idea is to center a

kernel on each data sample, and then to combine them like in a mixed distribution.

Kernels being themselves probability densities, the only but important di�erence

with a mixed distribution is that the number of base distributions is not held �xed,

but varies with the sample size, hence the non-parametric nature of the estimator.

For the moment, let us still limit ourselves to the univariate case. This basic kernel

estimator of a true density f(x), given a sample x1, ..., xN of data, is computed as:

f̂(x) =1

N

N∑i=1

K(x− xi) (3.6)

Additionally, we can generalize the chosen kernel density with an important ex-

ternal parameter h, called the bandwidth. Given a kernel K(u), we can de�ne

Kh(u) = K(u/h)/h. At this point, the bandwidth h may seem to only have a minor

scaling e�ect, but as we will see later on, its impact is actually quite crucial on the

estimator obtained.


Name K(u)Uniform 1

2I(|u| ≤ 1)

Triangle (1− |u|) I(|u| ≤ 1)Epanechnikov 3

4(1− u2) I(|u| ≤ 1)

Gaussian 1√2πe−

u2

2

Table 3.1: Some common kernel types

In the general case of multivariate data, we deal with vectors x and xi instead

of scalars x and xi. In order to use (univariate) kernels in this setting, an approach

is to use a norm on the di�erence between the vectors x and xi to determine the

single variable u. It is also possible to take the product of the kernels evaluated on

each single variable di�erence. Notice this is not the same as a single kernel on the

product of the variables' absolute di�erences!

Generally as in the isotropic kernel of equation 2.3, the Euclidean norm is im-

plicitly chosen. But other types of norms can of course be used. For instance,

the anisotropic kernel of equation 2.5 uses the Mahalanobis distance d(x,x′) =√(x− x′)TS−1(x− x′) as de�ned between two random vectors (the square root dis-

appears as u is itself squared). There are also the Minkowski distances or p-norms,

that are de�ned as:

dp(x,x′) =

(D∑i=1

|xi − x′i|p)1/p

(3.7)

Figure 3.4 illustrates the Epanechnikov kernel (see table 3.1) in the bivariate

case. In the �rst three kernels, data is converted to univariate using the most classic

Minkowski distances : 1-norm (Manhattan), 2-norm (Euclidean) and ∞-norm (the

limit of the p-norm when p → ∞). The fourth kernel is obtained with the product

of the two univariate kernels. Overall, the Euclidean distance generates the most

circular kernel.

Among the functions that respect the properties de�ned above, it is interesting

to cite the kernels of table 3.1, where I(•) is the indicator function (1 if true, 0 if

false). These kernel types are illustrated in �gure 3.5. There are other common types


Figure 3.5: Common kernel types

of kernels, but they don't di�er much from the ones presented here. The mode, if

any, will always be centered on 0 (remember it is the position of the sole currently

considered data point), and from there the kernel will monotonically decrease on

both sides. The Gaussian kernel is particular though. Unlike the others, it expands

in�nitely instead of being cut just after the bandwidth limit. Although the quantity

soon becomes negligible (exponential decrease), it is not the same as 0, and this

information is valuable when comparing two points falling outside the range of the

kernels. It should nonetheless be noted that, like the histogram, all kernels will

eventually converge toward the true distribution when given enough samples (and

setting the bandwidth h close to 0, as for ε).

3.3.3 Bandwidth selection

The choice of h is a tricky one. This is because we are typically confronted with

rather limited data, from which we wish to generalize without losing too much focus

on what we know. Too low values of h will generate hectic variations on the estimated

density, while too high ones will �atten it to an uniform distribution.

From a purely visual point a view, a human experimenter might very well be


Figure 3.6: Pitfalls of good-looking curves (100 observed points, h = 0.1 and 1.5respectively), �gure idea taken from [38])

tempted to increase h until obtaining a smooth surface, pretty to the eye and looking

more natural. This is alas not a very convincing criterion, without even having

to consider its subjective character. The problem is not to assume that the true

distribution should be relatively smooth (it is a fairly natural rule, provided no

hidden variable is present), but rather that we could obtain an estimate from limited

data that is at the same time smooth and precise.

Figure 3.6 illustrates this psychological e�ect. If we forget about the (normally

hidden) blue reference function, the right red curve looks much more like a respectable

distribution than the left one, but it actually fails to detect the bimodal nature of

the blue curve. At the opposite, we should neither expect numerical wonders from

low values of h: like for the histogram, the increased �precision� is gained relative to

the available data, not to the true distribution.

This dilemma can be interpreted in terms of notions related to Machine Learning.

Too high bandwidth values correspond to under�tting (too simple model), and too

low to over�tting (relying exaggeratedly on the known samples). In [31], this is

presented as the trade-o� between variance (high h) and bias (low h) relative to the

sought density.

In fact, there is no universally accepted way to de�ne the best h overall for a

given data set. There are, however, ways to determine an optimal value relative to

an error criterion through mathematical analysis, but the best performance criterion


itself is not universally agreed upon! For instance, the commonly accepted Integrated

Square Error (ISE) and Mean Integrated Square Error (MISE) will choose di�erent

bandwidth values as the best one. See [38] for an complete discussion of the subject,

and other criteria of optimality.

Depending on what we want exactly, we will thus choose di�erent ways to �nd

a good h. In our case, the criterion is not that much to approximate as close as

possible the true distribution as a whole, but rather to ensure that the true local

maxima look similar to the estimated ones, which is a very di�erent thing.

4. Chosen methodology

Taking nine Squares, each an inch

every way, I had put them together

so as to make one large Square,

with a side of three inches, and I

had hence proved to my little

Grandson that - though it was

impossible for us to see the inside of

the Square - yet we might ascertain

the number of square inches in a

Square by simply squaring the

number of inches in the side.


4.1 Preliminary analysis

We will explain here the overall intuition of [5] sketched in 2.2. The idea is to build a

kernel density estimator not for itself, but as a tool to guide prediction. Concretely,

body poses are modeled through angles at each joint, the length of the limbs being

held �xed. For a given frame in a sequence, this gives us a vector, each variable

having the value of the corresponding joint. If we want to get a better idea of a

certain type of movement, we could build a kernel density estimator based on a

group of frames where the joints are already known through some other means (the

39

40 CHAPTER 4. CHOSEN METHODOLOGY

training set). The resulting density would model the relative likelihood of an external

pose, as belonging to the same distribution as the elements of the sample used as

data set.

This would however only possibly help to estimate the chances for a pose to

belong to a type of movement or another (in order words, classifying it), whereas we

actually want to go a step further, and predict the next pose when knowing the past

ones. Obviously, we then need to take time into account.

For this task, the training set must also contain information about the past.

The vectors corresponding to each pose are then no more composed solely of their

current value of the joints, but contain also the values of the joints for some �xed past

frames. The values are simply juxtaposed, the actual order having no importance.

For instance, to convert a frame t into a vector of the training set, one could retain

values of frame t, t − 1 and t − 2. The kernel density then represents the relative

likelihood of the sequence of frames as a whole. Poses that are maybe realistic taken

individually, but that do not make a coherent succession will thus be seen as unlikely.

Notice that taken as such, there is no explicit causality link following a chronological

order (we stay at an unsupervised level up to this point).

Now, the point is, we have one of those successions of frames at hand, for which

we know the past poses, but not the current one (as it is precisely the one we would

like to predict). A trick would be to propose various predictions for the current

pose, evaluate the kernel density estimator on each vector taken as a whole (with the

�xed past poses), and retain the most likely combination of current and past poses.

Of course, the tried current pose con�gurations must not be determined arbitrarily.

Gradient ascent can be used to achieve this in a systematic way. This is basically

the core principle of the methodology. Section 4.3 explains it much more in detail.

In the context of point 3.3.2, this corresponds to a Gaussian kernel, with the

Euclidean distance as criterion to obtain parameter u of the univariate kernel (and

the according scaling to make it a valid density). Isotropic means being equivalent

in all directions. In others words, the bandwidth is applied to all dimensions with

the same strength. However, we do not expect at all our data to have an equivalent

variance in all dimensions. The isotropic kernel would thus require a large training

4.2. KERNEL RECONSTRUCTION OF ARTIFICIAL EXAMPLES 41

set and a low bandwidth to follow the true distribution on a local level. Could we

capture the variances with fewer training elements?

Well, the already mentioned Mahalanobis distance is precisely based on a covari-

ance matrix. We could theoretically just use the sample covariance matrix, but we

would like the bandwidth to be adaptive not only relative to the di�erent dimen-

sions, but also to be adaptive on a locality level. By this, we mean that dense regions

should have a low bandwidth h to make a good use of the precision o�ered there by

the training set, whereas sparsely populated regions should have a larger h to cover

smoothly the rareness of information known, instead of a rigid global compromise.

For this, we need a local covariance matrix for each element of the training set. We

can use the isotropic kernel itself as a way to weight the relative in�uence of the

other training set points for the covariance matrix of the current training set point

(this is also to allow us to conserve h as a scaling parameter).

We implemented the whole system in Matlab. Additionally, in order to gain time

performance, we also implemented a C version of the functions that have an in�uence

on complexity (see section 4.6). This brought us a speed-up of about ×4, depending

on the context.

4.2 Kernel reconstruction of arti�cial examples

4.2.1 Visualization

No realistic human model can simply be based on a low number of Degrees of Free-

dom (DoF). This poses an issue of visualization. It remains quite easy to observe the

behavior of the predictions made by the system when applied to a 3D virtual jointed

model. But analyzing the shape and behavior of the underlying kernel density es-

timation with more than 3 variables is pretty di�cult for human experimenters. In

order to have a direct and concrete feeling of how kernel estimation itself behaves,

we �rst restrained ourselves to arti�cial true distributions in dimensions 2 to 4.

We implemented a GUI in Matlab to ensure this straightforward interaction with

the estimations themselves, and the in�uence of their hyperparameters. As such, this


Figure 4.1: Isotropic kernel estimation of an uniform spiral (1000 observed points)

toolbox is conceptually independent of kernel estimators' potential applications (in

our case, human motion prediction). The toolbox contains 23 prede�ned arti�cial

densities. They are all easy to parametrize, and to be combined in a modular way to

facilitate creation of new ones. The user can choose among the kernel types listed in

the table of point 3.1. Both the isotropic and anisotropic kernels presented in section

2.2 in the case of the Gaussian type are supported for all those kernel types. The

hyper-parameters can be determined automatically via di�erent methods, or be set

manually. The number of generated training samples can also be chosen, and can be

held for several estimations of the same true distribution. With the according option

set and reasonable training data size, it is also possible to see in real-time the impact

of tuning the di�erent parameters.

In the case of 3D data, the density lies in 4D space. Thus, each point in space has

an associated probability. We patch together sets of points for which the density value

is ≥ than a prede�ned tolerance threshold, and thus obtain a group of 3D objects.

4.2. KERNEL RECONSTRUCTION OF ARTIFICIAL EXAMPLES 43

Figure 4.2: Two views of an isotropic kernel estimation of the open-box manifold(each point is equiprobable), depending on the value of the tolerance threshold (500observed points)

The value of the tolerance threshold to accept a point can be modi�ed in real-time

by a slide bar, and thus let the objects shrink or grow along this dimension. In �gure

4.2, left (resp. right) view is obtained via a high (resp. low) tolerance threshold.

4.2.2 Integrated Square Error

Visualization alone remains a bit too subjective. As noted in point 3.3.3, the human

mind tends to favor smooth estimations even when they are actually blurring the

real shape of the sought distribution. Therefore, numerical criteria are desirable to

complement the viewer's impression. A common measure of error when comparing

a function to its estimate is the Integrated Square Error. It is de�ned, as can be

expected:

ISE =

∫(f̂(x)− f(x))2dx (4.1)

It is important to note that we normally don't have the reference function f(x)


at our disposal, as we precisely try to estimate it. Then, the ISE itself needs to be

estimated (while remaining distinct of the Mean Integrated Square Error, but this is

another story).

4.2.3 Average Negative Log Likelihood

The Average Negative Log Likelihood is another performance criterion (used for

instance in [40]), much closer to a Machine Learning approach. It consists of evalu-

ating the estimated density with a set of m points that were generated with the true

distribution. It is de�ned as:

ANLL = − 1

m

m∑i=1

log f̂(xi) (4.2)

Contrary to the ISE, this does not require any external knowledge on the unknown

function f(x), as we can build a test set by taking away a subset of the available data.

Of course, comparing two di�erent estimations needs to be done with the same test

set. It can be noticed that summing the logarithms is equivalent to the logarithm of

the product, hence, very low values will penalize directly the �nal result.

4.3 Processing real data

4.3.1 Building of training set

In the context of human poses, samples are not drawn from any known probability

density. Actually, it sounds rather weird to ask ourselves what the global maximum

is (in other words, the most probable sequence of human poses in general). We also

have to bear in mind that any model based on joints angles remains only a model,

and does not re�ect precisely the complete human body con�guration. Anyway,

the task of establishing a data set representative of all arbitrary movements would

be totally intractable. In this respect, trying to reveal the nature of human poses

by a kernel density estimator is theoretically speaking inherently �awed and should

4.3. PROCESSING REAL DATA 45

Figure 4.3: The MMM joint model during a boxing move

remain out of question. The real purpose here is to use the estimation locally in a

purely pragmatical way to help predictions on similar movements.

There are two di�erent joint models used in the Go Hu.MAn group: the Master

Motor Map (MMM) model depicted in �gure 4.3, and the human model. The �rst

one was originally conceived to execute man-like movements on a humanoid robot,

without being constrained by anatomical limits (such as the impossibility of deploying

one's elbow above 180◦, unless being subjected to an especially painful situation).

Each joint is thus simply constituted of three unrestricted angle-based Degrees of

Freedom (DoF). The second model is directly intended to represent a human person.

Both models can be used for prediction.

Additionally to the joints of the model itself, we have to consider what is called

in [5] the global twist. It represents the position of the body in a �xed reference

environment, and is thus composed of 3 variables for translation and 3 others for

rotation. The MMM model has a total of 18 joints, which accounts for 60 variables

(3*18 + 6). The human model has a varying number of DoF/joint, and comprises

42 variables overall, global twist included.


The input data for the two models types were generated originally from the same

marker-based data. Each category of movement is composed of several sequences,

where a marker-covered sportsman was asked to perform the desired movement. All

in all, there are a bit more than 100 sequences available, covering 10 categories of

movements (acrobatic jumps and/or martial arts).

Each sequence is a matrix representing the model con�guration over time. As

mentioned in section 2.2, each data point of the kernel density estimator is in fact the

juxtaposition of the model's variables over several frames, following a prede�ned rule

which what we call the time sequence. For instance, if we follow the time sequence t,

t−1 and t−2, the data point corresponding to frame number 1200 will be a vector of

180 variables (with the MMM model) containing the poses of frames number 1200,

1199 and 1198.

At this point, there is a di�erence with the approach presented in [5]. In order to

reduce a bit the total number of dimensions, the authors recommend to include the

global twist variables only in the present frame t (and taking relative values instead

of absolute ones). Doing so in our case had lead to extremely imprecise predictions

on the global twist in preliminary experiments, the system having only the other

(very indirectly correlated) variables at its disposal to guide prediction. We suppose

that this is much less a problem in their case, as the combination with the image

�delity method through global optimization implicitly overwrites the predictions on

the global twist. Additionally, tracking of the body as a whole is much more easier

than detecting the local joints in general, and this is even more true in presence of

partial occlusions (thus, prediction is rarely needed at all for global twist). Anyway, in

the MMMmodel, this represents only 10 % of the complete model, so the dimensional

overhead is not comparable to the brutal loss of precision.

4.3.2 Frame subsampling

So far we have discussed how to generate data for the kernel density estimator, but

not which frames to give it. We could for instance, starting from frame t, retain only

t+ 10, t+ 20, etc, instead of t+ 1, t+ 2, etc. We call this gap between two retained


frames the frame step. It can be de�ned for both the training and test sets.

But is this cut justi�ed at all? Indeed, there are time considerations that make

worth considering the selection of only a subset of available data, especially for

the training set. Without going into the details of the complexity analysis (see

section 4.6), computation time will increase linearly relative to increase in test set

and training set for the isotropic kernel, and quadratically relative to increase in

training set for the anisotropic kernel.

In the data at our disposal, the frame rate is quite high, so the poses variate not

much on a scale of an unitary frame step. Worse, the proportion of those variations

that are imputable to noise is then no more negligible at all, which is a much undesired

e�ect. Furthermore, the human body is not able to change trajectories instantly, so

we can quite reasonably expect interpolating behavior at a low time level. And of

course, if su�ciently data is available, it might be more interesting, computation time

being equal, to �invest� in additional sequences of the movement category (variety)

instead of super�uous frames, or tune �ner the other hyper-parameters.

Finally, this is not inconsistent with the methodology itself, as from an external

point of view, this is simply equivalent to a camera with a slower frame rate. These

arguments must however be nuanced for the test set frame step. If the end application

wants results at the level of the original frame rate, we should also assess performance

(but not necessarily train!) at this scale.

4.3.3 Step-by-step prediction

Another point to clarify is that the whole system (prediction and image �delity

combined), can only work step-by-step, meaning that all pose estimations must be

made in the chronological order. Indeed, the prediction method needs the pose

con�guration of previous frames mentioned in the time sequence in order to make a

prediction for t. This also means that at the beginning of the process (until the ith

frame is reached, where t − i + 1 is the �deepest� frame of the time sequence), the

image �delity method is alone responsible for the pose estimations.

Figure 4.4 illustrates this step-by-step handling of data, in the most simple case


Figure 4.4: Step-by-step estimate, test frame step = 1, time sequence of t and t− 1

where the time sequence is composed of t and t−1 (thus, prediction can be activated

as soon as frame 2), and test frame step is equal to 1 (thus, each frame after frame

1 is estimated with both methods). The ⊕ symbol has no formal signi�cation here,

simply meaning that the two outputs are somehow merged into a single estimate.

Training frame step is not relevant in this scheme, as it solely concerns the `Training

data' component.

4.3.4 Prediction by gradient ascent

The next question is: how do we actually predict a pose, given a training set and

some past poses? The kernel density estimator is based upon the training set only.

If we evaluate it on a point x, we have a relative likelihood estimate for this point

to appear. Remember a request to our learner takes the form of a vector composed

of the juxtaposition of human body poses over a time sequence, from which we only

know the poses of the past frames and want to induce the current frame's pose. Our


goal is thus to discover which vector x is the most likely, under the constraint that

the variables belonging to the past frames are of course held �xed. Hence, we search

for a maximum on a subspace of the density hyper-surface.

However, we evidently can't a�ord to compute the estimate over a domain, and

only search afterward for the best candidate matching the variable values of the past

frames. Conceptually speaking, we have a density estimation at our disposal, but

each point of its domain is costly to compute, and thanks to the curse of dimension-

ality, there are lots of points. A global maximum is thus de�nitely out of a reach.

What we can do is search for a local maximum, using gradient ascent.

The initial point can be determined in di�erent ways (the variables of past frames

being already �xed). We chose to take the same pose as the nearest past frame of

the time sequence. In preliminary experiments, we found out that this seems to

slightly outperform arbitrary choices, though the in�uence seems rather small. We

must nonetheless precise that this result stands for small training set sizes in high-

dimensional space. In the inverse situation, one may expect less sparse distributions

and thus a more important role for the choice of the initial point.

We actually compute the gradient of the logarithm of the estimation (as the log-

arithm function is strictly monotone, this does not in�uence the direction of steepest

ascent):

∂ log f̂iso(x)

∂x= −

∑Ni=1K

isoh (x,xi)(x− xi)

h2∑N

i=1Kisoh (x,xi)

(4.3)

∂ log f̂ani(x)

∂x= −

∑Ni=1 K

anii (x)Σ−1

i (x− xi)∑Ni=1K

anii (x)

(4.4)

Where the de�nitions of f̂iso(x) and f̂ani(x) stand in equations 2.2 and 2.4 re-

spectively. Notice that this is the complete gradient, whereas in practice we are

only allowed to use the partial derivatives of the variables of the current frame. The

derivations are detailed in appendix A.

We can also remark at this point that the isotropic kernel estimator is a lazy

learner (i.e. the whole computation is done on request), while the anisotropic kernel


estimator is not (inverse covariance matrices and determinants are kept in memory

after �rst request).

4.4 Hyper-parameters selection

4.4.1 Bandwidth

As already discussed in general in point 3.3.3, selecting the adequate value of h

is almost the fundamental question of kernel density estimation. Along with the

regularization parameter α in the anisotropic case, it is theoretically possible to

optimize the best choice of parameters via cross-validation. If we optimize in terms

of accuracy of the estimated density, the performance criterion to use is the ANLL

de�ned in equation 4.2, applied on the union of all validation sets instead of the

test set. However, the real performance criterion is eventually the accuracy of the

prediction made after the gradient ascent, which is not quite the same thing (although

we expect a good ANLL to be at least partially helping) and implies the associated

computations.

But even with simplifying to the ANLL and applying a low-cost cross-validation

(say, only 2-folds), this multiplies the number of computations of an already expensive

density estimator. By the way, if an important part of the training set is �sacri�ced�

to the validation set, the value obtained for h will actually be over-estimated. Indeed,

remember that the larger the training set, the lower we should allow h to go. The

good values for h are in fact highly depending upon the training set size.

And unlike integer hyper-parameters, there is no clear numerical step to use to

explore the in�uence on performance of h (and α in the anisotropic case). Depending

on the training set, �xed initial values and step sizes are quite prone to be arbitrary.

To counter this problem, one may iteratively re�ne search between the two best

candidates (see for e.g. approach proposed in [37]), a bit like in a dichotomic search.

Nonetheless, very much like in gradient ascent, one has to remain conscious that

this is still not su�cient to be guaranteed to �nd a global optimum, unless the

performance function over the parameter space is unimodal.

4.4. HYPER-PARAMETERS SELECTION 51

Figure 4.5: Evolution of average (solid blue) and maximum (dashed green) nearestneighbor distance with sample size varying from 10 to 1000, D = 1 and D = 100resp.

For all these reasons, optimizing h in a systematic manner seems very costly to

apply in practice, especially in the anisotropic case. If an easily computable and

reasonable approximation exists, the time spent on optimizing may instead be used

to evaluate a larger training set (assuming su�ciently data is available). In [6], the

authors propose to use the maximal nearest neighbor distance of the training set.

However, a year after, in [5], they switch to the average nearest neighbor distance.

The idea that may justify the max nearest neighbor distance is that it allows

the bandwidth to expand until it �nds a neighbor, meaning there are no gap of low

(Gaussian) or null (strict window kernels) value between two neighbors, a region we

would intuitively not expect to see falling suddenly at very low probability levels.

That said, this argument is not based on an objective numerical criterion, and might

be somewhat over-smoothing (remember �gure 3.6). The average nearest neighbor

distance also tries to achieve this e�ect, but in a less radical manner.

An important point is the behavior of these approximations when the number of

training samples changes. A much desirable property would be that the proposed

value for h lowers as samples are added. It turns out, that in that respect, the av-

erage neighbor distance behaves much better than the alternative maximum nearest

neighbor distance.


Figure 4.5 illustrates this. We generated data drawn from the standard normal

with D variables, and compared the evolution of the two measures when adding

new elements to the sample (of course, our real data is certainly not generated from

an unimodal distribution such at this one, but we think the following observations

will nonetheless hold). The maximum nearest distance is, as expected, substantially

larger than the average nearest distance. But more importantly, it does not o�er any

clear guarantee to decrease as the training set grows.

This is especially problematic for low-dimensional data (left plot). The variation

is chaotic because it is based on the degree of isolation of the most isolated outlier in

the sample, which is a property susceptible to variate brutally. This unpredictability

is less severe in high dimensionality (right plot), thanks to a positive (in this case!)

aspect of the curse of dimensionality, which states that the relative di�erence between

the maximal and minimal distance in a data set converges to 0 as dimensionality

increases (see [4]). In other words, distance progressively loses its discriminative

power as a de�ning characteristic between two points, which helps to stabilize the

maximum nearest neighbor distance.

4.4.2 Alpha

This is a hyper-parameter speci�c to the anisotropic kernel. Unfortunately, the

anisotropic kernel is already much more time-costly than the isotropic alternative,

so trying to optimize this supplementary hyper-parameter only further widens the

gap between the two.

In point 2.2, we presented α as a way to avoid ill-conditioned covariance matri-

ces. What about simply setting α just high enough to ensure a reasonable condition

number? Well, problem is that high condition number is a relatively blur and pro-

gressive notion, so we could choose a threshold of what is deemed an ino�ensive

condition number, but pick up α to match this threshold on the worst Σi is certainly

not optimal in terms of performances. As for h, there is a trade-o� at play.

So, basically, what is the e�ect of letting α grow too much? The covariance

matrix will more and more concentrate on its diagonal, and converge toward an

4.4. HYPER-PARAMETERS SELECTION 53

identity matrix scaled by α. And what does such a covariance matrix represent?

An isotropic kernel, which is certainly not as terrible as an ill-conditioned matrix.

Pragmatically speaking, there is a major di�erence with the normal isotropic kernel

though, in that we waste much time building a covariance matrix which we will

mostly ignore in the end! And afterward, the evaluation itself will also involve costly

vector multiplications, instead of the straightforward h factor.

As for h, there are no de�nite answers on what would be the best value to give to

α. In [5], α is �xed for rather obscure reasons to h/5. That said, it is fairly clear that

α should, at least partially, depend upon h in some way. The scaling factor must

however be adjusted by heuristic results. So far, we see no automated alternatives

able to better grasp the complex relations between h and the condition number of

the covariance matrix that the isotropic kernel generates.

4.4.3 Training frame step

The question of the training frame step seems not as delicate. We would expect that

low frame steps can never lower precision of the predictions, and that the trade-o�

is thus between precision and computational cost. Depending on how much time we

are ready to give to the algorithm, setting the value of the hyper-parameter could

be done quite naturally. Still, if we want to increase e�ciency, there might be some

ways to specify it less arbitrarily. One could use an objective precision versus time

criterion for instance.

There is also something else to be noted: the training frame step and the time

sequence are not independent. Changing the former will directly determine which

past frames will also be considered, thus changing the signi�cation of the time se-

quence itself. For e.g., t, t− 1 and t− 2 are not the same thing for a frame step of 1

(past frames are also selected frames) than for one of 3 (past frames are the missing

link between two selected frames). That means, one should be aware that lowering

the training frame step after having �xed the time sequence might actually in some

cases decrease performance (bigger training set, but formated in another manner).


4.4.4 Time sequence

The time sequence is, once again, not an easy hyper-parameter to determine. There

are actually two di�erent things to decide: the number of frames included, and how

those frames are arranged.

The �rst aspect will have a dramatic impact on dimensionality, as the time se-

quence's length multiplies the number of dimensions of each data point. This directly

a�ect computational cost, especially for the anisotropic kernel (see section 4.6). But

this is not the main issue. Indeed, all e�ects related to the curse of dimensionality

will only be exacerbated (see point 5.3.1). If the human body model used contains

already too many dimensions, the length of the time sequence must be set accord-

ingly. Nonetheless, in order to make prediction possible, the time sequence must of

course still contain at least one past frame.

The second aspect covers potentially as many hyper-parameters as there are past

frames in the time sequence. However, we may expect good sequences to follow

some kind of simple rule, like an arithmetic progression or an exponential growth.

The former would consider all time intervals to have same relevance, while the latter

would allow to have both deeper insight in the past and information about the near

past. On top of that, there remains of course a scaling factor similar in nature to the

frame step, that is, does it really bring something to cover redundant close frames

instead of directly looking deeper in the past?

4.5 Combination with image �delity

4.5.1 Con�dence weight of estimation

As previously said, prediction method and image �delity method are two distinct

algorithms, and our work is solely focused on the former. However, in order to

guarantee a harmonized mix between the two results obtained, it might be a good

idea to consider weighting their respective in�uence relative to their estimated quality

on the proposed instance. Formulated another way, it would be nice if the predictor

could itself give a scalar con�dence value on the pose it proposes.

4.5. COMBINATION WITH IMAGE FIDELITY 55

At �rst glance, knowing that the kernel density estimator is precisely a probabil-

ity density, we could simply return the value of the local maximum found and use

it as a estimator of the probability of this pose sequence. But relative likelihood

is not probability! A well-de�ned probability must actually be an integral over a

region included in the domain of the density, the value of a point's relative likelihood

being only meaningful when compared to the others points. Due to the curse of

dimensionality, the relative likelihood of a point in a high-dimensional space will in

general be much lower than one in a space of lower dimensionality.

A �rst practical consequence of this is that when having two models not having

the same dimensionality (for e.g., di�erent lengths for the time sequence), it is not

relevant to compare the con�dence values obtained by taking the estimate on the

pose sequence. A second practical consequence is that we really have to scale the

relative likelihood should we want to use it as an ersatz for probability. But varying

conditions must be taken into account, so using this kind of static approach might

still lack �exibility.

We suggest instead to use the value returned on the pose sequence divided with

what it would have been without the prediction. Comparing two relative likelihoods

assure us to stay on the same scale. A reasonable candidate for the pose of the

current frame without the prediction (remember that the variables relative to the

pose of the current frame are unde�ned) could be the nearest past frame. In this

sense, the learner would compare its self-con�dence with what it judges to be the

value of staying purely conservative. Notice it is also the initial point we use in the

gradient ascent (see point 4.3.4), so it could also provide a measure of the estimated

�success� of the complete gradient ascent operation.

4.5.2 Ambiguity measure of training set

The kernel density estimator returns relative likelihood values on points requested,

but this is highly dependent on the parameters used, and of the methodology in

general. And obviously, this value is de�ned locally for the test point in question,

not on a global scale for the training set. In complement to that measure, it would


be nice to get information on the predictive power of the training set itself. That is,

we don't talk about kernels anymore and think of the data points as composed of

two sets of continuous variables: those we want to predict, and the others. So, if we

suddenly remove the values to be predicted, to which extent are the others su�cient

to retrieve the �lost� values? To answer this question, we imagined a way to measure

the ambiguity inherent to a data set, relative to a set of removed variables.

Consider the following naive example: Paul goes to school on each day from

Monday to Friday, but not on Saturday and Sunday. If it is Wednesday, does Paul

go to school? The answer is straightforward, because there is no ambiguity. Now,

knowing that Paul goes to school, which day is it? There are 5 di�erent possible

answers. Having Paul's agenda for this week, we could compute a measure of the

di�culty there is in this agenda to predict the day knowing Paul's occupation (or

vice-versa). For each pair made from 2 of the 7 tuples (day, occupation) we could see if

there is an ambiguity to remove the day or not. For e.g., if we have (Tuesday, school)

versus (Thursday, school), there is ambiguity, whereas for (Monday, school) versus

(Sunday, ¬ school) there is not. That way we could get a ratio of the ambiguous

pairs if the day is omitted, which already gives a good idea of how ambiguous it is

to predict the day with respect to the occupation.

However, it should be clear that such an approach could only be relevant with

restricted discrete values (the data set covers well the range of possible values). In

the continuous case, this is by de�nition impossible with a �nite data set. Instead

of deciding in a binary way if the current pair is ambiguous or not, we may judge

how much the lost variables account for the numerical discrimination between the two

data points. A simple criterion to evaluate how much two sets of continuous variables

di�er one from the other is the Euclidean distance between the two corresponding

vectors. This is coherent with all the classical quadratic error criteria.

If we interpret this geometrically, variables are dimensions, so removing one of

them consists of �compressing� the data set to a single value (i.e. 0 or any other

constant) along this dimension. A sphere would thus be virtually reduced to a disk

along the z axis for instance. Rebuilding the sphere would be ambiguous because

each point on the disk could have originated from both the bottom or upper surface.

4.5. COMBINATION WITH IMAGE FIDELITY 57

A hemisphere posed on a horizontal plane would be easier to rebuild, but unlike

what would have said the binary ambiguity criterion, not absolutely unambiguous,

because variations/imprecisions on x, y would induce variations on z. In fact, only a

perfectly �at horizontal surface would be perfectly unambiguous, because no matter

what the x, y coordinates are, the result to predict will always be the same. On the

contrary, if we compress this surface along the x, y dimensions, there is no way to

reconstitute anything of the original data set, so there is complete ambiguity. Notice

that this will always be the case if all dimensions are compressed to a single point of

no dimension.

In brief, we want to penalize high variance in the compressed dimensions (because

it's more di�cult to be accurate when having to predict them). The idea is, instead of

hunting strict ambiguities, to estimate for each pair how much the distance between

the vectors composed only of the compressed dimensions determines the distance

between the complete vectors.

Having a data set X composed of D-dimensional vectors x1, ...,xN and a set C

of the dimensions to be compressed so that ∀c ∈ C(c ∈ N1 ∧ c ≤ D), let us de�ne x′ias the |C|-dimensional vector composed of all the variables of xi indexed in C. This

gives us the following ambiguity ratio:

AmbigC(X) =2

N(N − 1)

N∑i=1

N∑j>i

∥∥x′i − x′j∥∥

‖xi − xj‖(4.5)

This ratio automatically normalizes the ambiguity in the interval [0, 1]. There

is nonetheless an indeterminate form 0/0 in the case of duplicates. To avoid this

situation, we could impose to eliminate the duplicates in X, but this does not re�ect

the nature of the original data set, because 10 instances of a data point should be

much more in�uent than a single one. Instead, one may simply ignore the duplicates

when they are paired together, but still let them count in the other pairs. This gives

us a rede�ned and generalized ambiguity ratio:


Figure 4.6: Curves of ambiguity repartition obtained by histograms (see text fordescription)

AmbigC(X) =2

|{(i, j) : xi 6= xj}|

N∑i=1

N∑j>i

xj 6=xi

∥∥x′i − x′j∥∥

‖xi − xj‖(4.6)

One more remark about what the ambiguity ratio actually measure: in order

for the Euclidean distance to make sense, the variables must be numerically com-

parable. A priori, this excludes putting together variables of di�erent nature, such

as temperatures and weights. But even with kilograms and tons, one has to scale

them accordingly to a common unit. Normalization is also not a solution, because

it will eliminate relative variabilities. In fact, if there is a scaling to do, it must be

dependent of our own weighting of error on the concerned variable. One has simply

to make sure that an error of ε on a variable has the same cost for us, no matter

which one is perturbed. This stands not only for the variables to predict, but also

for the variables which will in practice be given.

To get an approximate idea of how formula 4.6 behaves in general (are low am-

4.6. COMPLEXITY ANALYSIS 59

biguity ratios common or rare, how smooth is the repartition?), we evaluated it on

a relatively large number of small low-dimensional data sets, generated following

various manners to combine the variables on a limited domain. We present in �gure

4.6 the result of one of those preliminary experiments. The ambiguity ratio obtained

on each data set is put in a bin of an histogram (with bin width ε = 0.05). Each

data set is composed of 4 two-dimensional points only, and each variable can take

only one of the values {-4, -3, -2, -1, 0, 1, 2, 3, 4}. We generated all 62370 non-

redundant combinations of those points which comprise nor duplicates, nor multiple

values on y for a given value x (like as in a function). This last condition implies

di�erent repartitions along the two compressible dimensions, as it is less ambiguous

to compress vertically. Despite the rather coarse conditions, the displayed curves are

almost smooth (notice the hiatus at the left of the vertical compression though).

4.6 Complexity analysis

Let us de�ne several variables relative to the complexity. Be N , the number of

elements in the training set, and M the number of elements in the test set. Be d the

dimensionality of the model used, T the length of the time sequence, and D = dT

the dimensionality of an element in the training and test set. Be F the total number

of available frames. Finally, be X the number of maximal tolerated evaluations for

one call of the gradient ascent.

4.6.1 Space complexity

There are several data structures and functions using them, but the space com-

plexity is essentially only playing a role for the (inverse) covariance matrices of the

anisotropic kernel. In the isotropic case, both training and test sets are stored in

memory, but this is more inherent to the problem than to the methodology used to

solve it. That said, theoretically at least, one has to distinguish the space taken by

the available raw data from the constituted training and test sets. The former is in

Θ(dF ), while the latter is in Θ(DN +DM), which is potentially bigger. In practice


however, we would really not expect T to be more than a constant factor, and nor

N , nor M can be superior to F .

Back to the inverse covariance matrices, it would be possible to store only one

matrix at a time, but this forces to re-compute each one again for each new kernel

evaluation, including during the gradient ascent. This space versus time trade-o� is

out of question in practice. All the inverse covariances matrices have thus to be stored

at the same time. There is one D ×D matrix for each element of the training set,

thus the space complexity lies in Θ(ND2). Concretely, knowing that each variable is

stored in a 64 bits double, this represents 80 Megabytes for N = 1000 and D = 100.

4.6.2 Time complexity

The time complexity is alas not as gentle as the space complexity. In the isotropic

case, we deal with a time complexity in O(DNMX). The D factor is explained by

the distance calculation, the N by all the individual kernels averaged, and the M by

the loop over this. Contrary to the other complexity factors situated in Θ, X is only

a worst-case estimate, as the gradient may in most cases converge quickly enough,

meaning that we don't need to interrupt it. In fact, the X factor exists solely to

o�er the possibility of a strict upper-bound, as the gradient ascent is guaranteed

to converge in a �nite, yet variable, amount of steps. For the strict window kernel

types, it is actually possible to also get the D factor in O instead of in Θ. This is

done by shortening the computation of the distance as soon as we already pass over

the bandwidth, as the result will be equal to 0 anyway.

The anisotropic case is a bit more complicated. First, there is the computation of

the N covariance matrices. For each matrix, this is done by calling, for each element

of the training set, the isotropic kernel scaling the matrix obtained by computing

(xi−xj)(xi−xj)T (see equation 2.6). So, in total this operation is in Θ(N2(D+D2)) =

Θ(N2D2). Interestingly, the inner call to the isotropic kernel is negligible compared

to the vector multiplications. Then, we also have to inverse these matrices and

compute their determinants. This is taken in charge by Matlab, and can be done N

times in O(D3) for each matrix, supposing it uses standard algorithms (according to

4.6. COMPLEXITY ANALYSIS 61

Wikipedia, the fastest known algorithm can accomplish both tasks for big matrices in

O(D2.376)). So, to summarize, the matrix computation phase is in O(N2D2 +ND3).

Then comes the evaluation in itself. The anisotropic kernel must perform vector

multiplications with the inverse covariance matrix in order to obtain a scalar, so this

costs D2 for each evaluation. Like for the isotropic kernel, we have to multiply this by

NMX. If we consider the matrix computation phase and the subsequent evaluation

together, the total time complexity of the anisotropic kernel lies in the not very nice

but still polynomial O(N2D2 + ND3 + D2NMX) = O(ND2(N + D + MX)). As

previously said, the majority of these factors are in Θ so this theoretical complexity

has also very concrete consequences on the actual computational time.

In general, optimizing hyper-parameters add yet other multiplicative factors to

the whole complexity, with the nuance that the validation set's size may certainly

be smaller than M , and that X could possibly be dropped by approximating perfor-

mance estimation (see point 4.4.1).

What about the proposed shortcut for selecting the bandwidth h? The average

(or maximum) nearest neighbor distance is performed only once in Θ(N2D), which

might be quite an inconvenient for the isotropic kernel (depending if MX is much

bigger than N or not). For the anisotropic kernel, already much more expensive, this

remains a negligible e�ect.

Finally, concerning the self-evaluation of the algorithm, the con�dence weight of

estimation adds no time complexity cost at all (as it is based on what is already

computed). The ambiguity measure of training set runs in Θ(N2D), like the nearest

neighbor distance measurements.

5. Experiments

The Reader will probably

understand form these two

instances how - after a very long

training supplemented by constant

experience - it is possible for the

well-educated classes among us to

discriminate with fair accuracy

between the middle and lowest

orders, by the sense of sight.


5.1 Performance criteria

How can we measure the precision of the predictions? A subjective appreciation of

global success or failure could be obtained by watching a run of the human body

models con�gured by the predictions over the frames of the test set, but this lacks

automation, fairness and precision. Yet, a potential issue with numerical criteria is

that their neutrality does not imply problem relevance if they are not de�ned so as

to exactly re�ect what we actually want to measure in the end. Is a big error in one

variable acceptable if it helps to keep the other variables precise? Are all variables

comparable in terms of error? We would like to clarify here how we decided to answer

those two questions in our context, and thus evaluated our results.

63

64 CHAPTER 5. EXPERIMENTS

The �rst criterion, namely with which kind of rule should an error impact the

global result, is not that much controversial. Errors are basically de�ned as the

absolute deviation relative to a reference. When evaluating a whole system, we

could for instance propose to multiply all individual errors. This assumes that each

individual error worsens considerably the existing problems. This seems not to be

very relevant for us, because local errors do not suddenly make the rest of the body

con�guration totally aberrant. At the contrary, we could simply sum the individual

errors. This assumes that a local error has absolutely no other consequences over

the global solution than itself. Again, this is a bit too extreme in our case, because

when one joint starts to really deviate from the true one, this a�ects our anatomical

interpretation of the whole body more than if this error was distributed between all

joints.

A good compromise in our problem between those two points of view is to take the

sum of squared errors. This criterion has also the advantage of being in accordance

with more abstract statistical measures, like the ISE (see equation 4.1). This is of

course de�ned for a single frame. When evaluating the results for a whole test set,

we average all individual results.

The second question is a bit more delicate to answer. Firstly, we have to recall

that among the variables of our both body models, there is the so-called global

twist, which is composed of three angles, but also of a translation in space. This

is quite problematic because the translation does not capture the same reality as

all the other variables. Even if in practice this has no lead to a sensibly di�erent

contribution to the total error, one has to bear in mind that this is basically due

to luck, as scaling the spacing unit will directly impact this fragile �equilibrium�.

But even between two variables both measured in radians, are we sure that they

are really comparable? For a human being, is 180◦ comparable when applied to the

horizontal rotation of the neck or to the bending of the knee? Even without violating

those so-called anatomical constraints (not included in the MMM model anyway),

one could argue that we should add weighting factors based on expert's judgment.

As we don't know those anatomical priors, especially for the MMM model, a

possible solution to this dilemma is to measure the error in a relative way, in comple-

5.1. PERFORMANCE CRITERIA 65

ment to the absolute error. We propose to scale each individual error on a variable

by the value of the reference's sample standard deviation for this variable. For a set

of M reference D-dimensional vectors X and a set of M estimated D-dimensional

vectors X̂, with the predicted variables being the �rst d ones, xj = 1M

∑Mi=1 xij and

sj =√

1M−1

∑Mi=1(xij − xj)2, being respectively the reference's average and sample

standard deviation for variable j, this gives us the following formulas:

AbsErr(X̂,X) =1

M

M∑i=1

d∑j=1

(x̂ij − xij)2 (5.1)

RelErr(X̂,X) =1

M

M∑i=1

d∑j=1

(x̂ij − xij

sj

)2

(5.2)

It should be clear that the absolute and relative errors presented here measure

performance in two di�erent ways, and could not be directly compared. Regarding

the relative error, it tries to penalize imprecision more severely for variables that don't

vary much. In the extreme case, sj might be equal to 0, but with real data this could

barely happen. Of course, M must be > 1, otherwise all sj would be indeterminate

(measuring performance on a single pose is of doubtful relevance anyway).

Notice �nally that the fraction 1M−1

in the sample standard deviation sj is an

unbiased estimate for σj, contrary to the alternative 1M

that underestimates it. This

way, sj will not systematically grow with M . This desirable property must nonethe-

less be viewed with precaution, as a condition for the total absence of bias is that

the sample values must be drawn independently, which clearly won't be the case if

the test set consists of poses belonging to the same sequence.

This section would not be complete should we omit to mention the relevance

of measuring also the time performance, alongside the precision. This is highly

helpful to concretely see the di�erence between the isotropic and anisotropic kernels,

considering their contrasted time complexities. Fortunately, time measurement is a

quite trivial and objective task.


5.2 Test set selection

In our tests, we wanted to respect the step-by-step approach presented in point 4.3.3,

so we did not run the estimations independently frame from frame on a �xed test set.

We simulated the missing image �delity method by the reference data itself (a perfect

estimator in that sense). Formulated that way, this might sound like a violation of

the important principle that the model should not in�uence the test process, but in

fact conceptually it is not really the case.

For all frames of the test set, the variables relative to the current frame t stay

untouched, so the reference for evaluation is never in�uenced. Each individual (i.e.,

on a single estimated pose frame) performance measure is of course done on the

prediction alone (since we want to evaluate the predictive part only), so the help of

the perfect image �delity method will not in�uence the result on the current frame.

For subsequent frames, combining past frames with perfect image �delity is of

course advantageous compared to using prediction alone, but let us remember that

the test set as taken by default contains all the true past frames! Hence, if we let

the perfect estimate completely overwrite the prediction, it would be the same as

using the prede�ned test set. In other words, any other combination with the perfect

estimate can only be less precise, so that we could never arti�cially improve our

predictor's performances that way.

One may still argue that the principle of an in�exible test set is also meant to

avoid under -estimating the actual performance. But the whole question is then,

what has more relevance as actual performance measure? Allowing past frames to

be perfect is unrealistic because it eliminates the need for a predictor at all. For

instance, we could return the nearest past frame as prediction for the current one.

As there is only minimal variation between succeeding frames, this would give a very

precise predictor running in no time (but totally useless in practice, needless to say)!

Another argument, which will constitute our conclusion in this discussion, is that

the principle of test set neutrality remains not violated if we consider the whole

sequence as a single complex component to test, instead of a test set made of falsely

independent instances (the mode of combination being held �xed, conditions remain

5.2. TEST SET SELECTION 67

always the same, and comparing results is fair).

This is thus how we evaluate the test data, whatever the mode of combination

chosen to merge the pose of the image �delity method with the one of the prediction

method. One could think of many more or less sophisticated ways to merge the two

outputs. Most importantly, both methods should deliver an honest self-con�dence

estimate to ensure adaptive weighting, depending on the situation. If large occlusions

suddenly appear, the image �delity method should notice it and trust more largely

the results of the prediction method, and reciprocally if the predictor notices that the

current movement does not seem to �t to its prior knowledge. For the predictor, we

propose to take into account the two measures de�ned in section 4.5, for assessing

its self-con�dence both in a dynamic (con�dence weight of estimation) and static

(ambiguity measure of training set) way.

Nevertheless, for our experiments themselves, we limited ourselves to three very

basic combinations that do not make use of those estimates. The reason is simply

because we have only a perfect image �delity method at our disposal, so it would

not make sense to give it an arti�cially low self-con�dence estimate. There was of

course the possibility to make the data noisy, but this would not have re�ected in

a realistic way the imprecision of the method, as noise should be on the image, not

directly on the resulting poses.

Due to this limited setting, we simply combined both poses ximage and xpred in a

linear way for each vector, with a scalar λ ∈ [0, 1] for the proportion:

xout = λximage + (1− λ)xpred (5.3)

Once again, we insist that in our testing framework, this is only applied to the

past frames given as input, so that the performances for the current frame are not

directly a�ected, whatever the choice for λ. We chose λ = 1 to always get the real

past poses, λ = 0.5 to get a simple mix and λ = 0 to rely solely on prediction. The

third combination is an especially harsh testing mode, because it conceptually means

that the camera is blinded for the whole movement once the deepest frame in the

time sequence is reached.


This brings us to another remark: in order for xpred to be de�ned, the predictor

must of course have been called on the corresponding frame before. Otherwise, one

must solely rely on the image data �delity for this past frame, which is not so terrible

in the real system, but would make the testing experiments meaningless for λ < 1,

as some privileged past frames would de facto enjoy a λ = 1. This is inevitably the

case for the starting frames (as the predictor can only start when its deepest past

frame exists), but we would like to at least avoid gaps afterward. To ensure this, the

elements in the time sequence should all be multiples of the test frame step, so that

both remain synchronized.

5.3 Limitations of the methodology

Before presenting the results themselves, we would like to explain the nature of the

two main issues encountered when processing the experiments. These problems were

certainly partially predictable, however they became truly apparent in the concrete

tests, hence the reason why this section is placed in the present chapter.

5.3.1 Curse of dimensionality

This issue recovers all indirect consequences of the so-called curse of dimensionality

sketched in point 3.1.1. At �rst glance, the core problem seems to be the non-

representativity of any training set of reasonable size, due to the high number of

dimensions (a minimum of 120 for the MMM model).

This pessimistic view is to be corrected however by the fact that the variables are

very likely to show little independence. Firstly, the dependency between successive

frames of the time sequence is precisely what justi�es the use of a learner: if they were

independent, there would be nothing to predict. Secondly, the body con�guration as

a whole for a given frame and a more or less de�ned type of movement is suspected to

lie in the neighborhood of a much lower dimensional manifold (embedding of space),

even if these manifolds are susceptible to variate sensibly with the complexity of the

concerned type of movement. See point 5.4 for more details.

5.3. LIMITATIONS OF THE METHODOLOGY 69

So, it seems that we deal with a structural curse of dimensionality only on a

limited scale, which is reassuring, because a full-scaled one would have meant the

practical impossibility of any predictions. Nevertheless, we still have to face the

perverse e�ects of a curse of dimensionality on the representation used, in the sense

that the data treated by the kernel density estimators remains of high dimensionality.

In fact, lowering the number of variables could be achieved using techniques of

nonlinear dimensionality reduction, a large number of which are very well presented

in [24]. Alas, they do not provide a two-ways mapping, so we could not practically

make much use of a reduced prediction. Indeed, representing a reduced human model,

and more importantly, combining it with the unreduced pose of the image �delity

method seems to be almost as di�cult as �nding a relevant parametric estimator for

the concerned movement type.

Hence, even if the curse of dimensionality does not a�ect the learning process

as a whole, it still has very concrete consequences on the kernel density estimation

computed. What are they? Remember that densities, having an integral of 1, tend to

spread out in all the possible dimensions becoming available to them. In particular,

isotropic Gaussian kernels will directly be a�ected by particularly brutal numerical

problems, appearing with too high or too low bandwidths h. In both cases, the

estimated relative likelihood value for a given point risks to be so dramatically close

to zero that it will fall beneath the numerical limit for double precision, which is in

the order of 10−308 according to [10].

This numerical e�ect �kills� the gradient if all individual kernels return 0. It is

rather hard to estimate the amount of imprecision caused by approximating very

small positive doubles in a limited format, but in any case the abrupt cut-o� of any

prediction and the indeterminate gradient at the region of 10−308 was very clearly

observable in our experiments. Without getting into the technical details, we tried

to scale the result before the approximation to zero happens, but this is delicate to

do while respecting the meaning of h. Once over the fatal barrier, a bit like for the

event horizon of a black hole, there is no much to do apart from relying on longer

�oating point representation (but like for scaling, this does not solve the problem in

general, and costs more time and space).


Figure 5.1: Upper (dashed) and lower (solid) bounds of h for varying u2, D = 90 andD = 120 resp.

We also tried to use the log-likelihood instead of the likelihood in the de�nition

of the kernel, which allows to discard the exponential without a�ecting the general

shape of the function (as the logarithmic is a monotonic application). Unfortunately,

this can only be applied on the individual kernels, not on the estimator that averages

them. Indeed, we have to apply the logarithm inside K isoh (resp. Kani

i ), because

if we do it on the whole estimator f̂iso (resp. f̂ani) only when it is determined,

the numerical issue would already have appeared. The problem is, log(a + b) is

a monotonic application (preserves gradient direction), whereas log(a) + log(b) is

not, and thus invalidates the relevance of the gradient. We did not �nd a way to

express the log-likelihood of the kernel estimator in terms of the log-likelihoods of

its individual kernels.

In fact, this issue is indeed very closely related to the dimensionality. Let us

call u2 the squared Euclidean distance between x and x′, in accordance with the

terminology used in point 3.3.2. For a given value u2, let us de�ne an upper and

lower bound for h, outside which numerical approximation to zero appears. This

means, values for the bandwidth lying outside this range for a given value u2 are

stopping gradient ascent if the requested point is located at a distance v ≥ u to its

nearest neighbor among the training points.

Figure 5.1 shows how the range of possible values gets narrower with D (and with

5.3. LIMITATIONS OF THE METHODOLOGY 71

Figure 5.2: Value of an isotropic kernel with �xed squared distance u2 and varyingh, D = 1 and D = 60 resp.

u2 of course). We compare the case of D = 90 and D = 120 (the minimal number

of dimensions in the MMM model). Notice that there is not only a broader range

of usable values for h in the left plot, but also a big contrast between the maximal

u2 allowed before both bounds merge brutally (approx. 4× 104 for D = 90 but only

approx. 6900 for D = 120). We do not show comparisons with the univariate case,

as the higher bound is so radically higher and stays so until u2 reaches such a large

number that one could consider it to not exist at all for non-pathological data.

The problems only worsens considering that if we have two vectors of a �xed

number of variables, and then add other variables, the new Euclidean norm on the

di�erence can only grow if the added variables are not identical. This is precisely

what is done when adding several past frames in the time sequence!

To have another view on the in�uence of high-dimensionality, consider what hap-

pens in equation 2.3 if, instead of �xing the bandwidth and varying u2, we chose a

�xed u2 and observed what happens with various h. Figure 5.2 plots the value of the

kernel over varying bandwidths with u2 equal to 1 and 0.99. To make the contrast

apparent, we chose to show here the plots of D = 1 and D = 60.

There are nevertheless precautions to take with this sort of plot, because the

choice of u2 also has consequences, especially in the high-dimensional case. For e.g.,

the right plot has a mode of a very large value, but when progressively increasing u2,


it will shrink dramatically. Regarding the shapes obtained, let us only observe that

setting D to 1 results in a plot that bears resemblance to an inverse χ2 distribution,

whereas increasing D results in what seems to be a convergence toward a normal

itself. Those graphical similarities do not make them valid densities however, as the

integral of those plots does not sum to 1 in general.

To summarize the issues discussed in this point, we strongly advocate in favor

of a body model that comprises not too many dimensions. The relevance obtained

through a more realistic model risks to be jeopardized by numerical e�ects of the curse

of dimensionality (not to mention the impact of high-dimensionality on complexity).

5.3.2 Ill-conditioned matrices

Another issue arises from the anisotropic kernel this time. This is not to say that the

anisotropic kernel is not a�ected by the problems of the isotropic kernel, as it uses

it to build its covariance matrices! Those are at least always positive semi-de�nite.

But their condition number could also be dangerously high, rendering the inverted

matrices unusable in practice, unless we raise α. Nonetheless, as discussed in 4.4.2,

this can partially lead to waste the information gained by the computation of the

covariance matrices.

The problem we have to face in practice is that, at least for our data and our

number of dimensions, even an α smaller that what is needed to not su�er from ill-

conditioned matrices will comparatively dwarf the values obtained by the isotropic

kernels outside the diagonal. In other words, in order to make the anisotropic usable,

one has to blur it so much that it does not show dramatic precision improvement

over the simpler isotropic equivalent.

Without much alternative to regularization through α in order to reduce the

condition number, it does not seem easy to circumvent the problem. We also tried

Moore-Pearson pseudo-inverse, an alternative inversion technique that can also be

applied tom×n matrices. Of course, it changes nothing about the condition number,

but we had thought it might possibly give a better inversed matrix. It turns out that

the matrix obtained is indeed di�erent, but multiplying it at left by (x − xi)T and

5.4. ESTIMATION OF INTRINSIC DIMENSIONALITY 73

at right by (x − xi), as is done in the anisotropic kernel (see equation 2.5) returns

the same number.

The original causes of this ill-conditioning problem are not perfectly clear to us.

It might once again have something to do with dimensionality, but a strict distinction

from the problems encountered by the isotropic kernel is quite delicate to achieve,

as the values returned by the a�ected isotropic kernels will also have their own

in�uence. We can say that a minimal threshold for α seems much less critical for

low-dimensionality data. However, values for α are not really comparable on this

level, and furthermore, the data comes from a distinct source, so one should take

caution before imputing everything solely to the curse of dimensionality.

Thus, we want to emphasize that it is quite plausible that the anisotropic kernels

are not revealed at their true potential through our experiments, and that they might

perform better in other settings.

5.4 Estimation of intrinsic dimensionality

In complement to the experiments on the data itself, we were interested to know

a bit more about the dependencies between variables. One could naively use the

Pearson's correlation applied on the sample covariance matrix to try to �nd out the

dependencies between variables. Given two variables x and y found in a sample of

size N , this coe�cient basically scales the sample covariance by the product of both

sample standard deviations, and is de�ned as:

rxy =

N∑i=1

(xi − x)(yi − y)

(N − 1)sxsy(5.4)

Unfortunately, linear measures such as rxy detect solely linear dependencies, as

illustrated in �gure 5.3, whereas we do not expect this property in human poses,

so this dependence itself is not easy to appreciate directly. We thus relied instead

on methods closely related to the domain of dimensionality's reduction, namely es-

timators of intrinsic dimensionality. This was done using the drtoolbox created and


Figure 5.3: Various bivariate data sets and the corresponding correlation's coe�cient(image taken from the Wikipedia entry on correlation and dependence)

distributed by Laurens van der Maaten, available at http://homepage.tudelft.

nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html. Resulting di-

mensions are nor expected to be integer numbers (think about fractal dimensions),

nor to stay the same for all types of movements.

Table 5.1 shows the results for 4 of the 6 intrinsic dimensionality estimators

included in the drtoolbox. GMST stands for Geodesic Minimum Spanning Tree,

and MLE stands for Maximum Likelihood Estimator. It is not surprising that the

di�erent estimators give contrasted numbers, considering they capture the idea of

intrinsic dimensionality from di�erent points of view. For instance, one of the 2

remaining estimators, namely the packing numbers, returned 0, which is of course

absurd in our case.

Another foreseeable result is that each estimate variates when picking another

type of movement. What is more surprising however, is that it also variates within

a single type of movement. We would have expected much less inner-class variance.

And types of movements that intuitively appear less complex (a simple boxing move

compared to acrobatic jumps) do not necessarily have a lesser estimated dimension-

ality. Another surprise is that, with the exception of the Eigen values (the criterion

used is, roughly put, to cuto� dimensions of in�uence deemed �negligible� relative to

http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html

http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html

5.4. ESTIMATION OF INTRINSIC DIMENSIONALITY 75

CorrDim GMST EigValue MLE

Boxschlag_hinten_2_1_Q_Human 1.48 1.86 5.00 2.03



Salto_rückw_3_1_Q_Human 1.71 2.64 5.00 2.34



Butter�y_Kick_1_1_Q_Human 2.20 3.40 6.00 3.39



Boxschlag_hinten_2_1_Q_MMM 1.54 1.91 5.00 2.20



Salto_rückw_3_1_Q_MMM 1.54 2.73 6.00 2.75

Salto_rückw_4_1_Q_MMM 1.73 4.25 9.00 3.61

Salto_rückw_6_1_Q_MMM 1.47 3.13 7.00 2.94

Butter�y_Kick_1_1_Q_MMM 1.84 2.26 8.00 2.17



Table 5.1: Number of estimated dimensions for sequences used in experiments, framestep of 10, time sequence of t, t−10

the preceding one, after having ordered dimensions by decreasing importance), the

estimate of intrinsic dimensionality is actually very low compared with the original

number of variables.

Finally, to have a more graphical idea of how complex the dependences actually

are, we tested several dimensionality reduction techniques over the sequences used

in the experiments. We present here three examples reduced via MDS in �gure 5.4.

The choice of this technique in particular is simply because we do not really want

to use (or give explicit meaning to) the reduced variables obtained. Rather, we

are interested in mere visualization, for which MDS is especially adapted. A nice

property of MDS is that if we increase the number of dimensions desired, the initial

mapping will stay relevant and simply expand in the supplementary dimension. For

instance, if we represent the fourth dimension by color, �gure would 5.4 be the same,

just colorful.

While in all generality, there is absolutely no guarantee that mappings of sim-

ilar movements will reveal to be themselves graphically similar (MDS is not based


Figure 5.4: 3-dimensional MDS applied on various MMM sequences: boxing 6 (top,left), back�ip 6 (top, right), 3 (below, left) and 4 (below, right)

on topology in that sense), the plots for sequences Salto_rückw_3_1_Q_MMM

and Salto_rückw_4_1_Q_MMM bear indeed some structural resemblance to each

other. This similarity is also observable on the color in 4 dimensions. It is also very

apparent that points are not arbitrary clouds but follow the time line of the move-

ment. Compared to the back�ip move, plots of boxing look substantially simpler, as

they indeed seem to be in reality.

5.5 Evaluation

5.5.1 Default case

One could think of many settings to evaluate the methodology (hyper-parameters

selection, model used, size and selection of training set, and so on). We present here

only a small subset of those variations, that overall produce exploitable results. By

this, we mean results that are of �acceptable quality� if combined with the image

�delity method, that is, that respect more or less the motion being performed. This

is of course, a rather subjective criterion, but one can fairly discard a setting if all

tests operated on it produce results that substantially mislead the tracking on the

whole sequence. As a consequence, the results presented here do not seek to show the

5.5. EVALUATION 77

impact of bad hyper-parameters, and are not very contrastive. Quite importantly,

we precise that the whole process remains deterministic, so results made with one

setting should only di�er by the time they take.

We will refer to table 5.2 as our default case. In other words, subsequent variations

are made relative to this setting. We chose the MMM model, because it does not

have anatomical constraints that might be violated, and the visual results tend to be

more convincing. We chose a frame step of 10 for both training and test sets because

of the signi�cant speed up (×100) this brings, which is especially convenient for the

evaluation of the anisotropic kernels. Time sequence only covers one past frame to

avoid issues arising from high dimensionality. The bandwidth is set to the average

nearest neighbor distance.

On table 5.2, as well as in all subsequent tables, results are expressed in terms

of absolute (`Abs') and relative error (`Rel'), as de�ned in section 5.1. Each test

is always made with the three ways of determnining past frames : 'Pred' (100%

predicted values), 'Mix' (50% predicted and 50% real values) or 'Real' (100% real

values), according to the λ of equation 5.3. Among the categories of motion at our

disposal, we chose boxing moves, back�ips and butter�y kicks, because intuitively

they seem to be of a varying amount of complexity. The selection of the sequences

within the corresponding categories that we took for the evaluation is arbitrary, as

well as is the repartition between training and test sets. Time is expressed in seconds

(as measured on a Dualcore 1.6 Ghz laptop), but does not cover the animation of the

sequence. Sequences last between 201 (boxing 6 and 11) and 901 frames (back�ip

3), with more or less the same number of frames within each category.

When looking at the results of table 5.2, we notice, as expected, that the manner

of determining past frames has a clear impact on performance. What is perhaps

less intuitive is the sensible variation of error between the test sets. Notice however

that relative errors are de�ned depending on the reference data, thus making direct

comparison inadequate (fortunately, two di�erent settings on the same test set stay

comparable). But absolute errors are more neutral in that sense, especially when the

motion category is the same, and they also variate sensibly. This is also observable

when examining the predictions visually.


Train TestPred Mix Real

TimeAbs Rel Abs Rel Abs Rel

Boxing 2Boxing 6 94.86 2.05e+3 81.64 1.95e+3 79.42 1.88e+3 23.89

Boxing 11 30.10 8.18e+4 25.86 7.71e+4 18.97 7.20e+4 14.77

Back�ip 3Back�ip 4 32.88 1.20e+2 8.25 7.61e+1 3.83 4.76e+1 48.86

Back�ip 6 45.97 1.31e+2 25.22 8.05e+1 14.76 5.97e+1 48.31

Butter�y kick 1Butter�y kick 3 95.49 1.20e+2 47.77 1.45e+2 33.52 1.21e+2 61.97

Butter�y kick 5 103.69 1.78e+2 94.61 1.66e+2 57.75 1.42e+2 67.38

Table 5.2: Results for isotropic kernels on the MMM model, frame step of 10 for bothtraining and test, time sequence of t, t−10, h = average nearest neighbor distance

In the variations we are going to consider, we deliberatively omit here to present

the results for the human model. This is because the meaning of the two error

measures is intrinsic to the model used, and direct comparison can only be made on

video, so it would necessitate to present all results twice for both models. For space

reasons, we simply decided to leave the results to their electronic format. Visually

speaking, predictions on the human model tend to be of lower quality because it

frequently degenerates to a likelihood of 0, although it would intuitively be less

prone to su�er from the e�ects of high dimensionality with multiple past frames (42

parameters instead of 60 in the MMM model).

5.5.2 Some graphical results

Before presenting the numerical tables for the variations we retained here, we would

like to illustrate part of the preceding results in a more concrete form. Figures 5.5

and 5.6 show 6 frames of the sequences Boxing 11 and Back�ip 4 respectively.

Noticeable errors of the prediction in �gure 5.5 is the wrong twist of the left

hand and the bad time synchronization in the motion (the blow is delivered too

late). When watching the video sequence, it also appears that the return of the �st

remains somewhat shaky.

Although the sequence may appear less respected in �gure 5.6, it is actually much

more a question of time synchronization, as the predicted motion is quite smooth and

matches the reference sequence on the whole (we do not show here the predictions

on the real past frames which are still more precise).

5.5. EVALUATION 79

Figure

5.5:

Sequence

forBoxing11

with40

fram

esbetweeneach

image,

reference

(top)andprediction

oftable5.2withreal

pastfram

es(bottom)


Figure

5.6:

Sequence

forBack�ip

4with160fram

esbetweeneach

image,

reference

(top)andprediction

oftable5.2withmixed

pastfram

es(bottom)

5.5. EVALUATION 81

Figure 5.7: Values of absolute (left) and relative (right) error over time for test oftable 5.2 on Boxing 6

Figure 5.7 shows the repartition of error for each tested frame (with test frame

step of 10, this means 21 frames among a total of 201) of Boxing 6. Obviously, the

quality of the prediction �uctuates with time. Here, there is a brief passage where

the prediction with real past frames suddenly matches very well the actual pose,

corresponding to the blow itself in the middle of the motion. A bit surprising is that

better past frames do not systemically dominate less-precise ones for each frame,

although it is almost always the case for the average error.

In the preceding case, the shapes of the two error functions were quite similar.

But this is not always so, as absolute and relative errors do not measure the same

notion. In �gure 5.8, the similarity between absolute and relative error is much less

clear.

Preceding plots were not very smooth, but the number of frames being reduced,

this is not surprising. However, this seems to also be the case for longer sequences,

such as Back�ip 4, as can be seen in �gure 5.9. Similarity between the two error

functions is partially present by their shape, in the sense that local maxima and

minima correspond more or less, even if the levels in themselves di�er.

Another interesting plot, although not easy to interpret, is to plot all variables

of the reference and predicted pose over time. Notice we are only interested here in

the prediction alone, so we do not represent the values of the variables for the past

frames. As previously said, there are 60 variables to describe the current pose in the


Figure 5.8: Values of absolute (left) and relative (right) error over time for test oftable 5.2 on Boxing 11

Figure 5.9: Values of absolute (left) and relative (right) error over time for test oftable 5.2 on Back�ip 4

5.5. EVALUATION 83

Figure 5.10: Global plot of the variables over time for reference (top) and predictionof table 5.2 with mixed past frames (bottom), on Back�ip 4

MMM model. Figures 5.10 and 5.11 illustrate this for sequence Back�ip 4, with all

variables grouped together and considered individually respectively.

As can be seen in �gure 5.10, the predicted plot looks less �messy� and chaotic

than the actual plot. This does not mean that it is smoother on an individual scale,


Figure 5.11: All individual plots of the variables for reference (blue) and predictionof table 5.2 with mixed past frames (red), on Back�ip 4

as can be seen more clearly on �gure 5.11. This suggests that the proposed model

is underestimating the complexity of the actual model, in other words, that the

learning model su�ers from under�tting. Figure 5.11 also helps to detect joints for

which the prediction performs the worst. Here, one can see for e.g. that the third

variable of �rst row and that the second variable of the fourth row do not �t at all

the reference. These joints correspond to the z coordinate of the global twist and to

the y coordinate of the right elbow respictively.

5.5.3 Maximum nearest neighbor distance

The setting retained here is the same than for the default case, except that h is now

equal to the maximum nearest neighbor distance. Table 5.3 summarizes the results.

We indicated in bold errors that are inferior to their equivalent in the default case

(we did not apply this to time, because variations are much more likely to be due

to external causes such as a lag in Matlab for the time taken by Boxing 6). This

5.5. EVALUATION 85




Boxing 11 24.61 8.02e+4 22.43 8.06+e4 21.69 8.08e+4 8.43


Back�ip 6 41.53 1.22e+2 21.94 7.85e+1 16.42 5.97e+1 42.74


Butter�y kick 5 98.07 1.46e+2 95.17 1.47e+2 81.71 1.37e+2 46.41

Table 5.3: Results for isotropic kernels on the MMM model, frame step of 10, timesequence of t, t−10, h = maximum nearest neighbor distance

setting systematically dominates the default one for 'Pred' past frames, but is less

advantageous when past frames are more trustworthy. We consider however that

'Mix' and 'Real' past frames should be granted more relevance, as in practice the

image �delity would help to keep the tracking consistent until ambiguities appear.

Notice that, comparatively speaking, it tends to perform better when evaluated on

relative error than on absolute one.

In point 4.4.1, we already suggested that the maximum nearest neighbor distance

as a value selector for the bandwidth may have a tendency to blur the actual com-

plexity of the motion (in other words, under�tting it). Figure 5.12 illustrates this

more concretely, when we compare it with the plots of �gure 5.10. Not only is the

inner structure less intricate, but the curves are also smoother than with the average

nearest neighbor distance.

5.5.4 Anistropic kernels

We present here results for two values of α as h/3 and h/5 (more extreme values

quickly degrade performance). As for the preceding variation, bold numbers indicate

lesser error than in the default case. Once again, we consider 'Pred' past frames to

be less relevant. The quality of the results is far from exceptional compared to the

isotropic kernel, considering the sensible time cost di�erence. Of course, with this

data computational cost remains quite reasonable, but we have to bear in mind that

the training and test samples are very reduced here. So if a trade-o� comes into

play, the precision that might be gained by anisotropic kernels is largely penalized

by the much larger time cost. As an illustration of the predicted poses, we can also


Figure 5.12: Global plot of the variables over time for prediction of table 5.3 withmixed past frames, on Back�ip 4

compare �gure 5.13 with the previous corresponding plots. It appears slightly more

intricate than the prediction of �gure 5.10, but this is a bit subjective.

5.5.5 Longer time sequence

We will now brie�y present one of the results obtained with other time sequences.

It turned out in preliminary experiments that long time sequences quickly su�er

from the e�ects of high-dimensionality (at least with relatively large body models




Boxing 11 30.10 8.16e+4 25.69 7.65e+4 18.32 6.70e+4 187.82


Back�ip 6 45.94 1.30e+2 19.10 7.47e+1 13.22 5.58e+1 1337.57


Butter�y kick 5 104.33 1.82e+2 90.97 1.64e+2 69.65 1.89e+2 1239.76

Table 5.4: Results for anisotropic kernels on the MMM model, frame step of 10for both training and test, time sequence of t, t−10, h = average nearest neighbordistance, α = h/3

5.5. EVALUATION 87




Boxing 11 30.18 8.15e+4 24.86 8.12e+4 18.45 6.90e+4 174.03


Back�ip 6 45.89 1.30e+2 17.39 7.36e+1 13.45 5.64e+1 1286.90


Butter�y kick 5 104.47 1.83e+2 90.54 1.63e+2 73.25 2.06e+2 2531.14

Table 5.5: Results for anisotropic kernels on the MMM model, frame step of 10for both training and test, time sequence of t, t−10, h = average nearest neighbordistance, α = h/5






Boxing 11 30.16 8.16e+4 27.52 7.85e+4 20.60 7.28e+4 15.17


Back�ip 6 45.91 1.31e+2 28.03 8.46e+1 16.08 6.41e+1 49.88


Butter�y kick 5 103.94 1.78e+2 93.38 1.65e+2 70.36 1.48e+2 59.62

Table 5.6: Results for isotropic kernels on the MMM model, frame step of 10 forboth training and test, time sequence of t, t−10, t−20, h = average nearest neighbordistance

such as MMM), and the related numerical problems discussed previously. Thus, we

selected a basic case where the problem does not appear. We simply add t − 20

in the samples, which corresponds to the nearest available past frame before t − 10

(remember from point 5.2 that the test frame step should synchronize with the time

sequence to ensure that past frames respect the rules de�ned for 'Mix' and 'Pred').

Results are summarized in table 5.6. As always, errors that are inferior to the

default case are marked in bold. The supplementary knowledge gained by looking

deeper in the past does not seem to have brought a positive impact overall. Of course,

one could also argue that this is not representative of the principle, but in general we

have been a bit disappointed by the performances gained through multi past frames.

An explanation for this phenomenon might be that the Markov assumption (i.e.,

a state depends only on the previous one) works reasonably well in the context of

human motion.

Once again, we can compare the case of Back�ip 4 on mixed past frames in �gure

5.14. The level of complexity seems to be more or less the same as the prediction

of the default case. There is an interesting detail appearing when the predicted

plots presented here are aligned vertically: local maxima appear roughly at the same

moment for all the other plots, but for the extended time sequence they appear a few

frames before. When there is correspondence with the reference plot, its maxima are

also comming before. It would appear as if this setting could slightly better anticipate

the general tendency of the motion to come. Despite of this property, the default

setting performs better both numerically and visually on video in this case.

5.5. EVALUATION 89


6. Conclusion

He has no cognizance even of the

number Two; nor has he a thought

of Plurality; for he is himself his

One and All, being really Nothing.

Yet mark his perfect

self-contentment, and hence learn

this lesson, that to be self-contented

is to be vile and ignorant, and that

to aspire is better than to be

blindly and impotently happy.


One of the main weaknesses of this application of kernel density estimation is

certainly the volatility of quality depending on the setting and the data used. We

think it is quite fundamental to avoid following the results blindly, without cross-

checking their relevance by the complementary image �delity method. Another major

issue is that too high dimensionality has a clearly negative impact on both time and

performance, limiting the potential complexity of both the model and of the past

frames.

Quite surprisingly, the more elaborate anisotropic kernels do not seem to sys-

tematically overpower their isotropic counterparts for our data. Not only adding an

hyper-parameter that is delicate to optimize, they consume vastly more resources.

They may perform better on larger training sets less prone to over�tting, but the

computational cost becomes quickly prohibitive when compared with the isotropic

91

92 CHAPTER 6. CONCLUSION

alternative. If the anisotropic kernels are forced to select only a subset of the available

data because of time constraints, it might be a better deal to rely on the isotropic

kernels instead, as kernels are nonparametric estimators and as such are guaranteed

to converge to the true density when more samples are added.

While all those limitations must be kept in mind, there are nonetheless many

potential applications of prior knowledge when it is carefully combined with a more

classic approach to tracking. In this context, we propose to put somewhat into

perspective reasonable levels of errors on each joint of the predicted poses (low to

average lack of precision globally). The image-based method is likely to be much

more precise when there is no visual ambiguity, but can otherwise fail to con�gure

correctly the pose on the few ambiguous joints (large lack of precision locally). This

complementarity is the reason why we insist on the meaningfulness of delivering

self-con�dence estimates to balance harmoniously both methods.

We would like to cite a few perspectives on what could be realized to improve the

technique. Firstly, hyper-parameters might be tuned �ner with more sophisticated

heuristics, or optimized automatically in an e�cient way. Secondly, when dealing

with larger training sets, approximation of the Gaussian kernels may also realized by

speci�c clustering techniques (such as the one presented in [12] for e.g.). This would

be especially helpful for making anisotropic kernels tractable without sacri�cing gen-

erality by truncating the available data. An interesting idea to improve evaluation

would be to weight errors on joints depending on anatomical or model-based criteria,

such as the position in the kinematic chain (an error close to the root in�uences more

the overall pose than does an end e�ector). Finally, we thought it would be interest-

ing to let the frame steps vary with the current velocity of the motion, which would

allow to analyze quick variations with more scrutiny than slower ones (synchronizing

might be delicate though).

To conclude, the use of kernel density estimators in human motion tracking does

certainly make sense if employed wisely. Throughout this work, we have tried to put

the emphasis on the study of the feasibility of the proposed methodology, and we

hope to have conducted the analysis of its strengths and limitations with as much

fairness as possible.

List of Figures

2.1 From left to right: stick, 2D contours and volumetric human models

(images taken from [2]) . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Two local ranges for the elbow, depending on the shoulder con�gura-

tion (image taken from [17]) . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Sketch of the methodology proposed in [33] (image taken from it) . . 17

3.1 3rd and 5th order approximations of the Hilbert curve . . . . . . . . . 23

3.2 Missing global maximum (red is high, blue low) . . . . . . . . . . . . 29

3.3 Estimators receiving samples drawn from the standard normal: three

�rst plots are histograms, fourth plot is a parametric estimator for

normal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Epanechnikov kernel using various univariate conversions, from darker

to lighter: Euclidean norm, Manhattan distance,∞-norm, multiplica-

tive kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Common kernel types . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6 Pitfalls of good-looking curves (100 observed points, h = 0.1 and 1.5

respectively), �gure idea taken from [38]) . . . . . . . . . . . . . . . . 37

4.1 Isotropic kernel estimation of an uniform spiral (1000 observed points) 42

4.2 Two views of an isotropic kernel estimation of the open-box manifold

(each point is equiprobable), depending on the value of the tolerance

threshold (500 observed points) . . . . . . . . . . . . . . . . . . . . . 43

4.3 The MMM joint model during a boxing move . . . . . . . . . . . . . 45

93

94 LIST OF FIGURES

4.4 Step-by-step estimate, test frame step = 1, time sequence of t and t− 1 48

4.5 Evolution of average (solid blue) and maximum (dashed green) nearest

neighbor distance with sample size varying from 10 to 1000, D = 1

and D = 100 resp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.6 Curves of ambiguity repartition obtained by histograms (see text for

description) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.1 Upper (dashed) and lower (solid) bounds of h for varying u2, D = 90

and D = 120 resp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Value of an isotropic kernel with �xed squared distance u2 and varying

h, D = 1 and D = 60 resp. . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Various bivariate data sets and the corresponding correlation's co-

e�cient (image taken from the Wikipedia entry on correlation and

dependence) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.4 3-dimensional MDS applied on various MMM sequences: boxing 6

(top, left), back�ip 6 (top, right), 3 (below, left) and 4 (below, right) 76

5.5 Sequence for Boxing 11 with 40 frames between each image, reference

(top) and prediction of table 5.2 with real past frames (bottom) . . . 79

5.6 Sequence for Back�ip 4 with 160 frames between each image, reference

(top) and prediction of table 5.2 with mixed past frames (bottom) . . 80

5.7 Values of absolute (left) and relative (right) error over time for test of

table 5.2 on Boxing 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 81


table 5.2 on Boxing 11 . . . . . . . . . . . . . . . . . . . . . . . . . . 82


table 5.2 on Back�ip 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.10 Global plot of the variables over time for reference (top) and prediction

of table 5.2 with mixed past frames (bottom), on Back�ip 4 . . . . . 83

5.11 All individual plots of the variables for reference (blue) and prediction

of table 5.2 with mixed past frames (red), on Back�ip 4 . . . . . . . . 84

LIST OF FIGURES 95

5.12 Global plot of the variables over time for prediction of table 5.3 with

mixed past frames, on Back�ip 4 . . . . . . . . . . . . . . . . . . . . 86





Bibliography

[1] Edwin A. Abbott. Flatland: A Romance of Many Dimensions. Oxford World's

Classics, 1884.

[2] J. K. Aggarwal and Q. Cai. Human motion analysis: A review. Computer

Vision and Image Understanding, 73(3):428�440, March 1999.

[3] Yoshua Bengio. Gradient-based optimization of hyper-parameters. Neural Com-

putation, 12(8):1889�1900, August 2000.

[4] Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is

�nearest neighbor� meaningful? Lecture Notes in Computer Science, 1540:217�

235, 1999.

[5] Thomas Brox, Bodo Rosenhahn, Daniel Cremers, and Hans-Peter Seidel. Non-

parametric density estimation with adaptive, anisotropic kernels for human mo-

tion tracking. In Human Motion - Understanding, Modeling, Capture and Ani-

mation, Second Workshop, Human Motion 2007, Rio de Janeiro, Brazil, October

20, 2007, Proceedings, volume 4814 of Lecture Notes in Computer Science, pages

152�165. Springer, 2007.

[6] Thomas Brox, Bodo Rosenhahn, Uwe G. Kersting, and Daniel Cremers. Non-

parametric density estimation for human pose tracking. In Pattern Recognition,

28th DAGM Symposium, Berlin, Germany, September 12-14, 2006, Proceedings,

volume 4174 of Lecture Notes in Computer Science, pages 546�555. Springer,

2006.

96

BIBLIOGRAPHY 97

[7] Allison Bruce and Geo�rey Gordon. Better motion prediction for people-

tracking. In Proceedings of the IEEE International Conference on Robotics and

Automation (ICRA), February 04 2004.

[8] Stefano Corazza, Lars Mündermann, Emiliano Gambaretto, Giancarlo Ferrigno,

and Thomas P. Andriacchi. Markerless motion capture through visual hull,

articulated ICP and subject speci�c model generation. International Journal of

Computer Vision, 87(1-2):156�169, 2010.

[9] S. L. Dockstader and A. M. Tekalp. Multiple camera tracking of interacting

and occluded human motion. Proceedings of IEEE, 89(10):1441�1455, October

2001.

[10] Sun Documentation. Chapter 2: Ieee arithmetic. http://docs.sun.com/

source/806-3568/ncg_math.html.

[11] A. M. Elgammal and C. S. Lee. The role of manifold learning in human motion

analysis. In Human Motion Understanding, Modelling, Capture, and Animation,

page 2, 2008.

[12] Ahmed M. Elgammal, Ramani Duraiswami, and Larry S. Davis. E�cient kernel

density estimation using the fast gauss transform with applications to color

modeling and tracking. IEEE Trans. Pattern Anal. Mach. Intell, 25(11):1499�

1504, 2003.

[13] Ahmed M. Elgammal and Chan-Su Lee. Tracking people on a torus. IEEE

Trans. Pattern Anal. Mach. Intell, 31(3):520�538, 2009.

[14] Keith Grochow, Steven L. Martin, Aaron Hertzmann, and Zoran Popovi¢. Style-

based inverse kinematics. ACM Transactions on Graphics, 23(3):522�531, Au-

gust 2004.

[15] Jihun Ham, Daniel D. Lee, Sebastian Mika, and Bernhard Schölkopf. A ker-

nel view of the dimensionality reduction of manifolds. In Carla E. Brodley,

http://docs.sun.com/source/806-3568/ncg_math.html

http://docs.sun.com/source/806-3568/ncg_math.html

98 BIBLIOGRAPHY

editor, Machine Learning, Proceedings of the Twenty-�rst International Confer-

ence (ICML 2004), Ban�, Alberta, Canada, July 4-8, 2004, volume 69 of ACM

International Conference Proceeding Series. ACM, 2004.

[16] David Haussler. Quantifying inductive bias: AI learning algorithms and valiant's

learning framework. Arti�cial Intelligence, 36(2):177�221, September 1988.

[17] L. Herda, R. Urtasun, and P. Fua. Hierarchical implicit surface joint limits

for human body tracking. Computer Vision and Image Understanding: CVIU,

99(2):189�209, August 2005.

[18] Julian Hoch. Model-based 3d object classi�cation for humanoid robots using

semi-global features. Master's thesis, Karlsruhe Insitute of Technology, April

2010.

[19] Nicholas R. Howe, Michael E. Leventon, and William T. Freeman. Bayesian

reconstruction of 3D human motion from single-camera video. In Sara A. Solla,

Todd K. Leen, and Klaus-Robert Müller, editors, NIPS, pages 820�826. The

MIT Press, 1999.

[20] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures

of local experts. Neural Computation, 3:79�87, 1991.

[21] C. N. Kuruwita. A bayesian approach for bandwidth selection in kernel density

estimation. userwww.service.emory.edu/~cmagnan/ACEStalks/Kuruwita.

ppt.

[22] Chan-Su Lee and Ahmed M. Elgammal. Homeomorphic manifold analysis:

Learning decomposable generative models for human motion analysis. In René

Vidal, Anders Heyden, and Yi Ma, editors, WDV, volume 4358 of Lecture Notes

in Computer Science, pages 100�114. Springer, 2006.

[23] Chan-Su Lee and Ahmed M. Elgammal. Human motion synthesis by motion

manifold learning and motion primitive segmentation. In Francisco J. Perales

userwww.service.emory.edu/~cmagnan/ACEStalks/Kuruwita.ppt

userwww.service.emory.edu/~cmagnan/ACEStalks/Kuruwita.ppt

BIBLIOGRAPHY 99

López and Robert B. Fisher, editors, Articulated Motion and Deformable Ob-

jects, 4th International Conference, AMDO 2006, Port d'Andratx, Mallorca,

Spain, July 11-14, 2006, Proceedings, volume 4069 of Lecture Notes in Com-

puter Science, pages 464�473. Springer, 2006.

[24] John A. Lee and Michel Verleysen. Nonlinear Dimensionality Reduction.

Springer, 2007.

[25] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.

[26] Lars Mündermann, Stefano Corazza, and Thomas P. Andriacchi. Markerless hu-

man motion capture through visual hull and articulated icp. In NIPS Workshop

on Evaluation of Articulated Human Motion and Pose Estimation, 2006.

[27] Thomas B. Moeslund, Adrian Hilton, and Volker Krüger. A survey of advances

in vision-based human motion capture and analysis. Computer Vision and Image

Understanding, 104(2-3):90�126, 2006.

[28] E. Parzen. On estimation of a probability density function and mode. Annals

of Mathematical Statistics, 33:1065�1076, September 1962.

[29] Vladimir Pavlovic, James M. Rehg, and John MacCormick. Learning switching

linear models of human motion. In Todd K. Leen, Thomas G. Dietterich, and

Volker Tresp, editors, NIPS, pages 981�987. MIT Press, 2000.

[30] Rómer Rosales and Stan Sclaro�. Learning body pose via specialized maps.

In Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, editors,

Advances in Neural Information Processing Systems 14 [Neural Information

Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001,

Vancouver, British Columbia, Canada], pages 1263�1270. MIT Press, 2001.

[31] Stephan R. Sain. Multivariate locally adaptive density estimation, 2002.

[32] H. S. Seung and D. D. Lee. The manifold ways of perception. Science,

290(5500):2268�2269, December 2000.

100 BIBLIOGRAPHY

[33] H. Sidenbladh, M. J. Black, and L. Sigal. Implicit probabilistic models of human

motion for synthesis and tracking. In ECCV, page I: 784 �., 2002.

[34] Cristian Sminchisescu and Allan D. Jepson. Generative modeling for continu-

ous non-linearly embedded visual inference. In Carla E. Brodley, editor, Ma-

chine Learning, Proceedings of the Twenty-�rst International Conference (ICML

2004), Ban�, Alberta, Canada, July 4-8, 2004, volume 69 of ACM International

Conference Proceeding Series. ACM, 2004.

[35] Cristian Sminchisescu, Atul Kanaujia, and Dimitris N. Metaxas. Learning joint

top-down and bottom-up processes for 3D visual inference. In CVPR, pages

1743�1752. IEEE Computer Society, 2006.

[36] Cristian Sminchisescu and Bill Triggs. Estimating articulated human motion

with covariance scaled sampling. International Journal of Robotic Research,

22(6):371�392, 2003.

[37] Carl Staelin. Parameter selection for support vector machines. Technical report,

Hewlett-Packard, November 21 2002.

[38] Berwin A. Turlach. Bandwidth selection in kernel density estimation: A re-

view. Discussion paper 9317, Institut de Statistique, UCL, Louvain-la-Neuve,

Belgium, January 15 1993.

[39] Raquel Urtasun, David J. Fleet, and Pascal Fua. 3D people tracking with

gaussian process dynamical models. In CVPR, pages 238�245. IEEE Computer

Society, 2006.

[40] Pascal Vincent and Yoshua Bengio. Manifold parzen windows. In Suzanna

Becker, Sebastian Thrun, and Klaus Obermayer, editors, NIPS, pages 825�832.

MIT Press, 2002.

[41] Christopher Richard Wren and Alex Pentland. Dynamic models of human mo-

tion. In Proceedings of the Third IEEE International Conference on Automatic

Face and Gesture Recognition, pages 22�29. IEEE Computer Society, 1998.

A. Gradients derivation

Let us de�ne the elements of x as (x1, . . . , xD)T and those of xi as (xi1, . . . , xiD)T .

We begin with a single partial derivation of the logarithm of the isotropic kernel,

with a given j being the dimension along which we wish to derivate. For notational

convenience, let [f(•)]′ designate ∂f(•)∂xj

. Additionally, let us precise that `log' stands

here for the natural logarithm, not for log10.

The partial derivation of the logarithm of the isotropic kernel is:

[log f̂iso(x)]′

=[f̂iso(x)]′

f̂iso(x)(derivative of log(•))

=[ 1N

∑Ni=1K

isoh (x,xi)]

′

1N

∑Ni=1 K

isoh (x,xi)

(de�nition of f̂iso)

=[∑N

i=1Kisoh (x,xi)]

′∑Ni=1 K

isoh (x,xi)

(extraction & simpli�cation of 1N

)

=

∑Ni=1[K iso

h (x,xi)]′∑N

i=1Kisoh (x,xi)

(derivative of a sum)

101

102 APPENDIX A. GRADIENTS DERIVATION

=

∑Ni=1

[1

(2πh2)D2

exp(−‖x−xi‖22h2

)

]′∑N

i=1Kisoh (x,xi)

(de�nition ofK isoh (x,xi))

=

∑Ni=1

1

(2πh2)D2

exp(−‖x−xi‖22h2

)

[− ‖x−xi‖2

2h2

]′∑N

i=1Kisoh (x,xi)

(derivation of exp(•))

= −∑N

i=1Kisoh (x,xi)[‖x− xi‖2]′

2h2∑N

i=1Kisoh (x,xi)

(extraction of scalar factor)

= −∑N

i=1Kisoh (x,xi)[

∑Dk=1(xk − xik)2]′

2h2∑N

i=1Kisoh (x,xi)

(de�nition of squared norm)

= −∑N

i=1Kisoh (x,xi)

∑Dk=1[(xk − xik)2]′

2h2∑N

i=1 Kisoh (x,xi)

(derivative of a sum)

= −∑N

i=1Kisoh (x,xi)[(xj − xij)2]′

2h2∑N

i=1Kisoh (x,xi)

(nulli�cation of constants)

= −∑N

i=1Kisoh (x,xi)2(xj − xij[xj − xij]′)2h2

∑Ni=1K

isoh (x,xi)

(derivative off(•)2)

= −∑N

i=1Kisoh (x,xi)(xj − xij)

h2∑N

i=1Kisoh (x,xi)

(derivative ofxj and of constant xij)

103

Now, we can compute the partial derivatives for all j:

∂ log f̂iso(x)

∂x

=

(∂ log f̂iso(x)

∂x1

, . . . ,∂ log f̂iso(x)

∂xD

)T

=

(−∑N

i=1Kisoh (x,xi)(x1 − xi1)

h2∑N

i=1Kisoh (x,xi)

, . . . ,−∑N

i=1Kisoh (x,xi)(xD − xiD)

h2∑N

i=1Kisoh (x,xi)

)T

= −∑N

i=1Kisoh (x,xi)(x1 − xi1, . . . , xD − xiD)T

h2∑N

i=1Kisoh (x,xi)

= −∑N

i=1Kisoh (x,xi)(x− xi)

h2∑N

i=1Kisoh (x,xi)

The derivations for the anisotropic are relatively similar. Let us de�ne O and

I respectively as the null and identity matrices of dimension D. Let ej be the jth

column of I. The single partial derivative is:

[log f̂ani(x)]′

=

∑Ni=1 K

anii (x)

[− 1

2(x− xi)

TΣ−1i (x− xi)

]′∑N

i=1Kanii (x)

(similar to log f̂iso)

= −∑N

i=1Kanii (x)[(x− xi)

TΣ−1i (x− xi)]

′

2∑N

i=1Kanii (x)

(extraction of scalar factor)


To lighten notation we will derivate the inner vector multiplications and replace

the result in the whole fraction afterwards:

[(x− xi)TΣ−1

i (x− xi)]′

= [(x− xi)TΣ−1

i ]′(x− xi) (derivative of a product)

+(x− xi)TΣ−1

i [x− xi]′

= [(x− xi)TΣ−1

i ]′(x− xi) (expansion of vector (x− xi))

+(x− xi)TΣ−1

i [∑D

k=1(xk − xik)ek]′

= [(x− xi)TΣ−1

i ]′(x− xi) (derivative of a sum)

+(x− xi)TΣ−1

i

∑Dk=1[(xk − xik]′ek)

= [(x− xi)TΣ−1

i ]′(x− xi) (nulli�cation of constants)

+(x− xi)TΣ−1

i [xj − xij]′ej

= [(x− xi)TΣ−1

i ]′(x− xi) (derivative ofxj and of constant xij)

+(x− xi)TΣ−1

i ej

= [(x− xi)T ]′Σ−1

i (x− xi) (derivative of a product)

+(x− xi)T [Σ−1

i ]′(x− xi)

+(x− xi)TΣ−1

i ej

= [(x− xi)T ]′Σ−1

i (x− xi) (nulli�cation of constants)

+(x− xi)TO(x− xi)

+(x− xi)TΣ−1

i ej

105

= [(x− xi)T ]′Σ−1

i (x− xi) (simpli�cation)

+(x− xi)TΣ−1

i ej

= ejTΣ−1

i (x− xi) (similar to above)

+(x− xi)TΣ−1

i ej

= 2(x− xi)TΣ−1

i ej (becauseΣ−1i is symmetric)

If we replace this in the previous equality, we get:

[log f̂ani(x)]′

= −∑N

i=1Kanii (x)2(x− xi)

TΣ−1i ej

2∑N

i=1 Kanii (x)

(see above)

= −∑N

i=1Kanii (x)(x− xi)

TΣ−1i ej∑N

i=1Kanii (x)

(extraction & simpli�cation of 2)

Finally, we can compute the partial derivatives for all j:

∂ log f̂ani(x)

∂x

=

(∂ log f̂ani(x)

∂x1

, . . . ,∂ log f̂ani(x)

∂xD

)T

=

(−∑N


TΣ−1i e1∑N

i=1Kanii (x)

, . . . ,−∑N


TΣ−1i eD∑N

i=1Kanii (x)

)T


= −∑N

i=1Kanii (x)((x− xi)

TΣ−1i e1, . . . , (x− xi)

TΣ−1i eD)T∑N

i=1Kanii (x)

= −∑N


TΣ−1i (e1, . . . , eD)T∑N

i=1Kanii (x)

= −∑N


TΣ−1i I∑N

i=1 Kanii (x)

= −∑N


TΣ−1i∑N

i=1Kanii (x)

Université Catholique de Louvain Ecole Polytechnique de Louvain … · 2011. 4. 4. · 1st utor:T...

Documents

Transcript of Université Catholique de Louvain Ecole Polytechnique de Louvain … · 2011. 4. 4. · 1st utor:T...