Université Catholique de Louvain Ecole Polytechnique de Louvain … · 2011. 4. 4. · 1st utor:T...
Transcript of Université Catholique de Louvain Ecole Polytechnique de Louvain … · 2011. 4. 4. · 1st utor:T...
Université Catholique de Louvain
Ecole Polytechnique de Louvain
Département d'Ingénierie Informatique
Human motion prediction using kernel density
estimators
under the guidance of Master's thesis presented in attainment
Supervisor: Prof. Pierre Dupont of a master degree in computer science
Co-supervisor: Prof. Rainer Stiefelhagen option in arti�cial intelligence
1st Tutor: Tobias Feldmann by Richard Tillieux
2nd Tutor: Sebastian Schulz Academic year 2010-2011
Acknowledgements
First of all I would like to thank my tutors Tobias Feldmann and Sebastian Schulz
for their long-term supervision. They have invested quite a lot of time to follow the
evolution of this work, and I am grateful for their interest. I will especially miss the
weekly morning group meetings, where Tobias generously feeds his starving students
with the most decadent pastries to be ever found on this side of the Atlantic!
I would also like to thank my supervisor Prof. Pierre Dupont for his open-
mindedness. He friendly accepted to take charge of a master thesis at a very unusual
time for the Belgian academic year.
I am greatly thankful to Chantal Poncin for her truly amazing personal support
throughout this whole Erasmus year. She truly dedicated herself with unending
patience to facilitate administrative obstacles, and to give me excellent advices to
resolve very various issues. Without her help, staying one semester more in Karlsruhe
would certainly not have been possible at all.
I would like to thank the whole Go Hu.MAn. group for the kind atmosphere they
put in the o�ce. The monthly social events, despite the occasional injury, were all
pretty nice evenings.
I could not fail to mention my original supervisor Prof. Marco Saerens, who was
really kind-hearted about my desire to stay one year in Karlsruhe, despite the fact
that this also meant canceling his master thesis.
And of course, last but certainly not least, I would �nally like to thank my parents,
whose continued support I felt despite of the distance.
Abstract
Markerless human motion tracking could be achieved through two distinct, but com-
plementary, sets of techniques: those based on the analysis of the image, and those
based on prior knowledge. Among the latter, direct prediction of human motion on
a global perspective has promising potential, yet it is also a challenging Machine
Learning problem, due to the high-dimensionality of the data. This master thesis
seeks to explore the strengths and limitations of an approach speci�cally based on
kernel density estimators. The methodology proposed here compares on several lev-
els the classic isotropic kernels with their anisotropic counterparts, in the context of
human motion prediction.
This master thesis was realized within the Go Hu.MAn. group at the Karlsruhe
Institute of Technology (KIT).
Table of contents
1 Introduction 5
2 State of the art 9
2.1 Prior knowledge in markerless motion tracking . . . . . . . . . . . . . 9
2.1.1 Explicit joint limits . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Static pose priors . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Dynamic models . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Kernel estimators in motion tracking . . . . . . . . . . . . . . . . . . 17
3 Theoretical foundations 21
3.1 Geometrical concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Intrinsic dimensionality . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Machine Learning concepts . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Inductive bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 Training and test sets . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.3 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.4 Gradient ascent . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.5 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Non-parametric density estimation . . . . . . . . . . . . . . . 31
3.3.2 Kernel de�nition . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.3 Bandwidth selection . . . . . . . . . . . . . . . . . . . . . . . 36
1
4 Chosen methodology 39
4.1 Preliminary analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Kernel reconstruction of arti�cial examples . . . . . . . . . . . . . . . 41
4.2.1 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2 Integrated Square Error . . . . . . . . . . . . . . . . . . . . . 43
4.2.3 Average Negative Log Likelihood . . . . . . . . . . . . . . . . 44
4.3 Processing real data . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 Building of training set . . . . . . . . . . . . . . . . . . . . . . 44
4.3.2 Frame subsampling . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.3 Step-by-step prediction . . . . . . . . . . . . . . . . . . . . . . 47
4.3.4 Prediction by gradient ascent . . . . . . . . . . . . . . . . . . 48
4.4 Hyper-parameters selection . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.1 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.2 Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.3 Training frame step . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.4 Time sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Combination with image �delity . . . . . . . . . . . . . . . . . . . . . 54
4.5.1 Con�dence weight of estimation . . . . . . . . . . . . . . . . . 54
4.5.2 Ambiguity measure of training set . . . . . . . . . . . . . . . . 55
4.6 Complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6.1 Space complexity . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6.2 Time complexity . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Experiments 63
5.1 Performance criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Test set selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Limitations of the methodology . . . . . . . . . . . . . . . . . . . . . 68
5.3.1 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . 68
5.3.2 Ill-conditioned matrices . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Estimation of intrinsic dimensionality . . . . . . . . . . . . . . . . . . 73
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2
5.5.1 Default case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5.2 Some graphical results . . . . . . . . . . . . . . . . . . . . . . 78
5.5.3 Maximum nearest neighbor distance . . . . . . . . . . . . . . 84
5.5.4 Anistropic kernels . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5.5 Longer time sequence . . . . . . . . . . . . . . . . . . . . . . . 86
6 Conclusion 91
A Gradients derivation 101
3
1. Introduction
`Back! back! Away from me, or you
must go with me - whither you
know not - into the Land of Three
Dimensions!' `Fool! Madman!
Irregular!' I exclaimed; `never will I
release thee; thou shalt pay the
penalty of thine impostures.' `Ha!
Is it come to this?' thundered the
Stranger: `then meet your fate: out
of your Plane you go. Once, twice,
thrice! 'Tis done!'
Edwin A. Abbott, Flatland
Human motion tracking relates to the task of analyzing the movement of a person
over time, and to convert the raw data obtained into a digital anatomical model.
There are mainly two ways to track human motion: through the use of optical
markers, or by analyzing the image taken from a camera system.
The �rst solution has been extensively used in the movie industry for the ani-
mation of digital characters. Basically, the markers can be pretty precisely located
in space, the main task being to �nd a model �tting the marker set positions (with
enough markers, there is barely ambiguity). The work of the animation team is in-
deed much facilitated if an human actor can directly make the virtual character move
in a natural manner. This remains of course only a model, and the real complexity of
an human pose cannot be fully rendered. The position and concentration of markers
5
6 CHAPTER 1. INTRODUCTION
will therefore vary depending on the desired level of scrutiny. For e.g., reproducing
facial expressions will certainly be more relevant in the case of a dialogue than in the
middle of a cascade.
Marker-based techniques have proved to provide quite reliable results. However,
they have the disadvantage of requiring a large and static infrastructure, which may
be suitable for movies, but obviously lacks �exibility for more general usage. The
practical constraint to stick all the markers over oneself is certainly not an ideal of
ergonomics.
The second approach seeks to overcome the necessity of relying over markers, and
directly treats the image recorded by the cameras. The potential applications are
multiple (domotics, user interfaces, video games, etc). The underlying application in
our context is to help aspiring sportsmen to correct their technical moves compared
to a reference movement performed by an accomplished expert of the sport in ques-
tion. This could also possibly be used for physiotherapy, by helping patients to do
rehabilitation exercises themselves at home.
Nonetheless, this is much more arduous to achieve than with marker-based tech-
niques, and just begins to be actually used in a industrial context (recently, Microsoft
released the controller-free Kinect sensor device for the Xbox for instance). If we only
have one camera at our disposal, then the challenge is even bigger: reconstitution of
a 3D model based on a single 2D image is highly ambiguous. Not helping are also the
risks of occlusion (some object passes before the observed person) and self-occlusion
(in a pro�le view by walking, one of our legs hides the other one). There are several
other di�culties upon which we will not elaborate, as the analysis of image in itself
is not the object of this work.
At this point, there is an interesting observation to make: as human beings, we
can have a pretty precise idea of how visible persons are moving around, and closing
one eye does to seem to degrade substantially our perception of them. But we have
accumulated experience from years of real-world interaction with persons at various
distances, so there is no big deal for us to infer the poses of passers-by with almost
no visual e�ort.
The idea is thus, in complement to image analysis (the so-called image �delity
7
method), to make use of anatomical knowledge to favor poses that are similar to
previously learned movements. Indeed, we could in some limited proportion predict
what the current pose could be, with the help of both an estimation of the previous
movements and a knowledge base. There are more or less elaborate ways to do this,
we would like to propose here an approach totally independent of the image �delity
method. This clear distinction provides a clean framework to �plug� at will the two
together when the image is ambiguous.
Let us give a quick overview of the following chapters. Chapter 2 brie�y presents
some theoretical works about knowledge-assisted motion tracking. Then follows a
closer look at the approach proposed in [5], the paper upon which our own work is
primarily based. Chapter 3 introduces the main concepts that we use and that are
not speci�c to the task of human motion prediction. Further chapters assume its
reading, so we recommend to the reader to take a look at the conventions used here
in the formulas to avoid ambiguity, even if he or she is already familiar with the
subject. Chapter 4 details how these concepts are used speci�cally in the context
of human motion prediction. Chapter 5 discusses how to relevantly test the system,
but also the concrete consequences of theoretical problems, and some performance
results with the available data.
8 CHAPTER 1. INTRODUCTION
2. State of the art
The rhythmical and, if I may say,
well-modulated undulation of the
back of our ladies of Circular rank
is envied and imitated by the
common Equilateral, who can
achieve nothing but a mere
monotonous swing, like the ticking
of a pendulum.
Edwin A. Abbott, Flatland
2.1 Prior knowledge in markerless motion tracking
In order to restrict this summary to a well-de�ned problem, we would like to clarify
a bit which task of human motion analysis we consider to be relevant here. Given a
temporal sequence of images taken from a person uncovered with markers, by �motion
tracking� we mean modeling the person's poses through time. The term of �tracking�
alone actually does not necessarily imply explicit pose estimation: in all generality,
it considers states of the person through time (represented features may be totally
image-based, such as colors or position in the image). Tasks such as people-tracking
(locating the position of a moving person in a complex environment, see [7] for an
example making use of predictions) serve a distinct purpose as they do not represent
pose estimation, and thus will not be considered here.
9
10 CHAPTER 2. STATE OF THE ART
Figure 2.1: From left to right: stick, 2D contours and volumetric human models(images taken from [2])
Additionally, we will stick to the problem as formulated for a single individual.
Thus, we will not discuss the task of multiple persons tracking (with interacting
and mutually occluding persons, see [9] for instance). Even with a knowledge-based
approach that is independent of the image such as ours, it would require to take
interactions into account, as individual motions will clearly be in�uenced by the
whole group's overall occupation.
Another remark about terminology: by �prior knowledge� we refer here exclusively
to information that is exterior to the currently tracked motion, and gained through
automated learning or domain experts. The term �prediction� is a bit ambiguous,
because prediction is achieved on the base of previous poses of the same motion,
the use of prior knowledge staying optional. Indeed, tracking algorithms may very
well make predictions, without considering the assistance of previously learned mo-
tions or anatomical constraints. As we are mainly interested in the knowledge-based
techniques deployed to assist motion tracking (and not in the tracking techniques
themselves), we will primarily focus on this aspect here.
A presentation of human motion analysis can be found in [2], and more recently
in [27]. The three most common ways to explicitly model human poses are sketched
in �gure 2.1). These models represent a human body pose in the form of a kinematic
skeletal structure (in this context, �2D contours� is not the overall silhouette but
2.1. PRIOR KNOWLEDGE IN MARKERLESS MOTION TRACKING 11
means that the body is decomposed into articulated parts). The values of the joint
angles are therefore fundamental to describe the pose. Indeed, for a given person,
they are the only parameters susceptible to vary in stick and volumetric models, as we
do not expect the limbs of the person to change their shape in the process of motion.
Stick models are today mostly used for illustration, as they can be superposed directly
on the person's image.
2.1.1 Explicit joint limits
A quite intuitive way to take prior knowledge into account to ensure a meaningful
estimated pose is to de�ne and enforce constraints on the model. Typically, we
will impose speci�c bounds on each joint angle, or prohibit mutual inclusion for
body parts in the case of the volumetric model. This allows to discard anatomically
inconsistent hypothetical poses, thus diminishing the ambiguity and improving the
robustness of the result.
In [36], the authors employ a volumetric body model parametrized by a vector
p = (χ,d, i), where χ designates the joint angles parameters (about 30 in their
model), d and i being deformable shape parameters (limbs volume) and internal
joint proportions (respective position in space of some joints that are susceptible to
vary individually, such as the hip joint) respectively1. As these two last vectors are
optimized once and for all at initialization to �t better parameters speci�c to the
modeled subject's body, the temporal modeling to achieve throughout the tracking
process depends only on χ (as in our case). Interestingly, all these three vectors
are subjected to hard-constraints (for e.g. the values for i are allowed to deviate
10% from the human standard). Those constraints are formulated by the inequality
Cbl · p < 0, with Cbl as the box-limit constraint matrix.
The body representation used in [26] is a subject-speci�c volumetric model, con-
sisting of 15 body segments and 14 joints. Additionally, the authors make use of
a convex visual hull before estimating the actual model. They propose a �exible
modeling of constraints. Firstly, joint centers are not held �xed at the juncture of
1We renamed the variables so that the nomenclature corresponds to ours for χ.
12 CHAPTER 2. STATE OF THE ART
strictly adjacent body segments. In other words, joints also get 3 supplementary
translational Degrees of Freedom (DoF). Secondly, constraints are formulated in a
soft way, by weighting deviations on each joint, for both rotational DoF and the
additional translational DoF.
This work is extended in [8]. The subject-speci�c model is generated automat-
ically, with a very realistic visual result. Constraints are formulated in a hard way
this time, and are de�ned locally on the 6 DoF of each joint. The 3 translational
DoF constraints are grouped in a bounding-box. The 2 DoF corresponding to the
�swing� rotation of the articulation are constrained together. Finally, the �twist�
DoF is simply constrained by one inequality.
The approach of [17] is quite sophisticated on the subject of joint constraints, as
it is precisely the focus of the paper. The authors point out that medical textual
resources that might be consulted to device �xed joint constraints do not provide
explicit dependency constraints between linked joints (for e.g., the elbow twist de-
pends on the shoulder orientation) or between the di�erent DoF of a single joint.
Their idea is to measure those relations by experience. After extensive recording
of a subject performing twist and swing on the studied joint, they obtain a quater-
nion �eld. Quaternions are used instead of Euler's angles (roll, pitch, yaw) to avoid,
among others, gimbal lock problems (for joints having 3 DoF).
Each quaternion q = (qx, qy, qz, qw)T = (sin θ2v, cos θ
2) corresponds to a given
orientation on the joint represented by the unit axis v and the angle of rotation θ
around that axis. Those quaternions �elds are then converted into a voxelized form.
Internal dependencies between the DoF are thus encapsulated for the whole joint.
The voxel shapes can then be hierarchically combined to represent dependencies
between two linked joints (see �gure 2.2). Each voxel of the parental joint (for e.g.
the shoulder) will further constrain the range of valid values for the child joint (the
elbow in that case).
2.1. PRIOR KNOWLEDGE IN MARKERLESS MOTION TRACKING 13
Figure 2.2: Two local ranges for the elbow, depending on the shoulder con�guration(image taken from [17])
2.1.2 Static pose priors
While anatomical constraints can prove well adapted for multi-cameras systems,
they may reveal insu�cient for the resolution of ambiguity in a monocular setting.
Multiple anatomically consistent poses may match a 2D image, hence requiring to
arbitrate between them on a likelihood basis, rather than on a constraint basis. This
is not to say that explicitly de�ning joint limits is primitive and should be regarded
as inferior! It simply depends on the setting. Furthermore, it might in some cases
be interesting to keep constraints just in case, if the predictions based on priors are
susceptible to produce inconsistent results.
We borrow the term of �static pose priors� from [5]. It is not to be understood
as a way of determining priors on single isolated frames, without taking past frames
into consideration. Rather, it is to distinguish from the dynamic models, which
will be sketched in the next point. Roughly put, a pose is said to be static if its
model does not contain temporal information when considered at a �xed moment.
This information appears indirectly by combining several poses into a chronological
sequence. Our own approach can be categorized as such.
14 CHAPTER 2. STATE OF THE ART
One of the �rst integration of priors on motion can be found in [19]. The model is
a simple stick �gure, in which a pose is encoded as a group of 20 body point locations
(instead of joint angles). Each learning element is what the authors call a snippet, and
consists of a brief sequence of 11 successive poses juxtaposed in a vector, lasting only
about a third of a second at the frame rate used. Those snippets are clustered into m
groups, and a matrixMi is de�ned for each cluster i, by composition of its ni snippets
centered on their mean µi. Singular Value Decomposition is then performed on each
matrix to yield Mi = UiS2i V
Ti . Only the largest 50 singular values are retained. A
covariance matrix can be de�ned for each cluster i as2 Σi = 1niUiS
2i U
Ti . Finally, the
prior probability for a snippet x is de�ned as:
P (x) =m∑i=1
kπie− 1
2xT Σ−1
i x (2.1)
where k is a normalization factor and πi weights the importance of cluster i as
its fraction of the whole training set.
The methodology proposed in [30] takes another point of view to the problem. In-
stead of con�ning learning to the model space, the authors seek to learn the mapping
between image space Rc and model space Rd. Reconstitution of the image I corre-
sponding to a pose con�guration3 χ is normally unambiguous, and can be managed
through the use of a function ζ : Rd → Rc that computes the forward kinematics
and the resulting image. As already mentioned, the inverse mapping is much more
di�cult to achieve, because of the inherent ambiguity of the problem. The idea of
the authors is to generate a set of m functions φk : Rc → Rd, each one specialized on
a sub-domain of Rc. This portion of the model space is not forced to be a connected
region. Interestingly, the function ζ can be integrated in the inference to provide a
weighted con�dence in the candidate poses with respect to the actual image.
Similarly to this explicit binding between image and model in learning, the au-
thors of [35] propose learning algorithm that combines both generative modeling
2We renamed two variables used in the paper, namely j into i and Λ into Σ, to harmonizenomenclature with our own, as we also make use of local covariance matrices (although we have noclusters, and we de�ne Σ in a quite di�erent way).
3Once again, variables are renamed when appropriate.
2.1. PRIOR KNOWLEDGE IN MARKERLESS MOTION TRACKING 15
(top-down) and model recognition (bottom-up) in a kind of bidirectional search.
The authors describe the learning technique as being self-supervised in the sense
that inferences in one of the two modeling processes optimizes the other, and vice-
versa. This bootstrap cannot be fully realized automatically however, because of the
absolute need of relevant initialization by supervised learning independently for both
models, hence we might give it a kind of semi-supervised status as a whole.
2.1.3 Dynamic models
These kinds of models view human body poses as physical systems. As such, the
current velocity (along acceleration if required) of each pose parameter can be in-
jected into the model, and treated as an inherent property of the pose. This way
of doing allows to make use of related techniques (like Kalman �lters for instance).
However, as human motion cannot be represented by linear dynamic systems, more
speci�c techniques have to be employed.
Why not always integrate motion dynamics? Firstly, one could argue that velocity
and acceleration are at least partially recovered implicitly in temporal static poses,
even though they do not appear measured as such. Furthermore, according to [30]:
Although the approach presented here can be used to model dynamics,
we argue that when general human motion dynamics are intended to be
learned, the amount of training data, model complexity, and computa-
tional resources required are impractical.
Of course, a more complex model may still yield better performance when ex-
ploited meaningfully, that is, with an approach based itself on dynamics. The quoted
statement was also made in 2001, and since then, advances in computational power
allow for more demanding processes. So it seems that the modeling choice is directly
dependent on the intended usage that is to be made from it.
An early use of dynamic models for human motion analysis can be found in [41].
The authors begin to advocate the use of dynamics by comparing the predictions
made by two systems based on Hidden Markov Models (HMM) for mouse motion
16 CHAPTER 2. STATE OF THE ART
gestures. The �rst class of HMM is trained on the di�erence between two successive
recorded mouse positions, whereas the second is trained on the innovation sequence
(di�erence between observation and prediction) generated by a Kalman �lter. Both
systems classify samples of simple geometrical shapes (circles, triangles, scribbles)
without error, but the dynamic model is sensibly smoother in its predictions due to
its auto-calibration.
When applying dynamics to human motion models, the authors de�ne hard-
constraints by the form of a set of physical forces. Based on q̈ = W · Q, the
de�nition of acceleration of a object q as its inverse mass matrix times the external
forces applied to it, they add forces c(q, t) to Q to account for kinematic constraints,
and de�ned so as to avoid arti�cially adding energy to the model. Additionally,
soft-constraints are represented by a potential �eld to account for the probabilistic
aspects of the problem.
In [29], the tactic chosen is to use a Switching Linear Dynamic System (SLDS),
which permits transitions between several linear dynamic systems. This kind of
fusion between HMM and LDS allows to overcome both the discrete nature of the
HMM and the nonlinearity of human motion dynamics. But because of exponential
complexity over time due to all the potential switchings from one LDS to another,
the authors use approximated algorithms in practice, namely Viterbi, variational and
generalized Pseudo Gaussian.
The approach taken in [33] is quite original. Their dynamical model is volumetric
and consists of 50 parameters covering both positions in space (for the torso) and
angles (for the joints), along with the velocities in both cases. But they do not device
a speci�cally dynamical method to learn from these variables. Less focusing on the
inference power of a learning algorithm, the authors propose to emphasize rather
on e�cient ways to access prior knowledge. Instead of making costly combinations
between all training samples, they simply seek to match the one closest to the re-
quested motion, using a binary tree representation to store the knowledge base. The
basic idea is to use the variables as nodes to classify the samples.
Samples are, as usual, motions over a time window, which are then transformed
to a lower dimensional form using Principal Components Analysis (PCA). Although
2.2. KERNEL ESTIMATORS IN MOTION TRACKING 17
Figure 2.3: Sketch of the methodology proposed in [33] (image taken from it)
being intrinsically linear (and thus, not ideal for reducing the dimensionality of hu-
man motion data), PCA has several advantages for this use: it is widely-known
and time-e�cient, its results can be easily used to decide the number of dimensions
retained (instead of a prede�ned choice) and most importantly here, it ranks the
obtained variables according to their in�uence. This allows to structure the tree in
that order, the dimension with the higher variance being the root. Figure 2.3 illus-
trates this. With this representation, making a prediction has a very reduced time
cost that is logarithmic in the size of the tree plus linear in the number of samples
landing into the same leaf (authors argue they generally obtain reasonably balanced
trees). This is especially interesting of course when it is possible to generate large
knowledge bases that would require large computational cost to cover exhaustively
at each prediction.
2.2 Kernel estimators in motion tracking
In the approach proposed in [5], the authors make use of kernel density estimators
to model prior knowledge. As we based our methodology primarily on this paper,
18 CHAPTER 2. STATE OF THE ART
we will dedicate section 4.1 to explain the intuition behind the formulas used here.
Also, as the term of `kernel' has quite various meanings, we recommend reading point
3.3.2 to ensure against possible ambiguities.
Two classes of kernel density estimators are given: isotropic and anisotropic.
Being constructed on the base of a training set composed of D-dimensional vectors
x1, ...,xN , each estimator can return a relative likelihood value for a requested vector
x. Each of these vectors contains the joint angle values of several successive poses
χt, ..., χt−k. Knowing χt−1, ..., χt−k, the most likely (on a local level) unknown pose
χt is determined by gradient ascent, according to the kernel estimator used.
Firstly, given some stochastic kernel Kh(x,x′), consider the isotropic kernel esti-
mator:
f̂iso(x) =1
N
N∑i=1
K isoh (x,xi) (2.2)
The authors propose to use the (Euclidean distance-based) Gaussian kernel de-
�ned as:
K isoh (x,x′) =
1
(2πh2)D2
exp
(− ‖x− x′‖2
2h2
)(2.3)
The anisotropic kernel estimator is de�ned as:
f̂ani(x) =1
N
N∑i=1
Kanii (x) (2.4)
Notice that unlike the isotropic estimator, each kernel Kanii (x) is not only called
locally on xi, but also de�ned depending on i:
Kanii (x) =
1
|2πΣi|12
exp
(− 1
2(x− xi)
TΣ−1i (x− xi)
)(2.5)
2.2. KERNEL ESTIMATORS IN MOTION TRACKING 19
Each Σi is a local covariance matrix de�ned as:
Σi = αI +N∑j=1
K isoh (xi,xj)(xi − xj)(xi − xj)
T (2.6)
where variable α is an external parameter, used to regularize the obtained covari-
ance matrices (to prevent them to be ill-conditioned). Notice that the anisotropic
kernel estimator makes use of the isotropic kernel to compute the local covariance
matrices.
20 CHAPTER 2. STATE OF THE ART
3. Theoretical foundations
Although popularly every one
called a Circle is deemed a Circle,
yet among the better educated
Classes it is known that no Circle is
really a Circle, but only a Polygon
with a very large number of very
small sides.
Edwin A. Abbott, Flatland
3.1 Geometrical concepts
This section is directly inspired from the passages of [24] corresponding to the curse
of dimensionality (point 1.2.2 in the reference) and to the estimation of the instrinsic
dimension (chapter 3 in the reference). That said, the examples and formulas we
mention here are also presented in the same way in other sources.
3.1.1 Curse of dimensionality
The so-called �curse of dimensionality� appears in applied mathematics to describe
the (undesired) e�ects appearing when the dimensions of a given problem increase,
all other things being equal. In particular, the problem can be viewed as our inability
21
22 CHAPTER 3. THEORETICAL FOUNDATIONS
to e�ciently cope with the exponential increase of (hyper-)volume1 of a considered
object.
The classical example used to illustrate this e�ect is the volume of a sphere relative
to its circumscripted cube, when both are de�ned within a space of dimensionality
D. We can de�ne their volumes depending on the radius r of the sphere:
Vsphere(r) =πD/2rD
Γ(1 +D/2)(3.1)
Vcube(r) = (2r)D (3.2)
where Γ(n) denotes the gamma function (for the sake of the argument, let us
consider it simply as a generalized factorial). The ratio Vsphere/Vcube is thus:
Vsphere(r)
Vcube(r)=
πD/2rD
Γ(1 +D/2)(2r)D=
πD/2
Γ(1 +D/2)2D(3.3)
What happens when D becomes larger and larger? The limit of 3.3 when D →∞equals 0. This means, the volume occupied by the sphere tends to become negligible
compared to the cube containing it, as the dimensionality increases! In other words,
the core of the cube becomes actually less and less representative of the whole object,
the main volume being concentrated in the exponentially many corners.
If we transpose this to the case of highly multivariate isotropic (symmetrical
in all dimensions) normal distributions, we can observe that the density tends to
spread out in all dimensions. Although a given point A is always exponentially more
likely than a given point B farther from the modus, the region near the modus (say,
within standard deviation σ) accounts for a smaller and smaller part of the whole
distribution as D goes up. Suddenly, being somewhere in the tails is actually more
likely than landing near the center (clearly not the case in the univariate case).
What are the practical consequences of this phenomenon? Firstly, desirable prop-
erties we pragmatically take for being only marginally violated in low-dimensional
1From now on, we will take the liberty to leave out the pre�x �hyper� in order to improvereadability.
3.1. GEOMETRICAL CONCEPTS 23
Figure 3.1: 3rd and 5th order approximations of the Hilbert curve
spaces may degrade severely in higher dimensions. Secondly, if we don't possess a
strong prior knowledge of the object, studying it locally will prove to be exponen-
tially more di�cult. In particular, when having only limited data at our disposal,
our knowledge of the whole problem risks to lose global relevance. These issues are
quite critical in the sense that there is no easy workaround to avoid them.
3.1.2 Intrinsic dimensionality
A nuance is to made regarding the conclusions made in the previous point. The curse
of dimensionality is indeed structural to the real dimension of the object considered,
and there is nothing to do against it. Nevertheless, we should bear in mind that
the important word here is real. Let's suppose we analyze a data set X, whose each
element xi is formed of several variables xi1, xi2, ..., xiD. If we wish to extrapolate
a continuous geometrical set from it, then the adequate dimensionality which ap-
pears to us intuitively is simply D, the number of variables. But maybe are there
redundancies, even partial, among them (in a statistical context, the variables are
not independent), and thus a dimension d < D would actually already be su�cient.
The intrinsic dimension can be naively thought of as the minimal number of latent
variables which are necessary to describe completely the considered set. However,
fractal objects put this de�nition into di�culty: for instance the Hilbert curve (see
24 CHAPTER 3. THEORETICAL FOUNDATIONS
�gure 3.1), which covers the space of a square, although the points on a curve can
be described by a single parameter. The Lesbegue covering dimension (also called
topological dimension) is in that case more theoretically sound. But when dealing
with limited data, it might be easier to use one of the fractal dimensions instead.
Another argument in favor of the fractal dimensions is that they do not restrict
themselves on integer numbers, which may perhaps seem a bit absurd at �rst glance,
but actually much better characterizes the nature of �nite data sets.
There are several fractal dimensions, which actually variate the free parameter
q of the generalized q-dimension (q does not itself represent a dimension, but only
determines how to estimate one), such as the capacity, information and correlation
dimensions. Let us just consider here the capacity dimension to illustrate the data-
driven quality of these approaches. We can circumscribe the given data set within a
hypercube, and then split it into a grid of smaller hypercubes of edge length ε. Be
N(ε) the number of those hypercubes that contain at least one data point.
The capacity dimension is then de�ned as:
dcap = − limε→0
log N(ε)
log ε(3.4)
The underlying intuition behind this formula is to cover the space that is really
occupied with increased precision, as ε approaches 0. This is very similar to digital
images constituted of pixels (or voxels in 3D): once the resolution is set high enough,
there is no more visible di�erence with the original.
3.2 Machine Learning concepts
The �eld of Machine Learning is dedicated to the task of designing programs able
to adjust their behavior automatically, provided they receive adequate training. The
commonly accepted de�nition [25] states:
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E.
3.2. MACHINE LEARNING CONCEPTS 25
3.2.1 Inductive bias
In the above-quoted de�nition, one may wonder how far the tasks T could deviate
from available experience E. In the extreme case where the tasks consist of a prede-
�ned set of requests exactly matching the complete experience, a database system
could actually be considered as a learning program (as �lling it will indeed make it
more precise)! The human equivalent of this would be a passive pupil strictly recit-
ing his lesson, without making any e�ort to think about the meaning of the words
he takes from his memory. Obviously, those so-called rote-learners are not what
we expect from Machine Learning. A program attempting to recognize people by
face cannot realistically assume that the input image could directly be found in the
knowledge base, far from it! Brie�y put, solely accumulating experience could not be
a purpose in itself, as the bene�ts of learning are then limited to what was already
at our disposal in the �rst place. Instead, we expect our program to generalize what
it gets from its experience to a much broader set of examples.
At that point, it should be noted that, for all its merits, generalization by in-
duction is not at all in accordance with formal logic. Indeed, consider for instance,
having X ⊂ Y:
∀x ∈ X, P (x)
∀y ∈ Y, P (y)(3.5)
This claim clearly does not hold. This should not be a surprise, however, as
induction is actually the pendant of deduction: reversing the inference in 3.5 makes it
logically correct. Nevertheless, pragmatically speaking, we could treat the conclusion
on Y as at least plausible, provided we strongly believe that X is representative
of Y with regards to property P . Philosophers ardently discussed the validity of
inductive arguments, and of their status inside science itself. Popper even argued
that statements must necessarily be falsi�able to be called scienti�c! And indeed, in
a strictly deductive world, we would never learn any rule from observation, leading to
a very poor understanding of our universe. In this sense, induction has the strength
of its weaknesses.
Inductive bias is maybe the most fundamental characterization of any Machine
26 CHAPTER 3. THEORETICAL FOUNDATIONS
Learning technique. It is, so to speak, the criterion the algorithm uses to arbitrate
the choice between several contradictory hypotheses. Two levels of biases must be
distinguished: restriction and preference [16]. In the �rst case, we assume some pre-
de�ned properties of the general rule, and therefore discard a whole set of hypotheses
that will never be considered, even if the data strongly suggests one of them. In the
second case, more data-driven, we follow some guiding criteria through the complete
hypothesis space.
These two kinds of biases will be exempli�ed in our context inside point 3.3.1, by
the distinction between parametric and non-parametric density estimators.
3.2.2 Training and test sets
The experience mentioned in the above de�nition does not implicate a direct inter-
action of the learning program with its environment. In most of the cases, we rather
provide the data ourselves, after having measured it independently. This is called
the training set, or in other words the knowledge base of the learning algorithm.
The model built from that data can be explicit (for e.g. a decision tree), or implicit
(the so-called lazy learners). But even in the explicit case, the quality of the model
cannot be evaluated in all objectivity, without comparing the predictions made to a
group of already well-established results. This is called the test set.
There are at least two critical conditions which need to be ful�lled, if we want
to guarantee the signi�cance of the results. First of all, the two mentioned sets need
to be determined independently one from the other. Should we fail to respect this,
then learners overly con�dent in the training set (this is called over�tting) would be
judged favorably, but their predictive power will actually be biased by the training
set, and still not apprehend the whole domain we want to cover! To understand why,
one has to remember the nature of inductive reasoning (see point 3.2.1). The more
we believe the examples we have at our disposal have the value of a �xed rule for all
possible situations, the more �dogmatic� and categoric will be the results, while being
in accordance with our (limited) experience. As previously said, this is necessary to
generalize, but we need the test set to be totally independent and unbiased toward
3.2. MACHINE LEARNING CONCEPTS 27
the incomplete clues suggested by the training set.
This leads us to the second fundamental condition, namely reasonable proportion
of training and test sets among the available data. It should be fairly clear that the
bigger the training set, the less over�tting there will be, and similarly, the bigger
the test set, the more reliable and statistically signi�cant will be the evaluation.
Unfortunately, most of the time, we have rather limited data already determined
(relatively to the hypothesis space), and that's precisely the reason why we would
like the learning program to do it for us automatically! When dealing with an amount
of data critically low, priority should be given to the test set, even if this consequently
degrade the performance of the learner confronted to an even-smaller training set. A
useful trick in this situation is to perform a K-folds cross validation, see point 3.2.5.
On top of that, it is also plausible that even the data censored by us is not
perfectly reliable: in other words, that it contains approximations or even strict
logical contradictions. For the training set, depending on the learner used, noise
will a�ect performance in di�erent proportions. In all generality, statistically-based
approaches show more robustness to noise than logical-based ones. For the test set,
this also impacts the evaluation's variability, in addition to the above-mentioned
potential non-representativity of the test set.
In our case, we get reference data about the pose directly by marker-based mo-
tion capture instances. This assures us to be independent of the results obtained
with image-based reconstruction techniques which, even if the optical conditions are
optimal, are still more prone to noise that poses determined by resolving the inverse
kinematic problem on the quite precise positions of the markers. Nevertheless, noise
remains an issue, especially when pose variation is being considered between short
temporal sequences (consecutive frames). In the �eld of human motion tracking,
building marker-based reference data is designated by the term of ground truth (see
for e.g. [26]).
28 CHAPTER 3. THEORETICAL FOUNDATIONS
3.2.3 Supervised learning
Another important aspect of Machine Learning is whether the training is supervised
or not. Roughly put, it is about the labeling of the data we already possess. If
we give the algorithm a corresponding result on each training sample, and then
expect it to predict the same type of result on the subsequent unlabeled instances
of the test set, it is called supervised learning. In this case, we determine ourselves
which variable, or group of variables, should be considered as the result, and thus
we implicitly establish a causality chain. Many supervised algorithms consider only
discrete output values, typically class names and not numbers (the algorithm is then
called a classi�er), but this is not a sine qua non condition, and the output value
could also be a vector of continuous values for instance.
In contrast to this, unsupervised learning abstracts the data as a whole, and is
more interested to discover the underlying structure, without making assumptions
about the causality links between the variables. The paradigmatic examples of this
class are the clustering algorithms, where we only decide of the number of anonymous
data classes (the so-called clusters) we wish to obtain in the end.
The supervised nature of our learner is perhaps not obvious at �rst glance. Indeed,
kernel density estimation is considered to be linked to unsupervised learning, as
guessing the general probability distribution of the data is clearly an unsupervised
approach. But we are not interested in the distribution in itself, rather we use it as
a tool to guide prediction. See point 4.3.1 for more details about the structure of the
training samples.
3.2.4 Gradient ascent
The technique of gradient ascent is not a concept of Machine Learning strictly speak-
ing, but several ML algorithms rely more or less heavily on it, most notably the
famous Arti�cial Neural Networks. In all generality, gradient ascent (resp. descent)
makes possible to �nd local maxima (resp. minima) of a function whose general
shape is unknown. This methodology does not su�er from the pitfalls of more naive
approaches, like building a whole grid of the function domain, which could be both
3.2. MACHINE LEARNING CONCEPTS 29
Figure 3.2: Missing global maximum (red is high, blue low)
expensive to compute and imprecise (as this grid is necessarily discrete, the real max-
imum will very probably lie between the meshes of the net). Starting with a more
or less arbitrary point, gradient ascent will converge toward a peak on the function
surface.
Mathematically speaking, the gradient 5f of a function f is the vector of all
partial derivatives of f . If we evaluate it on a given point, the gradient determines
the direction of steepest ascent, or in other words, an arrow toward the locally most
likely candidate for a maximum. The length of the step to be made in that direction is
signi�cant, as too big steps might �jump over the hill� so to speak, whereas minimalist
steps will slow down convergence. Applying gradient ascent iteratively is guaranteed
to converge toward a local maximum after a �nite amount of time, provided step
size decreases in a meaningful manner over time. It is important to note the word
local : ideally, we would like to �nd a global maximum, which is unfortunately a much
more di�cult task. Depending on the chosen starting point, we might thus land on
local maxima of varying quality. Figure 3.2 shows an example of this. Starting with
point x0, gradient ascent will attain local maximum lm after 5 iterations. Global
30 CHAPTER 3. THEORETICAL FOUNDATIONS
maximum gm is unfortunately not discovered. If x0 was situated somewhere else
(not necessarily closer to gm), it might have been the case though.
In our implementation, we use gradient ascent to �nd a local maximum on the
surface of the density estimation. See point 4.3.4 for more details.
3.2.5 Cross-validation
Cross-validation is a methodology which is often employed in Machine Learning. Its
core principle is to extract a subset from the data available as training set to form a
distinct entity, called the validation set. This is done several times, with each time
a new selection to form the validation set, in order to guarantee its representativity
and neutrality relative to the remaining training set. Especially in the case where
data is limited, and that building a signi�cant test set would also directly impact
the quality of training itself, it may be interesting to treat each validation set ex-
actly as a small test set (speaking of �cross-testing� might actually be more accurate
then). Otherwise, in the classical usage, the validation set is directly used to improve
performance, and must therefore remain strictly separated from the test set as well.
The general idea behind this third set is to act as an interactive feedback: instead
of just judging the �nal model to assess its quality (test set), it evaluates the current
model proposed, while remaining independent of the current training data itself. Of
course, learning algorithms already select the model that they consider to be the
��ttest�. Nevertheless, many of them require some tuning parameters to be �xed
from outside, the so-called hyper-parameters. To choose their values manually at
�rst guess might be largely suboptimal, so a good practice is to seek the best hyper-
parameters with cross-validation. When confronted to multiple hyper-parameters,
there is of course no guarantee that taking the best ones one an individual scale (that
is, by letting only one vary at a time) will bring the best combination overall as they
are unlikely to impact performance independently, so in theory all combinations
should be considered. However, in some cases, the time overhead might become
truly prohibitive.
A quite generic way of extracting several validation sets from the base training
3.3. KERNEL DENSITY ESTIMATION 31
set is to partition it into K-folds of same size. This allows to cover the whole
base training set with no overlap between the subsequent validation sets. The most
complete form of K-folds is leave-one-out cross-validation, where K is simply the
size of the base training set. Each element of the training set is thus subsequently
removed to act as a singleton validation set. The disadvantage of this method is of
course its very high computational cost, especially with big training sets.
3.3 Kernel density estimation
3.3.1 Non-parametric density estimation
Basically speaking, density estimation is the inverse operation of drawing samples
from a given probability distribution. There is a major di�erence, though. Following
a probability distribution yields unpredictable results, yet can be done in all neutral-
ity. This is much more di�cult to achieve when we deal with limited data and seek
to reconstitute the original distribution underlying it. There are an in�nite number
of potentially matching distributions, but of course they are not all equiprobable,
that is, many could have generated the data but are very unlikely to have done so.
To help us discriminate between the candidate distributions, we need to de�ne some
sort of inductive bias.
A �rst way to resolve the ambiguity is to make a strong assumption about the
sought distribution, and �xing all of it but its parameters. For instance, in an
univariate setting, we might assume our data follows an exponential distribution
λe−λx. We then seek the parameter λ for which the obtained data is the best �t.
This is called parametric density estimation. Although choosing the appropriate
function seems to be rather arbitrary if we do not possess prior knowledge external
to the available data itself, the estimation will be already precise with limited data if
our �bet� is right. On the other hand, if this condition is not ful�lled, the estimation
will always be wrong, even with in�nite data at our disposal. In this sense, the
employed inductive bias is restrictive (see end of point 3.2.1).
A second approach, namely non-parametric density estimation, is to rely primar-
32 CHAPTER 3. THEORETICAL FOUNDATIONS
Figure 3.3: Estimators receiving samples drawn from the standard normal: three �rstplots are histograms, fourth plot is a parametric estimator for normal distributions
ily upon the data, without constraining it into a pre-established framework. This
methodology has the big advantage of automatically converging toward the true
distribution as the number of data samples grows. The classical example is the his-
togram estimator. It consists on dividing the domain covered by the samples into a
grid of bins with length ε, and then counting the number of samples falling into each
of those bins. Finally, the overall sum is normalized to 1 (to respect the de�nition of
density), and we obtain quite simply a discrete estimate of the underlying distribu-
tion. This intuitive technique leads from itself to the true distribution when ε → 0
and the number of samples N →∞.
Of course, in practice, we cannot just �x ε to a low number when data is not
su�cient enough to remain representative at a very local bin level. On the other
hand, the bigger ε is, the rougher will be the resulting distribution. In other words,
ε remains as hyper-parameter (another equivalent hyper-parameter would be k, the
3.3. KERNEL DENSITY ESTIMATION 33
number of bins). This does not contradict the adjective non-parametric, as ε is not
the parameter of any �xed density, but rather inherent to the methodology itself.
Figure 3.3 summarizes all this. The histogram with 1000 samples is either coarse or
chaotic. But with su�cient samples and a low ε, we get a much better estimation.
In contrast, the parametric estimator needs only 1000 samples to generate a very
precise approximation, but bear in mind that this works only if the true distribution
is indeed a normal.
Within a non-parametric setting, a �rst disadvantage of employing an histogram
is that its convergence to a smooth estimation is particularly slow. Its inductive
bias lets it prefer discrete distributions, whereas we might very well expect natural
distributions to be continuous functions. A second point which could be improved
is that the bins only approximate the local in�uence of each data sample: whatever
the exact position of the sample inside the bin, the result will be the same. Kernel
density estimation is a non-parametric technique which avoids these limitations.
3.3.2 Kernel de�nition
What is a kernel? The word has several signi�cations, depending on the context.
We will restrain ourselves here to stochastic kernels of order 2, as de�ned in [38]. In
this de�nition, a kernel K(u) is a function over R so that:
1. K(u) = K(−u),∀u (symmetry)
2.∫ +∞−∞ K(u) du = 1 (valid density)
3.∫ +∞−∞ u2K(u) du 6= 0
The third condition is not as straightforward to interpret as the two others, and
comes from the general kernels of order k. Fortunately, for the case of k = 2, it
only eliminates speci�c cases among those that allow negative values for the function
K(u). If we stick to kernels that can be interpreted as probability distributions, we
can thus simply ignore it (the wikipedia entry on stochastic kernels does not even
mention it).
34 CHAPTER 3. THEORETICAL FOUNDATIONS
Figure 3.4: Epanechnikov kernel using various univariate conversions, from darkerto lighter: Euclidean norm, Manhattan distance, ∞-norm, multiplicative kernels
How can such a function be used for density estimation? The idea is to center a
kernel on each data sample, and then to combine them like in a mixed distribution.
Kernels being themselves probability densities, the only but important di�erence
with a mixed distribution is that the number of base distributions is not held �xed,
but varies with the sample size, hence the non-parametric nature of the estimator.
For the moment, let us still limit ourselves to the univariate case. This basic kernel
estimator of a true density f(x), given a sample x1, ..., xN of data, is computed as:
f̂(x) =1
N
N∑i=1
K(x− xi) (3.6)
Additionally, we can generalize the chosen kernel density with an important ex-
ternal parameter h, called the bandwidth. Given a kernel K(u), we can de�ne
Kh(u) = K(u/h)/h. At this point, the bandwidth h may seem to only have a minor
scaling e�ect, but as we will see later on, its impact is actually quite crucial on the
estimator obtained.
3.3. KERNEL DENSITY ESTIMATION 35
Name K(u)Uniform 1
2I(|u| ≤ 1)
Triangle (1− |u|) I(|u| ≤ 1)Epanechnikov 3
4(1− u2) I(|u| ≤ 1)
Gaussian 1√2πe−
u2
2
Table 3.1: Some common kernel types
In the general case of multivariate data, we deal with vectors x and xi instead
of scalars x and xi. In order to use (univariate) kernels in this setting, an approach
is to use a norm on the di�erence between the vectors x and xi to determine the
single variable u. It is also possible to take the product of the kernels evaluated on
each single variable di�erence. Notice this is not the same as a single kernel on the
product of the variables' absolute di�erences!
Generally as in the isotropic kernel of equation 2.3, the Euclidean norm is im-
plicitly chosen. But other types of norms can of course be used. For instance,
the anisotropic kernel of equation 2.5 uses the Mahalanobis distance d(x,x′) =√(x− x′)TS−1(x− x′) as de�ned between two random vectors (the square root dis-
appears as u is itself squared). There are also the Minkowski distances or p-norms,
that are de�ned as:
dp(x,x′) =
(D∑i=1
|xi − x′i|p)1/p
(3.7)
Figure 3.4 illustrates the Epanechnikov kernel (see table 3.1) in the bivariate
case. In the �rst three kernels, data is converted to univariate using the most classic
Minkowski distances : 1-norm (Manhattan), 2-norm (Euclidean) and ∞-norm (the
limit of the p-norm when p → ∞). The fourth kernel is obtained with the product
of the two univariate kernels. Overall, the Euclidean distance generates the most
circular kernel.
Among the functions that respect the properties de�ned above, it is interesting
to cite the kernels of table 3.1, where I(•) is the indicator function (1 if true, 0 if
false). These kernel types are illustrated in �gure 3.5. There are other common types
36 CHAPTER 3. THEORETICAL FOUNDATIONS
Figure 3.5: Common kernel types
of kernels, but they don't di�er much from the ones presented here. The mode, if
any, will always be centered on 0 (remember it is the position of the sole currently
considered data point), and from there the kernel will monotonically decrease on
both sides. The Gaussian kernel is particular though. Unlike the others, it expands
in�nitely instead of being cut just after the bandwidth limit. Although the quantity
soon becomes negligible (exponential decrease), it is not the same as 0, and this
information is valuable when comparing two points falling outside the range of the
kernels. It should nonetheless be noted that, like the histogram, all kernels will
eventually converge toward the true distribution when given enough samples (and
setting the bandwidth h close to 0, as for ε).
3.3.3 Bandwidth selection
The choice of h is a tricky one. This is because we are typically confronted with
rather limited data, from which we wish to generalize without losing too much focus
on what we know. Too low values of h will generate hectic variations on the estimated
density, while too high ones will �atten it to an uniform distribution.
From a purely visual point a view, a human experimenter might very well be
3.3. KERNEL DENSITY ESTIMATION 37
Figure 3.6: Pitfalls of good-looking curves (100 observed points, h = 0.1 and 1.5respectively), �gure idea taken from [38])
tempted to increase h until obtaining a smooth surface, pretty to the eye and looking
more natural. This is alas not a very convincing criterion, without even having
to consider its subjective character. The problem is not to assume that the true
distribution should be relatively smooth (it is a fairly natural rule, provided no
hidden variable is present), but rather that we could obtain an estimate from limited
data that is at the same time smooth and precise.
Figure 3.6 illustrates this psychological e�ect. If we forget about the (normally
hidden) blue reference function, the right red curve looks much more like a respectable
distribution than the left one, but it actually fails to detect the bimodal nature of
the blue curve. At the opposite, we should neither expect numerical wonders from
low values of h: like for the histogram, the increased �precision� is gained relative to
the available data, not to the true distribution.
This dilemma can be interpreted in terms of notions related to Machine Learning.
Too high bandwidth values correspond to under�tting (too simple model), and too
low to over�tting (relying exaggeratedly on the known samples). In [31], this is
presented as the trade-o� between variance (high h) and bias (low h) relative to the
sought density.
In fact, there is no universally accepted way to de�ne the best h overall for a
given data set. There are, however, ways to determine an optimal value relative to
an error criterion through mathematical analysis, but the best performance criterion
38 CHAPTER 3. THEORETICAL FOUNDATIONS
itself is not universally agreed upon! For instance, the commonly accepted Integrated
Square Error (ISE) and Mean Integrated Square Error (MISE) will choose di�erent
bandwidth values as the best one. See [38] for an complete discussion of the subject,
and other criteria of optimality.
Depending on what we want exactly, we will thus choose di�erent ways to �nd
a good h. In our case, the criterion is not that much to approximate as close as
possible the true distribution as a whole, but rather to ensure that the true local
maxima look similar to the estimated ones, which is a very di�erent thing.
4. Chosen methodology
Taking nine Squares, each an inch
every way, I had put them together
so as to make one large Square,
with a side of three inches, and I
had hence proved to my little
Grandson that - though it was
impossible for us to see the inside of
the Square - yet we might ascertain
the number of square inches in a
Square by simply squaring the
number of inches in the side.
Edwin A. Abbott, Flatland
4.1 Preliminary analysis
We will explain here the overall intuition of [5] sketched in 2.2. The idea is to build a
kernel density estimator not for itself, but as a tool to guide prediction. Concretely,
body poses are modeled through angles at each joint, the length of the limbs being
held �xed. For a given frame in a sequence, this gives us a vector, each variable
having the value of the corresponding joint. If we want to get a better idea of a
certain type of movement, we could build a kernel density estimator based on a
group of frames where the joints are already known through some other means (the
39
40 CHAPTER 4. CHOSEN METHODOLOGY
training set). The resulting density would model the relative likelihood of an external
pose, as belonging to the same distribution as the elements of the sample used as
data set.
This would however only possibly help to estimate the chances for a pose to
belong to a type of movement or another (in order words, classifying it), whereas we
actually want to go a step further, and predict the next pose when knowing the past
ones. Obviously, we then need to take time into account.
For this task, the training set must also contain information about the past.
The vectors corresponding to each pose are then no more composed solely of their
current value of the joints, but contain also the values of the joints for some �xed past
frames. The values are simply juxtaposed, the actual order having no importance.
For instance, to convert a frame t into a vector of the training set, one could retain
values of frame t, t − 1 and t − 2. The kernel density then represents the relative
likelihood of the sequence of frames as a whole. Poses that are maybe realistic taken
individually, but that do not make a coherent succession will thus be seen as unlikely.
Notice that taken as such, there is no explicit causality link following a chronological
order (we stay at an unsupervised level up to this point).
Now, the point is, we have one of those successions of frames at hand, for which
we know the past poses, but not the current one (as it is precisely the one we would
like to predict). A trick would be to propose various predictions for the current
pose, evaluate the kernel density estimator on each vector taken as a whole (with the
�xed past poses), and retain the most likely combination of current and past poses.
Of course, the tried current pose con�gurations must not be determined arbitrarily.
Gradient ascent can be used to achieve this in a systematic way. This is basically
the core principle of the methodology. Section 4.3 explains it much more in detail.
In the context of point 3.3.2, this corresponds to a Gaussian kernel, with the
Euclidean distance as criterion to obtain parameter u of the univariate kernel (and
the according scaling to make it a valid density). Isotropic means being equivalent
in all directions. In others words, the bandwidth is applied to all dimensions with
the same strength. However, we do not expect at all our data to have an equivalent
variance in all dimensions. The isotropic kernel would thus require a large training
4.2. KERNEL RECONSTRUCTION OF ARTIFICIAL EXAMPLES 41
set and a low bandwidth to follow the true distribution on a local level. Could we
capture the variances with fewer training elements?
Well, the already mentioned Mahalanobis distance is precisely based on a covari-
ance matrix. We could theoretically just use the sample covariance matrix, but we
would like the bandwidth to be adaptive not only relative to the di�erent dimen-
sions, but also to be adaptive on a locality level. By this, we mean that dense regions
should have a low bandwidth h to make a good use of the precision o�ered there by
the training set, whereas sparsely populated regions should have a larger h to cover
smoothly the rareness of information known, instead of a rigid global compromise.
For this, we need a local covariance matrix for each element of the training set. We
can use the isotropic kernel itself as a way to weight the relative in�uence of the
other training set points for the covariance matrix of the current training set point
(this is also to allow us to conserve h as a scaling parameter).
We implemented the whole system in Matlab. Additionally, in order to gain time
performance, we also implemented a C version of the functions that have an in�uence
on complexity (see section 4.6). This brought us a speed-up of about ×4, depending
on the context.
4.2 Kernel reconstruction of arti�cial examples
4.2.1 Visualization
No realistic human model can simply be based on a low number of Degrees of Free-
dom (DoF). This poses an issue of visualization. It remains quite easy to observe the
behavior of the predictions made by the system when applied to a 3D virtual jointed
model. But analyzing the shape and behavior of the underlying kernel density es-
timation with more than 3 variables is pretty di�cult for human experimenters. In
order to have a direct and concrete feeling of how kernel estimation itself behaves,
we �rst restrained ourselves to arti�cial true distributions in dimensions 2 to 4.
We implemented a GUI in Matlab to ensure this straightforward interaction with
the estimations themselves, and the in�uence of their hyperparameters. As such, this
42 CHAPTER 4. CHOSEN METHODOLOGY
Figure 4.1: Isotropic kernel estimation of an uniform spiral (1000 observed points)
toolbox is conceptually independent of kernel estimators' potential applications (in
our case, human motion prediction). The toolbox contains 23 prede�ned arti�cial
densities. They are all easy to parametrize, and to be combined in a modular way to
facilitate creation of new ones. The user can choose among the kernel types listed in
the table of point 3.1. Both the isotropic and anisotropic kernels presented in section
2.2 in the case of the Gaussian type are supported for all those kernel types. The
hyper-parameters can be determined automatically via di�erent methods, or be set
manually. The number of generated training samples can also be chosen, and can be
held for several estimations of the same true distribution. With the according option
set and reasonable training data size, it is also possible to see in real-time the impact
of tuning the di�erent parameters.
In the case of 3D data, the density lies in 4D space. Thus, each point in space has
an associated probability. We patch together sets of points for which the density value
is ≥ than a prede�ned tolerance threshold, and thus obtain a group of 3D objects.
4.2. KERNEL RECONSTRUCTION OF ARTIFICIAL EXAMPLES 43
Figure 4.2: Two views of an isotropic kernel estimation of the open-box manifold(each point is equiprobable), depending on the value of the tolerance threshold (500observed points)
The value of the tolerance threshold to accept a point can be modi�ed in real-time
by a slide bar, and thus let the objects shrink or grow along this dimension. In �gure
4.2, left (resp. right) view is obtained via a high (resp. low) tolerance threshold.
4.2.2 Integrated Square Error
Visualization alone remains a bit too subjective. As noted in point 3.3.3, the human
mind tends to favor smooth estimations even when they are actually blurring the
real shape of the sought distribution. Therefore, numerical criteria are desirable to
complement the viewer's impression. A common measure of error when comparing
a function to its estimate is the Integrated Square Error. It is de�ned, as can be
expected:
ISE =
∫(f̂(x)− f(x))2dx (4.1)
It is important to note that we normally don't have the reference function f(x)
44 CHAPTER 4. CHOSEN METHODOLOGY
at our disposal, as we precisely try to estimate it. Then, the ISE itself needs to be
estimated (while remaining distinct of the Mean Integrated Square Error, but this is
another story).
4.2.3 Average Negative Log Likelihood
The Average Negative Log Likelihood is another performance criterion (used for
instance in [40]), much closer to a Machine Learning approach. It consists of evalu-
ating the estimated density with a set of m points that were generated with the true
distribution. It is de�ned as:
ANLL = − 1
m
m∑i=1
log f̂(xi) (4.2)
Contrary to the ISE, this does not require any external knowledge on the unknown
function f(x), as we can build a test set by taking away a subset of the available data.
Of course, comparing two di�erent estimations needs to be done with the same test
set. It can be noticed that summing the logarithms is equivalent to the logarithm of
the product, hence, very low values will penalize directly the �nal result.
4.3 Processing real data
4.3.1 Building of training set
In the context of human poses, samples are not drawn from any known probability
density. Actually, it sounds rather weird to ask ourselves what the global maximum
is (in other words, the most probable sequence of human poses in general). We also
have to bear in mind that any model based on joints angles remains only a model,
and does not re�ect precisely the complete human body con�guration. Anyway,
the task of establishing a data set representative of all arbitrary movements would
be totally intractable. In this respect, trying to reveal the nature of human poses
by a kernel density estimator is theoretically speaking inherently �awed and should
4.3. PROCESSING REAL DATA 45
Figure 4.3: The MMM joint model during a boxing move
remain out of question. The real purpose here is to use the estimation locally in a
purely pragmatical way to help predictions on similar movements.
There are two di�erent joint models used in the Go Hu.MAn group: the Master
Motor Map (MMM) model depicted in �gure 4.3, and the human model. The �rst
one was originally conceived to execute man-like movements on a humanoid robot,
without being constrained by anatomical limits (such as the impossibility of deploying
one's elbow above 180◦, unless being subjected to an especially painful situation).
Each joint is thus simply constituted of three unrestricted angle-based Degrees of
Freedom (DoF). The second model is directly intended to represent a human person.
Both models can be used for prediction.
Additionally to the joints of the model itself, we have to consider what is called
in [5] the global twist. It represents the position of the body in a �xed reference
environment, and is thus composed of 3 variables for translation and 3 others for
rotation. The MMM model has a total of 18 joints, which accounts for 60 variables
(3*18 + 6). The human model has a varying number of DoF/joint, and comprises
42 variables overall, global twist included.
46 CHAPTER 4. CHOSEN METHODOLOGY
The input data for the two models types were generated originally from the same
marker-based data. Each category of movement is composed of several sequences,
where a marker-covered sportsman was asked to perform the desired movement. All
in all, there are a bit more than 100 sequences available, covering 10 categories of
movements (acrobatic jumps and/or martial arts).
Each sequence is a matrix representing the model con�guration over time. As
mentioned in section 2.2, each data point of the kernel density estimator is in fact the
juxtaposition of the model's variables over several frames, following a prede�ned rule
which what we call the time sequence. For instance, if we follow the time sequence t,
t−1 and t−2, the data point corresponding to frame number 1200 will be a vector of
180 variables (with the MMM model) containing the poses of frames number 1200,
1199 and 1198.
At this point, there is a di�erence with the approach presented in [5]. In order to
reduce a bit the total number of dimensions, the authors recommend to include the
global twist variables only in the present frame t (and taking relative values instead
of absolute ones). Doing so in our case had lead to extremely imprecise predictions
on the global twist in preliminary experiments, the system having only the other
(very indirectly correlated) variables at its disposal to guide prediction. We suppose
that this is much less a problem in their case, as the combination with the image
�delity method through global optimization implicitly overwrites the predictions on
the global twist. Additionally, tracking of the body as a whole is much more easier
than detecting the local joints in general, and this is even more true in presence of
partial occlusions (thus, prediction is rarely needed at all for global twist). Anyway, in
the MMMmodel, this represents only 10 % of the complete model, so the dimensional
overhead is not comparable to the brutal loss of precision.
4.3.2 Frame subsampling
So far we have discussed how to generate data for the kernel density estimator, but
not which frames to give it. We could for instance, starting from frame t, retain only
t+ 10, t+ 20, etc, instead of t+ 1, t+ 2, etc. We call this gap between two retained
4.3. PROCESSING REAL DATA 47
frames the frame step. It can be de�ned for both the training and test sets.
But is this cut justi�ed at all? Indeed, there are time considerations that make
worth considering the selection of only a subset of available data, especially for
the training set. Without going into the details of the complexity analysis (see
section 4.6), computation time will increase linearly relative to increase in test set
and training set for the isotropic kernel, and quadratically relative to increase in
training set for the anisotropic kernel.
In the data at our disposal, the frame rate is quite high, so the poses variate not
much on a scale of an unitary frame step. Worse, the proportion of those variations
that are imputable to noise is then no more negligible at all, which is a much undesired
e�ect. Furthermore, the human body is not able to change trajectories instantly, so
we can quite reasonably expect interpolating behavior at a low time level. And of
course, if su�ciently data is available, it might be more interesting, computation time
being equal, to �invest� in additional sequences of the movement category (variety)
instead of super�uous frames, or tune �ner the other hyper-parameters.
Finally, this is not inconsistent with the methodology itself, as from an external
point of view, this is simply equivalent to a camera with a slower frame rate. These
arguments must however be nuanced for the test set frame step. If the end application
wants results at the level of the original frame rate, we should also assess performance
(but not necessarily train!) at this scale.
4.3.3 Step-by-step prediction
Another point to clarify is that the whole system (prediction and image �delity
combined), can only work step-by-step, meaning that all pose estimations must be
made in the chronological order. Indeed, the prediction method needs the pose
con�guration of previous frames mentioned in the time sequence in order to make a
prediction for t. This also means that at the beginning of the process (until the ith
frame is reached, where t − i + 1 is the �deepest� frame of the time sequence), the
image �delity method is alone responsible for the pose estimations.
Figure 4.4 illustrates this step-by-step handling of data, in the most simple case
48 CHAPTER 4. CHOSEN METHODOLOGY
Figure 4.4: Step-by-step estimate, test frame step = 1, time sequence of t and t− 1
where the time sequence is composed of t and t−1 (thus, prediction can be activated
as soon as frame 2), and test frame step is equal to 1 (thus, each frame after frame
1 is estimated with both methods). The ⊕ symbol has no formal signi�cation here,
simply meaning that the two outputs are somehow merged into a single estimate.
Training frame step is not relevant in this scheme, as it solely concerns the `Training
data' component.
4.3.4 Prediction by gradient ascent
The next question is: how do we actually predict a pose, given a training set and
some past poses? The kernel density estimator is based upon the training set only.
If we evaluate it on a point x, we have a relative likelihood estimate for this point
to appear. Remember a request to our learner takes the form of a vector composed
of the juxtaposition of human body poses over a time sequence, from which we only
know the poses of the past frames and want to induce the current frame's pose. Our
4.3. PROCESSING REAL DATA 49
goal is thus to discover which vector x is the most likely, under the constraint that
the variables belonging to the past frames are of course held �xed. Hence, we search
for a maximum on a subspace of the density hyper-surface.
However, we evidently can't a�ord to compute the estimate over a domain, and
only search afterward for the best candidate matching the variable values of the past
frames. Conceptually speaking, we have a density estimation at our disposal, but
each point of its domain is costly to compute, and thanks to the curse of dimension-
ality, there are lots of points. A global maximum is thus de�nitely out of a reach.
What we can do is search for a local maximum, using gradient ascent.
The initial point can be determined in di�erent ways (the variables of past frames
being already �xed). We chose to take the same pose as the nearest past frame of
the time sequence. In preliminary experiments, we found out that this seems to
slightly outperform arbitrary choices, though the in�uence seems rather small. We
must nonetheless precise that this result stands for small training set sizes in high-
dimensional space. In the inverse situation, one may expect less sparse distributions
and thus a more important role for the choice of the initial point.
We actually compute the gradient of the logarithm of the estimation (as the log-
arithm function is strictly monotone, this does not in�uence the direction of steepest
ascent):
∂ log f̂iso(x)
∂x= −
∑Ni=1K
isoh (x,xi)(x− xi)
h2∑N
i=1Kisoh (x,xi)
(4.3)
∂ log f̂ani(x)
∂x= −
∑Ni=1 K
anii (x)Σ−1
i (x− xi)∑Ni=1K
anii (x)
(4.4)
Where the de�nitions of f̂iso(x) and f̂ani(x) stand in equations 2.2 and 2.4 re-
spectively. Notice that this is the complete gradient, whereas in practice we are
only allowed to use the partial derivatives of the variables of the current frame. The
derivations are detailed in appendix A.
We can also remark at this point that the isotropic kernel estimator is a lazy
learner (i.e. the whole computation is done on request), while the anisotropic kernel
50 CHAPTER 4. CHOSEN METHODOLOGY
estimator is not (inverse covariance matrices and determinants are kept in memory
after �rst request).
4.4 Hyper-parameters selection
4.4.1 Bandwidth
As already discussed in general in point 3.3.3, selecting the adequate value of h
is almost the fundamental question of kernel density estimation. Along with the
regularization parameter α in the anisotropic case, it is theoretically possible to
optimize the best choice of parameters via cross-validation. If we optimize in terms
of accuracy of the estimated density, the performance criterion to use is the ANLL
de�ned in equation 4.2, applied on the union of all validation sets instead of the
test set. However, the real performance criterion is eventually the accuracy of the
prediction made after the gradient ascent, which is not quite the same thing (although
we expect a good ANLL to be at least partially helping) and implies the associated
computations.
But even with simplifying to the ANLL and applying a low-cost cross-validation
(say, only 2-folds), this multiplies the number of computations of an already expensive
density estimator. By the way, if an important part of the training set is �sacri�ced�
to the validation set, the value obtained for h will actually be over-estimated. Indeed,
remember that the larger the training set, the lower we should allow h to go. The
good values for h are in fact highly depending upon the training set size.
And unlike integer hyper-parameters, there is no clear numerical step to use to
explore the in�uence on performance of h (and α in the anisotropic case). Depending
on the training set, �xed initial values and step sizes are quite prone to be arbitrary.
To counter this problem, one may iteratively re�ne search between the two best
candidates (see for e.g. approach proposed in [37]), a bit like in a dichotomic search.
Nonetheless, very much like in gradient ascent, one has to remain conscious that
this is still not su�cient to be guaranteed to �nd a global optimum, unless the
performance function over the parameter space is unimodal.
4.4. HYPER-PARAMETERS SELECTION 51
Figure 4.5: Evolution of average (solid blue) and maximum (dashed green) nearestneighbor distance with sample size varying from 10 to 1000, D = 1 and D = 100resp.
For all these reasons, optimizing h in a systematic manner seems very costly to
apply in practice, especially in the anisotropic case. If an easily computable and
reasonable approximation exists, the time spent on optimizing may instead be used
to evaluate a larger training set (assuming su�ciently data is available). In [6], the
authors propose to use the maximal nearest neighbor distance of the training set.
However, a year after, in [5], they switch to the average nearest neighbor distance.
The idea that may justify the max nearest neighbor distance is that it allows
the bandwidth to expand until it �nds a neighbor, meaning there are no gap of low
(Gaussian) or null (strict window kernels) value between two neighbors, a region we
would intuitively not expect to see falling suddenly at very low probability levels.
That said, this argument is not based on an objective numerical criterion, and might
be somewhat over-smoothing (remember �gure 3.6). The average nearest neighbor
distance also tries to achieve this e�ect, but in a less radical manner.
An important point is the behavior of these approximations when the number of
training samples changes. A much desirable property would be that the proposed
value for h lowers as samples are added. It turns out, that in that respect, the av-
erage neighbor distance behaves much better than the alternative maximum nearest
neighbor distance.
52 CHAPTER 4. CHOSEN METHODOLOGY
Figure 4.5 illustrates this. We generated data drawn from the standard normal
with D variables, and compared the evolution of the two measures when adding
new elements to the sample (of course, our real data is certainly not generated from
an unimodal distribution such at this one, but we think the following observations
will nonetheless hold). The maximum nearest distance is, as expected, substantially
larger than the average nearest distance. But more importantly, it does not o�er any
clear guarantee to decrease as the training set grows.
This is especially problematic for low-dimensional data (left plot). The variation
is chaotic because it is based on the degree of isolation of the most isolated outlier in
the sample, which is a property susceptible to variate brutally. This unpredictability
is less severe in high dimensionality (right plot), thanks to a positive (in this case!)
aspect of the curse of dimensionality, which states that the relative di�erence between
the maximal and minimal distance in a data set converges to 0 as dimensionality
increases (see [4]). In other words, distance progressively loses its discriminative
power as a de�ning characteristic between two points, which helps to stabilize the
maximum nearest neighbor distance.
4.4.2 Alpha
This is a hyper-parameter speci�c to the anisotropic kernel. Unfortunately, the
anisotropic kernel is already much more time-costly than the isotropic alternative,
so trying to optimize this supplementary hyper-parameter only further widens the
gap between the two.
In point 2.2, we presented α as a way to avoid ill-conditioned covariance matri-
ces. What about simply setting α just high enough to ensure a reasonable condition
number? Well, problem is that high condition number is a relatively blur and pro-
gressive notion, so we could choose a threshold of what is deemed an ino�ensive
condition number, but pick up α to match this threshold on the worst Σi is certainly
not optimal in terms of performances. As for h, there is a trade-o� at play.
So, basically, what is the e�ect of letting α grow too much? The covariance
matrix will more and more concentrate on its diagonal, and converge toward an
4.4. HYPER-PARAMETERS SELECTION 53
identity matrix scaled by α. And what does such a covariance matrix represent?
An isotropic kernel, which is certainly not as terrible as an ill-conditioned matrix.
Pragmatically speaking, there is a major di�erence with the normal isotropic kernel
though, in that we waste much time building a covariance matrix which we will
mostly ignore in the end! And afterward, the evaluation itself will also involve costly
vector multiplications, instead of the straightforward h factor.
As for h, there are no de�nite answers on what would be the best value to give to
α. In [5], α is �xed for rather obscure reasons to h/5. That said, it is fairly clear that
α should, at least partially, depend upon h in some way. The scaling factor must
however be adjusted by heuristic results. So far, we see no automated alternatives
able to better grasp the complex relations between h and the condition number of
the covariance matrix that the isotropic kernel generates.
4.4.3 Training frame step
The question of the training frame step seems not as delicate. We would expect that
low frame steps can never lower precision of the predictions, and that the trade-o�
is thus between precision and computational cost. Depending on how much time we
are ready to give to the algorithm, setting the value of the hyper-parameter could
be done quite naturally. Still, if we want to increase e�ciency, there might be some
ways to specify it less arbitrarily. One could use an objective precision versus time
criterion for instance.
There is also something else to be noted: the training frame step and the time
sequence are not independent. Changing the former will directly determine which
past frames will also be considered, thus changing the signi�cation of the time se-
quence itself. For e.g., t, t− 1 and t− 2 are not the same thing for a frame step of 1
(past frames are also selected frames) than for one of 3 (past frames are the missing
link between two selected frames). That means, one should be aware that lowering
the training frame step after having �xed the time sequence might actually in some
cases decrease performance (bigger training set, but formated in another manner).
54 CHAPTER 4. CHOSEN METHODOLOGY
4.4.4 Time sequence
The time sequence is, once again, not an easy hyper-parameter to determine. There
are actually two di�erent things to decide: the number of frames included, and how
those frames are arranged.
The �rst aspect will have a dramatic impact on dimensionality, as the time se-
quence's length multiplies the number of dimensions of each data point. This directly
a�ect computational cost, especially for the anisotropic kernel (see section 4.6). But
this is not the main issue. Indeed, all e�ects related to the curse of dimensionality
will only be exacerbated (see point 5.3.1). If the human body model used contains
already too many dimensions, the length of the time sequence must be set accord-
ingly. Nonetheless, in order to make prediction possible, the time sequence must of
course still contain at least one past frame.
The second aspect covers potentially as many hyper-parameters as there are past
frames in the time sequence. However, we may expect good sequences to follow
some kind of simple rule, like an arithmetic progression or an exponential growth.
The former would consider all time intervals to have same relevance, while the latter
would allow to have both deeper insight in the past and information about the near
past. On top of that, there remains of course a scaling factor similar in nature to the
frame step, that is, does it really bring something to cover redundant close frames
instead of directly looking deeper in the past?
4.5 Combination with image �delity
4.5.1 Con�dence weight of estimation
As previously said, prediction method and image �delity method are two distinct
algorithms, and our work is solely focused on the former. However, in order to
guarantee a harmonized mix between the two results obtained, it might be a good
idea to consider weighting their respective in�uence relative to their estimated quality
on the proposed instance. Formulated another way, it would be nice if the predictor
could itself give a scalar con�dence value on the pose it proposes.
4.5. COMBINATION WITH IMAGE FIDELITY 55
At �rst glance, knowing that the kernel density estimator is precisely a probabil-
ity density, we could simply return the value of the local maximum found and use
it as a estimator of the probability of this pose sequence. But relative likelihood
is not probability! A well-de�ned probability must actually be an integral over a
region included in the domain of the density, the value of a point's relative likelihood
being only meaningful when compared to the others points. Due to the curse of
dimensionality, the relative likelihood of a point in a high-dimensional space will in
general be much lower than one in a space of lower dimensionality.
A �rst practical consequence of this is that when having two models not having
the same dimensionality (for e.g., di�erent lengths for the time sequence), it is not
relevant to compare the con�dence values obtained by taking the estimate on the
pose sequence. A second practical consequence is that we really have to scale the
relative likelihood should we want to use it as an ersatz for probability. But varying
conditions must be taken into account, so using this kind of static approach might
still lack �exibility.
We suggest instead to use the value returned on the pose sequence divided with
what it would have been without the prediction. Comparing two relative likelihoods
assure us to stay on the same scale. A reasonable candidate for the pose of the
current frame without the prediction (remember that the variables relative to the
pose of the current frame are unde�ned) could be the nearest past frame. In this
sense, the learner would compare its self-con�dence with what it judges to be the
value of staying purely conservative. Notice it is also the initial point we use in the
gradient ascent (see point 4.3.4), so it could also provide a measure of the estimated
�success� of the complete gradient ascent operation.
4.5.2 Ambiguity measure of training set
The kernel density estimator returns relative likelihood values on points requested,
but this is highly dependent on the parameters used, and of the methodology in
general. And obviously, this value is de�ned locally for the test point in question,
not on a global scale for the training set. In complement to that measure, it would
56 CHAPTER 4. CHOSEN METHODOLOGY
be nice to get information on the predictive power of the training set itself. That is,
we don't talk about kernels anymore and think of the data points as composed of
two sets of continuous variables: those we want to predict, and the others. So, if we
suddenly remove the values to be predicted, to which extent are the others su�cient
to retrieve the �lost� values? To answer this question, we imagined a way to measure
the ambiguity inherent to a data set, relative to a set of removed variables.
Consider the following naive example: Paul goes to school on each day from
Monday to Friday, but not on Saturday and Sunday. If it is Wednesday, does Paul
go to school? The answer is straightforward, because there is no ambiguity. Now,
knowing that Paul goes to school, which day is it? There are 5 di�erent possible
answers. Having Paul's agenda for this week, we could compute a measure of the
di�culty there is in this agenda to predict the day knowing Paul's occupation (or
vice-versa). For each pair made from 2 of the 7 tuples (day, occupation) we could see if
there is an ambiguity to remove the day or not. For e.g., if we have (Tuesday, school)
versus (Thursday, school), there is ambiguity, whereas for (Monday, school) versus
(Sunday, ¬ school) there is not. That way we could get a ratio of the ambiguous
pairs if the day is omitted, which already gives a good idea of how ambiguous it is
to predict the day with respect to the occupation.
However, it should be clear that such an approach could only be relevant with
restricted discrete values (the data set covers well the range of possible values). In
the continuous case, this is by de�nition impossible with a �nite data set. Instead
of deciding in a binary way if the current pair is ambiguous or not, we may judge
how much the lost variables account for the numerical discrimination between the two
data points. A simple criterion to evaluate how much two sets of continuous variables
di�er one from the other is the Euclidean distance between the two corresponding
vectors. This is coherent with all the classical quadratic error criteria.
If we interpret this geometrically, variables are dimensions, so removing one of
them consists of �compressing� the data set to a single value (i.e. 0 or any other
constant) along this dimension. A sphere would thus be virtually reduced to a disk
along the z axis for instance. Rebuilding the sphere would be ambiguous because
each point on the disk could have originated from both the bottom or upper surface.
4.5. COMBINATION WITH IMAGE FIDELITY 57
A hemisphere posed on a horizontal plane would be easier to rebuild, but unlike
what would have said the binary ambiguity criterion, not absolutely unambiguous,
because variations/imprecisions on x, y would induce variations on z. In fact, only a
perfectly �at horizontal surface would be perfectly unambiguous, because no matter
what the x, y coordinates are, the result to predict will always be the same. On the
contrary, if we compress this surface along the x, y dimensions, there is no way to
reconstitute anything of the original data set, so there is complete ambiguity. Notice
that this will always be the case if all dimensions are compressed to a single point of
no dimension.
In brief, we want to penalize high variance in the compressed dimensions (because
it's more di�cult to be accurate when having to predict them). The idea is, instead of
hunting strict ambiguities, to estimate for each pair how much the distance between
the vectors composed only of the compressed dimensions determines the distance
between the complete vectors.
Having a data set X composed of D-dimensional vectors x1, ...,xN and a set C
of the dimensions to be compressed so that ∀c ∈ C(c ∈ N1 ∧ c ≤ D), let us de�ne x′ias the |C|-dimensional vector composed of all the variables of xi indexed in C. This
gives us the following ambiguity ratio:
AmbigC(X) =2
N(N − 1)
N∑i=1
N∑j>i
∥∥x′i − x′j∥∥
‖xi − xj‖(4.5)
This ratio automatically normalizes the ambiguity in the interval [0, 1]. There
is nonetheless an indeterminate form 0/0 in the case of duplicates. To avoid this
situation, we could impose to eliminate the duplicates in X, but this does not re�ect
the nature of the original data set, because 10 instances of a data point should be
much more in�uent than a single one. Instead, one may simply ignore the duplicates
when they are paired together, but still let them count in the other pairs. This gives
us a rede�ned and generalized ambiguity ratio:
58 CHAPTER 4. CHOSEN METHODOLOGY
Figure 4.6: Curves of ambiguity repartition obtained by histograms (see text fordescription)
AmbigC(X) =2
|{(i, j) : xi 6= xj}|
N∑i=1
N∑j>i
xj 6=xi
∥∥x′i − x′j∥∥
‖xi − xj‖(4.6)
One more remark about what the ambiguity ratio actually measure: in order
for the Euclidean distance to make sense, the variables must be numerically com-
parable. A priori, this excludes putting together variables of di�erent nature, such
as temperatures and weights. But even with kilograms and tons, one has to scale
them accordingly to a common unit. Normalization is also not a solution, because
it will eliminate relative variabilities. In fact, if there is a scaling to do, it must be
dependent of our own weighting of error on the concerned variable. One has simply
to make sure that an error of ε on a variable has the same cost for us, no matter
which one is perturbed. This stands not only for the variables to predict, but also
for the variables which will in practice be given.
To get an approximate idea of how formula 4.6 behaves in general (are low am-
4.6. COMPLEXITY ANALYSIS 59
biguity ratios common or rare, how smooth is the repartition?), we evaluated it on
a relatively large number of small low-dimensional data sets, generated following
various manners to combine the variables on a limited domain. We present in �gure
4.6 the result of one of those preliminary experiments. The ambiguity ratio obtained
on each data set is put in a bin of an histogram (with bin width ε = 0.05). Each
data set is composed of 4 two-dimensional points only, and each variable can take
only one of the values {-4, -3, -2, -1, 0, 1, 2, 3, 4}. We generated all 62370 non-
redundant combinations of those points which comprise nor duplicates, nor multiple
values on y for a given value x (like as in a function). This last condition implies
di�erent repartitions along the two compressible dimensions, as it is less ambiguous
to compress vertically. Despite the rather coarse conditions, the displayed curves are
almost smooth (notice the hiatus at the left of the vertical compression though).
4.6 Complexity analysis
Let us de�ne several variables relative to the complexity. Be N , the number of
elements in the training set, and M the number of elements in the test set. Be d the
dimensionality of the model used, T the length of the time sequence, and D = dT
the dimensionality of an element in the training and test set. Be F the total number
of available frames. Finally, be X the number of maximal tolerated evaluations for
one call of the gradient ascent.
4.6.1 Space complexity
There are several data structures and functions using them, but the space com-
plexity is essentially only playing a role for the (inverse) covariance matrices of the
anisotropic kernel. In the isotropic case, both training and test sets are stored in
memory, but this is more inherent to the problem than to the methodology used to
solve it. That said, theoretically at least, one has to distinguish the space taken by
the available raw data from the constituted training and test sets. The former is in
Θ(dF ), while the latter is in Θ(DN +DM), which is potentially bigger. In practice
60 CHAPTER 4. CHOSEN METHODOLOGY
however, we would really not expect T to be more than a constant factor, and nor
N , nor M can be superior to F .
Back to the inverse covariance matrices, it would be possible to store only one
matrix at a time, but this forces to re-compute each one again for each new kernel
evaluation, including during the gradient ascent. This space versus time trade-o� is
out of question in practice. All the inverse covariances matrices have thus to be stored
at the same time. There is one D ×D matrix for each element of the training set,
thus the space complexity lies in Θ(ND2). Concretely, knowing that each variable is
stored in a 64 bits double, this represents 80 Megabytes for N = 1000 and D = 100.
4.6.2 Time complexity
The time complexity is alas not as gentle as the space complexity. In the isotropic
case, we deal with a time complexity in O(DNMX). The D factor is explained by
the distance calculation, the N by all the individual kernels averaged, and the M by
the loop over this. Contrary to the other complexity factors situated in Θ, X is only
a worst-case estimate, as the gradient may in most cases converge quickly enough,
meaning that we don't need to interrupt it. In fact, the X factor exists solely to
o�er the possibility of a strict upper-bound, as the gradient ascent is guaranteed
to converge in a �nite, yet variable, amount of steps. For the strict window kernel
types, it is actually possible to also get the D factor in O instead of in Θ. This is
done by shortening the computation of the distance as soon as we already pass over
the bandwidth, as the result will be equal to 0 anyway.
The anisotropic case is a bit more complicated. First, there is the computation of
the N covariance matrices. For each matrix, this is done by calling, for each element
of the training set, the isotropic kernel scaling the matrix obtained by computing
(xi−xj)(xi−xj)T (see equation 2.6). So, in total this operation is in Θ(N2(D+D2)) =
Θ(N2D2). Interestingly, the inner call to the isotropic kernel is negligible compared
to the vector multiplications. Then, we also have to inverse these matrices and
compute their determinants. This is taken in charge by Matlab, and can be done N
times in O(D3) for each matrix, supposing it uses standard algorithms (according to
4.6. COMPLEXITY ANALYSIS 61
Wikipedia, the fastest known algorithm can accomplish both tasks for big matrices in
O(D2.376)). So, to summarize, the matrix computation phase is in O(N2D2 +ND3).
Then comes the evaluation in itself. The anisotropic kernel must perform vector
multiplications with the inverse covariance matrix in order to obtain a scalar, so this
costs D2 for each evaluation. Like for the isotropic kernel, we have to multiply this by
NMX. If we consider the matrix computation phase and the subsequent evaluation
together, the total time complexity of the anisotropic kernel lies in the not very nice
but still polynomial O(N2D2 + ND3 + D2NMX) = O(ND2(N + D + MX)). As
previously said, the majority of these factors are in Θ so this theoretical complexity
has also very concrete consequences on the actual computational time.
In general, optimizing hyper-parameters add yet other multiplicative factors to
the whole complexity, with the nuance that the validation set's size may certainly
be smaller than M , and that X could possibly be dropped by approximating perfor-
mance estimation (see point 4.4.1).
What about the proposed shortcut for selecting the bandwidth h? The average
(or maximum) nearest neighbor distance is performed only once in Θ(N2D), which
might be quite an inconvenient for the isotropic kernel (depending if MX is much
bigger than N or not). For the anisotropic kernel, already much more expensive, this
remains a negligible e�ect.
Finally, concerning the self-evaluation of the algorithm, the con�dence weight of
estimation adds no time complexity cost at all (as it is based on what is already
computed). The ambiguity measure of training set runs in Θ(N2D), like the nearest
neighbor distance measurements.
62 CHAPTER 4. CHOSEN METHODOLOGY
5. Experiments
The Reader will probably
understand form these two
instances how - after a very long
training supplemented by constant
experience - it is possible for the
well-educated classes among us to
discriminate with fair accuracy
between the middle and lowest
orders, by the sense of sight.
Edwin A. Abbott, Flatland
5.1 Performance criteria
How can we measure the precision of the predictions? A subjective appreciation of
global success or failure could be obtained by watching a run of the human body
models con�gured by the predictions over the frames of the test set, but this lacks
automation, fairness and precision. Yet, a potential issue with numerical criteria is
that their neutrality does not imply problem relevance if they are not de�ned so as
to exactly re�ect what we actually want to measure in the end. Is a big error in one
variable acceptable if it helps to keep the other variables precise? Are all variables
comparable in terms of error? We would like to clarify here how we decided to answer
those two questions in our context, and thus evaluated our results.
63
64 CHAPTER 5. EXPERIMENTS
The �rst criterion, namely with which kind of rule should an error impact the
global result, is not that much controversial. Errors are basically de�ned as the
absolute deviation relative to a reference. When evaluating a whole system, we
could for instance propose to multiply all individual errors. This assumes that each
individual error worsens considerably the existing problems. This seems not to be
very relevant for us, because local errors do not suddenly make the rest of the body
con�guration totally aberrant. At the contrary, we could simply sum the individual
errors. This assumes that a local error has absolutely no other consequences over
the global solution than itself. Again, this is a bit too extreme in our case, because
when one joint starts to really deviate from the true one, this a�ects our anatomical
interpretation of the whole body more than if this error was distributed between all
joints.
A good compromise in our problem between those two points of view is to take the
sum of squared errors. This criterion has also the advantage of being in accordance
with more abstract statistical measures, like the ISE (see equation 4.1). This is of
course de�ned for a single frame. When evaluating the results for a whole test set,
we average all individual results.
The second question is a bit more delicate to answer. Firstly, we have to recall
that among the variables of our both body models, there is the so-called global
twist, which is composed of three angles, but also of a translation in space. This
is quite problematic because the translation does not capture the same reality as
all the other variables. Even if in practice this has no lead to a sensibly di�erent
contribution to the total error, one has to bear in mind that this is basically due
to luck, as scaling the spacing unit will directly impact this fragile �equilibrium�.
But even between two variables both measured in radians, are we sure that they
are really comparable? For a human being, is 180◦ comparable when applied to the
horizontal rotation of the neck or to the bending of the knee? Even without violating
those so-called anatomical constraints (not included in the MMM model anyway),
one could argue that we should add weighting factors based on expert's judgment.
As we don't know those anatomical priors, especially for the MMM model, a
possible solution to this dilemma is to measure the error in a relative way, in comple-
5.1. PERFORMANCE CRITERIA 65
ment to the absolute error. We propose to scale each individual error on a variable
by the value of the reference's sample standard deviation for this variable. For a set
of M reference D-dimensional vectors X and a set of M estimated D-dimensional
vectors X̂, with the predicted variables being the �rst d ones, xj = 1M
∑Mi=1 xij and
sj =√
1M−1
∑Mi=1(xij − xj)2, being respectively the reference's average and sample
standard deviation for variable j, this gives us the following formulas:
AbsErr(X̂,X) =1
M
M∑i=1
d∑j=1
(x̂ij − xij)2 (5.1)
RelErr(X̂,X) =1
M
M∑i=1
d∑j=1
(x̂ij − xij
sj
)2
(5.2)
It should be clear that the absolute and relative errors presented here measure
performance in two di�erent ways, and could not be directly compared. Regarding
the relative error, it tries to penalize imprecision more severely for variables that don't
vary much. In the extreme case, sj might be equal to 0, but with real data this could
barely happen. Of course, M must be > 1, otherwise all sj would be indeterminate
(measuring performance on a single pose is of doubtful relevance anyway).
Notice �nally that the fraction 1M−1
in the sample standard deviation sj is an
unbiased estimate for σj, contrary to the alternative 1M
that underestimates it. This
way, sj will not systematically grow with M . This desirable property must nonethe-
less be viewed with precaution, as a condition for the total absence of bias is that
the sample values must be drawn independently, which clearly won't be the case if
the test set consists of poses belonging to the same sequence.
This section would not be complete should we omit to mention the relevance
of measuring also the time performance, alongside the precision. This is highly
helpful to concretely see the di�erence between the isotropic and anisotropic kernels,
considering their contrasted time complexities. Fortunately, time measurement is a
quite trivial and objective task.
66 CHAPTER 5. EXPERIMENTS
5.2 Test set selection
In our tests, we wanted to respect the step-by-step approach presented in point 4.3.3,
so we did not run the estimations independently frame from frame on a �xed test set.
We simulated the missing image �delity method by the reference data itself (a perfect
estimator in that sense). Formulated that way, this might sound like a violation of
the important principle that the model should not in�uence the test process, but in
fact conceptually it is not really the case.
For all frames of the test set, the variables relative to the current frame t stay
untouched, so the reference for evaluation is never in�uenced. Each individual (i.e.,
on a single estimated pose frame) performance measure is of course done on the
prediction alone (since we want to evaluate the predictive part only), so the help of
the perfect image �delity method will not in�uence the result on the current frame.
For subsequent frames, combining past frames with perfect image �delity is of
course advantageous compared to using prediction alone, but let us remember that
the test set as taken by default contains all the true past frames! Hence, if we let
the perfect estimate completely overwrite the prediction, it would be the same as
using the prede�ned test set. In other words, any other combination with the perfect
estimate can only be less precise, so that we could never arti�cially improve our
predictor's performances that way.
One may still argue that the principle of an in�exible test set is also meant to
avoid under -estimating the actual performance. But the whole question is then,
what has more relevance as actual performance measure? Allowing past frames to
be perfect is unrealistic because it eliminates the need for a predictor at all. For
instance, we could return the nearest past frame as prediction for the current one.
As there is only minimal variation between succeeding frames, this would give a very
precise predictor running in no time (but totally useless in practice, needless to say)!
Another argument, which will constitute our conclusion in this discussion, is that
the principle of test set neutrality remains not violated if we consider the whole
sequence as a single complex component to test, instead of a test set made of falsely
independent instances (the mode of combination being held �xed, conditions remain
5.2. TEST SET SELECTION 67
always the same, and comparing results is fair).
This is thus how we evaluate the test data, whatever the mode of combination
chosen to merge the pose of the image �delity method with the one of the prediction
method. One could think of many more or less sophisticated ways to merge the two
outputs. Most importantly, both methods should deliver an honest self-con�dence
estimate to ensure adaptive weighting, depending on the situation. If large occlusions
suddenly appear, the image �delity method should notice it and trust more largely
the results of the prediction method, and reciprocally if the predictor notices that the
current movement does not seem to �t to its prior knowledge. For the predictor, we
propose to take into account the two measures de�ned in section 4.5, for assessing
its self-con�dence both in a dynamic (con�dence weight of estimation) and static
(ambiguity measure of training set) way.
Nevertheless, for our experiments themselves, we limited ourselves to three very
basic combinations that do not make use of those estimates. The reason is simply
because we have only a perfect image �delity method at our disposal, so it would
not make sense to give it an arti�cially low self-con�dence estimate. There was of
course the possibility to make the data noisy, but this would not have re�ected in
a realistic way the imprecision of the method, as noise should be on the image, not
directly on the resulting poses.
Due to this limited setting, we simply combined both poses ximage and xpred in a
linear way for each vector, with a scalar λ ∈ [0, 1] for the proportion:
xout = λximage + (1− λ)xpred (5.3)
Once again, we insist that in our testing framework, this is only applied to the
past frames given as input, so that the performances for the current frame are not
directly a�ected, whatever the choice for λ. We chose λ = 1 to always get the real
past poses, λ = 0.5 to get a simple mix and λ = 0 to rely solely on prediction. The
third combination is an especially harsh testing mode, because it conceptually means
that the camera is blinded for the whole movement once the deepest frame in the
time sequence is reached.
68 CHAPTER 5. EXPERIMENTS
This brings us to another remark: in order for xpred to be de�ned, the predictor
must of course have been called on the corresponding frame before. Otherwise, one
must solely rely on the image data �delity for this past frame, which is not so terrible
in the real system, but would make the testing experiments meaningless for λ < 1,
as some privileged past frames would de facto enjoy a λ = 1. This is inevitably the
case for the starting frames (as the predictor can only start when its deepest past
frame exists), but we would like to at least avoid gaps afterward. To ensure this, the
elements in the time sequence should all be multiples of the test frame step, so that
both remain synchronized.
5.3 Limitations of the methodology
Before presenting the results themselves, we would like to explain the nature of the
two main issues encountered when processing the experiments. These problems were
certainly partially predictable, however they became truly apparent in the concrete
tests, hence the reason why this section is placed in the present chapter.
5.3.1 Curse of dimensionality
This issue recovers all indirect consequences of the so-called curse of dimensionality
sketched in point 3.1.1. At �rst glance, the core problem seems to be the non-
representativity of any training set of reasonable size, due to the high number of
dimensions (a minimum of 120 for the MMM model).
This pessimistic view is to be corrected however by the fact that the variables are
very likely to show little independence. Firstly, the dependency between successive
frames of the time sequence is precisely what justi�es the use of a learner: if they were
independent, there would be nothing to predict. Secondly, the body con�guration as
a whole for a given frame and a more or less de�ned type of movement is suspected to
lie in the neighborhood of a much lower dimensional manifold (embedding of space),
even if these manifolds are susceptible to variate sensibly with the complexity of the
concerned type of movement. See point 5.4 for more details.
5.3. LIMITATIONS OF THE METHODOLOGY 69
So, it seems that we deal with a structural curse of dimensionality only on a
limited scale, which is reassuring, because a full-scaled one would have meant the
practical impossibility of any predictions. Nevertheless, we still have to face the
perverse e�ects of a curse of dimensionality on the representation used, in the sense
that the data treated by the kernel density estimators remains of high dimensionality.
In fact, lowering the number of variables could be achieved using techniques of
nonlinear dimensionality reduction, a large number of which are very well presented
in [24]. Alas, they do not provide a two-ways mapping, so we could not practically
make much use of a reduced prediction. Indeed, representing a reduced human model,
and more importantly, combining it with the unreduced pose of the image �delity
method seems to be almost as di�cult as �nding a relevant parametric estimator for
the concerned movement type.
Hence, even if the curse of dimensionality does not a�ect the learning process
as a whole, it still has very concrete consequences on the kernel density estimation
computed. What are they? Remember that densities, having an integral of 1, tend to
spread out in all the possible dimensions becoming available to them. In particular,
isotropic Gaussian kernels will directly be a�ected by particularly brutal numerical
problems, appearing with too high or too low bandwidths h. In both cases, the
estimated relative likelihood value for a given point risks to be so dramatically close
to zero that it will fall beneath the numerical limit for double precision, which is in
the order of 10−308 according to [10].
This numerical e�ect �kills� the gradient if all individual kernels return 0. It is
rather hard to estimate the amount of imprecision caused by approximating very
small positive doubles in a limited format, but in any case the abrupt cut-o� of any
prediction and the indeterminate gradient at the region of 10−308 was very clearly
observable in our experiments. Without getting into the technical details, we tried
to scale the result before the approximation to zero happens, but this is delicate to
do while respecting the meaning of h. Once over the fatal barrier, a bit like for the
event horizon of a black hole, there is no much to do apart from relying on longer
�oating point representation (but like for scaling, this does not solve the problem in
general, and costs more time and space).
70 CHAPTER 5. EXPERIMENTS
Figure 5.1: Upper (dashed) and lower (solid) bounds of h for varying u2, D = 90 andD = 120 resp.
We also tried to use the log-likelihood instead of the likelihood in the de�nition
of the kernel, which allows to discard the exponential without a�ecting the general
shape of the function (as the logarithmic is a monotonic application). Unfortunately,
this can only be applied on the individual kernels, not on the estimator that averages
them. Indeed, we have to apply the logarithm inside K isoh (resp. Kani
i ), because
if we do it on the whole estimator f̂iso (resp. f̂ani) only when it is determined,
the numerical issue would already have appeared. The problem is, log(a + b) is
a monotonic application (preserves gradient direction), whereas log(a) + log(b) is
not, and thus invalidates the relevance of the gradient. We did not �nd a way to
express the log-likelihood of the kernel estimator in terms of the log-likelihoods of
its individual kernels.
In fact, this issue is indeed very closely related to the dimensionality. Let us
call u2 the squared Euclidean distance between x and x′, in accordance with the
terminology used in point 3.3.2. For a given value u2, let us de�ne an upper and
lower bound for h, outside which numerical approximation to zero appears. This
means, values for the bandwidth lying outside this range for a given value u2 are
stopping gradient ascent if the requested point is located at a distance v ≥ u to its
nearest neighbor among the training points.
Figure 5.1 shows how the range of possible values gets narrower with D (and with
5.3. LIMITATIONS OF THE METHODOLOGY 71
Figure 5.2: Value of an isotropic kernel with �xed squared distance u2 and varyingh, D = 1 and D = 60 resp.
u2 of course). We compare the case of D = 90 and D = 120 (the minimal number
of dimensions in the MMM model). Notice that there is not only a broader range
of usable values for h in the left plot, but also a big contrast between the maximal
u2 allowed before both bounds merge brutally (approx. 4× 104 for D = 90 but only
approx. 6900 for D = 120). We do not show comparisons with the univariate case,
as the higher bound is so radically higher and stays so until u2 reaches such a large
number that one could consider it to not exist at all for non-pathological data.
The problems only worsens considering that if we have two vectors of a �xed
number of variables, and then add other variables, the new Euclidean norm on the
di�erence can only grow if the added variables are not identical. This is precisely
what is done when adding several past frames in the time sequence!
To have another view on the in�uence of high-dimensionality, consider what hap-
pens in equation 2.3 if, instead of �xing the bandwidth and varying u2, we chose a
�xed u2 and observed what happens with various h. Figure 5.2 plots the value of the
kernel over varying bandwidths with u2 equal to 1 and 0.99. To make the contrast
apparent, we chose to show here the plots of D = 1 and D = 60.
There are nevertheless precautions to take with this sort of plot, because the
choice of u2 also has consequences, especially in the high-dimensional case. For e.g.,
the right plot has a mode of a very large value, but when progressively increasing u2,
72 CHAPTER 5. EXPERIMENTS
it will shrink dramatically. Regarding the shapes obtained, let us only observe that
setting D to 1 results in a plot that bears resemblance to an inverse χ2 distribution,
whereas increasing D results in what seems to be a convergence toward a normal
itself. Those graphical similarities do not make them valid densities however, as the
integral of those plots does not sum to 1 in general.
To summarize the issues discussed in this point, we strongly advocate in favor
of a body model that comprises not too many dimensions. The relevance obtained
through a more realistic model risks to be jeopardized by numerical e�ects of the curse
of dimensionality (not to mention the impact of high-dimensionality on complexity).
5.3.2 Ill-conditioned matrices
Another issue arises from the anisotropic kernel this time. This is not to say that the
anisotropic kernel is not a�ected by the problems of the isotropic kernel, as it uses
it to build its covariance matrices! Those are at least always positive semi-de�nite.
But their condition number could also be dangerously high, rendering the inverted
matrices unusable in practice, unless we raise α. Nonetheless, as discussed in 4.4.2,
this can partially lead to waste the information gained by the computation of the
covariance matrices.
The problem we have to face in practice is that, at least for our data and our
number of dimensions, even an α smaller that what is needed to not su�er from ill-
conditioned matrices will comparatively dwarf the values obtained by the isotropic
kernels outside the diagonal. In other words, in order to make the anisotropic usable,
one has to blur it so much that it does not show dramatic precision improvement
over the simpler isotropic equivalent.
Without much alternative to regularization through α in order to reduce the
condition number, it does not seem easy to circumvent the problem. We also tried
Moore-Pearson pseudo-inverse, an alternative inversion technique that can also be
applied tom×n matrices. Of course, it changes nothing about the condition number,
but we had thought it might possibly give a better inversed matrix. It turns out that
the matrix obtained is indeed di�erent, but multiplying it at left by (x − xi)T and
5.4. ESTIMATION OF INTRINSIC DIMENSIONALITY 73
at right by (x − xi), as is done in the anisotropic kernel (see equation 2.5) returns
the same number.
The original causes of this ill-conditioning problem are not perfectly clear to us.
It might once again have something to do with dimensionality, but a strict distinction
from the problems encountered by the isotropic kernel is quite delicate to achieve,
as the values returned by the a�ected isotropic kernels will also have their own
in�uence. We can say that a minimal threshold for α seems much less critical for
low-dimensionality data. However, values for α are not really comparable on this
level, and furthermore, the data comes from a distinct source, so one should take
caution before imputing everything solely to the curse of dimensionality.
Thus, we want to emphasize that it is quite plausible that the anisotropic kernels
are not revealed at their true potential through our experiments, and that they might
perform better in other settings.
5.4 Estimation of intrinsic dimensionality
In complement to the experiments on the data itself, we were interested to know
a bit more about the dependencies between variables. One could naively use the
Pearson's correlation applied on the sample covariance matrix to try to �nd out the
dependencies between variables. Given two variables x and y found in a sample of
size N , this coe�cient basically scales the sample covariance by the product of both
sample standard deviations, and is de�ned as:
rxy =
N∑i=1
(xi − x)(yi − y)
(N − 1)sxsy(5.4)
Unfortunately, linear measures such as rxy detect solely linear dependencies, as
illustrated in �gure 5.3, whereas we do not expect this property in human poses,
so this dependence itself is not easy to appreciate directly. We thus relied instead
on methods closely related to the domain of dimensionality's reduction, namely es-
timators of intrinsic dimensionality. This was done using the drtoolbox created and
74 CHAPTER 5. EXPERIMENTS
Figure 5.3: Various bivariate data sets and the corresponding correlation's coe�cient(image taken from the Wikipedia entry on correlation and dependence)
distributed by Laurens van der Maaten, available at http://homepage.tudelft.
nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html. Resulting di-
mensions are nor expected to be integer numbers (think about fractal dimensions),
nor to stay the same for all types of movements.
Table 5.1 shows the results for 4 of the 6 intrinsic dimensionality estimators
included in the drtoolbox. GMST stands for Geodesic Minimum Spanning Tree,
and MLE stands for Maximum Likelihood Estimator. It is not surprising that the
di�erent estimators give contrasted numbers, considering they capture the idea of
intrinsic dimensionality from di�erent points of view. For instance, one of the 2
remaining estimators, namely the packing numbers, returned 0, which is of course
absurd in our case.
Another foreseeable result is that each estimate variates when picking another
type of movement. What is more surprising however, is that it also variates within
a single type of movement. We would have expected much less inner-class variance.
And types of movements that intuitively appear less complex (a simple boxing move
compared to acrobatic jumps) do not necessarily have a lesser estimated dimension-
ality. Another surprise is that, with the exception of the Eigen values (the criterion
used is, roughly put, to cuto� dimensions of in�uence deemed �negligible� relative to
5.4. ESTIMATION OF INTRINSIC DIMENSIONALITY 75
CorrDim GMST EigValue MLE
Boxschlag_hinten_2_1_Q_Human 1.48 1.86 5.00 2.03
Boxschlag_hinten_6_1_Q_Human 2.41 2.38 4.00 2.49
Boxschlag_hinten_11_1_Q_Human 2.64 2.33 4.00 2.67
Salto_rückw_3_1_Q_Human 1.71 2.64 5.00 2.34
Salto_rückw_4_1_Q_Human 2.27 2.81 7.00 2.70
Salto_rückw_6_1_Q_Human 1.74 3.19 6.00 2.98
Butter�y_Kick_1_1_Q_Human 2.20 3.40 6.00 3.39
Butter�y_Kick_3_1_Q_Human 1.53 1.57 7.00 1.67
Butter�y_Kick_5_1_Q_Human 1.99 2.56 7.00 2.73
Boxschlag_hinten_2_1_Q_MMM 1.54 1.91 5.00 2.20
Boxschlag_hinten_6_1_Q_MMM 2.26 2.43 5.00 2.42
Boxschlag_hinten_11_1_Q_MMM 2.51 2.62 6.00 2.40
Salto_rückw_3_1_Q_MMM 1.54 2.73 6.00 2.75
Salto_rückw_4_1_Q_MMM 1.73 4.25 9.00 3.61
Salto_rückw_6_1_Q_MMM 1.47 3.13 7.00 2.94
Butter�y_Kick_1_1_Q_MMM 1.84 2.26 8.00 2.17
Butter�y_Kick_3_1_Q_MMM 1.85 1.98 7.00 2.13
Butter�y_Kick_5_1_Q_MMM 1.69 3.03 6.00 3.04
Table 5.1: Number of estimated dimensions for sequences used in experiments, framestep of 10, time sequence of t, t−10
the preceding one, after having ordered dimensions by decreasing importance), the
estimate of intrinsic dimensionality is actually very low compared with the original
number of variables.
Finally, to have a more graphical idea of how complex the dependences actually
are, we tested several dimensionality reduction techniques over the sequences used
in the experiments. We present here three examples reduced via MDS in �gure 5.4.
The choice of this technique in particular is simply because we do not really want
to use (or give explicit meaning to) the reduced variables obtained. Rather, we
are interested in mere visualization, for which MDS is especially adapted. A nice
property of MDS is that if we increase the number of dimensions desired, the initial
mapping will stay relevant and simply expand in the supplementary dimension. For
instance, if we represent the fourth dimension by color, �gure would 5.4 be the same,
just colorful.
While in all generality, there is absolutely no guarantee that mappings of sim-
ilar movements will reveal to be themselves graphically similar (MDS is not based
76 CHAPTER 5. EXPERIMENTS
Figure 5.4: 3-dimensional MDS applied on various MMM sequences: boxing 6 (top,left), back�ip 6 (top, right), 3 (below, left) and 4 (below, right)
on topology in that sense), the plots for sequences Salto_rückw_3_1_Q_MMM
and Salto_rückw_4_1_Q_MMM bear indeed some structural resemblance to each
other. This similarity is also observable on the color in 4 dimensions. It is also very
apparent that points are not arbitrary clouds but follow the time line of the move-
ment. Compared to the back�ip move, plots of boxing look substantially simpler, as
they indeed seem to be in reality.
5.5 Evaluation
5.5.1 Default case
One could think of many settings to evaluate the methodology (hyper-parameters
selection, model used, size and selection of training set, and so on). We present here
only a small subset of those variations, that overall produce exploitable results. By
this, we mean results that are of �acceptable quality� if combined with the image
�delity method, that is, that respect more or less the motion being performed. This
is of course, a rather subjective criterion, but one can fairly discard a setting if all
tests operated on it produce results that substantially mislead the tracking on the
whole sequence. As a consequence, the results presented here do not seek to show the
5.5. EVALUATION 77
impact of bad hyper-parameters, and are not very contrastive. Quite importantly,
we precise that the whole process remains deterministic, so results made with one
setting should only di�er by the time they take.
We will refer to table 5.2 as our default case. In other words, subsequent variations
are made relative to this setting. We chose the MMM model, because it does not
have anatomical constraints that might be violated, and the visual results tend to be
more convincing. We chose a frame step of 10 for both training and test sets because
of the signi�cant speed up (×100) this brings, which is especially convenient for the
evaluation of the anisotropic kernels. Time sequence only covers one past frame to
avoid issues arising from high dimensionality. The bandwidth is set to the average
nearest neighbor distance.
On table 5.2, as well as in all subsequent tables, results are expressed in terms
of absolute (`Abs') and relative error (`Rel'), as de�ned in section 5.1. Each test
is always made with the three ways of determnining past frames : 'Pred' (100%
predicted values), 'Mix' (50% predicted and 50% real values) or 'Real' (100% real
values), according to the λ of equation 5.3. Among the categories of motion at our
disposal, we chose boxing moves, back�ips and butter�y kicks, because intuitively
they seem to be of a varying amount of complexity. The selection of the sequences
within the corresponding categories that we took for the evaluation is arbitrary, as
well as is the repartition between training and test sets. Time is expressed in seconds
(as measured on a Dualcore 1.6 Ghz laptop), but does not cover the animation of the
sequence. Sequences last between 201 (boxing 6 and 11) and 901 frames (back�ip
3), with more or less the same number of frames within each category.
When looking at the results of table 5.2, we notice, as expected, that the manner
of determining past frames has a clear impact on performance. What is perhaps
less intuitive is the sensible variation of error between the test sets. Notice however
that relative errors are de�ned depending on the reference data, thus making direct
comparison inadequate (fortunately, two di�erent settings on the same test set stay
comparable). But absolute errors are more neutral in that sense, especially when the
motion category is the same, and they also variate sensibly. This is also observable
when examining the predictions visually.
78 CHAPTER 5. EXPERIMENTS
Train TestPred Mix Real
TimeAbs Rel Abs Rel Abs Rel
Boxing 2Boxing 6 94.86 2.05e+3 81.64 1.95e+3 79.42 1.88e+3 23.89
Boxing 11 30.10 8.18e+4 25.86 7.71e+4 18.97 7.20e+4 14.77
Back�ip 3Back�ip 4 32.88 1.20e+2 8.25 7.61e+1 3.83 4.76e+1 48.86
Back�ip 6 45.97 1.31e+2 25.22 8.05e+1 14.76 5.97e+1 48.31
Butter�y kick 1Butter�y kick 3 95.49 1.20e+2 47.77 1.45e+2 33.52 1.21e+2 61.97
Butter�y kick 5 103.69 1.78e+2 94.61 1.66e+2 57.75 1.42e+2 67.38
Table 5.2: Results for isotropic kernels on the MMM model, frame step of 10 for bothtraining and test, time sequence of t, t−10, h = average nearest neighbor distance
In the variations we are going to consider, we deliberatively omit here to present
the results for the human model. This is because the meaning of the two error
measures is intrinsic to the model used, and direct comparison can only be made on
video, so it would necessitate to present all results twice for both models. For space
reasons, we simply decided to leave the results to their electronic format. Visually
speaking, predictions on the human model tend to be of lower quality because it
frequently degenerates to a likelihood of 0, although it would intuitively be less
prone to su�er from the e�ects of high dimensionality with multiple past frames (42
parameters instead of 60 in the MMM model).
5.5.2 Some graphical results
Before presenting the numerical tables for the variations we retained here, we would
like to illustrate part of the preceding results in a more concrete form. Figures 5.5
and 5.6 show 6 frames of the sequences Boxing 11 and Back�ip 4 respectively.
Noticeable errors of the prediction in �gure 5.5 is the wrong twist of the left
hand and the bad time synchronization in the motion (the blow is delivered too
late). When watching the video sequence, it also appears that the return of the �st
remains somewhat shaky.
Although the sequence may appear less respected in �gure 5.6, it is actually much
more a question of time synchronization, as the predicted motion is quite smooth and
matches the reference sequence on the whole (we do not show here the predictions
on the real past frames which are still more precise).
5.5. EVALUATION 79
Figure
5.5:
Sequence
forBoxing11
with40
fram
esbetweeneach
image,
reference
(top)andprediction
oftable5.2withreal
pastfram
es(bottom)
80 CHAPTER 5. EXPERIMENTS
Figure
5.6:
Sequence
forBack�ip
4with160fram
esbetweeneach
image,
reference
(top)andprediction
oftable5.2withmixed
pastfram
es(bottom)
5.5. EVALUATION 81
Figure 5.7: Values of absolute (left) and relative (right) error over time for test oftable 5.2 on Boxing 6
Figure 5.7 shows the repartition of error for each tested frame (with test frame
step of 10, this means 21 frames among a total of 201) of Boxing 6. Obviously, the
quality of the prediction �uctuates with time. Here, there is a brief passage where
the prediction with real past frames suddenly matches very well the actual pose,
corresponding to the blow itself in the middle of the motion. A bit surprising is that
better past frames do not systemically dominate less-precise ones for each frame,
although it is almost always the case for the average error.
In the preceding case, the shapes of the two error functions were quite similar.
But this is not always so, as absolute and relative errors do not measure the same
notion. In �gure 5.8, the similarity between absolute and relative error is much less
clear.
Preceding plots were not very smooth, but the number of frames being reduced,
this is not surprising. However, this seems to also be the case for longer sequences,
such as Back�ip 4, as can be seen in �gure 5.9. Similarity between the two error
functions is partially present by their shape, in the sense that local maxima and
minima correspond more or less, even if the levels in themselves di�er.
Another interesting plot, although not easy to interpret, is to plot all variables
of the reference and predicted pose over time. Notice we are only interested here in
the prediction alone, so we do not represent the values of the variables for the past
frames. As previously said, there are 60 variables to describe the current pose in the
82 CHAPTER 5. EXPERIMENTS
Figure 5.8: Values of absolute (left) and relative (right) error over time for test oftable 5.2 on Boxing 11
Figure 5.9: Values of absolute (left) and relative (right) error over time for test oftable 5.2 on Back�ip 4
5.5. EVALUATION 83
Figure 5.10: Global plot of the variables over time for reference (top) and predictionof table 5.2 with mixed past frames (bottom), on Back�ip 4
MMM model. Figures 5.10 and 5.11 illustrate this for sequence Back�ip 4, with all
variables grouped together and considered individually respectively.
As can be seen in �gure 5.10, the predicted plot looks less �messy� and chaotic
than the actual plot. This does not mean that it is smoother on an individual scale,
84 CHAPTER 5. EXPERIMENTS
Figure 5.11: All individual plots of the variables for reference (blue) and predictionof table 5.2 with mixed past frames (red), on Back�ip 4
as can be seen more clearly on �gure 5.11. This suggests that the proposed model
is underestimating the complexity of the actual model, in other words, that the
learning model su�ers from under�tting. Figure 5.11 also helps to detect joints for
which the prediction performs the worst. Here, one can see for e.g. that the third
variable of �rst row and that the second variable of the fourth row do not �t at all
the reference. These joints correspond to the z coordinate of the global twist and to
the y coordinate of the right elbow respictively.
5.5.3 Maximum nearest neighbor distance
The setting retained here is the same than for the default case, except that h is now
equal to the maximum nearest neighbor distance. Table 5.3 summarizes the results.
We indicated in bold errors that are inferior to their equivalent in the default case
(we did not apply this to time, because variations are much more likely to be due
to external causes such as a lag in Matlab for the time taken by Boxing 6). This
5.5. EVALUATION 85
Train TestPred Mix Real
TimeAbs Rel Abs Rel Abs Rel
Boxing 2Boxing 6 92.85 1.97e+3 87.85 1.98e+3 85.87 1.99e+3 99.75
Boxing 11 24.61 8.02e+4 22.43 8.06+e4 21.69 8.08e+4 8.43
Back�ip 3Back�ip 4 30.17 9.79e+1 9.28 6.94e+1 4.46 4.98e+1 40.73
Back�ip 6 41.53 1.22e+2 21.94 7.85e+1 16.42 5.97e+1 42.74
Butter�y kick 1Butter�y kick 3 81.06 1.05e+2 54.99 9.78e+1 46.34 1.05e+2 40.95
Butter�y kick 5 98.07 1.46e+2 95.17 1.47e+2 81.71 1.37e+2 46.41
Table 5.3: Results for isotropic kernels on the MMM model, frame step of 10, timesequence of t, t−10, h = maximum nearest neighbor distance
setting systematically dominates the default one for 'Pred' past frames, but is less
advantageous when past frames are more trustworthy. We consider however that
'Mix' and 'Real' past frames should be granted more relevance, as in practice the
image �delity would help to keep the tracking consistent until ambiguities appear.
Notice that, comparatively speaking, it tends to perform better when evaluated on
relative error than on absolute one.
In point 4.4.1, we already suggested that the maximum nearest neighbor distance
as a value selector for the bandwidth may have a tendency to blur the actual com-
plexity of the motion (in other words, under�tting it). Figure 5.12 illustrates this
more concretely, when we compare it with the plots of �gure 5.10. Not only is the
inner structure less intricate, but the curves are also smoother than with the average
nearest neighbor distance.
5.5.4 Anistropic kernels
We present here results for two values of α as h/3 and h/5 (more extreme values
quickly degrade performance). As for the preceding variation, bold numbers indicate
lesser error than in the default case. Once again, we consider 'Pred' past frames to
be less relevant. The quality of the results is far from exceptional compared to the
isotropic kernel, considering the sensible time cost di�erence. Of course, with this
data computational cost remains quite reasonable, but we have to bear in mind that
the training and test samples are very reduced here. So if a trade-o� comes into
play, the precision that might be gained by anisotropic kernels is largely penalized
by the much larger time cost. As an illustration of the predicted poses, we can also
86 CHAPTER 5. EXPERIMENTS
Figure 5.12: Global plot of the variables over time for prediction of table 5.3 withmixed past frames, on Back�ip 4
compare �gure 5.13 with the previous corresponding plots. It appears slightly more
intricate than the prediction of �gure 5.10, but this is a bit subjective.
5.5.5 Longer time sequence
We will now brie�y present one of the results obtained with other time sequences.
It turned out in preliminary experiments that long time sequences quickly su�er
from the e�ects of high-dimensionality (at least with relatively large body models
Train TestPred Mix Real
TimeAbs Rel Abs Rel Abs Rel
Boxing 2Boxing 6 95.07 2.06e+3 85.63 1.95e+3 74.35 1.59e+3 191.86
Boxing 11 30.10 8.16e+4 25.69 7.65e+4 18.32 6.70e+4 187.82
Back�ip 3Back�ip 4 32.88 1.20e+2 8.31 7.82e+1 3.76 4.76e+1 1352.74
Back�ip 6 45.94 1.30e+2 19.10 7.47e+1 13.22 5.58e+1 1337.57
Butter�y kick 1Butter�y kick 3 95.75 1.24e+2 44.36 1.41e+2 37.89 1.45e+2 1156.57
Butter�y kick 5 104.33 1.82e+2 90.97 1.64e+2 69.65 1.89e+2 1239.76
Table 5.4: Results for anisotropic kernels on the MMM model, frame step of 10for both training and test, time sequence of t, t−10, h = average nearest neighbordistance, α = h/3
5.5. EVALUATION 87
Train TestPred Mix Real
TimeAbs Rel Abs Rel Abs Rel
Boxing 2Boxing 6 89.63 2.23e+3 86.91 1.98e+3 83.06 1.91e+3 401.02
Boxing 11 30.18 8.15e+4 24.86 8.12e+4 18.45 6.90e+4 174.03
Back�ip 3Back�ip 4 32.85 1.20e+2 7.93 7.37e+1 3.83 4.91e+1 1596.69
Back�ip 6 45.89 1.30e+2 17.39 7.36e+1 13.45 5.64e+1 1286.90
Butter�y kick 1Butter�y kick 3 95.82 1.26e+2 44.50 1.46e+2 36.30 1.39e+2 1194.74
Butter�y kick 5 104.47 1.83e+2 90.54 1.63e+2 73.25 2.06e+2 2531.14
Table 5.5: Results for anisotropic kernels on the MMM model, frame step of 10for both training and test, time sequence of t, t−10, h = average nearest neighbordistance, α = h/5
Figure 5.13: Global plot of the variables over time for prediction of table 5.5 withmixed past frames, on Back�ip 4
88 CHAPTER 5. EXPERIMENTS
Train TestPred Mix Real
TimeAbs Rel Abs Rel Abs Rel
Boxing 2Boxing 6 95.12 2.06e+3 65.57 1.47e+3 80.35 1.87e+3 14.96
Boxing 11 30.16 8.16e+4 27.52 7.85e+4 20.60 7.28e+4 15.17
Back�ip 3Back�ip 4 32.85 1.20e+2 11.40 9.13e+1 5.51 5.89e+1 55.38
Back�ip 6 45.91 1.31e+2 28.03 8.46e+1 16.08 6.41e+1 49.88
Butter�y kick 1Butter�y kick 3 95.55 1.20e+2 98.01 1.22e+2 40.54 1.35e+2 66.01
Butter�y kick 5 103.94 1.78e+2 93.38 1.65e+2 70.36 1.48e+2 59.62
Table 5.6: Results for isotropic kernels on the MMM model, frame step of 10 forboth training and test, time sequence of t, t−10, t−20, h = average nearest neighbordistance
such as MMM), and the related numerical problems discussed previously. Thus, we
selected a basic case where the problem does not appear. We simply add t − 20
in the samples, which corresponds to the nearest available past frame before t − 10
(remember from point 5.2 that the test frame step should synchronize with the time
sequence to ensure that past frames respect the rules de�ned for 'Mix' and 'Pred').
Results are summarized in table 5.6. As always, errors that are inferior to the
default case are marked in bold. The supplementary knowledge gained by looking
deeper in the past does not seem to have brought a positive impact overall. Of course,
one could also argue that this is not representative of the principle, but in general we
have been a bit disappointed by the performances gained through multi past frames.
An explanation for this phenomenon might be that the Markov assumption (i.e.,
a state depends only on the previous one) works reasonably well in the context of
human motion.
Once again, we can compare the case of Back�ip 4 on mixed past frames in �gure
5.14. The level of complexity seems to be more or less the same as the prediction
of the default case. There is an interesting detail appearing when the predicted
plots presented here are aligned vertically: local maxima appear roughly at the same
moment for all the other plots, but for the extended time sequence they appear a few
frames before. When there is correspondence with the reference plot, its maxima are
also comming before. It would appear as if this setting could slightly better anticipate
the general tendency of the motion to come. Despite of this property, the default
setting performs better both numerically and visually on video in this case.
5.5. EVALUATION 89
Figure 5.14: Global plot of the variables over time for prediction of table 5.6 withmixed past frames, on Back�ip 4
90 CHAPTER 5. EXPERIMENTS
6. Conclusion
He has no cognizance even of the
number Two; nor has he a thought
of Plurality; for he is himself his
One and All, being really Nothing.
Yet mark his perfect
self-contentment, and hence learn
this lesson, that to be self-contented
is to be vile and ignorant, and that
to aspire is better than to be
blindly and impotently happy.
Edwin A. Abbott, Flatland
One of the main weaknesses of this application of kernel density estimation is
certainly the volatility of quality depending on the setting and the data used. We
think it is quite fundamental to avoid following the results blindly, without cross-
checking their relevance by the complementary image �delity method. Another major
issue is that too high dimensionality has a clearly negative impact on both time and
performance, limiting the potential complexity of both the model and of the past
frames.
Quite surprisingly, the more elaborate anisotropic kernels do not seem to sys-
tematically overpower their isotropic counterparts for our data. Not only adding an
hyper-parameter that is delicate to optimize, they consume vastly more resources.
They may perform better on larger training sets less prone to over�tting, but the
computational cost becomes quickly prohibitive when compared with the isotropic
91
92 CHAPTER 6. CONCLUSION
alternative. If the anisotropic kernels are forced to select only a subset of the available
data because of time constraints, it might be a better deal to rely on the isotropic
kernels instead, as kernels are nonparametric estimators and as such are guaranteed
to converge to the true density when more samples are added.
While all those limitations must be kept in mind, there are nonetheless many
potential applications of prior knowledge when it is carefully combined with a more
classic approach to tracking. In this context, we propose to put somewhat into
perspective reasonable levels of errors on each joint of the predicted poses (low to
average lack of precision globally). The image-based method is likely to be much
more precise when there is no visual ambiguity, but can otherwise fail to con�gure
correctly the pose on the few ambiguous joints (large lack of precision locally). This
complementarity is the reason why we insist on the meaningfulness of delivering
self-con�dence estimates to balance harmoniously both methods.
We would like to cite a few perspectives on what could be realized to improve the
technique. Firstly, hyper-parameters might be tuned �ner with more sophisticated
heuristics, or optimized automatically in an e�cient way. Secondly, when dealing
with larger training sets, approximation of the Gaussian kernels may also realized by
speci�c clustering techniques (such as the one presented in [12] for e.g.). This would
be especially helpful for making anisotropic kernels tractable without sacri�cing gen-
erality by truncating the available data. An interesting idea to improve evaluation
would be to weight errors on joints depending on anatomical or model-based criteria,
such as the position in the kinematic chain (an error close to the root in�uences more
the overall pose than does an end e�ector). Finally, we thought it would be interest-
ing to let the frame steps vary with the current velocity of the motion, which would
allow to analyze quick variations with more scrutiny than slower ones (synchronizing
might be delicate though).
To conclude, the use of kernel density estimators in human motion tracking does
certainly make sense if employed wisely. Throughout this work, we have tried to put
the emphasis on the study of the feasibility of the proposed methodology, and we
hope to have conducted the analysis of its strengths and limitations with as much
fairness as possible.
List of Figures
2.1 From left to right: stick, 2D contours and volumetric human models
(images taken from [2]) . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Two local ranges for the elbow, depending on the shoulder con�gura-
tion (image taken from [17]) . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Sketch of the methodology proposed in [33] (image taken from it) . . 17
3.1 3rd and 5th order approximations of the Hilbert curve . . . . . . . . . 23
3.2 Missing global maximum (red is high, blue low) . . . . . . . . . . . . 29
3.3 Estimators receiving samples drawn from the standard normal: three
�rst plots are histograms, fourth plot is a parametric estimator for
normal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Epanechnikov kernel using various univariate conversions, from darker
to lighter: Euclidean norm, Manhattan distance,∞-norm, multiplica-
tive kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Common kernel types . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Pitfalls of good-looking curves (100 observed points, h = 0.1 and 1.5
respectively), �gure idea taken from [38]) . . . . . . . . . . . . . . . . 37
4.1 Isotropic kernel estimation of an uniform spiral (1000 observed points) 42
4.2 Two views of an isotropic kernel estimation of the open-box manifold
(each point is equiprobable), depending on the value of the tolerance
threshold (500 observed points) . . . . . . . . . . . . . . . . . . . . . 43
4.3 The MMM joint model during a boxing move . . . . . . . . . . . . . 45
93
94 LIST OF FIGURES
4.4 Step-by-step estimate, test frame step = 1, time sequence of t and t− 1 48
4.5 Evolution of average (solid blue) and maximum (dashed green) nearest
neighbor distance with sample size varying from 10 to 1000, D = 1
and D = 100 resp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6 Curves of ambiguity repartition obtained by histograms (see text for
description) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 Upper (dashed) and lower (solid) bounds of h for varying u2, D = 90
and D = 120 resp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Value of an isotropic kernel with �xed squared distance u2 and varying
h, D = 1 and D = 60 resp. . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Various bivariate data sets and the corresponding correlation's co-
e�cient (image taken from the Wikipedia entry on correlation and
dependence) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4 3-dimensional MDS applied on various MMM sequences: boxing 6
(top, left), back�ip 6 (top, right), 3 (below, left) and 4 (below, right) 76
5.5 Sequence for Boxing 11 with 40 frames between each image, reference
(top) and prediction of table 5.2 with real past frames (bottom) . . . 79
5.6 Sequence for Back�ip 4 with 160 frames between each image, reference
(top) and prediction of table 5.2 with mixed past frames (bottom) . . 80
5.7 Values of absolute (left) and relative (right) error over time for test of
table 5.2 on Boxing 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.8 Values of absolute (left) and relative (right) error over time for test of
table 5.2 on Boxing 11 . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.9 Values of absolute (left) and relative (right) error over time for test of
table 5.2 on Back�ip 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.10 Global plot of the variables over time for reference (top) and prediction
of table 5.2 with mixed past frames (bottom), on Back�ip 4 . . . . . 83
5.11 All individual plots of the variables for reference (blue) and prediction
of table 5.2 with mixed past frames (red), on Back�ip 4 . . . . . . . . 84
LIST OF FIGURES 95
5.12 Global plot of the variables over time for prediction of table 5.3 with
mixed past frames, on Back�ip 4 . . . . . . . . . . . . . . . . . . . . 86
5.13 Global plot of the variables over time for prediction of table 5.5 with
mixed past frames, on Back�ip 4 . . . . . . . . . . . . . . . . . . . . 87
5.14 Global plot of the variables over time for prediction of table 5.6 with
mixed past frames, on Back�ip 4 . . . . . . . . . . . . . . . . . . . . 89
Bibliography
[1] Edwin A. Abbott. Flatland: A Romance of Many Dimensions. Oxford World's
Classics, 1884.
[2] J. K. Aggarwal and Q. Cai. Human motion analysis: A review. Computer
Vision and Image Understanding, 73(3):428�440, March 1999.
[3] Yoshua Bengio. Gradient-based optimization of hyper-parameters. Neural Com-
putation, 12(8):1889�1900, August 2000.
[4] Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is
�nearest neighbor� meaningful? Lecture Notes in Computer Science, 1540:217�
235, 1999.
[5] Thomas Brox, Bodo Rosenhahn, Daniel Cremers, and Hans-Peter Seidel. Non-
parametric density estimation with adaptive, anisotropic kernels for human mo-
tion tracking. In Human Motion - Understanding, Modeling, Capture and Ani-
mation, Second Workshop, Human Motion 2007, Rio de Janeiro, Brazil, October
20, 2007, Proceedings, volume 4814 of Lecture Notes in Computer Science, pages
152�165. Springer, 2007.
[6] Thomas Brox, Bodo Rosenhahn, Uwe G. Kersting, and Daniel Cremers. Non-
parametric density estimation for human pose tracking. In Pattern Recognition,
28th DAGM Symposium, Berlin, Germany, September 12-14, 2006, Proceedings,
volume 4174 of Lecture Notes in Computer Science, pages 546�555. Springer,
2006.
96
BIBLIOGRAPHY 97
[7] Allison Bruce and Geo�rey Gordon. Better motion prediction for people-
tracking. In Proceedings of the IEEE International Conference on Robotics and
Automation (ICRA), February 04 2004.
[8] Stefano Corazza, Lars Mündermann, Emiliano Gambaretto, Giancarlo Ferrigno,
and Thomas P. Andriacchi. Markerless motion capture through visual hull,
articulated ICP and subject speci�c model generation. International Journal of
Computer Vision, 87(1-2):156�169, 2010.
[9] S. L. Dockstader and A. M. Tekalp. Multiple camera tracking of interacting
and occluded human motion. Proceedings of IEEE, 89(10):1441�1455, October
2001.
[10] Sun Documentation. Chapter 2: Ieee arithmetic. http://docs.sun.com/
source/806-3568/ncg_math.html.
[11] A. M. Elgammal and C. S. Lee. The role of manifold learning in human motion
analysis. In Human Motion Understanding, Modelling, Capture, and Animation,
page 2, 2008.
[12] Ahmed M. Elgammal, Ramani Duraiswami, and Larry S. Davis. E�cient kernel
density estimation using the fast gauss transform with applications to color
modeling and tracking. IEEE Trans. Pattern Anal. Mach. Intell, 25(11):1499�
1504, 2003.
[13] Ahmed M. Elgammal and Chan-Su Lee. Tracking people on a torus. IEEE
Trans. Pattern Anal. Mach. Intell, 31(3):520�538, 2009.
[14] Keith Grochow, Steven L. Martin, Aaron Hertzmann, and Zoran Popovi¢. Style-
based inverse kinematics. ACM Transactions on Graphics, 23(3):522�531, Au-
gust 2004.
[15] Jihun Ham, Daniel D. Lee, Sebastian Mika, and Bernhard Schölkopf. A ker-
nel view of the dimensionality reduction of manifolds. In Carla E. Brodley,
98 BIBLIOGRAPHY
editor, Machine Learning, Proceedings of the Twenty-�rst International Confer-
ence (ICML 2004), Ban�, Alberta, Canada, July 4-8, 2004, volume 69 of ACM
International Conference Proceeding Series. ACM, 2004.
[16] David Haussler. Quantifying inductive bias: AI learning algorithms and valiant's
learning framework. Arti�cial Intelligence, 36(2):177�221, September 1988.
[17] L. Herda, R. Urtasun, and P. Fua. Hierarchical implicit surface joint limits
for human body tracking. Computer Vision and Image Understanding: CVIU,
99(2):189�209, August 2005.
[18] Julian Hoch. Model-based 3d object classi�cation for humanoid robots using
semi-global features. Master's thesis, Karlsruhe Insitute of Technology, April
2010.
[19] Nicholas R. Howe, Michael E. Leventon, and William T. Freeman. Bayesian
reconstruction of 3D human motion from single-camera video. In Sara A. Solla,
Todd K. Leen, and Klaus-Robert Müller, editors, NIPS, pages 820�826. The
MIT Press, 1999.
[20] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures
of local experts. Neural Computation, 3:79�87, 1991.
[21] C. N. Kuruwita. A bayesian approach for bandwidth selection in kernel density
estimation. userwww.service.emory.edu/~cmagnan/ACEStalks/Kuruwita.
ppt.
[22] Chan-Su Lee and Ahmed M. Elgammal. Homeomorphic manifold analysis:
Learning decomposable generative models for human motion analysis. In René
Vidal, Anders Heyden, and Yi Ma, editors, WDV, volume 4358 of Lecture Notes
in Computer Science, pages 100�114. Springer, 2006.
[23] Chan-Su Lee and Ahmed M. Elgammal. Human motion synthesis by motion
manifold learning and motion primitive segmentation. In Francisco J. Perales
BIBLIOGRAPHY 99
López and Robert B. Fisher, editors, Articulated Motion and Deformable Ob-
jects, 4th International Conference, AMDO 2006, Port d'Andratx, Mallorca,
Spain, July 11-14, 2006, Proceedings, volume 4069 of Lecture Notes in Com-
puter Science, pages 464�473. Springer, 2006.
[24] John A. Lee and Michel Verleysen. Nonlinear Dimensionality Reduction.
Springer, 2007.
[25] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[26] Lars Mündermann, Stefano Corazza, and Thomas P. Andriacchi. Markerless hu-
man motion capture through visual hull and articulated icp. In NIPS Workshop
on Evaluation of Articulated Human Motion and Pose Estimation, 2006.
[27] Thomas B. Moeslund, Adrian Hilton, and Volker Krüger. A survey of advances
in vision-based human motion capture and analysis. Computer Vision and Image
Understanding, 104(2-3):90�126, 2006.
[28] E. Parzen. On estimation of a probability density function and mode. Annals
of Mathematical Statistics, 33:1065�1076, September 1962.
[29] Vladimir Pavlovic, James M. Rehg, and John MacCormick. Learning switching
linear models of human motion. In Todd K. Leen, Thomas G. Dietterich, and
Volker Tresp, editors, NIPS, pages 981�987. MIT Press, 2000.
[30] Rómer Rosales and Stan Sclaro�. Learning body pose via specialized maps.
In Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, editors,
Advances in Neural Information Processing Systems 14 [Neural Information
Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001,
Vancouver, British Columbia, Canada], pages 1263�1270. MIT Press, 2001.
[31] Stephan R. Sain. Multivariate locally adaptive density estimation, 2002.
[32] H. S. Seung and D. D. Lee. The manifold ways of perception. Science,
290(5500):2268�2269, December 2000.
100 BIBLIOGRAPHY
[33] H. Sidenbladh, M. J. Black, and L. Sigal. Implicit probabilistic models of human
motion for synthesis and tracking. In ECCV, page I: 784 �., 2002.
[34] Cristian Sminchisescu and Allan D. Jepson. Generative modeling for continu-
ous non-linearly embedded visual inference. In Carla E. Brodley, editor, Ma-
chine Learning, Proceedings of the Twenty-�rst International Conference (ICML
2004), Ban�, Alberta, Canada, July 4-8, 2004, volume 69 of ACM International
Conference Proceeding Series. ACM, 2004.
[35] Cristian Sminchisescu, Atul Kanaujia, and Dimitris N. Metaxas. Learning joint
top-down and bottom-up processes for 3D visual inference. In CVPR, pages
1743�1752. IEEE Computer Society, 2006.
[36] Cristian Sminchisescu and Bill Triggs. Estimating articulated human motion
with covariance scaled sampling. International Journal of Robotic Research,
22(6):371�392, 2003.
[37] Carl Staelin. Parameter selection for support vector machines. Technical report,
Hewlett-Packard, November 21 2002.
[38] Berwin A. Turlach. Bandwidth selection in kernel density estimation: A re-
view. Discussion paper 9317, Institut de Statistique, UCL, Louvain-la-Neuve,
Belgium, January 15 1993.
[39] Raquel Urtasun, David J. Fleet, and Pascal Fua. 3D people tracking with
gaussian process dynamical models. In CVPR, pages 238�245. IEEE Computer
Society, 2006.
[40] Pascal Vincent and Yoshua Bengio. Manifold parzen windows. In Suzanna
Becker, Sebastian Thrun, and Klaus Obermayer, editors, NIPS, pages 825�832.
MIT Press, 2002.
[41] Christopher Richard Wren and Alex Pentland. Dynamic models of human mo-
tion. In Proceedings of the Third IEEE International Conference on Automatic
Face and Gesture Recognition, pages 22�29. IEEE Computer Society, 1998.
A. Gradients derivation
Let us de�ne the elements of x as (x1, . . . , xD)T and those of xi as (xi1, . . . , xiD)T .
We begin with a single partial derivation of the logarithm of the isotropic kernel,
with a given j being the dimension along which we wish to derivate. For notational
convenience, let [f(•)]′ designate ∂f(•)∂xj
. Additionally, let us precise that `log' stands
here for the natural logarithm, not for log10.
The partial derivation of the logarithm of the isotropic kernel is:
[log f̂iso(x)]′
=[f̂iso(x)]′
f̂iso(x)(derivative of log(•))
=[ 1N
∑Ni=1K
isoh (x,xi)]
′
1N
∑Ni=1 K
isoh (x,xi)
(de�nition of f̂iso)
=[∑N
i=1Kisoh (x,xi)]
′∑Ni=1 K
isoh (x,xi)
(extraction & simpli�cation of 1N
)
=
∑Ni=1[K iso
h (x,xi)]′∑N
i=1Kisoh (x,xi)
(derivative of a sum)
101
102 APPENDIX A. GRADIENTS DERIVATION
=
∑Ni=1
[1
(2πh2)D2
exp(−‖x−xi‖22h2
)
]′∑N
i=1Kisoh (x,xi)
(de�nition ofK isoh (x,xi))
=
∑Ni=1
1
(2πh2)D2
exp(−‖x−xi‖22h2
)
[− ‖x−xi‖2
2h2
]′∑N
i=1Kisoh (x,xi)
(derivation of exp(•))
= −∑N
i=1Kisoh (x,xi)[‖x− xi‖2]′
2h2∑N
i=1Kisoh (x,xi)
(extraction of scalar factor)
= −∑N
i=1Kisoh (x,xi)[
∑Dk=1(xk − xik)2]′
2h2∑N
i=1Kisoh (x,xi)
(de�nition of squared norm)
= −∑N
i=1Kisoh (x,xi)
∑Dk=1[(xk − xik)2]′
2h2∑N
i=1 Kisoh (x,xi)
(derivative of a sum)
= −∑N
i=1Kisoh (x,xi)[(xj − xij)2]′
2h2∑N
i=1Kisoh (x,xi)
(nulli�cation of constants)
= −∑N
i=1Kisoh (x,xi)2(xj − xij[xj − xij]′)2h2
∑Ni=1K
isoh (x,xi)
(derivative off(•)2)
= −∑N
i=1Kisoh (x,xi)(xj − xij)
h2∑N
i=1Kisoh (x,xi)
(derivative ofxj and of constant xij)
103
Now, we can compute the partial derivatives for all j:
∂ log f̂iso(x)
∂x
=
(∂ log f̂iso(x)
∂x1
, . . . ,∂ log f̂iso(x)
∂xD
)T
=
(−∑N
i=1Kisoh (x,xi)(x1 − xi1)
h2∑N
i=1Kisoh (x,xi)
, . . . ,−∑N
i=1Kisoh (x,xi)(xD − xiD)
h2∑N
i=1Kisoh (x,xi)
)T
= −∑N
i=1Kisoh (x,xi)(x1 − xi1, . . . , xD − xiD)T
h2∑N
i=1Kisoh (x,xi)
= −∑N
i=1Kisoh (x,xi)(x− xi)
h2∑N
i=1Kisoh (x,xi)
The derivations for the anisotropic are relatively similar. Let us de�ne O and
I respectively as the null and identity matrices of dimension D. Let ej be the jth
column of I. The single partial derivative is:
[log f̂ani(x)]′
=
∑Ni=1 K
anii (x)
[− 1
2(x− xi)
TΣ−1i (x− xi)
]′∑N
i=1Kanii (x)
(similar to log f̂iso)
= −∑N
i=1Kanii (x)[(x− xi)
TΣ−1i (x− xi)]
′
2∑N
i=1Kanii (x)
(extraction of scalar factor)
104 APPENDIX A. GRADIENTS DERIVATION
To lighten notation we will derivate the inner vector multiplications and replace
the result in the whole fraction afterwards:
[(x− xi)TΣ−1
i (x− xi)]′
= [(x− xi)TΣ−1
i ]′(x− xi) (derivative of a product)
+(x− xi)TΣ−1
i [x− xi]′
= [(x− xi)TΣ−1
i ]′(x− xi) (expansion of vector (x− xi))
+(x− xi)TΣ−1
i [∑D
k=1(xk − xik)ek]′
= [(x− xi)TΣ−1
i ]′(x− xi) (derivative of a sum)
+(x− xi)TΣ−1
i
∑Dk=1[(xk − xik]′ek)
= [(x− xi)TΣ−1
i ]′(x− xi) (nulli�cation of constants)
+(x− xi)TΣ−1
i [xj − xij]′ej
= [(x− xi)TΣ−1
i ]′(x− xi) (derivative ofxj and of constant xij)
+(x− xi)TΣ−1
i ej
= [(x− xi)T ]′Σ−1
i (x− xi) (derivative of a product)
+(x− xi)T [Σ−1
i ]′(x− xi)
+(x− xi)TΣ−1
i ej
= [(x− xi)T ]′Σ−1
i (x− xi) (nulli�cation of constants)
+(x− xi)TO(x− xi)
+(x− xi)TΣ−1
i ej
105
= [(x− xi)T ]′Σ−1
i (x− xi) (simpli�cation)
+(x− xi)TΣ−1
i ej
= ejTΣ−1
i (x− xi) (similar to above)
+(x− xi)TΣ−1
i ej
= 2(x− xi)TΣ−1
i ej (becauseΣ−1i is symmetric)
If we replace this in the previous equality, we get:
[log f̂ani(x)]′
= −∑N
i=1Kanii (x)2(x− xi)
TΣ−1i ej
2∑N
i=1 Kanii (x)
(see above)
= −∑N
i=1Kanii (x)(x− xi)
TΣ−1i ej∑N
i=1Kanii (x)
(extraction & simpli�cation of 2)
Finally, we can compute the partial derivatives for all j:
∂ log f̂ani(x)
∂x
=
(∂ log f̂ani(x)
∂x1
, . . . ,∂ log f̂ani(x)
∂xD
)T
=
(−∑N
i=1Kanii (x)(x− xi)
TΣ−1i e1∑N
i=1Kanii (x)
, . . . ,−∑N
i=1Kanii (x)(x− xi)
TΣ−1i eD∑N
i=1Kanii (x)
)T
106 APPENDIX A. GRADIENTS DERIVATION
= −∑N
i=1Kanii (x)((x− xi)
TΣ−1i e1, . . . , (x− xi)
TΣ−1i eD)T∑N
i=1Kanii (x)
= −∑N
i=1Kanii (x)(x− xi)
TΣ−1i (e1, . . . , eD)T∑N
i=1Kanii (x)
= −∑N
i=1Kanii (x)(x− xi)
TΣ−1i I∑N
i=1 Kanii (x)
= −∑N
i=1Kanii (x)(x− xi)
TΣ−1i∑N
i=1Kanii (x)