Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a...
Transcript of Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a...
Multimodal Deep Learning for
Automated Exercise Logging
David Stromback
Master of Science
Artificial Intelligence
School of Informatics
University of Edinburgh
2019
Abstract
Fitness tracking devices have risen in popularity in recent years, but limitations
in terms of their accuracy [4] and failure to track many common exercises presents a
need for improved fitness tracking solutions. This work proposes a multimodal deep
learning approach to leverage multiple data perspectives for robust and accurate activ-
ity segmentation, exercise recognition and repetition counting in the gym setting. To
enable this exploration we introduce the Multimodal Gym dataset; a substantial dataset
of participants performing full-body workouts, consisting of time-synchronised multi-
viewpoint (MVV) RGB-D video, multiple inertial sensing modalities, and 2D and 3D
pose estimates. We demonstrate the effectiveness of our CNN-based architecture at
extracting modality-specific spatio-temporal representations from inertial sensor and
skeleton sequence data. We compare the exercise recognition performance of uni-
modal, single-device and multimodal models across a number of smart devices and
modalities. Furthermore, we propose a fully-connected multimodal autoencoder to
learn cross-modal representations, which achieves near-perfect accuracy of 99.5% for
exercise recognition on the Multimodal Gym dataset. We evaluate the generalisability
of our models by testing on unseen subjects, and explore the effect of pretraining on
external datasets. We also propose and present the initial results of a training strat-
egy for strengthening single-device performance using multimodal training, which we
call multimodal zeroing-out training. Finally, we implement a strong repetition count-
ing baseline on the Multimodal Gym dataset, and propose promising future research
directions for the repetition counting task.
i
Acknowledgements
I would like to express my sincere gratitude to my academic supervisor, Dr. Radu, for
his expert guidance throughout the project, and our many fruitful discussions. I am
also very grateful to my industry supervisor, Dr. Huang, for our regular conversations
that stimulated new research directions and ideas. I would also like to thank Sony for
providing funding for this research project, and the data collection participants, for
being so generous with their time. Finally, I would like to acknowledge with gratitude
the ever-present support and love from my family and friends.
ii
Table of Contents
1 Introduction 11.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 42.1 Human Action Recognition (HAR) . . . . . . . . . . . . . . . . . . . 4
2.1.1 Inertial Sensor-Based HAR . . . . . . . . . . . . . . . . . . . 4
2.1.2 Skeletal-Based HAR . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 HAR in the Gym . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Multimodal Deep Learning . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Repetition Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Methodology 93.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Activity Segmentation and Exercise Recognition . . . . . . . . . . . 13
3.2.1 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Unimodal Autoencoders . . . . . . . . . . . . . . . . . . . . 14
3.2.3 Multimodal Autoencoder . . . . . . . . . . . . . . . . . . . . 19
3.2.4 Multimodal Classification . . . . . . . . . . . . . . . . . . . 20
3.3 Repetition Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Evaluation and Discussion 234.1 Activity Segmentation and Exercise Recognition . . . . . . . . . . . 23
4.1.1 Pretraining Unimodal Autoencoders . . . . . . . . . . . . . . 23
4.1.2 Unimodal, Single-Device and Multimodal Models . . . . . . 25
4.1.3 Multimodal Zeroing-Out Training Strategy . . . . . . . . . . 29
4.2 Repetition Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
iii
5 Conclusion 385.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Bibliography 41
A Additional Figures 47A.1 Exercise Visualisations . . . . . . . . . . . . . . . . . . . . . . . . . 47
A.2 Unimodal Autoencoder Hyperparameter Search . . . . . . . . . . . . 48
A.3 Confusion Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
iv
Chapter 1
Introduction
The popularity of fitness tracking devices has risen in recent years [30], enabling users
to monitor their health and fitness levels through their smart devices. Current fitness
trackers, however, are predominantly focused on tracking continuous high-movement
aerobic activities, such as walking, running, and swimming. A small subset of fitness
trackers also support tracking body-weight and resistance exercises, but they are of-
ten restricted to only tracking exercises in which the fitness tracker is heavily involved
in the exercise movements, thus often failing to track many common exercises. Fur-
thermore, questions about the reliability of current fitness tracking technology have
been raised [4], with significant discrepancies being observed across different fitness
tracking brands and models.
In this work we propose a multimodal deep learning approach to develop a robust
and accurate automated exercise logging system in the gym setting, and to contribute
to the development of current fitness tracking technology. The system we propose
consists of two main stages; an activity segmentation and exercise recognition stage,
followed by a repetition counting stage. In addition to addressing the shortcomings of
current fitness tracking technologies, this work makes contributions towards research
into the use of multimodal learning for human activity recognition (HAR), and for the
relatively unexplored task of repetition counting.
There are a number of factors that make HAR and repetition counting challenging
problems. The high variation in how different people perform the same activity, and
even the variation in how the same person performs an activity on separate occasions,
sets high demands on HAR and repetition counting systems. There are also modality
specific challenges that arise for HAR and repetition counting tasks. Occlusion and
viewpoint variation are two common challenges that need to be handled when working
1
Chapter 1. Introduction 2
with RGB and depth data. Learning from sensor data, in particular from sensor data
generated across multiple heterogeneous sensors, also poses a number of challenges.
The model and the brand of the sensor device, the positioning of the sensor, and the
sampling frequency of the sensor, are a number of factors that can cause significant
variations in sensor readings. An additional limitation of relying on sensor data for
HAR and repetition counting is that you are only able to track activities in which, the
body part that the sensor is attached to, is involved in, thus making some activities
indistinguishable. Finally, sensor drift is a problem that affects all sensors over time,
and can lead to unreliable and deteriorating sensor reading accuracy.
To tackle these challenges we propose a multimodal deep learning approach, whereby
combining inertial sensor data from user-worn smart devices, and 3D skeleton se-
quence data extracted from RGB video, we hypothesise that these modalities will
complement each other, and result in more robust and accurate HAR and repetition
counting models. The inertial sensing devices used in this work are accelerometer and
gyroscope sensors, which record the acceleration and angular velocity, respectively.
The acceleration and angular velocity of body-worn devices change with the body
movements, and are thus informative for inferring what activity a person is perform-
ing. The third modality we study in this work is 3D pose estimate data. Advances in
2D and 3D pose estimation have made it possible to acquire highly accurate and robust
pose estimates in real-time [6]. 3D pose sequence information is highly discriminative
for the task of action recognition, and allows for a much more compact representa-
tion than raw image or depth data, resulting in more lightweight models. With the
widespread availability of cheap cameras and inertial sensors, and the growing trend
of ubiquitous computing, we are seeing an increase in the number of data modalities
that are available across a wide range of settings. This contributes to the timeliness of
this research, as we believe there are many settings, where it is feasible, and that would
benefit from multimodal learning systems. Ambient-assisted living (AAL) solutions,
self-driving cars, and smart gyms, as investigated in this work, are just a few examples
of application areas that would benefit from multiple data perspectives. We explore
deep learning methods [26] for this task due to their demonstrated ability to learn gen-
eralised hierarchical representations, which are important for dealing with the high
intra-class variability found in HAR datasets. In addition, deep learning methods can
be trained end-to-end directly on the raw data, avoiding the need to design handcrafted
features that are often used in shallow learning methods.
Chapter 1. Introduction 3
1.1 Contribution
This section outlines the main contributions made in this work. We have collected and
annotated the Multimodal Gym dataset, a substantial dataset of participants performing
full-body workouts, consisting of synchronised multi-viewpoint (MVV) RGB-D video,
multiple inertial sensing modalities, and 2D and 3D pose estimates. The Multimodal
Gym dataset will be made available to the research community, with the hope that it
will stimulate further developments in multimodal learning for HAR and repetition
counting.
We propose a deep learning approach that learns unimodal and cross-modal repre-
sentations, and evaluate this approach for exercise recognition on the Multimodal Gym
dataset. Furthermore, we use inertial sensor data and 3D pose estimate data, which is
to the best of our knowledge the first work to explore fusing these modalities for HAR.
Additionally, we have implemented a baseline model for repetition counting on the
Multimodal Gym dataset, and invite the research community to further improve upon
the baseline performance.
Finally, we propose a multimodal zeroing-out training strategy, to investigate whether
single-device model performance can be improved through multimodal training, and
present our initial results.
Chapter 2
Background
2.1 Human Action Recognition (HAR)
Human action recognition approaches can be divided into two main categories, sensor-
based methods [8, 25, 5, 48], relying on data from inertial measurement units (IMUs),
such as accelerometers and gyroscopes, and video-based methods [35, 1, 18], operat-
ing on visual input data, such as colour and depth. The latter category also includes
skeletal-based methods which use as input 2D or 3D human pose estimates extracted
from colour or depth data. Previous works have predominantly focused on unimodal
learning, however, there are many real-world applications, such as ours, where it is
feasible to leverage information from multiple modalities. Performance on HAR tasks
has improved significantly in recent years, as research focus has shifted from using
handcrafted features to using modern deep learning solutions [18]. For conciseness,
this chapter will be centred on recent deep-learning approaches, and will focus on
sensor-based and skeletal-based methods, as these are the modalities investigated in
this work. An overview of previous HAR works in the gym setting are also presented.
2.1.1 Inertial Sensor-Based HAR
There are a range of sensing modalities that have been used for action recognition,
with accelerometers and gyroscopes being the most commonly used [48]. These sen-
sors output multivariate time-series data, which are often segmented using a sliding-
window approach, before being passed on to the feature extraction and classification
stages. Traditional pattern recognition techniques, such as Hidden Markov Models
(HMMs), decision trees, and Support Vector Machines (SVMs), have demonstrated
4
Chapter 2. Background 5
strong results on various action recognition tasks in controlled environments, however,
their reliance on handcrafted features and ability to only learn shallow representations,
restricts their performance and generalisability [25]. Deep learning approaches, on the
other hand, can learn high-level features directly from the raw sensor data as part of an
end-to-end system. Given sufficient amount of data, traditional machine learning and
signal processing techniques are in large part outperformed by modern deep learning
approaches [13]. Convolutional Neural Networks (CNNs) [27] are particularly well
suited for working with sensor data, as they efficiently capture local dependencies in
the data, and can learn scale-invariant features, important for dealing with activities
performed at different rates of frequency. Zeng et al. presented one of the first works
using CNNs on sensor data for HAR, in which they convolved and max-pooled each
accelerometer axis separately before flattening and concatenating the learnt features
from each axis into a fully-connected network [51]. They demonstrated state-of-the-
art results on three public HAR datasets. Yang et al. adopted a similar approach but
instead of performing convolutions on each axis separately, they stacked the time-series
signals from each sensor axis and from multiple sensors to form a 2D image, which
was then convolved and pooled along the temporal dimension [50]. In addition to oper-
ating on multiple sensors, a key difference in [50] compared to [51], is that information
across axes (and sensors in [50]) are fused together early on in the learning process,
as opposed to only towards the end. In a related work, Ha et al. also stack signals
from multiple sensors into a 2D image, however, they also add zero padding to the
input between modality groups, to prevent kernels in the first convolutional layer in-
termixing different modalities and not learning modality-specific features [15]. In our
approach we force the network to specialise on each modality and sensor by initially
treating each modality and sensor separately, before fusing the unimodal features and
learning cross-modal features. Hammerla et al. carried out a comparison of different
deep learning approaches for HAR, investigating Recurrent Neural Networks (RNNs),
deep fully-connected networks, and CNNs on three public datasets. They found that
for tasks where long-term dependencies play an important role, RNNs tend to perform
best, whereas if detecting local patterns is of higher importance, as is the case for exer-
cise recognition in the gym setting, CNNs are preferable [16]. Autoencoders and Re-
stricted Boltzmann-Machines (RBMs) have also been applied to sensor data for HAR
[2, 36]. Radu et al. demonstrate superior HAR performance compared to traditional
shallow-learning approaches, by using a multimodal-RBM network to learn a shared
representation for accelerometer and gyroscope data [36]. Similar to our approach they
Chapter 2. Background 6
extract unimodal features before learning a cross-modal representation.
2.1.2 Skeletal-Based HAR
With the release of the Microsoft Kinect depth sensor and body tracking SDK in 2010,
and the more recent development of real-time and accurate RGB-based human pose es-
timation techniques [6, 31, 38], the research area of skeletal-based HAR has emerged.
Skeletal sequence data consists of joint trajectories, and can, similarly to inertial sen-
sor data, be viewed as multivariate time-series data. Early works focus on constructing
handcrafted features to obtain discriminative skeleton representations [47, 12, 46], with
more recent approaches, again, focusing on deep learning.
A range of deep learning approaches have been explored for skeletal-based HAR,
in particular, recurrent neural networks (RNNs) [11], graph neural network (GNN)
[49], and CNNs [10, 29, 22]. RNNs are able to efficiently leverage time-series infor-
mation, while GNNs are able to incorporate the spatial constraints and relationships
of human joints in their models. Finally, CNNs have widely been demonstrated to
perform well in learning representations at different abstraction levels, and at learning
spatio-temporal features.
Ke et al. propose a 3D skeleton representation form using cylindrical coordinates
to encode the relative joint positions with respect to selected reference joints [22].
They demonstrate the efficiency of this representation in combination with deep CNNs
to learn spatio-temporal features for HAR. Motivated by their success, we use their
proposed skeleton representation form in this work.
2.1.3 HAR in the Gym
Only a few works have focused on HAR in the gym setting. Chang et al. presented
one of the earliest works that tackled exercise recognition [17]. They propose a Naive
Bayes Classifier and a HMM approach using two triaxial accelerometers to achieve
90% accuracy on a dataset with nine exercise classes. Morris et al. present an au-
tomated exercise logging system with many similarities to ours [32]. Their system
consists of three main stages; segmenting exercise and non-exercise segments, recog-
nising exercises, and counting repetitions. A key difference compared to our approach
is that Morris et al. use handcrafted features in combination with an SVM to classify
exercises. They evaluate their system on the RecoFit dataset, a large-scale dataset con-
sisting of accelerometer and gyroscope data collected from an arm-worn sensor. Their
Chapter 2. Background 7
system achieves a precision and recall of greater than 95% for the segmentation task,
and recognition rates on three exercise circuits consisting of 4, 7, and 13 exercises,
of 99%, 98%, and 96%, respectively. The repetition counting approach proposed by
Morris et al. is the baseline method we implement on the Multimodal Gym dataset. A
more recent deep learning based approach was proposed by Um et al. [45], where they
applied a CNN to accelerometer data formatted into an image. They apply their model
on a large-scale dataset with 50 exercises, and obtain 92% for the exercise recognition
task.
2.2 Multimodal Deep Learning
Multimodal deep learning is concerned with how to fuse different modalities, such
that the learning task can best leverage the different data perspectives. Ngiam et al.
propose a cross-modal RBM learning approach to learn better unimodal features by
training on multiple modalities [34]. They demonstrate the efficiency of their ap-
proach by training on video and audio data, and testing on just a single modality,
for the task of audio-visual speech classification. Their approach inspired the multi-
modal zeroing-out training strategy that we investigate in this work, with the major
differences being that we extend the strategy to multiple modalities, and use a gradual
zeroing-out process of the modalities. Radu et al. explore a number of fusing strategies
for multimodal deep learning methods [37], comparing early feature concatenation and
modality-specific architectures, on activity and context recognition tasks using inertial
sensor data. Modality-specific architectures first learn unimodal features before learn-
ing a shared cross-modal representation. This is also the multimodal learning approach
that we take in this work. [37] demonstrate that their proposed modality-specific CNN
architecture outperforms the other explored approaches on three out of four tasks, and
achieves an average accuracy that is 5% higher across all four tasks, compared to the
early feature concatenation architecture.
2.3 Repetition Counting
We focus on sensor-based repetition counting as this is the focus in our work, however,
there are also a number of works for visual repetition counting in video [28, 39]. The
majority of sensor-based repetition counting works focus on counting repetitions in
the gym setting [17, 9, 33], and rely on signal-processing techniques. Chang et al.
Chapter 2. Background 8
propose two approaches for repetition counting on triaxial accelerometer data; a peak
counting algorithm, and a method using the Viterbi algorithm with a HMM. On a
dataset of nine workout exercises they achieve a miscount rate of 5%. Choi et al. adopt
a simple signal-processing technique to track repetitions using accelerometer sensors
attached to gym exercise machines [9]. Their predictions are within one repetition of
the ground-truth, 95% of the time. A recent deep learning approach has also been
proposed by Soro et al. using a CNN applied on inertial sensor data collected from
a wrist-worn and an ankle-worn sensor, to count the repetitions across ten Cross-Fit
exercises [42]. Their model’s predictions are within one of the true repetition count,
91% of the time.
Chapter 3
Methodology
This chapter outlines and motivates the design choices for our proposed automated ex-
ercise logging system. The system consists of two main components; an activity seg-
mentation and exercise recognition module, followed by a repetition counting module.
We begin by introducing the data collection procedure.
3.1 Data Collection
To enable research into multimodal learning for activity segmentation, exercise recog-
nition and repetition counting in the gym setting, we have collected and annotated a
multimodal dataset of participants performing full-body workouts1. We refer to this
dataset as the Multimodal Gym dataset. The Multimodal Gym dataset will be made
publicly available, with the hope that it will stimulate further research into multimodal
learning. This section outlines the data collection procedure and provides an overview
of the Multimodal Gym dataset.
Participants were asked to carry out full-body workouts whilst being recorded by
two depth cameras and wearing five differently-positioned devices collecting inertial
sensor data. To restrict the task to single-person exercise recognition, each workout
session involved just one participant. The workout sessions consist of three sets of ten
exercises, with ten repetitions for each set. The following set of well-known gym exer-
cises were chosen; squats, lunges (with dumbbells), bicep curls (alternating arms),
sit-ups, push-ups, sitting overhead dumbbell triceps extensions, standing dumbbell
rows, jumping jacks, sitting dumbbell shoulder press, and dumbbell lateral shoulder
1The data collection study has undergone ethical screening in accordance with the University ofEdinburgh School of Informatics ethics process, and received ethical approval from the Ethics Panel.Ref No. 70572.
9
Chapter 3. Methodology 10
raises. See A.1 for an example of each exercise. The exercises were demonstrated to
the participants before their workout session to ensure familiarity with each exercise,
but no corrections to the participant’s form were made during the workout. For the
exercises involving dumbbells, the participants had access to two dumbbells with ad-
justable weight of up to 7.5kg each, and were free to choose how much weight to use.
In between sets participants could decide how long to rest, and what to do, as long as
they remained within the cameras’ field of view. The data collection was carried out
in a controlled environment, keeping the background and lighting conditions constant
across all workouts.
Participants were asked to wear two smartwatches, one on each wrist, an earbud in
their left ear, and two smartphones, one in their left trouser pocket (Huawei P20) and
the other in their right trouser pocket (Samsung S7). The Huawei P20 was only used
to collect data in 6 of the workout sessions. The smartwatches and earbud are fixed in
orientation when worn, and the smartphones were ensured to be placed with the cam-
era facing downwards and outwards. Inertial sensor data was collected from all five
devices using the inbuilt triaxial accelerometer and gyroscope, and also the magne-
tometer in the smartphones. Heart rate data was also collected from the smartwatches
using the optical heart rate sensor. Figure 3.1 displays an example accelerometer and
gyroscope sensor segment. Custom applications were developed to stream and store
the sensor data from each device. For the smartwatches a Wear OS Java application
was developed to locally store sensor readings. For the eSense earbud, an Android OS
Java application was developed, allowing the earbud to stream sensor readings to a host
smartphone over Bluetooth Low Energy (BLE). For the smartphones, an Android OS
Java application developed by Dr. Radu was used to stream and store inertial sensor
data. The devices used in the data collection, along with the sampling frequency for
each device and modality are given in Table 3.1.
Colour images and depth maps of the workouts were recorded at 30 frames per
second from two viewpoints using two Orbbec Astra Pro RGB-D cameras. A C++ ap-
plication was developed to stream and store the colour and depth data from these cam-
eras. The Astra Pro depth sensor uses structured light to capture depth information,
and has a range of 0.6-8m. We ensured to place the cameras such that the participant
was always within a 2-6m range from the cameras. The depth maps capture the planar
distance to the camera of the scene. For each workout session the cameras were set up
in approximately the same position, with both cameras facing the front of the partici-
pant but at different angles. The colour and depth data was recorded at the maximum
Chapter 3. Methodology 11
0 5 10 15 20Time (s)
10
0
10
m/s
2
XYZ
(a) Accelerometer.
0 5 10 15 20Time (s)
2
0
2
4
rad/s
XYZ
(b) Gyroscope.
Figure 3.1: Sensor readings from a left-worn smartwatch triaxial accelerometer and
gyroscope recorded during a set of squats.
resolution, 1280×720 for colour, and 640×480 for depth. Example RGB frame and
depth map outputs are given in Figure 3.2.
(a) Camera 1 RGB. (b) Camera 2 RGB.0
1
2
3
4
(c) Camera 2 depth map.
Figure 3.2: Example colour and depth output of camera 1 and camera 2.
To use the Multimodal Gym dataset for multimodal learning, the data from each
of the seven devices used in the data collection must be time-synchronised. This was
done by asking each participant to perform an abrupt synchronisation jump from a
stand-still position at the beginning of each workout. The synchronisation jump was
used to determine the offset between each device’s system clock and a chosen reference
clock, which was chosen to be the system clock of the laptop connected to one of the
cameras. Given the offset of each device’s system clock to our reference clock, the
data from each device could be aligned in time.
Each workout has been manually annotated with the video frame boundaries of
Chapter 3. Methodology 12
Device Modality Frequency (Hz) Resolution
Orbbec Astra ProRGB 30 1080x720
Depth 30 640x480
Mobvoi TicWatch Pro
Accelerometer 100 -
Gyroscope 100 -
Heart beats per minute 1 -
eSense [21]Accelerometer 90 -
Gyroscope 90 -
Samsung S7
Accelerometer 210 -
Gyroscope 210 -
Magnetometer 100 -
Huawei P20
Accelerometer 500 -
Gyroscope 500 -
Magnetometer 65 -
Table 3.1: An overview of the devices used in the data collection, and the collected
modalities.
where each exercise set begins and ends, and the number of repetitions in each set.
The workouts were annotated by the same annotator by inspecting the videos of the
workout sessions. Participants were instructed to perform 10 repetitions in each set,
however, in limited cases, due to miscounting, fewer or more than 10 repetitions were
performed. There is some ambiguity involved when annotating the start and end frames
of exercises, and the number of repetitions. When determining when a participant starts
an exercise, is it when they are in the exercise start position, or is it only once they start
the exercise motion? Is the end of a set of jumping jacks when the feet come together
for the last time, or is it when the arms stop swinging? Does sitting up at the end of a
set of sit-ups count as a repetition? These ambiguous cases do not arise very frequently,
but to minimise the impact on the data quality we ensured to handle these cases with
consistency across all workouts. Given the discussed ambiguities, and the difficulties
in identifying exactly when a workout begins and ends, we estimate that the precision
of the majority of our start and end annotations are within 2-3 frames (60-90ms) of
the ground-truth, which is more than sufficient for the task of activity recognition and
repetition counting.
16 workout sessions were recorded, totalling 580 minutes of data. The workout
Chapter 3. Methodology 13
lengths range from 27 to 58 minutes, with an average duration of 36 minutes. A total
of 466 sets, and 4678 repetitions were collected, and exercises constitute 26%, or 150
minutes of the data, with the remaining portion corresponding to non-exercise activi-
ties. Two of the five participants carried out six workout sessions each, one participated
in two sessions, and the other two participants carried out one workout session each.
3.2 Activity Segmentation and Exercise Recognition
In this section we outline our proposed multimodal approach for activity segmentation
and exercise recognition. Our approach can be split into three main training stages;
learning modality-specific representations using unimodal autoencoders, learning a
shared cross-modal representation using a multimodal autoencoder, and finally, train-
ing a classifier for segmentation and exercise recognition using the shared cross-modal
representation. Each of these three stages and the pre-processing stage will be covered
in the following subsections. An illustration of the complete architecture is given in
Figure 3.3.
3.2.1 Pre-Processing
Our system takes as input time-synchronised inertial sensor and skeleton sequence data
with different sampling frequencies. We downsample the accelerometer and gyroscope
sensor readings to 50Hz, and leave the skeleton sequence data at its original sampling
rate of 30Hz. Previous HAR works have suggested that a sampling rate of 20Hz is suf-
ficient to capture human motions without risking losing informative parts of the signal
[23]. We take a cautious approach and use a higher sampling rate, but acknowledge
that a lower sampling rate could most likely be used with little or no loss of accuracy to
our model’s performance. Using an as low as possible sampling rate is desirable when
deploying models on resource constrained devices in order to reduce the energy con-
sumption and storage requirements. To facilitate the training process we standardise
our data along the feature dimensions, (i.e. the X, Y, and Z axes for the accelerometer
and gyroscope data, and along the joint dimensions for the skeleton data). We use the
global training mean and standard deviation to standardise all instances.
Chapter 3. Methodology 14
Inertial 1D Convolutional Autoencoder
Skeleton 2D Convolutional Autoencoder
XYZ
STAGE 1
STAGE 1
Multimodal Autoencoder
STAGE 2 STAGE 3
rθz
Activity Segmentation and Recognition
rθz
XYZ
Figure 3.3: An overview of the proposed activity segmentation and exercise recog-
nition approach. Stage 1: Learn modality-specific representations using a separate
autoencoder for each device and modality. The layers of the inertial 1D convolutional
autoencoder block are used to illustrate that separate autoencoders are used for each
device and modality. Stage 2: Flatten and concatenate the modality-specific repre-
sentations outputted by the encoder component of each unimodal autoencoder. Learn
a shared cross-modal representation using a fully-connected multimodal autoencoder
that attempts to reconstruct the original inputs from the shared representation. The out-
put vector of the multimodal autoencoder is split along the concatenation indices, and
fed to the decoder component of the corresponding unimodal autoencoder, to recon-
struct the original input, and backpropagate the reconstruction loss. Stage 3: A fully-
connected classifier is attached to the learnt shared cross-modal representation. The
entire network is trained for the task of activity segmentation and exercise recognition,
with the pretrained unimodal and multimodal autoencoder weights being fine-tuned.
3.2.2 Unimodal Autoencoders
Motivated by the demonstrated strong performance of modality-specific architectures
in previous multimodal deep learning works [34, 37], we employ the same late fusing
Chapter 3. Methodology 15
strategy in this work (discussed in section 2.2). We first train separate autoencoder
networks for each device and modality to extract modality-specific representations, and
to enable us to pretrain our network on external datasets. We pretrain each autoencoder
on a publicly available dataset from a similar domain to ours, in order to evaluate
whether transfer learning boosts the robustness and generalisability of our models.
We propose using a stacked convolutional autoencoder with a bottleneck to force
the network to learn a compact modality-specific representation. As discussed in sec-
tion 2.1.1, CNNs are efficient at learning temporal patterns and scale-invariant features,
which make them well-suited for our task. The autoencoders are constructed in a sym-
metric fashion, with the encoder consisting of convolutional and max-pooling layers,
and the decoder consisting of deconvolutional and max-unpooling layers. The rectified
linear unit (ReLU) activation function [14] is applied after every convolutional layers,
and after the first two deconvolutional layers. The network configuration for the ac-
celerometer and gyroscope modalities, and the skeleton modality are given in Table 3.2
and 3.3, respectively. The autoencoders are trained end-to-end, with the Mean Squared
Error (MSE) as the reconstruction loss.
3.2.2.1 Inertial Sensor Autoencoders
This section outlines how unimodal representations are learned from the accelerometer
and gyroscope data collected from the smartwatches, smartphones and earbud.
We treat the triaxial accelerometer and gyroscope data from the smartwatches and
earbud as 1D images with three channels, corresponding to the X, Y, and Z axis. For
the smartphone data we use the magnitude of the accelerometer and gyroscope read-
ings,√
X2 +Y 2 +Z2, instead of the raw triaxial signal, and hence the smartphone data
only has one channel. Accelerometer and gyroscope readings are dependant on the
sensor’s orientation. The smartwatches and the earbud are worn such that the orien-
tation of the devices are relatively fixed across all workouts, however, the orientation
of the smartphones can vary significantly throughout and across workouts. This trans-
formation is performed to make our smartphone models invariant to the orientation
of the smartphone. The magnitude of the accelerometer and gyroscope corresponds
to the speed of acceleration, and speed of rotation, irrespective of the direction. This
transformation results in a loss of directional information, but we gain invariance to
the orientation of the sensing device.
The resulting 1D images for each modality and device are convolved and max-
pooled with 1D filters along the temporal dimension. The accelerometer and gyroscope
Chapter 3. Methodology 16
LayerKernel dims(HxW)
Output dims(C@HxW)
Acc Gyr
input - - 3@1x250
conv1 1x11 (G) 1x3 9@1x125
conv2 1x11 (G) 1x3 15@1x63
conv3 1x11 1x3 24@1x32
maxpool 1x2 1x2 24@1x16
unpool 1x2 1x2 24@1x32
deconv1 1x11 1x3 15@1x63
deconv2 1x11 1x3 9@1x125
deconv3 1x11 1x3 3@1x250
Table 3.2: Stacked convolutional au-
toencoder network architecture for
smartwatch and earbud data. The con-
figuration for the smartphone autoen-
coders only differ in that the number
of channels are divided by a factor of
three, since the smartphone data only
has one input channel. The first two
convolutional layers in the accelerome-
ter autoencoder are grouped (G) con-
volutions, with the number of groups
equal to three. The ReLU activation
function is applied after every convolu-
tional layer, and after the first two de-
convolutional layers. All the layers use
a kernel stride of two.
LayerKernel dims(HxW)
Output dims(C@HxW)
input - 3@150x16
conv1 11x11 (G) 9@75x16
conv2 11x11 (G) 15@38x16
conv3 11x11 24@19x16
maxpool 2x2 24@9x8
unpool 2x2 24@19x16
deconv1 11x11 15@38x16
deconv2 11x11 9@75x16
deconv3 11x11 3@150x16
Table 3.3: Stacked convolutional au-
toencoder network architecture for
skeleton data. The first two convolu-
tional layers in the accelerometer au-
toencoder are grouped (G) convolu-
tions, with the number of groups equal
to three. The ReLU activation func-
tion is applied after every convolutional
layer, and after the first two deconvolu-
tional layers. All the layers use a kernel
stride of two in the height dimension,
and one in the width dimension.
data are processed in separate networks to extract modality-specific representations.
We use a cross-validation framework to evaluate different network configurations, in-
cluding the number of convolutional and deconvolutional layers, the kernel size, the
kernel stride, and using depth (grouped) or regular convolutions. The final network
configurations for all three devices and modalities are given in Table 3.2.
Chapter 3. Methodology 17
3.2.2.2 Skeleton Autoencoder
To incorporate visual information into our model, we use 3D pose information as one
of our input modalities. Unlike the inertial sensor streams, where we use the raw sensor
data as input to our model, the skeleton stream requires some processing to extract 3D
pose estimates from our video data. We extract 3D pose estimates from RGB frames
using the highly-accurate 2D pose estimation system OpenPose [6], and the 2D to
3D regression model proposed by Martinez et al. [31]. OpenPose is an open source
multi-person 2D pose estimation system that uses a CNN-based bottom-up approach
to find body parts in RGB images. To ’lift’ the 2D pose estimates to 3D, Martinez
et al. propose a simple but effective deep fully-connected network, which at the time
of publication in 2017, outperformed the state-of-the-art 3D pose estimation on the
Human3.6M dataset by 30% [31]. Examples of 2D and 3D pose estimates on two
RGB frames from the Multimodal Gym dataset are given in Figure 3.4. The resulting
pose estimates are highly accurate, robust to fast movements, and can even handle
simple cases of occlusion, as demonstrated in Figure 3.4.
Figure 3.4: 2D and 3D pose estimates obtained using OpenPose [6], and Martinez et
al. [31], respectively.
Our 3D skeleton model consists of 17 joints, for which we have the Cartesian
coordinate position of in all video frames. The coordinate system used to describe the
Chapter 3. Methodology 18
pose estimate is relative to the person, with the origin coordinate being at the centre
hip joint. We use a skeleton representation proposed by Ke et el. [22], which involves
transforming the 3D Cartesian coordinates, (x,y,z), to cylindrical coordinates, (r,θ,z),
as follows:
r =√
x2 + y2
θ = tan−1(
xy
)(3.1)
z = z
[22] demonstrate improved performance when using cylindrical coordinates over
Cartesian coordinates. They attribute this improvement to the fact that human motions
use pivotal movements, and are therefore best described by cylindrical coordinates.
There is however a limitation with using cylindrical coordinates, that is not discussed
in the original paper [22], namely the angle discontinuity at −180◦ and 180◦ of the az-
imuth component of the cylindrical coordinate. This causes large jumps in the azimuth
signal for potentially very small pivotal movements which complicates learning. After
the transformation, a reference joint, R, is chosen, and each joint position is represented
in terms of its relative position to the reference joint. This has the advantage of making
the representation invariant to different viewpoints. We chose the centre hip joint as
the reference joint since it is a relatively stable and stationary joint, which results in the
final skeleton representation being less prone to noise. Finally the relative cylindrical
joint coordinates are aligned into a 3D image. The channels in the image correspond to
the radial, azimuth, and vertical components, the width corresponds to the spatial di-
mension, and the height to the temporal dimension. This results in a N×16×3 image,
where N is the sequence length, and 16 is number of joints, discarding the reference
centre hip joint. A visualisation of the skeleton representation is given in Figure 3.5,
where the repetitive nature of a set of squat repetitions is captured in the form of a
repeating visual pattern.
The skeleton representation enables us to use a stacked convolutional autoencoder
model to learn modality-specific spatio-temporal features from our data. The skele-
ton autoencoder we propose is very similar to the autoencoders we use for the inertial
sensor streams, with the major difference being that 2D convolution and pooling opera-
tions are used instead of 1D operations. The final model configuration was selected by
cross-validating a number of different configurations in a hyperparameter grid search.
Table 3.3 contains the final network configuration for the skeleton autoencoder.
Chapter 3. Methodology 19
(a) Non-exercise segment.
(b) Squats segment.
Figure 3.5: Visualisation of two 10 second segments of the skeleton representation dur-
ing a non exercise activity, and a set of squats. The segment has been standardised,
and then scaled channel-wise between 0 and 255. The radial coordinate, r, corre-
sponds to the red channel, the azimuth coordinate, θ, to the blue channel, and the
vertical coordinate, z, to the green channel.
In [22] a VGG19 [41] model pretrained on images was used to extract spatio-
temporal features from the skeleton representation. We adopt a different approach, and
instead leverage the large number of 3D pose estimates in the Multimodal Gym dataset
to train our model just on skeleton representations. We hypothesise that training our
network exclusively on skeleton data will improve our model as the skeleton repre-
sentation images differ significantly from natural images, even the low-level features.
This is illustrated by the example skeleton representation shown in Figure 3.5.
3.2.3 Multimodal Autoencoder
Once the modality-specific representations have been learnt for each modality and de-
vice, the features from the embedding layer are flattened and concatenated, before
being fed into the next stage; the multimodal autoencoder. The aim of this stage is
to learn cross-modal representations that hopefully will prove to be discriminative for
the exercise recognition task. This is again done through the use of an autoencoder
network, but this time using a stacked fully-connected autoencoder. By using fully
connected layers, the network is easily able to intermix features from all modalities.
As can be seen in Figure 3.3, the architecture structure is symmetric, with the decoder
component of the multimodal autoencoder reconstructing a vector of the same size
as the concatenated input vector. The output vector is then split at the concatenation
indices, such that each segment can be reshaped into the same size as its correspond-
ing modality-specific embedding layer. The reshaped vector segment is then passed
on to the decoder module of the corresponding unimodal autoencoder, which attempts
Chapter 3. Methodology 20
to reconstruct the original input for that modality. A hyperparameter search using
cross-validation was conducted to select the configuration with the lowest validation
reconstruction loss. The final network configuration is outlined in Table 3.4. As for
the modality-specific autoencoders, MSE is used as the reconstruction loss, and the
network is trained end-to-end using backpropagation.
Layer Output Units
input 12200
mod specific encoders 4288
flatten & concat. embeddings 4288
enc fc1 1000
enc fc2 1000
enc fc3 1000
dec fc1 1000
dec fc2 1000
dec fc3 4288
split 4288
mod specific decoders 12200
Table 3.4: The network configuration for the stacked fully-connected multimodal au-
toencoder. The inputs are all the modalities from all the devices, where 12200 = 6×(3@1×250)+2×(1@1×250)+(3@150×16) is the total size of the inputs across all
devices and modalities, and 4288 = 6× (24@1×16)+2× (8@1×16)+(24@9×8),
is the size of the concatenated embedding features from each modality and device.
3.2.4 Multimodal Classification
The final stage of our approach is the classification stage, in which the learned cross-
modal representations are fed into a classifier to determine the predicted label for each
input segment. Unlike [32], we treat activity segmentation and exercise recognition as
one task, instead of two separate tasks, which simplifies our exercise logging pipeline.
This is done by adding an extra non-exercise class to the set of exercise classes. The
classifier consists of three fully-connected layers which are attached to the embedding
layer of the multimodal autoencoder. The first two fully connected layers consist of
100 hidden units, and use ReLU as the activation function. The final layer corresponds
Chapter 3. Methodology 21
to the predicted output for each class, and consists of 11 units (one for each exercise
class, and one for the non-exercise class). The entire network is trained end-to-end,
fine-tuning the autoencoder weights to specialise the learnt features for the task of
exercise recognition. The cross-entropy loss is backpropagated through the network to
update the model parameters.
3.3 Repetition Counting
To count the number of repetitions we implemented a signal processing based ap-
proach proposed by [32]. This approach serves as a strong baseline on the Multimodal
Gym dataset, and can act as a benchmark for other researchers to compare their results
against.
The approach relies on identifying periodic peaks and auto-correlations in the data,
which are strong indicators of a repetition. It is assumed that the signal has been
segmented, and that the exercise being performed is known. To ensure that these pre-
requisites are satisfied, the repetition counting stage should be preceded by the activity
segmentation and exercise recognition stage.
The approach can easily be applied on most modalities, as it just requires being able
to transform the data into a 1D signal that captures the maximal directions of variance
in the data. In our case, the multidimensional data (3 axes for accelerometer and gyro-
scope data, or 16 relative joint positions in cylindrical coordinates for skeleton data) is
first standardised and then smoothed using a third-degree polynomial Savitzky–Golay
filter [40], along each feature dimension. The processed data is then projected onto its
first principal component direction to obtain a 1D signal.
The next stage is to identify peaks, or local maxima, in the processed 1D signal.
This is done by simply comparing neighbouring values in the signal. Although the
smoothing and PCA removed a lot of noise from the original data, there are still many
incorrectly detected peaks that do not correspond to a repetition. To remove these
false positives, we sort the list of candidate peaks based on their amplitude, and going
through the list in descending order, we add candidate peaks to a new candidate list, as
long as there is not a peak already in the list that is within a minimum allowed threshold
distance of the current peak. This minimum threshold is chosen independently for each
exercise based on an estimate of the minimum amount of time required to execute one
repetition of the exercise.
In the next step we leverage the repetitive nature of repetitions by incorporating
Chapter 3. Methodology 22
information from the auto-correlation of the signal to exclude unlikely peaks. The
autocorrelation is computed for a window centred at each peak for lags between the
minimum and maximum expected duration of a repetition. The largest autocorrelation
value within this range of lags is chosen to be the period, P, at that candidate peak.
Any peaks with smaller amplitudes, and within a distance of 0.75∗P from the current
peak, are removed.
The final filtering that is performed is to remove all peaks that have an amplitude
lower than half of the 40th percentile of the remaining peak amplitudes. The number
of remaining peaks is our predicted repetition count.
Chapter 4
Evaluation and Discussion
This chapter outlines and discusses the evaluation procedure and results of our pro-
posed exercise logging system on the Multimodal Gym dataset. The system is eval-
uated on exercise recognition and repetition counting. In addition, we present and
analyse the initial results for our proposed multimodal zeroing-out training strategy.
4.1 Activity Segmentation and Exercise Recognition
We evaluate our proposed multimodal deep learning approach (see section 3.2) on the
Multimodal Gym dataset, and compare its performance to unimodal models, single-
device models, and single-device models trained using our proposed multimodal zero-
ing out strategy. Unimodal models are models that only take input from one modality
and device, for example, accelerometer data from the left smartwatch. Single-device
models are models that take as input all the modalities generated by a single device,
for example, accelerometer and gyroscope data from the left smartwatch. We com-
pare using pretrained unimodal autoencoders, and training from scratch. Furthermore,
we evaluate each model on two test sets, one consisting of participants which are also
present in the training set, and one consisting of participants not previously seen by the
model.
4.1.1 Pretraining Unimodal Autoencoders
This section outlines the experimental setup used to pretrain the unimodal autoen-
coders (see section 3.2.2). We trained a total of seven autoencoders, corresponding to
the following devices and modalities; smartwatch accelerometer and gyroscope, smart-
23
Chapter 4. Evaluation and Discussion 24
phone accelerometer and gyroscope, earbud accelerometer and gyroscope, and skele-
ton data. The following three datasets were used for pretraining, RecoFit [32], the
Heterogeneity Human Activity Recognition (HHAR) dataset [44], and Human3.6M
[7, 20].
The Microsoft RecoFit dataset [32] was used to pretrain the smartwatch accelerom-
eter and gyroscope autoencoders. The RecoFit dataset consists of accelerometer and
gyroscope data collected at 50Hz from an arm-worn sensor of individuals working
out in a gym. The dataset consists of 114 participants, across 146 sessions. To pre-
train the smartphone and earbud accelerometer and gyroscope autoencoders we use
the smartphone data from the HHAR dataset. The HHAR dataset consists of triaxial
accelerometer and gyroscope data collected from 12 smart device models at different
frequencies. The following activities are present in the HHAR dataset; biking, sitting,
standing, walking, walking down stairs, and walking up stairs. Finally, the skeleton
autoencoder was pretrained on ground truth motion capture data from the Human3.6M
dataset. The Human3.6M dataset contains 3D pose information collected from a highly
accurate motion capture system at 50Hz. The data was collected in a controlled en-
vironment across a number of participants, and a range of diverse scenarios, such as
eating, walking, and discussing.
The reason for using device-specific datasets to pretrain the autoencoders is that
the motion patterns experienced by differently positioned devices differ significantly.
The arm-worn sensor in the RecoFit dataset is likely to be involved in a large variety of
motions, from fine-grained movements to rapid movements, whereas a smartphone in
the pocket will most likely experience more rigid motion patterns. We want our models
to learn these device-specific motion patterns. Since there are no publicly available
datasets with accelerometer or gyroscope data collected from earbuds, we also use the
HHAR smartphone data to pretrain the earbud autoencoders. We hypothesise that the
motions picked up by the earbud will be more similar to those recorded by smartphones
in the pocket, compared to arm-worn sensors.
As a pre-processing step we first downsample all inertial sensor readings to 50Hz,
and the 3D pose data to 30Hz. Then the magnitude of the smartphone accelerometer
and gyroscope data was computed, and the 3D pose data was transformed to the skele-
ton representation form outlined in section 3.2.2.2. Each dataset was then randomly
split into a train, test, and validation set, using a 70-15-15 split. Finally, all instances
were standardised using the training set mean and standard deviation.
To select the network configuration for our stacked convolutional autoencoder we
Chapter 4. Evaluation and Discussion 25
use a cross-validation framework to evaluate different network configurations. The hy-
perparameter search was carried out separately for each modality, however, we only
used the RecoFit dataset to choose the network configuration for the accelerometer
and gyroscope autoencoders. We made the assumption that a similar network archi-
tecture would work well across all three types of devices for the same modality. The
following network configurations were explored; the number of convolutional and de-
convolutional layers, the kernel size, the kernel stride, and using depth (grouped) or
regular convolutions. A selection of the configurations with the lowest reconstruction
loss on the external dataset’s test set were evaluated on the Multimodal Gym dataset
for exercise recognition, with the configuration with the highest validation accuracy
selected as our final configuration. We use the results from the autoencoder model
hyperparameter search to guide the final model selection for each modality, with the
intuition that models with a low reconstruction loss are able to preserve salient fea-
tures of the data in their embedding layer, which should be useful for the exercise
recognition task. See A.2 for the accelerometer, gyroscope, and skeleton autoencoder
hyperparameter search results. The final network configurations are given in Table 3.2
and 3.3.
As discussed in section 4.1.2, we chose to use 5 second windows for the activity
recognition task, and therefore we also use 5 second window segments as input to the
autoencoders. Instances are randomly sampled from the training set, using a batch size
of 128. Each model was trained for 100 epochs, using an Adam optimiser [24] with
a learning rate of 0.001, and the following exponential decay rates for the moment
estimates, β1 equal to 0.9 and β2 equal to 0.999. The final weights were saved for
each of the seven autoencoder models to later be used for exercise recognition on the
Multimodal Gym dataset.
4.1.2 Unimodal, Single-Device and Multimodal Models
Given our pretrained unimodal autoencoders, we now detail the approach taken to train
exercise recognition models that leverage the learnt modality-specific representations.
We split the Multimodal Gym dataset into a train, validation, and two test sets,
one containing participants seen in the training set, and one with previously unseen
participants. The dataset is divided by assigning each workout session to one of the
splits. For reproducibility and to enable future works to compare results we provide
the workout IDs for each split: train (1, 2, 3, 4, 6, 7, 8, 9), validation (14, 15), test seen
Chapter 4. Evaluation and Discussion 26
(10, 11), test unseen (0, 5, 12, 13). In this work we use data from 5 of the 7 devices in
the Multimodal Gym dataset, ignoring one of the RGB-D cameras and the left smart-
phone. The use of multi-view visual information is left for future work, and the left
smartphone is excluded because there is only data collected for that device in a subset
of the recorded workout sessions.
All the models take a 5 second window segment as input, which corresponds to
250 sensor samples at 50Hz, or 150 skeleton samples at 30Hz. The optimal window
size varies across different activity recognition applications, and there is often a trade-
off between accuracy and recognition speed [3]. Five second windows were chosen as
this is large enough to capture the repetitive nature of repetitions, whilst also ensuring
that the recognition speed (2.5 seconds) is acceptable for an exercise logging system.
During training, instances were randomly sampled from all workouts in the training set,
using a batch size of 128. At test time we generated batches using sequential strided
sampling, with a stride of 0.2 seconds. This simulates how the system would be used in
a real-world application, and is equivalent to the setup used in [32]. Window segments
are labelled according to the majority class in the segment, which is common-practice
for activity recognition.
A hyperparameter search was carried out to determine the number of fully-connected
layers, hidden units, and amount of dropout [43] to use for the unimodal, single-device,
and multimodal models. The hyperparameter search was only performed on the left
smartwatch accelerometer data for the unimodal model, and the left smartwatch ac-
celerometer and gyroscope data for the single-device model, in order to limit the num-
ber of experiments. The configuration with the highest validation accuracy was chosen
for each model. The final configuration was 3 layers with 100 hidden units, and no
dropout for all three model types. Finally, a hyperparameter search was also carried out
for the multimodal autoencoder in order to select the number of encoder and decoder
layers, the number of hidden units, and the number of units in the embedding layer.
The configuration with the lowest validation reconstruction loss was selected. The final
configuration consisted of three encoder and three decoder fully-connected layers, with
1000 hidden units, and 1000 units in the embedding layer. The unimodal models are
evaluated for exercise recognition by flattening the embedding representation from the
modality-specific autoencoder and attaching the three-layered fully-connected classi-
fier. The same approach is taken for single-device models, with the only difference
being that the embedding layer from two modalities are flattened and concatenated be-
fore attaching the fully-connected classifier. Finally, the multimodal model is evaluated
Chapter 4. Evaluation and Discussion 27
for exercise recognition by attaching the fully-connected classifier to the embedding
layer of the multimodal autoencoder. In all three model types, the autoencoder weights
are fine-tuned using a lower learning rate than for the fully-connected layers.
All experiments were trained using an Adam optimiser [24] with a learning rate of
0.001, and the following exponential decay rates for the moment estimates, β1 equal
to 0.9 and β2 equal to 0.999. For fine-tuning the pretrained weights, an order of mag-
nitude lower learning rate of 0.0001 was used. Furthermore, a learning rate scheduler
was used to decrease the learning rate by an order of magnitude if the validation metric
did not improve for 10 epochs. The models were trained until convergence, which was
typically after 10 to 20 epochs.
The Multimodal Gym exercise recognition accuracies for the explored models are
presented in Table 4.1, along with confusion matrices for the multimodal model in Fig-
ure 4.1, and for the single-device models in Figures 4.2 and 4.3 (see A.3 for additional
confusion matrices). The highest accuracies on the two test sets, seen and unseen sub-
jects, are obtained by the multimodal model with 99.51% and the skeleton model with
92.94%, respectively. The skeleton model also obtained strong performance on the
seen test set, with 99.37% accuracy. The strong performance of the skeleton model is
expected, since the visual information encoded in the skeleton sequence data is natu-
rally very discriminative for the exercise recognition task. Incorporating sensor infor-
mation into the skeleton model through multimodal learning has little or no benefit.
The next-best performing models are the left and right multimodal smartwatch
accelerometer and gyroscope models, with test accuracies of 98.72% and 98.75, re-
spectively. The strong performance of the smartwatch models can be explained by the
fact that the smartwatches are involved in the movements of all ten exercises studied in
the Multimodal Gym dataset. The earbud and right smartphone models perform sub-
stantially worse than the other devices’ models, which is expected since the earbud and
smartphone are positioned such that they are not heavily involved in many of the exer-
cises. Inspecting the confusion matrices for the single-device earbud model in Figures
4.2e and 4.3e, and for the single-device smartphone model in Figures 4.2d and 4.2e
supports this hypothesis. The smartphone and earbud show poor classification perfor-
mance on exercises with limited head and core movement, such as biceps curls, triceps
extensions, shoulder press and lateral raises, and better performance on, for example,
squats and jumping jacks.
Among the unimodal models we observe that the unimodal accelerometer models
consistently outperform the corresponding unimodal gyroscope model, suggesting that
Chapter 4. Evaluation and Discussion 28
acceleration is more discriminative than angular velocity for exercise recognition. Fus-
ing accelerometer and gyroscope data, as is done in the single-device models, further
improves exercise recognition accuracy for all devices, demonstrating the benefit of
leveraging multiple data perspectives through multimodal learning.
To evaluate the generalisation ability of our models we tested each model on un-
seen test subjects, that is, subjects that were not present in the training data. Different
participants will perform exercises with slight variations in form and tempo, which an
automated exercise logging system must be robust to. The results show that our models
struggle to generalise well to unseen participants. Morris et al. observed the same gen-
eralisation problem initially in their work when training on a dataset of 30 participants
[32]. Once they expanded their dataset to 160 participants their models were able to
generalise a lot better. To improve the generalisability of our models we could expand
our dataset to include more participants, and also apply data augmentation methods to
introduce more variability in the data.
The confusion matrices in Figures 4.1, 4.2, and 4.3, reveal that most misclassi-
fications, in particular on the unseen test set, are exercise segments mislabelled as
non-exercise segments. Further analysis into these failure cases is required, but most
likely many of these misclassifications are occurring at the ambiguous boundary re-
gions at the start and end of exercise sets, where the window segment contains both
exercise and non-exercise samples. This problem could be alleviated through tempo-
ral smoothing of predictions, for example, only changing the currently predicted class
to a different class if three consecutive predictions are in agreement. Another poten-
tial reason for these misclassifications is the degree of class imbalance in the dataset,
where the number of non-exercise samples outnumbers each exercise by roughly 30 to
1. This problem could be solved through stratified batching during training.
The results indicate that pretraining does not have a significant impact on model
performance for our exercise recognition task. This is most likely because we are
using relatively lightweight models with few parameters, for which our dataset is suf-
ficiently large to train from scratch without overfitting. However, in settings where
only a small dataset is available for the target task, but where there are many similar
unlabelled instances available, pretraining could help improve performance. This is
especially relevant for tasks involving inertial sensor data, as large unlabelled datasets
are readily available, but creating labelled inertial sensor datasets is a challenging and
time-consuming task.
Chapter 4. Evaluation and Discussion 29
4.1.3 Multimodal Zeroing-Out Training Strategy
To learn more robust representations for segmentation and exercise recognition when
data from only a single device is available at test time, we investigate whether mul-
timodal training can be used to strengthen the unimodal representations. We do this
by during training gradually randomly zeroing out the input modalities that are only
available at training time, but still requiring the multimodal autoencoder to reconstruct
all the original inputs. By forcing the network to reconstruct modalities for which it
only sees the input for occasionally, we hypothesise that the network will learn features
that encode information about the relationships between different modalities, and thus
potentially supplementing and strengthening the unimodal features of the device which
is available at test time. In particular, we believe that the less discriminative modalities
from the earbud and smartphone, can benefit from the data perspectives provided by
the stronger skeleton and smartwatch modalities. We refer to the device that is used at
test time as the target device.
We now outline the proposed multimodal zeroing-out strategy. We initialise the
multimodal autoencoder with the pretrained weights of a multimodal autoencoder
trained on all available modalities. We begin by for the first two epochs randomly
selecting a device (excluding the target device) to zero out for each batch, and for ev-
ery following two epochs, we increase the number of devices to zero out. This process
continues until only the target device is not being zeroed-out, at which point we let the
network train for another five epochs. The model is evaluated on the validation set after
every epoch, by zeroing out all the modalities except for the modalities belonging to
the target device, and recording the reconstruction loss as the model attempts to recon-
struct all the original inputs. The model with the lowest validation reconstruction loss
is selected to be evaluated for exercise recognition. To evaluate the learnt represen-
tation on the Multimodal Gym dataset for exercise recognition, three fully connected
layers with 100 hidden units are attached to the embedding layer of the multimodal
autoencoder. The model is then trained for exercise recognition, zeroing out all modal-
ities, except for the modalities belonging to the target device. The model with the
highest validation accuracy is selected and evaluated on the test sets. The experimental
setup is identical to the one outlined in section 4.1.2 in terms of learning rate, optimiser
and batch size.
We trained five models, each with a different target device; left camera, left smart-
watch, right smartwatch, right smartphone, and earbud. The test set accuracies for each
Chapter 4. Evaluation and Discussion 30
of the five models is given in Table 4.2, along with the confusion matrices in Figures
A.17 and A.18. The multimodal zeroing-out training strategy results in marginally bet-
ter accuracy for the two smartwatches on the seen test set, and for the right smartwatch
on the unseen test set compared to the corresponding single-device models. However,
for the remaining devices, the test set accuracy is lower when using the zeroing-out
training strategy. These initial results indicate that the zeroing-out training strategy is
not effective for learning more robust representations and at leveraging external data
perspectives. However, more extensive exploration is required before conclusively
discarding this approach. We propose trying variations of this approach, such as mod-
ifying how modalities are removed during training, and trying to reconstruct just one
modality instead of multiple ones, so that training can focus on leveraging information
from just one data perspective. Furthermore, we plan on experimenting with instance-
level zeroing-out, instead of batch-level zeroing-out, which we hypothesise will result
in smoother learning.
Device Modality Accuracy Accuracy Accuracy Accuracy(w/o PT) (w PT) (w/o PT, UTS) (w PT, UTS)
All - 99.51% - 92.28%
Camera Skeleton 99.37% 99.36% 92.55% 92.94%
Watch left
Acc 98.51% 98.30% 90.37% 88.54%
Gyr 97.81% 98.10% 88.79% 86.51%
Acc & Gyr 98.72% 98.72% 93.85% 92.48%
Watch right
Acc 98.50% 98.37% 88.54% 88.07%
Gyr 97.86% 97.87% 86.87% 86.85%
Acc & Gyr 98.75% 98.68% 91.95% 91.34%
Phone right
Acc 91.21% 90.00% 82.17% 82.17%
Gyr 87.64% 86.62% 78.97% 77.08%
Acc & Gyr 92.65% 93.31% 83.91% 83.94%
Earbud
Acc 92.85% 91.89% 78.89% 77.04%
Gyr 88.65% 88.68% 79.17% 77.64%
Acc & Gyr 94.17% 94.22% 82.11% 80.47%
Table 4.1: Multimodal, unimodal and single-device exercise recognition accuracies on
the Multimodal Gym test sets. We evaluate whether pretraining (PT) boosts perfor-
mance, and how well each model generalises to unseen test subjects (UTS).
Chapter 4. Evaluation and Discussion 31
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
99.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 98.9% 0.0% 0.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 99.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 97.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 99.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 98.7% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.7% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.7% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.9% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.3% 0.0%
0.3% 1.1% 0.8% 2.2% 0.6% 1.3% 1.3% 1.3% 0.1% 1.7% 99.8%
0%
20%
40%
60%
80%
Accu
racy
(a) Seen subjects.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
94.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 2.8% 0.0%
0.0% 58.1% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%
0.0% 0.0% 10.4% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 64.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.3% 0.0% 0.0% 79.3% 0.3% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.2% 0.0% 75.1% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 39.5% 0.0% 0.0% 0.0% 0.0%
0.0% 0.1% 0.0% 0.0% 0.6% 0.1% 3.9% 90.1% 0.2% 14.2% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 97.3% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 46.4% 0.0%
6.0% 41.5% 89.6% 34.8% 20.1% 24.4% 56.6% 9.9% 2.4% 36.5% 99.5%
0%
20%
40%
60%
80%
Accu
racy
(b) Unseen subjects.
Figure 4.1: Confusion matrix for the multimodal model on the Multimodal Gym test set,
for seen and unseen subjects.
Device Accuracy Accuracy (UTS)
Camera left 99.35% 91.13%
Smartwatch left 98.78% 92.78%
Smartwatch right 99.08% 91.99%
Smartphone right 91.01% 81.60%
Earbud 93.08% 81.28%
Table 4.2: Single-device test accuracies for pretrained models trained using zeroing out
strategy, for both seen and previously unseen test subjects (UTS). Results should be
compared with the corresponding single-device pretrained model’s accuracies in Table
4.1.
4.2 Repetition Counting
We have implemented and evaluated a strong baseline approach (see section 3.3) for
the task of repetition counting on the Multimodal Gym dataset. See Figure 4.4 for an
example output of the repetition counting approach. Since the proposed approach does
not require any training, and as we do not fit any model hyperparameters using any
data, we are able to evaluate the approach on all 16 workout sessions. This brings the
total number of exercise sets on which our approach was evaluated on to 466. The
ground truth number of repetitions across all exercise sets range from 9 to 14, with the
mean number of repetitions per set equal to 10.04.
Chapter 4. Evaluation and Discussion 32
The Mean Absolute repetition counting Error (MAE) per set obtained by the peak
counting and auto-correlation method for the different modalities and devices, are
given in Table 4.3. Table 4.4 displays the percentage of predictions within different
ranges of the ground truth. The left smartwatch gyroscope modality obtains the lowest
MAE, 0.29, closely followed by the right smartwatch gyroscope, 0.32, and the left and
right smartwatch accelerometers, 0.34 and 0.38, respectively. The predictions made
using phone and earbud data are significantly worse compared to those made using the
watch and skeleton data. This is expected since the phone and earbud are positioned
such that they do not capture the central movements of many of the exercises, for ex-
ample biceps curls, rows, and triceps extensions. Surprisingly, the skeleton modality
obtains a MAE which is over two times greater than the best performing modality, the
left smartwatch gyroscope. This is surprising since intuitively the skeleton sequence
data should be the most informative, and be able to capture the distinguishing move-
ments of any exercise, irrelevant of which body parts are involved. Analysing the
results further by looking into the performance at the exercise level (Table 4.3), reveals
that the skeleton modality struggles to count biceps curls, but performs well on all the
other exercises. The reason for the skeleton modality’s poor performance on counting
biceps curl repetitions is explained by Figure 4.5. The processed signal obtained using
the implemented approach defines one biceps curl repetition as two alternating curls
(one for each arm). This highlights an ambiguous case of how a repetition should be
defined; is one curl a repetition, or are two alternating curls one repetition? Ignoring
the biceps curl exercise results in a MAE of 0.29 for the skeleton modality, which is
on level with the performance of the left smartwatch gyroscope.
The results we obtain are in line with the results reported by Morris et al. for
repetition counting on the RecoFit dataset [32]. Using the same approach, and also
evaluating on the ground-truth exercise segmentation boundaries, they obtain a MAE
of 0.26 across 160 sets for 26 different exercises. Their predictions were obtained using
data from a right-arm worn accelerometer sensor. They obtain exact predictions in 50%
of the sets, within-1 in 97% of the sets, and within-2 in 99% of the sets, which can be
compared to our right watch accelerometer, exact 75%, within-1 95%, and within-2
97%. One difference in the findings of our work compared to Morris et al., is that they
did not find gyroscope data useful for repetition counting [32], however, our results
show that the method performs marginally better in terms of MAE on gyroscope data
than accelerometer data, across all devices.
The repetition counting approach is intuitive, efficient, and performs remarkably
Chapter 4. Evaluation and Discussion 33
ExercisesTotal
Device Mod. Sq Pu Sp Lu Ro Su Te Bc Lr Jj
Cam L. Skel 0.06 0.63 0.73 0.0 0.31 0.06 0.37 4.74 0.09 0.33 0.70
Watch L.Acc 0.19 0.52 0.19 0.02 0.20 0.23 0.24 1.35 0.05 0.53 0.34
Gyr 0.17 0.21 0.71 0.47 0.31 0.21 0.12 0.35 0.0 0.35 0.29
Watch R.Acc 0.17 0.60 0.25 0.02 0.16 0.27 0.24 1.51 0.05 0.60 0.38
Gyr 0.23 0.19 0.52 0.57 0.33 0.17 0.24 0.37 0.07 0.47 0.32
Phone R.Acc 1.40 0.21 1.77 0.32 1.27 2.69 2.51 2.51 1.77 1.40 1.58
Gyr 3.04 0.58 1.40 0.87 0.57 2.81 1.71 1.21 1.35 1.60 1.52
EarbudAcc 0.19 0.60 1.54 0.21 1.63 0.90 0.53 2.00 1.47 2.91 1.17
Gyr 0.79 0.88 1.06 0.72 1.41 0.56 0.51 1.77 1.28 2.49 1.12
Table 4.3: The mean absolute repetition counting error per set for each modality and
device across all exercises; squats (Sq), push-ups (Pu), shoulder press (Sp), lunges
(Lu), dumbbell rows (Ro), sit-ups (Su), triceps extensions (Te), biceps curls (Bc), lateral
raises (Lr), and jumping jacks (Jj).
Device Modality Exact Within 1 Within 2
Camera left Skeleton 67.17% (74.00%) 89.06% (97.87%) 90.56% (99.53%)
Smartwatch leftAcc 74.03% 95.28% 98.07%
Gyr 75.32% 97.00% 98.93%
Smartwatch rightAcc 75.11% 95.06% 96.57%
Gyr 72.96% 97.00% 99.14%
Smartphone rightAcc 40.56% 65.67% 76.18%
Gyr 31.55% 68.45% 84.33%
EarbudAcc 43.56% 70.17% 82.62%
Gyr 40.34% 72.75% 84.55%
Table 4.4: The percentage of exercise sets for which the predicted repetition count is
exact, within 1, or within 2 of the ground-truth repetition count, for each modality and
device. The percentages in brackets for the skeleton modality, are the results obtained
when ignoring biceps curls.
well. However, there are clear limitations associated with this approach. One such
limitation is that a minimum and maximum repetition duration threshold needs to be
manually chosen for each exercise. The performance is sensitive to how accurate these
Chapter 4. Evaluation and Discussion 34
thresholds are, and also entails that the approach is not invariant to the rate at which
repetitions are performed. Another limitation is that the method is heavily reliant on
accurate segmentation and classification in order to obtain good results. Finally, the
method relies on simple heuristics based signal processing techniques, and does not
learn a high-level representation of what constitutes a repetition. This problem is ex-
emplified by the approach’s behaviour when counting biceps curl repetitions using
smartwatch data. Intuitively one would expect the repetition counting method to pre-
dict five repetitions for a set of ten alternating biceps curls since the smartwatch would
only be able to register the curls from one of the arms. However, the method often pre-
dicts ten or close to ten repetitions in these scenarios. This is because the assumptions
supplied to the model about how long a biceps curl repetition should last causes the
model to interpret the small peaks in between the actual repetitions as corresponding to
repetitions. These peaks are most likely just noise but the method can not distinguish
between actual repetition peaks and just noisy peaks. More sophisticated methods that
incorporate higher-level reasoning are required for this.
We present this baseline approach on the Multimodal Gym dataset, with the hope
that future work can build on and improve these results. One interesting future research
direction is to use more sophisticated supervised learning approaches to count repeti-
tions. Such methods would benefit from having annotations of repetition counts on
the repetition level. Since manually annotating this information would be very time-
consuming, we propose using the presented baseline approach for automatic annota-
tion of repetition counts on the repetition level. A promising future research direction
is the use of Long Short-Term Memory (LSTM) networks [19], which are well-suited
for sequence data related tasks where previous information is useful for the current
prediction.
Chapter 4. Evaluation and Discussion 35
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
97.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.0% 0.0%
0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 0.5% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 90.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 97.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3%
0.0% 0.3% 0.0% 0.0% 98.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 99.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.3% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.8% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 99.6% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.6% 0.0%
2.6% 1.5% 9.7% 2.9% 1.7% 1.0% 2.2% 2.2% 0.4% 1.4% 99.3%
0%
20%
40%
60%
80%
Accu
racy
(a) Smartwatch left acc & gyr.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
98.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 97.2% 0.2% 0.0% 0.0% 0.0% 2.5% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 96.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 95.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%
0.0% 0.1% 0.0% 0.0% 97.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.1% 0.0% 99.7% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.1% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.1% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.0% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 0.4% 0.0% 0.0% 0.0% 0.0% 97.8% 0.1%
1.8% 2.7% 3.8% 4.5% 2.3% 0.3% 1.3% 2.9% 1.0% 2.2% 99.2%
0%
20%
40%
60%
80%
Accu
racy
(b) Smartwatch right acc & gyr.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
99.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 99.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 98.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 98.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 99.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 94.9% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 100.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.0% 0.0%
0.3% 0.7% 1.6% 1.7% 0.2% 5.0% 0.3% 0.3% 0.0% 1.0% 99.6%
0%
20%
40%
60%
80%
100%
Accu
racy
(c) Skeleton.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
96.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.6% 0.0% 0.0% 0.0%
0.0% 97.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 61.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.5%
0.0% 0.0% 0.0% 86.3% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 62.5% 0.0% 0.0% 6.2% 24.6% 0.2%
0.0% 0.0% 0.0% 1.7% 0.0% 0.0% 92.8% 0.0% 0.1% 3.1% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.5% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 6.0% 0.0% 0.0% 89.8% 8.7% 0.7%
0.0% 0.0% 0.0% 0.0% 0.0% 20.4% 0.7% 0.0% 0.5% 47.8% 0.2%
3.8% 2.9% 38.6% 12.0% 1.9% 11.1% 6.4% 2.9% 3.3% 15.8% 98.1%
0%
20%
40%
60%
80%
Accu
racy
(d) Smartphone right acc & gyr.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
93.5% 0.2% 0.0% 0.3% 0.0% 0.0% 0.0% 0.3% 0.0% 0.9% 0.1%
1.0% 96.9% 0.0% 0.0% 0.0% 0.0% 0.4% 0.0% 0.0% 0.0% 0.3%
0.0% 0.1% 61.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.6% 0.5%
0.0% 0.0% 0.0% 90.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3% 0.3%
0.0% 0.0% 0.0% 1.8% 99.8% 0.0% 1.2% 0.0% 0.0% 0.0% 0.3%
0.0% 0.0% 0.2% 0.2% 0.0% 75.7% 0.0% 0.0% 0.0% 14.3% 0.4%
0.0% 0.2% 8.0% 0.0% 0.0% 0.1% 92.9% 0.0% 0.0% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.0% 0.0% 0.0% 0.1%
0.2% 0.0% 0.0% 0.0% 0.0% 21.8% 0.0% 0.0% 92.5% 6.5% 0.2%
0.3% 0.0% 2.5% 0.2% 0.0% 0.3% 0.0% 0.0% 3.3% 70.8% 0.3%
4.9% 2.5% 27.6% 7.2% 0.2% 2.0% 5.6% 3.6% 4.1% 6.6% 97.3%
0%
20%
40%
60%
80%
Accu
racy
(e) Earbud left acc & gyr.
Figure 4.2: Confusion matrices for the single-device models (with pretraining) on the
Multimodal Gym test set.
Chapter 4. Evaluation and Discussion 36
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
69.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 20.9% 0.1%
0.0% 71.5% 0.2% 0.0% 0.0% 0.0% 0.9% 0.0% 0.0% 0.0% 0.6%
0.0% 0.2% 66.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 51.2% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.1%
0.0% 1.2% 1.6% 0.0% 54.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 94.1% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 9.8% 2.3% 0.0% 1.6% 0.0% 77.1% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 3.3% 93.9% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.1% 0.1% 0.0% 0.0% 94.1% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 4.2% 0.0% 0.0% 0.0% 0.0% 41.8% 0.1%
30.8% 17.4% 29.9% 48.6% 40.0% 5.8% 18.7% 6.1% 5.8% 37.3% 98.5%
0%
20%
40%
60%
80%
Accu
racy
(a) Smartwatch left acc & gyr.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
51.0% 0.0% 0.0% 0.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3% 0.1%
0.0% 66.6% 0.1% 0.0% 5.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3%
0.0% 0.1% 88.1% 0.0% 0.7% 0.0% 7.6% 0.0% 0.0% 0.0% 0.3%
0.0% 0.0% 0.0% 22.2% 0.0% 0.5% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 1.1% 0.2% 0.0% 73.3% 0.0% 0.6% 0.0% 0.0% 0.3% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 91.4% 0.0% 0.0% 0.1% 0.0% 0.1%
0.0% 3.2% 0.0% 0.0% 0.0% 0.0% 60.2% 0.0% 0.0% 0.0% 0.2%
0.0% 0.0% 0.2% 0.7% 0.0% 0.0% 0.6% 93.2% 0.0% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 94.2% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 3.4% 0.0% 0.0% 0.0% 0.0% 42.3% 0.1%
49.0% 29.0% 11.3% 76.6% 17.6% 8.1% 31.1% 6.8% 5.8% 57.1% 98.4%
0%
20%
40%
60%
80%
Accu
racy
(b) Smartwatch right acc & gyr.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
26.5% 0.0% 0.0% 0.0% 3.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 85.1% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 25.4% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 74.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 89.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 59.4% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 69.4% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 85.7% 0.0% 1.5% 0.0%
0.0% 0.0% 0.0% 5.9% 0.0% 0.8% 0.0% 0.0% 99.8% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 64.8% 0.0%
73.5% 14.9% 74.6% 19.1% 6.7% 39.8% 30.5% 14.3% 0.2% 33.7% 99.5%
0%
20%
40%
60%
80%Ac
cura
cy
(c) Skeleton.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
88.2% 1.4% 0.0% 0.0% 0.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.2% 74.2% 0.0% 0.2% 3.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3%
0.0% 0.0% 29.2% 0.0% 1.5% 0.1% 3.3% 0.0% 1.1% 18.9% 2.0%
0.0% 0.0% 0.0% 11.9% 0.0% 0.0% 36.1% 0.0% 1.7% 6.6% 0.4%
0.0% 0.2% 0.0% 0.0% 60.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 5.3% 0.0% 0.0% 20.9% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 9.4% 0.0% 0.0% 6.2% 0.1%
0.0% 4.6% 0.1% 0.0% 0.7% 0.0% 0.0% 94.3% 0.0% 0.0% 0.2%
0.0% 0.0% 0.0% 0.3% 0.0% 2.7% 0.5% 0.0% 5.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 1.0% 0.0% 0.1%
11.6% 19.7% 70.7% 87.5% 33.0% 91.9% 50.6% 5.7% 70.4% 68.4% 96.4%
0%
20%
40%
60%
80%
Accu
racy
(d) Smartphone right acc & gyr.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
27.1% 7.4% 0.0% 0.0% 0.0% 0.0% 0.4% 14.3% 0.0% 0.1% 0.1%
0.2% 19.3% 0.0% 0.0% 0.0% 1.6% 0.0% 1.3% 0.0% 1.0% 0.1%
0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%
3.1% 1.9% 0.0% 89.3% 0.7% 0.0% 0.5% 3.8% 0.0% 1.8% 0.5%
47.4% 34.9% 5.3% 0.2% 75.5% 37.6% 52.9% 17.4% 15.1% 33.6% 2.9%
0.0% 0.2% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 7.2% 0.4% 0.1%
0.0% 0.0% 0.5% 0.0% 0.0% 15.2% 9.6% 0.0% 12.0% 0.0% 1.1%
0.0% 0.0% 0.1% 0.0% 4.5% 0.0% 0.0% 41.6% 0.0% 0.4% 0.0%
0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.7% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3% 0.1%
22.3% 36.4% 94.0% 10.4% 19.3% 45.4% 36.6% 21.5% 65.0% 62.4% 94.7%
0%
20%
40%
60%
80%
Accu
racy
(e) Earbud left acc & gyr.
Figure 4.3: Confusion matrices for the single-device models (with pretraining) on the
Multimodal Gym test set for unseen subjects.
Chapter 4. Evaluation and Discussion 37
2
0
2
rad/s
XYZ
0 2 4 6 8 10Time (s)
Figure 4.4: Repetition counting for a set of ten squat repetitions using gyroscope read-
ings from a smartwatch worn on the left wrist. The original gyroscope data is shown in
the top plot, and the processed 1D signal is shown in the bottom plot, with the predicted
repetitions marked out.
0 2 4 6 8 10Time (s)
Figure 4.5: Processed skeleton modality signal for repetition counting for a set of biceps
curls, with the predicted repetitions marked out.
Chapter 5
Conclusion
5.1 Summary
This work proposes an automated exercise logging system, consisting of an joint ac-
tivity segmentation and exercise recognition stage, followed by a repetition counting
stage. We use a multimodal deep learning approach for activity segmentation and ex-
ercise recognition that extracts modality-specific representations using a stacked con-
volutional autoencoder. These representations are then fed into a deep fully-connected
autoencoder to learn a shared cross-modal representation, which is used for the exer-
cise recognition task.
To train and evaluate our automated exercise logging system, we have collected and
annotated the Multimodal Gym dataset, a substantial dataset of subjects performing
full-body workouts, consisting of RGB-D video, inertial sensing, and pose estimate
data. This dataset will be made publicly available, with the hope that it will be a useful
resource for multimodal research. We investigate fusing inertial sensor data and 3D
pose estimates, which has not previously been tried before for HAR. Our proposed
multimodal approach demonstrates near-perfect exercise recognition performance on
the Multimodal Gym dataset, with 99.5% accuracy. We also evaluate and compare
unimodal and single-device models across a number of smart devices, and evaluate the
generalisability of our models on unseen test subjects.
In addition, we introduce the zeroing-out training strategy as an approach for strength-
ening unimodal features by leveraging external data perspectives through multimodal
training. We present our initial results, which indicate that our zeroing-out strategy is
not effective at leveraging external data perspectives, however, a more comprehensive
exploration is required, for which we suggest a number of future research directions.
38
Chapter 5. Conclusion 39
Furthermore, we have implemented a strong baseline method, proposed by Morris
et al. [32], for the repetition counting task on the Multimodal Gym dataset. The best
performing modality was the smartwatch gyroscope, with a mean absolute prediction
error per set of 0.29. We discuss the limitations of the baseline approach and propose
promising future research directions for repetition counting.
5.2 Future Work
Future work will focus on improving a number of aspects of the proposed automated
exercise logging system. To improve the robustness and generalisability of our exer-
cise recognition models we will investigate data augmentation techniques for sensor
and skeleton data. Additionally, the Multimodal Gym dataset could be expanded to
include more participants in order to introduce more variability in the data. These
modifications should result in improved model performance on unseen test subjects.
To decrease the number of exercise segments being misclassified as non-exercise
segments, we propose three potential solutions to investigate. First, we propose evalu-
ating whether temporal smoothing of the predictions improves performance. This can
be done in an online fashion with a small additional cost in recognition speed. Sec-
ondly, we will explore whether stratified batching can be used to handle the class im-
balance problem, to ensure that the network places an equal importance on classifying
all classes. Finally, a potential solution to explore is to label and predict each window
segment instance with a vector containing the proportion of each class in the current
window segment, instead of just labelling and predicting one class for each window
segment. This would eliminate the problem of ambiguous overlapping windows at the
borders of exercise and non-exercise segments.
An additional promising future research direction for further improving the ex-
ercise recognition model is to remove the intermediate stage of extracting 3D pose
estimates, and to work directly on the raw RGB and depth data. This would make the
learning task significantly more difficult, thus requiring a larger, and more sophisti-
cated network, but would allow the network to determine what features in the raw data
are relevant for the task of exercise recognition.
We also propose a more extensive exploration of the zeroing-out training strategy
(see section 4.1.3). This will include, modifying how modalities are removed during
training, experimenting with instance-level zeroing-out, and using a smaller bottleneck
size in the multimodal autoencoder. This would make the network more likely to learn
Chapter 5. Conclusion 40
a shared cross-modal representation, as the network would be restricted in the space it
has to learn independent modality-specific representations.
Finally, as discussed in section 4.2, we highlight LSTMs as a promising approach
that is well-suited for repetition counting. Since LSTMs would benefit from having an-
notations at the repetition level, we propose using a signal processing technique, such
as the one proposed by Morris et al. [32], to automatically generate these repetition
level annotations.
Bibliography
[1] Jake K. Aggarwal and Lu Xia. Human activity recognition from 3D data: A
review. Pattern Recognition Letters, 48:70–80, 2014.
[2] B Almaslukh. An effective deep autoencoder approach for online smartphone-
based human activity recognition. International Journal of Computer Science
and Network Security, 17, 04 2017.
[3] Oresti Banos, Juan-Manuel Galvez, Miguel Damas, Hector Pomares, and Ignacio
Rojas. Window size impact in human activity recognition. Sensors, 14(4):6474–
6499, 2014.
[4] C. G. Bender, J. C. Hoffstot, B. T. Combs, S. Hooshangi, and J. Cappos. Measur-
ing the fitness of fitness trackers. In 2017 IEEE Sensors Applications Symposium
(SAS), pages 1–6, March 2017.
[5] Andreas Bulling, Ulf Blanke, and Bernt Schiele. A tutorial on human activ-
ity recognition using body-worn inertial sensors. ACM Comput. Surv., 46:33:1–
33:33, 2014.
[6] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Open-
Pose: realtime multi-person 2D pose estimation using Part Affinity Fields. In
arXiv preprint arXiv:1812.08008, 2018.
[7] Cristian Sminchisescu Catalin Ionescu, Fuxin Li. Latent structured models for
human pose estimation. In International Conference on Computer Vision, 2011.
[8] Liming Chen, Jesse Hoey, Chris D. Nugent, Diane J. Cook, and Zhiwen Yu.
Sensor-based activity recognition. Trans. Sys. Man Cyber Part C, 42(6):790–
808, November 2012.
41
Bibliography 42
[9] K. S. Choi, Y. S. Joo, and S. Kim. Automatic exercise counter for outdoor exer-
cise equipment. In 2013 IEEE International Conference on Consumer Electron-
ics (ICCE), pages 436–437, Jan 2013.
[10] Yong Du, Yun Fu, and Liang Wang. Skeleton based action recognition with con-
volutional neural network. 2015 3rd IAPR Asian Conference on Pattern Recog-
nition (ACPR), pages 579–583, 2015.
[11] Yong Du, Wei Wang, and Liang Wang. Hierarchical recurrent neural network for
skeleton based action recognition. 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 1110–1118, 2015.
[12] Georgios Dimitrios Evangelidis, Gurkirt Singh, and Radu Horaud. Skeletal
quads: Human action recognition using joint quadruples. 2014 22nd Interna-
tional Conference on Pattern Recognition, pages 4513–4518, 2014.
[13] Hristijan Gjoreski, Jani Bizjak, Martin Gjoreski, and Matjaz Gams. Compar-
ing deep and classical machine learning methods for human activity recognition
using wrist accelerometer. 2016.
[14] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural
networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudık, editors,
Proceedings of the Fourteenth International Conference on Artificial Intelligence
and Statistics, volume 15 of Proceedings of Machine Learning Research, pages
315–323, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR.
[15] S. Ha, J. Yun, and S. Choi. Multi-modal convolutional neural networks for ac-
tivity recognition. In 2015 IEEE International Conference on Systems, Man, and
Cybernetics, pages 3017–3022, Oct 2015.
[16] Nils Y. Hammerla, Shane Halloran, and Thomas Plotz. Deep, convolutional, and
recurrent models for human activity recognition using wearables. In IJCAI, 2016.
[17] Keng hao Chang, Mike Y. Chen, and John F. Canny. Tracking free-weight exer-
cises. In UbiComp, 2007.
[18] Samitha Herath, Mehrtash Tafazzoli Harandi, and Fatih Murat Porikli. Going
deeper into action recognition: A survey. Image Vision Comput., 60:4–21, 2017.
Bibliography 43
[19] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural
Comput., 9(8):1735–1780, November 1997.
[20] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Hu-
man3.6m: Large scale datasets and predictive methods for 3d human sensing
in natural environments. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 36(7):1325–1339, jul 2014.
[21] Fahim Kawsar, Chulhong Min, Akhil Mathur, and Allesandro Montanari. Ear-
ables for personal-scale behavior analytics. IEEE Pervasive Computing, 17:83–
89, 2018.
[22] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Ahmed Sohel, and
Farid Boussaıd. A new representation of skeleton sequences for 3d action recog-
nition. 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 4570–4579, 2017.
[23] Aftab Khan, Nils Y. Hammerla, Sebastian Mellor, and Thomas Plotz. Optimis-
ing sampling rates for accelerometer-based human activity recognition. Pattern
Recognition Letters, 73:33–40, 2016.
[24] Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-
tion. In International Conference on Learning Representations (ICLR), 2015.
[25] Oscar D. Lara and Miguel A. Labrador. A survey on human activity recognition
using wearable sensors. IEEE Communications Surveys & Tutorials, 15:1192–
1209, 2013.
[26] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,
521(7553):436–444, 5 2015.
[27] Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based
learning applied to document recognition. In Proceedings of the IEEE, pages
2278–2324, 1998.
[28] O. Levy and L. Wolf. Live repetition counting. In 2015 IEEE International
Conference on Computer Vision (ICCV), pages 3020–3028, Dec 2015.
[29] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Skeleton-based action recog-
nition with convolutional neural networks. 2017 IEEE International Conference
on Multimedia Expo Workshops (ICMEW), pages 597–600, 2017.
Bibliography 44
[30] Shanhong Liu. Fitness & activity tracker. https://www.statista.com/study/35598/
fitness-and-activity-tracker/, 2018.
[31] Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. A simple
yet effective baseline for 3d human pose estimation. In ICCV, 2017.
[32] Dan Morris, T. Scott Saponas, Andrew Guillory, and Ilya Kelner. Recofit: using a
wearable sensor to find, recognize, and count repetitive exercises. In CHI, 2014.
[33] Bobak Mortazavi, Mohammad Pourhomayoun, Gabriel Alsheikh, Nabil Alshu-
rafa, Sunghoon Ivan Lee, and Majid Sarrafzadeh. Determining the single best
axis for exercise repetition recognition and counting on smartwatches. 2014 11th
International Conference on Wearable and Implantable Body Sensor Networks,
pages 33–38, 2014.
[34] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and An-
drew Y. Ng. Multimodal deep learning. In ICML, 2011.
[35] Ronald Poppe. A survey on vision-based human action recognition. Image Vision
Comput., 28:976–990, 2010.
[36] Valentin Radu, Nicholas D. Lane, Sourav Bhattacharya, Cecilia Mascolo, Ma-
hesh K. Marina, and Fahim Kawsar. Towards multimodal deep learning for
activity recognition on mobile devices. In Proceedings of the 2016 ACM In-
ternational Joint Conference on Pervasive and Ubiquitous Computing: Adjunct,
UbiComp ’16, pages 185–188, New York, NY, USA, 2016. ACM.
[37] Valentin Radu, Catherine Tong, Sourav Bhattacharya, Nicholas D. Lane, Cecilia
Mascolo, Mahesh K. Marina, and Fahim Kawsar. Multimodal deep learning for
activity and context recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous
Technol., 1(4):157:1–157:27, January 2018.
[38] Iasonas Kokkinos R{iza Alp Guler, Natalia Neverova. Densepose: Dense human
pose estimation in the wild. 2018.
[39] T. F. H. Runia, C. G. M. Snoek, and A. W. M. Smeulders. Real-world repetition
estimation by div, grad and curl. In 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 9009–9017, 2018.
Bibliography 45
[40] Abraham. Savitzky and M. J. E. Golay. Smoothing and differentiation of data
by simplified least squares procedures. Analytical Chemistry, 36(8):1627–1639,
1964.
[41] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. In International Conference on Learning Representations,
2015.
[42] Andrea Soro, Gino Brunner, Simon Tanner, and Roger Wattenhofer. Recogni-
tion and repetition counting for complex physical exercises with deep learning.
Sensors, 19(3), 2019.
[43] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: A simple way to prevent neural networks from overfit-
ting. J. Mach. Learn. Res., 15(1):1929–1958, January 2014.
[44] Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow,
Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen.
Smart devices are different: Assessing and mitigating mobile sensing hetero-
geneities for activity recognition. In Proceedings of the 13th ACM Conference on
Embedded Networked Sensor Systems, SenSys ’15, pages 127–140, New York,
NY, USA, 2015. ACM.
[45] Terry Taewoong Um, Vahid Babakeshizadeh, and Dana Kulic. Exercise motion
classification from large-scale wearable sensor data using convolutional neural
networks. 2017 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), pages 2385–2390, 2016.
[46] Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. Human action recog-
nition by representing 3d skeletons as points in a lie group. 2014 IEEE Confer-
ence on Computer Vision and Pattern Recognition, pages 588–595, 2014.
[47] Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. Learning actionlet en-
semble for 3d human action recognition. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 36:914–927, 2014.
[48] Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. Deep
learning for sensor-based activity recognition: A survey. Pattern Recognition
Letters, 119:3–11, 2019.
Bibliography 46
[49] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional
networks for skeleton-based action recognition. In AAAI, 2018.
[50] Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiaoli Li, and Shonali Krish-
naswamy. Deep convolutional neural networks on multichannel time series for
human activity recognition. In IJCAI, 2015.
[51] Ming Zeng, Le T. Nguyen, Bo Yu, Ole J. Mengshoel, Jiang Zhu, Pang Wu, and
Joy Zhang. Convolutional neural networks for human activity recognition using
mobile sensors. 6th International Conference on Mobile Computing, Applica-
tions and Services, pages 197–205, 2014.
Appendix A
Additional Figures
A.1 Exercise Visualisations
Figure A.1: Squats.
Figure A.2: Push-ups.
Figure A.3: Dumbbell shoulder press.
47
Appendix A. Additional Figures 48
Figure A.4: Lunges.
Figure A.5: Dumbbell rows.
Figure A.6: Triceps extensions.
Figure A.7: Biceps curls.
Figure A.8: Sit-ups.
A.2 Unimodal Autoencoder Hyperparameter Search
A.3 Confusion Matrices
Appendix A. Additional Figures 49
Figure A.9: Jumping jacks.
Figure A.10: Lateral shoulder raises.
3_k11_GNN
3_k11_NNN
3_k11_GGN
3_k5_NNN
3_k5_GNN
3_k3_NNN
3_k3_GGN
3_k3_GNN
1_k11_N
1_k11_G
2_k5_NN
2_k11_GG
2_k11_NN
2_k11_GN
2_k5_GN
2_k5_GG1_k5_N
2_k3_GN
2_k3_NN
2_k3_GG1_k3_N
1_k5_G1_k3_G
Models
0.0000
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
Test
Rec
onst
ruct
ion
Loss
Stride 2Stride 3Model Parameters
0
2000
4000
6000
8000
10000
12000
14000
Mod
el P
aram
eter
s
Figure A.11: Validation reconstruction loss on RecoFit accelerometer data for different
convolutional autoencoder model configurations.
Appendix A. Additional Figures 50
3_k11_NNN
3_k11_GGN
3_k11_GNN
3_k5_NNN
3_k5_GGN
3_k3_GNN
3_k3_NNN
3_k3_GGN
1_k11_N
2_k11_GG
2_k11_NN
2_k5_NN
2_k11_GN
2_k5_GN
2_k5_GG
1_k11_G1_k5_N
2_k3_NN
2_k3_GN
2_k3_GG1_k5_G
1_k3_N1_k3_G
Models
0.0000
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
Test
Rec
onst
ruct
ion
Loss
Stride 2Stride 3Model Parameters
0
2000
4000
6000
8000
10000
12000
14000
Mod
el P
aram
eter
s
Figure A.12: Validation reconstruction loss on RecoFit gyroscope data for different con-
volutional autoencoder model configurations.
3_k11_NNN
3_k11_GGN
3_k11_GGG
2_k11_NN
2_k11_GG
2_k11_GN
3_k11_GNN
3_k5_NNN
3_k5_GGN
3_k5_GNN
3_k5_GGG
2_k5_NN
2_k5_GN
1_k11_N
2_k5_GG
1_k11_G1_k5_N
3_k3_NNN
3_k3_GNN
3_k3_GGN1_k5_G
2_k3_NN
2_k3_GN
3_k3_GGG
2_k3_GG1_k3_N
1_k3_G
Models
0.0000
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035
Test
Rec
onst
ruct
ion
Loss
Stride 2Stride 3Model Parameters
0
20000
40000
60000
80000
100000
120000
140000
160000
Mod
el P
aram
eter
s
Figure A.13: Validation reconstruction loss on Human3.6M motion capture data for
different convolutional autoencoder model configurations.
Appendix A. Additional Figures 51
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
98.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 89.4% 0.0% 0.0% 0.0% 0.0% 0.5% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 93.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 97.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 8.4% 0.0% 0.0% 99.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 99.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.1% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.1% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.9% 0.0%
1.4% 2.2% 7.0% 2.1% 0.9% 1.0% 2.3% 2.9% 0.3% 1.1% 99.5%
0%
20%
40%
60%
80%
Accu
racy
(a) Smartwatch left acc & gyr.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
98.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 97.5% 0.0% 0.0% 0.0% 0.0% 0.7% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 95.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 95.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 99.2% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.8% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 95.5% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.2% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.8% 0.1%
1.8% 2.5% 5.0% 4.1% 1.9% 0.8% 1.5% 4.5% 0.8% 1.2% 99.2%
0%
20%
40%
60%
80%
Accu
racy
(b) Smartwatch right acc & gyr.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
99.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 99.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 98.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 97.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 99.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 94.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.3% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 2.2% 0.0% 0.0% 100.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.6% 0.0%
0.3% 0.9% 1.6% 2.4% 0.8% 3.8% 0.7% 0.3% 0.0% 1.4% 99.7%
0%
20%
40%
60%
80%
100%
Accu
racy
(c) Skeleton.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
98.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 96.7% 0.0% 0.0% 0.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 76.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.4%
0.0% 0.0% 0.0% 76.4% 0.0% 0.0% 0.5% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 99.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 61.5% 0.0% 0.0% 28.5% 30.4% 0.2%
0.0% 0.0% 0.0% 5.9% 0.0% 0.0% 95.3% 0.0% 0.0% 7.0% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 95.8% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 1.0% 0.0% 0.0% 65.7% 5.6% 0.4%
0.0% 0.0% 0.0% 0.7% 0.0% 31.5% 0.0% 0.0% 2.2% 46.3% 0.3%
1.4% 3.3% 23.9% 17.0% 0.6% 6.0% 4.2% 4.2% 3.6% 10.7% 98.2%
0%
20%
40%
60%
80%
Accu
racy
(d) Smartphone right acc & gyr.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
93.4% 0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
2.0% 96.6% 0.0% 0.0% 0.9% 0.0% 2.7% 0.0% 0.0% 0.0% 0.2%
0.0% 0.1% 67.0% 0.0% 0.0% 0.0% 5.0% 0.0% 0.0% 3.1% 0.3%
0.0% 0.0% 0.0% 93.4% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 95.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.3% 0.0% 0.0% 88.4% 0.0% 0.0% 3.3% 9.9% 0.2%
0.0% 0.2% 0.0% 0.0% 0.0% 0.1% 78.3% 0.0% 0.0% 0.1% 0.1%
0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 96.4% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 5.5% 0.0% 0.0% 81.0% 15.3% 0.0%
0.0% 0.1% 0.7% 0.0% 0.0% 0.0% 0.0% 0.0% 2.6% 48.9% 0.1%
4.6% 2.7% 32.0% 6.4% 3.9% 6.0% 13.8% 3.6% 13.1% 22.7% 98.7%
0%
20%
40%
60%
80%
Accu
racy
(e) Earbud left acc & gyr.
Figure A.14: Confusion matrices for the single-device models (trained from scratch) on
the Multimodal Gym test set.
Appendix A. Additional Figures 52
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
98.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 97.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 86.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%
0.0% 0.0% 0.0% 98.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.5% 0.0% 0.0% 93.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 98.8% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 95.1% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.8% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.4% 0.0%
2.0% 2.0% 13.9% 1.2% 6.8% 1.3% 4.9% 3.2% 0.3% 1.6% 99.3%
0%
20%
40%
60%
80%
Accu
racy
(a) Smartwatch left.Sq
uats
Lung
esBic
eps c
urls
Sit-up
sPu
sh-up
sTri
ceps
ext.
Rows
Jumpin
g jac
ksSh
oulde
r pres
sLa
teral
raise
sNon
-exerc
ise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
98.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 97.0% 0.0% 0.0% 0.0% 0.0% 1.5% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 91.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 96.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 97.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 97.9% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.5% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 95.8% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.3% 0.1%
2.0% 3.0% 8.4% 3.4% 3.0% 2.1% 2.0% 4.2% 0.3% 1.7% 99.1%
0%
20%
40%
60%
80%
Accu
racy
(b) Smartwatch right.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
88.1% 0.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.7% 95.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.5% 0.2%
0.0% 0.0% 64.9% 0.0% 0.0% 1.1% 1.0% 0.0% 0.0% 1.6% 0.5%
1.0% 0.2% 0.3% 88.5% 0.0% 0.0% 0.0% 0.3% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 97.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%
0.0% 0.0% 1.1% 0.0% 0.0% 68.5% 0.0% 0.0% 0.0% 18.3% 0.2%
0.0% 0.0% 0.3% 0.0% 0.0% 0.0% 78.1% 0.0% 0.0% 0.0% 0.1%
0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 94.4% 0.0% 0.0% 0.1%
0.0% 0.0% 0.5% 0.0% 0.0% 22.8% 0.0% 0.0% 81.6% 13.5% 0.2%
0.0% 0.3% 0.5% 0.0% 0.0% 3.1% 0.0% 0.0% 9.4% 52.4% 0.4%
10.0% 2.5% 32.4% 11.5% 3.0% 4.4% 21.0% 5.3% 9.0% 13.6% 98.0%
0%
20%
40%
60%
80%
Accu
racy
(c) Earbud left.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
96.7% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.5% 95.2% 0.0% 0.0% 0.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 54.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.4%
0.0% 0.0% 0.0% 85.7% 0.0% 0.0% 0.3% 0.0% 0.0% 0.0% 0.1%
1.1% 0.1% 0.0% 0.0% 95.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 41.0% 0.0% 0.0% 4.1% 23.7% 0.1%
0.0% 0.0% 0.0% 1.3% 0.0% 0.0% 91.6% 0.0% 0.0% 5.2% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 1.0% 0.3% 0.0% 64.3% 3.4% 0.5%
0.0% 0.0% 0.0% 0.9% 0.0% 35.6% 1.3% 0.0% 0.5% 35.7% 0.2%
1.8% 4.7% 45.6% 12.1% 3.4% 22.5% 6.4% 1.0% 31.1% 32.0% 98.3%
0%
20%
40%
60%
80%
Accu
racy
(d) Smartphone right.
Figure A.15: Confusion matrices for the unimodal accelerometer models (trained from
scratch) on the Multimodal Gym test set.
Appendix A. Additional Figures 53
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
92.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 91.8% 0.3% 0.1% 0.2% 0.0% 1.2% 0.0% 0.3% 0.0% 0.1%
0.0% 0.0% 86.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 94.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 96.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 91.4% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.8% 0.0% 0.0% 0.0%
0.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.6% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.1% 0.0%
6.7% 8.2% 13.3% 5.5% 3.6% 1.9% 7.4% 3.2% 1.2% 1.9% 99.2%
0%
20%
40%
60%
80%
Accu
racy
(a) Smartwatch left.Sq
uats
Lung
esBic
eps c
urls
Sit-up
sPu
sh-up
sTri
ceps
ext.
Rows
Jumpin
g jac
ksSh
oulde
r pres
sLa
teral
raise
sNon
-exerc
ise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
94.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 93.5% 0.0% 0.0% 0.0% 0.0% 5.2% 0.0% 0.4% 0.0% 0.2%
0.0% 0.0% 88.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 95.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 97.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.9% 99.7% 0.0% 0.0% 0.0% 0.5% 0.1%
0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 89.6% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 95.5% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.3% 0.0% 96.8% 0.1% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.4% 0.1%
5.8% 6.4% 12.0% 4.4% 2.1% 0.3% 3.9% 4.5% 2.8% 3.0% 99.2%
0%
20%
40%
60%
80%
Accu
racy
(b) Smartwatch right.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
57.4% 5.3% 2.0% 0.5% 2.3% 1.1% 5.8% 0.0% 1.0% 5.5% 0.3%
3.1% 78.4% 0.0% 0.0% 3.7% 0.1% 3.5% 0.0% 2.6% 5.9% 0.3%
1.0% 0.9% 43.1% 0.0% 0.0% 1.8% 1.2% 0.0% 0.3% 1.1% 0.5%
0.0% 0.0% 0.0% 83.4% 0.0% 0.0% 0.0% 0.0% 0.4% 0.0% 0.2%
2.4% 0.5% 0.3% 0.5% 84.9% 0.0% 0.8% 0.0% 0.5% 0.3% 0.1%
0.3% 0.0% 1.1% 0.7% 1.1% 68.2% 0.8% 0.0% 1.3% 2.8% 0.2%
5.9% 1.3% 0.8% 0.2% 2.5% 2.3% 48.7% 0.0% 9.0% 10.1% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 90.4% 0.0% 0.0% 0.1%
7.6% 0.5% 1.8% 2.6% 1.1% 5.2% 4.0% 0.0% 68.5% 9.1% 0.2%
5.1% 1.0% 2.0% 0.2% 1.1% 16.0% 6.7% 0.0% 7.3% 48.4% 0.4%
17.1% 12.1% 48.9% 12.0% 3.2% 5.1% 28.7% 9.6% 9.0% 16.8% 97.4%
0%
20%
40%
60%
80%
Accu
racy
(c) Earbud left.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
93.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 70.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%
0.0% 0.0% 73.4% 1.3% 0.4% 0.0% 2.0% 0.0% 0.0% 0.0% 0.5%
0.0% 0.0% 0.0% 64.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.3% 0.0% 0.1% 83.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 26.8% 0.0% 0.0% 14.9% 18.8% 0.1%
0.0% 0.0% 2.5% 0.3% 0.0% 0.0% 68.6% 0.0% 0.5% 1.5% 0.4%
0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 97.4% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.3% 0.0% 36.1% 0.0% 0.0% 59.0% 15.1% 0.5%
0.0% 0.0% 0.0% 0.0% 0.0% 1.3% 0.0% 0.0% 3.3% 28.1% 0.3%
6.7% 29.2% 24.1% 33.6% 16.1% 35.8% 29.4% 2.6% 22.2% 36.5% 97.6%
0%
20%
40%
60%
80%
Accu
racy
(d) Smartphone right.
Figure A.16: Confusion matrices for the unimodal gyroscope models (trained from
scratch) on the Multimodal Gym test set.
Appendix A. Additional Figures 54
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
98.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 97.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 82.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 97.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 97.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 94.5% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 95.8% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.4% 0.0%
1.1% 2.6% 17.5% 3.0% 2.1% 0.3% 5.5% 4.2% 0.0% 0.6% 99.7%
0%
20%
40%
60%
80%
100%
Accu
racy
(a) Smartwatch left.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
99.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 2.5% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 95.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 95.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 98.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 94.6% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.4% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.9% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.7% 0.0%
0.9% 1.9% 4.1% 4.5% 1.5% 0.3% 2.9% 1.6% 0.1% 1.3% 99.6%
0%
20%
40%
60%
80%
Accu
racy
(b) Smartwatch right.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
99.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 98.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 97.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 98.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 99.6% 0.0% 0.0% 0.0% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.3% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.7% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.4% 0.0%
0.5% 1.4% 2.7% 1.9% 1.3% 0.4% 1.7% 1.3% 0.0% 0.6% 99.6%
0%
20%
40%
60%
80%
100%
Accu
racy
(c) Skeleton.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
98.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3% 0.0% 0.0% 0.1%
0.0% 0.0% 32.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.6%
0.0% 0.0% 0.0% 85.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.1% 0.0% 0.0% 99.2% 0.0% 0.0% 0.0% 0.0% 0.1% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 61.0% 0.0% 0.0% 7.3% 7.7% 0.4%
0.0% 0.0% 0.0% 0.4% 0.0% 0.0% 78.0% 0.0% 0.0% 4.8% 0.3%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.4% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 8.1% 0.0% 0.0% 75.5% 26.1% 0.2%
0.0% 0.0% 0.0% 0.1% 0.0% 15.6% 2.4% 0.0% 5.8% 26.9% 0.2%
1.2% 1.8% 67.8% 13.6% 0.8% 15.3% 19.7% 1.3% 11.4% 34.4% 97.8%
0%
20%
40%
60%
80%
Accu
racy
(d) Smartphone right.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
96.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 96.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 76.2% 0.0% 0.0% 0.0% 5.2% 0.0% 0.0% 0.6% 0.9%
0.0% 0.0% 0.0% 82.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 97.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 1.1% 0.0% 80.7% 0.5% 0.0% 1.8% 11.3% 0.4%
0.0% 1.1% 1.3% 0.0% 0.0% 0.0% 77.8% 0.0% 0.0% 0.0% 0.2%
0.0% 0.0% 0.0% 11.0% 0.0% 0.0% 0.0% 95.5% 0.0% 0.0% 0.7%
0.0% 0.0% 0.2% 0.7% 0.0% 16.2% 0.0% 0.0% 86.2% 17.1% 0.2%
0.0% 0.0% 1.3% 0.0% 0.0% 0.0% 0.0% 0.0% 6.2% 58.6% 0.7%
3.7% 2.6% 21.1% 4.9% 2.3% 3.1% 16.5% 4.5% 5.8% 12.3% 96.7%
0%
20%
40%
60%
80%
Accu
racy
(e) Earbud left.
Figure A.17: Confusion matrices on seen subjects test set for the single-device models
trained using the multimodal zeroing-out training strategy.
Appendix A. Additional Figures 55
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
50.3% 0.0% 0.0% 0.7% 0.0% 0.0% 0.0% 0.2% 0.0% 0.3% 0.0%
0.0% 80.1% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 58.3% 0.0% 0.0% 0.0% 6.1% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.5% 55.3% 0.0% 1.3% 0.1% 0.0% 0.0% 0.0% 0.1%
0.0% 0.2% 0.9% 0.0% 45.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.1% 0.1% 0.0% 93.4% 0.2% 0.0% 0.0% 0.0% 0.1%
0.0% 0.5% 0.0% 0.0% 0.0% 0.0% 69.5% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 89.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 90.8% 0.0% 0.1%
0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 61.2% 0.1%
49.7% 19.2% 40.2% 43.9% 54.4% 5.4% 23.8% 10.8% 9.2% 38.5% 99.2%
0%
20%
40%
60%
80%
Accu
racy
(a) Smartwatch left.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
50.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 2.1% 0.1%
0.0% 72.8% 0.0% 0.1% 6.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 72.6% 3.7% 0.8% 0.0% 0.5% 0.9% 0.0% 0.0% 0.2%
0.0% 0.0% 0.0% 38.4% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0%
0.0% 2.3% 0.6% 0.2% 50.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 3.3% 0.0% 90.9% 0.0% 0.0% 0.1% 0.0% 0.1%
0.0% 1.3% 0.0% 0.2% 0.0% 0.0% 41.5% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 1.3% 0.0% 0.0% 0.5% 74.2% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.0% 96.6% 0.0% 0.3%
0.0% 0.0% 0.0% 0.0% 3.4% 0.0% 0.0% 0.2% 0.0% 90.1% 0.1%
49.5% 23.6% 26.8% 52.7% 39.4% 9.1% 57.5% 24.4% 3.3% 7.8% 98.9%
0%
20%
40%
60%
80%
Accu
racy
(b) Smartwatch right.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
3.8% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 1.3% 0.0%
0.0% 76.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%
0.0% 0.0% 38.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0%
0.0% 0.0% 0.0% 85.8% 0.0% 0.0% 0.0% 0.0% 0.0% 1.8% 0.2%
0.0% 0.0% 0.0% 0.0% 76.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.0% 1.1% 0.0% 68.5% 0.0% 0.0% 0.0% 1.8% 0.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 28.6% 0.0% 0.0% 0.1% 0.0%
0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 52.1% 0.0% 0.1% 0.0%
0.0% 0.0% 0.0% 0.2% 0.0% 25.6% 0.0% 0.0% 99.6% 1.6% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 2.3% 0.0% 0.0% 53.1% 0.0%
96.2% 23.4% 61.7% 12.8% 23.3% 5.9% 69.1% 47.7% 0.4% 39.8% 99.3%
0%
20%
40%
60%
80%Ac
cura
cy
(c) Skeleton.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
92.4% 0.0% 0.0% 0.0% 10.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%
0.2% 92.9% 0.1% 0.0% 1.0% 0.0% 0.0% 57.7% 0.0% 0.0% 0.7%
0.0% 0.0% 15.1% 0.2% 4.2% 0.0% 16.1% 0.0% 0.1% 30.7% 2.4%
0.0% 0.0% 3.2% 7.3% 0.0% 0.0% 32.4% 0.0% 0.1% 12.7% 0.9%
0.0% 0.0% 0.0% 0.0% 53.2% 2.3% 0.0% 0.0% 7.0% 0.0% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 24.7% 0.0% 0.0% 19.4% 0.0% 0.5%
0.0% 0.0% 0.0% 0.0% 0.3% 7.5% 5.3% 0.0% 0.3% 8.8% 1.1%
1.2% 0.1% 0.0% 0.0% 13.8% 0.0% 0.0% 38.5% 0.0% 0.0% 0.1%
0.0% 0.0% 0.0% 0.0% 1.4% 11.2% 0.0% 0.0% 9.1% 0.0% 0.3%
0.0% 0.0% 0.0% 0.0% 0.0% 0.8% 0.0% 0.0% 5.5% 0.0% 0.4%
6.3% 6.9% 81.6% 92.4% 15.9% 53.6% 46.2% 3.8% 58.5% 47.8% 93.2%
0%
20%
40%
60%
80%
Accu
racy
(d) Smartphone right.
Squa
tsLu
nges
Bicep
s curl
sSit
-ups
Push
-ups
Trice
ps ex
t.Ro
wsJum
ping j
acks
Shou
lder p
ress
Later
al rai
ses
Non-ex
ercise
True labels
Squats
Lunges
Biceps curls
Sit-ups
Push-ups
Triceps ext.
Rows
Jumping jacks
Shoulder press
Lateral raises
Non-exercise
Pred
icted
labe
ls
14.9% 4.8% 0.0% 0.2% 0.0% 0.0% 0.0% 1.4% 0.6% 0.0% 0.0%
0.1% 9.8% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.0%
0.0% 0.0% 0.3% 0.0% 0.0% 0.1% 0.0% 0.0% 0.2% 0.4% 0.5%
1.2% 0.4% 0.0% 72.7% 0.0% 0.0% 0.0% 0.5% 0.0% 0.0% 0.1%
51.5% 12.7% 0.0% 0.2% 28.3% 7.0% 17.9% 29.3% 3.5% 7.2% 0.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.6% 0.0% 0.0%
0.0% 0.1% 0.0% 0.0% 0.0% 27.7% 6.9% 0.0% 17.7% 0.1% 0.5%
0.0% 0.0% 0.0% 0.2% 0.1% 0.0% 0.0% 18.5% 0.0% 0.0% 0.1%
0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 1.5% 0.3% 0.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.3% 0.0% 0.0% 0.0% 0.0% 0.1%
32.4% 72.2% 99.6% 26.8% 71.6% 64.9% 74.9% 50.2% 75.9% 91.9% 98.6%
0%
20%
40%
60%
80%
Accu
racy
(e) Earbud left.
Figure A.18: Confusion matrices on unseen subjects test set for the single-device mod-
els trained using the multimodal zeroing-out training strategy.