Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a...

60
Multimodal Deep Learning for Automated Exercise Logging David Str¨ omb¨ ack Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2019

Transcript of Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a...

Page 1: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Multimodal Deep Learning for

Automated Exercise Logging

David Stromback

Master of Science

Artificial Intelligence

School of Informatics

University of Edinburgh

2019

Page 2: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Abstract

Fitness tracking devices have risen in popularity in recent years, but limitations

in terms of their accuracy [4] and failure to track many common exercises presents a

need for improved fitness tracking solutions. This work proposes a multimodal deep

learning approach to leverage multiple data perspectives for robust and accurate activ-

ity segmentation, exercise recognition and repetition counting in the gym setting. To

enable this exploration we introduce the Multimodal Gym dataset; a substantial dataset

of participants performing full-body workouts, consisting of time-synchronised multi-

viewpoint (MVV) RGB-D video, multiple inertial sensing modalities, and 2D and 3D

pose estimates. We demonstrate the effectiveness of our CNN-based architecture at

extracting modality-specific spatio-temporal representations from inertial sensor and

skeleton sequence data. We compare the exercise recognition performance of uni-

modal, single-device and multimodal models across a number of smart devices and

modalities. Furthermore, we propose a fully-connected multimodal autoencoder to

learn cross-modal representations, which achieves near-perfect accuracy of 99.5% for

exercise recognition on the Multimodal Gym dataset. We evaluate the generalisability

of our models by testing on unseen subjects, and explore the effect of pretraining on

external datasets. We also propose and present the initial results of a training strat-

egy for strengthening single-device performance using multimodal training, which we

call multimodal zeroing-out training. Finally, we implement a strong repetition count-

ing baseline on the Multimodal Gym dataset, and propose promising future research

directions for the repetition counting task.

i

Page 3: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Acknowledgements

I would like to express my sincere gratitude to my academic supervisor, Dr. Radu, for

his expert guidance throughout the project, and our many fruitful discussions. I am

also very grateful to my industry supervisor, Dr. Huang, for our regular conversations

that stimulated new research directions and ideas. I would also like to thank Sony for

providing funding for this research project, and the data collection participants, for

being so generous with their time. Finally, I would like to acknowledge with gratitude

the ever-present support and love from my family and friends.

ii

Page 4: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Table of Contents

1 Introduction 11.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 42.1 Human Action Recognition (HAR) . . . . . . . . . . . . . . . . . . . 4

2.1.1 Inertial Sensor-Based HAR . . . . . . . . . . . . . . . . . . . 4

2.1.2 Skeletal-Based HAR . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 HAR in the Gym . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Multimodal Deep Learning . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Repetition Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Methodology 93.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Activity Segmentation and Exercise Recognition . . . . . . . . . . . 13

3.2.1 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.2 Unimodal Autoencoders . . . . . . . . . . . . . . . . . . . . 14

3.2.3 Multimodal Autoencoder . . . . . . . . . . . . . . . . . . . . 19

3.2.4 Multimodal Classification . . . . . . . . . . . . . . . . . . . 20

3.3 Repetition Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Evaluation and Discussion 234.1 Activity Segmentation and Exercise Recognition . . . . . . . . . . . 23

4.1.1 Pretraining Unimodal Autoencoders . . . . . . . . . . . . . . 23

4.1.2 Unimodal, Single-Device and Multimodal Models . . . . . . 25

4.1.3 Multimodal Zeroing-Out Training Strategy . . . . . . . . . . 29

4.2 Repetition Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

iii

Page 5: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

5 Conclusion 385.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Bibliography 41

A Additional Figures 47A.1 Exercise Visualisations . . . . . . . . . . . . . . . . . . . . . . . . . 47

A.2 Unimodal Autoencoder Hyperparameter Search . . . . . . . . . . . . 48

A.3 Confusion Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

iv

Page 6: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 1

Introduction

The popularity of fitness tracking devices has risen in recent years [30], enabling users

to monitor their health and fitness levels through their smart devices. Current fitness

trackers, however, are predominantly focused on tracking continuous high-movement

aerobic activities, such as walking, running, and swimming. A small subset of fitness

trackers also support tracking body-weight and resistance exercises, but they are of-

ten restricted to only tracking exercises in which the fitness tracker is heavily involved

in the exercise movements, thus often failing to track many common exercises. Fur-

thermore, questions about the reliability of current fitness tracking technology have

been raised [4], with significant discrepancies being observed across different fitness

tracking brands and models.

In this work we propose a multimodal deep learning approach to develop a robust

and accurate automated exercise logging system in the gym setting, and to contribute

to the development of current fitness tracking technology. The system we propose

consists of two main stages; an activity segmentation and exercise recognition stage,

followed by a repetition counting stage. In addition to addressing the shortcomings of

current fitness tracking technologies, this work makes contributions towards research

into the use of multimodal learning for human activity recognition (HAR), and for the

relatively unexplored task of repetition counting.

There are a number of factors that make HAR and repetition counting challenging

problems. The high variation in how different people perform the same activity, and

even the variation in how the same person performs an activity on separate occasions,

sets high demands on HAR and repetition counting systems. There are also modality

specific challenges that arise for HAR and repetition counting tasks. Occlusion and

viewpoint variation are two common challenges that need to be handled when working

1

Page 7: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 1. Introduction 2

with RGB and depth data. Learning from sensor data, in particular from sensor data

generated across multiple heterogeneous sensors, also poses a number of challenges.

The model and the brand of the sensor device, the positioning of the sensor, and the

sampling frequency of the sensor, are a number of factors that can cause significant

variations in sensor readings. An additional limitation of relying on sensor data for

HAR and repetition counting is that you are only able to track activities in which, the

body part that the sensor is attached to, is involved in, thus making some activities

indistinguishable. Finally, sensor drift is a problem that affects all sensors over time,

and can lead to unreliable and deteriorating sensor reading accuracy.

To tackle these challenges we propose a multimodal deep learning approach, whereby

combining inertial sensor data from user-worn smart devices, and 3D skeleton se-

quence data extracted from RGB video, we hypothesise that these modalities will

complement each other, and result in more robust and accurate HAR and repetition

counting models. The inertial sensing devices used in this work are accelerometer and

gyroscope sensors, which record the acceleration and angular velocity, respectively.

The acceleration and angular velocity of body-worn devices change with the body

movements, and are thus informative for inferring what activity a person is perform-

ing. The third modality we study in this work is 3D pose estimate data. Advances in

2D and 3D pose estimation have made it possible to acquire highly accurate and robust

pose estimates in real-time [6]. 3D pose sequence information is highly discriminative

for the task of action recognition, and allows for a much more compact representa-

tion than raw image or depth data, resulting in more lightweight models. With the

widespread availability of cheap cameras and inertial sensors, and the growing trend

of ubiquitous computing, we are seeing an increase in the number of data modalities

that are available across a wide range of settings. This contributes to the timeliness of

this research, as we believe there are many settings, where it is feasible, and that would

benefit from multimodal learning systems. Ambient-assisted living (AAL) solutions,

self-driving cars, and smart gyms, as investigated in this work, are just a few examples

of application areas that would benefit from multiple data perspectives. We explore

deep learning methods [26] for this task due to their demonstrated ability to learn gen-

eralised hierarchical representations, which are important for dealing with the high

intra-class variability found in HAR datasets. In addition, deep learning methods can

be trained end-to-end directly on the raw data, avoiding the need to design handcrafted

features that are often used in shallow learning methods.

Page 8: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 1. Introduction 3

1.1 Contribution

This section outlines the main contributions made in this work. We have collected and

annotated the Multimodal Gym dataset, a substantial dataset of participants performing

full-body workouts, consisting of synchronised multi-viewpoint (MVV) RGB-D video,

multiple inertial sensing modalities, and 2D and 3D pose estimates. The Multimodal

Gym dataset will be made available to the research community, with the hope that it

will stimulate further developments in multimodal learning for HAR and repetition

counting.

We propose a deep learning approach that learns unimodal and cross-modal repre-

sentations, and evaluate this approach for exercise recognition on the Multimodal Gym

dataset. Furthermore, we use inertial sensor data and 3D pose estimate data, which is

to the best of our knowledge the first work to explore fusing these modalities for HAR.

Additionally, we have implemented a baseline model for repetition counting on the

Multimodal Gym dataset, and invite the research community to further improve upon

the baseline performance.

Finally, we propose a multimodal zeroing-out training strategy, to investigate whether

single-device model performance can be improved through multimodal training, and

present our initial results.

Page 9: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 2

Background

2.1 Human Action Recognition (HAR)

Human action recognition approaches can be divided into two main categories, sensor-

based methods [8, 25, 5, 48], relying on data from inertial measurement units (IMUs),

such as accelerometers and gyroscopes, and video-based methods [35, 1, 18], operat-

ing on visual input data, such as colour and depth. The latter category also includes

skeletal-based methods which use as input 2D or 3D human pose estimates extracted

from colour or depth data. Previous works have predominantly focused on unimodal

learning, however, there are many real-world applications, such as ours, where it is

feasible to leverage information from multiple modalities. Performance on HAR tasks

has improved significantly in recent years, as research focus has shifted from using

handcrafted features to using modern deep learning solutions [18]. For conciseness,

this chapter will be centred on recent deep-learning approaches, and will focus on

sensor-based and skeletal-based methods, as these are the modalities investigated in

this work. An overview of previous HAR works in the gym setting are also presented.

2.1.1 Inertial Sensor-Based HAR

There are a range of sensing modalities that have been used for action recognition,

with accelerometers and gyroscopes being the most commonly used [48]. These sen-

sors output multivariate time-series data, which are often segmented using a sliding-

window approach, before being passed on to the feature extraction and classification

stages. Traditional pattern recognition techniques, such as Hidden Markov Models

(HMMs), decision trees, and Support Vector Machines (SVMs), have demonstrated

4

Page 10: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 2. Background 5

strong results on various action recognition tasks in controlled environments, however,

their reliance on handcrafted features and ability to only learn shallow representations,

restricts their performance and generalisability [25]. Deep learning approaches, on the

other hand, can learn high-level features directly from the raw sensor data as part of an

end-to-end system. Given sufficient amount of data, traditional machine learning and

signal processing techniques are in large part outperformed by modern deep learning

approaches [13]. Convolutional Neural Networks (CNNs) [27] are particularly well

suited for working with sensor data, as they efficiently capture local dependencies in

the data, and can learn scale-invariant features, important for dealing with activities

performed at different rates of frequency. Zeng et al. presented one of the first works

using CNNs on sensor data for HAR, in which they convolved and max-pooled each

accelerometer axis separately before flattening and concatenating the learnt features

from each axis into a fully-connected network [51]. They demonstrated state-of-the-

art results on three public HAR datasets. Yang et al. adopted a similar approach but

instead of performing convolutions on each axis separately, they stacked the time-series

signals from each sensor axis and from multiple sensors to form a 2D image, which

was then convolved and pooled along the temporal dimension [50]. In addition to oper-

ating on multiple sensors, a key difference in [50] compared to [51], is that information

across axes (and sensors in [50]) are fused together early on in the learning process,

as opposed to only towards the end. In a related work, Ha et al. also stack signals

from multiple sensors into a 2D image, however, they also add zero padding to the

input between modality groups, to prevent kernels in the first convolutional layer in-

termixing different modalities and not learning modality-specific features [15]. In our

approach we force the network to specialise on each modality and sensor by initially

treating each modality and sensor separately, before fusing the unimodal features and

learning cross-modal features. Hammerla et al. carried out a comparison of different

deep learning approaches for HAR, investigating Recurrent Neural Networks (RNNs),

deep fully-connected networks, and CNNs on three public datasets. They found that

for tasks where long-term dependencies play an important role, RNNs tend to perform

best, whereas if detecting local patterns is of higher importance, as is the case for exer-

cise recognition in the gym setting, CNNs are preferable [16]. Autoencoders and Re-

stricted Boltzmann-Machines (RBMs) have also been applied to sensor data for HAR

[2, 36]. Radu et al. demonstrate superior HAR performance compared to traditional

shallow-learning approaches, by using a multimodal-RBM network to learn a shared

representation for accelerometer and gyroscope data [36]. Similar to our approach they

Page 11: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 2. Background 6

extract unimodal features before learning a cross-modal representation.

2.1.2 Skeletal-Based HAR

With the release of the Microsoft Kinect depth sensor and body tracking SDK in 2010,

and the more recent development of real-time and accurate RGB-based human pose es-

timation techniques [6, 31, 38], the research area of skeletal-based HAR has emerged.

Skeletal sequence data consists of joint trajectories, and can, similarly to inertial sen-

sor data, be viewed as multivariate time-series data. Early works focus on constructing

handcrafted features to obtain discriminative skeleton representations [47, 12, 46], with

more recent approaches, again, focusing on deep learning.

A range of deep learning approaches have been explored for skeletal-based HAR,

in particular, recurrent neural networks (RNNs) [11], graph neural network (GNN)

[49], and CNNs [10, 29, 22]. RNNs are able to efficiently leverage time-series infor-

mation, while GNNs are able to incorporate the spatial constraints and relationships

of human joints in their models. Finally, CNNs have widely been demonstrated to

perform well in learning representations at different abstraction levels, and at learning

spatio-temporal features.

Ke et al. propose a 3D skeleton representation form using cylindrical coordinates

to encode the relative joint positions with respect to selected reference joints [22].

They demonstrate the efficiency of this representation in combination with deep CNNs

to learn spatio-temporal features for HAR. Motivated by their success, we use their

proposed skeleton representation form in this work.

2.1.3 HAR in the Gym

Only a few works have focused on HAR in the gym setting. Chang et al. presented

one of the earliest works that tackled exercise recognition [17]. They propose a Naive

Bayes Classifier and a HMM approach using two triaxial accelerometers to achieve

90% accuracy on a dataset with nine exercise classes. Morris et al. present an au-

tomated exercise logging system with many similarities to ours [32]. Their system

consists of three main stages; segmenting exercise and non-exercise segments, recog-

nising exercises, and counting repetitions. A key difference compared to our approach

is that Morris et al. use handcrafted features in combination with an SVM to classify

exercises. They evaluate their system on the RecoFit dataset, a large-scale dataset con-

sisting of accelerometer and gyroscope data collected from an arm-worn sensor. Their

Page 12: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 2. Background 7

system achieves a precision and recall of greater than 95% for the segmentation task,

and recognition rates on three exercise circuits consisting of 4, 7, and 13 exercises,

of 99%, 98%, and 96%, respectively. The repetition counting approach proposed by

Morris et al. is the baseline method we implement on the Multimodal Gym dataset. A

more recent deep learning based approach was proposed by Um et al. [45], where they

applied a CNN to accelerometer data formatted into an image. They apply their model

on a large-scale dataset with 50 exercises, and obtain 92% for the exercise recognition

task.

2.2 Multimodal Deep Learning

Multimodal deep learning is concerned with how to fuse different modalities, such

that the learning task can best leverage the different data perspectives. Ngiam et al.

propose a cross-modal RBM learning approach to learn better unimodal features by

training on multiple modalities [34]. They demonstrate the efficiency of their ap-

proach by training on video and audio data, and testing on just a single modality,

for the task of audio-visual speech classification. Their approach inspired the multi-

modal zeroing-out training strategy that we investigate in this work, with the major

differences being that we extend the strategy to multiple modalities, and use a gradual

zeroing-out process of the modalities. Radu et al. explore a number of fusing strategies

for multimodal deep learning methods [37], comparing early feature concatenation and

modality-specific architectures, on activity and context recognition tasks using inertial

sensor data. Modality-specific architectures first learn unimodal features before learn-

ing a shared cross-modal representation. This is also the multimodal learning approach

that we take in this work. [37] demonstrate that their proposed modality-specific CNN

architecture outperforms the other explored approaches on three out of four tasks, and

achieves an average accuracy that is 5% higher across all four tasks, compared to the

early feature concatenation architecture.

2.3 Repetition Counting

We focus on sensor-based repetition counting as this is the focus in our work, however,

there are also a number of works for visual repetition counting in video [28, 39]. The

majority of sensor-based repetition counting works focus on counting repetitions in

the gym setting [17, 9, 33], and rely on signal-processing techniques. Chang et al.

Page 13: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 2. Background 8

propose two approaches for repetition counting on triaxial accelerometer data; a peak

counting algorithm, and a method using the Viterbi algorithm with a HMM. On a

dataset of nine workout exercises they achieve a miscount rate of 5%. Choi et al. adopt

a simple signal-processing technique to track repetitions using accelerometer sensors

attached to gym exercise machines [9]. Their predictions are within one repetition of

the ground-truth, 95% of the time. A recent deep learning approach has also been

proposed by Soro et al. using a CNN applied on inertial sensor data collected from

a wrist-worn and an ankle-worn sensor, to count the repetitions across ten Cross-Fit

exercises [42]. Their model’s predictions are within one of the true repetition count,

91% of the time.

Page 14: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3

Methodology

This chapter outlines and motivates the design choices for our proposed automated ex-

ercise logging system. The system consists of two main components; an activity seg-

mentation and exercise recognition module, followed by a repetition counting module.

We begin by introducing the data collection procedure.

3.1 Data Collection

To enable research into multimodal learning for activity segmentation, exercise recog-

nition and repetition counting in the gym setting, we have collected and annotated a

multimodal dataset of participants performing full-body workouts1. We refer to this

dataset as the Multimodal Gym dataset. The Multimodal Gym dataset will be made

publicly available, with the hope that it will stimulate further research into multimodal

learning. This section outlines the data collection procedure and provides an overview

of the Multimodal Gym dataset.

Participants were asked to carry out full-body workouts whilst being recorded by

two depth cameras and wearing five differently-positioned devices collecting inertial

sensor data. To restrict the task to single-person exercise recognition, each workout

session involved just one participant. The workout sessions consist of three sets of ten

exercises, with ten repetitions for each set. The following set of well-known gym exer-

cises were chosen; squats, lunges (with dumbbells), bicep curls (alternating arms),

sit-ups, push-ups, sitting overhead dumbbell triceps extensions, standing dumbbell

rows, jumping jacks, sitting dumbbell shoulder press, and dumbbell lateral shoulder

1The data collection study has undergone ethical screening in accordance with the University ofEdinburgh School of Informatics ethics process, and received ethical approval from the Ethics Panel.Ref No. 70572.

9

Page 15: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3. Methodology 10

raises. See A.1 for an example of each exercise. The exercises were demonstrated to

the participants before their workout session to ensure familiarity with each exercise,

but no corrections to the participant’s form were made during the workout. For the

exercises involving dumbbells, the participants had access to two dumbbells with ad-

justable weight of up to 7.5kg each, and were free to choose how much weight to use.

In between sets participants could decide how long to rest, and what to do, as long as

they remained within the cameras’ field of view. The data collection was carried out

in a controlled environment, keeping the background and lighting conditions constant

across all workouts.

Participants were asked to wear two smartwatches, one on each wrist, an earbud in

their left ear, and two smartphones, one in their left trouser pocket (Huawei P20) and

the other in their right trouser pocket (Samsung S7). The Huawei P20 was only used

to collect data in 6 of the workout sessions. The smartwatches and earbud are fixed in

orientation when worn, and the smartphones were ensured to be placed with the cam-

era facing downwards and outwards. Inertial sensor data was collected from all five

devices using the inbuilt triaxial accelerometer and gyroscope, and also the magne-

tometer in the smartphones. Heart rate data was also collected from the smartwatches

using the optical heart rate sensor. Figure 3.1 displays an example accelerometer and

gyroscope sensor segment. Custom applications were developed to stream and store

the sensor data from each device. For the smartwatches a Wear OS Java application

was developed to locally store sensor readings. For the eSense earbud, an Android OS

Java application was developed, allowing the earbud to stream sensor readings to a host

smartphone over Bluetooth Low Energy (BLE). For the smartphones, an Android OS

Java application developed by Dr. Radu was used to stream and store inertial sensor

data. The devices used in the data collection, along with the sampling frequency for

each device and modality are given in Table 3.1.

Colour images and depth maps of the workouts were recorded at 30 frames per

second from two viewpoints using two Orbbec Astra Pro RGB-D cameras. A C++ ap-

plication was developed to stream and store the colour and depth data from these cam-

eras. The Astra Pro depth sensor uses structured light to capture depth information,

and has a range of 0.6-8m. We ensured to place the cameras such that the participant

was always within a 2-6m range from the cameras. The depth maps capture the planar

distance to the camera of the scene. For each workout session the cameras were set up

in approximately the same position, with both cameras facing the front of the partici-

pant but at different angles. The colour and depth data was recorded at the maximum

Page 16: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3. Methodology 11

0 5 10 15 20Time (s)

10

0

10

m/s

2

XYZ

(a) Accelerometer.

0 5 10 15 20Time (s)

2

0

2

4

rad/s

XYZ

(b) Gyroscope.

Figure 3.1: Sensor readings from a left-worn smartwatch triaxial accelerometer and

gyroscope recorded during a set of squats.

resolution, 1280×720 for colour, and 640×480 for depth. Example RGB frame and

depth map outputs are given in Figure 3.2.

(a) Camera 1 RGB. (b) Camera 2 RGB.0

1

2

3

4

(c) Camera 2 depth map.

Figure 3.2: Example colour and depth output of camera 1 and camera 2.

To use the Multimodal Gym dataset for multimodal learning, the data from each

of the seven devices used in the data collection must be time-synchronised. This was

done by asking each participant to perform an abrupt synchronisation jump from a

stand-still position at the beginning of each workout. The synchronisation jump was

used to determine the offset between each device’s system clock and a chosen reference

clock, which was chosen to be the system clock of the laptop connected to one of the

cameras. Given the offset of each device’s system clock to our reference clock, the

data from each device could be aligned in time.

Each workout has been manually annotated with the video frame boundaries of

Page 17: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3. Methodology 12

Device Modality Frequency (Hz) Resolution

Orbbec Astra ProRGB 30 1080x720

Depth 30 640x480

Mobvoi TicWatch Pro

Accelerometer 100 -

Gyroscope 100 -

Heart beats per minute 1 -

eSense [21]Accelerometer 90 -

Gyroscope 90 -

Samsung S7

Accelerometer 210 -

Gyroscope 210 -

Magnetometer 100 -

Huawei P20

Accelerometer 500 -

Gyroscope 500 -

Magnetometer 65 -

Table 3.1: An overview of the devices used in the data collection, and the collected

modalities.

where each exercise set begins and ends, and the number of repetitions in each set.

The workouts were annotated by the same annotator by inspecting the videos of the

workout sessions. Participants were instructed to perform 10 repetitions in each set,

however, in limited cases, due to miscounting, fewer or more than 10 repetitions were

performed. There is some ambiguity involved when annotating the start and end frames

of exercises, and the number of repetitions. When determining when a participant starts

an exercise, is it when they are in the exercise start position, or is it only once they start

the exercise motion? Is the end of a set of jumping jacks when the feet come together

for the last time, or is it when the arms stop swinging? Does sitting up at the end of a

set of sit-ups count as a repetition? These ambiguous cases do not arise very frequently,

but to minimise the impact on the data quality we ensured to handle these cases with

consistency across all workouts. Given the discussed ambiguities, and the difficulties

in identifying exactly when a workout begins and ends, we estimate that the precision

of the majority of our start and end annotations are within 2-3 frames (60-90ms) of

the ground-truth, which is more than sufficient for the task of activity recognition and

repetition counting.

16 workout sessions were recorded, totalling 580 minutes of data. The workout

Page 18: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3. Methodology 13

lengths range from 27 to 58 minutes, with an average duration of 36 minutes. A total

of 466 sets, and 4678 repetitions were collected, and exercises constitute 26%, or 150

minutes of the data, with the remaining portion corresponding to non-exercise activi-

ties. Two of the five participants carried out six workout sessions each, one participated

in two sessions, and the other two participants carried out one workout session each.

3.2 Activity Segmentation and Exercise Recognition

In this section we outline our proposed multimodal approach for activity segmentation

and exercise recognition. Our approach can be split into three main training stages;

learning modality-specific representations using unimodal autoencoders, learning a

shared cross-modal representation using a multimodal autoencoder, and finally, train-

ing a classifier for segmentation and exercise recognition using the shared cross-modal

representation. Each of these three stages and the pre-processing stage will be covered

in the following subsections. An illustration of the complete architecture is given in

Figure 3.3.

3.2.1 Pre-Processing

Our system takes as input time-synchronised inertial sensor and skeleton sequence data

with different sampling frequencies. We downsample the accelerometer and gyroscope

sensor readings to 50Hz, and leave the skeleton sequence data at its original sampling

rate of 30Hz. Previous HAR works have suggested that a sampling rate of 20Hz is suf-

ficient to capture human motions without risking losing informative parts of the signal

[23]. We take a cautious approach and use a higher sampling rate, but acknowledge

that a lower sampling rate could most likely be used with little or no loss of accuracy to

our model’s performance. Using an as low as possible sampling rate is desirable when

deploying models on resource constrained devices in order to reduce the energy con-

sumption and storage requirements. To facilitate the training process we standardise

our data along the feature dimensions, (i.e. the X, Y, and Z axes for the accelerometer

and gyroscope data, and along the joint dimensions for the skeleton data). We use the

global training mean and standard deviation to standardise all instances.

Page 19: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3. Methodology 14

Inertial 1D Convolutional Autoencoder

Skeleton 2D Convolutional Autoencoder

XYZ

STAGE 1

STAGE 1

Multimodal Autoencoder

STAGE 2 STAGE 3

rθz

Activity Segmentation and Recognition

rθz

XYZ

Figure 3.3: An overview of the proposed activity segmentation and exercise recog-

nition approach. Stage 1: Learn modality-specific representations using a separate

autoencoder for each device and modality. The layers of the inertial 1D convolutional

autoencoder block are used to illustrate that separate autoencoders are used for each

device and modality. Stage 2: Flatten and concatenate the modality-specific repre-

sentations outputted by the encoder component of each unimodal autoencoder. Learn

a shared cross-modal representation using a fully-connected multimodal autoencoder

that attempts to reconstruct the original inputs from the shared representation. The out-

put vector of the multimodal autoencoder is split along the concatenation indices, and

fed to the decoder component of the corresponding unimodal autoencoder, to recon-

struct the original input, and backpropagate the reconstruction loss. Stage 3: A fully-

connected classifier is attached to the learnt shared cross-modal representation. The

entire network is trained for the task of activity segmentation and exercise recognition,

with the pretrained unimodal and multimodal autoencoder weights being fine-tuned.

3.2.2 Unimodal Autoencoders

Motivated by the demonstrated strong performance of modality-specific architectures

in previous multimodal deep learning works [34, 37], we employ the same late fusing

Page 20: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3. Methodology 15

strategy in this work (discussed in section 2.2). We first train separate autoencoder

networks for each device and modality to extract modality-specific representations, and

to enable us to pretrain our network on external datasets. We pretrain each autoencoder

on a publicly available dataset from a similar domain to ours, in order to evaluate

whether transfer learning boosts the robustness and generalisability of our models.

We propose using a stacked convolutional autoencoder with a bottleneck to force

the network to learn a compact modality-specific representation. As discussed in sec-

tion 2.1.1, CNNs are efficient at learning temporal patterns and scale-invariant features,

which make them well-suited for our task. The autoencoders are constructed in a sym-

metric fashion, with the encoder consisting of convolutional and max-pooling layers,

and the decoder consisting of deconvolutional and max-unpooling layers. The rectified

linear unit (ReLU) activation function [14] is applied after every convolutional layers,

and after the first two deconvolutional layers. The network configuration for the ac-

celerometer and gyroscope modalities, and the skeleton modality are given in Table 3.2

and 3.3, respectively. The autoencoders are trained end-to-end, with the Mean Squared

Error (MSE) as the reconstruction loss.

3.2.2.1 Inertial Sensor Autoencoders

This section outlines how unimodal representations are learned from the accelerometer

and gyroscope data collected from the smartwatches, smartphones and earbud.

We treat the triaxial accelerometer and gyroscope data from the smartwatches and

earbud as 1D images with three channels, corresponding to the X, Y, and Z axis. For

the smartphone data we use the magnitude of the accelerometer and gyroscope read-

ings,√

X2 +Y 2 +Z2, instead of the raw triaxial signal, and hence the smartphone data

only has one channel. Accelerometer and gyroscope readings are dependant on the

sensor’s orientation. The smartwatches and the earbud are worn such that the orien-

tation of the devices are relatively fixed across all workouts, however, the orientation

of the smartphones can vary significantly throughout and across workouts. This trans-

formation is performed to make our smartphone models invariant to the orientation

of the smartphone. The magnitude of the accelerometer and gyroscope corresponds

to the speed of acceleration, and speed of rotation, irrespective of the direction. This

transformation results in a loss of directional information, but we gain invariance to

the orientation of the sensing device.

The resulting 1D images for each modality and device are convolved and max-

pooled with 1D filters along the temporal dimension. The accelerometer and gyroscope

Page 21: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3. Methodology 16

LayerKernel dims(HxW)

Output dims(C@HxW)

Acc Gyr

input - - 3@1x250

conv1 1x11 (G) 1x3 9@1x125

conv2 1x11 (G) 1x3 15@1x63

conv3 1x11 1x3 24@1x32

maxpool 1x2 1x2 24@1x16

unpool 1x2 1x2 24@1x32

deconv1 1x11 1x3 15@1x63

deconv2 1x11 1x3 9@1x125

deconv3 1x11 1x3 3@1x250

Table 3.2: Stacked convolutional au-

toencoder network architecture for

smartwatch and earbud data. The con-

figuration for the smartphone autoen-

coders only differ in that the number

of channels are divided by a factor of

three, since the smartphone data only

has one input channel. The first two

convolutional layers in the accelerome-

ter autoencoder are grouped (G) con-

volutions, with the number of groups

equal to three. The ReLU activation

function is applied after every convolu-

tional layer, and after the first two de-

convolutional layers. All the layers use

a kernel stride of two.

LayerKernel dims(HxW)

Output dims(C@HxW)

input - 3@150x16

conv1 11x11 (G) 9@75x16

conv2 11x11 (G) 15@38x16

conv3 11x11 24@19x16

maxpool 2x2 24@9x8

unpool 2x2 24@19x16

deconv1 11x11 15@38x16

deconv2 11x11 9@75x16

deconv3 11x11 3@150x16

Table 3.3: Stacked convolutional au-

toencoder network architecture for

skeleton data. The first two convolu-

tional layers in the accelerometer au-

toencoder are grouped (G) convolu-

tions, with the number of groups equal

to three. The ReLU activation func-

tion is applied after every convolutional

layer, and after the first two deconvolu-

tional layers. All the layers use a kernel

stride of two in the height dimension,

and one in the width dimension.

data are processed in separate networks to extract modality-specific representations.

We use a cross-validation framework to evaluate different network configurations, in-

cluding the number of convolutional and deconvolutional layers, the kernel size, the

kernel stride, and using depth (grouped) or regular convolutions. The final network

configurations for all three devices and modalities are given in Table 3.2.

Page 22: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3. Methodology 17

3.2.2.2 Skeleton Autoencoder

To incorporate visual information into our model, we use 3D pose information as one

of our input modalities. Unlike the inertial sensor streams, where we use the raw sensor

data as input to our model, the skeleton stream requires some processing to extract 3D

pose estimates from our video data. We extract 3D pose estimates from RGB frames

using the highly-accurate 2D pose estimation system OpenPose [6], and the 2D to

3D regression model proposed by Martinez et al. [31]. OpenPose is an open source

multi-person 2D pose estimation system that uses a CNN-based bottom-up approach

to find body parts in RGB images. To ’lift’ the 2D pose estimates to 3D, Martinez

et al. propose a simple but effective deep fully-connected network, which at the time

of publication in 2017, outperformed the state-of-the-art 3D pose estimation on the

Human3.6M dataset by 30% [31]. Examples of 2D and 3D pose estimates on two

RGB frames from the Multimodal Gym dataset are given in Figure 3.4. The resulting

pose estimates are highly accurate, robust to fast movements, and can even handle

simple cases of occlusion, as demonstrated in Figure 3.4.

Figure 3.4: 2D and 3D pose estimates obtained using OpenPose [6], and Martinez et

al. [31], respectively.

Our 3D skeleton model consists of 17 joints, for which we have the Cartesian

coordinate position of in all video frames. The coordinate system used to describe the

Page 23: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3. Methodology 18

pose estimate is relative to the person, with the origin coordinate being at the centre

hip joint. We use a skeleton representation proposed by Ke et el. [22], which involves

transforming the 3D Cartesian coordinates, (x,y,z), to cylindrical coordinates, (r,θ,z),

as follows:

r =√

x2 + y2

θ = tan−1(

xy

)(3.1)

z = z

[22] demonstrate improved performance when using cylindrical coordinates over

Cartesian coordinates. They attribute this improvement to the fact that human motions

use pivotal movements, and are therefore best described by cylindrical coordinates.

There is however a limitation with using cylindrical coordinates, that is not discussed

in the original paper [22], namely the angle discontinuity at −180◦ and 180◦ of the az-

imuth component of the cylindrical coordinate. This causes large jumps in the azimuth

signal for potentially very small pivotal movements which complicates learning. After

the transformation, a reference joint, R, is chosen, and each joint position is represented

in terms of its relative position to the reference joint. This has the advantage of making

the representation invariant to different viewpoints. We chose the centre hip joint as

the reference joint since it is a relatively stable and stationary joint, which results in the

final skeleton representation being less prone to noise. Finally the relative cylindrical

joint coordinates are aligned into a 3D image. The channels in the image correspond to

the radial, azimuth, and vertical components, the width corresponds to the spatial di-

mension, and the height to the temporal dimension. This results in a N×16×3 image,

where N is the sequence length, and 16 is number of joints, discarding the reference

centre hip joint. A visualisation of the skeleton representation is given in Figure 3.5,

where the repetitive nature of a set of squat repetitions is captured in the form of a

repeating visual pattern.

The skeleton representation enables us to use a stacked convolutional autoencoder

model to learn modality-specific spatio-temporal features from our data. The skele-

ton autoencoder we propose is very similar to the autoencoders we use for the inertial

sensor streams, with the major difference being that 2D convolution and pooling opera-

tions are used instead of 1D operations. The final model configuration was selected by

cross-validating a number of different configurations in a hyperparameter grid search.

Table 3.3 contains the final network configuration for the skeleton autoencoder.

Page 24: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3. Methodology 19

(a) Non-exercise segment.

(b) Squats segment.

Figure 3.5: Visualisation of two 10 second segments of the skeleton representation dur-

ing a non exercise activity, and a set of squats. The segment has been standardised,

and then scaled channel-wise between 0 and 255. The radial coordinate, r, corre-

sponds to the red channel, the azimuth coordinate, θ, to the blue channel, and the

vertical coordinate, z, to the green channel.

In [22] a VGG19 [41] model pretrained on images was used to extract spatio-

temporal features from the skeleton representation. We adopt a different approach, and

instead leverage the large number of 3D pose estimates in the Multimodal Gym dataset

to train our model just on skeleton representations. We hypothesise that training our

network exclusively on skeleton data will improve our model as the skeleton repre-

sentation images differ significantly from natural images, even the low-level features.

This is illustrated by the example skeleton representation shown in Figure 3.5.

3.2.3 Multimodal Autoencoder

Once the modality-specific representations have been learnt for each modality and de-

vice, the features from the embedding layer are flattened and concatenated, before

being fed into the next stage; the multimodal autoencoder. The aim of this stage is

to learn cross-modal representations that hopefully will prove to be discriminative for

the exercise recognition task. This is again done through the use of an autoencoder

network, but this time using a stacked fully-connected autoencoder. By using fully

connected layers, the network is easily able to intermix features from all modalities.

As can be seen in Figure 3.3, the architecture structure is symmetric, with the decoder

component of the multimodal autoencoder reconstructing a vector of the same size

as the concatenated input vector. The output vector is then split at the concatenation

indices, such that each segment can be reshaped into the same size as its correspond-

ing modality-specific embedding layer. The reshaped vector segment is then passed

on to the decoder module of the corresponding unimodal autoencoder, which attempts

Page 25: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3. Methodology 20

to reconstruct the original input for that modality. A hyperparameter search using

cross-validation was conducted to select the configuration with the lowest validation

reconstruction loss. The final network configuration is outlined in Table 3.4. As for

the modality-specific autoencoders, MSE is used as the reconstruction loss, and the

network is trained end-to-end using backpropagation.

Layer Output Units

input 12200

mod specific encoders 4288

flatten & concat. embeddings 4288

enc fc1 1000

enc fc2 1000

enc fc3 1000

dec fc1 1000

dec fc2 1000

dec fc3 4288

split 4288

mod specific decoders 12200

Table 3.4: The network configuration for the stacked fully-connected multimodal au-

toencoder. The inputs are all the modalities from all the devices, where 12200 = 6×(3@1×250)+2×(1@1×250)+(3@150×16) is the total size of the inputs across all

devices and modalities, and 4288 = 6× (24@1×16)+2× (8@1×16)+(24@9×8),

is the size of the concatenated embedding features from each modality and device.

3.2.4 Multimodal Classification

The final stage of our approach is the classification stage, in which the learned cross-

modal representations are fed into a classifier to determine the predicted label for each

input segment. Unlike [32], we treat activity segmentation and exercise recognition as

one task, instead of two separate tasks, which simplifies our exercise logging pipeline.

This is done by adding an extra non-exercise class to the set of exercise classes. The

classifier consists of three fully-connected layers which are attached to the embedding

layer of the multimodal autoencoder. The first two fully connected layers consist of

100 hidden units, and use ReLU as the activation function. The final layer corresponds

Page 26: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3. Methodology 21

to the predicted output for each class, and consists of 11 units (one for each exercise

class, and one for the non-exercise class). The entire network is trained end-to-end,

fine-tuning the autoencoder weights to specialise the learnt features for the task of

exercise recognition. The cross-entropy loss is backpropagated through the network to

update the model parameters.

3.3 Repetition Counting

To count the number of repetitions we implemented a signal processing based ap-

proach proposed by [32]. This approach serves as a strong baseline on the Multimodal

Gym dataset, and can act as a benchmark for other researchers to compare their results

against.

The approach relies on identifying periodic peaks and auto-correlations in the data,

which are strong indicators of a repetition. It is assumed that the signal has been

segmented, and that the exercise being performed is known. To ensure that these pre-

requisites are satisfied, the repetition counting stage should be preceded by the activity

segmentation and exercise recognition stage.

The approach can easily be applied on most modalities, as it just requires being able

to transform the data into a 1D signal that captures the maximal directions of variance

in the data. In our case, the multidimensional data (3 axes for accelerometer and gyro-

scope data, or 16 relative joint positions in cylindrical coordinates for skeleton data) is

first standardised and then smoothed using a third-degree polynomial Savitzky–Golay

filter [40], along each feature dimension. The processed data is then projected onto its

first principal component direction to obtain a 1D signal.

The next stage is to identify peaks, or local maxima, in the processed 1D signal.

This is done by simply comparing neighbouring values in the signal. Although the

smoothing and PCA removed a lot of noise from the original data, there are still many

incorrectly detected peaks that do not correspond to a repetition. To remove these

false positives, we sort the list of candidate peaks based on their amplitude, and going

through the list in descending order, we add candidate peaks to a new candidate list, as

long as there is not a peak already in the list that is within a minimum allowed threshold

distance of the current peak. This minimum threshold is chosen independently for each

exercise based on an estimate of the minimum amount of time required to execute one

repetition of the exercise.

In the next step we leverage the repetitive nature of repetitions by incorporating

Page 27: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 3. Methodology 22

information from the auto-correlation of the signal to exclude unlikely peaks. The

autocorrelation is computed for a window centred at each peak for lags between the

minimum and maximum expected duration of a repetition. The largest autocorrelation

value within this range of lags is chosen to be the period, P, at that candidate peak.

Any peaks with smaller amplitudes, and within a distance of 0.75∗P from the current

peak, are removed.

The final filtering that is performed is to remove all peaks that have an amplitude

lower than half of the 40th percentile of the remaining peak amplitudes. The number

of remaining peaks is our predicted repetition count.

Page 28: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4

Evaluation and Discussion

This chapter outlines and discusses the evaluation procedure and results of our pro-

posed exercise logging system on the Multimodal Gym dataset. The system is eval-

uated on exercise recognition and repetition counting. In addition, we present and

analyse the initial results for our proposed multimodal zeroing-out training strategy.

4.1 Activity Segmentation and Exercise Recognition

We evaluate our proposed multimodal deep learning approach (see section 3.2) on the

Multimodal Gym dataset, and compare its performance to unimodal models, single-

device models, and single-device models trained using our proposed multimodal zero-

ing out strategy. Unimodal models are models that only take input from one modality

and device, for example, accelerometer data from the left smartwatch. Single-device

models are models that take as input all the modalities generated by a single device,

for example, accelerometer and gyroscope data from the left smartwatch. We com-

pare using pretrained unimodal autoencoders, and training from scratch. Furthermore,

we evaluate each model on two test sets, one consisting of participants which are also

present in the training set, and one consisting of participants not previously seen by the

model.

4.1.1 Pretraining Unimodal Autoencoders

This section outlines the experimental setup used to pretrain the unimodal autoen-

coders (see section 3.2.2). We trained a total of seven autoencoders, corresponding to

the following devices and modalities; smartwatch accelerometer and gyroscope, smart-

23

Page 29: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 24

phone accelerometer and gyroscope, earbud accelerometer and gyroscope, and skele-

ton data. The following three datasets were used for pretraining, RecoFit [32], the

Heterogeneity Human Activity Recognition (HHAR) dataset [44], and Human3.6M

[7, 20].

The Microsoft RecoFit dataset [32] was used to pretrain the smartwatch accelerom-

eter and gyroscope autoencoders. The RecoFit dataset consists of accelerometer and

gyroscope data collected at 50Hz from an arm-worn sensor of individuals working

out in a gym. The dataset consists of 114 participants, across 146 sessions. To pre-

train the smartphone and earbud accelerometer and gyroscope autoencoders we use

the smartphone data from the HHAR dataset. The HHAR dataset consists of triaxial

accelerometer and gyroscope data collected from 12 smart device models at different

frequencies. The following activities are present in the HHAR dataset; biking, sitting,

standing, walking, walking down stairs, and walking up stairs. Finally, the skeleton

autoencoder was pretrained on ground truth motion capture data from the Human3.6M

dataset. The Human3.6M dataset contains 3D pose information collected from a highly

accurate motion capture system at 50Hz. The data was collected in a controlled en-

vironment across a number of participants, and a range of diverse scenarios, such as

eating, walking, and discussing.

The reason for using device-specific datasets to pretrain the autoencoders is that

the motion patterns experienced by differently positioned devices differ significantly.

The arm-worn sensor in the RecoFit dataset is likely to be involved in a large variety of

motions, from fine-grained movements to rapid movements, whereas a smartphone in

the pocket will most likely experience more rigid motion patterns. We want our models

to learn these device-specific motion patterns. Since there are no publicly available

datasets with accelerometer or gyroscope data collected from earbuds, we also use the

HHAR smartphone data to pretrain the earbud autoencoders. We hypothesise that the

motions picked up by the earbud will be more similar to those recorded by smartphones

in the pocket, compared to arm-worn sensors.

As a pre-processing step we first downsample all inertial sensor readings to 50Hz,

and the 3D pose data to 30Hz. Then the magnitude of the smartphone accelerometer

and gyroscope data was computed, and the 3D pose data was transformed to the skele-

ton representation form outlined in section 3.2.2.2. Each dataset was then randomly

split into a train, test, and validation set, using a 70-15-15 split. Finally, all instances

were standardised using the training set mean and standard deviation.

To select the network configuration for our stacked convolutional autoencoder we

Page 30: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 25

use a cross-validation framework to evaluate different network configurations. The hy-

perparameter search was carried out separately for each modality, however, we only

used the RecoFit dataset to choose the network configuration for the accelerometer

and gyroscope autoencoders. We made the assumption that a similar network archi-

tecture would work well across all three types of devices for the same modality. The

following network configurations were explored; the number of convolutional and de-

convolutional layers, the kernel size, the kernel stride, and using depth (grouped) or

regular convolutions. A selection of the configurations with the lowest reconstruction

loss on the external dataset’s test set were evaluated on the Multimodal Gym dataset

for exercise recognition, with the configuration with the highest validation accuracy

selected as our final configuration. We use the results from the autoencoder model

hyperparameter search to guide the final model selection for each modality, with the

intuition that models with a low reconstruction loss are able to preserve salient fea-

tures of the data in their embedding layer, which should be useful for the exercise

recognition task. See A.2 for the accelerometer, gyroscope, and skeleton autoencoder

hyperparameter search results. The final network configurations are given in Table 3.2

and 3.3.

As discussed in section 4.1.2, we chose to use 5 second windows for the activity

recognition task, and therefore we also use 5 second window segments as input to the

autoencoders. Instances are randomly sampled from the training set, using a batch size

of 128. Each model was trained for 100 epochs, using an Adam optimiser [24] with

a learning rate of 0.001, and the following exponential decay rates for the moment

estimates, β1 equal to 0.9 and β2 equal to 0.999. The final weights were saved for

each of the seven autoencoder models to later be used for exercise recognition on the

Multimodal Gym dataset.

4.1.2 Unimodal, Single-Device and Multimodal Models

Given our pretrained unimodal autoencoders, we now detail the approach taken to train

exercise recognition models that leverage the learnt modality-specific representations.

We split the Multimodal Gym dataset into a train, validation, and two test sets,

one containing participants seen in the training set, and one with previously unseen

participants. The dataset is divided by assigning each workout session to one of the

splits. For reproducibility and to enable future works to compare results we provide

the workout IDs for each split: train (1, 2, 3, 4, 6, 7, 8, 9), validation (14, 15), test seen

Page 31: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 26

(10, 11), test unseen (0, 5, 12, 13). In this work we use data from 5 of the 7 devices in

the Multimodal Gym dataset, ignoring one of the RGB-D cameras and the left smart-

phone. The use of multi-view visual information is left for future work, and the left

smartphone is excluded because there is only data collected for that device in a subset

of the recorded workout sessions.

All the models take a 5 second window segment as input, which corresponds to

250 sensor samples at 50Hz, or 150 skeleton samples at 30Hz. The optimal window

size varies across different activity recognition applications, and there is often a trade-

off between accuracy and recognition speed [3]. Five second windows were chosen as

this is large enough to capture the repetitive nature of repetitions, whilst also ensuring

that the recognition speed (2.5 seconds) is acceptable for an exercise logging system.

During training, instances were randomly sampled from all workouts in the training set,

using a batch size of 128. At test time we generated batches using sequential strided

sampling, with a stride of 0.2 seconds. This simulates how the system would be used in

a real-world application, and is equivalent to the setup used in [32]. Window segments

are labelled according to the majority class in the segment, which is common-practice

for activity recognition.

A hyperparameter search was carried out to determine the number of fully-connected

layers, hidden units, and amount of dropout [43] to use for the unimodal, single-device,

and multimodal models. The hyperparameter search was only performed on the left

smartwatch accelerometer data for the unimodal model, and the left smartwatch ac-

celerometer and gyroscope data for the single-device model, in order to limit the num-

ber of experiments. The configuration with the highest validation accuracy was chosen

for each model. The final configuration was 3 layers with 100 hidden units, and no

dropout for all three model types. Finally, a hyperparameter search was also carried out

for the multimodal autoencoder in order to select the number of encoder and decoder

layers, the number of hidden units, and the number of units in the embedding layer.

The configuration with the lowest validation reconstruction loss was selected. The final

configuration consisted of three encoder and three decoder fully-connected layers, with

1000 hidden units, and 1000 units in the embedding layer. The unimodal models are

evaluated for exercise recognition by flattening the embedding representation from the

modality-specific autoencoder and attaching the three-layered fully-connected classi-

fier. The same approach is taken for single-device models, with the only difference

being that the embedding layer from two modalities are flattened and concatenated be-

fore attaching the fully-connected classifier. Finally, the multimodal model is evaluated

Page 32: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 27

for exercise recognition by attaching the fully-connected classifier to the embedding

layer of the multimodal autoencoder. In all three model types, the autoencoder weights

are fine-tuned using a lower learning rate than for the fully-connected layers.

All experiments were trained using an Adam optimiser [24] with a learning rate of

0.001, and the following exponential decay rates for the moment estimates, β1 equal

to 0.9 and β2 equal to 0.999. For fine-tuning the pretrained weights, an order of mag-

nitude lower learning rate of 0.0001 was used. Furthermore, a learning rate scheduler

was used to decrease the learning rate by an order of magnitude if the validation metric

did not improve for 10 epochs. The models were trained until convergence, which was

typically after 10 to 20 epochs.

The Multimodal Gym exercise recognition accuracies for the explored models are

presented in Table 4.1, along with confusion matrices for the multimodal model in Fig-

ure 4.1, and for the single-device models in Figures 4.2 and 4.3 (see A.3 for additional

confusion matrices). The highest accuracies on the two test sets, seen and unseen sub-

jects, are obtained by the multimodal model with 99.51% and the skeleton model with

92.94%, respectively. The skeleton model also obtained strong performance on the

seen test set, with 99.37% accuracy. The strong performance of the skeleton model is

expected, since the visual information encoded in the skeleton sequence data is natu-

rally very discriminative for the exercise recognition task. Incorporating sensor infor-

mation into the skeleton model through multimodal learning has little or no benefit.

The next-best performing models are the left and right multimodal smartwatch

accelerometer and gyroscope models, with test accuracies of 98.72% and 98.75, re-

spectively. The strong performance of the smartwatch models can be explained by the

fact that the smartwatches are involved in the movements of all ten exercises studied in

the Multimodal Gym dataset. The earbud and right smartphone models perform sub-

stantially worse than the other devices’ models, which is expected since the earbud and

smartphone are positioned such that they are not heavily involved in many of the exer-

cises. Inspecting the confusion matrices for the single-device earbud model in Figures

4.2e and 4.3e, and for the single-device smartphone model in Figures 4.2d and 4.2e

supports this hypothesis. The smartphone and earbud show poor classification perfor-

mance on exercises with limited head and core movement, such as biceps curls, triceps

extensions, shoulder press and lateral raises, and better performance on, for example,

squats and jumping jacks.

Among the unimodal models we observe that the unimodal accelerometer models

consistently outperform the corresponding unimodal gyroscope model, suggesting that

Page 33: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 28

acceleration is more discriminative than angular velocity for exercise recognition. Fus-

ing accelerometer and gyroscope data, as is done in the single-device models, further

improves exercise recognition accuracy for all devices, demonstrating the benefit of

leveraging multiple data perspectives through multimodal learning.

To evaluate the generalisation ability of our models we tested each model on un-

seen test subjects, that is, subjects that were not present in the training data. Different

participants will perform exercises with slight variations in form and tempo, which an

automated exercise logging system must be robust to. The results show that our models

struggle to generalise well to unseen participants. Morris et al. observed the same gen-

eralisation problem initially in their work when training on a dataset of 30 participants

[32]. Once they expanded their dataset to 160 participants their models were able to

generalise a lot better. To improve the generalisability of our models we could expand

our dataset to include more participants, and also apply data augmentation methods to

introduce more variability in the data.

The confusion matrices in Figures 4.1, 4.2, and 4.3, reveal that most misclassi-

fications, in particular on the unseen test set, are exercise segments mislabelled as

non-exercise segments. Further analysis into these failure cases is required, but most

likely many of these misclassifications are occurring at the ambiguous boundary re-

gions at the start and end of exercise sets, where the window segment contains both

exercise and non-exercise samples. This problem could be alleviated through tempo-

ral smoothing of predictions, for example, only changing the currently predicted class

to a different class if three consecutive predictions are in agreement. Another poten-

tial reason for these misclassifications is the degree of class imbalance in the dataset,

where the number of non-exercise samples outnumbers each exercise by roughly 30 to

1. This problem could be solved through stratified batching during training.

The results indicate that pretraining does not have a significant impact on model

performance for our exercise recognition task. This is most likely because we are

using relatively lightweight models with few parameters, for which our dataset is suf-

ficiently large to train from scratch without overfitting. However, in settings where

only a small dataset is available for the target task, but where there are many similar

unlabelled instances available, pretraining could help improve performance. This is

especially relevant for tasks involving inertial sensor data, as large unlabelled datasets

are readily available, but creating labelled inertial sensor datasets is a challenging and

time-consuming task.

Page 34: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 29

4.1.3 Multimodal Zeroing-Out Training Strategy

To learn more robust representations for segmentation and exercise recognition when

data from only a single device is available at test time, we investigate whether mul-

timodal training can be used to strengthen the unimodal representations. We do this

by during training gradually randomly zeroing out the input modalities that are only

available at training time, but still requiring the multimodal autoencoder to reconstruct

all the original inputs. By forcing the network to reconstruct modalities for which it

only sees the input for occasionally, we hypothesise that the network will learn features

that encode information about the relationships between different modalities, and thus

potentially supplementing and strengthening the unimodal features of the device which

is available at test time. In particular, we believe that the less discriminative modalities

from the earbud and smartphone, can benefit from the data perspectives provided by

the stronger skeleton and smartwatch modalities. We refer to the device that is used at

test time as the target device.

We now outline the proposed multimodal zeroing-out strategy. We initialise the

multimodal autoencoder with the pretrained weights of a multimodal autoencoder

trained on all available modalities. We begin by for the first two epochs randomly

selecting a device (excluding the target device) to zero out for each batch, and for ev-

ery following two epochs, we increase the number of devices to zero out. This process

continues until only the target device is not being zeroed-out, at which point we let the

network train for another five epochs. The model is evaluated on the validation set after

every epoch, by zeroing out all the modalities except for the modalities belonging to

the target device, and recording the reconstruction loss as the model attempts to recon-

struct all the original inputs. The model with the lowest validation reconstruction loss

is selected to be evaluated for exercise recognition. To evaluate the learnt represen-

tation on the Multimodal Gym dataset for exercise recognition, three fully connected

layers with 100 hidden units are attached to the embedding layer of the multimodal

autoencoder. The model is then trained for exercise recognition, zeroing out all modal-

ities, except for the modalities belonging to the target device. The model with the

highest validation accuracy is selected and evaluated on the test sets. The experimental

setup is identical to the one outlined in section 4.1.2 in terms of learning rate, optimiser

and batch size.

We trained five models, each with a different target device; left camera, left smart-

watch, right smartwatch, right smartphone, and earbud. The test set accuracies for each

Page 35: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 30

of the five models is given in Table 4.2, along with the confusion matrices in Figures

A.17 and A.18. The multimodal zeroing-out training strategy results in marginally bet-

ter accuracy for the two smartwatches on the seen test set, and for the right smartwatch

on the unseen test set compared to the corresponding single-device models. However,

for the remaining devices, the test set accuracy is lower when using the zeroing-out

training strategy. These initial results indicate that the zeroing-out training strategy is

not effective for learning more robust representations and at leveraging external data

perspectives. However, more extensive exploration is required before conclusively

discarding this approach. We propose trying variations of this approach, such as mod-

ifying how modalities are removed during training, and trying to reconstruct just one

modality instead of multiple ones, so that training can focus on leveraging information

from just one data perspective. Furthermore, we plan on experimenting with instance-

level zeroing-out, instead of batch-level zeroing-out, which we hypothesise will result

in smoother learning.

Device Modality Accuracy Accuracy Accuracy Accuracy(w/o PT) (w PT) (w/o PT, UTS) (w PT, UTS)

All - 99.51% - 92.28%

Camera Skeleton 99.37% 99.36% 92.55% 92.94%

Watch left

Acc 98.51% 98.30% 90.37% 88.54%

Gyr 97.81% 98.10% 88.79% 86.51%

Acc & Gyr 98.72% 98.72% 93.85% 92.48%

Watch right

Acc 98.50% 98.37% 88.54% 88.07%

Gyr 97.86% 97.87% 86.87% 86.85%

Acc & Gyr 98.75% 98.68% 91.95% 91.34%

Phone right

Acc 91.21% 90.00% 82.17% 82.17%

Gyr 87.64% 86.62% 78.97% 77.08%

Acc & Gyr 92.65% 93.31% 83.91% 83.94%

Earbud

Acc 92.85% 91.89% 78.89% 77.04%

Gyr 88.65% 88.68% 79.17% 77.64%

Acc & Gyr 94.17% 94.22% 82.11% 80.47%

Table 4.1: Multimodal, unimodal and single-device exercise recognition accuracies on

the Multimodal Gym test sets. We evaluate whether pretraining (PT) boosts perfor-

mance, and how well each model generalises to unseen test subjects (UTS).

Page 36: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 31

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

99.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 98.9% 0.0% 0.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 99.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 97.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 99.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 98.7% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.7% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.7% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.9% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.3% 0.0%

0.3% 1.1% 0.8% 2.2% 0.6% 1.3% 1.3% 1.3% 0.1% 1.7% 99.8%

0%

20%

40%

60%

80%

Accu

racy

(a) Seen subjects.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

94.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 2.8% 0.0%

0.0% 58.1% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%

0.0% 0.0% 10.4% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 64.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.3% 0.0% 0.0% 79.3% 0.3% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.2% 0.0% 75.1% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 39.5% 0.0% 0.0% 0.0% 0.0%

0.0% 0.1% 0.0% 0.0% 0.6% 0.1% 3.9% 90.1% 0.2% 14.2% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 97.3% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 46.4% 0.0%

6.0% 41.5% 89.6% 34.8% 20.1% 24.4% 56.6% 9.9% 2.4% 36.5% 99.5%

0%

20%

40%

60%

80%

Accu

racy

(b) Unseen subjects.

Figure 4.1: Confusion matrix for the multimodal model on the Multimodal Gym test set,

for seen and unseen subjects.

Device Accuracy Accuracy (UTS)

Camera left 99.35% 91.13%

Smartwatch left 98.78% 92.78%

Smartwatch right 99.08% 91.99%

Smartphone right 91.01% 81.60%

Earbud 93.08% 81.28%

Table 4.2: Single-device test accuracies for pretrained models trained using zeroing out

strategy, for both seen and previously unseen test subjects (UTS). Results should be

compared with the corresponding single-device pretrained model’s accuracies in Table

4.1.

4.2 Repetition Counting

We have implemented and evaluated a strong baseline approach (see section 3.3) for

the task of repetition counting on the Multimodal Gym dataset. See Figure 4.4 for an

example output of the repetition counting approach. Since the proposed approach does

not require any training, and as we do not fit any model hyperparameters using any

data, we are able to evaluate the approach on all 16 workout sessions. This brings the

total number of exercise sets on which our approach was evaluated on to 466. The

ground truth number of repetitions across all exercise sets range from 9 to 14, with the

mean number of repetitions per set equal to 10.04.

Page 37: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 32

The Mean Absolute repetition counting Error (MAE) per set obtained by the peak

counting and auto-correlation method for the different modalities and devices, are

given in Table 4.3. Table 4.4 displays the percentage of predictions within different

ranges of the ground truth. The left smartwatch gyroscope modality obtains the lowest

MAE, 0.29, closely followed by the right smartwatch gyroscope, 0.32, and the left and

right smartwatch accelerometers, 0.34 and 0.38, respectively. The predictions made

using phone and earbud data are significantly worse compared to those made using the

watch and skeleton data. This is expected since the phone and earbud are positioned

such that they do not capture the central movements of many of the exercises, for ex-

ample biceps curls, rows, and triceps extensions. Surprisingly, the skeleton modality

obtains a MAE which is over two times greater than the best performing modality, the

left smartwatch gyroscope. This is surprising since intuitively the skeleton sequence

data should be the most informative, and be able to capture the distinguishing move-

ments of any exercise, irrelevant of which body parts are involved. Analysing the

results further by looking into the performance at the exercise level (Table 4.3), reveals

that the skeleton modality struggles to count biceps curls, but performs well on all the

other exercises. The reason for the skeleton modality’s poor performance on counting

biceps curl repetitions is explained by Figure 4.5. The processed signal obtained using

the implemented approach defines one biceps curl repetition as two alternating curls

(one for each arm). This highlights an ambiguous case of how a repetition should be

defined; is one curl a repetition, or are two alternating curls one repetition? Ignoring

the biceps curl exercise results in a MAE of 0.29 for the skeleton modality, which is

on level with the performance of the left smartwatch gyroscope.

The results we obtain are in line with the results reported by Morris et al. for

repetition counting on the RecoFit dataset [32]. Using the same approach, and also

evaluating on the ground-truth exercise segmentation boundaries, they obtain a MAE

of 0.26 across 160 sets for 26 different exercises. Their predictions were obtained using

data from a right-arm worn accelerometer sensor. They obtain exact predictions in 50%

of the sets, within-1 in 97% of the sets, and within-2 in 99% of the sets, which can be

compared to our right watch accelerometer, exact 75%, within-1 95%, and within-2

97%. One difference in the findings of our work compared to Morris et al., is that they

did not find gyroscope data useful for repetition counting [32], however, our results

show that the method performs marginally better in terms of MAE on gyroscope data

than accelerometer data, across all devices.

The repetition counting approach is intuitive, efficient, and performs remarkably

Page 38: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 33

ExercisesTotal

Device Mod. Sq Pu Sp Lu Ro Su Te Bc Lr Jj

Cam L. Skel 0.06 0.63 0.73 0.0 0.31 0.06 0.37 4.74 0.09 0.33 0.70

Watch L.Acc 0.19 0.52 0.19 0.02 0.20 0.23 0.24 1.35 0.05 0.53 0.34

Gyr 0.17 0.21 0.71 0.47 0.31 0.21 0.12 0.35 0.0 0.35 0.29

Watch R.Acc 0.17 0.60 0.25 0.02 0.16 0.27 0.24 1.51 0.05 0.60 0.38

Gyr 0.23 0.19 0.52 0.57 0.33 0.17 0.24 0.37 0.07 0.47 0.32

Phone R.Acc 1.40 0.21 1.77 0.32 1.27 2.69 2.51 2.51 1.77 1.40 1.58

Gyr 3.04 0.58 1.40 0.87 0.57 2.81 1.71 1.21 1.35 1.60 1.52

EarbudAcc 0.19 0.60 1.54 0.21 1.63 0.90 0.53 2.00 1.47 2.91 1.17

Gyr 0.79 0.88 1.06 0.72 1.41 0.56 0.51 1.77 1.28 2.49 1.12

Table 4.3: The mean absolute repetition counting error per set for each modality and

device across all exercises; squats (Sq), push-ups (Pu), shoulder press (Sp), lunges

(Lu), dumbbell rows (Ro), sit-ups (Su), triceps extensions (Te), biceps curls (Bc), lateral

raises (Lr), and jumping jacks (Jj).

Device Modality Exact Within 1 Within 2

Camera left Skeleton 67.17% (74.00%) 89.06% (97.87%) 90.56% (99.53%)

Smartwatch leftAcc 74.03% 95.28% 98.07%

Gyr 75.32% 97.00% 98.93%

Smartwatch rightAcc 75.11% 95.06% 96.57%

Gyr 72.96% 97.00% 99.14%

Smartphone rightAcc 40.56% 65.67% 76.18%

Gyr 31.55% 68.45% 84.33%

EarbudAcc 43.56% 70.17% 82.62%

Gyr 40.34% 72.75% 84.55%

Table 4.4: The percentage of exercise sets for which the predicted repetition count is

exact, within 1, or within 2 of the ground-truth repetition count, for each modality and

device. The percentages in brackets for the skeleton modality, are the results obtained

when ignoring biceps curls.

well. However, there are clear limitations associated with this approach. One such

limitation is that a minimum and maximum repetition duration threshold needs to be

manually chosen for each exercise. The performance is sensitive to how accurate these

Page 39: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 34

thresholds are, and also entails that the approach is not invariant to the rate at which

repetitions are performed. Another limitation is that the method is heavily reliant on

accurate segmentation and classification in order to obtain good results. Finally, the

method relies on simple heuristics based signal processing techniques, and does not

learn a high-level representation of what constitutes a repetition. This problem is ex-

emplified by the approach’s behaviour when counting biceps curl repetitions using

smartwatch data. Intuitively one would expect the repetition counting method to pre-

dict five repetitions for a set of ten alternating biceps curls since the smartwatch would

only be able to register the curls from one of the arms. However, the method often pre-

dicts ten or close to ten repetitions in these scenarios. This is because the assumptions

supplied to the model about how long a biceps curl repetition should last causes the

model to interpret the small peaks in between the actual repetitions as corresponding to

repetitions. These peaks are most likely just noise but the method can not distinguish

between actual repetition peaks and just noisy peaks. More sophisticated methods that

incorporate higher-level reasoning are required for this.

We present this baseline approach on the Multimodal Gym dataset, with the hope

that future work can build on and improve these results. One interesting future research

direction is to use more sophisticated supervised learning approaches to count repeti-

tions. Such methods would benefit from having annotations of repetition counts on

the repetition level. Since manually annotating this information would be very time-

consuming, we propose using the presented baseline approach for automatic annota-

tion of repetition counts on the repetition level. A promising future research direction

is the use of Long Short-Term Memory (LSTM) networks [19], which are well-suited

for sequence data related tasks where previous information is useful for the current

prediction.

Page 40: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 35

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

97.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.0% 0.0%

0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 0.5% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 90.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 97.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3%

0.0% 0.3% 0.0% 0.0% 98.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 99.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.3% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.8% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 99.6% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.6% 0.0%

2.6% 1.5% 9.7% 2.9% 1.7% 1.0% 2.2% 2.2% 0.4% 1.4% 99.3%

0%

20%

40%

60%

80%

Accu

racy

(a) Smartwatch left acc & gyr.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

98.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 97.2% 0.2% 0.0% 0.0% 0.0% 2.5% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 96.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 95.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%

0.0% 0.1% 0.0% 0.0% 97.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.1% 0.0% 99.7% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.1% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.1% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.0% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 0.4% 0.0% 0.0% 0.0% 0.0% 97.8% 0.1%

1.8% 2.7% 3.8% 4.5% 2.3% 0.3% 1.3% 2.9% 1.0% 2.2% 99.2%

0%

20%

40%

60%

80%

Accu

racy

(b) Smartwatch right acc & gyr.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

99.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 99.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 98.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 98.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 99.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 94.9% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 100.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.0% 0.0%

0.3% 0.7% 1.6% 1.7% 0.2% 5.0% 0.3% 0.3% 0.0% 1.0% 99.6%

0%

20%

40%

60%

80%

100%

Accu

racy

(c) Skeleton.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

96.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.6% 0.0% 0.0% 0.0%

0.0% 97.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 61.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.5%

0.0% 0.0% 0.0% 86.3% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 62.5% 0.0% 0.0% 6.2% 24.6% 0.2%

0.0% 0.0% 0.0% 1.7% 0.0% 0.0% 92.8% 0.0% 0.1% 3.1% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.5% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 6.0% 0.0% 0.0% 89.8% 8.7% 0.7%

0.0% 0.0% 0.0% 0.0% 0.0% 20.4% 0.7% 0.0% 0.5% 47.8% 0.2%

3.8% 2.9% 38.6% 12.0% 1.9% 11.1% 6.4% 2.9% 3.3% 15.8% 98.1%

0%

20%

40%

60%

80%

Accu

racy

(d) Smartphone right acc & gyr.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

93.5% 0.2% 0.0% 0.3% 0.0% 0.0% 0.0% 0.3% 0.0% 0.9% 0.1%

1.0% 96.9% 0.0% 0.0% 0.0% 0.0% 0.4% 0.0% 0.0% 0.0% 0.3%

0.0% 0.1% 61.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.6% 0.5%

0.0% 0.0% 0.0% 90.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3% 0.3%

0.0% 0.0% 0.0% 1.8% 99.8% 0.0% 1.2% 0.0% 0.0% 0.0% 0.3%

0.0% 0.0% 0.2% 0.2% 0.0% 75.7% 0.0% 0.0% 0.0% 14.3% 0.4%

0.0% 0.2% 8.0% 0.0% 0.0% 0.1% 92.9% 0.0% 0.0% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.0% 0.0% 0.0% 0.1%

0.2% 0.0% 0.0% 0.0% 0.0% 21.8% 0.0% 0.0% 92.5% 6.5% 0.2%

0.3% 0.0% 2.5% 0.2% 0.0% 0.3% 0.0% 0.0% 3.3% 70.8% 0.3%

4.9% 2.5% 27.6% 7.2% 0.2% 2.0% 5.6% 3.6% 4.1% 6.6% 97.3%

0%

20%

40%

60%

80%

Accu

racy

(e) Earbud left acc & gyr.

Figure 4.2: Confusion matrices for the single-device models (with pretraining) on the

Multimodal Gym test set.

Page 41: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 36

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

69.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 20.9% 0.1%

0.0% 71.5% 0.2% 0.0% 0.0% 0.0% 0.9% 0.0% 0.0% 0.0% 0.6%

0.0% 0.2% 66.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 51.2% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.1%

0.0% 1.2% 1.6% 0.0% 54.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 94.1% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 9.8% 2.3% 0.0% 1.6% 0.0% 77.1% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 3.3% 93.9% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.1% 0.1% 0.0% 0.0% 94.1% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 4.2% 0.0% 0.0% 0.0% 0.0% 41.8% 0.1%

30.8% 17.4% 29.9% 48.6% 40.0% 5.8% 18.7% 6.1% 5.8% 37.3% 98.5%

0%

20%

40%

60%

80%

Accu

racy

(a) Smartwatch left acc & gyr.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

51.0% 0.0% 0.0% 0.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3% 0.1%

0.0% 66.6% 0.1% 0.0% 5.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3%

0.0% 0.1% 88.1% 0.0% 0.7% 0.0% 7.6% 0.0% 0.0% 0.0% 0.3%

0.0% 0.0% 0.0% 22.2% 0.0% 0.5% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 1.1% 0.2% 0.0% 73.3% 0.0% 0.6% 0.0% 0.0% 0.3% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 91.4% 0.0% 0.0% 0.1% 0.0% 0.1%

0.0% 3.2% 0.0% 0.0% 0.0% 0.0% 60.2% 0.0% 0.0% 0.0% 0.2%

0.0% 0.0% 0.2% 0.7% 0.0% 0.0% 0.6% 93.2% 0.0% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 94.2% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 3.4% 0.0% 0.0% 0.0% 0.0% 42.3% 0.1%

49.0% 29.0% 11.3% 76.6% 17.6% 8.1% 31.1% 6.8% 5.8% 57.1% 98.4%

0%

20%

40%

60%

80%

Accu

racy

(b) Smartwatch right acc & gyr.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

26.5% 0.0% 0.0% 0.0% 3.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 85.1% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 25.4% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 74.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 89.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 59.4% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 69.4% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 85.7% 0.0% 1.5% 0.0%

0.0% 0.0% 0.0% 5.9% 0.0% 0.8% 0.0% 0.0% 99.8% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 64.8% 0.0%

73.5% 14.9% 74.6% 19.1% 6.7% 39.8% 30.5% 14.3% 0.2% 33.7% 99.5%

0%

20%

40%

60%

80%Ac

cura

cy

(c) Skeleton.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

88.2% 1.4% 0.0% 0.0% 0.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.2% 74.2% 0.0% 0.2% 3.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3%

0.0% 0.0% 29.2% 0.0% 1.5% 0.1% 3.3% 0.0% 1.1% 18.9% 2.0%

0.0% 0.0% 0.0% 11.9% 0.0% 0.0% 36.1% 0.0% 1.7% 6.6% 0.4%

0.0% 0.2% 0.0% 0.0% 60.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 5.3% 0.0% 0.0% 20.9% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 9.4% 0.0% 0.0% 6.2% 0.1%

0.0% 4.6% 0.1% 0.0% 0.7% 0.0% 0.0% 94.3% 0.0% 0.0% 0.2%

0.0% 0.0% 0.0% 0.3% 0.0% 2.7% 0.5% 0.0% 5.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 1.0% 0.0% 0.1%

11.6% 19.7% 70.7% 87.5% 33.0% 91.9% 50.6% 5.7% 70.4% 68.4% 96.4%

0%

20%

40%

60%

80%

Accu

racy

(d) Smartphone right acc & gyr.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

27.1% 7.4% 0.0% 0.0% 0.0% 0.0% 0.4% 14.3% 0.0% 0.1% 0.1%

0.2% 19.3% 0.0% 0.0% 0.0% 1.6% 0.0% 1.3% 0.0% 1.0% 0.1%

0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%

3.1% 1.9% 0.0% 89.3% 0.7% 0.0% 0.5% 3.8% 0.0% 1.8% 0.5%

47.4% 34.9% 5.3% 0.2% 75.5% 37.6% 52.9% 17.4% 15.1% 33.6% 2.9%

0.0% 0.2% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 7.2% 0.4% 0.1%

0.0% 0.0% 0.5% 0.0% 0.0% 15.2% 9.6% 0.0% 12.0% 0.0% 1.1%

0.0% 0.0% 0.1% 0.0% 4.5% 0.0% 0.0% 41.6% 0.0% 0.4% 0.0%

0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.7% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3% 0.1%

22.3% 36.4% 94.0% 10.4% 19.3% 45.4% 36.6% 21.5% 65.0% 62.4% 94.7%

0%

20%

40%

60%

80%

Accu

racy

(e) Earbud left acc & gyr.

Figure 4.3: Confusion matrices for the single-device models (with pretraining) on the

Multimodal Gym test set for unseen subjects.

Page 42: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 4. Evaluation and Discussion 37

2

0

2

rad/s

XYZ

0 2 4 6 8 10Time (s)

Figure 4.4: Repetition counting for a set of ten squat repetitions using gyroscope read-

ings from a smartwatch worn on the left wrist. The original gyroscope data is shown in

the top plot, and the processed 1D signal is shown in the bottom plot, with the predicted

repetitions marked out.

0 2 4 6 8 10Time (s)

Figure 4.5: Processed skeleton modality signal for repetition counting for a set of biceps

curls, with the predicted repetitions marked out.

Page 43: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 5

Conclusion

5.1 Summary

This work proposes an automated exercise logging system, consisting of an joint ac-

tivity segmentation and exercise recognition stage, followed by a repetition counting

stage. We use a multimodal deep learning approach for activity segmentation and ex-

ercise recognition that extracts modality-specific representations using a stacked con-

volutional autoencoder. These representations are then fed into a deep fully-connected

autoencoder to learn a shared cross-modal representation, which is used for the exer-

cise recognition task.

To train and evaluate our automated exercise logging system, we have collected and

annotated the Multimodal Gym dataset, a substantial dataset of subjects performing

full-body workouts, consisting of RGB-D video, inertial sensing, and pose estimate

data. This dataset will be made publicly available, with the hope that it will be a useful

resource for multimodal research. We investigate fusing inertial sensor data and 3D

pose estimates, which has not previously been tried before for HAR. Our proposed

multimodal approach demonstrates near-perfect exercise recognition performance on

the Multimodal Gym dataset, with 99.5% accuracy. We also evaluate and compare

unimodal and single-device models across a number of smart devices, and evaluate the

generalisability of our models on unseen test subjects.

In addition, we introduce the zeroing-out training strategy as an approach for strength-

ening unimodal features by leveraging external data perspectives through multimodal

training. We present our initial results, which indicate that our zeroing-out strategy is

not effective at leveraging external data perspectives, however, a more comprehensive

exploration is required, for which we suggest a number of future research directions.

38

Page 44: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 5. Conclusion 39

Furthermore, we have implemented a strong baseline method, proposed by Morris

et al. [32], for the repetition counting task on the Multimodal Gym dataset. The best

performing modality was the smartwatch gyroscope, with a mean absolute prediction

error per set of 0.29. We discuss the limitations of the baseline approach and propose

promising future research directions for repetition counting.

5.2 Future Work

Future work will focus on improving a number of aspects of the proposed automated

exercise logging system. To improve the robustness and generalisability of our exer-

cise recognition models we will investigate data augmentation techniques for sensor

and skeleton data. Additionally, the Multimodal Gym dataset could be expanded to

include more participants in order to introduce more variability in the data. These

modifications should result in improved model performance on unseen test subjects.

To decrease the number of exercise segments being misclassified as non-exercise

segments, we propose three potential solutions to investigate. First, we propose evalu-

ating whether temporal smoothing of the predictions improves performance. This can

be done in an online fashion with a small additional cost in recognition speed. Sec-

ondly, we will explore whether stratified batching can be used to handle the class im-

balance problem, to ensure that the network places an equal importance on classifying

all classes. Finally, a potential solution to explore is to label and predict each window

segment instance with a vector containing the proportion of each class in the current

window segment, instead of just labelling and predicting one class for each window

segment. This would eliminate the problem of ambiguous overlapping windows at the

borders of exercise and non-exercise segments.

An additional promising future research direction for further improving the ex-

ercise recognition model is to remove the intermediate stage of extracting 3D pose

estimates, and to work directly on the raw RGB and depth data. This would make the

learning task significantly more difficult, thus requiring a larger, and more sophisti-

cated network, but would allow the network to determine what features in the raw data

are relevant for the task of exercise recognition.

We also propose a more extensive exploration of the zeroing-out training strategy

(see section 4.1.3). This will include, modifying how modalities are removed during

training, experimenting with instance-level zeroing-out, and using a smaller bottleneck

size in the multimodal autoencoder. This would make the network more likely to learn

Page 45: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Chapter 5. Conclusion 40

a shared cross-modal representation, as the network would be restricted in the space it

has to learn independent modality-specific representations.

Finally, as discussed in section 4.2, we highlight LSTMs as a promising approach

that is well-suited for repetition counting. Since LSTMs would benefit from having an-

notations at the repetition level, we propose using a signal processing technique, such

as the one proposed by Morris et al. [32], to automatically generate these repetition

level annotations.

Page 46: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Bibliography

[1] Jake K. Aggarwal and Lu Xia. Human activity recognition from 3D data: A

review. Pattern Recognition Letters, 48:70–80, 2014.

[2] B Almaslukh. An effective deep autoencoder approach for online smartphone-

based human activity recognition. International Journal of Computer Science

and Network Security, 17, 04 2017.

[3] Oresti Banos, Juan-Manuel Galvez, Miguel Damas, Hector Pomares, and Ignacio

Rojas. Window size impact in human activity recognition. Sensors, 14(4):6474–

6499, 2014.

[4] C. G. Bender, J. C. Hoffstot, B. T. Combs, S. Hooshangi, and J. Cappos. Measur-

ing the fitness of fitness trackers. In 2017 IEEE Sensors Applications Symposium

(SAS), pages 1–6, March 2017.

[5] Andreas Bulling, Ulf Blanke, and Bernt Schiele. A tutorial on human activ-

ity recognition using body-worn inertial sensors. ACM Comput. Surv., 46:33:1–

33:33, 2014.

[6] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Open-

Pose: realtime multi-person 2D pose estimation using Part Affinity Fields. In

arXiv preprint arXiv:1812.08008, 2018.

[7] Cristian Sminchisescu Catalin Ionescu, Fuxin Li. Latent structured models for

human pose estimation. In International Conference on Computer Vision, 2011.

[8] Liming Chen, Jesse Hoey, Chris D. Nugent, Diane J. Cook, and Zhiwen Yu.

Sensor-based activity recognition. Trans. Sys. Man Cyber Part C, 42(6):790–

808, November 2012.

41

Page 47: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Bibliography 42

[9] K. S. Choi, Y. S. Joo, and S. Kim. Automatic exercise counter for outdoor exer-

cise equipment. In 2013 IEEE International Conference on Consumer Electron-

ics (ICCE), pages 436–437, Jan 2013.

[10] Yong Du, Yun Fu, and Liang Wang. Skeleton based action recognition with con-

volutional neural network. 2015 3rd IAPR Asian Conference on Pattern Recog-

nition (ACPR), pages 579–583, 2015.

[11] Yong Du, Wei Wang, and Liang Wang. Hierarchical recurrent neural network for

skeleton based action recognition. 2015 IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 1110–1118, 2015.

[12] Georgios Dimitrios Evangelidis, Gurkirt Singh, and Radu Horaud. Skeletal

quads: Human action recognition using joint quadruples. 2014 22nd Interna-

tional Conference on Pattern Recognition, pages 4513–4518, 2014.

[13] Hristijan Gjoreski, Jani Bizjak, Martin Gjoreski, and Matjaz Gams. Compar-

ing deep and classical machine learning methods for human activity recognition

using wrist accelerometer. 2016.

[14] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural

networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudık, editors,

Proceedings of the Fourteenth International Conference on Artificial Intelligence

and Statistics, volume 15 of Proceedings of Machine Learning Research, pages

315–323, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR.

[15] S. Ha, J. Yun, and S. Choi. Multi-modal convolutional neural networks for ac-

tivity recognition. In 2015 IEEE International Conference on Systems, Man, and

Cybernetics, pages 3017–3022, Oct 2015.

[16] Nils Y. Hammerla, Shane Halloran, and Thomas Plotz. Deep, convolutional, and

recurrent models for human activity recognition using wearables. In IJCAI, 2016.

[17] Keng hao Chang, Mike Y. Chen, and John F. Canny. Tracking free-weight exer-

cises. In UbiComp, 2007.

[18] Samitha Herath, Mehrtash Tafazzoli Harandi, and Fatih Murat Porikli. Going

deeper into action recognition: A survey. Image Vision Comput., 60:4–21, 2017.

Page 48: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Bibliography 43

[19] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural

Comput., 9(8):1735–1780, November 1997.

[20] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Hu-

man3.6m: Large scale datasets and predictive methods for 3d human sensing

in natural environments. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 36(7):1325–1339, jul 2014.

[21] Fahim Kawsar, Chulhong Min, Akhil Mathur, and Allesandro Montanari. Ear-

ables for personal-scale behavior analytics. IEEE Pervasive Computing, 17:83–

89, 2018.

[22] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Ahmed Sohel, and

Farid Boussaıd. A new representation of skeleton sequences for 3d action recog-

nition. 2017 IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), pages 4570–4579, 2017.

[23] Aftab Khan, Nils Y. Hammerla, Sebastian Mellor, and Thomas Plotz. Optimis-

ing sampling rates for accelerometer-based human activity recognition. Pattern

Recognition Letters, 73:33–40, 2016.

[24] Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-

tion. In International Conference on Learning Representations (ICLR), 2015.

[25] Oscar D. Lara and Miguel A. Labrador. A survey on human activity recognition

using wearable sensors. IEEE Communications Surveys & Tutorials, 15:1192–

1209, 2013.

[26] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,

521(7553):436–444, 5 2015.

[27] Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based

learning applied to document recognition. In Proceedings of the IEEE, pages

2278–2324, 1998.

[28] O. Levy and L. Wolf. Live repetition counting. In 2015 IEEE International

Conference on Computer Vision (ICCV), pages 3020–3028, Dec 2015.

[29] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Skeleton-based action recog-

nition with convolutional neural networks. 2017 IEEE International Conference

on Multimedia Expo Workshops (ICMEW), pages 597–600, 2017.

Page 49: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Bibliography 44

[30] Shanhong Liu. Fitness & activity tracker. https://www.statista.com/study/35598/

fitness-and-activity-tracker/, 2018.

[31] Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. A simple

yet effective baseline for 3d human pose estimation. In ICCV, 2017.

[32] Dan Morris, T. Scott Saponas, Andrew Guillory, and Ilya Kelner. Recofit: using a

wearable sensor to find, recognize, and count repetitive exercises. In CHI, 2014.

[33] Bobak Mortazavi, Mohammad Pourhomayoun, Gabriel Alsheikh, Nabil Alshu-

rafa, Sunghoon Ivan Lee, and Majid Sarrafzadeh. Determining the single best

axis for exercise repetition recognition and counting on smartwatches. 2014 11th

International Conference on Wearable and Implantable Body Sensor Networks,

pages 33–38, 2014.

[34] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and An-

drew Y. Ng. Multimodal deep learning. In ICML, 2011.

[35] Ronald Poppe. A survey on vision-based human action recognition. Image Vision

Comput., 28:976–990, 2010.

[36] Valentin Radu, Nicholas D. Lane, Sourav Bhattacharya, Cecilia Mascolo, Ma-

hesh K. Marina, and Fahim Kawsar. Towards multimodal deep learning for

activity recognition on mobile devices. In Proceedings of the 2016 ACM In-

ternational Joint Conference on Pervasive and Ubiquitous Computing: Adjunct,

UbiComp ’16, pages 185–188, New York, NY, USA, 2016. ACM.

[37] Valentin Radu, Catherine Tong, Sourav Bhattacharya, Nicholas D. Lane, Cecilia

Mascolo, Mahesh K. Marina, and Fahim Kawsar. Multimodal deep learning for

activity and context recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous

Technol., 1(4):157:1–157:27, January 2018.

[38] Iasonas Kokkinos R{iza Alp Guler, Natalia Neverova. Densepose: Dense human

pose estimation in the wild. 2018.

[39] T. F. H. Runia, C. G. M. Snoek, and A. W. M. Smeulders. Real-world repetition

estimation by div, grad and curl. In 2018 IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 9009–9017, 2018.

Page 50: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Bibliography 45

[40] Abraham. Savitzky and M. J. E. Golay. Smoothing and differentiation of data

by simplified least squares procedures. Analytical Chemistry, 36(8):1627–1639,

1964.

[41] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale

image recognition. In International Conference on Learning Representations,

2015.

[42] Andrea Soro, Gino Brunner, Simon Tanner, and Roger Wattenhofer. Recogni-

tion and repetition counting for complex physical exercises with deep learning.

Sensors, 19(3), 2019.

[43] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan

Salakhutdinov. Dropout: A simple way to prevent neural networks from overfit-

ting. J. Mach. Learn. Res., 15(1):1929–1958, January 2014.

[44] Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow,

Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen.

Smart devices are different: Assessing and mitigating mobile sensing hetero-

geneities for activity recognition. In Proceedings of the 13th ACM Conference on

Embedded Networked Sensor Systems, SenSys ’15, pages 127–140, New York,

NY, USA, 2015. ACM.

[45] Terry Taewoong Um, Vahid Babakeshizadeh, and Dana Kulic. Exercise motion

classification from large-scale wearable sensor data using convolutional neural

networks. 2017 IEEE/RSJ International Conference on Intelligent Robots and

Systems (IROS), pages 2385–2390, 2016.

[46] Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. Human action recog-

nition by representing 3d skeletons as points in a lie group. 2014 IEEE Confer-

ence on Computer Vision and Pattern Recognition, pages 588–595, 2014.

[47] Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. Learning actionlet en-

semble for 3d human action recognition. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 36:914–927, 2014.

[48] Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. Deep

learning for sensor-based activity recognition: A survey. Pattern Recognition

Letters, 119:3–11, 2019.

Page 51: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Bibliography 46

[49] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional

networks for skeleton-based action recognition. In AAAI, 2018.

[50] Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiaoli Li, and Shonali Krish-

naswamy. Deep convolutional neural networks on multichannel time series for

human activity recognition. In IJCAI, 2015.

[51] Ming Zeng, Le T. Nguyen, Bo Yu, Ole J. Mengshoel, Jiang Zhu, Pang Wu, and

Joy Zhang. Convolutional neural networks for human activity recognition using

mobile sensors. 6th International Conference on Mobile Computing, Applica-

tions and Services, pages 197–205, 2014.

Page 52: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Appendix A

Additional Figures

A.1 Exercise Visualisations

Figure A.1: Squats.

Figure A.2: Push-ups.

Figure A.3: Dumbbell shoulder press.

47

Page 53: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Appendix A. Additional Figures 48

Figure A.4: Lunges.

Figure A.5: Dumbbell rows.

Figure A.6: Triceps extensions.

Figure A.7: Biceps curls.

Figure A.8: Sit-ups.

A.2 Unimodal Autoencoder Hyperparameter Search

A.3 Confusion Matrices

Page 54: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Appendix A. Additional Figures 49

Figure A.9: Jumping jacks.

Figure A.10: Lateral shoulder raises.

3_k11_GNN

3_k11_NNN

3_k11_GGN

3_k5_NNN

3_k5_GNN

3_k3_NNN

3_k3_GGN

3_k3_GNN

1_k11_N

1_k11_G

2_k5_NN

2_k11_GG

2_k11_NN

2_k11_GN

2_k5_GN

2_k5_GG1_k5_N

2_k3_GN

2_k3_NN

2_k3_GG1_k3_N

1_k5_G1_k3_G

Models

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

0.0030

0.0035

Test

Rec

onst

ruct

ion

Loss

Stride 2Stride 3Model Parameters

0

2000

4000

6000

8000

10000

12000

14000

Mod

el P

aram

eter

s

Figure A.11: Validation reconstruction loss on RecoFit accelerometer data for different

convolutional autoencoder model configurations.

Page 55: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Appendix A. Additional Figures 50

3_k11_NNN

3_k11_GGN

3_k11_GNN

3_k5_NNN

3_k5_GGN

3_k3_GNN

3_k3_NNN

3_k3_GGN

1_k11_N

2_k11_GG

2_k11_NN

2_k5_NN

2_k11_GN

2_k5_GN

2_k5_GG

1_k11_G1_k5_N

2_k3_NN

2_k3_GN

2_k3_GG1_k5_G

1_k3_N1_k3_G

Models

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

0.0030

0.0035

Test

Rec

onst

ruct

ion

Loss

Stride 2Stride 3Model Parameters

0

2000

4000

6000

8000

10000

12000

14000

Mod

el P

aram

eter

s

Figure A.12: Validation reconstruction loss on RecoFit gyroscope data for different con-

volutional autoencoder model configurations.

3_k11_NNN

3_k11_GGN

3_k11_GGG

2_k11_NN

2_k11_GG

2_k11_GN

3_k11_GNN

3_k5_NNN

3_k5_GGN

3_k5_GNN

3_k5_GGG

2_k5_NN

2_k5_GN

1_k11_N

2_k5_GG

1_k11_G1_k5_N

3_k3_NNN

3_k3_GNN

3_k3_GGN1_k5_G

2_k3_NN

2_k3_GN

3_k3_GGG

2_k3_GG1_k3_N

1_k3_G

Models

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

0.0030

0.0035

Test

Rec

onst

ruct

ion

Loss

Stride 2Stride 3Model Parameters

0

20000

40000

60000

80000

100000

120000

140000

160000

Mod

el P

aram

eter

s

Figure A.13: Validation reconstruction loss on Human3.6M motion capture data for

different convolutional autoencoder model configurations.

Page 56: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Appendix A. Additional Figures 51

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

98.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 89.4% 0.0% 0.0% 0.0% 0.0% 0.5% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 93.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 97.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 8.4% 0.0% 0.0% 99.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 99.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.1% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.1% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.9% 0.0%

1.4% 2.2% 7.0% 2.1% 0.9% 1.0% 2.3% 2.9% 0.3% 1.1% 99.5%

0%

20%

40%

60%

80%

Accu

racy

(a) Smartwatch left acc & gyr.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

98.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 97.5% 0.0% 0.0% 0.0% 0.0% 0.7% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 95.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 95.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 99.2% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 97.8% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 95.5% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.2% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.8% 0.1%

1.8% 2.5% 5.0% 4.1% 1.9% 0.8% 1.5% 4.5% 0.8% 1.2% 99.2%

0%

20%

40%

60%

80%

Accu

racy

(b) Smartwatch right acc & gyr.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

99.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 99.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 98.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 97.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 99.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 94.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.3% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 2.2% 0.0% 0.0% 100.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.6% 0.0%

0.3% 0.9% 1.6% 2.4% 0.8% 3.8% 0.7% 0.3% 0.0% 1.4% 99.7%

0%

20%

40%

60%

80%

100%

Accu

racy

(c) Skeleton.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

98.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 96.7% 0.0% 0.0% 0.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 76.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.4%

0.0% 0.0% 0.0% 76.4% 0.0% 0.0% 0.5% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 99.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 61.5% 0.0% 0.0% 28.5% 30.4% 0.2%

0.0% 0.0% 0.0% 5.9% 0.0% 0.0% 95.3% 0.0% 0.0% 7.0% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 95.8% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 1.0% 0.0% 0.0% 65.7% 5.6% 0.4%

0.0% 0.0% 0.0% 0.7% 0.0% 31.5% 0.0% 0.0% 2.2% 46.3% 0.3%

1.4% 3.3% 23.9% 17.0% 0.6% 6.0% 4.2% 4.2% 3.6% 10.7% 98.2%

0%

20%

40%

60%

80%

Accu

racy

(d) Smartphone right acc & gyr.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

93.4% 0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

2.0% 96.6% 0.0% 0.0% 0.9% 0.0% 2.7% 0.0% 0.0% 0.0% 0.2%

0.0% 0.1% 67.0% 0.0% 0.0% 0.0% 5.0% 0.0% 0.0% 3.1% 0.3%

0.0% 0.0% 0.0% 93.4% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 95.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.3% 0.0% 0.0% 88.4% 0.0% 0.0% 3.3% 9.9% 0.2%

0.0% 0.2% 0.0% 0.0% 0.0% 0.1% 78.3% 0.0% 0.0% 0.1% 0.1%

0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 96.4% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 5.5% 0.0% 0.0% 81.0% 15.3% 0.0%

0.0% 0.1% 0.7% 0.0% 0.0% 0.0% 0.0% 0.0% 2.6% 48.9% 0.1%

4.6% 2.7% 32.0% 6.4% 3.9% 6.0% 13.8% 3.6% 13.1% 22.7% 98.7%

0%

20%

40%

60%

80%

Accu

racy

(e) Earbud left acc & gyr.

Figure A.14: Confusion matrices for the single-device models (trained from scratch) on

the Multimodal Gym test set.

Page 57: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Appendix A. Additional Figures 52

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

98.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 97.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 86.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%

0.0% 0.0% 0.0% 98.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.5% 0.0% 0.0% 93.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 98.8% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 95.1% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.8% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.4% 0.0%

2.0% 2.0% 13.9% 1.2% 6.8% 1.3% 4.9% 3.2% 0.3% 1.6% 99.3%

0%

20%

40%

60%

80%

Accu

racy

(a) Smartwatch left.Sq

uats

Lung

esBic

eps c

urls

Sit-up

sPu

sh-up

sTri

ceps

ext.

Rows

Jumpin

g jac

ksSh

oulde

r pres

sLa

teral

raise

sNon

-exerc

ise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

98.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 97.0% 0.0% 0.0% 0.0% 0.0% 1.5% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 91.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 96.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 97.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 97.9% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.5% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 95.8% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.3% 0.1%

2.0% 3.0% 8.4% 3.4% 3.0% 2.1% 2.0% 4.2% 0.3% 1.7% 99.1%

0%

20%

40%

60%

80%

Accu

racy

(b) Smartwatch right.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

88.1% 0.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.7% 95.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.5% 0.2%

0.0% 0.0% 64.9% 0.0% 0.0% 1.1% 1.0% 0.0% 0.0% 1.6% 0.5%

1.0% 0.2% 0.3% 88.5% 0.0% 0.0% 0.0% 0.3% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 97.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%

0.0% 0.0% 1.1% 0.0% 0.0% 68.5% 0.0% 0.0% 0.0% 18.3% 0.2%

0.0% 0.0% 0.3% 0.0% 0.0% 0.0% 78.1% 0.0% 0.0% 0.0% 0.1%

0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 94.4% 0.0% 0.0% 0.1%

0.0% 0.0% 0.5% 0.0% 0.0% 22.8% 0.0% 0.0% 81.6% 13.5% 0.2%

0.0% 0.3% 0.5% 0.0% 0.0% 3.1% 0.0% 0.0% 9.4% 52.4% 0.4%

10.0% 2.5% 32.4% 11.5% 3.0% 4.4% 21.0% 5.3% 9.0% 13.6% 98.0%

0%

20%

40%

60%

80%

Accu

racy

(c) Earbud left.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

96.7% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.5% 95.2% 0.0% 0.0% 0.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 54.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.4%

0.0% 0.0% 0.0% 85.7% 0.0% 0.0% 0.3% 0.0% 0.0% 0.0% 0.1%

1.1% 0.1% 0.0% 0.0% 95.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 41.0% 0.0% 0.0% 4.1% 23.7% 0.1%

0.0% 0.0% 0.0% 1.3% 0.0% 0.0% 91.6% 0.0% 0.0% 5.2% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 1.0% 0.3% 0.0% 64.3% 3.4% 0.5%

0.0% 0.0% 0.0% 0.9% 0.0% 35.6% 1.3% 0.0% 0.5% 35.7% 0.2%

1.8% 4.7% 45.6% 12.1% 3.4% 22.5% 6.4% 1.0% 31.1% 32.0% 98.3%

0%

20%

40%

60%

80%

Accu

racy

(d) Smartphone right.

Figure A.15: Confusion matrices for the unimodal accelerometer models (trained from

scratch) on the Multimodal Gym test set.

Page 58: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Appendix A. Additional Figures 53

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

92.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 91.8% 0.3% 0.1% 0.2% 0.0% 1.2% 0.0% 0.3% 0.0% 0.1%

0.0% 0.0% 86.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 94.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 96.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 91.4% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.8% 0.0% 0.0% 0.0%

0.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.6% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.1% 0.0%

6.7% 8.2% 13.3% 5.5% 3.6% 1.9% 7.4% 3.2% 1.2% 1.9% 99.2%

0%

20%

40%

60%

80%

Accu

racy

(a) Smartwatch left.Sq

uats

Lung

esBic

eps c

urls

Sit-up

sPu

sh-up

sTri

ceps

ext.

Rows

Jumpin

g jac

ksSh

oulde

r pres

sLa

teral

raise

sNon

-exerc

ise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

94.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 93.5% 0.0% 0.0% 0.0% 0.0% 5.2% 0.0% 0.4% 0.0% 0.2%

0.0% 0.0% 88.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 95.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 97.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.9% 99.7% 0.0% 0.0% 0.0% 0.5% 0.1%

0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 89.6% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 95.5% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.3% 0.0% 96.8% 0.1% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 96.4% 0.1%

5.8% 6.4% 12.0% 4.4% 2.1% 0.3% 3.9% 4.5% 2.8% 3.0% 99.2%

0%

20%

40%

60%

80%

Accu

racy

(b) Smartwatch right.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

57.4% 5.3% 2.0% 0.5% 2.3% 1.1% 5.8% 0.0% 1.0% 5.5% 0.3%

3.1% 78.4% 0.0% 0.0% 3.7% 0.1% 3.5% 0.0% 2.6% 5.9% 0.3%

1.0% 0.9% 43.1% 0.0% 0.0% 1.8% 1.2% 0.0% 0.3% 1.1% 0.5%

0.0% 0.0% 0.0% 83.4% 0.0% 0.0% 0.0% 0.0% 0.4% 0.0% 0.2%

2.4% 0.5% 0.3% 0.5% 84.9% 0.0% 0.8% 0.0% 0.5% 0.3% 0.1%

0.3% 0.0% 1.1% 0.7% 1.1% 68.2% 0.8% 0.0% 1.3% 2.8% 0.2%

5.9% 1.3% 0.8% 0.2% 2.5% 2.3% 48.7% 0.0% 9.0% 10.1% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 90.4% 0.0% 0.0% 0.1%

7.6% 0.5% 1.8% 2.6% 1.1% 5.2% 4.0% 0.0% 68.5% 9.1% 0.2%

5.1% 1.0% 2.0% 0.2% 1.1% 16.0% 6.7% 0.0% 7.3% 48.4% 0.4%

17.1% 12.1% 48.9% 12.0% 3.2% 5.1% 28.7% 9.6% 9.0% 16.8% 97.4%

0%

20%

40%

60%

80%

Accu

racy

(c) Earbud left.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

93.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 70.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%

0.0% 0.0% 73.4% 1.3% 0.4% 0.0% 2.0% 0.0% 0.0% 0.0% 0.5%

0.0% 0.0% 0.0% 64.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.3% 0.0% 0.1% 83.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 26.8% 0.0% 0.0% 14.9% 18.8% 0.1%

0.0% 0.0% 2.5% 0.3% 0.0% 0.0% 68.6% 0.0% 0.5% 1.5% 0.4%

0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 97.4% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.3% 0.0% 36.1% 0.0% 0.0% 59.0% 15.1% 0.5%

0.0% 0.0% 0.0% 0.0% 0.0% 1.3% 0.0% 0.0% 3.3% 28.1% 0.3%

6.7% 29.2% 24.1% 33.6% 16.1% 35.8% 29.4% 2.6% 22.2% 36.5% 97.6%

0%

20%

40%

60%

80%

Accu

racy

(d) Smartphone right.

Figure A.16: Confusion matrices for the unimodal gyroscope models (trained from

scratch) on the Multimodal Gym test set.

Page 59: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Appendix A. Additional Figures 54

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

98.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 97.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 82.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 97.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 97.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 94.5% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 95.8% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.4% 0.0%

1.1% 2.6% 17.5% 3.0% 2.1% 0.3% 5.5% 4.2% 0.0% 0.6% 99.7%

0%

20%

40%

60%

80%

100%

Accu

racy

(a) Smartwatch left.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

99.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 2.5% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 95.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 95.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 98.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 94.6% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.4% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.9% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.7% 0.0%

0.9% 1.9% 4.1% 4.5% 1.5% 0.3% 2.9% 1.6% 0.1% 1.3% 99.6%

0%

20%

40%

60%

80%

Accu

racy

(b) Smartwatch right.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

99.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 98.6% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 97.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 98.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 99.6% 0.0% 0.0% 0.0% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.3% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.7% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 99.4% 0.0%

0.5% 1.4% 2.7% 1.9% 1.3% 0.4% 1.7% 1.3% 0.0% 0.6% 99.6%

0%

20%

40%

60%

80%

100%

Accu

racy

(c) Skeleton.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

98.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 98.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.3% 0.0% 0.0% 0.1%

0.0% 0.0% 32.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.6%

0.0% 0.0% 0.0% 85.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.1% 0.0% 0.0% 99.2% 0.0% 0.0% 0.0% 0.0% 0.1% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 61.0% 0.0% 0.0% 7.3% 7.7% 0.4%

0.0% 0.0% 0.0% 0.4% 0.0% 0.0% 78.0% 0.0% 0.0% 4.8% 0.3%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 98.4% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 8.1% 0.0% 0.0% 75.5% 26.1% 0.2%

0.0% 0.0% 0.0% 0.1% 0.0% 15.6% 2.4% 0.0% 5.8% 26.9% 0.2%

1.2% 1.8% 67.8% 13.6% 0.8% 15.3% 19.7% 1.3% 11.4% 34.4% 97.8%

0%

20%

40%

60%

80%

Accu

racy

(d) Smartphone right.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

96.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 96.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 76.2% 0.0% 0.0% 0.0% 5.2% 0.0% 0.0% 0.6% 0.9%

0.0% 0.0% 0.0% 82.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 97.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 1.1% 0.0% 80.7% 0.5% 0.0% 1.8% 11.3% 0.4%

0.0% 1.1% 1.3% 0.0% 0.0% 0.0% 77.8% 0.0% 0.0% 0.0% 0.2%

0.0% 0.0% 0.0% 11.0% 0.0% 0.0% 0.0% 95.5% 0.0% 0.0% 0.7%

0.0% 0.0% 0.2% 0.7% 0.0% 16.2% 0.0% 0.0% 86.2% 17.1% 0.2%

0.0% 0.0% 1.3% 0.0% 0.0% 0.0% 0.0% 0.0% 6.2% 58.6% 0.7%

3.7% 2.6% 21.1% 4.9% 2.3% 3.1% 16.5% 4.5% 5.8% 12.3% 96.7%

0%

20%

40%

60%

80%

Accu

racy

(e) Earbud left.

Figure A.17: Confusion matrices on seen subjects test set for the single-device models

trained using the multimodal zeroing-out training strategy.

Page 60: Multimodal Deep Learning for Automated Exercise Logging · modalities. Furthermore, we propose a fully-connected multimodal autoencoder to learn cross-modal representations, which

Appendix A. Additional Figures 55

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

50.3% 0.0% 0.0% 0.7% 0.0% 0.0% 0.0% 0.2% 0.0% 0.3% 0.0%

0.0% 80.1% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 58.3% 0.0% 0.0% 0.0% 6.1% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.5% 55.3% 0.0% 1.3% 0.1% 0.0% 0.0% 0.0% 0.1%

0.0% 0.2% 0.9% 0.0% 45.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.1% 0.1% 0.0% 93.4% 0.2% 0.0% 0.0% 0.0% 0.1%

0.0% 0.5% 0.0% 0.0% 0.0% 0.0% 69.5% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 89.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 90.8% 0.0% 0.1%

0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 61.2% 0.1%

49.7% 19.2% 40.2% 43.9% 54.4% 5.4% 23.8% 10.8% 9.2% 38.5% 99.2%

0%

20%

40%

60%

80%

Accu

racy

(a) Smartwatch left.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

50.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 2.1% 0.1%

0.0% 72.8% 0.0% 0.1% 6.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 72.6% 3.7% 0.8% 0.0% 0.5% 0.9% 0.0% 0.0% 0.2%

0.0% 0.0% 0.0% 38.4% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0%

0.0% 2.3% 0.6% 0.2% 50.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 3.3% 0.0% 90.9% 0.0% 0.0% 0.1% 0.0% 0.1%

0.0% 1.3% 0.0% 0.2% 0.0% 0.0% 41.5% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 1.3% 0.0% 0.0% 0.5% 74.2% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.0% 96.6% 0.0% 0.3%

0.0% 0.0% 0.0% 0.0% 3.4% 0.0% 0.0% 0.2% 0.0% 90.1% 0.1%

49.5% 23.6% 26.8% 52.7% 39.4% 9.1% 57.5% 24.4% 3.3% 7.8% 98.9%

0%

20%

40%

60%

80%

Accu

racy

(b) Smartwatch right.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

3.8% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 1.3% 0.0%

0.0% 76.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1%

0.0% 0.0% 38.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0%

0.0% 0.0% 0.0% 85.8% 0.0% 0.0% 0.0% 0.0% 0.0% 1.8% 0.2%

0.0% 0.0% 0.0% 0.0% 76.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.0% 1.1% 0.0% 68.5% 0.0% 0.0% 0.0% 1.8% 0.1%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 28.6% 0.0% 0.0% 0.1% 0.0%

0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 52.1% 0.0% 0.1% 0.0%

0.0% 0.0% 0.0% 0.2% 0.0% 25.6% 0.0% 0.0% 99.6% 1.6% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 2.3% 0.0% 0.0% 53.1% 0.0%

96.2% 23.4% 61.7% 12.8% 23.3% 5.9% 69.1% 47.7% 0.4% 39.8% 99.3%

0%

20%

40%

60%

80%Ac

cura

cy

(c) Skeleton.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

92.4% 0.0% 0.0% 0.0% 10.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2%

0.2% 92.9% 0.1% 0.0% 1.0% 0.0% 0.0% 57.7% 0.0% 0.0% 0.7%

0.0% 0.0% 15.1% 0.2% 4.2% 0.0% 16.1% 0.0% 0.1% 30.7% 2.4%

0.0% 0.0% 3.2% 7.3% 0.0% 0.0% 32.4% 0.0% 0.1% 12.7% 0.9%

0.0% 0.0% 0.0% 0.0% 53.2% 2.3% 0.0% 0.0% 7.0% 0.0% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 24.7% 0.0% 0.0% 19.4% 0.0% 0.5%

0.0% 0.0% 0.0% 0.0% 0.3% 7.5% 5.3% 0.0% 0.3% 8.8% 1.1%

1.2% 0.1% 0.0% 0.0% 13.8% 0.0% 0.0% 38.5% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 0.0% 1.4% 11.2% 0.0% 0.0% 9.1% 0.0% 0.3%

0.0% 0.0% 0.0% 0.0% 0.0% 0.8% 0.0% 0.0% 5.5% 0.0% 0.4%

6.3% 6.9% 81.6% 92.4% 15.9% 53.6% 46.2% 3.8% 58.5% 47.8% 93.2%

0%

20%

40%

60%

80%

Accu

racy

(d) Smartphone right.

Squa

tsLu

nges

Bicep

s curl

sSit

-ups

Push

-ups

Trice

ps ex

t.Ro

wsJum

ping j

acks

Shou

lder p

ress

Later

al rai

ses

Non-ex

ercise

True labels

Squats

Lunges

Biceps curls

Sit-ups

Push-ups

Triceps ext.

Rows

Jumping jacks

Shoulder press

Lateral raises

Non-exercise

Pred

icted

labe

ls

14.9% 4.8% 0.0% 0.2% 0.0% 0.0% 0.0% 1.4% 0.6% 0.0% 0.0%

0.1% 9.8% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.0%

0.0% 0.0% 0.3% 0.0% 0.0% 0.1% 0.0% 0.0% 0.2% 0.4% 0.5%

1.2% 0.4% 0.0% 72.7% 0.0% 0.0% 0.0% 0.5% 0.0% 0.0% 0.1%

51.5% 12.7% 0.0% 0.2% 28.3% 7.0% 17.9% 29.3% 3.5% 7.2% 0.2%

0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.6% 0.0% 0.0%

0.0% 0.1% 0.0% 0.0% 0.0% 27.7% 6.9% 0.0% 17.7% 0.1% 0.5%

0.0% 0.0% 0.0% 0.2% 0.1% 0.0% 0.0% 18.5% 0.0% 0.0% 0.1%

0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 1.5% 0.3% 0.0%

0.0% 0.0% 0.0% 0.0% 0.0% 0.3% 0.0% 0.0% 0.0% 0.0% 0.1%

32.4% 72.2% 99.6% 26.8% 71.6% 64.9% 74.9% 50.2% 75.9% 91.9% 98.6%

0%

20%

40%

60%

80%

Accu

racy

(e) Earbud left.

Figure A.18: Confusion matrices on unseen subjects test set for the single-device mod-

els trained using the multimodal zeroing-out training strategy.