The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation...

12
C. CHARALAMBOUS, A. BHARATH: DATA AUGMENTATION FOR GAIT RECOGNITION 1 A data augmentation methodology for training machine/deep learning gait recognition algorithms Christoforos C. Charalambous http://www.bicv.org/team/christoforos-charalambous/ Anil A. Bharath http://www.bicv.org/team/anil-anthony-bharath/ BICV Research Group Department of Bioengineering Imperial College London London, UK Abstract There are several confounding factors that can reduce the accuracy of gait recog- nition systems. These factors can reduce the distinctiveness, or alter the features used to characterise gait; they include variations in clothing, lighting, pose and environment, such as the walking surface. Full invariance to all confounding factors is challenging in the absence of high-quality labelled training data. We introduce a simulation-based methodology and a subject-specific dataset which can be used for generating synthetic video frames and sequences for data augmentation. With this methodology, we generated a multi-modal dataset. In addition, we supply simulation files that provide the ability to simultaneously sample from several confounding variables. The basis of the data is real motion capture data of subjects walking and running on a treadmill at different speeds. Results from gait recognition experiments suggest that information about the identity of subjects is retained within synthetically generated examples. The dataset and methodol- ogy allow studies into fully-invariant identity recognition spanning a far greater number of observation conditions than would otherwise be possible. 1 Introduction Gait has received significant attention as a biometric, particularly with the increasing quality of security cameras, the growth in city populations and the subsequent interest in tracking the movement of people in such environments. There have been several studies [1, 2] that support the usefulness of gait for identity recognition within a sample population. How- ever, there are many confounding factors that can reduce usefulness of gait as an identifying biometric; these range from the difficulties in extracting particular signatures to the distinc- tiveness and robustness of those features to confounding factors. The confounding factors can be divided into three different categories, according to whether they are related to: the environment in which the subject is being captured (e.g. weather conditions, walk- ing surface, background, ambient light, shadows, occlusions by other objects) the imaging system used in capturing the subject (e.g. subject-camera relative viewing angle, lighting conditions, camera lens distortions, image modality) c 2016. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms. Pages 110.1-110.12 DOI: https://dx.doi.org/10.5244/C.30.110

Transcript of The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation...

Page 1: The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation is not new in the eld of Computer Vision. Studies like [3,4,5] use synthetic generated

C. CHARALAMBOUS, A. BHARATH: DATA AUGMENTATION FOR GAIT RECOGNITION 1

A data augmentation methodology fortraining machine/deep learning gaitrecognition algorithms

Christoforos C. Charalamboushttp://www.bicv.org/team/christoforos-charalambous/

Anil A. Bharathhttp://www.bicv.org/team/anil-anthony-bharath/

BICV Research GroupDepartment of BioengineeringImperial College LondonLondon, UK

Abstract

There are several confounding factors that can reduce the accuracy of gait recog-nition systems. These factors can reduce the distinctiveness, or alter the features usedto characterise gait; they include variations in clothing, lighting, pose and environment,such as the walking surface. Full invariance to all confounding factors is challengingin the absence of high-quality labelled training data. We introduce a simulation-basedmethodology and a subject-specific dataset which can be used for generating syntheticvideo frames and sequences for data augmentation. With this methodology, we generateda multi-modal dataset. In addition, we supply simulation files that provide the ability tosimultaneously sample from several confounding variables. The basis of the data is realmotion capture data of subjects walking and running on a treadmill at different speeds.Results from gait recognition experiments suggest that information about the identity ofsubjects is retained within synthetically generated examples. The dataset and methodol-ogy allow studies into fully-invariant identity recognition spanning a far greater numberof observation conditions than would otherwise be possible.

1 IntroductionGait has received significant attention as a biometric, particularly with the increasing qualityof security cameras, the growth in city populations and the subsequent interest in trackingthe movement of people in such environments. There have been several studies [1, 2] thatsupport the usefulness of gait for identity recognition within a sample population. How-ever, there are many confounding factors that can reduce usefulness of gait as an identifyingbiometric; these range from the difficulties in extracting particular signatures to the distinc-tiveness and robustness of those features to confounding factors. The confounding factorscan be divided into three different categories, according to whether they are related to:

• the environment in which the subject is being captured (e.g. weather conditions, walk-ing surface, background, ambient light, shadows, occlusions by other objects)

• the imaging system used in capturing the subject (e.g. subject-camera relative viewingangle, lighting conditions, camera lens distortions, image modality)

c© 2016. The copyright of this document resides with its authors.It may be distributed unchanged freely in print or electronic forms.

Pages 110.1-110.12

DOI: https://dx.doi.org/10.5244/C.30.110

Page 2: The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation is not new in the eld of Computer Vision. Studies like [3,4,5] use synthetic generated

2 C. CHARALAMBOUS, A. BHARATH: DATA AUGMENTATION FOR GAIT RECOGNITION

• the appearance of the subject (e.g. clothing, footwear, speed, gender, age, physicalstate, injuries)

Emerging gait recognition systems depend on supervised machine/deep learning. Ma-chine learning algorithms (especially deep networks) require vast amounts of application-specific, high-quality labelled training data, which is either very expensive or not feasibleto acquire. In order to collect sufficient amounts of labelled data for training deep networksfor gait recognition from scratch, participants would need to walk for hours. In addition, tocollect data under different conditions, the experiment would have to be repeated multipletimes, to cover all the possible combinations of conditions that are likely to occur. The onlyalternative strategy would be to employ transfer learning, but it is hard to find a problem fortransfer learning which is sufficiently close to the complexity of gait recognition.

Taking several confounding variables together poses significant challenges. How doesone gather sufficient variations across all confounding factors, and with many different sub-jects, within a finite time in order to train, let alone test, a system to provide accurate recogni-tion? One strategy would be to synthesize gait motion within virtual environments in whichclothing, lighting, camera pose and environments can be systematically changed. This papertakes a first step towards showing that this is feasible, conditional on sufficient quality of realmotion capture (mocap) data, rendering and imaging models.

We suggest a method to generate augmented labelled training data for machine learning,including quantities of data that could be used for deep supervised learning algorithms. Anew multi-modal dataset is introduced, consisting of more than 6.5 million frames of realmotion capture data, video data and 3D mesh models. Based on the motion capture datacollected from 26 subjects, we also provide simulation files with several confounding factorsbeing controllable (see Figure 3). The motion data itself is captured from real people walkingand running on a treadmill at different speeds.

Generating synthetic data for data augmentation is not new in the field of ComputerVision. Studies like [3, 4, 5] use synthetic generated data to augment training and/or testingsets, while others use it to learn features invariant to certain conditions [6]. While suchmethods focus on a very limited number of covariate factors, to our knowledge, this is thefirst attempt to provide the ability to generate synthetic video data with so many controllableconditions from real human motion data. This opens the possibility of bootstrapping or pre-training gait recognition methods that can identify people by their gait while being invariantto several different confounding factors simultaneously. We discuss the process by whichthis dataset is generated and demonstrate that characteristics of identity are preserved withinthe motion of the synthetically generated data.

2 Literature ReviewThis work focuses on generating subject-specific synthetic video data for the purpose of dataaugmentation. Data augmentation provides a means for increasing the quantity of trainingdata available for machine learning, and is particularly relevant when training deep learningsystems from scratch [3].

In the context of gait recognition, data augmentation is likely to be useful for appearance-based, rather than model-based, approaches. The key idea behind appearance-based gaitrecognition is to collapse temporal information, from a video or a series of frames, down toa single image or a pair of images. An early attempt [7] to analyse human motion, suggested

Page 3: The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation is not new in the eld of Computer Vision. Studies like [3,4,5] use synthetic generated

C. CHARALAMBOUS, A. BHARATH: DATA AUGMENTATION FOR GAIT RECOGNITION 3

using two complementary images, the (binary) Motion Energy Image (MEI) and the (scalarvalued) Motion History Image (MHI). Inspired by [7], Han introduced the Gait Energy Im-age (GEI) [8], a grayscale image, calculated by averaging the binary silhouettes of a personwalking across the duration of a complete gait cycle. A different gait representation, the KeyFourier Descriptors (KFDs), proposed by Yu et al [9], used frequency information of thecontours of binary silhouettes to extract gait features. Subsequent work by Lam et al. intro-duced the Motion Silhouette Image (MSI) [10]. Inspired by the MHI, the MSI is a grayscaleimage created from binary silhouette sequences of ambulatory motion. The intensity valueof a pixel in a MSI is a non-linear function of the temporal history of motion at that pixellocation.

An alternative approach by Liu et al. [11] suggested the Gait History Image (GHI),which addresses some shortcomings of the MEI, MHI and GEI metrics. A slightly differentapproach by Bashir et al [12] took advantage of the ability of Information Entropy to calcu-late the average amount of information in a signal; they introduced the Gait Entropy Image(GEnI), which is obtained by calculating the Shannon Entropy for each pixel in a sequenceof size-normalised, and centre-aligned, binary silhouettes.

Most of the appearance-based methods that collapse spatio-temporal information downto single images show promise in discriminating subjects when applied under controlledimaging environments. When used in less controlled environments, the many factors thatact as confounding variables can substantially reduce recognition performance. The mostcommon confounding factor is the camera-subject relative viewing angle. One of the stud-ies aimed at providing viewpoint-invariance in gait recognition is the work of Makihara etal. [13]. The authors suggested a View Transformation Model (VTM), which is applied toFourier-based features, transforming the features from one camera-subject viewpoint rela-tionship into features from one or more reference views. The authors construct a featurematrix that has a different row for each viewpoint and a different column for each subject.This feature matrix is then decomposed using Singular Value Decomposition (SVD) and, byutilising simple algebra, analytical expressions to estimate the feature vector in a specificviewpoint are derived.

Perhaps a greater challenge for gait recognition algorithms is to deal with different cloth-ing. Nandy et al. [14] conducted a recent study into the possibility of clothing-invariantsubject recognition. Based on the GEnI [12], the authors built a new feature vector, consid-ering the width of each row in the calculated GEnI. Their results suggest that this featureextraction process not only reduces the feature space dramatically, but also includes distinc-tive signatures while neglecting redundant static information. The reliability of the proposedfeature was evaluated using statistical tests, such as pair-wise clothing correlation and intra-clothing variance.

In real-world scenarios it is also common to encounter occlusions due to objects betweenthe camera and the human subject. Hofmann et al. [15] attempted to address the problem ofdynamic and static inter-object occlusion, introducing a new dataset and baseline algorithmsthat used colour histograms and the GEI feature image [8]. Another very common confound-ing factor, which directly affects appearance-based gait recognition algorithms, is speed ofwalking. Ghuan et al. [16], employ Gabor functions from five scales and eight orientations togenerate the Gabor-GEI feature template. The generated Gabor-GEI features were projectedto subspaces using a Random Subspace Method (RSM), followed by 2D Linear DiscriminantAnalysis (2DLDA), which then projected the samples into a canonical space. The authorsaimed to reduce the generalisation error by using a large number of random subspaces/baseclassifiers.

Page 4: The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation is not new in the eld of Computer Vision. Studies like [3,4,5] use synthetic generated

4 C. CHARALAMBOUS, A. BHARATH: DATA AUGMENTATION FOR GAIT RECOGNITION

Data augmentation has been widely used in some areas of computer vision that relyheavily on machine learning, where suitable augmentation can help achieve object-camerapose invariance. Common forms of image augmentation include simple transformationssuch as left-right flipping, rotations, translations or scaling of an image [3, 17]. Flipping,cropping and resizing have also been used in gait recognition [8]. However, this representsonly a small subset of transformations to which we wish a gait recognition system to beinvariant.

3 Data Collection

Given that the objective of this work is to support the ability to generate augmented gaitsequences in such a way as to incorporate controllable confounding factors, we sought touse a combination of computer generated imagery, driven by real motion capture data ofhuman volunteers. The motion of these subjects was captured whilst they walked and ran ona treadmill at five different walking speeds (3-7 km/h) and at five different jogging/runningspeeds (8-12 km/h).

(a) (b) (c) (d) (e) (f)Figure 1: Selected screenshots showing key steps of the overall pipeline, for subject 3. (a) thesubject with reflective markers attached to the body, (b) the 3D representation of the markersin ViconTM Nexus software, (c) the estimated 3D skeleton model, (d) a skeleton followingthe real mocap data within Blender, (e) the 3D mesh model created in the MakeHumanenvironment, and (f) the 3D mesh deformed according to the collected motion capture data.

The motion capture data was collected using a multi-camera ViconTM motion capturesystem, operating at 200 Hz. The system was configured so that 10 infrared cameras werestrategically spaced around the treadmill, which was placed in the middle of the lab. A totalof 38 highly reflective markers were attached on selected landmarks on the body of eachsubject, as seen in Figure 1(a). Figure 1(b) shows the 3D representation of the markersattached on the same subject. The 3D coordinates of the markers, along with a number ofanthropometric measurements that are taken for each subject at the start of the experiment,provide the necessary information to fit a 3D skeleton model to the subject (Figure 1(c)).The estimated 3D skeleton consists of a total of 19 joints, fully characterised by their 3Dposition and orientation. ViconTM Plug-in-Gait provides additional estimated kinetics (i.e.forces and moments), which are also available. However, they are not used in this work. TheViconTM Nexus software [18], was used to pre-process the collected data, and to estimatethe 3D skeleton.

Page 5: The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation is not new in the eld of Computer Vision. Studies like [3,4,5] use synthetic generated

C. CHARALAMBOUS, A. BHARATH: DATA AUGMENTATION FOR GAIT RECOGNITION 5

4 Data PreparationWe used the open-source system Blender [19] to render the motion of characters driven by thereal motion capture data. Figure 1(d) shows a skeleton driven by the motion capture data. Inorder to generate plausible synthetic video sequences, a 3D character (i.e. avatar – see Figure1(e)) was created using the anthropometry of each volunteer. The open-source MakeHumanpackage [20] was chosen for this task. MakeHuman allows human-shaped templates to beformed into detailed 3D characters. Features like gender, age, muscle percentage, weight,height, body proportions, can be adjusted. Confounding and identity obscuring changes canalso be applied to the visual appearance of characters, including clothing and hair. A smallsample of avatars, created for six different subjects from our dataset, is presented in Figure2; differences in height, gender, textures, materials, body proportions and anthropometry arereadily apparent.

Figure 2: Front and side views of the avatars created for six different human subjects fromour dataset (starting from upper left, going clockwise) subjects 1,3,7,10,9,8. Differences inheight, gender, textures, materials, body proportions and anthropometry are readily apparent.

With MakeHuman, one can also attach the avatar of a human subject to a rig (i.e. askeleton, known as armature), which then makes it possible to deform and move the characteraround. This is known as the skinning procedure, in which each vertex of the 3D mesh modelis associated with a certain bone in the rig. Each vertex is weighted differently, accordingto a physics engine that takes into account how the surface of the body moves relative to theskeleton.

In order to be able to deform the 3D avatar of a human subject (Figure 1(e)) accordingto the corresponding real motion capture data for that subject, the motion from the realmotion capture data (Figure 1(d)) has to be “transferred” to the rig inside the 3D avatar(Figure 1(f)). This procedure is known as retargetting. We found it necessary to adjust someinconsistencies between the representation of the skeleton obtained from the motion capturesystem and the representation of the skeleton using the MakeHuman armature. We provideexamples of the processing required to do this in the supplementary material. Once done,it was possible to render the avatar corresponding to a given human subject within Blender,and to vary lighting, camera angles and several other parameters. Video sequences generatedfrom avatar movements compared favourably with real video sequences captured from thecorresponding human subjects.

Page 6: The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation is not new in the eld of Computer Vision. Studies like [3,4,5] use synthetic generated

6 C. CHARALAMBOUS, A. BHARATH: DATA AUGMENTATION FOR GAIT RECOGNITION

The combination of real motion capture, data preparation, avatar construction and scriptedrendering, allows synthetic frames to be generated with almost arbitrary degrees of variation.Samples of controllable confounding factors are presented in Figure 3, which shows framesfrom the synthetic video data, the corresponding binary silhouettes, and the GEI featurescalculated from the corresponding gait cycles. This data was used to create a large datasetacross time, confounding factors and individuals. In addition to raw video and motion cap-ture data from volunteers, the dataset contains the deformed 3D mesh representations for allsubjects for a walking trial (5 km/h), allowing subject-specific and camera and pose-specificimage sequences to be produced.

Figure 3: A small sample of the controllable confounding factors. Top row shows framesfrom the synthetic video data, middle row shows the corresponding binary silhouettes, andthe bottom row shows the synthetic GEI features extracted for the corresponding gait cycles.It is readily apparent how these confounding factors directly affect the extracted gait features.

5 Experiments

The process of data augmentation through motion capture and avatars is much more complexthan that used in other forms of data augmentation in machine vision to date. A key questionthat must be addressed is: to what extent is identity preserved in applying such a high degreeof complexity? In order to evaluate the degree to which information on a subject’s identityis passed from the real data onto the synthetic data, we chose to perform experiments usingthe Gait Energy Image (GEI) features [8]; these were used as gait features in order to trainSupport Vector Machines (SVM) to identify the subjects. GEI features are widely used forgait recognition, mainly because they are easy to extract. Due to employing the averagingprocedure [8], they are also relatively immune to errors such as foot sliding and jitter inhuman movement, as well as to noise present in the binary silhouettes due to suboptimalsegmentation, which is very frequent in real-world applications.

The GEI feature requires a binary silhouette to be extracted from each frame in a videosequence, followed by an averaging operation across a complete gait cycle in time. The realbinary sequences are calculated by converting all the frames from the real gait sequences

Page 7: The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation is not new in the eld of Computer Vision. Studies like [3,4,5] use synthetic generated

C. CHARALAMBOUS, A. BHARATH: DATA AUGMENTATION FOR GAIT RECOGNITION 7

into the LAB colour space and then subtracting the background from each frame in thevideo. The resulting grayscale image is then segmented by clustering the intensity valuesinto two separate classes, corresponding to the foreground and the background pixel intensi-ties. A simple thresholding procedure, according to that clustering, provides the real binarysilhouettes. The segmentation of synthetic data is based on simple colour separation, whichis facilitated by choosing a uniform colour for the background when rendering the scene.

Once the binary silhouettes are extracted, the video sequences are divided into gait cy-cles. Inspired by [21], we consider only the lower part of the human body and count the“on” (white) pixels for each frame in a video sequence. Assuming that when the legs are atthe extreme positions (i.e. full stride stance) the number of “on” pixels reaches a maximumvalue, we plot that number for each frame in a video sequence. The result is a periodic signalwith half the period of a gait cycle. In order to follow the correct definition of a gait cycle,one has to consider every second peak. The timings of these peaks define the boundaries ofthe gait cycles.

Having all the binary silhouettes extracted and resized down to 50×30 pixels, we performan average operation across frames for each gait cycle. This results in a normalised grayscaleimage, which is, by definition, the GEI image [8]. We apply further data augmentationby left-right flipping, systematically cropping the image features and then resizing back to50×30 [8]. Figure 4(a) presents examples of real (top row) and synthetic (middle row) binarysilhouettes, and the resulting GEI images, for subject 3, as well as the fusion of them (bottomrow). Figure 4(b) shows additional real and synthetic GEI feature images for the subjectspresented in Figure 2, along with the fusion of them.

(a) (b)Figure 4: (a) Seven key frames of a gait cycle and the resulting GEI image, for subject 3,with the real binary silhouettes on top, the synthetic ones in the middle and the fusion of thetwo at the bottom (green for real, red for synthetic and yellow for their overlap). (b) GEIfeatures for the subjects of Figure 2 (left to right, subjects 1,3,7,8,9 and 10), with the realGEIs on top, the synthetic ones in the middle and their fusion at the bottom (same colouring).

Using the synthetic frames and corresponding GEIs, we conducted experiments designedto assess the level to which identity is preserved within the generated synthetic data sets. Thefirst experiment compared binary silhouettes from real subjects with those from their avatars,driven by the real motion capture data. The second experiment compared the principal com-ponents of the 1500-dimensional GEIs; this is divided into two parts: a visualisation of theprincipal components of real and synthetic frames, and an exploration of training and testingrecognition with both datasets separately and together (real/synthetic data augmentation).

Page 8: The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation is not new in the eld of Computer Vision. Studies like [3,4,5] use synthetic generated

8 C. CHARALAMBOUS, A. BHARATH: DATA AUGMENTATION FOR GAIT RECOGNITION

6 Results & Discussion

In order to assess the degree of agreement between the real and synthetic binary silhouettes,we calculated the Jaccard index between them, for each frame, for each individual, withframe alignment. The Jaccard index is defined as the size of the intersection divided by thesize of the union of the two regions (see equation 1), and approaches 1 for perfect alignment.

J(A,b) =|A⋂

B||A⋃

B| =|A⋂

B||A|+ |B|− |A⋂

B| (1)

We also calculated the similarity between real binary silhouettes of the same subject,across different gait cycles, which serves as a baseline (see Figure 5). This can be seen as therealistic upper bound for the Jaccard index, when used as a metric to assess the consistencyof silhouette alignment across cycles. The calculated similarity using the Jaccard index overthe entire walking trial (at 5 km/h) for subjects 1-14 is presented in Figure 6. With Figure 5as a baseline, it shows that synthetically generated data aligns well with real sequences.

1 2 3 4 5 6 7 8 9 10 11 12 13 140

0.2

0.4

0.6

0.8

1

Subject id

Jacc

ard

inde

x

Similarity between real binary silhouettes across different gait cycles

Figure 5: Similarity between real binary silhouettes, across different gait cycles of the sameSubject, for a walking trial, measured using the Jaccard index between the correspondingbinary silhouettes (a baseline for the similarity between real and synthetic binary silhouettes).

1 2 3 4 5 6 7 8 9 10 11 12 13 140

0.2

0.4

0.6

0.8

1

Similarity between real and synthetic binary silhouettes

Jacc

ard

inde

x

Subject id

Figure 6: Similarity between real and synthetic binary silhouettes for a walking trial of Sub-jects 1-14, measured using the Jaccard index between the corresponding binary silhouettes.

Page 9: The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation is not new in the eld of Computer Vision. Studies like [3,4,5] use synthetic generated

C. CHARALAMBOUS, A. BHARATH: DATA AUGMENTATION FOR GAIT RECOGNITION 9

There are several reasons why real and synthetic binary silhouettes do not fully match.Hairstyle, for example, can play a significant role in the appearance of a subject. Whennot properly modelled, it may affect the match to a great extent. Also, the Jaccard index isneither rotation-invariant (a possible roll angle offset on the video camera capturing the realdata may cause this misalignment) nor translation-invariant, which means that small miss-alignments – see Figure 4(a) – decrease the similarity values dramatically. Since we comparesilhouettes from single frames with the Jaccard index, noise introduced by segmentation ofreal image sequences – see Figure 4(a) (top row) – shows common imperfections which arenot present in the generated synthetic data. This decreases the average Jaccard index valuesreported.

However, as seen in Figure 4, qualitatively, the synthetic binary silhouettes keep a sub-ject’s individual locomotion style. This is also readily apparent, quantitatively, in Figure7(b), where the accuracy for the experiments in which training and testing data were solelysynthetic, is relatively high. This means that some discriminating characteristics containinga subject’s identity is preserved and passed onto the generated synthetic data, making themsuitable for data augmentation.

Number of Principal Components

Acc

ura

cy (

%)

Accuracy Vs Number of Principal Components

5 10 15 20 25 30 35 400

20

40

60

80

100

R - RS - SS - RR - S70%R+S - 30%R30%R+S - 70%R

(a) (b)Figure 7: (a) The first nine principal components calculated on the GEI features, using realdata (left) and synthetic data (right); it is readily apparent that the feature space in the twocases is similar but not identical (b) results from our experiments, with different trainingand testing sets, using combinations of real and/or synthetic data. We present the accuraciesachieved, with the GEI features being projected to different number of principal components.

Results from our experiments – presented in Figure 7(b) – suggest that information aboutthe identity of the subjects is retained within the synthetically generated data. Figure 7(a)shows the first nine principal components calculated on the GEI features, using real data(left) and synthetic data (right), whereas Figure 7(b) shows results of our experiments. Un-der same-mode conditions (either training and testing with solely real data R – R, or trainingand testing with solely synthetic data S – S) we achieve the best results. It is also readilyapparent that, given a sufficient number of principal components, using a limited numberof real data, augmented with the synthetically generated data (cases 70%R+S – 30%R and30%R+S – 70%R), results in identification of the subjects with an accuracy of more than95%. However, in the cross-mode conditions (training with real data and testing with syn-thetic data R – S, or the other way around S – R) we get the lowest results. This is probablybecause the features learnt solely by real data and solely by synthetically generated data aresimilar, but not necessarily identical (see Figure 7(a)); therefore, the feature space learnt isnot actually shared, but rather shifted between the two cases.

Page 10: The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation is not new in the eld of Computer Vision. Studies like [3,4,5] use synthetic generated

10 C. CHARALAMBOUS, A. BHARATH: DATA AUGMENTATION FOR GAIT RECOGNITION

The use of a treadmill allowed us to limit the area of motion capture, while being able tocapture long sequences of the order of several minutes. Whilst possibly altering natural gait,the use of a treadmill is supported by a study of the biomechanics of human motion [22],which compared overground with treadmill walking in healthy individuals. The differencesbetween overground and treadmill walking were found to be surprisingly small.

Finally, because anthropomorphic data was introduced into the avatars by hand, details ofbody appearance have not been fully incorporated in the 3D mesh models. Since subsequentsteps, including silhouette calculations, depend on these meshes, there may be loss of detailintroduced by this manual process. This loss of detail may also be the result of the kinematicsof the gait not being perfectly transferred to the avatar, due to physics deformation of theavatar differing from that of the real subject.

7 Conclusion and Future Work

We suggested a method to generate synthetic video data for data augmentation of gait se-quences. The process allows sequences to be generated with multiple confounding fac-tors, and allows exploratory work into training machine/deep learning algorithms for fully-invariant gait recognition, with a far greater amount of synthetic training and test data thanwould otherwise be possible.

A multi-modal dataset produced during this study is based on real motion capture data,with a total number of more than 6.5 million frames. Augmenting the dataset with syntheticdata introduces further confounding factors, allowing even larger amounts of data to be syn-thesized. Even though the amount of synthetically generated data is limited by the amountof real mocap data collected, the proposed methodology attempts to maximise that numberby modelling as many of the confounding factors stated in Section 1 as possible. For exam-ple, introducing different subject-camera relative viewing angles, with a simple sampling ofthe angular space with a step of 5◦, in both longitudinal and latitudinal directions, results in703 different pairs of azimuth and elevation angles. Augmenting the data by introducing asingle confounding factor like subject-camera relative viewing angle, increases the numberof frames to more than 4.5 billion.

As seen qualitatively in Figure 4 and quantitatively in Figure 7(b), the synthetic videodata contains identity information that discriminates subjects. One could argue that usingavatars driven by motion capture to generate subject-specific image data is a poor substituteto using real image data for training systems in identity recognition. However, there are stillopen questions about the best way to attain invariant recognition of human gait, given thevery large number of confounding factors that affect the performance of recognition. Bygenerating a large set of data that – at least visually – replicates some of the variability inappearance of subjects, it is possible to explore features and techniques that are more robustthan existing gait recognition methods.

Future work includes a) expanding the span of confounding factors in the datasets (age,gender, body shape, physical fitness); b) improving confounding factors, such as cameramodels and garment deformations, to close the gap between synthetic and real data; c) con-tinuing the collection of motion capture data from additional subjects to increase the pop-ulation of our dataset. The dataset, along with the simulation files, is publicly available athttp://www.bicv.org/datasets/.

Page 11: The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation is not new in the eld of Computer Vision. Studies like [3,4,5] use synthetic generated

C. CHARALAMBOUS, A. BHARATH: DATA AUGMENTATION FOR GAIT RECOGNITION 11

References[1] M. P. Murray. Gait as a total pattern of movement. American journal of physical

medicine, 46(1):290–333, 1967.

[2] L. Bianchi, D. Angelini, F. Lacquanitiay. Individual characteristics of human walkingmechanics. Pflungers Archiv European Journal of Physiology, 436(3):343–356, 1998.

[3] A. Krizhevsky, S. Ilya, G. Hinton. Imagenet classification with deep convolutionalneural networks. Advances in Neural Information Processing Systems 25, pages 1097–1105, 2012.

[4] J. Shotton et al. Real-time human pose recognition in parts from single depth images.Computer Vision and Pattern Recognition 2011, pages 1297–1304, 2011.

[5] R. Sandwell et al. Synthesising unseen image conditions to enhance classification ac-curacy for sparse datasets : applied to chimpanzee face recognition. British MachineVision Workshop, page 1, 2013.

[6] J. Zhang et al. Arbitrary view action recognition via transfer dictionary learning onsynthetic training data. IEEE International Conference on Robotics and Automation,pages 1678–1684, 2016.

[7] A F. Bobick, J W. Davis. The recognition of human movement using temporal tem-plates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):257–267, 2001.

[8] J. Han, B. Bhanu. Statistical feature fusion for gait-based human recognition. Proceed-ings of the 2004 IEEE Computer Society Conference on Computer Vision and PatternRecognition, 2:0–5, 2004.

[9] S. Yu, L. Wang, T. Tan, W. Hu. Gait analysis for human identification in frequencydomain. Third International Conference on Image and Graphics, pages 282–285, 2004.

[10] T. Lam. A new representation for human gait recognition: motion silhouettes image(msi). Advances in Biometrics, pages 612–618, 2005.

[11] J. Liu, N. Zheng. Gait history image: A novel temporal template for gait recognition.IEEE International Conference on Multimedia and Expo, pages 663–666, 2007.

[12] K. Bashir, T. Xiang, S. Gong. Gait recognition using gait entropy image. 3rd Interna-tional Conference on Imaging for Crime Detection and Prevention, pages 1–6, 2009.

[13] Y. Makihara, R. Sagawa, Y. Yagi, T. Echigo, Y. Mukaigawa. Which reference viewis effective for gait identification using a view transformation model? Proceedings ofthe IEEE Computer Society Conference on Computer Vision and Pattern Recognition(ICDP), page 45, 2006.

[14] A. Nandy, A. Pathak, P. Chakraborty. A study on gait entropy image analysis forclothing invariant human identification. Multimedia Tools and Applications, pages 1–35, 2016.

Page 12: The British Machine Vision Association : Home - A …Generating synthetic data for data augmentation is not new in the eld of Computer Vision. Studies like [3,4,5] use synthetic generated

12 C. CHARALAMBOUS, A. BHARATH: DATA AUGMENTATION FOR GAIT RECOGNITION

[15] M. Hofmann, S. Sural, G. Rigoll. Gait recognition in the presence of occlusion: A newdataset and baseline algorithms. Proceedings of International Conference on ComputerGraphics, Visualization and Computer Vision, pages 99–104, 2011.

[16] Y. Guan, C. Li. A robust speed-invariant gait recognition system for walker and runneridentification. International Conference on Biometrics (ICB), pages 1–8, 2013.

[17] K. Wang. Image classification with pyramid representation and rotated data augmenta-tion on torch 7. 2015.

[18] Nexus. http://www.vicon.com/products/software/nexus. accessed 11/05/2016.

[19] Blender. https://www.blender.org/. accessed 04/05/2016.

[20] MakeHuman. http://www.makehuman.org/. accessed 04/05/2016.

[21] I. S. Sarkar, P. J. Phillips, R. Vega, Z. Liu. The humanid gait challenge problem: datasets, performance, and analysis. IEEE Transactions on Pattern Analysis and MachineIntelligence, 27(2):162–177, 2005.

[22] S. Lee, J. Hidler. Biomechanics of overground vs. treadmill walking in healthy indi-viduals. Journal of Applied Physiology, 104(3):747–755, 2008.