Multi-modal Gesture Recognition Challenge 2013: Dataset...

8
Multi-modal Gesture Recognition Challenge 2013: Dataset and Results Sergio Escalera Dept. Applied Mathematics, Universitat de Barcelona Computer Vision Center, UAB [email protected] Jordi Gonzàlez Dept. Computer Science, UAB, Barcelona Computer Vision Center, UAB [email protected] Xavier Baró EIMT at the Open University of Catalonia, Barcelona Computer Vision Center, UAB [email protected] Miguel Reyes Dept. Applied Mathematics, Universitat de Barcelona Computer Vision Center, UAB [email protected] Oscar Lopes Computer Vision Center, Campus UAB, Barcelona [email protected] Isabelle Guyon ChaLearn, Berkeley, California [email protected] Vassilis Athitsos University of Texas [email protected] Hugo J. Escalante INAOE, Puebla, Mexico [email protected] ABSTRACT The recognition of continuous natural gestures is a complex and challenging problem due to the multi-modal nature of involved visual cues (e.g. fingers and lips movements, sub- tle facial expressions, body pose, etc.), as well as technical limitations such as spatial and temporal resolution and unre- liable depth cues. In order to promote the research advance on this field, we organized a challenge on multi-modal ges- ture recognition. We made available a large video database of 13, 858 gestures from a lexicon of 20 Italian gesture cate- gories recorded with a Kinect TM camera, providing the au- dio, skeletal model, user mask, RGB and depth images. The focus of the challenge was on user independent multiple ges- ture learning. There are no resting positions and the gestures are performed in continuous sequences lasting 1-2 minutes, containing between 8 and 20 gesture instances in each se- quence. As a result, the dataset contains around 1.720.800 frames. In addition to the 20 main gesture categories, ‘dis- tracter’ gestures are included, meaning that additional audio and gestures out of the vocabulary are included. The final evaluation of the challenge was defined in terms of the Lev- enshtein edit distance, where the goal was to indicate the real order of gestures within the sequence. 54 international teams participated in the challenge, and outstanding results were obtained by the first ranked participants. 1. INTRODUCTION Predictive modelling competitions or Challenges have been fostering progress in Computer Vision in recent years. One of the most important challenges so far has been undoubt- edly the PASCAL Visual Object Classes Challenges 1 , orga- nized by Everingham et al. [1], which have contributed to push the state-of-the-art on image classification, detection and segmentation. These initial efforts concentrating on image understanding are now moving towards the analysis of video data, in which recognizing human activities in visual data has received much attention. Improving automatic recognition of human ac- tions in visual data will allow the development of novel ap- plications useful in surveillance and security, new generation of eHealth applications such as assisted living, the develop- ment of natural interfaces for Human Computer Interaction, and improved control in leisure scenarios. As a result, several contests have been organized in activity recognition with applications in video surveillance, for ex- ample the VIRAT Action Recognition Challenge 2 , the 3D human reconstruction and action recognition Grand Chal- lenge 3 , the Human Activities Recognition and Localization Competition 4 and the Contest on Semantic Description of Human Activities 5 . Following this trend, we organized a challenge called ‘Multi- modal Gesture Recognition Challenge’ 6 focusing on recog- nizing multiple gestures from a novel data set of videos recorded with a Microsoft Kinect TM camera. Kinect TM has revolutionized computer vision in recent years by providing an affordable 3D camera. The applications, initially driven by the game industry [5], have been rapidly diversifying and include video surveillance, computer interfaces, robot vision 1 http://pascallin.ecs.soton.ac.uk/challenges/VOC 2 http://www.umiacs.umd.edu/conferences/cvpr2011/ARC 3 http://mmv.eecs.qmul.ac.uk/mmgc2013/ 4 http://liris.cnrs.fr/harl2012/ 5 http://cvrc.ece.utexas.edu/SDHA2010/ 6 http://gesture.chalearn.org

Transcript of Multi-modal Gesture Recognition Challenge 2013: Dataset...

Page 1: Multi-modal Gesture Recognition Challenge 2013: Dataset ...vlm1.uta.edu/~athitsos/publications/escalera_grw2013b.pdf81% of the participants were Italian native speakers, while the

Multi-modal Gesture Recognition Challenge 2013:Dataset and Results

Sergio EscaleraDept. Applied Mathematics,

Universitat de BarcelonaComputer Vision Center, UAB

[email protected]

Jordi GonzàlezDept. Computer Science,

UAB, BarcelonaComputer Vision Center, UAB

[email protected]

Xavier BaróEIMT at the Open University of

Catalonia, BarcelonaComputer Vision Center, UAB

[email protected] Reyes

Dept. Applied Mathematics,Universitat de Barcelona

Computer Vision Center, [email protected]

Oscar LopesComputer Vision Center,Campus UAB, Barcelona

[email protected]

Isabelle GuyonChaLearn, Berkeley, California

[email protected]

Vassilis AthitsosUniversity of Texas

[email protected]

Hugo J. EscalanteINAOE, Puebla, [email protected]

ABSTRACTThe recognition of continuous natural gestures is a complexand challenging problem due to the multi-modal nature ofinvolved visual cues (e.g. fingers and lips movements, sub-tle facial expressions, body pose, etc.), as well as technicallimitations such as spatial and temporal resolution and unre-liable depth cues. In order to promote the research advanceon this field, we organized a challenge on multi-modal ges-ture recognition. We made available a large video databaseof 13, 858 gestures from a lexicon of 20 Italian gesture cate-gories recorded with a KinectTM camera, providing the au-dio, skeletal model, user mask, RGB and depth images. Thefocus of the challenge was on user independent multiple ges-ture learning. There are no resting positions and the gesturesare performed in continuous sequences lasting 1-2 minutes,containing between 8 and 20 gesture instances in each se-quence. As a result, the dataset contains around 1.720.800frames. In addition to the 20 main gesture categories, ‘dis-tracter’ gestures are included, meaning that additional audioand gestures out of the vocabulary are included. The finalevaluation of the challenge was defined in terms of the Lev-enshtein edit distance, where the goal was to indicate thereal order of gestures within the sequence. 54 internationalteams participated in the challenge, and outstanding resultswere obtained by the first ranked participants.

1. INTRODUCTIONPredictive modelling competitions or Challenges have beenfostering progress in Computer Vision in recent years. One

of the most important challenges so far has been undoubt-edly the PASCAL Visual Object Classes Challenges1, orga-nized by Everingham et al. [1], which have contributed topush the state-of-the-art on image classification, detectionand segmentation.

These initial efforts concentrating on image understandingare now moving towards the analysis of video data, in whichrecognizing human activities in visual data has received muchattention. Improving automatic recognition of human ac-tions in visual data will allow the development of novel ap-plications useful in surveillance and security, new generationof eHealth applications such as assisted living, the develop-ment of natural interfaces for Human Computer Interaction,and improved control in leisure scenarios.

As a result, several contests have been organized in activityrecognition with applications in video surveillance, for ex-ample the VIRAT Action Recognition Challenge2, the 3Dhuman reconstruction and action recognition Grand Chal-lenge 3, the Human Activities Recognition and LocalizationCompetition 4 and the Contest on Semantic Description ofHuman Activities5.

Following this trend, we organized a challenge called ‘Multi-modal Gesture Recognition Challenge’6 focusing on recog-nizing multiple gestures from a novel data set of videosrecorded with a Microsoft KinectTM camera. KinectTM hasrevolutionized computer vision in recent years by providingan affordable 3D camera. The applications, initially drivenby the game industry [5], have been rapidly diversifying andinclude video surveillance, computer interfaces, robot vision

1http://pascallin.ecs.soton.ac.uk/challenges/VOC

2http://www.umiacs.umd.edu/conferences/cvpr2011/ARC

3http://mmv.eecs.qmul.ac.uk/mmgc2013/

4http://liris.cnrs.fr/harl2012/

5http://cvrc.ece.utexas.edu/SDHA2010/

6http://gesture.chalearn.org

Page 2: Multi-modal Gesture Recognition Challenge 2013: Dataset ...vlm1.uta.edu/~athitsos/publications/escalera_grw2013b.pdf81% of the participants were Italian native speakers, while the

and control, and education [3].

Our previous one-shot learning challenge [2] was devoted tolearning a gesture category from a single example of gesturecoming from a limited vocabulary, using RGB and depthdata. Our novel data set offers several gesture categories la-beled from a dictionary of 20 Italian sign gesture categories.Several features make this new competition extremely chal-lenging, including the recording of continuous sequences, thepresence of distracter gestures (not included in the dictio-nary), the relatively large number of categories, the lengthof the gesture sequences, and the variety of users. To attacksuch a difficult problem, several modalities are provided inthe data set, including audio, RGB, depth maps, user masks,and user skeletal model. The presentation of the data setand the results obtained in the Mutimodal Gesture Recog-nition Challenge are explained in the next sections.

2. PROBLEM SETTING AND DATAThe focus of the challenge is on multiple instance, user in-dependent learning of gestures from multi-modal data, whichmeans learning to recognize gestures from several instancesfor each category performed by different users, drawn froma vocabulary of 20 gesture categories. A gesture vocabularyis a set of unique gestures, generally related to a particu-lar task. In this challenge we focus on the recognition of avocabulary of 20 Italian cultural/anthropological signs, seeFigure 1 for one example of each Italian gesture.

In all the sequences, a single user is recorded in front of aKinectTM , performing natural communicative gestures andspeaking in fluent Italian. The main characteristics of thedataset of gestures are:

• 13.858 gesture samples recorded with the KinectTM cam-era, including audio, skeletal model, user mask, RGB, anddepth images.• RGB video stream, 8-bit VGA resolution (640×480) witha Bayer color filter, and depth sensing video stream in VGAresolution (640×480) with 11-bit. Both are acquired in 20fps on average.• Audio data is captured using KinectTM 20 multi-array mi-crophone.• A total number of 27 users appear in the data set.• The data set contains the following number of sequences,development: 393 (7.754 gestures), validation: 287 (3.362gestures), and test: 276 (2.742 gestures), each sequence lastsbetween 1 and 2 minutes and contains between 8 and 20gesture samples, around 1.800 frames. The total number offrames of the data set is 1.720.800.• All the gesture samples belonging to 20 main gesture cat-egories from an Italian gesture dictionary are annotated atframe level indicating the gesture label.• 81% of the participants were Italian native speakers, whilethe remaining 19% of the users were not Italian, but Italian-speakers.• All the audio that appears in the data is from the Italiandictionary. In addition, sequences may contain distracterwords and gestures, which are not annotated since they donot belong to the main dictionary of 20 gestures.

This dataset, available at http://sunai.uoc.edu/chalearn,presents various features of interest as listed in Table 1.

Table 1: Easy and challenging aspects of the data.EasyFixed cameraNear frontal view acquisitionWithin a sequence the same userGestures performed mostly by arms and handsCamera framing upper bodySeveral available modalities: audio, skeletal model, user mask,depth, and RGBSeveral instances of each gesture for trainingSingle person present in the visual fieldChallengingWithin each sequence:Continuous gestures without a resting poseMany gesture instances are presentDistracter gestures out of the vocabulary may be present in termsof both gesture and audioBetween sequences:High inter and intra-class variabilities of gestures in terms of bothgesture and audioVariations in background, clothing, skin color, lighting, tempera-ture, resolutionSome parts of the body may be occludedDifferent Italian dialects

Figure 2: Skeleton joint positions.7

2.1 Data format and structureWe provide the X audio.ogg, X color.mp4, X depth.mp4,and X user.mp4 files containing the audio, RGB, depth, anduser mask videos for a sequence X, respectively, see Figure3. We also provide a script in order to export the data inMatlab, which contains the following Matlab structures:

• NumFrames: Total number of frames.• FrameRate: Frame rate of the video in fps.• Audio: structure that contains WAV audio data.

– y: Audio Data– fs: Sample rate for the data.

• Labels: Structure that contains the data about labels con-tained in the sequence, sorted in order of appearance. Thelabels considered to the 20 gesture categories as shown inFigure 1.

– Name: The name given to this gesture.1. vattene 11. ok2. vieniqui 12. cosatifarei3. perfetto 13. basta4. furbo 14. prendere5. cheduepalle 15. noncenepiu6. chevuoi 16. fame7. daccordo 17. tantotempo8. seipazzo 18. buonissimo9. combinato 19. messidaccordo10. freganiente 20. sonostufo

7Image from

http://msdn.microsoft.com/en∼us/library/microsoft.kinect.jointtype.aspx

Page 3: Multi-modal Gesture Recognition Challenge 2013: Dataset ...vlm1.uta.edu/~athitsos/publications/escalera_grw2013b.pdf81% of the participants were Italian native speakers, while the

(1) Vattene (2) Viene qui (3) Perfetto (4) E un furbo (5) Che due palle

(6) Che vuoi (7) Vanno d’accordo (8) Sei pazzo (9) Cos hai combinato (10) Nonme me frieganiente

(11) Ok (12) Cosa ti farei (13) Basta (14) Le vuoi prendere (15) Non ce ne piu

(16) Ho fame (17) Tanto tempo fa (18) Buonissimo (19) Si sono messid’accordo

(20) Sono stufo

Figure 1: Data set gesture categories.

• RGB: This matrix represents the RGB color image, ex-pressed in 8-bit VGA resolution (640×480) with a Bayercolor filter.

• Depth: The Depth matrix contains the pixel-wise z com-ponent, VGA resolution (640×480) represented with 11bits. The value of depth is expressed in millimeters.

• UserIndex: The user index matrix represents the playerindex of each depth pixel. A non-zero pixel value meansthat a tracked subject occupies the pixel, and a value of 0denotes that no tracked subject occupies the pixel.

• Skeleton: An array of Skeleton structures is containedwithin a Skeletons array. It contains the joint positions,and bone orientations comprising a skeleton. The formatof a Skeleton structure is:

– JointType: Skeleton joints that make up a trackedskeleton. Figure 2 visualizes these joint types.

1. HipCenter 9. HipLeft2. Spine 10. KneeLeft3. ShoulderCenter 11. AnkleLeft4. Head 12. FootLeft5. ShoulderLeft 13. HipRight6. ElbowLeft 14. KneeRight7. WristLeft 15. AnkleRight8. HandRight 16. FootRight

– JointPosition: It contains the joint positions in thenext three coordinates:

∗ WorldPosition: The world coordinates position

Page 4: Multi-modal Gesture Recognition Challenge 2013: Dataset ...vlm1.uta.edu/~athitsos/publications/escalera_grw2013b.pdf81% of the participants were Italian native speakers, while the

structure represents the global position of a trackedjoint. The format is X, Y, Z which representsthe x, y, and z components of the subject´s globalposition (in millimeters).

∗ PixelPosition: The pixel coordinates positionstructure represents the position of a tracked joint.The format of the Position structure is X, Ywhich represent the x and y components of thejoint location over the RGB map (in pixels coor-dinates).

∗ WorldRotation: The world rotation structurecontains the orientations of skeletal bones in termsof absolute transformations and is formed by a20×4 matrix, where each row contains the W,X, Y, Z values of the quaternion related to therotation. The world rotation structure providesthe orientation of a bone in the 3D camera space.The orientation of a bone is relative to the childjoint and the Hip Center joint still contains theorientation of the player/subject.

3. PROTOCOL AND EVALUATIONThe timeline of the competition was as follows:

• April 30th, 2013: Beginning of the challenge competi-tion, release of first data examples.• May 20th, 2013: Full release of training and validationdata. Training data with ground truth labels.• August 1st, 2013: Encrypted Final evaluation data andground truth labels for the validation data are made avail-able.• August 15th, 2013: End of the challenge competition.Deadline for code submission. The organizers start the codeverification by running it on the final evaluation data andobtaining the team scores.• August 25th, 2013: Deadline for fact sheets.• September 1st, 2013: Release of the verification resultsto the participants for review.

The challenge consisted of two main components: a devel-opment phase (April 30th to Aug 1st) and a final evaluationphase (Aug 2nd to Aug 15th). The submission and evalua-tion of the challenge entries was via the Kaggle platform 8.The official participation rules were provided on the web-site of the challenge. In addition, publicity and news on theChaLearn Multi-modal Gesture Recognition Challenge werepublished in well-known online platforms, such as LinkedIn,Facebook, Google Groups and the ChaLearn website.

During the development phase, the participants were askedto build a system capable of learning from several gesturesamples a vocabulary of 20 Italian sign gesture categories.To that end, the teams received the development data totrain and self-evaluate their systems. In order to monitortheir progress they could use the validation data for whichthe labels were not provided. The prediction results on val-idation data could be submitted online to get immediatefeed-back. A real-time leaderboard showed to the partici-pants their current standing based on their validation setpredictions.

During the final phase, labels for validation data are pub-lished and the participants performed similar tasks as thoseperformed in previous phase, using the validation data and

8https://www.kaggle.com/c/multi-modal-gesture-recognition

training data sets in order to train their system with moregesture instances. The participants had only few days totrain their systems and upload them. The organizers usedthe final evaluation data in order to generate the predictionsand obtain the final score and rank for each team. At theend, the final evaluation data was revealed, and authors sub-mitted their own predictions and fact sheets to the platform.

4. EVALUATION METRICFor each unlabeled video, the participants were instructedto provide an ordered list of labels R corresponding to therecognized gestures. We compared this list with the truthlabels T i.e. the prescribed list of gestures that the userhad to play during data collection. We computed the Lev-enshtein distance L(R, T ), that is the minimum number ofedit operations (substitution, insertion, or deletion) that onehas to perform to go from R to T (or vice versa). The Lev-enhstein distance is also known as ‘edit distance’. For ex-ample: L([124], [32]) = 2, L([1], [2]) = 1, L([222], [2]) = 2.The overall score is the sum of the Levenshtein distances forall the lines of the result file compared to the correspondinglines in the truth value file, divided by the total number ofgestures in the truth value file. This score is analogous to anerror rate. For simplicity, in what follows, we call it ErrorRate, although it can exceed 1.0. A public score appearedon the leaderboard during the development period and wasbased on the validation data. Subsequently, a private scorefor each team was computed on the final evaluation data re-leased at the end of the development period, which was notrevealed until the challenge was over. The private score wasused to rank the participants and determine the prizes.

5. RESULTSThe challenge attracted high level of participation, with a to-tal of 54 teams and near 300 total number of entries. This isa good level of participation for a computer vision challengerequiring very specialized skills. Finally, 17 teams success-fully submitted their prediction in final test set, while pro-viding also their code for verification and summarizing theirmethod by means of a fact sheet questionnaire.

After verifying the codes and results of the participants, thefinal scores of the top rank participants on both validationand test sets were made public: these results are shown inTable 2, where winner results on the final test set are printedin bold. In the end, the final error rate on the test data setwas around 12%.

5.1 Statistics on the resultsFigure 5 shows the correlation of the validation and testerror scores obtained by the top ranked participants of thechallenge. One can see that most of them obtain similarresults in both sets. However, there exist a few outliersthat show non-correlated results among validation and testscores. Most of the participants that achieved top positionsin test scores also achieved high recognition rates on thevalidation set. However, some participants that achieved lowerror on validation, considerably increased their error on thetest set. It could be mainly related to overfitting of methodparameters on the validation data, which could not be ableto generalize to the variability of new users present on thetest set. On the other hand, the final scores on the test set

Page 5: Multi-modal Gesture Recognition Challenge 2013: Dataset ...vlm1.uta.edu/~athitsos/publications/escalera_grw2013b.pdf81% of the participants were Italian native speakers, while the

RGB Depth User mask Skeletal model

Figure 3: Different data modalities of the provided data set.

Table 2: Top rank results on validation and test sets.TEAM Validation score Test scoreIVA MM 0.20137 0.12756

WWEIGHT 0.46163 0.15387ET 0.33611 0.16813

MmM 0.25996 0.17215PPTK 0.15199 0.17325LRS 0.18114 0.17727

MMDL 0.43992 0.24452TELEPOINTS 0.48543 0.25841

CSI MM 0.32124 0.28911SUMO 0.49137 0.31652GURU 0.51844 0.37281

AURINKO 0.31529 0.63304STEVENWUDI 1.43427 0.74415JACKSPARROW 0.86050 0.79313

JOEWAN 0.13653 0.83772MILAN KOVAC 0.87835 0.87463IAMKHADER 0.93397 0.92069

are in general lower than in the validation set. This may beproduced because more data is trained by the participantswhen testing for final evaluation set, and thus, if propertlytrained, this small final improvement is expected. The bestpublic score on the validation set achieved during the periodof the challenge taking into account all team submissionsover time is summarized in Figure 4.

Figure 5: Correlation results among the validation and test

results of the top ranked participants.

In Figure 6, we show the histograms of validation and testscores based on the best score achieved by each team oneach of the two sets. One can see that the scores on the val-idation set become more sparse, and the teams that finallysubmitted their predictions to the test set, except for twocases, achieved scores inferior to 1 Levenstein score error.

5.2 Fact sheetsWe asked the participants to fill out a survey about themethods employed. The top ranked 17 test participantsfilled out this survey. We briefly summarize the results next.

From the questions within the survey, the most relevant as-

Figure 6: Validation and test scores histograms.

pects for the challenge where: the modalities considered forthe methods, the temporal segmentation methodology ap-plied, the considered fusion strategy, as well as the recog-nition techniques considered. We additionally comment theprogramming language used by the participants. The in-formation about modalities, segmentation strategy, fusion,classifier, and programming language are summarized in Fig-ures 7, 8, 9, 10, and 11, respectively. The details of eachparticular team strategy are shown in Table 3.

Figure 7: Modalities considered.

Looking at the considered modalities in Figure 7, one can seethat none of the teams considered only audio information,and most of them used multiple modalities for describing thedata. In particular, skeleton was the most considered featurewhen no multiple modalities were used. In general, combin-ing audio plus skeleton information was the predominantstrategy among the participants, and the one considered bythe first three ranked teams on the test set.

The considered temporal segmentation strategies are shownin Figure 8. One can see that two strategies were mainly

Page 6: Multi-modal Gesture Recognition Challenge 2013: Dataset ...vlm1.uta.edu/~athitsos/publications/escalera_grw2013b.pdf81% of the participants were Italian native speakers, while the

Figure 4: Best public score obtained in the validation set during the Challenge.

Figure 8: Segmentation strategy.

Figure 9: Fusion strategy.

applied: based on features and based on temporal windowsor classifiers (such as Dynamic Time Warping). For the firstcase, audio information and joint positions were the mostconsidered information to split the continuous sequence intogesture candidates. In the second case, Sliding-Windowstechnique was the preferred choice of the participants.

Regarding the fusion strategy (Figure 9), several authors didnot apply any, since only one cue was used in their approachor different cues were independently used in different stages,such as one for temporal segmentation and the other onefor describing segmented candidate gestures and final clas-sification. Regarding the participants that fused differentmodalities (the majority of the teams), few of them com-bined feature vectors in an early fusion fashion before train-ing a classifier. The preferred strategy was to train classifierson different feature sets from different modalities and fusethe weighted outputs of classifiers in a late fusion fashion.

Regarding the classifiers (Figure 10), it is interesting to seethe broad variety of strategies, covering most of the state-of-the-art Machine Learning strategies. Both discriminativeand generative classifiers were considered, in all cases apply-ing supervised learning. The preferred strategy was HiddenMarkov Models. On the other hand dynamic programming,and in particular Dynamic Time Warping, which of oftenone of the preferred methods for gesture recognition, wasnot widely applied in our challenge (fourth choice). In par-ticular, Random Forest and Neural Network variants werethe second and third choice of the participants, respectively.

Figure 10: Learning strategy.

Figure 11: Programming language.

Finally, Figure 11 summarizes the programming languages:mainly Matlab/Octave and Python languages were used,and in few cases the codes were speeded up by means ofGPU programming.

Table 3 shows the particular strategy from these statisticsapplied for each team of the top ranked positions of the chal-lenge. Interestingly, the three top ranked participants agreein the modalities and segmentation strategy considered, al-though they differ in the final applied classifier.

5.3 Summary of the winner methodsTable 3 shows the summary of the strategies considered byeach of the top ranked participants on the test set. Next,we briefly describe in more detail the approach designed bythe three winners of the challenge.

The first ranked team IV AMM on the test set used afeature vector based on audio and skeletal information, andapplied late fusion to obtain final recognition results. Asimple time-domain end-point detection algorithm based onjoint coordinates is applied to segment continuous data se-quences into candidate gesture intervals. A Gaussian HiddenMarkov Model is trained with 39-dimension MFCC featuresand generates confidence scores for each gesture category.A Dynamic Time Warping based skeletal feature classifier isapplied to provide complementary information. The confi-

Page 7: Multi-modal Gesture Recognition Challenge 2013: Dataset ...vlm1.uta.edu/~athitsos/publications/escalera_grw2013b.pdf81% of the participants were Italian native speakers, while the

Table 3: Team methods and results. Early and late refer to early and late fusion of features/classifier outputs. HMM: Hidden

Markov Models. KNN: Nearest Neighbor. RF: Random Forest. Tree: Decision Trees. ADA: Adaboost variants. SVM: Support

Vector Machines. Fisher: Fisher Linear Discriminant Analysis. GMM: Gaussian Mixture Models. NN: Neural Networks. DGM:

Deep Boltzmann Machines. LR: Logistic Regression. DP: Dynamic Programming. ELM: Extreme Learning Machines.TEAM Test score Rank position Modalities Segmentation Fusion ClassifierIVA MM 0.12756 1 Audio,Skeleton Audio None HMM,DP,KNN

WWEIGHT 0.15387 2 Audio,Skeleton Audio Late RF,KNNET 0.16813 3 Audio,Skeleton Audio Late Tree,RF,ADA

MmM 0.17215 4 Audio,RGB+Depth Audio Late SVM,Fisher,GMM,KNNPPTK 0.17325 5 Skeleton,RGB,Depth Sliding windows Late GMM,HMMLRS 0.17727 6 Audio,Skeleton,Depth Sliding windows Early NN

MMDL 0.24452 7 Audio,Skeleton Sliding windows Late DGM+LRTELEPOINTS 0.25841 8 Audio,Skeleton,RGB Audio,Skeleton Late HMM,SVM

CSI MM 0.28911 9 Audio,Skeleton Audio Early HMMSUMO 0.31652 10 Skeleton Sliding windows None RFGURU 0.37281 11 Audio,Skeleton,Depth DP Late DP,RF,HMM

AURINKO 0.63304 12 Skeleton,RGB Skeleton Late ELMSTEVENWUDI 0.74415 13 Audio,Skeleton Sliding windows Early DNN,HMMJACKSPARROW 0.79313 14 Skeleton Sliding windows None NN

JOEWAN 0.83772 15 Skeleton Sliding windows None KNNMILAN KOVAC 0.87463 16 Skeleton Sliding windows None NNIAMKHADER 0.92069 17 Depth Sliding windows None RF

Figure 12: ExtraTreesClassifier Feature Importance.

dence scores generated by the two classifiers are firstly nor-malized and then combined to produce a weighted sum. Asingle threshold approach is employed to classify meaningfulgesture intervals from meaningless intervals caused by falsedetection of speech intervals.

The second ranked team WWEIGHT combined audioand skeletal information, using both joint spatial distribu-tion and joint orientation. The method first searches forregions of time with high audio-energy to define 1.8-second-long windows of time that potentially contained a gesture.This had the effect that the development, validation, andtest data were treated uniformly. Feature vectors are thendefined using a log-spaced audio spectrogram and the jointpositions and orientations above the hips. At each timesample the method subtracts the average 3D position of theleft and right shoulders from each 3D joint position. Datais down-sampled onto a 5 Hz grid considering 1.8 seconds.There were 1593 features total (9 time samples × 177 fea-tures per time sample). Since some of the detected windowscan contain distracter gestures, an extra 21st label is intro-duced, defining the ‘not in the dictionary’ gesture category.Python’s scikit-learn was used to train two models: an en-semble of randomized decision trees (ExtraTreesClassifier,100 trees, 40% of features) and a K-Nearest Neighbor model(7 neighbors, L1 distance). The posteriors from these modelsare averaged with equal weight. Finally, a heuristic is used(12 gestures maximum, no repeats) to convert posteriors toa prediction for the sequence of gestures.

Figure 12 shows the mean feature importance for the win-dows size of 1.8 seconds for the three sets of features: jointcoordinates, joint orientations, and audio spectogram. Onecan note that features from the three sets are selected asdiscriminative by the classifier, although skeletal features

Figure 13: Recognition of test sequence by the three chal-

lenge winners. Black bin means that the complete list of

ordered gestures has been successfully recognized.

becomes more useful for the ExtraTreesClassifier. Addition-ally, the most discriminative features are those in the mid-dle of the windows size, since begin-end features are sharedamong different gestures (transitions) and thus are less dis-criminative for the classifier.

The third ranked team ET combined the output de-cisions of two designed approaches. In the first approach,they look for gesture intervals (unsupervised) using the au-dio files and extracts features from this intervals (MFCC).Using these features, authors train a random forest and gra-dient boosting classifier. The second approch uses simplestatistics (median, var, min, max) on the first 40 frames foreach gesture to build the training samples. The predictionphase uses a sliding window. The authors create a weightedaverage of the output of these two models [4]. The featuresconsidered were skeleton information and audio signal.

Finally, we extracted some statistics from the results of thethree challenge winners in order to analyze common pointsand difficult aspects of the challenge. Figure 13 shows therecognition of the 276 test sequences by the winners. Blackbin means that the complete list of ordered gestures was suc-cessfully recognized for those sequences. Once can see thatthere exists some kind of correlation among methods. Tak-ing into account that consecutive sequences belong to thesame user performing gestures, it means that some somegestures are easier to recognize than others. Since differentusers appears in the training and test sequences, it is some-times difficult for the models to generalize to the style ofnew users, based on the gesture instances used for training.

We also investigated the difficulty of the problem by gesture

Page 8: Multi-modal Gesture Recognition Challenge 2013: Dataset ...vlm1.uta.edu/~athitsos/publications/escalera_grw2013b.pdf81% of the participants were Italian native speakers, while the

Figure 14: Deviation of the number of gesture samples for each category by the three winners in relation to the GT data.

category, within each of the 20 Italian gesture categories.Figure 14 shows for each winner method the deviation be-tween the number of gesture instances recognized and thetotal number of gestures, for each category. This was com-puted for each sequence independently, and adding the devi-ation for all the sequences. In that case, a zero value meansthat the participant method recognized the same numberof gesture instances for a category that was recorded in theground truth data. Although we cannot guarantee with thismeasure that the order of recognized gesture matches withthe ground truth, it gives us an idea of how difficult thegesture sequences were to segment into individual gestures.Additionally, the sum of total deviation for all the gesturesfor all the teams was 378, 469, and 504, which correlateswith the final rank of the winners. The figure suggests acorrelation between the performance of the three winners.In particular, categories 6, 7, 8, 9, 16, 17, 18, and 19 werethe ones that achieved most accuracy for all the participants,meanwhile 1, 2, 3, 5, and 12 were the ones that introducedthe highest recognition error. Note that the public dataset provides accurate label annotations of end-begin of ges-tures, and thus, a more detailed recognition analysis couldbe performed applying a different recognition measurementto Leveinstein, such as Jaccard overlapping or sensitivityscore estimation, which will also allow for confusion matrixestimation based on both inter and intra user and gesturecategory variability. This is left to future work.

6. CONCLUSIONSThis paper describes the ChaLearn Multi-modal GestureRecognition Challenge. For this purpose, we designed alarge data set, including several people that perform ges-tures from a vocabulary 20 Italian sign gesture categories.Data also include distracter gestures to make the recognitiontask challenging. The modalities provided included audio,RGB, depth maps, user masks, and skeletal model. Thedatasets has been manually annotated to provide groundtruth of temporal segmentation of the signal into individualgestures. The dataset has been made publicly available.

Different classifiers for gesture recognition were used by theparticipants. The preferred one was Hidden Markov Models(used by the first ranked team of the challenge), followedby Random Forest (used by the second and third winners).Although several state of the art learning and testing ges-ture techniques were applied at the last stage of the methodsof the participants, still the feature vector descriptions aremainly based on MFCC audio features and skeleton joint in-formation. This supports the use of complementary sourceof information, but it is expected that the use of more so-

phisticated features in a near feature will be useful to re-duce the current error rate achieved in the data set. Forinstance, we think that structural hand information aroundhand joint could be useful to discriminate among gesturecategories that may share similar trajectories of hand/arms.

Although the current error rate on the data set is about12% using de Levenshtein edit distance among order of pre-dicted gestures, it still offers range for improvement and testfor other metrics given the provided ground truth, such areJaccard overlapping index or sensitivity.

7. ACKNOWLEDGMENTSThe ChaLearn Multi-modal Gesture Recognition Challengewas organized thanks to the support of ChaLearn9, the Uni-versity of Barcelona Mathematics Faculty, the UniversitatAutonoma de Barcelona, the Computer Vision Center, theUniversitat Oberta de Catalunya, and the Human Pose Re-covery and Behavior Analysis Group10. We thank the Kag-gle submission website for wonderful support, together withthe committee members and participants of the ICMI 2013Multi-modal Gesture Recognition workshop for their sup-port, reviews and contributions. We lastly would like tothank the students Pau Rodrıguez, Meysam Madadi andAaron Negrın for their help in the annotation of gestures.This work has also been partially supported by Pascal2 net-work of excellence, and the Spanish projects TIN2009-14501-C02-02, TIN2012-39051, and TIN2012-38187-C03-02.

8. REFERENCES[1] M. Everingham, L. V. Gool, C. Williams, J. Winn, and

A. Zisserman. The pascal visual object classes (voc)challenge. IJCV, 88(2):303–338, 2010.

[2] I. Guyon, V. Athitsos, P. Jangyodsuk, H. Escalante, andB. Hamner. Results and analysis of the chalearn gesturechallenge 2012. ICPR, 2012.

[3] A. Hernandez-Vela, N. Zlateva, A. Marinov, M. Reyes,P. Radeva, D. Dimov, and S. Escalera. Human limbsegmentation in depth maps based on spatio-temporalgraph-cuts optimization. JAISE, 4(6):535–546, 2012.

[4] Pedregosa. Scikit-learn: Machine learning in python. InJournal of Machine Learning Research, pages 2825–2830,2011.

[5] J. Shotton. Real-time human pose recognition in parts fromsingle depth images. CVPR, pages 1297–1304, 2011.

9ChaLearn: http://chalearn.org

10HuPBA research group: http://www.maia.ub.es/∼sergio/